TL;DR: Treat ML as measurement: quantify missingness, skew, leakage risks; stratify splits; scale when distances/gradients matter; encode to preserve signal without inventing order; use pipelines to automate the tedious preprocessing; validate with k-fold and report the right metric for your target.
š Data Understanding
Start by making the dataset legible. I document each field and compute quick health stats:
- Shape & granularity: rows, columns, primary key uniqueness
- Type audit: numeric, categorical, datetime, text; parse dates, booleans
- Missingness: % missing per column, MCAR/MAR/MNAR hypothesis
- Cardinality: unique count per categorical; flag high-cardinality (> 1k)
- Basic stats: mean, std, median, IQR, skewness, kurtosis
- Ranges/anomalies: min/max plausibility (e.g., negative ages)
š Exploratory Data Analysis (EDA)
Use stats and visuals to surface structure, leakage, and edge cases. I aim to answer:
- How are key features distributed? (look for skew/heavy tails)
- Which anomalies/outliers exist? (IQR rule: outside [Q1ā1.5Ā·IQR, Q3+1.5Ā·IQR])
- Which features correlate with the target? Which leak it? (time IDs, post-outcome fields)
- Numeric vs target: correlation/point-biserial for binary, mutual information
- Categorical vs target: target rate per category; rare category grouping
- Multicollinearity: correlation heatmap; consider VIF for linear models
āļø Feature Scaling
Scale when distances or gradients matter; skip when the model is scaleāinvariant.
- Standardization: mean 0, variance 1 ā good default for linear/logistic, SVM, NN
- MināMax: 0ā1 ā useful for KNN, clustering, models with cosine/Euclidean distance
- Robust scaling: median/IQR ā resistant to outliers
Heuristic: if a featureās |z| > 3 for many rows, prefer robust scaling.
š Encoding Categorical Variables
Convert categories without inventing order. Choose encoding by cardinality and model:
- Oneāhot (nominal, low cardinality)
- Ordinal (true rank exists)
- Label (tree models tolerate it; avoid for linear models)
- Highācardinality: hash trick; frequency encoding; target encoding with Kāfold/outāofāfold to avoid leakage
š§ Pipelines & Column Transformers
This was a game-changer for me! Instead of manually applying different transformations to different columns, I now use ColumnTransformer and Pipeline to automate the entire preprocessing workflow.
- ColumnTransformer: Apply different transformations to different column types in one go
- Pipeline: Chain preprocessing steps with model training seamlessly
- No data leakage: Transformations are fit only on training data, applied to validation/test
- Reproducible: Same preprocessing logic for training and inference
- Clean code: No more manual column selection and transformation loops
Example workflow: numeric columns get standardized, categoricals get one-hot encoded, and everything flows through a single pipeline. What used to be 50+ lines of tedious preprocessing is now 5 lines of clean, maintainable code.
Video Demo: I've created a comprehensive walkthrough showing how pipelines transformed my ML workflow from messy manual preprocessing to clean, automated pipelines. The video covers real examples and shows the before/after code comparison. Jump down to the Titanic app demo below.
š¢ Titanic Survival App
I built a small app that predicts whether a passenger would survive based on inputs like age, fare, and embarked station. It uses a ColumnTransformer for per-column preprocessing and a Pipeline to keep training and inference consistent end-to-end.
- Inputs: age, sex, pclass, fare, embarked
- Preprocessing: numeric scaling, categorical one-hot encoding
- Model: tried logistic regression and tree-based models via the same pipeline
š§© TraināValidationāTest
Evaluate on unseen data and mirror deployment conditions:
- Typical split: 60/20/20 or 70/15/15; stratify for imbalanced targets
- Kāfold CV: k=5 or 10; use grouped or time series CV when appropriate
- Timeābased: train on past, validate on future windows (no shuffling)
Metrics: pick by objective ā classification (ROCāAUC, PRāAUC for imbalance, F1); regression (RMSE/MAE, MAPE if scaleāfree needed); ranking (NDCG, MAP).
š Graphs & Visualization
Plot early, plot often. Visuals surface structure and failure modes:
- Histograms: distribution, skew, zeroāinflation
- Scatter: relationships, nonālinearities, heteroscedasticity
- Box plots: outliers, spread by group
- Heatmaps: correlations, multicollinearity, leakage checks
- Learning curves: bias vs variance diagnosis
- Calibration/PR curves: probabilistic classifier quality (esp. on imbalance)
Takeaway: Be empirical. Quantify data quality, encode minimally, automate with pipelines, split correctly, and validate with metrics that match the business goal. The right tools make the tedious parts effortless.