My Machine Learning Journey So Far 🚀 • My Blog

TL;DR: Treat ML as measurement: quantify missingness, skew, leakage risks; stratify splits; scale when distances/gradients matter; encode to preserve signal without inventing order; use pipelines to automate the tedious preprocessing; validate with k-fold and report the right metric for your target.

🔍 Data Understanding

Start by making the dataset legible. I document each field and compute quick health stats:

Shape & granularity: rows, columns, primary key uniqueness
Type audit: numeric, categorical, datetime, text; parse dates, booleans
Missingness: % missing per column, MCAR/MAR/MNAR hypothesis
Cardinality: unique count per categorical; flag high-cardinality (> 1k)
Basic stats: mean, std, median, IQR, skewness, kurtosis
Ranges/anomalies: min/max plausibility (e.g., negative ages)

📊 Exploratory Data Analysis (EDA)

Use stats and visuals to surface structure, leakage, and edge cases. I aim to answer:

How are key features distributed? (look for skew/heavy tails)
Which anomalies/outliers exist? (IQR rule: outside [Q1−1.5·IQR, Q3+1.5·IQR])
Which features correlate with the target? Which leak it? (time IDs, post-outcome fields)

Numeric vs target: correlation/point-biserial for binary, mutual information
Categorical vs target: target rate per category; rare category grouping
Multicollinearity: correlation heatmap; consider VIF for linear models

⚖️ Feature Scaling

Scale when distances or gradients matter; skip when the model is scale‑invariant.

Standardization: mean 0, variance 1 — good default for linear/logistic, SVM, NN
Min–Max: 0–1 — useful for KNN, clustering, models with cosine/Euclidean distance
Robust scaling: median/IQR — resistant to outliers

Heuristic: if a feature’s |z| > 3 for many rows, prefer robust scaling.

🔑 Encoding Categorical Variables

Convert categories without inventing order. Choose encoding by cardinality and model:

One‑hot (nominal, low cardinality)
Ordinal (true rank exists)
Label (tree models tolerate it; avoid for linear models)
High‑cardinality: hash trick; frequency encoding; target encoding with K‑fold/out‑of‑fold to avoid leakage

🔧 Pipelines & Column Transformers

This was a game-changer for me! Instead of manually applying different transformations to different columns, I now use ColumnTransformer and Pipeline to automate the entire preprocessing workflow.

ColumnTransformer: Apply different transformations to different column types in one go
Pipeline: Chain preprocessing steps with model training seamlessly
No data leakage: Transformations are fit only on training data, applied to validation/test
Reproducible: Same preprocessing logic for training and inference
Clean code: No more manual column selection and transformation loops

Example workflow: numeric columns get standardized, categoricals get one-hot encoded, and everything flows through a single pipeline. What used to be 50+ lines of tedious preprocessing is now 5 lines of clean, maintainable code.

Video Demo: I've created a comprehensive walkthrough showing how pipelines transformed my ML workflow from messy manual preprocessing to clean, automated pipelines. The video covers real examples and shows the before/after code comparison. Jump down to the Titanic app demo below.

🚢 Titanic Survival App

I built a small app that predicts whether a passenger would survive based on inputs like age, fare, and embarked station. It uses a ColumnTransformer for per-column preprocessing and a Pipeline to keep training and inference consistent end-to-end.

Inputs: age, sex, pclass, fare, embarked
Preprocessing: numeric scaling, categorical one-hot encoding
Model: tried logistic regression and tree-based models via the same pipeline

🧩 Train–Validation–Test

Evaluate on unseen data and mirror deployment conditions:

Typical split: 60/20/20 or 70/15/15; stratify for imbalanced targets
K‑fold CV: k=5 or 10; use grouped or time series CV when appropriate
Time‑based: train on past, validate on future windows (no shuffling)

Metrics: pick by objective — classification (ROC‑AUC, PR‑AUC for imbalance, F1); regression (RMSE/MAE, MAPE if scale‑free needed); ranking (NDCG, MAP).

📉 Graphs & Visualization

Plot early, plot often. Visuals surface structure and failure modes:

Histograms: distribution, skew, zero‑inflation
Scatter: relationships, non‑linearities, heteroscedasticity
Box plots: outliers, spread by group
Heatmaps: correlations, multicollinearity, leakage checks
Learning curves: bias vs variance diagnosis
Calibration/PR curves: probabilistic classifier quality (esp. on imbalance)

Takeaway: Be empirical. Quantify data quality, encode minimally, automate with pipelines, split correctly, and validate with metrics that match the business goal. The right tools make the tedious parts effortless.