Machine Learning • Journey • 2025

My Machine Learning Journey So Far

Foundational practices I now reach for in every project: data understanding, EDA, scaling, encoding, pipelines, proper splits, and clear visualizations.

TL;DR: Treat ML as measurement: quantify missingness, skew, leakage risks; stratify splits; scale when distances/gradients matter; encode to preserve signal without inventing order; use pipelines to automate the tedious preprocessing; validate with k-fold and report the right metric for your target.

šŸ” Data Understanding

Start by making the dataset legible. I document each field and compute quick health stats:

šŸ“Š Exploratory Data Analysis (EDA)

Use stats and visuals to surface structure, leakage, and edge cases. I aim to answer:

āš–ļø Feature Scaling

Scale when distances or gradients matter; skip when the model is scale‑invariant.

Heuristic: if a feature’s |z| > 3 for many rows, prefer robust scaling.

šŸ”‘ Encoding Categorical Variables

Convert categories without inventing order. Choose encoding by cardinality and model:

šŸ”§ Pipelines & Column Transformers

This was a game-changer for me! Instead of manually applying different transformations to different columns, I now use ColumnTransformer and Pipeline to automate the entire preprocessing workflow.

Example workflow: numeric columns get standardized, categoricals get one-hot encoded, and everything flows through a single pipeline. What used to be 50+ lines of tedious preprocessing is now 5 lines of clean, maintainable code.

Video Demo: I've created a comprehensive walkthrough showing how pipelines transformed my ML workflow from messy manual preprocessing to clean, automated pipelines. The video covers real examples and shows the before/after code comparison. Jump down to the Titanic app demo below.

🚢 Titanic Survival App

I built a small app that predicts whether a passenger would survive based on inputs like age, fare, and embarked station. It uses a ColumnTransformer for per-column preprocessing and a Pipeline to keep training and inference consistent end-to-end.

🧩 Train–Validation–Test

Evaluate on unseen data and mirror deployment conditions:

Metrics: pick by objective — classification (ROC‑AUC, PR‑AUC for imbalance, F1); regression (RMSE/MAE, MAPE if scale‑free needed); ranking (NDCG, MAP).

šŸ“‰ Graphs & Visualization

Plot early, plot often. Visuals surface structure and failure modes:

Takeaway: Be empirical. Quantify data quality, encode minimally, automate with pipelines, split correctly, and validate with metrics that match the business goal. The right tools make the tedious parts effortless.