March 26, 2026 at 8:34 AM
🩺diabetes prediction
what it is. a machine-learning pipeline that predicts diabetes risk from a handful of clinical features — glucose, bmi, insulin, age — with careful handling of missing values and aggressive feature selection. this project later became my ijmit paper (see papers).
why i built it
clinical datasets look clean in the textbook and are a landmine in real life. i wanted to know, at a gut level, what happens when you treat the unglamorous parts — missing values, leakage, class imbalance — as the main event instead of a side quest.
what i learned
- missing values are a modelling choice. imputation is a lie you tell the model. which lie you tell changes the result more than which model you pick.
- feature selection beats model zoo. i tried eleven models. picking the right three features moved f1 more than picking the "best" model did.
- a paper is a forcing function. writing the work up honestly, for peer review, is where i caught my own shortcuts.