Amershi et al. (2019) studied AI development workflows at Microsoft and reported a now-famous number: ~70% of AI development time is spent on data preparation and feature engineering. Not modeling. Not deployment. Not evaluation. Data wrangling. If you're building anything ML-adjacent in 2026, that number is still mostly true — and it's still mostly avoidable if you treat data infrastructure as a first-class system instead of an afterthought you write inside notebooks. Where the 70% actually goes Amershi's breakdown across 551 ML practitioners pointed at four sinks: Schema reconciliation — same entity, four upstream sources, four shapes Imputation and outlier handling — null rates that move week-over-week Feature pipelines that drift silently — train/serve skew nobody owns Labeling and re-labeling — the cost line nobody budgets for The single most common antipattern is treating each of these as a notebook exercise instead of a deployable artifact. What actually cuts the tax Treat features as code.…