The 70% Data-Prep Tax in AI Development (and How to Cut It in Half)

1 / 2

The 70% Data-Prep Tax in AI Development (and How to Cut It in Half)

DEV Community·A3E Ecosystem·23 days ago

#OOELumtD

#ai #machinelearning #datascience #productivity #engineering #amershi

Reading 0:00

15s threshold

Amershi et al. (2019) studied AI development workflows at Microsoft and reported a now-famous number: ~70% of AI development time is spent on data preparation and feature engineering. Not modeling. Not deployment. Not evaluation. Data wrangling. If you're building anything ML-adjacent in 2026, that number is still mostly true — and it's still mostly avoidable if you treat data infrastructure as a first-class system instead of an afterthought you write inside notebooks. Where the 70% actually goes Amershi's breakdown across 551 ML practitioners pointed at four sinks: Schema reconciliation — same entity, four upstream sources, four shapes Imputation and outlier handling — null rates that move week-over-week Feature pipelines that drift silently — train/serve skew nobody owns Labeling and re-labeling — the cost line nobody budgets for The single most common antipattern is treating each of these as a notebook exercise instead of a deployable artifact. What actually cuts the tax Treat features as code.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

The 70% Data-Prep Tax in AI Development (and How to Cut It in Half)