The first time I swapped an embedding model in production, the answer quality on our internal eval set jumped by twelve points and the latency went down. I felt very smart for about a week. Then a customer success engineer asked why the assistant had stopped finding documents that contained exact product SKUs, and I spent a Saturday discovering that the new model, which was great at semantic similarity, had gotten worse at lexical matching. The old model carried enough surface-level signal to find the SKU. The new one had been trained out of that and pretended every SKU was a similar SKU. Recall on a specific class of query had collapsed, and our eval set had not covered that class. That is the standard embedding-model story. The model that wins on benchmarks is not always the model that wins on your data, and the model that wins on your data is not always the model that keeps winning when the queries change shape next quarter. Embeddings are not a commodity.…