TL;DR: We pick the 512 hardest images for INT8 PTQ calibration by scoring a candidate pool with a small VLM. Bifrost sits between our calibration pipeline and four providers, gives us semantic caching, per-engineer virtual keys, and hard budget caps so a runaway loop can't burn EUR800 of API spend over the weekend. So, the thing is, calibration set selection is one of those topics nobody writes about until the day your INT8 model is 4.2 points of mAP worse than the fp16 reference and you have a release branch already cut. Pick the wrong 512 images and your activation histograms get biased toward easy frames. Pick the right 512 and the gap closes to 0.6 points without any QAT work at all. For about a year we picked them by hand. Then we tried random sampling, then stratified by class. The thing that finally worked, on an industrial defect detector we shipped to a customer last quarter, was scoring our candidate pool with a vision-language model and biasing toward the high-difficulty tail.…