We added OpenAI’s gpt-5.5 model to our eval suite the day it launched. We ran 1,742 tests overall, which included over 45 task scenarios across using 11 real engineering skills, each run 6 times and averaged the data, which is shown in this blog. TL;DR The gpt-5.5 model has the highest raw capability of any OpenAI model we've tested. When it uses agent skills and performs the same tasks, it pretty much ties with gpt-5.4 on score but costs 63% more per run. Question Answer Best Codex model out of the box? gpt-5.5: 75.6 avg baseline, highest in the family Best Codex model with skills loaded? gpt-5.4 and gpt-5.5 tie at 89.3 and 89.4 Worth the 63% price premium over gpt-5.4? With this data, we don’t think so Any scenario where it wins? Latency: 89.5s vs 135.4s for gpt-5.4 Should you use gpt-5.3 instead? No, oddly enough, gpt-5.3 costs 47% more than gpt-5.4 for a worse result because of the token bloat.…