TL;DR 9.91% is not "did the model get it right on the first try" — it's "did the model walk through the procedure to the end." Even a frontier model can fail a simple constraint like "don't skip any endpoint." The 100% in the title means the contract can force the model to walk the procedure . CoT cannot be inspected if you leave it as free prose. The real question isn't "how long does the model think" — it's "can we turn that thinking into a submittable audit artifact?" The focus shifts from correctness to compliance. Part 1 was about compile / validate / test. Part 2 is about coverage / reason / audit. Beyond engineering, you can still guarantee a quality floor. Encode existing audit formats (SOAP / IRAC / ADR / postmortem) at the type level, and sloppy procedures stop passing. Prompt is a request. Schema is enforcement. 1. Preface This post is a follow-up to Function Calling Harness: From 6.75% to 100% . Part 1 had a simple thesis.…