Function Calling Harness 2: CoT Compliance from 9.91% to 100%

1 / 8

Function Calling Harness 2: CoT Compliance from 9.91% to 100%

DEV Community·Jeongho Nam·about 1 month ago

#NGl9b7Gp

#ai #typescript #programming #opensource #string #procedure

Reading 0:00

15s threshold

TL;DR 9.91% is not "did the model get it right on the first try" — it's "did the model walk through the procedure to the end." Even a frontier model can fail a simple constraint like "don't skip any endpoint." The 100% in the title means the contract can force the model to walk the procedure . CoT cannot be inspected if you leave it as free prose. The real question isn't "how long does the model think" — it's "can we turn that thinking into a submittable audit artifact?" The focus shifts from correctness to compliance. Part 1 was about compile / validate / test. Part 2 is about coverage / reason / audit. Beyond engineering, you can still guarantee a quality floor. Encode existing audit formats (SOAP / IRAC / ADR / postmortem) at the type level, and sloppy procedures stop passing. Prompt is a request. Schema is enforcement. 1. Preface This post is a follow-up to Function Calling Harness: From 6.75% to 100% . Part 1 had a simple thesis.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Function Calling Harness 2: CoT Compliance from 9.91% to 100%