If you ask GPT-4o or Claude to extract Federal Acquisition Regulation clause numbers from a federal solicitation, a non-trivial fraction of the time they will hand you a number that does not exist. There is no FAR 52.999-99 . The model just made it up. For a federal contractor staffing a proposal, that is the difference between a clean compliance matrix and a rejected bid. I went looking for a benchmark that measured this. There isn't one. Commercial tools in the space — Capture2Proposal, GovTribe, GovWin, OrangeSlices — all do natural-language processing on federal solicitations, but none publish benchmarks. Academic work on RFP processing is narrow and one-off. GSA's own srt-fbo-scraper covers only Section 508 compliance. So I built one. FedProc-Bench FedProc-Bench is a multi-task benchmark for federal procurement NLP.…