I was setting up Copilot CLI on my work account last week and came across an experimental /security-review command. I didn't see any announcement for it, so I was curious how it worked and poked around a little. The short version of what it does: you finish your coding session, it reads the diff, and it produces a list of likely vulnerabilities. Useful on paper. The thing I couldn't tell from poking at it manually was how much the underlying model matters. Does picking Opus over Haiku actually buy you better security findings, or are you just paying for the same answer in a fancier wrapper? So I built a small harness around OWASP Juice Shop to find out. This was a small scale experiment funded by my work subscription. This post is what fell out of that. The setup I needed a target with a known answer key, and Juice Shop is an app I've poked at before. It's a demo vulnerable Node.js app that ships with a catalogue of known issues. I took the original app, and created 10 changes from existing vulnerabilities.…