A precise, worked explanation of why paired bootstrap is the correct procedure for within-subject LLM evaluation designs, and why near-zero covariance between score vectors means it provides almost no practical benefit.
Personal project that turned into a bit of a rabbit hole. Got diagnosed with pre-diabetes a few months back and I want to own my own glucose data rather than have it locked in a proprietary app. I have an Accu-Chek Guide glucometer.…