I Tested Chunking on Docs, PDFs, and Code. The Winner Changed Every Time.

1 / 2

I Tested Chunking on Docs, PDFs, and Code. The Winner Changed Every Time.

DEV Community·Md Ayan Arshad·28 days ago

#qF33fMTZ

#experiment #ai #discuss #chunker #code #pdfs

Reading 0:00

15s threshold

I assumed chunking was a solved problem. Pick a text splitter, set 512 tokens, add some overlap, move on. After running structured experiments across three different data types, that assumption collapsed. The best chunker for markdown documentation actively hurt performance on code. The winner changed completely depending on what I was chunking. TL;DR Data type Winner Headline metric Markdown docs HeadingAwareChunker MRR 0.755 vs SlidingWindow 0.687 PDFs RecursiveChar (512 tok) Context Recall 0.9250, RAGAS SUM 3.4249 GitHub code CodeBlockAwareChunker RAGAS SUM 3.5680 — highest across all experiments RecursiveChar won on PDFs. The same chunker scored 0.5690 Context Precision on code, roughly half the retrieved chunks were irrelevant. There is no universal best chunker. The data type decides. What I was building A RAG system that ingests documentation sites, PDFs, and GitHub repositories for multiple tenants, then answers developer questions with citations.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

I Tested Chunking on Docs, PDFs, and Code. The Winner Changed Every Time.