Codebase-scale retrieval using AST-derived graphs + BM25 — reducing LLM context from 100K to 5K t…

📰

Codebase-scale retrieval using AST-derived graphs + BM25 — reducing LLM context from 100K to 5K tokens [D]

Reddit r/MachineLearning·u/Altruistic_Night_327·about 1 month ago

#bm25 #graph #retrieval #embedding #file #article

Reading 0:00

15s threshold

Codebase-scale retrieval using AST-derived graphs + BM25 — reducing LLM context from 100K to 5K tokens [D] Wanted to share an approach I've been using for retrieval-augmented generation over large codebases and get feedback from people thinking about similar problems. **The problem** Naive codebase RAG typically works by chunking files into text segments and embedding them for similarity search. This breaks down on code because semantic similarity at the chunk level doesn't capture structural relationships — a function in file A calling a type defined in file C won't surface that dependency through embedding proximity alone. **The approach: AST-derived typed graphs** Instead of chunking, I parse every file using Tree-sitter into its AST, then extract a typed node/edge graph: * Nodes: functions, classes, interfaces, types, modules * Edges: imports, exports, call relationships, inheritance, composition This gets stored in SQLite as a persistent graph. Parse cost is one-time per project.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Codebase-scale retrieval using AST-derived graphs + BM25 — reducing LLM context from 100K to 5K tokens [D]