Building a Persistent Knowledge Base RAG System with FastAPI, llama.cpp, Chroma, and Open WebUI

1 / 2

Building a Persistent Knowledge Base RAG System with FastAPI, llama.cpp, Chroma, and Open WebUI

DEV Community·navid mirnouri·about 1 month ago

#VHDJkldz

#ai #llm #programming #python #model #fullscreen

Reading 0:00

15s threshold

Have you ever wanted to chat with your own PDF collection – textbooks, research papers, internal documentation – using a local LLM, while keeping your data completely private? This is exactly what I built. In this article, I’ll walk you through a complete, production‑ready setup that: Ingests a folder of PDFs into a vector database (Chroma) Serves an OpenAI‑compatible RAG API using FastAPI Uses llama.cpp as the local LLM backend (any GGUF model works) Connects seamlessly to Open WebUI for a beautiful chat interface Provides persistent memory (the vector store survives restarts) All code is available at the end of this article – ready to copy, paste, and run. 🧠 Why this system? Privacy first – everything runs on your machine. Long‑term knowledge – uploaded PDFs stay in the vector store; you can chat with them any time. Cross‑chat memory – the RAG pipeline works every time you ask a question. Modular – swap Chroma for Qdrant, replace llama.cpp with Ollama, or add hybrid search.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Building a Persistent Knowledge Base RAG System with FastAPI, llama.cpp, Chroma, and Open WebUI