A local RAG for local memories
By Robert Russell
- 3 minutes read - 569 wordsI read a lot. But I don’t read the way I used to read. I used to read books. Now I read articles online, conversation threads, and plenty of Wikipedia. Reading for me now feels a lot less structured and a lot more sprawling than it was when I was younger. It’s not just the time in my life that’s passed though, the kinds of reading material available have changed a lot. The thing I’d like to bring back, for myself at least, from those younger days is the sense that there’s a thread or a connection that runs through the things I read. I can read a thousand pages on the same topic but they’re all in different source material. A thousand loose leaves instead of a single bound book. There’s value in that binding. Collecting and arranging these ideas in order and making a single internally coherent work would take more effort than I’m willing to apply. There may be a way to get something close though, I think.
Retrieval Augmented Generation is a way to enhance prompts for LLMs by adding relevant context queried from a vector database. The initial LLM prompt could be a string of text in plain English. The vector database holds some amount of documents relevant to the user, optimized for queries that use vectors from the embedding space of the LLM. The RAG mechanism takes the initial plain text prompt, uses it to come up with queries for the database, and then adds the results to the prompt before sending it to the LLM for inference. I’ve been intentionally vague about the exact mechanism of the query and the contents of the document because those are parameters which can vary a lot across RAG implementations.
So how do these two things go together? I’d like to build a database with a lot of the online text I’ve read so that I can query it and pull together various topics I’ve read over time. To that end, I tried out ChromaDB running locally earlier this year. It covers a lot of the important parts of building the vector database and querying it. I learned that there are a lot of tricky decisions around splitting your source material into appropriate chunks. A couple weeks ago, at the PyTorch conference, I heard about Llama Index. Llama Index enables RAG applications and seems to be steering hard toward AI agents. I used their CLI tool to build a starter application, something like their docs describe. I used that to pop up a Python backend that queries my ChromaDB instance and Ollama along with a frontend serving a pleasant web UI. I pointed it at my own website as a starting point and it’s definitely performing RAG but there are a whole lot of independent pieces that are still pretty murky. I like that Llama Index provides a multi-step pipeline which allows for something like chain of thought reasoning. But so far the pipeline is hard for me to trace and debug.
Building a little database from the things I read online feels like a natural progression from bookmarks and browser history. There’s a long way to go from a starter app to the tool I’m imagining. This is an nice way to experiment with the idea though. More importantly it gives me an excuse for leaving all those browser tabs open just a little longer.