Why I Built RAG From Scratch Before Using LangChain
Technical Note #01: Why I Built RAG From Scratch Before Using LangChain Part of the Agentic Finance Beast Technical Notes series Published: June 7, 2026 Reading Time: ~6 minutes This technical note documents my first implementation of a Retrieval-Augmented Generation (RAG) pipeline. The goal was n

Technical Note #01: Why I Built RAG From Scratch Before Using LangChain Part of the Agentic Finance Beast Technical Notes series Published: June 7, 2026 Reading Time: ~6 minutes This technical note documents my first implementation of a Retrieval-Augmented Generation (RAG) pipeline. The goal was not to build a production-ready system. The goal was to understand what actually happens between a user's question and an AI-generated answer before relying on frameworks such as LangChain. Rather than starting with abstractions, I wanted to build the core pieces myself and learn where real-world AI systems succeed and fail. I built a minimal RAG pipeline from scratch using Gemini Embeddings, cosine similarity search, and Mistral. The biggest lesson wasn't prompt engineering. It was discovering that retrieval quality often has a greater impact on answer quality than the language model itself. Most RAG tutorials follow a similar pattern: Install LangChain. Connect a vector database. Load a document. Ask questions. Within minutes, you have a working application. That's impressive. But it left me with a question: If the system retrieves the wrong information, how would I debug it? Frameworks make development faster, but they also hide implementation details. Before using those abstractions, I wanted to understand the individual components behind Retrieval-Augmented Generation. So I built a simple version myself. No LangChain. No vector database. No orchestration framework. Just Python, embeddings, similarity search, and an LLM. Retrieval-Augmented Generation combines two systems: A retrieval system that finds relevant information. A generation system that uses that information to answer questions. Instead of relying entirely on the language model's training data, relevant information is retrieved at runtime and injected into the prompt. The simplified workflow looks like this: Document โ Chunking โ Embeddings โ Similarity Search โ Context Retrieval โ LLM Response This architecture allows AI systems to answer questions using external knowledge without retraining the model. My implementation consisted of five core stages: Document โ Sentence-Based Chunking โ Gemini Embeddings โ Cosine Similarity Search โ Mistral Answer Generation The knowledge base contained a short document about AI agents. Users could ask questions, and the system would retrieve relevant information before generating a response. The first step was splitting the document into smaller pieces. I used a simple sentence-based approach: chunks = [ s.strip() + "." for s in document.replace("\n", " ").split(". ") if s.strip() ] At first, this felt like a minor preprocessing step. It wasn't. I quickly realized that chunking affects retrieval quality directly. Large chunks preserve context but often include irrelevant information. Small chunks improve retrieval precision but can lose important context. Even in this small project, chunking turned out to be a meaningful engineering decision. After chunking the document, I generated embeddings using Gemini's embedding API. Each chunk was converted into a high-dimensional vector representation. embedding = data.get("embedding", {}).get("values", []) Before building this project, embeddings felt somewhat magical. After seeing the actual vectors returned by the API, the concept became easier to understand. Embeddings allow machines to compare meaning instead of matching exact words. For example, a query about decision-making could retrieve information related to reasoning even if the exact keywords do not appear. That capability is what makes semantic search possible. Instead of using a vector database, I implemented cosine similarity manually. def cosine_similarity(a, b): dot = sum(x * y for x, y in zip(a, b)) norm_a = math.sqrt(sum(x * x for x in a)) norm_b = math.sqrt(sum(x * x for x in b)) return dot / (norm_a * norm_b) This was one of the most interesting parts of the project. Before building it, vector search seemed complicated. After implementing it myself, I realized the mathematics behind retrieval is relatively straightforward. The challenge is not the formula. The challenge is consistently retrieving the most useful context. When a user asks a question, the same embedding model converts the question into a vector. The query embedding is then compared against every document embedding. best_idx = similarities.index(max(similarities)) The chunk with the highest similarity score becomes the retrieved context. For example, questions such as: What is an AI agent? How do AI agents differ from traditional programs? What can financial AI agents do? successfully retrieved relevant information from the knowledge base. For a minimal implementation, the results were surprisingly effective. Once relevant context is retrieved, it is passed to Mistral alongside the user's question. Context: {context} Question: {question} The model is instructed to answer using only the provided context. This is where retrieval and generation come together. Without retrieval, the model answers based on its training data. With retrieval, the model answers using information supplied at runtime. This simple shift dramatically improves factual grounding. Before building this project, I assumed the language model would be the most important part of the system. I was wrong. Most answer quality issues were retrieval issues. When retrieval returned weak context, answer quality suffered. When retrieval returned relevant context, answer quality improved significantly. This changed how I think about AI applications. Prompt engineering matters. Model selection matters. But retrieval quality often determines whether an answer is useful in the first place. Even in a small project, several tradeoffs became visible. Sentence-level chunking was easy to implement. However, preserving context becomes more difficult as chunk sizes become smaller. Retrieving a single best match is simple. Retrieving multiple relevant chunks provides broader coverage but introduces additional complexity. Building retrieval manually helped me understand the system. In production environments, dedicated vector databases and retrieval frameworks become necessary. Before starting this project, I believed prompt engineering would have the greatest impact on answer quality. The implementation showed otherwise. Poor retrieval produced poor answers regardless of prompt quality. Improving retrieval had a larger effect than rewriting prompts. That was one of the most valuable lessons from the entire exercise. This implementation intentionally prioritizes learning over scalability. Several important features are missing: Top-K retrieval Persistent vector storage Metadata filtering Conversation memory Retrieval evaluation metrics Hybrid search techniques These limitations are acceptable because the objective was understanding the fundamentals rather than building a production-ready system. This implementation serves as the foundation for future work within Agentic Finance Beast. The next improvements I plan to explore include: Top-K retrieval instead of single-result retrieval Better chunking strategies Vector storage using pgvector Financial document retrieval Multi-step workflows with LangGraph Agent memory and reasoning systems Each improvement builds upon the concepts explored in this first implementation. Building a RAG pipeline from scratch did not make me an expert in retrieval systems. What it did provide was a practical understanding of how retrieval, embeddings, similarity search, and generation work together. Frameworks such as LangChain are incredibly useful. But understanding the fundamentals behind those abstractions provides a different kind of value. When something breaks, I now have a mental model for where to investigate. For me, that understanding made building from scratch worthwhile. GitHub: https://github.com/Sumayea104/agentic-finance-beast
Key Takeaways
- โขTechnical Note #01: Why I Built RAG From Scratch Before Using LangChain Part of the Agentic Finance Beast Technical Notes series Published: June 7, 2026 Reading Time: ~6 minutes This technical note documents my first implementation of a Retrieval-Augmented Generation (RAG) pipeline. The goal was n
- โขThis story was reported by Dev.to, covering developments in the dev space.
- โขAI advancements continue to reshape industries โ read the full article on Dev.to for complete coverage.
๐ Continue reading the full article:
Read Full Article on Dev.to โShare this article



