Combining RAG with Knowledge Graphs: Structure Your Text Around Concepts#
At ufirst we build AI-powered solutions while exploring what’s actually functional in this ever-shifting landscape. Our focus is utility and reliability, which means we spend significant effort on prototypes and tech demos—proving to ourselves, before our clients, how AI can solve real world, industry ready, problems. Most explorations, unfortunately, end up gathering dust in forgotten GitHub repos, yet sometimes we stumble upon interesting findings worth sharing: patterns we believe will set standards for future implementations, and deserve to be shared with the broader community in a joint effort to push the state of the art forward.
This article tells the story of one such exploration. It started almost as an experiment to waste some time, and unexpectedly led me to build a Knowledge Graph-based RAG system while trying to solve another, completely different and rather specific problem: constructing a timeline of events surrounding people, places, and evidence from a large corpus of Italian legal documents related to the Garlasco Murder Case—one of Italy’s most controversial criminal cases, still generating developments 17 years later.
The interesting part wasn’t the timeline itself. It was discovering that building the Knowledge Graph properly made retrieval, visualization, and a plethora of other possibilities a natural consequence. The graph didn’t just enable RAG—it made RAG almost trivial, among many other possible solutions.
Here I intend to discuss such journey in a less technical and more talkative way with the aim to the inform the general audience about what Knowledge Graphs combined with AI agents can leverage, no matter the technical backtround. I will probably write another, more technically involved, article that dives deeper in the actual implementation.
Why a Knowledge Graph to build a timeline in the first place?#
My initial goal was to derive structured data from a large corpus of legal documents (humungous PDFs full of Italian court jargon) and use it to feed a visual timeline. Eeach section of the timeline should have been mapped to an event in time, detailing all the relavant entities associated to such event (People, Places, Evidence, Documents etc.). For each visual element, I wanted to be able to reference the actual textual sources where, so users could verify claims and dive deeper. And, on top of that, I wanted this thing to be extensible: adding new documents should enrich the timeline, and the connected entities, without the need to reprocess everything once again.
To do this, I needed to go beyond simple summarization. I needed to map documents to conceptual entities, track how those entities evolved across documents, and enable both information retrieval and artifact generation.
Given these constraints, the shape of the solution was clear. Tracking people, events, rulings, and dates—and their connections over time—is inherently a graph problem. I started thinking: “Ok, I might need to perform some named entity recognition for people and events, then maybe I tie dates to events and people. Then if I track where entity X appears across documents I should be good to go.”
⚠️ TODO: add visual asset to accompany this section.
After some research, I came to the conclusion that the best approach was to define a proper taxonomy of the entities I wanted to represent and build a knowledge graph. If it worked (and it did), the “timeline” would just be the result of querying Event nodes and sorting by date.
Knowledge Graphs 101#
Knowledge graphs aren’t new—they’ve been around for decades, powering everything from Google’s search results to enterprise data integration. The core idea is simple on paper: represent information as a network of entities (nodes) connected by relationships (edges). A Person node connects to an Event node via an INVOLVED_IN relationship. That Event connects to a Place via LOCATED_AT. Each node can carry properties: names, dates, roles, descriptions.
The appeal is intuitive: instead of flattening everything into rows and columns, you preserve the natural structure of how things relate. The challenge is that knowledge graphs have traditionally been a pain to build and maintain. You need clean data, consistent schemas, and a lot of manual curation. For most use cases, the juice wasn’t worth the squeeze, but since I love to make my life difficult, and I always wanted to play around with knowledge graphs, I decided to give it a go.
⚠️ TODO: add visual asset to accompany this section.
Of course this choice wasn’t coming out of nowhere. LLMs can now extract entities and relationships from unstructured text at scale, dramatically lowering the construction cost of a KG. And once you have a graph, you can combine structured traversal with semantic search—a hybrid that’s more powerful than either alone. Microsoft Research has been exploring this with their GraphRAG approach; Neo4j has built native vector search into their graph database, frameworks such as Langchain and LLamaIndex integrate with these technologies by default, and since I begun my journey with these technologies I’ve seen articles about this topic popping up as mushrooms after a rainy day, so I felt reassured that I wasn’t completely off the beaten path.
⚠️ TODO: add visual asset to accompany this section.
For my use case, I needed a specific structure: entities like Person, Event, Place, and Evidence, connected by relationships like INVOLVED_IN or RELATIVE_OF. The structure that emerged—and the architectural insight that made everything else possible—I’ll explain when we get to building it.
Building the KG for the Garlasco Case#
The first challenge was getting text out of the documents themselves. Italian court filings aren’t exactly machine-friendly: scanned PDFs, inconsistent formatting, dense legal prose.
I’ve set up a data manipulation pipeline with a pdf document as an entry point (and some configuration props) that starts with OCR and text extraction, then moves to semantic chunking—breaking the document into meaningful sections rather than arbitrary character counts. Getting the chunking right matters a lot: too small and you lose context; too large and the extraction model struggles to focus. I settled on a hybrid approach that respects document structure (headers, paragraphs, numbered lists) while keeping sections digestible for the LLM, and have made sure this was configurable for each document.
With clean, chunked text in hand, the next step was entity extraction. My first instinct was to reach for traditional NER tools— I’ve tried Spacy, for instance, since it has solid Italian language models. But the results were disappointing. Standard NER recognizes generic categories (PERSON, ORG, DATE), not domain-specific concepts like “evidence” or “legal ruling.” Worse, it mislabeled constantly: unusual words were getting tagged as a Person, dates parsed incorrectly from Italian formats. I’d need an LLM to verify and correct anyway, so why not cut out the middleman?
The LLM-based approach worked dramatically better. I feed each document section to the model along with a detailed description of what I’m looking for—not just “extract people,” but guidance on how Italian names work (surname first? title included?), what counts as a valid event (must have a date, should involve other entities), what distinguishes a “place” from an address. The model returns structured entities matching my Pydantic schemas, with remarkably high precision. The cost is higher than NER, but the quality difference is night and day, and thankfully, token costs haven’t been prohibitive at all, with plenty of room for optimizations trough private models or fine-tuning down the line.
The biggest hurdle here has been nailing the right amount of sections-per-batch of work, and deciding how many entities to recognize per batch, I’ve settled with extracting one entity type at a time works better than asking for everything at once and, at the same time, I could take advantage of modern bigger context windows providing many document sections per batch without losing much precision (and drastically speeding up a full document pass).
On the other hand, when I prompted the model to extract people, events, places, and evidence simultaneously, it would often miss one category entirely while over-indexing another. Of course this varies a lot depending on the model used, the thinking budget and so on, but this seemed to be a consistent pattern across multiple experiments.
⚠️: TODO: add visual asset to accompany this section.
One neat trick I’ve used down the line has been to map each entity the model extracts with a list of facts—short summaries of how and where that entity appears in the document when returned by the LLM.
A Person might have facts like “mentioned as the defendant in the 2007 murder trial” and “testified during the appeal proceedings.” These facts aren’t free-form text; they reference the specific document sections they came from. This creates a three-level hierarchy: Entity → Fact → Section.
What this enables is quite versatile: you can query at the concept level (find all events involving this person), retrieve supporting context at the fact level, and drill down to the actual source text when you need it—structured queries with full traceability to the original documents. The hierarchy emerged organically from the extraction workflow, but it turned out to be the most valuable architectural decision in the whole project.
⚠️: TODO: add visual asset to accompany this section.
Deduplication was the next hurdle. Legal documents don’t use consistent naming. The same person might appear as “Alberto Stasi,” “Stasi,” “l’imputato” (the defendant), or “il giovane” (the young man), the same event might be referenced by date in one document and by description in another and one of my goals was to have just one canonical representation for each Entity, no matter how many documents and references they appear in. I needed to merge duplicates without accidentally collapsing distinct entities.
The solution: a two-stage approach. First, straightforward field comparisons catch obvious duplicates—same name and surname, same event title and date. For ambiguous cases, I route to an LLM with all the facts for both candidates: “Given these two Person entities and their associated facts, are they the same individual?” The model can reason about context in ways that string matching never could. When merging against an existing knowledge graph, the same logic applies: new entities either match existing ones (and enrich them with new facts) or create new nodes.
Relationship extraction is where things get genuinely hard. The combinatorial problem is daunting: every person could potentially relate to every event, every event to every place, every piece of evidence to every person who might have handled it. You can’t just compare everything against everything—the token costs alone would be prohibitive.
⚠️: TODO: add visual asset to accompany this section.
My approach uses relationship “bundles”: predefined groups of relationship types that make sense together. Social relationships (RELATIVE_OF, KNOWS, ASSOCIATED_WITH) form one bundle and only apply between people. Factual relationships (INVOLVED_IN, LOCATED_AT, CORRELATED_TO) form another, connecting people to events and events to places. For each bundle, I present the LLM with all relevant entities and their facts, asking it to identify which relationships actually exist. The facts provide the context needed to make these judgments.
I’ll be honest: entity extraction works well; relationship extraction can be improved by a decent margin. I’ts not that I’m not satisfied with the final results, but rather that my optimizer brain knows that there’s a better way than brute-forcing the extracted entities and their facts against an LLM once again, and I’ve begun exploring more architecture-dependant solutions such as using Neo4j’s graph data science (GDS) library to detect clusters of semantically related entities, then letting the LLM confirm or refine those clusters. But even with imperfect relationship extraction, the graph is robust and precise. And on top of that, in retrospect, the entities themselves, with their facts and source links, already enable powerful queries.
From KG to KG-RAG#
That’s for me the funniest part: RAG wasn’t even the original goal. I built the knowledge graph for the timeline—that was the whole point. But once the graph existed, with entities linked to facts linked to source sections, adding RAG capabilities was almost trivial. All it took me, in about 30 minutes of coding, was to wire up a Pydantic AI agent with a handful of tools (aka methods to launch specific queries) to navigate the graph, wrote a prompt that was honestly pretty rough around the edges, and the results exceeded every expectation I had. The graph had already done the hard work of organizing information; the agent just needed to navigate it.
This accidental discovery got me thinking about why traditional RAG approaches struggle with certain problems, and why the graph-based alternative felt so much more natural.
The Limits of Naive RAG#
The bare, “naive”, RAG pattern is straightforward: take a user query, convert it to an embedding, find document chunks with similar embeddings, stuff those chunks into an LLM’s context window, and generate a response. It works remarkably well for many use cases. But there’s a fundamental limitation baked into the approach: relevance is measured geometrically, not semantically.
When you retrieve by embedding similarity, you get chunks that look like the query in some high-dimensional vector space. That’s not the same as chunks that actually answer the query. A question about “the defendant’s alibi” might retrieve passages mentioning “defendant” and “alibi” separately, but miss the one paragraph that actually explains what the alibi was—because that paragraph uses different words. The retrieved chunks can also be homogeneous and redundant: you get five variations of the same information instead of the diverse context you actually need.
The top-k cutoff makes this worse. If the relevant information happens to be in the 11th most similar chunk and you’re retrieving the top 10, you’ll never see it. Tracking relationships over time—“how did the court’s position on the DNA evidence evolve across appeals?"—isn’t a similarity search problem at all. It’s a structural query that requires understanding which entities connect to which events in which order.
⚠️: TODO: add visual asset to accompany this section.
Research confirms these limitations. Microsoft’s GraphRAG paper found that traditional vector RAG struggles with what they call “sensemaking queries”—questions requiring global understanding of a dataset rather than point lookups. The KG2RAG paper from Arxiv documented that semantically retrieved chunks “can be homogeneous and redundant, failing to provide intrinsic relationships among chunks.”
⚠️: TODO: add visual asset to accompany this section and links to original documents.
To be fair: production RAG systems aren’t as naive as the basic pattern I described. There’s a whole toolkit of optimizations—reranking with cross-encoders, query expansion, hybrid keyword-plus-semantic search, iterative retrieval, LLM-based refinement of results. These techniques genuinely help, and for many applications they’re sufficient. All I’m saying here is that they’re still fundamentally patching a geometric retrieval system to behave more semantically, and of course this might (and should) be the right tool for the job. Knowledge Graph-based RAG strikes me as a more elegant solution: instead of optimizing around the limitations of embedding similarity, you structure the information in a way that makes the right context naturally accessible, and with the possibility to store embeddings for basically WHATEVER in graph databases such as neo4j, you still get to do semantic search all the same. It’s versatile, it’s domain-agnostic once you define your taxonomy, and it addresses the root problem rather than its symptoms.
The Emerging Landscape#
I wasn’t the first to notice this gap of course. The idea of combining knowledge graphs with RAG has been gaining momentum, and Microsoft Research’s GraphRAG approach deserves particular mention. Their system extracts entities from documents, uses Leiden community detection to cluster related entities, generates hierarchical summaries for these communities, and answers queries by synthesizing partial answers across the hierarchy. On comprehensiveness metrics, they report 72-83% win rates over vector RAG. It’s impressive work, especially for “global” questions like “What are the main themes in this dataset?”
Neo4j, meanwhile, has built native vector search directly into their graph database, enabling hybrid queries that combine structured traversal with semantic similarity. The ecosystem is clearly moving toward this kind of integration.
When I tested Microsoft’s approach out of the box, I ran into friction: auto-inferred entities yielded duplicates, inconsistent naming, and a knowledge base that only the AI could really navigate. You end up with a huge database full of interesting things, but no clear way to query it programmatically. Now, to be fair, GraphRAG does support configuration for predetermined entity types—you’re not locked into pure auto-inference. But by the time I understood the framework deeply enough to customize it properly, I’d already started building my own pipeline. We developers love to preach “don’t reinvent the wheel” while quietly crafting artisanal wheels in our garages. In my defense, rolling my own gave me exactly the control I wanted: a graph optimized not just for RAG retrieval, but for the kind of structured workflows I had in mind—like building a timeline, or tracking how a specific person’s involvement changed over multiple documents.
Why I Prefer a Predetermined Taxonomy#
My approach goes the other direction: define your entity types upfront, extract to that schema, and accept that you might miss some interesting connections in exchange for control and predictability.
The tradeoffs are real. A predetermined taxonomy means you’re deciding in advance what matters, which requires domain knowledge and might cause you to overlook unexpected patterns. Auto-inference can surface things you didn’t know to look for. But for my use case—and I suspect for most production applications—the benefits of a fixed schema are substantial.
First, you can actually query the graph programmatically. Want all events involving a specific person, sorted by date? That’s a simple Cypher query when you know Event and Person are node types with predictable properties. With auto-inferred entities, you’re back to hoping the LLM figures out what you mean.
Second, debugging and quality assessment become tractable. When extraction goes wrong, you know which entity type misbehaved and can adjust its description. With auto-inference, problems are diffuse and hard to diagnose.
Third—and this is the real payoff—predetermined entities act like model objects in a traditional application. They’re not just retrieval fodder; they’re structured data you can build features around. The timeline visualization? Just sort Event nodes by date. A relationship graph showing who knows whom? Query the KNOWS and RELATIVE_OF edges. Automated triage based on entity types? Filter by label. The graph enables workflows that go beyond question-answering.
Navigation, Not Just Search#
Once you have a well-structured knowledge graph, retrieval stops being a search problem and becomes a navigation problem. The agent doesn’t need to guess which chunks might be relevant; it can traverse from a known entity to its facts to related entities to their facts, building context deliberately rather than probabilistically.
The Entity → Fact → Section hierarchy makes this work. An agent answering “What role did Alberto Stasi play in the initial investigation?” doesn’t search for similar text. It finds the Person entity for Alberto Stasi, retrieves his facts, identifies which facts relate to investigative events, pulls those Event entities, and follows the chain to the original source sections. Every step is traceable. Every piece of context is there for a reason.
⚠️: TODO: add visual asset to accompany this section.
This isn’t just more accurate—it’s more transparent. When the agent cites a source, you can verify exactly why that source was included. The reasoning path through the graph is explicit, not hidden inside embedding similarity scores.
The implementation itself is surprisingly simple. The agent has a small set of tools: search for entities by name or description, retrieve facts for a given entity, find related entities through specific relationship types, and fetch the original source text for any fact. That’s it. No text-to-Cypher translation, no complex query language exposed to the LLM. I considered letting the agent write arbitrary graph queries, but it’s error-prone—the model hallucinates property names, misremembers relationship directions, constructs syntactically invalid Cypher. Abstracting the graph behind simple, purpose-built tools turned out to be far more reliable.
The agent plans its own traversal strategy. Given a question, it decides which entities to search for first, what relationships to explore, when it has enough context to answer. I cap the number of iterations to avoid loops, but in practice the agent converges quickly. The graph’s structure guides it naturally toward relevant information—there’s less flailing than you’d see with a pure retrieval system trying to guess which chunks matter.
What’s Next#
The RAG results were satisfying, but the bigger realization came later: the entities in the graph aren’t just retrieval fodder—they’re model objects. They behave like rows in a database, but with semantic context attached. You can run traditional data-driven workflows on them: filter by type, aggregate by property, trigger actions based on entity state. And you can do all of this while keeping the semantic search and LLM-powered reasoning that make RAG useful in the first place.
This opens up possibilities that pure RAG can’t touch. Imagine a tech support system built on a knowledge graph of product entities, known issues, and troubleshooting steps. An agent doesn’t just retrieve similar documentation—it navigates from the user’s reported symptom to matching issues to resolution steps, building a response from structured relationships rather than hoping the right paragraphs appear in the top-k results. Or consider an automated triage pipeline: incoming documents get parsed, entities extracted, and routed based on what the graph already knows. The graph provides context for classification that embeddings alone would miss.
The pattern is domain-agnostic. Legal documents were my test case, but the same approach applies to healthcare records, supply chain logistics, financial compliance—anywhere relationships matter and “find similar text” isn’t enough. Define your taxonomy. Build the graph. Get navigation, retrieval, and data-driven workflows as natural consequences of the same investment.
I’ve started generalizing the approach into a proper framework—something reusable beyond this single experiment. The prototype was messy, hardcoded, purpose-built for Italian court filings. The framework is cleaner: entity types defined through descriptors, extraction and deduplication as composable operations, the graph database abstracted behind ports that don’t care if you’re using Neo4j or something else. It’s not ready for public release, but it’s getting closer.
The insight I keep coming back to isn’t “use a knowledge graph.” It’s simpler: structure your text around concepts—around what actually matters to your domain. People, events, relationships, whatever your application cares about. Once you do, retrieval becomes a navigation problem with clear paths to follow. The knowledge graph is one implementation of that principle. Vector databases with careful metadata might be another. The point is that semantic similarity alone isn’t enough when you need to reason about structure, and adding structure upfront pays dividends across everything you build on top of it.
Whether you’re building a legal timeline, a tech support agent, or something I haven’t imagined, the pattern holds: understand your domain, define your entities, connect them to your source material, and let the structure do the heavy lifting. The LLM becomes a navigator rather than a guesser. That’s the shift.