A melanoma patient's tumor sequencing comes back with a BRAF V600E mutation. Standard of care is dabrafenib plus trametinib. First-line, well-supported, unambiguous.
Another patient. Same lab, same panel, same gene. The mutation this time is BRAF V600K. One amino acid substitution different — a single nucleotide change in the codon.
If you build a retrieval system that treats these two cases as semantically similar — which is what vector RAG does by default — you'll surface the same recommendation for both. That's not a retrieval failure. In medicine, that's a wrong drug for a real patient.
This is the failure mode that killed IBM Watson for Oncology, and it's the one worth internalizing before you ship any generative AI into a clinical workflow. The fix isn't a better embedding model. It's a different primitive entirely.
What is the vector RAG failure mode in clinical AI?
Vector RAG fails for oncology because cosine similarity treats semantically near-identical passages as interchangeable, even when they describe clinically distinct variants with different approved drugs. The failure surfaces predictably on near-miss variants — BRAF V600E vs. V600K, EGFR exon 19 deletions vs. insertions — where a single amino acid change means a different treatment.
Vector RAG works like this. A query comes in. You embed it, look up the nearest neighbors in a vector store, pass the retrieved chunks to an LLM, and ask for a synthesized answer. For general question answering, this is a reasonable pattern. For oncology, it breaks on a specific class of inputs: near-miss variants.
Consider a knowledge base that includes two nearby passages:
"BRAF V600E: first-line combination therapy with dabrafenib + trametinib. OncoKB Level 1; FDA-approved (2018)."
"BRAF V600K: first-line dabrafenib + trametinib or vemurafenib + cobimetinib, with lower response rates than V600E."
The two passages differ by a single character (E vs K) and a clause about response rates. Run them through any off-the-shelf embedding model and the cosine similarity lands north of 0.97. When your retriever runs against a V600K query, the V600E chunk is likely to surface in the top-k — and often above the correct chunk, because the V600E passage is more frequently referenced across the corpus.
Now the LLM sees both passages and generates a recommendation. The problem is that LLMs under conflicting-but-similar inputs favor the more common / more strongly-stated claim. V600E is ~90% of BRAF mutations in melanoma. The V600K chunk gets overshadowed. The output silently recommends V600E treatment for a V600K patient.
You can try to patch this. Better embeddings. Hybrid retrieval with BM25. Re-ranking. Chain-of-thought prompting that asks the model to "check the exact variant first." I've seen all of these tried. None of them address the root issue: the retrieval layer has no structural knowledge of the difference between V600E and V600K. It only has surface-level linguistic proximity, and that's not enough for clinical precision.
Why does cosine similarity fail for variant-level reasoning?
Cosine similarity fails because it measures linguistic proximity, while clinical reasoning requires exact identity checks over discrete entities. Two variants described in nearly identical prose are still two distinct clinical objects with separate drug indications, contraindications, and evidence tiers. Similarity scoring collapses that distinction; graph traversal preserves it.
The deeper issue is that vector similarity and clinical reasoning are different operations over different kinds of data.
Vector similarity measures approximate linguistic proximity in an unstructured corpus. Clinical reasoning requires exact identity checks and typed relationship traversal over discrete entities: a specific variant, a specific drug, a specific OncoKB evidence level, a specific contraindication.
Forcing the second through the first doesn't make it robust. It makes it confidently wrong in the exact places you can't afford to be.
In software terms: we're using fuzzy string matching to compare memory addresses. The right primitive is an exact lookup — or in this domain's case, a graph traversal over explicitly typed variant-drug relationships.
What is GraphRAG, and how does it replace vector retrieval?
GraphRAG is a retrieval architecture where the knowledge base is a typed graph — genes, variants, drugs, trials, and evidence tiers are discrete nodes connected by typed relationships — and retrieval is a deterministic Cypher query rather than a similarity search. For clinical reasoning, this preserves variant-level identity that vector retrieval erases.
GraphRAG, in our usage, means treat the knowledge base as a typed graph. Clinically meaningful entities — genes, variants, drugs, trials, biomarkers, contraindications — are nodes. Clinically meaningful relationships are typed, directional edges.
Here's the slice of our Neo4j schema that handles the BRAF case:
// Nodes
CREATE CONSTRAINT FOR (p:Patient) REQUIRE p.patient_id IS UNIQUE;
CREATE CONSTRAINT FOR (g:Gene) REQUIRE g.hgnc_id IS UNIQUE;
CREATE CONSTRAINT FOR (v:Variant) REQUIRE v.hgvs IS UNIQUE;
CREATE CONSTRAINT FOR (d:Drug) REQUIRE d.rxcui IS UNIQUE;
// Labels + example instances
(:Patient {patient_id: "PT-123", tumor_type: "MELANOMA"})
(:Gene {hgnc_id: "HGNC:1097", symbol: "BRAF"})
(:Variant {hgvs: "BRAF:p.Val600Glu", short: "V600E", variant_type: "missense"})
(:Variant {hgvs: "BRAF:p.Val600Lys", short: "V600K", variant_type: "missense"})
(:Drug {rxcui: "1425099", name: "Dabrafenib"})
(:Drug {rxcui: "1425113", name: "Trametinib"})
// Relationships
(Patient)-[:HAS_MUTATION {vaf, tumor_type}]->(Gene)
(Gene)-[:SPECIFIC_VARIANT]->(Variant)
(Variant)-[:INDICATES_RESPONSE_TO {
evidence_level: "Level 1", // OncoKB
fda_status: "FDA-approved",
source: "OncoKB + FDA label + FLAURA (NEJM 2018)"
}]->(Drug)
(Variant)-[:CONTRAINDICATES {severity, reason}]->(DrugClass)
Notice: V600E and V600K are separate Variant nodes. They share an incoming SPECIFIC_VARIANT edge from the BRAF Gene node, but every outgoing edge — every therapy recommendation, every contraindication — is specific to the variant instance itself. There is no ambiguity to resolve. The graph makes the distinction structurally, not statistically.
A real recommendation query looks like this:
MATCH (p:Patient {patient_id: $patient_id})
-[:HAS_MUTATION]->(g:Gene)
-[:SPECIFIC_VARIANT]->(v:Variant {hgvs: $variant_hgvs})
-[r:INDICATES_RESPONSE_TO]->(d:Drug)
WHERE r.evidence_tier IN ['Level 1', 'Level 2A']
RETURN
g.symbol AS gene,
v.hgvs AS variant,
d.name AS drug,
r.evidence_tier AS tier,
r.nccn_category AS nccn,
r.source AS citations
ORDER BY r.evidence_tier ASC
LIMIT 5;
Run this for a V600E patient, you get dabrafenib + trametinib at OncoKB Level 1 with the FDA label and FLAURA citations. Run the identical query with $variant_hgvs = "BRAF:p.Val600Lys" and you get the V600K edge — which, correctly, notes the different approved combinations and the lower response rate. The query is deterministic. No retrieval step involves similarity.
That is the core shift. Retrieval becomes a structured database query against pre-curated, version-pinned clinical knowledge. Not a similarity search over free text.
The graph is built from authoritative sources on a disciplined cadence: OncoKB (quarterly), ClinVar (monthly), ClinicalTrials.gov (nightly), openFDA drug labels (on release). Every edge carries provenance: which knowledge-base entry, which RCT, which FDA label. That provenance is what makes the output auditable — and auditability is what makes it defensible when a CAP inspector asks how you arrived at a recommendation. Vector RAG cannot answer that question. The graph answers it by construction.
What role does the LLM play in a GraphRAG clinical pipeline?
In UNMIRI's GraphRAG pipeline, the LLM is scoped to two narrow jobs — extraction edge cases and long-tail variant fallback — and never touches the clinical output. The 2-page cheat sheet is rendered by deterministic templates from structured graph data. Templates can't hallucinate. In clinical contexts, determinism is a feature.
Here's where people get confused about our architecture. There is still an LLM in the pipeline. It just isn't writing the clinical output.
The graph traversal returns structured data — drug names, evidence levels, citations, contraindications, dosing. That structured data feeds into typed templates, not into a language model. Every sentence in the 2-page cheat sheet is rendered from a data field with a verified citation.
The rendering pattern, in sketch:
from pydantic import BaseModel
class Recommendation(BaseModel):
variant: str # "EGFR L858R"
drug: str # "Osimertinib"
evidence_level: str # "Level 1" (OncoKB)
fda_status: str # "FDA-approved"
citations: list[str] # ["OncoKB:EGFR-L858R", "FLAURA 2018"]
dosing: str
def render_recommendation(r: Recommendation) -> str:
return (
f"{r.drug} is indicated for {r.variant}-mutant NSCLC as "
f"{r.evidence_level.lower()} evidence per OncoKB; {r.fda_status}. "
f"Dosing: {r.dosing}. Sources: {', '.join(r.citations)}."
)
That's a template. A function. It can be unit-tested against every variant in the graph. It cannot produce a drug that isn't in its input. It cannot cite a trial that doesn't exist.
Where does the LLM come in, then? Two places, both outside the clinical path:
- Extraction edge cases. When AWS Textract and the per-lab parsers hit an unusual NGS report format, a narrow LLM call helps normalize the structured variant JSON before it enters the graph. The output of this step is data, not prose; the downstream template rendering is unchanged.
- Long-tail variant fallback. When a variant has no edge in the graph, a narrow LLM call surfaces the most recent literature context with an explicit lower-confidence flag in the output. The clinician sees the uncertainty; the template renders it as such.
The LLM subprocessor is Anthropic, on its HIPAA-ready API tier with a signed BAA. Prompts carry only de-identified variant data; Anthropic does not train on customer inputs or outputs on that tier.
Two properties fall out of this design:
The clinical output cannot contain fabricated content. Templates render from structured graph data. If the graph has no edge for a variant, the template either omits that section or renders a "no evidence-based recommendation available" clause. No LLM improvisation. No hallucinated citations.
Hallucinations become extraction errors, not reasoning errors. The narrow LLM use at the extraction boundary can produce a miscoded variant — recoverable through parser improvements and human review. The class of errors that destroys clinical trust — fabricated drugs, invented evidence levels, wrong trial NCT IDs — is structurally ruled out by the architecture.
When is vector search still useful in medical AI?
Vector search remains useful for three non-reasoning tasks: literature discovery across unstructured PubMed content, free-text clinical-trial eligibility parsing, and duplicate case clustering across LIMS imports. In those contexts, approximate similarity is the right primitive. For clinical recommendations, it is not.
One more thing worth being honest about. Vector search is not useless in this domain. It's just not the reasoning substrate.
Three places where neural retrieval earns its keep in our pipeline:
- Literature discovery. When a clinician wants to explore the current state of research on a specific variant, neural retrieval over PubMed abstracts is better than graph traversal. We use it for a "related research" side panel, clearly separated from the recommendation itself.
- Free-text eligibility parsing. ClinicalTrials.gov eligibility criteria are half-structured and half free-text. Embeddings help extract structured criteria from the prose during our trial-ingestion job. Once extracted, the structured criteria are graph edges.
- Duplicate case clustering. Identifying two reports that describe the same patient across LIMS imports benefits from embedding-based similarity on patient metadata.
None of those decisions involve telling a clinician what drug to use. That decision lives in the graph.
How should a lab evaluate an oncology AI vendor?
The single most diagnostic question is: when the system produces a clinical recommendation, can the vendor trace it to the specific knowledge-base entry and version that generated it? A vendor that can't answer this is selling vector RAG with a confident UI — which will sometimes be wrong for the patient in front of you.
If you're evaluating oncology AI vendors — as a lab CTO, bioinformatics lead, or engineer who's about to inherit this problem — here's the question that cuts through most of the marketing: when the system produces a recommendation, can you trace the specific knowledge-base entry that generated it?
If the answer is "the model just knows," you're buying vector RAG with a confident UI. The answer will sometimes be wrong for a V600K patient. And when you're wrong in medicine, you don't get a ranking penalty — you get a patient on the wrong drug.
The architecture that avoids this isn't novel. It's graphs instead of vectors, and deterministic templates instead of LLM-generated prose. For a fuller walkthrough of how UNMIRI puts these pieces together — PDF extraction, knowledge graph on OncoKB + ClinVar + ClinicalTrials.gov + openFDA, deterministic rendering, narrow LLM use — see the product page.
Frequently asked questions
- Is vector RAG ever appropriate for medical AI?
- Yes, for non-reasoning use cases: literature discovery across unstructured PubMed abstracts, parsing free-text clinical-trial eligibility criteria, and clustering duplicate case records across LIMS imports. Vector search is disqualified as the reasoning substrate for treatment recommendations because variant-level identity requires exact matching, not similarity scoring.
- Which knowledge bases does UNMIRI's graph include?
- OncoKB (quarterly refresh), ClinVar (monthly refresh), ClinicalTrials.gov (nightly), openFDA drug labels (on FDA publish), and COSMIC (on release). Every edge in the graph carries a reference to the specific knowledge-base version and entry that created it, which is what enables full provenance on every clinical recommendation UNMIRI produces.
- Can an LLM do clinical reasoning if you give it better prompts?
- No. Better prompting reduces obvious errors but doesn't change the underlying architecture: LLMs approximate responses from training distribution, and clinical recommendations require exact variant-level identity checks. The fix is architectural — move reasoning out of the LLM and into a typed graph where V600E and V600K are separate, unambiguous nodes.
- How is GraphRAG different from IBM Watson for Oncology?
- Watson used an LLM-like system to generate clinical recommendations directly, which led to fabricated citations and unsafe suggestions and ultimately to the product's discontinuation. GraphRAG inverts that architecture: a structured knowledge graph makes every clinical call, and the LLM is restricted to rendering the graph's output as readable prose.
Umair Khan
Founder, UNMIRI
Building UNMIRI — a GraphRAG-based NGS interpretation engine for regional diagnostic labs. Previously: software engineer working on data-intensive systems. Writing here on architecture, clinical data, and HIPAA-ready AI.
Related posts