Is vector RAG ever appropriate for medical AI?

Yes, for non-reasoning use cases: literature discovery across unstructured PubMed abstracts, parsing free-text clinical-trial eligibility criteria, and clustering duplicate case records across LIMS imports. Vector search is disqualified as the reasoning substrate for treatment recommendations because variant-level identity requires exact matching, not similarity scoring.

Which knowledge bases does UNMIRI's graph include?

CIViC (quarterly refresh), ClinVar (monthly refresh), ClinicalTrials.gov (nightly), openFDA drug labels (on FDA publish), and CPIC pharmacogenomics guidelines (on release). Every edge in the graph carries a reference to the specific knowledge-base version and entry that created it, which is what enables full provenance on every clinical recommendation UNMIRI produces.

Can an LLM do clinical reasoning if you give it better prompts?

No. Better prompting reduces obvious errors but doesn't change the underlying architecture: LLMs approximate responses from training distribution, and clinical recommendations require exact variant-level identity checks. The fix is architectural: move reasoning out of the LLM and into a typed graph where V600E and V600K are separate, unambiguous nodes.

How is GraphRAG different from IBM Watson for Oncology?

Watson used an LLM-like system to generate clinical recommendations directly, which led to fabricated citations and unsafe suggestions and ultimately to the product's discontinuation. GraphRAG inverts that architecture: a structured knowledge graph makes every clinical call, and the LLM is restricted to rendering the graph's output as readable prose.

Why Vector RAG Fails for Oncology: and What to Build

A melanoma patient's tumor sequencing comes back with a BRAF V600E mutation. Standard of care is dabrafenib plus trametinib. First-line, well-supported, unambiguous.

Another patient. Same lab, same panel, same gene. The mutation this time is BRAF V600K. One amino acid substitution different. A single nucleotide change in the codon.

If you build a retrieval system that treats these two cases as semantically similar (which is what vector RAG does by default), you'll surface the same recommendation for both. That's not a retrieval failure. In medicine, that's a wrong drug for a real patient.

This is the failure mode that killed IBM Watson for Oncology, and it's the one worth internalizing before you ship any generative AI into a clinical workflow. The fix isn't a better embedding model. It's a different primitive entirely.

What is the vector RAG failure mode in clinical AI?

Vector RAG fails for oncology because cosine similarity treats semantically near-identical passages as interchangeable, even when they describe clinically distinct variants with different approved drugs. The failure surfaces predictably on near-miss variants (BRAF V600E vs. V600K, EGFR exon 19 deletions vs. insertions), where a single amino acid change means a different treatment.

Vector RAG works like this. A query comes in. You embed it, look up the nearest neighbors in a vector store, pass the retrieved chunks to an LLM, and ask for a synthesized answer. For general question answering, this is a reasonable pattern. For oncology, it breaks on a specific class of inputs: near-miss variants.

Consider a knowledge base that includes two nearby passages:

"BRAF V600E: first-line combination therapy with dabrafenib + trametinib. AMP/ASCO/CAP Tier I-A; FDA-approved (2018)."

"BRAF V600K: first-line dabrafenib + trametinib or vemurafenib + cobimetinib, with lower response rates than V600E."

The two passages differ by a single character (E vs K) and a clause about response rates. Run them through any off-the-shelf embedding model and the cosine similarity lands north of 0.97. When your retriever runs against a V600K query, the V600E chunk is likely to surface in the top-k, often above the correct chunk, because the V600E passage is more frequently referenced across the corpus.

Now the LLM sees both passages and generates a recommendation. The problem is that LLMs under conflicting-but-similar inputs favor the more common / more strongly-stated claim. V600E is ~90% of BRAF mutations in melanoma. The V600K chunk gets overshadowed. The output silently recommends V600E treatment for a V600K patient.

You can try to patch this. Better embeddings. Hybrid retrieval with BM25. Re-ranking. Chain-of-thought prompting that asks the model to "check the exact variant first." I've seen all of these tried. None of them address the root issue: the retrieval layer has no structural knowledge of the difference between V600E and V600K. It only has surface-level linguistic proximity, and that's not enough for clinical precision.

Why does cosine similarity fail for variant-level reasoning?

Cosine similarity fails because it measures linguistic proximity, while clinical reasoning requires exact identity checks over discrete entities. Two variants described in nearly identical prose are still two distinct clinical objects with separate drug indications, contraindications, and evidence tiers. Similarity scoring collapses that distinction; graph traversal preserves it.

The deeper issue is that vector similarity and clinical reasoning are different operations over different kinds of data.

Vector similarity measures approximate linguistic proximity in an unstructured corpus. Clinical reasoning requires exact identity checks and typed relationship traversal over discrete entities: a specific variant, a specific drug, a specific AMP/ASCO/CAP evidence tier, a specific contraindication.

Forcing the second through the first doesn't make it work. It makes it confidently wrong in the exact places you can't afford to be.

In software terms: we're using fuzzy string matching to compare memory addresses. The right primitive is an exact lookup. In this domain, a graph traversal over explicitly typed variant-drug relationships.

What is GraphRAG, and how does it replace vector retrieval?

GraphRAG is a retrieval architecture where the knowledge base is a typed graph (genes, variants, drugs, trials, and evidence tiers as discrete nodes connected by typed relationships), and retrieval is a deterministic Cypher query rather than a similarity search. For clinical reasoning, this preserves variant-level identity that vector retrieval erases.

GraphRAG, in this usage, means treat the knowledge base as a typed graph. Clinically meaningful entities (genes, variants, drugs, trials, biomarkers, contraindications) are nodes. Clinically meaningful relationships are typed, directional edges.

Here's the slice of UNMIRI's Neo4j schema design that handles the BRAF case:

// Nodes
CREATE CONSTRAINT FOR (p:Patient) REQUIRE p.patient_id IS UNIQUE;
CREATE CONSTRAINT FOR (g:Gene) REQUIRE g.hgnc_id IS UNIQUE;
CREATE CONSTRAINT FOR (v:Variant) REQUIRE v.hgvs IS UNIQUE;
CREATE CONSTRAINT FOR (d:Drug) REQUIRE d.rxcui IS UNIQUE;

// Labels + example instances
(:Patient {patient_id: "PT-123", tumor_type: "MELANOMA"})
(:Gene {hgnc_id: "HGNC:1097", symbol: "BRAF"})
(:Variant {hgvs: "BRAF:p.Val600Glu", short: "V600E", variant_type: "missense"})
(:Variant {hgvs: "BRAF:p.Val600Lys", short: "V600K", variant_type: "missense"})
(:Drug {rxcui: "1425099", name: "Dabrafenib"})
(:Drug {rxcui: "1425113", name: "Trametinib"})

// Relationships
(Patient)-[:HAS_MUTATION {vaf, tumor_type}]->(Gene)
(Gene)-[:SPECIFIC_VARIANT]->(Variant)
(Variant)-[:INDICATES_RESPONSE_TO {
  evidence_level: "I-A",  // AMP/ASCO/CAP
  fda_status: "FDA-approved",
  source: "CIViC + FDA label + COMBI-d/COMBI-v (NEJM 2014/2015)"
}]->(Drug)
(Variant)-[:CONTRAINDICATES {severity, reason}]->(DrugClass)

Notice: V600E and V600K are separate Variant nodes. They share an incoming SPECIFIC_VARIANT edge from the BRAF Gene node, but every outgoing edge (every therapy recommendation, every contraindication) is specific to the variant instance itself. There is no ambiguity to resolve. The graph makes the distinction structurally, not statistically.

A recommendation query looks like this:

MATCH (p:Patient {patient_id: $patient_id})
  -[:HAS_MUTATION]->(g:Gene)
  -[:SPECIFIC_VARIANT]->(v:Variant {hgvs: $variant_hgvs})
  -[r:INDICATES_RESPONSE_TO]->(d:Drug)
WHERE r.evidence_tier IN ['I-A', 'I-B']
RETURN
  g.symbol         AS gene,
  v.hgvs           AS variant,
  d.name           AS drug,
  r.evidence_tier  AS tier,
  r.source         AS citations
ORDER BY r.evidence_tier ASC
LIMIT 5;

Run this for a V600E patient, you get dabrafenib + trametinib at AMP/ASCO/CAP Tier I-A with the FDA label and COMBI-d/COMBI-v citations. Run the identical query with $variant_hgvs = "BRAF:p.Val600Lys" and you get the V600K edge, which correctly notes the different approved combinations and the lower response rate. The query is deterministic. No retrieval step involves similarity.

That is the core shift. Retrieval becomes a structured database query against pre-curated, version-pinned clinical knowledge. Not a similarity search over free text.

The planned ingestion cadence pulls from authoritative sources: CIViC (quarterly), ClinVar (monthly), ClinicalTrials.gov (nightly), openFDA drug labels (on release). Every edge in the design carries provenance: which knowledge-base entry, which RCT, which FDA label. That provenance is what makes the output auditable, and auditability is what makes it defensible when a CAP inspector asks how you arrived at a recommendation. Vector RAG cannot answer that question. The graph answers it by construction.

What role does the LLM play in a GraphRAG clinical pipeline?

By design, UNMIRI's GraphRAG pipeline scopes the LLM to two narrow jobs (extraction edge cases and long-tail variant fallback) and keeps it away from the clinical output. The 2-page cheat sheet is rendered by deterministic templates from structured graph data. Templates can't hallucinate. In clinical contexts, determinism is a feature.

Here's where people get confused about the architecture. There is still an LLM in the pipeline. It just doesn't write the clinical output.

The graph traversal returns structured data: drug names, evidence levels, citations, contraindications, dosing. That structured data feeds into typed templates, not into a language model. Every sentence in the rendered 2-page cheat sheet comes from a data field with a verified citation.

The rendering pattern, in sketch:

from pydantic import BaseModel

class Recommendation(BaseModel):
    variant: str           # "EGFR L858R"
    drug: str              # "Osimertinib"
    evidence_level: str    # "Tier I-A" (AMP/ASCO/CAP)
    fda_status: str        # "FDA-approved"
    citations: list[str]   # ["CIViC:EID3017", "FLAURA NEJM 2018"]
    dosing: str

def render_recommendation(r: Recommendation) -> str:
    return (
        f"{r.drug} is indicated for {r.variant}-mutant NSCLC as "
        f"{r.evidence_level.lower()} evidence per AMP/ASCO/CAP; {r.fda_status}. "
        f"Dosing: {r.dosing}. Sources: {', '.join(r.citations)}."
    )

That's a template. A function. It can be unit-tested against every variant in the graph. It cannot produce a drug that isn't in its input. It cannot cite a trial that doesn't exist.

Where does the LLM come in, then? Two narrow roles in the design, both outside the clinical path:

Extraction edge cases. When the managed document-extraction service and the per-lab parsers hit an unusual NGS report format, a narrow LLM call helps normalize the structured variant JSON before it enters the graph. The output of this step is data, not prose; the downstream template rendering is unchanged.
Long-tail variant fallback. When a variant has no edge in the graph, a narrow LLM call surfaces the most recent literature context with an explicit lower-confidence flag in the output. The clinician sees the uncertainty; the template renders it as such.

The LLM subprocessor for that narrow role is Microsoft Azure OpenAI, under the Microsoft Online Services BAA (active). Prompts carry only de-identified variant data, and the path is built so PHI never reaches the model. The output of every clinical recommendation is still rendered by deterministic templates, not the model.

Two properties fall out of this design:

The clinical output cannot contain fabricated content. Templates render from structured graph data. If the graph has no edge for a variant, the template either omits that section or renders a "no evidence-based recommendation available" clause. No LLM improvisation. No hallucinated citations.

Hallucinations become extraction errors, not reasoning errors. The narrow LLM use at the extraction boundary can produce a miscoded variant, recoverable through parser improvements and human review. The class of errors that destroys clinical trust (fabricated drugs, invented evidence levels, wrong trial NCT IDs) is structurally ruled out by the architecture.

When is vector search still useful in medical AI?

Vector search remains useful for three non-reasoning tasks: literature discovery across unstructured PubMed content, free-text clinical-trial eligibility parsing, and duplicate case clustering across LIMS imports. In those contexts, approximate similarity is the right primitive. For clinical recommendations, it is not.

One more thing worth being honest about. Vector search is not useless in this domain. It's just not the reasoning substrate.

Three places where neural retrieval earns its keep in UNMIRI's design:

Literature discovery. When a clinician wants to explore the current state of research on a specific variant, neural retrieval over PubMed abstracts is better than graph traversal. The design includes it for a "related research" side panel, clearly separated from the recommendation itself.
Free-text eligibility parsing. ClinicalTrials.gov eligibility criteria are half-structured and half free-text. Embeddings help extract structured criteria from the prose during the trial-ingestion job. Once extracted, the structured criteria become graph edges.
Duplicate case clustering. Identifying two reports that describe the same patient across LIMS imports benefits from embedding-based similarity on patient metadata.

None of those decisions involve telling a clinician what drug to use. That decision lives in the graph.

How should a lab evaluate an oncology AI vendor?

The single most diagnostic question is: when the system produces a clinical recommendation, can the vendor trace it to the specific knowledge-base entry and version that generated it? A vendor that can't answer this is selling vector RAG with a confident UI, which will sometimes be wrong for the patient in front of you.

If you're evaluating oncology AI vendors (as a lab CTO, bioinformatics lead, or engineer who's about to inherit this problem), here's the question that cuts through most of the marketing: when the system produces a recommendation, can you trace the specific knowledge-base entry that generated it?

If the answer is "the model just knows," you're buying vector RAG with a confident UI. The answer will sometimes be wrong for a V600K patient. And when you're wrong in medicine, you don't get a ranking penalty. You get a patient on the wrong drug.

The architecture that avoids this isn't novel. It's graphs instead of vectors, and deterministic templates instead of LLM-generated prose. The NGS Interpretation API and the Genomics-aware CDS API are both built on this design: PDF extraction, knowledge graph on CIViC, ClinVar, ClinicalTrials.gov, openFDA, CPIC, deterministic rendering, narrow LLM use. The product platform overview sits above them.

Related references

Variant cheat sheet

EGFR mutations in NSCLC

What deterministic, citation-grounded variant evidence actually looks like.

Reference

Open FHIR Genomics schema

The typed graph data contract that replaces vector RAG for oncology.

Product

Genomics-aware CDS API

Engine 2: deterministic CDS routing over the same graph.

Product

NGS Interpretation API

Engine 1: cross-vendor parsing with FHIR Genomics output.

Live demo

Variant evidence lookup

See the graph join in one query.

Reference

FHIR Genomics validator

Validate Bundles against the conformance + UNMIRI conventions.

Frequently asked questions

Is vector RAG ever appropriate for medical AI?: Yes, for non-reasoning use cases: literature discovery across unstructured PubMed abstracts, parsing free-text clinical-trial eligibility criteria, and clustering duplicate case records across LIMS imports. Vector search is disqualified as the reasoning substrate for treatment recommendations because variant-level identity requires exact matching, not similarity scoring.
Which knowledge bases does UNMIRI's graph include?: CIViC (quarterly refresh), ClinVar (monthly refresh), ClinicalTrials.gov (nightly), openFDA drug labels (on FDA publish), and CPIC pharmacogenomics guidelines (on release). Every edge in the graph carries a reference to the specific knowledge-base version and entry that created it, which is what enables full provenance on every clinical recommendation UNMIRI produces.
Can an LLM do clinical reasoning if you give it better prompts?: No. Better prompting reduces obvious errors but doesn't change the underlying architecture: LLMs approximate responses from training distribution, and clinical recommendations require exact variant-level identity checks. The fix is architectural: move reasoning out of the LLM and into a typed graph where V600E and V600K are separate, unambiguous nodes.
How is GraphRAG different from IBM Watson for Oncology?: Watson used an LLM-like system to generate clinical recommendations directly, which led to fabricated citations and unsafe suggestions and ultimately to the product's discontinuation. GraphRAG inverts that architecture: a structured knowledge graph makes every clinical call, and the LLM is restricted to rendering the graph's output as readable prose.

Umair Khan

Founder and CTO, UNMIRI

Building UNMIRI, a precision oncology infrastructure company with four product surfaces: cross-vendor NGS interpretation, genomics-aware decision support, oncology literature intelligence, and a free cross-vendor unification tool for clinicians. Writing here on architecture, clinical data, and HIPAA-ready AI.

See the NGS Interpretation API →

Clinical advisors: UNMIRI is in active conversations with multiple board-certified pathologists about formal advisory roles. Public introductions land on the About page once each engagement is formalized and the advisor approves being named.

Industry & compliance · 4 min read

Somdutta Saha Joins UNMIRI as Technical and Scientific Advisor

Somdutta Saha, PhD, a translational bioinformatics leader with biopharma NGS and biomarker experience, joins UNMIRI as Technical and Scientific Advisor.

Clinical data & genomics · 8 min read

Real-Time Congress Intelligence for Medical Affairs Teams

Monitor ASCO, ESMO, AACR, ASH, and WCLC abstracts as they post using Crossref DOI metadata. How oncology medical affairs teams track congress evidence without manual triage.

Why Vector RAG Fails for Oncology — and What to Build