Why manual multi-KB matching fails
A bioinformatician classifying a complex case typically opens 4–6 browser tabs: OncoKB for therapy tiers, ClinVar for clinical significance, COSMIC for mutation prevalence, openFDA for label-level drug status, civic for additional evidence, and PubMed for the source literature. They cross-reference by hand, reconcile disagreements, and write up a summary. Per case, 30–60 minutes on matching alone.
The tabs-and-spreadsheets approach doesn't scale with panel volume and doesn't produce a consistent audit trail. That's the gap this engine closes.
How the graph traversal works
Each oncology knowledge base is represented as a set of typed nodes and relationships in a single Neo4j graph. When a variant arrives, one query traverses all relevant sources simultaneously:
| Source | What it contributes | Graph relationship type |
|---|---|---|
| OncoKB | Therapy-oriented evidence tiers (Level 1–4) | SENSITIZES_TO, RESISTS |
| ClinVar | Aggregated clinical significance (pathogenic, VUS, benign) | HAS_CLINICAL_SIGNIFICANCE |
| COSMIC | Mutation prevalence by tumor type | OCCURS_IN_CANCER_TYPE |
| openFDA | Current approval + label-level contraindications | APPROVED_FOR, CONTRAINDICATED_WITH |
| civic | Additional predictive/prognostic evidence | HAS_EVIDENCE |
| gnomAD | Population allele frequency for germline filtering | HAS_POPULATION_FREQUENCY |
What the response looks like
A single POST /v1/variants/classify call returns a unified classification: evidence tier, clinical significance, disagreement flags (when sources conflict), mutation prevalence context, and a complete citation set. Output structure example:
- variant — HGVS notation, transcript, gene
- amp_tier — I / II / III / IV (aggregated)
- oncokb_level — 1, 2, 3A, 3B, 4, or R1/R2
- clinvar_significance — pathogenic, likely pathogenic, VUS, etc.
- source_agreement — “concordant” | “partial” | “discordant” + details
- citations — per-source, with version and retrieval date
Handling source disagreement
When OncoKB and ClinVar disagree on a variant, the matching engine doesn't silently pick a winner. The output explicitly flags the disagreement with both positions. Your bioinformatician — or your institutional variant interpretation committee — makes the final call. That call can be pinned as a lab-specific override that applies to all future reports.
Full architectural context: Why Vector RAG Fails for Oncology. Compliance details on the security page.
How UNMIRI actually does this
The OncoKB and ClinVar data are normalized into a single Neo4j graph, along with ClinicalTrials.gov and openFDA drug labels. A classification request runs as a Cypher query that returns all matched entries with provenance. No similarity scoring, no LLM reasoning — matching is deterministic and fully auditable. More on the architecture.