Section 1

Every healthtech product hits the same wall.

The patient's genomic data sits in a clinical NGS report from a third-party lab. The clinical report runs roughly 10 to 30 pages depending on vendor, with technical appendices and QC data that can extend the full bundle further. The lab is one of 8 to 12 major vendors. Each vendor's report layout is subtly different: different gene panels, different evidence-grading conventions, different file formats (PDF for most, structured XML for some, HL7 v2 for legacy systems), and different update cadences.

Engineering teams shipping oncology features end up with one of two paths, both bad.

The build-it-yourself path: hire a bioinformatics-aware engineering team, write per-vendor parsers, maintain a regression suite for every report version, redo the work every time a vendor reformats. The cost is real. A single Foundation Medicine layout change can break weekly batch ingestion in production. Most teams budget for this once, then under-budget for the maintenance tail.

The skip-it path: accept structured data only when the lab can deliver it (rare in practice), and treat PDF reports as read-only artifacts the clinician opens manually. The product loses the ability to reason over genomic data programmatically. Variant-aware decision support, automated trial matching, and biomarker-eligibility flags for CDx all become clinician-mediated steps.

Vendor format change frequency is the part most engineering teams underestimate. Foundation Medicine's CDx layout has shifted at least twice in the past 18 months. Tempus xT outputs vary by cancer type and panel revision. Caris ships separate report formats for solid tumor and liquid biopsy panels. Guardant360 has its own conventions for actionability tiers. Each change requires regression testing against real reports, and real reports are PHI, which means the testing apparatus itself has to be HIPAA-ready before the parser can ship.

For any company that is not a top-five EHR vendor or a national reference lab, the in-house economics do not work. There is a third path: outsource the parsing layer to a vendor whose only job is staying current on every major lab format and emitting a single unified schema downstream products can build on. That is what Engine 1 is.

Section 2

A six-stage pipeline, each stage independently inspectable.

Engine 1 has six layers, each deliberately boring. The separation between parsing, normalization, knowledge lookup, and rendering is the architectural choice that lets the system stay correct as vendor formats and clinical knowledge both change underneath it.

01
Ingest
Accepted formats: PDF (the common case), structured XML (Foundation One CDx, Tempus when available), HL7 v2 ORU messages (legacy LIS/LIMS), FHIR Bundle. The ingest layer normalizes file metadata, extracts patient identifiers (kept separate from variant data downstream), and routes to the right parser pipeline.
02
OCR and layout detection
AWS Textract handles raw OCR for PDF inputs, with custom post-processing for table extraction across vendor-specific column conventions. The layout detector identifies report sections (variant calls, biomarkers, CDx flags, negative findings) so downstream parsers do not have to scan the whole document.
03
Format-specific parsers
One parser per major vendor, with versioning. Each parser is a deterministic state machine that maps the vendor's layout to UNMIRI's internal variant schema. When a vendor changes their layout, only that parser changes; the downstream graph and rendering layers stay stable.
04
Variant normalization
All variants are normalized to HGVS using the open-source `hgvs` Python library. Transcript selection prefers MANE Select where available, with explicit fallback rules for variants outside MANE coverage. Genomic coordinates are emitted in both GRCh37 and GRCh38.
05
Knowledge graph lookup
UNMIRI's architecture is built on a Neo4j knowledge graph encoding typed relationships across CIViC (CC0 public domain), ClinVar, ClinicalTrials.gov, openFDA drug labels, CPIC pharmacogenomics, and OncoKB level assignments. A variant query returns drug-gene-evidence edges, eligibility-criteria edges to actively recruiting trials, contraindication edges to FDA labels, and provenance metadata for every edge.
06
Deterministic rendering
The final output is rendered by typed templates from the graph results. Templates do not hallucinate. They cannot fabricate a drug name, misattribute an evidence tier, or invent a trial NCT ID. In clinical contexts, determinism is a feature, not a constraint.

LLMs help in exactly two narrow places: extraction edge cases where the per-lab parser hits an unusual format, and long-tail variants where the graph has no exact match and the LLM is asked to summarize literature context (the output flagged at a lower confidence band). The default clinical path is LLM-free.

The rationale for using a structured graph instead of a vector store is the subject of a separate post: Why Vector RAG Fails for Oncology. The short version: cosine similarity conflates clinically distinct variants. BRAF V600E and V600K differ by one amino acid and require different drugs. A retrieval layer with no structural knowledge of that difference will return the wrong answer for one of them, confidently.

A concrete traversal for a single variant looks like this:

(Variant: EGFR L858R)
  ├── SENSITIZES_TO ──▶ (Drug: Osimertinib)
  │                      ↳ evidence_level: Level 1 (OncoKB)
  │                      ↳ source: FLAURA (NEJM 2018)
  ├── SENSITIZES_TO ──▶ (Drug: Erlotinib + Ramucirumab)
  │                      ↳ evidence_level: Level 1 (OncoKB)
  │                      ↳ source: RELAY (Lancet Oncol 2019)
  └── HAS_RESISTANCE_PATHWAY ▶ (Trial: NCT04988295 · MARIPOSA-2)
                         ↳ eligibility: EGFR L858R + prior osimertinib
                         ↳ match_type: pre-matched for progression

Queries are Cypher. Same input, same output, every time. No similarity scoring, no ambiguous retrieval, no LLM improvising the clinical answer. If a variant has no edge to a drug in the graph, the pipeline returns no recommendation, which is the correct answer in that case.

Vendor coverage targets

The list below reflects the parser roadmap for Engine 1. Specific parser status per vendor is shared with design partners under NDA.

Vendor	Formats	Panels
Foundation Medicine	PDF + structured XML	F1CDx, F1 Liquid, F1 Heme
Tempus	PDF	xT, xT Onco, xR
Caris	PDF	MI Profile (solid + liquid)
Guardant	PDF + structured	Guardant360, Reveal
Natera	PDF	Signatera, Empower
NeoGenomics	PDF	NeoTYPE Discovery, NeoLAB
Strata Oncology	PDF	StrataNGS
Personalis	PDF	ImmunoID NeXT, NeXT Personal
OmniSeq	PDF	OmniSeq Comprehensive

Section 3

What you get back.

A single API call. A structured response. The fields below come back on every successful request, with provenance metadata pointing at the specific knowledge-base entry, FDA label, or trial record that produced each value.

Sample request

POST /v1/reports HTTP/1.1
Host: api.unmiri.com
Content-Type: multipart/form-data
Authorization: Bearer <api-key>

vendor=foundation_medicine
report=@F1CDx_synthetic_NSCLC.pdf
output=fhir_genomics

Sample response (truncated)

{
  "report_id": "rpt_synthetic_001",
  "vendor": "foundation_medicine",
  "patient": {
    "external_id": "synthetic-001",
    "tumor": "NSCLC adenocarcinoma, stage IVA"
  },
  "variants": [
    {
      "gene": "EGFR",
      "hgvs_p": "p.Leu858Arg",
      "hgvs_c": "c.2573T>G",
      "transcript": "NM_005228.5",
      "vaf": 0.342,
      "tier": "Tier IA",
      "evidence": "OncoKB Level 1",
      "drugs": [
        {
          "name": "Osimertinib",
          "evidence_level": "1",
          "fda_status": "approved",
          "source": "FLAURA (NEJM 2018)"
        }
      ]
    }
  ],
  "biomarkers": {
    "tmb": 2.3,
    "msi": "MSS",
    "pd_l1_tps": 0.5
  },
  "cdx_eligibility": ["EGFR_TKI", "ANTI_VEGF"],
  "contraindications": [
    {
      "drug_class": "PD-1/PD-L1 inhibitor monotherapy",
      "reason": "EGFR-mutant NSCLC: poor response independent of PD-L1",
      "source": "openFDA label"
    }
  ],
  "trial_matches": [
    {
      "nct_id": "NCT04988295",
      "name": "MARIPOSA-2",
      "match_type": "pre-matched for resistance pathway"
    }
  ]
}

Output fields

variants[]: HGVS-normalized variant calls (gene, transcript, VAF, tier, evidence)
biomarkers: TMB, MSI, HRD, PD-L1 (where reported by source vendor)
cdx_eligibility[]: Companion-diagnostic flags ready for downstream matching
contraindications[]: Drug-class flags derived from openFDA labels
trial_matches[]: Variant-aware ClinicalTrials.gov match list, pre-matched for resistance pathways where relevant
patient_timeline: Longitudinal aggregation when multiple reports for one patient are submitted

The full output schema, including FHIR R4 Genomics Bundle conformance details, mCODE-compatible export format, and authentication guidance, is shared with design partners under NDA alongside the API reference. A fully rendered end-to-end example for a synthetic NSCLC case is at /sample-report.

Pricing

Per-report consumption pricing or annual platform license, whichever fits the integration shape. Typical ACV is $30K to $150K for mid-market integrations and $250K to $750K for top-tier EHR and digital pathology platforms. Specific pricing is part of the design-partner conversation, where it depends on volume, latency requirements, and BAA scope. Design partners get pricing locked at design-partner rates for the first year of production use.

Section 4

Who Engine 1 is for.

Four buyer types are in scope today. The thread tying them together is the same: cross-vendor parsing is not the differentiating part of any of these products. It is a tax. Engine 1 collapses that tax into a single integration.

EHR vendors

Epic, Athenahealth, eClinicalWorks, and other oncology-EHR shops typically integrate genomic data via per-customer custom integrations. Engine 1 collapses that into a single API call. Your oncology module gets variant-aware fields it can render in the patient chart without owning the parsing layer or the knowledge graph behind it.

Digital pathology platforms

Paige.AI, PathAI, Proscia, and Indica Labs are expanding from imaging into molecular. Engine 1 lets these platforms ingest molecular reports from any of their pathology customers' downstream NGS labs and surface variant data alongside H&E and IHC findings inside one workflow.

Decision support vendors

Flatiron OncoEMR, Navigating Cancer, OncoLens, Cota, and Carevive build oncology decision support without owning the parsing layer. Engine 1 provides structured genomic input that decision-support engines can reason over directly: drug-gene matches, contraindications, and pre-matched trials.

Mid-tier NGS labs

Mid-market reference labs serving hospital customers often need to ingest competitor reports (the patient was tested elsewhere first; the new lab is monitoring or repeating). Engine 1 handles that ingestion, so the lab's pipeline can reconcile internal and external testing without rebuilding parsers for every competitor.

Engine 1 does not include EHR-side rendering UI, the pathology image overlay, or decision-support interaction logic. Those stay your differentiation. Engine 1 sits below them as parsing-and-knowledge infrastructure.

Section 5

Why this is hard to build.

Three things make Engine 1 a real moat rather than a thin wrapper around a couple of parsers.

The vendor-format coverage is itself the product. UNMIRI maintains a parser per major vendor with version pinning and a regression suite of synthetic test reports for each known layout variant. When a vendor changes their format, the parser changes; the downstream graph and rendering stay stable. The regression suite grows with every customer-reported edge case, and that growth compounds. A team starting from scratch needs 12 to 18 months and access to real reports to reach parity, and real reports are PHI, which adds compliance overhead before the engineering work begins.

The clinical evidence layer is open. CIViC is CC0 public domain. ClinVar is US Public Health Service. ClinicalTrials.gov is federally maintained. openFDA is Public.Resource.Org. CPIC is CC BY 4.0. OncoKB requires a separate commercial license, tracked in our internal data-use agreements. Customers can verify any output against the underlying source on the source's own site. Closed-source variant interpretation services keep their reasoning private; UNMIRI does not.

The output is deterministic. The same input produces the same output, with audit logs that capture the exact graph state at rendering time. A case re-run a year later produces the same recommendation, or a clear delta with documented reasons. This matters when a clinician or auditor asks how a specific recommendation was reached.

Clinical accuracy is verified through ongoing recruitment of board-certified pathologist advisors, with public introductions added once each engagement is formalized.

From the buyer's perspective, the practical question is what you do not have to build or maintain yourself once Engine 1 is integrated: per-vendor PDF and XML parsers, OncoKB and ClinVar ingestion pipelines, ClinicalTrials.gov eligibility parsing, FDA-label contraindication mapping, HGVS normalization with MANE Select handling, FHIR Genomics Bundle emission, and a regression suite covering vendor layout drift. That stack is not the differentiating part of any product downstream of it. Engine 1 owns it so customers can stop owning it.

Section 6

Compliance and integration.

HIPAA-ready posture. UNMIRI's architecture is built on AWS for the PHI path, with the following architectural targets:

AWS RDS Postgres for structured data and audit logs
Encrypted S3 (SSE-KMS, access-logged, versioned) for primary document storage
AWS Textract for PDF extraction, with a separate transient input bucket on a 1-day Lifecycle expiration
Anthropic HIPAA-ready API tier for narrow LLM use (extraction edge cases only)
us-east-1 region pinning across all PHI-handling resources

Cloud-subprocessor BAAs (AWS, Anthropic, Vercel) are in active negotiation; customer BAAs are part of design-partner onboarding once the upstream chain is in place. PHI is processed in memory and not persisted by UNMIRI after the response is returned. The full subprocessor list, BAA status, and incident response posture live at /security/subprocessors.

FHIR R4 conformance. The default output is a FHIR R4 Genomics-IG-conformant Bundle. DiagnosticReport carries the LOINC code 51969-4 (Genetic analysis report). Per-variant Observations use the genetic-variant profile with appropriate component coding. Treatment recommendations are emitted as genomic-implication Observations linked to their source variants. mCODE-compatible output is available for customers feeding cancer-registry pipelines.

EHR integration patterns. The common production pattern is LIMS-to-EHR via FHIR. Your LIMS POSTs the inbound NGS report to Engine 1, gets back the enriched FHIR Bundle, and routes it to the EHR through whatever channel your stack already uses (HL7 v2 for legacy, FHIR Bundle for modern, SMART-on-FHIR for app-style integrations). Engine 1 sits behind the LIMS, not behind the EHR, so you do not need a new EHR integration to ship.

Compliance documentation, including the full security posture overview, is shipped to design partners under NDA alongside the API reference. The publicly available portion is at /security.

Section 7

Become a design partner.

We are onboarding a small number of design partners for Engine 1. Design partners get the API reference under NDA, hands-on integration support from the engineering team, and pricing locked at design-partner rates for the first year of production use. In exchange, we ask for honest feedback on parser accuracy across your real-world report mix and a willingness to flag edge cases as they appear.

Cross-vendor NGS Interpretation API