Engine 1

Cross-vendor NGS Interpretation API

Stop building parsers for every lab. UNMIRI's API handles Foundation Medicine, Tempus, Caris, Guardant, Natera, NeoGenomics, Strata, Personalis, and OmniSeq (LabCorp) with unified FHIR Genomics output. Variant normalization, biomarker extraction, CDx eligibility flags, and trial matching ship in the response, deterministically rendered, with provenance on every field.

Inspect the schema first. Talk to our team when you're ready to integrate.

Open schema on GitHubHIPAA-ready (AWS + Microsoft BAAs active)FHIR R4 conformant10+ vendors

Section 1

Every healthtech product hits the same wall.

The patient's genomic data sits in a clinical NGS report from a third-party lab. The clinical report runs roughly 10 to 30 pages depending on vendor, with technical appendices and QC data that can extend the full bundle further. The lab is one of 8 to 12 major vendors. Each vendor's report layout is subtly different: different gene panels, different evidence-grading conventions, different file formats (PDF for most, structured XML for some, HL7 v2 for legacy systems), and different update cadences.

Engineering teams shipping oncology features end up with one of two paths, both bad.

The build-it-yourself path: hire a bioinformatics-aware engineering team, write per-vendor parsers, maintain a regression suite for every report version, redo the work every time a vendor reformats. The cost is real. A single Foundation Medicine layout change can break weekly batch ingestion in production. Most teams budget for this once, then under-budget for the maintenance tail.

The skip-it path: accept structured data only when the lab can deliver it (rare in practice), and treat PDF reports as read-only artifacts the clinician opens manually. The product loses the ability to reason over genomic data programmatically. Variant-aware decision support, automated trial matching, and biomarker-eligibility flags for CDx all become clinician-mediated steps.

Vendor format change frequency is the part most engineering teams underestimate. Foundation Medicine's CDx layout has shifted at least twice in the past 18 months. Tempus xT outputs vary by cancer type and panel revision. Caris ships separate report formats for solid tumor and liquid biopsy panels. Guardant360 has its own conventions for actionability tiers. Each change requires regression testing against real reports, and real reports are PHI, which means the testing apparatus itself has to be HIPAA-ready before the parser can ship.

For any company that is not a top-five EHR vendor or a national reference lab, the in-house economics do not work. There is a third path: outsource the parsing layer to a vendor whose only job is staying current on every major lab format and emitting a single unified schema downstream products can build on. That is what Engine 1 is.

Section 2

A six-stage pipeline, each stage independently inspectable.

Engine 1 has six layers, each deliberately boring. The separation between parsing, normalization, knowledge lookup, and rendering is the architectural choice that lets the system stay correct as vendor formats and clinical knowledge both change underneath it.

  1. 01

    Ingest

    Accepted formats: PDF (the common case), structured XML (Foundation One CDx, Tempus when available), HL7 v2 ORU messages (legacy LIS/LIMS), FHIR Bundle. The ingest layer normalizes file metadata, extracts patient identifiers (kept separate from variant data downstream), and routes to the right parser pipeline.

  2. 02

    OCR and layout detection

    AWS Textract handles raw OCR for PDF inputs, with custom post-processing for table extraction across vendor-specific column conventions. The layout detector identifies report sections (variant calls, biomarkers, CDx flags, negative findings) so downstream parsers do not have to scan the whole document.

  3. 03

    Format-specific parsers

    One parser per major vendor, with versioning. Each parser is a deterministic state machine that maps the vendor's layout to UNMIRI's internal variant schema. When a vendor changes their layout, only that parser changes; the downstream graph and rendering layers stay stable.

  4. 04

    Variant normalization

    All variants are normalized to HGVS using the open-source `hgvs` Python library. Transcript selection prefers MANE Select where available, with explicit fallback rules for variants outside MANE coverage. Genomic coordinates are emitted in both GRCh37 and GRCh38.

  5. 05

    Knowledge graph lookup

    UNMIRI's architecture is built on a Neo4j knowledge graph encoding typed relationships across CIViC (CC0 public domain), ClinVar, ClinicalTrials.gov, openFDA drug labels, CPIC pharmacogenomics. A variant query returns drug-gene-evidence edges, eligibility-criteria edges to actively recruiting trials, contraindication edges to FDA labels, and provenance metadata for every edge.

  6. 06

    Deterministic rendering

    The final output is rendered by typed templates from the graph results. Templates do not hallucinate. They cannot fabricate a drug name, misattribute an evidence tier, or invent a trial NCT ID. In clinical contexts, determinism is a feature, not a constraint.

LLMs help in exactly two narrow places: extraction edge cases where the per-lab parser hits an unusual format, and long-tail variants where the graph has no exact match and the LLM is asked to summarize literature context (the output flagged at a lower confidence band). The default clinical path is LLM-free.

The rationale for using a structured graph instead of a vector store is the subject of a separate post: Why Vector RAG Fails for Oncology. The short version: cosine similarity conflates clinically distinct variants. BRAF V600E and V600K differ by one amino acid and require different drugs. A retrieval layer with no structural knowledge of that difference will return the wrong answer for one of them, confidently.

A concrete traversal for a single variant looks like this:

(Variant: EGFR L858R)
  ├── SENSITIZES_TO ──▶ (Drug: Osimertinib)
  │                      ↳ evidence_level: Tier I-A (AMP/ASCO/CAP)
  │                      ↳ source: FLAURA (NEJM 2018)
  ├── SENSITIZES_TO ──▶ (Drug: Erlotinib + Ramucirumab)
  │                      ↳ evidence_level: Tier I-A (AMP/ASCO/CAP)
  │                      ↳ source: RELAY (Lancet Oncol 2019)
  └── HAS_RESISTANCE_PATHWAY ▶ (Trial: NCT04988295 · MARIPOSA-2)
                         ↳ eligibility: EGFR L858R + prior osimertinib
                         ↳ match_type: pre-matched for progression

Queries are Cypher. Same input, same output, every time. No similarity scoring, no ambiguous retrieval, no LLM improvising the clinical answer. If a variant has no edge to a drug in the graph, the pipeline returns no recommendation, which is the correct answer in that case.

Vendor coverage targets

The list below reflects the parser roadmap for Engine 1. Specific parser status per vendor is shared with design partners under NDA.

VendorFormats
Foundation MedicinePDF + structured XML
TempusPDF
CarisPDF
GuardantPDF + structured
NateraPDF
NeoGenomicsPDF
Strata OncologyPDF
PersonalisPDF
OmniSeq (LabCorp)PDF

Section 3

What you get back.

Submit a report, poll the job, get one canonical response. Every successful parse returns the same schema regardless of source vendor, with an audit envelope whose provenance metadata points at the specific knowledge-base entry, FDA label, or trial record that produced each value.

Sample request

POST /v1/parse HTTP/1.1
Host: api.unmiri.com
Content-Type: multipart/form-data
Authorization: Bearer <partner-api-key>

file=@F1CDx_synthetic_NSCLC.pdf

# 202 Accepted -> { "job_id": "...", "status": "processing",
#   "poll_url": "/v1/parse/{job_id}" }
# The canonical result lands at GET /v1/parse/{job_id}/result.
# To try the pipeline without an upload, POST /v1/parse/sample
# with { "sample_id": "fmi-egfr-l858r" } returns the same shape.

Sample response (truncated)

{
  "audit": {
    "responseId": "c5b8cbd0-2f21-4532-adad-99ee332a121c",
    "schemaVersion": "0.2.0",
    "engineVersion": "fmi_parser/0.1.0",
    "vendorSource": {
      "vendor": "Foundation Medicine",
      "product": "FoundationOne CDx",
      "reportFormat": "pdf"
    },
    "knowledgeBases": [
      { "name": "ClinVar", "version": "2026-01" },
      { "name": "ClinicalTrials.gov", "version": "2026-01" },
      { "name": "openFDA", "version": "2026-01" }
    ],
    "watermark": "Synthetic data - demonstration only"
  },
  "specimen": {
    "specimenType": "ffpe-tumor-tissue",
    "primaryTumorSite": { "display": "lung adenocarcinoma" }
  },
  "variants": [
    {
      "variantId": "50bfb335-bac6-4331-8116-d2373c3c25eb",
      "gene": { "symbol": "EGFR" },
      "hgvsProtein": "p.Leu858Arg",
      "hgvsCoding": "c.2573T>G",
      "variantType": "snv",
      "variantAlleleFraction": 0.342,
      "clinicalSignificance": "likely-pathogenic",
      "germlineOrSomatic": "somatic",
      "evidence": { "ampAscoCapTier": "I-A" }
    }
  ],
  "biomarkers": [
    { "type": "TMB", "value": 2.3, "unit": "mut/Mb", "interpretation": "low" },
    { "type": "MSI", "interpretation": "MSS" }
  ],
  "cdxFlags": [
    {
      "drug": { "name": "Osimertinib" },
      "indication": { "tumorType": "non-small cell lung cancer" },
      "approvalRegime": "FDA"
    }
  ],
  "contraindications": [
    {
      "drug": { "name": "PD-1/PD-L1 inhibitor monotherapy" },
      "reason": "EGFR-mutant NSCLC: poor response independent of PD-L1",
      "citations": ["openFDA label"]
    }
  ],
  "trialMatches": [
    {
      "nctId": "NCT04988295",
      "title": "MARIPOSA-2",
      "matchStrength": "strong"
    }
  ]
}

Output fields

audit
Provenance envelope: response ID, schema + engine version, vendor source, the knowledge bases used, and the synthetic-data watermark
specimen
Specimen type and primary tumor site, normalized from the source report
variants[]
Canonical variant calls: gene, HGVS coding/protein, variant type, allele fraction, clinical significance, and AMP/ASCO/CAP evidence tier
biomarkers[]
TMB, MSI, HRD, PD-L1 and other biomarkers where reported by the source vendor
cdxFlags[]
Companion-diagnostic eligibility: the drug, indication, and approval regime for each paired indication
contraindications[]
Drug-class contraindication flags with the reason and citing source
trialMatches[]
Variant-aware ClinicalTrials.gov matches with NCT ID, title, and match strength

The full output schema, including FHIR R4 Genomics Bundle conformance details, mCODE-compatible export format, and authentication guidance, is shared with design partners under NDA alongside the API reference. A fully rendered end-to-end example for a synthetic NSCLC case is at /sample-report.

See it interactively

Pricing

Per-report consumption pricing or annual platform license, whichever fits the integration shape. Typical ACV is $30K to $150K for mid-market integrations and $250K to $750K for top-tier EHR and digital pathology platforms. Specific pricing is part of the design-partner conversation, where it depends on volume, latency requirements, and BAA scope. Design partners get pricing locked at design-partner rates for the first year of production use.

Section 4

Who Engine 1 is for.

Four buyer types are in scope today. The thread tying them together is the same: cross-vendor parsing is not the differentiating part of any of these products. It is a tax. Engine 1 collapses that tax into a single integration.

EHR vendors

Epic, Athenahealth, eClinicalWorks, and other oncology-EHR shops typically integrate genomic data via per-customer custom integrations. Engine 1 collapses that into a single API call. Your oncology module gets variant-aware fields it can render in the patient chart without owning the parsing layer or the knowledge graph behind it.

Digital pathology platforms

Paige.AI, PathAI, Proscia, and Indica Labs are expanding from imaging into molecular. Engine 1 lets these platforms ingest molecular reports from any of their pathology customers' downstream NGS labs and surface variant data alongside H&E and IHC findings inside one workflow.

Decision support vendors

Flatiron OncoEMR, Navigating Cancer, OncoLens, Cota, and Carevive build oncology decision support without owning the parsing layer. Engine 1 provides structured genomic input that decision-support engines can reason over directly: drug-gene matches, contraindications, and pre-matched trials.

Mid-tier NGS labs

Mid-market reference labs serving hospital customers often need to ingest competitor reports (the patient was tested elsewhere first; the new lab is monitoring or repeating). Engine 1 handles that ingestion, so the lab's pipeline can reconcile internal and external testing without rebuilding parsers for every competitor.

Engine 1 does not include EHR-side rendering UI, the pathology image overlay, or decision-support interaction logic. Those stay your differentiation. Engine 1 sits below them as parsing-and-knowledge infrastructure.

Section 5

Why this is hard to build.

Three things make Engine 1 a real moat rather than a thin wrapper around a couple of parsers.

The vendor-format coverage is itself the product. UNMIRI maintains a parser per major vendor with version pinning and a regression suite of synthetic test reports for each known layout variant. When a vendor changes their format, the parser changes; the downstream graph and rendering stay stable. The regression suite grows with every customer-reported edge case, and that growth compounds. A team starting from scratch needs 12 to 18 months and access to real reports to reach parity, and real reports are PHI, which adds compliance overhead before the engineering work begins.

The clinical evidence layer is open. CIViC is CC0 public domain. ClinVar is US Public Health Service. ClinicalTrials.gov is federally maintained. openFDA is Public.Resource.Org. CPIC is CC BY 4.0. Proprietary or commercial-tier KBs (such as OncoKB or COSMIC commercial) are not part of the open contract; customers who need them integrate under their own licensing terms via the evidence.externalLevels extension point. Customers can verify any output against the underlying source on the source's own site. Closed-source variant interpretation services keep their reasoning private; UNMIRI does not.

The output is deterministic. The same input produces the same output, with audit logs that capture the exact graph state at rendering time. A case re-run a year later produces the same recommendation, or a clear delta with documented reasons. This matters when a clinician or auditor asks how a specific recommendation was reached.

Clinical accuracy is verified through ongoing recruitment of board-certified pathologist advisors, with public introductions added once each engagement is formalized.

From the buyer's perspective, the practical question is what you do not have to build or maintain yourself once Engine 1 is integrated: per-vendor PDF and XML parsers, CIViC and ClinVar ingestion pipelines, ClinicalTrials.gov eligibility parsing, FDA-label contraindication mapping, HGVS normalization with MANE Select handling, FHIR Genomics Bundle emission, and a regression suite covering vendor layout drift. That stack is not the differentiating part of any product downstream of it. Engine 1 owns it so customers can stop owning it.

Section 6

Compliance and integration.

HIPAA-ready posture. The entire PHI path runs in a single AWS account in us-east-1 on HIPAA-eligible services, under an active AWS BAA. Microsoft Azure OpenAI handles narrow LLM inference under the Microsoft Online Services BAA, network locked to UNMIRI's AWS NAT egress IP. The architecture meets these targets:

  • Managed PostgreSQL Multi-AZ for structured data and audit logs (AWS RDS in the BAA-covered AWS account; the entire PHI path is AWS-only)
  • Encrypted object storage with customer-managed keys, access logging, and versioning for primary document storage (AWS S3 with SSE-KMS in the BAA-covered AWS account)
  • Managed document extraction for PDF parsing (AWS Textract in the BAA-covered AWS account), with a separate transient input bucket on short-window auto-expiration
  • HIPAA-eligible LLM inference for extraction edge cases (Azure OpenAI Service under the Microsoft Online Services BAA), de-identified inputs only
  • US-only data residency across all PHI-handling resources

Upstream subprocessor BAAs are active: AWS Business Associate Addendum (account-scoped) and the Microsoft Online Services HIPAA BAA. Customer BAAs are part of design-partner onboarding. PHI is processed in memory and not persisted by UNMIRI after the response is returned. The full subprocessor list, BAA effective dates, and incident response posture are on the subprocessors page.

FHIR R4 conformance. The default output is a FHIR R4 Genomics-IG-conformant Bundle. DiagnosticReport carries the LOINC code 51969-4 (Genetic analysis report). Per-variant Observations use the genetic-variant profile with appropriate component coding. Treatment recommendations are emitted as genomic-implication Observations linked to their source variants. mCODE-compatible output is available for customers feeding cancer-registry pipelines.

EHR integration patterns. The common production pattern is LIMS-to-EHR via FHIR. Your LIMS POSTs the inbound NGS report to Engine 1, gets back the enriched FHIR Bundle, and routes it to the EHR through whatever channel your stack already uses (HL7 v2 for legacy, FHIR Bundle for modern, SMART-on-FHIR for app-style integrations). Engine 1 sits behind the LIMS, not behind the EHR, so you do not need a new EHR integration to ship.

Compliance documentation, including the full security posture overview, is shipped to design partners under NDA alongside the API reference. The publicly available portion is on the security overview.

Section 7

Become a design partner.

We are onboarding a small number of design partners for Engine 1. Design partners get the API reference under NDA, hands-on integration support from the engineering team, and pricing locked at design-partner rates for the first year of production use. In exchange, we ask for honest feedback on parser accuracy across your real-world report mix and a willingness to flag edge cases as they appear.

Please do not include patient names, medical record numbers, dates of birth, or any patient-identifying information in this form. This is a public intake channel and is not covered by a Business Associate Agreement. For secure transmission of patient data, email us to set up a covered intake.

Routed directly to partnerships@unmiri.com. Reply within one business day.