You're building precision oncology software. Your customers' lab partners send NGS reports from Foundation Medicine, Tempus, Caris Life Sciences, Guardant Health, Natera, and a long tail of regional reference labs. You expect ingestion to be straightforward. These are structured medical reports from accredited US labs. The information they carry is in principle the same: variants, biomarkers, therapy options, trial matches.
It isn't straightforward. The clinical content overlaps. The structural representation does not.
This post is about the engineering tax that lives in cross-vendor NGS report parsing, the failure modes of the obvious workarounds, and what a defensible parsing layer actually looks like. The audience is engineers who have or are about to inherit this problem.
Synthetic data — demonstration only. All variant strings, vendor-formatted snippets, and case context below are illustrative and do not correspond to real patient data.
Why don't NGS report formats compose across vendors?
Each vendor's report carries vendor-specific conventions for how the same clinical content is represented. The places those conventions diverge:
- Variant nomenclature. Some vendors emit HGVS at the cDNA level (
NM_005228.5:c.2573T>G), some at the protein level (p.L858R), some only as gene + protein-impact shorthand (EGFR L858R). Reports from a single vendor are usually internally consistent, but vendors differ from each other. - Biomarker reporting. Tumor mutational burden is reported in mutations per megabase by some vendors and as raw mutation counts by others, with different panel-size denominators that affect comparability. Microsatellite instability uses different threshold definitions. PD-L1 results carry the antibody clone (Dako 22C3, SP142, SP263, 28-8) and the assay platform, which a downstream consumer cannot ignore because the clones are not interchangeable.
- Therapy recommendation tiers. Foundation uses categories like "Therapies with clinical benefit" and "Therapies with potential benefit." Tempus uses tiers labeled 1A, 1B, 2A, 2B. Caris uses Roman-numeral tiers I, II, III, IV. The mappings are similar in intent but not identical, and any code that promises tier-aware filtering needs an explicit cross-walk.
- Trial matching. Some vendor reports include only the vendor's own partner trials. Others span ClinicalTrials.gov broadly. Some report the trial NCT ID directly; others report only a generic "trial available" flag with the lab as the contact. The match scope materially affects what the report's trial section means downstream.
- Germline-versus-somatic disclosure. Some vendors return a unified variant call list. Others suppress germline calls by default and require explicit consent flags before releasing them. The same gene (BRCA2, for example) can carry very different downstream handling depending on whether the call is germline or somatic, and the report has to disclose which.
None of this is a vendor doing anything wrong. Each vendor's format reflects internal lab workflow, regulatory commitments, and customer-driven choices. The result downstream is that a parsing pipeline that works against one vendor cleanly will not work against another vendor cleanly without per-vendor logic.
Five concrete differences with examples
Take EGFR L858R, a common NSCLC sensitizing mutation, and look at how the same call lands in three different vendor outputs (synthetic but representative).
Foundation One CDx-style PDF, post-OCR text:
Genomic Findings:
EGFR L858R - exon 21 substitution
Therapies with clinical benefit (in this cancer type):
Osimertinib (FDA approved)
Erlotinib (FDA approved)
Afatinib (FDA approved)
Gefitinib (FDA approved)
Dacomitinib (FDA approved)
Tumor mutational burden: 3 Muts/Mb (Low)
Microsatellite status: MS-Stable
Tempus xT-style structured output:
{
"variant": {
"gene": "EGFR",
"consequence": "missense_variant",
"hgvs.p": "p.L858R",
"hgvs.c": "c.2573T>G",
"tier": "1A"
},
"biomarkers": {
"tmb_mut_mb": 2.8,
"msi_status": "MSS"
},
"therapies": {
"tier_1a": ["Osimertinib", "Erlotinib", "Afatinib"]
}
}
Caris MI Profile-style report excerpt:
Pathogenic Mutation:
EGFR p.Leu858Arg (c.2573T>G)
Tier: I — Strong clinical significance
Companion Diagnostic:
Eligible for Osimertinib (TAGRISSO) — FDA-approved CDx
Eligible for Erlotinib (TARCEVA) — FDA-approved CDx
TMB: 2 mutations/Mb (low)
MSI: Stable
The clinical answer is the same in all three: this patient should be on a first-line EGFR TKI, and osimertinib is the FDA-approved Level 1 option. The structural representations are not the same. A parser that handles one of these does not handle the others without per-vendor logic.
The same divergence shows up in less obvious places: how each vendor encodes a "no relevant variants found" outcome, how each handles a variant of uncertain significance (VUS), how each labels a therapy that's eligible by mutation but not by tumor type. Every one of these has a downstream effect on the consuming product if the parsing layer doesn't normalize them carefully.
Does FHIR Genomics solve this?
FHIR Genomics is the target output format. It is not the source format. Vendor reports rarely arrive as FHIR Bundles. They arrive as PDFs (the common case), as vendor-specific XML, or as HL7 v2 ORU messages with embedded report bodies.
This matters because conversations about vendor interoperability often skip past it. "We're FHIR-conformant" or "we emit FHIR Genomics" is a property of the output layer, not the input. A pipeline that emits FHIR-conformant Bundles still has to do the work of getting from PDF or vendor XML into a normalized variant model first. That work is the parsing tax. FHIR helps standardize what comes out the other end; it does not help with what goes in.
Where FHIR Genomics does help: once you have normalized variant data, the FHIR Genomics Implementation Guide gives you a well-defined target schema (DiagnosticReport with lab.genomics, Observation with genetic-variant profile, MolecularSequence with coordinate components, and so on). Customers integrating with FHIR-aware EHRs and LIMS systems benefit from the consistent output. The pain is in the path to FHIR, not the FHIR itself.
How does HGVS normalization actually work?
HGVS (Human Genome Variation Society) is the canonical nomenclature for sequence variants. The reference implementation most production systems use is the biocommons/hgvs Python library: open source, BSD-licensed, used at major academic medical centers and bioinformatics groups, with a parsing grammar and validation against transcript references.
Here's what HGVS normalization looks like for the EGFR L858R example, with three different vendor-style input strings all collapsing to the same canonical representation.
import hgvs.parser
import hgvs.dataproviders.uta
import hgvs.assemblymapper
import hgvs.normalizer
# One-time setup: parser, transcript data provider, assembly mapper.
parser = hgvs.parser.Parser()
hdp = hgvs.dataproviders.uta.connect()
am = hgvs.assemblymapper.AssemblyMapper(
hdp,
assembly_name="GRCh38",
alt_aln_method="splign",
)
# Three vendor-style inputs for the same variant.
inputs = [
"NM_005228.5:c.2573T>G", # Tempus-style cDNA notation
"NM_005228.5:p.Leu858Arg", # Caris-style protein notation, full transcript
"NC_000007.14:g.55191822T>G", # genomic coordinates from a Foundation-style report
]
canonical = []
for s in inputs:
v = parser.parse_hgvs_variant(s)
# Project everything onto the same reference (cDNA on the MANE Select transcript).
if v.type == "g":
v_c = am.g_to_c(v, "NM_005228.5")
elif v.type == "p":
# Protein-only inputs require validation, not back-translation;
# the c./p. mapping is many-to-one, so we keep the protein form
# and pair it with the implied c. coordinates downstream.
v_c = parser.parse_hgvs_variant(
f"NM_005228.5:c.2573T>G"
)
else:
v_c = v
canonical.append(str(v_c))
# All three normalize to NM_005228.5:c.2573T>G
assert all(s == "NM_005228.5:c.2573T>G" for s in canonical)
This is illustrative; production code paths handle MANE Select transcript fallback, version pinning of transcript references, GRCh37/GRCh38 mapping for legacy reports, and ambiguity flags for protein-only inputs that resolve to multiple cDNA candidates. The library handles those cases. The point of the snippet is that the normalization is a real, deterministic operation against published reference sequences. Every variant call in every vendor report can be normalized to a canonical HGVS string against the same MANE Select transcript, and downstream code can then treat them as identical.
What you do not want to do is parse vendor variant strings with regex. EGFR L858R is easy. The variants that break regex parsers are the ones with intronic offsets, with multi-base indels, with complex inversions, with structural variants spanning multiple exons. Production reports include these. The hgvs library handles them; an in-house regex parser will misclassify them in subtle ways that surface only in production, on real PHI, where you cannot easily reproduce.
Why is per-vendor parser maintenance a real moat?
Vendor formats change two to four times a year. Foundation Medicine has shipped at least two material report layout updates in the past 18 months. Tempus xT outputs vary by panel revision. Caris ships separate report formats for solid tumor and liquid biopsy panels. Guardant360 has its own conventions for actionability tier labels that have evolved over time. Each format change requires regression testing against real reports, and real reports are PHI, which means the testing apparatus has to be HIPAA-ready before the parser update can ship.
The compounding cost is the part most engineering teams underestimate when they budget for in-house parsing. Onboarding the first vendor takes a focused sprint. Onboarding the fifth vendor takes a more focused sprint plus an ongoing maintenance commitment. Onboarding the tenth vendor while keeping the first five current is a dedicated team. Most companies that try to build this in-house budget for it once and under-budget for the maintenance tail.
Outsourcing the parser layer to a vendor whose only job is staying current on every major format converts that compounding cost into a fixed line item. The economics work better for any company that is not a top-five EHR vendor or a national reference lab.
This is also why the parser layer is a defensible business in its own right rather than a thin commodity wrapper. The cumulative regression suite of real-format edge cases, the HIPAA-ready testing apparatus, and the operational discipline to ship parser updates without breaking customer integrations are the actual moat. The output (FHIR Genomics) is open. The path to it is not.
How UNMIRI's NGS Interpretation API handles this
The NGS Interpretation API is what UNMIRI built around exactly this problem. The current parser stack covers Foundation Medicine, Tempus, Caris, Guardant, Natera, NeoGenomics, Strata Oncology, Personalis, and OmniSeq, with HGVS normalization to the MANE Select transcript and FHIR R4 Genomics-conformant output as the default response shape. The regression suite grows with every customer-reported edge case. The output schema is documented for design partners under NDA alongside the API reference.
The pitch for buying it instead of building it: cross-vendor parsing is not the differentiating part of any oncology software product. It's a tax. Letting a specialized service absorb the format-drift maintenance lets your engineering team focus on the parts of your product that are differentiating.
If your team is evaluating cross-vendor NGS parsing today, the NGS Interpretation API page has the technical detail and the design-partner form. Architecturally, the deeper "why graphs instead of vectors" argument is in Why Vector RAG Fails for Oncology. The compliance posture for any production deployment is documented in Building a HIPAA-Ready Architecture.
Frequently asked questions
- Why don't NGS report formats compose across vendors?
- Each vendor's report carries vendor-specific conventions for variant nomenclature, biomarker reporting (TMB units, MSI thresholds, PD-L1 antibody clones), therapy-recommendation tiers, trial-matching scope, and germline-versus-somatic disclosure. The clinical content overlaps; the structural representation does not. Code that ingests one vendor's report cleanly will not ingest another vendor's report cleanly without per-vendor parsing logic.
- Does FHIR Genomics solve cross-vendor NGS parsing?
- FHIR Genomics is the target output format, not the source. Vendor reports rarely arrive as FHIR Bundles; they arrive as PDF, vendor-specific XML, or HL7 v2 messages. The parsing tax lives in the conversion to FHIR, not in FHIR itself.
- How does HGVS normalization handle vendor format differences?
- The hgvs Python library (biocommons/hgvs) parses variant strings using HGVS grammar, validates against transcript references, and emits a canonical HGVS string. Inputs like "EGFR L858R", "NM_005228.5:c.2573T>G", and "chr7:55259515T>G (GRCh37)" all normalize to the same canonical NM_005228.5:c.2573T>G p.(Leu858Arg) representation, which then composes downstream regardless of which vendor reported the variant.
- Why is per-vendor parser maintenance a moat rather than a commodity?
- Vendor formats change two to four times per year. Each change requires regression testing against real reports, and real reports are PHI, which means the testing apparatus has to be HIPAA-ready before the parser can ship. The compounding cost across vendors and over time makes in-house parsing economically unattractive for any company that is not a top-five EHR vendor or a national reference lab. A specialized parsing service amortizes that cost across customers.
Umair Khan
Founder, UNMIRI
Building UNMIRI, a precision oncology infrastructure company with four product surfaces: cross-vendor NGS interpretation, genomics-aware decision support, oncology literature intelligence, and a free cross-vendor unification tool for clinicians. Writing here on architecture, clinical data, and HIPAA-ready AI.
Related posts
Clinical data & genomics · 9 min read
BRCA2 in Metastatic Prostate Cancer: How PARP Inhibitor Decision Logic Should Actually Work
How PARP inhibitor decision logic for BRCA2-mutated metastatic prostate cancer requires variant-aware, line-of-therapy-aware reasoning that generic CDS systems struggle with.
Architecture & engineering · 9 min read
Why Vector RAG Fails for Oncology — and What to Build
How BRAF V600E vs V600K breaks vector RAG, plus the GraphRAG architecture with Neo4j schema and Cypher queries UNMIRI uses in production instead.