:An open-source pipeline that turns oncology research papers into structured, auditable biomarker data. Drop in a PDF, get back a 4-sheet workbook with study details, the biomarker catalog, statistical results, and author inferences — every row tagged with a verbatim source quote and the section it came from.
The problem
Pharma analysts and translational researchers spend hours hand-extracting biomarkers, p-values, hazard ratios, and cohort details from each paper. Existing tools either miss the statistical layer or hide the source, which makes the output un-auditable.
How it works
Local PDF parsing (pymupdf4llm + pdfplumber) — no Azure dependency, no cloud upload of your PDFs.
A study classifier detects disease, study type, and biomarker type, then composes a 4-level prompt (core + disease addon + study-type addon + biomarker-type addon).
Four specialized LLM agents run in parallel: Study Details, Biomarker Details, Results, Inferences.
Outputs: Excel (4 sheets), flat CSV (one row per finding, pivot-ready), and full JSON. A corpus-wide registry shows biomarker frequency across papers.
Disease coverage today: breast, lung, liver, colorectal, gastric, pancreatic, thyroid. Study types: longitudinal clinical, survival oncology, diagnostic, methylation, immune infiltration.
0
8
A productized service for pharma medicinal-chemistry teams that converts
literature and curated databases into a structured, source-traced SAR
landscape — in 24 hours, at a fraction of the cost of an analyst.
What you get
Structured dataset of compounds (canonical SMILES, Bemis–Murcko scaffold, MW / cLogP / TPSA / HBD / HBA descriptors, server-rendered SVG structures)
Bioactivity table (IC50, Ki, EC50, Kd) with values normalised to nM
Author-stated SAR claims and trends, each anchored to a verbatim snippet
Every fact carries a source pointer: paper / patent / database, section, quoted text — pharma compliance and IP teams require this
CSV export, with PDF report packaging available on request