Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik
Most important take away
Roughly 90-95% of cancer drugs fail in the clinic, and Noetik’s thesis is that this is a patient-selection problem, not a pharmacology problem. By generating massive, intentionally designed multimodal patient-tissue datasets (H&E, protein stains, spatial transcriptomics, genotyping) and training transformer-based “virtual cell” foundation models on that data, they can predict which patients will respond to which drugs — effectively turning a clinical trial’s “which cohort?” question into an AI inference problem.
Chapter Summaries
1. Intro & What is Noetik
Ron Alfa (co-founder, physician-scientist) and Dan Bear (VP of AI) introduce Noetik. Founding thesis: cancer drugs fail because we select the wrong patients, not because we make bad drugs. Models can be used bidirectionally — to find which target fits a patient population and which patient population fits a molecule.
2. Why Preclinical Systems Don’t Translate
Cell lines are decades-old “Frankensteinian” immortalized cells that don’t represent real human tumors. Mouse xenograft models are similarly unreliable. This drives clinical trials into expensive “open label” enrollment where sponsors throw the drug at broad populations hoping to find signal.
3. Data Strategy — Generate, Don’t Scrape
Noetik generates its own data in-house rather than relying on public repositories. Inspired by ImageNet and PDB as examples of intentionally curated datasets. Uses image-based modalities for scale and cost-efficiency; randomizes patients across multiple slides to control for batch effects.
4. The Multimodal Data Stack
Each patient sample produces: H&E histology (tissue-level), multi-channel protein immunofluorescence (cell-type-level), spatial transcriptomics (~20,000 gene RNA expression per spatial location), and DNA genotyping. Spatial transcriptomics is the “meaty” ~20,000-channel computer vision problem.
5. Virtual Cell Models & Octo VC
Noetik’s virtual cell model simulates cell biology in context to answer practical drug-design questions. They train on multimodal data, but at inference time only need an H&E image (the “lingua franca” of pathology) — making deployment clinically practical.
6. In Vivo Perturbation & “In Silico Humanization”
A parallel mouse platform with barcoded CRISPR knockouts lets them run hundreds of genetic perturbations in a single mouse, then use models trained on human data to infer human-gene-level behavior from mouse tissue. Validates predictions and bridges mouse-to-human translation.
7. Tario — Autoregressive Transformer for Biology
Shift from masked-autoencoder (BERT-style) training to autoregressive (GPT-style) next-token prediction on spatial transcriptomic tokens. Shows good scaling behavior, with larger context windows (more tissue area) unlocking benefits of bigger models.
8. GSK Deal & Business Model
$50M deal with GSK (January) licensing Octo VC — first known foundation-model licensing deal in the pharma space. Structure includes upfront, milestones, and annual license fees. GSK can fine-tune on their own siloed translational data.
9. Advice to Founders & Closing
Don’t start from data — start from the problem and design the right dataset to solve it. Biotech needs conviction to spend years/capital generating data before models work. Call to action: more ML talent should engage with biology; the space is in its “first ChatGPT moment.”
Summary
Core Thesis & Actionable Insights
- The real bottleneck in oncology drug development is patient selection, not drug discovery. 90-95% of cancer drugs fail in clinical trials because trials enroll heterogeneous patient populations without knowing which subtype will respond. Responders exist (there’s no placebo effect in cancer), but they get drowned out statistically.
- Classical cancer subtypes (defined by pathologists over the last century) are too coarse. What looks like “one lung cancer” is probably 3-10 functionally distinct subtypes — and nobody currently knows the true carving. Models trained self-supervised on rich multimodal data can discover these.
- Cell lines and mouse xenografts don’t translate. Most cell lines don’t even carry the mutations of the human cancers they supposedly represent. Noetik takes a “no cell lines, no organoids” approach for its foundation models — training purely on human patient data.
Technical / Data Insights
- Intentional data generation beats scraping. ImageNet (1.2M curated images) and PDB (50 years of intentional structure collection) are the templates. In biology, you aren’t at the order-of-magnitude of internet-scale text data, so you can’t brute-force.
- Randomize patients across multiple slides to control for batch effects. Otherwise patient embeddings will encode staining day rather than biology.
- H&E as the deployment layer is strategically important. Train multimodal (H&E + protein + spatial transcriptomics + genotype), but at inference time only need H&E — which already exists for virtually every oncology patient and every historical trial.
- Masking ratio matters for multimodal bio transformers. ~99% masking produces much richer representations than low-masking BERT-style training, because protein/gene channels have strong spatial correlations that must be forced.
- Autoregressive > masked autoencoder for scaling. The Tario model (autoregressive next-token prediction on spatial transcriptomic tokens) showed better scaling behavior, especially at longer context lengths (more tissue area).
- Biology is data-hungry but tractable. Dropping from 100% to 40% or 10% of training data made models substantially worse, particularly at cross-cancer generalization. A few hundred patients per major cancer indication may be enough to generalize across oncology.
Career Advice (Explicit)
- For ML researchers: Noetik is hiring people who enjoy tackling ML on unfamiliar data modalities where first-principles custom model building is required. A biology background is NOT required — willingness to learn the minimum biology to make progress is sufficient.
- For biotech founders: Do NOT start from “I’ll generate X dataset and do ML on it.” Start from the problem, then design the dataset that can solve it. Most datasets are not ML datasets.
- Technology awareness matters: Capabilities shift rapidly (Dan did 2 genes per week in 2016; today’s platforms do 20,000 genes at once). Design for where the technology is going, not where it is.
- Conviction and capital are prerequisites. Noetik spent ~1.5-2 years generating data before training the first working model, and ~4 years before the first major commercial validation. There’s no shortcut if you’re entering a data-scarce regime.
- Career pattern — “Recursion mafia”: Both Ron and Dan came out of Recursion Pharmaceuticals, which seeded a cohort of founders with conviction to build data-at-scale biotech-AI companies. Working at a platform-building pioneer gives leverage when starting something new.
Business / Deal Insights
- First foundation-model licensing deal in pharma: Noetik × GSK, $50M, announced January. Structure: upfront + milestones + annual license fee. GSK can use the models internally and fine-tune them on their own translational data.
- Pharma appetite for AI has whiplashed upward. A year or two ago biotech-AI was “dying”; now deals and attention are flowing. Driver: pharma wants cross-pipeline access to models, not bespoke one-project collaborations.
- Diagnostic optionality: Because inference only needs H&E, a successful drug trial also produces a deployable companion diagnostic — a potential secondary revenue stream.
- Moat = data + custom architectures + interpretability. Noetik claims ≥1 order of magnitude more paired H&E + spatial transcriptomic + protein data than anything public/academic, plus custom self-supervised architectures, plus the ability to explain which genes drive predicted response.
Stocks / Investments Mentioned
- GSK (GlaxoSmithKline) — licensed Noetik’s Octo VC model in a $50M deal. Actionable angle: GSK is positioning aggressively on AI-for-biology; arguably one of the top pharma AI teams.
- Merck (Keytruda) — referenced as the archetype of a drug run through 1,000+ trials to find sub-indications. Illustrates the value unlock if Noetik-style models can short-circuit that process for future assets.
- Recursion Pharmaceuticals (implicit) — both founders are alumni; the “Recursion mafia” is producing multiple platform-oriented biotech-AI companies. Worth tracking as a talent-and-thesis indicator.
- Noetik itself — private; watch for further partnership announcements (Agenus collaboration already announced). Clinical-trial re-analysis work was hinted at as upcoming.
Most Actionable Takeaways
- For operators / investors: Foundation-model licensing (not project-based collaborations) is emerging as the template deal structure for AI-in-pharma. Watch for similar deals at other platform companies.
- For ML engineers: There’s high leverage in learning enough biology to work on multimodal tissue/spatial transcriptomic models — the field is data-constrained, not talent-saturated.
- For founders in bio-AI: Build the dataset deliberately, budget 18+ months before the first trainable model, and pick a deployment modality (like H&E) that already exists at scale in clinical workflows.
- For investors tracking oncology: Companies that can show “we can tell you which patients respond” materially de-risk Phase 2/3 trials — this is where the value accrues, not in molecule design.