When was LifeSciBench released?

OpenAI published LifeSciBench on June 17, 2026, alongside a preprint paper. 1

What was the best AI score on LifeSciBench?

GPT-Rosalind, OpenAI's life-sciences model, led with a 36.1% overall pass rate, up from 25.7% for the general-purpose GPT-5.5. 1

What is GPT-Rosalind?

GPT-Rosalind is OpenAI's frontier reasoning model purpose-built for biology, drug discovery, and translational medicine. Named after Rosalind Franklin, it launched as a limited research preview in April 2026 for qualified enterprise customers. 2

Why do AI models struggle on LifeSciBench?

Performance drops sharply on tasks that require interpreting attached artifacts (45.1% text-only versus 28.1% with files for GPT-Rosalind), producing exact numeric or sequence outputs, and handling design and optimization work. 1

How was the benchmark validated?

An independent panel of 453 expert reviewers — 97% with Ph.D.s — assessed the tasks, with agreement exceeding 96% in every quality category. 1

ai-ml

LifeSciBench: AI Fails 64% of Life-Science Tasks 2026

June 24, 2026

#LifeSciBench #OpenAI #GPT-Rosalind #AI benchmark #life sciences AI #drug discovery AI #GPT-5.5 #AI for science

LifeSciBench: AI Fails 64% of Life-Science Tasks 2026

OpenAI's new LifeSciBench grades AI on 750 real, free-response life-science research tasks written by 173 practicing Ph.D. scientists — and even the company's purpose-built science model passes fewer than four in ten of them.¹

TL;DR

On June 17, 2026, OpenAI released LifeSciBench, a 750-task benchmark grading AI on realistic drug-discovery and biology research work.¹ Its strongest model, the life-sciences-specific GPT-Rosalind, scored a 36.1% pass rate — better than GPT-5.5's 25.7%, but still failing nearly two-thirds of expert tasks.¹

What You'll Learn

What LifeSciBench is and why it differs from existing biology benchmarks
How OpenAI built it with 173 scientists and 19,020 rubric criteria
The headline result: GPT-Rosalind passes only 36.1% of tasks
Where AI is improving fastest and where it still breaks down
Why the 70% pass threshold and partial-credit scoring both matter
What the benchmark means for anyone building AI into scientific work

What is LifeSciBench?

LifeSciBench is an expert-written, expert-reviewed benchmark from OpenAI that measures whether AI systems can support real life-science research, not just answer biology trivia.¹ Published on June 17, 2026, it contains 750 free-response tasks spanning seven research workflows and seven biological domains, each grounded in the judgment of practicing scientists with Ph.D.-level training and direct experience advancing drug-discovery programs in biotech and pharma.¹

The distinction matters. Most existing life-science evaluations use multiple-choice questions with clean reference answers. LifeSciBench instead presents each task the way a researcher would brief a knowledgeable collaborator: a scientific prompt, any relevant context or data files, and a free-response answer that an expert then grades against a detailed rubric.¹ Real research, as OpenAI frames it, "rarely looks like a single fact-recall question" — scientists reconcile conflicting results, troubleshoot assays, design experiments, and decide what to do next under uncertainty.¹

How OpenAI built it

The scale of the human effort behind LifeSciBench is the most striking part of the project. Tasks were authored by 173 expert scientists, each holding Ph.D.-level training plus biotech or pharmaceutical industry experience.¹ The benchmark covers seven workflows — evidence handling; analysis; design, optimization, and prediction; scientific reasoning; validation and operations; translation; and scientific communication.¹

Grading is where LifeSciBench gets unusually granular. Each task carries a custom rubric that breaks the expected answer into specific claims, calculations, decisions, and justifications. Across the benchmark, those rubrics add up to 19,020 criteria — an average of 25 per task.¹ The design reflects how science is actually judged: a response can reach the right high-level conclusion but still be marked incomplete if it overlooks a key assay limitation or fails to flag a consequential biological nuance.¹

The tasks are also demanding by construction. Seventy-nine percent require multiple reasoning or decision steps, averaging four steps each, and the benchmark ships with 1,062 attached artifacts — figures, PDFs, tables, genomic sequence files, chemical structures, and web references.¹ More than half of all tasks (53%) require the model to interpret or synthesize information from at least one of those artifacts rather than relying on prompt text alone.¹

The headline result: AI passes barely a third

Here is the number that drove the coverage: GPT-Rosalind, OpenAI's frontier model purpose-built for life sciences, reached an overall exact pass rate of just 36.1%.¹ A task counts as "passed" only when the model meets a task-level success threshold of 70%, so a 36.1% pass rate means the model fully solved a little over a third of the benchmark — and fell short on the other 63.9%.¹

GPT-Rosalind is OpenAI's life-sciences research model, named after DNA-structure pioneer Rosalind Franklin and launched as a limited research preview in April 2026 for qualified enterprise customers.² It is a meaningful step up from the general-purpose GPT-5.5, which OpenAI shipped on April 23, 2026 and which scored 25.7% on the same benchmark.¹³ The 10.4-percentage-point gain shows targeted scientific training helps — but the absolute ceiling underscores how far AI still sits from autonomous research-grade work.

Where AI is improving fastest

LifeSciBench is not a story of uniform failure. Frontier models are relatively strongest on tasks involving scientific synthesis, communication, and structured interpretation.¹ The sharpest gains from GPT-5.5 to GPT-Rosalind show up in two workflows:

Scientific communication climbed from 56.3% to 71.1% — though OpenAI cautions this category is small (n=9) and should be read carefully.¹
Translation, the "bench-to-bedside" work of connecting preclinical evidence to clinical implications, rose from 36.8% to 57.7%.¹

Rubric-level results reinforce the pattern. On tasks demanding expert-useful, actionable output, GPT-Rosalind scored 44.7% versus GPT-5.5's 29.1%; on tasks requiring careful handling of uncertainty and caveats, it scored 44.8% versus 29.3%.¹ In short, today's models are most useful when a task has a clear evidence boundary and rewards structured scientific judgment.

Where AI still breaks down

Performance collapses on artifact-heavy, design-heavy, and precision-dependent work.¹ Design, optimization, and prediction remains one of the hardest workflows, with GPT-Rosalind passing just 30.7%, and analysis is similarly difficult at 30.3%.¹

The clearest weakness is working with real scientific files. GPT-Rosalind's pass rate drops from 45.1% on text-only tasks to 28.1% once a task includes artifacts or URLs; GPT-5.5 shows the same slide, from 29.9% to 21.9%.¹ OpenAI's analysis confirms that frontier models struggle to extract information from complex figures or large sequence files and weave it into a final answer.¹

Exactness is the other wall. Tasks needing precise numeric answers saw GPT-Rosalind reach only 14.8%, while sequence- or structure-level outputs hit 24.0% and construct-generation tasks landed at 27.3% with little improvement over GPT-5.5.¹ These failures are scientifically meaningful: workflows like CRISPR/HDR donor design or siRNA design need outputs exact enough to use directly.¹ Notably, on roughly 14% of tasks the models earned substantial rubric credit while still failing the pass threshold — for GPT-Rosalind, 109 tasks scored below a 20% pass rate yet still earned at least 50% of the rubric reward.¹

Why two metrics matter

LifeSciBench reports two complementary numbers, and the gap between them is the point. Pass rate is the share of tasks where a model clears the 70% task-level threshold. Score is the average rubric reward, giving partial credit for individual criteria even when the full task is not solved.¹ A scientific answer can be partly correct and still useful — or reach the right conclusion while missing a critical caveat that makes it unusable. Reporting both keeps a single headline number from hiding that nuance, and it explains why a model can look like it is "almost there" on score while still failing as a research collaborator.

How experts validated it

To check that the tasks were fair, OpenAI ran an independent review with 453 reviewers who had not written any tasks.¹ Their credentials were heavyweight: 97% held a Ph.D. or equivalent, with an average of 12 years of field experience and 14 peer-reviewed publications, and 88% reported at least one award or fellowship.¹ Reviewer agreement exceeded 96% in every category, including 98.3% overall agreement that tasks reflected realistic research work and 96.5% that they were scientifically grounded and answerable.¹ That level of consensus is what separates LifeSciBench from quickly assembled benchmarks — the difficulty is verified, not assumed.

What it means for builders

For anyone wiring AI into scientific or technical workflows, LifeSciBench is a useful reality check disguised as a research paper. The same failure modes it exposes in drug discovery — degraded performance the moment real files enter the prompt, brittleness on exact numeric output, and confident partial answers that miss a decisive constraint — are exactly the risks that show up when you point a general model at messy enterprise data. The lesson is not that models are useless; GPT-Rosalind's gains in synthesis and communication are real. It is that "passes the demo" and "trustworthy on a 25-criteria rubric" are very different bars, and that human review remains non-negotiable for high-stakes output. That is the same theme behind building trustworthy AI with real-world guardrails and the ongoing work on hallucination prevention and testing.

It also reframes the leaderboard race. While most 2026 benchmark coverage has fixated on coding scores from models like Claude Opus 4.8 and Kimi K2.7 Code, LifeSciBench measures something harder to game: whether a model can do the judgment-heavy work of a trained scientist. By that measure, the frontier is still wide open.

Bottom line

LifeSciBench is one of the most rigorous AI evaluations yet built, and its verdict is sobering: even a model trained specifically for life sciences passes only 36.1% of expert research tasks. The gains GPT-Rosalind shows over GPT-5.5 prove that domain-focused training moves the needle, especially in scientific communication and translation. But the benchmark's artifact, precision, and design failures map almost exactly onto the things real research demands. For now, AI is a fast, increasingly capable collaborator for science — not a substitute for the scientist holding the rubric.

OpenAI, "Introducing LifeSciBench," June 17, 2026. https://openai.com/index/introducing-life-sci-bench/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷ ↩¹⁸ ↩¹⁹ ↩²⁰ ↩²¹ ↩²² ↩²³ ↩²⁴ ↩²⁵ ↩²⁶ ↩²⁷ ↩²⁸ ↩²⁹ ↩³⁰ ↩³¹ ↩³² ↩³³ ↩³⁴ ↩³⁵ ↩³⁶
VentureBeat, "OpenAI debuts GPT-Rosalind, a new limited access model for life sciences," April 2026. https://venturebeat.com/technology/openai-debuts-gpt-rosalind-a-new-limited-access-model-for-life-sciences-and-broader-codex-plugin-on-github/ ; OpenAI, "Introducing GPT-Rosalind for life sciences research." https://openai.com/index/introducing-gpt-rosalind/ ↩ ↩²
TechCrunch, "OpenAI releases GPT-5.5, bringing company one step closer to an AI 'super app,'" April 23, 2026. https://techcrunch.com/2026/04/23/openai-chatgpt-gpt-5-5-ai-model-superapp/ ↩

Frequently Asked Questions

LifeSciBench is an OpenAI benchmark of 750 free-response life-science research tasks, written by 173 Ph.D.-level scientists and validated by an independent expert panel, that measures whether AI can support real drug-discovery and biology research rather than answer multiple-choice questions. 1