Which LLM does PaperOrchestra use as its backbone?

The paper is from Google Cloud AI Research, and the automated side-by-side evaluation uses both Gemini-3.1-Pro and GPT-5 as judge models.2 The exact backbone model used inside the agents themselves is described in the full paper; based on the author affiliation, a Gemini-family model is the most likely choice, though the use of GPT-5 as a co-judge suggests the authors deliberately checked for bias in their own ecosystem.

How is PaperOrchestra different from AI Scientist-v2?

AI Scientist-v2 is an end-to-end autonomous system: it generates the research idea, runs the experiments, and writes the paper all within a single loop. PaperOrchestra is narrower by design — it assumes the human researcher already has an idea and experimental results, and focuses entirely on the writing pipeline.2 This is a pragmatic fit for most working researchers, who already know what they want to say.

Does it hallucinate citations?

The Literature Review Agent explicitly verifies every candidate citation against the Semantic Scholar API before including it in the manuscript.2 This is a significant departure from pure-LLM writing pipelines, which routinely invent plausible-sounding but nonexistent references. Verification does not eliminate errors entirely — a paper can exist and still be cited inappropriately — but it removes the most common failure mode.

What does the 39.6-minute runtime actually include?

That figure is a mean across the full PaperWritingBench evaluation and covers the complete pipeline from raw materials to a refined LaTeX draft. It includes the outlining, plotting, literature retrieval and verification, section writing, and iterative refinement steps.2 Pipeline runs for complex papers with large bibliographies will sit above that mean; simpler papers will sit below.

ai-ml

PaperOrchestra: 5 Agents Write a CVPR-Ready Paper in 40 Min

April 10, 2026

#PaperOrchestra #Google Cloud AI #multi-agent AI #research automation #AI agents #LLM research #Semantic Scholar #AI writing

PaperOrchestra: 5 Agents Write a CVPR-Ready Paper in 40 Min

TL;DR

Google Cloud AI Research has introduced PaperOrchestra, a multi-agent framework that converts unstructured pre-writing materials — rough ideas, experimental logs, result tables — into submission-ready LaTeX research manuscripts. Five specialized agents handle outlining, plotting, literature review, section writing, and peer-review-style refinement. The full pipeline completes in roughly 39.6 minutes per paper using about 60–70 LLM API calls. In side-by-side human evaluations, it beat autonomous baselines by 50%–68% on literature review quality and 14%–38% on overall manuscript quality. The paper was posted to arXiv in April 2026 and introduces a new benchmark, PaperWritingBench, built from 200 accepted CVPR 2025 and ICLR 2025 papers.¹²

What You'll Learn

What PaperOrchestra is and why its five-agent architecture matters
How each of the five specialized agents contributes to the writing pipeline
How the PaperWritingBench evaluation set was constructed and what it tests
Exact win rates against autonomous baselines like AI Scientist-v2 in both human and automated evaluations
How simulated peer review is used to refine drafts before submission
What this means for the future of AI-assisted scientific writing

A Writing Assistant for Researchers, Not a Replacement

Autonomous "AI scientist" systems have been trending in the research community for more than a year, promising end-to-end pipelines that ideate, run experiments, and write up results with no human in the loop. The most widely cited example, Sakana AI's AI Scientist-v2, proved the concept is feasible but exposed a hard limitation: such systems can only write the papers they themselves generated through their own internal research loops. If you already have experimental results, rough notes, and a direction in mind, an end-to-end AI scientist has nothing to offer.

PaperOrchestra takes the opposite angle. It was built by Yiwen Song, Yale Song, Tomas Pfister, and Jinsung Yoon at Google Cloud AI Research, and it assumes the human has already done the interesting scientific work.¹ Give it a rough idea summary and raw experimental logs, and it will return a submission-ready LaTeX manuscript — with verified citations, generated figures, and language polished through simulated peer review. The goal is to compress the most tedious part of research writing, not to replace the ideas behind it.

The Five Agents

PaperOrchestra strategically decouples the writing process across five specialized agents, each responsible for a different stage of manuscript construction.²

1. Outline Agent

The first agent ingests the researcher's raw materials and produces a structured outline tailored to the target venue. Conference formats differ — CVPR uses a double-column layout, ICLR uses a single-column layout, and each has its own conventions around section depth and figure placement. The Outline Agent produces a plan that fits the target format and the specific evidence the researcher has supplied.

2. Plotting Agent

The Plotting Agent generates both conceptual methodology diagrams and statistical plots directly from the experimental log. Under the hood it calls PaperVizAgent (originally released under the name PaperBanana) — a sibling academic-illustration framework from the same Google Cloud AI Research team — which uses a Vision-Language Model critic to iteratively refine generated figures against the source content until they meet a design quality bar.³ The generated figures are then integrated into the LaTeX source alongside tables extracted directly from the experimental log.

3. Literature Review Agent

This is arguably the most important component for scientific integrity. Rather than relying on an LLM's internal knowledge — which is notoriously prone to hallucinating citations to papers that do not exist — the Literature Review Agent runs a two-phase citation pipeline: first it uses an LLM with web search to surface candidate papers, then it verifies every candidate against the Semantic Scholar API with a Levenshtein fuzzy-match on the title, abstract and metadata retrieval, and a temporal cutoff tied to the target conference's submission deadline.² Unverifiable references are discarded and the rest are compiled into a BibTeX file. The agent then uses that verified pool to draft the Introduction and Related Work sections under a hard constraint that at least 90% of the gathered literature pool must be actively cited — a mechanism designed to prevent both fabricated citations and token-efficient but shallow related work sections. On PaperWritingBench, PaperOrchestra generated an average of 45.73 to 47.98 verified citations per paper, against roughly 59 citations in the human-written ground-truth papers — close to human scale without sacrificing grounding.²

4. Section Writing Agent

With the outline, verified citations, and generated figures in hand, the Section Writing Agent drafts the remaining body of the paper — the abstract, methodology, experiments, and conclusion — leaving the introduction and related work to the Literature Review Agent. It extracts numeric values directly from the experimental log to construct the results tables and integrates the generated figures into the LaTeX source, stitching the whole manuscript into a coherent draft that respects the target venue's length and style constraints.

5. Content Refinement Agent

The final agent uses AgentReview, a previously published simulated peer-review system, to iteratively critique and revise the draft.² AgentReview was introduced in a separate 2024 paper as an LLM-based framework for simulating the peer review process.⁴ In PaperOrchestra, it acts as a quality gate: the draft is accepted only if each refinement pass either raises the overall AgentReview score or ties it with net non-negative sub-axis gains. This prevents the refinement loop from wandering into worse drafts in pursuit of novelty.

PaperWritingBench: A New Benchmark for AI Paper Writing

To evaluate the framework rigorously, the authors built PaperWritingBench, described as the first standardized benchmark of reverse-engineered raw materials from top-tier AI conference papers.²

The benchmark contains 200 accepted papers — 100 from CVPR 2025 and 100 from ICLR 2025. The two venues were chosen specifically because they differ in format: CVPR uses a double-column layout and ICLR uses a single-column layout, forcing any paper-writing system to adapt to both. For each paper, the authors reverse-engineered the raw pre-writing materials a researcher might have started with — a rough idea summary, experimental logs, result tables — and used that as the input to the paper-writing system. This gives PaperOrchestra the same starting point a human author would have had, and lets the evaluators compare the generated manuscript directly against the paper the humans eventually wrote.

How PaperOrchestra Performs

The authors evaluated PaperOrchestra in two complementary ways: an automated side-by-side (SxS) evaluation using LLM judges, and a human side-by-side evaluation where expert annotators compared generated manuscripts to baseline outputs.

Human Evaluation

The authors ran a side-by-side (SxS) study with 11 AI researchers performing 180 paired manuscript comparisons, blindly judging PaperOrchestra drafts against autonomous baselines and against the human-written ground truth.² On literature review quality, PaperOrchestra achieved absolute win-rate margins of 50% to 68% over baseline systems. On overall manuscript quality, the margins ranged from 14% to 38%.²

The literature review gap is the more striking result. It reflects the payoff of grounding every citation in the Semantic Scholar API rather than letting the model hallucinate references from its training data — a chronic problem with pure-LLM paper generators.

Automated Evaluation

In automated SxS evaluations, the gap was even larger. PaperOrchestra dominated on literature review quality with absolute win margins of 88% to 99% over AI baselines. For overall paper quality, it outperformed AI Scientist-v2 by 39% to 86% and a Single Agent baseline by 52% to 88% across all settings.²

The gap between human and automated numbers is worth noting: human evaluators are stricter, and the automated judges are more generous with the top-line scores. The honest read is that PaperOrchestra clearly wins on both axes, but the human numbers are the ones to quote when comparing against prior work.

Simulated Acceptance

Under ScholarPeer — a sibling Google Research reviewer agent released alongside PaperOrchestra and built by an overlapping team — PaperOrchestra-generated manuscripts achieved simulated acceptance rates of 84% on CVPR and 81% on ICLR, compared to the ground-truth (human-authored) rates of 86% and 94% respectively.²⁵ Interpreted carefully — these are simulated acceptances from a sibling LLM reviewer, not real conference decisions — the results suggest that PaperOrchestra drafts land within a few percentage points of the human versions on the CVPR side and roughly 13 points behind on ICLR. The framework also reported absolute acceptance gains of 13% on CVPR and 9% on ICLR over the strongest autonomous baseline.²

Runtime and Cost

Speed matters for research automation, because any system that takes hours or days per manuscript will never fit into a real writing workflow. PaperOrchestra completes the full pipeline in a mean of 39.6 minutes per paper and consumes approximately 60 to 70 LLM API calls per manuscript.² For reference, the autonomous baseline AI Scientist-v2 runs in 35.1 minutes end-to-end, so PaperOrchestra delivers its large quality gains at a runtime cost of only about 4.5 extra minutes per paper — roughly the same wall-clock budget, not a slower system.² The paper does not publish an end-to-end dollar cost, and the actual figure depends on which backbone model is used and the pricing tier at the time of the run; readers should consult the full paper for the exact configuration used in the reported experiments.

Why the Multi-Agent Approach Wins

The core insight behind PaperOrchestra is that paper writing is not a single task. It is a collection of very different tasks that a monolithic LLM call tends to do badly when bundled together. Asking one prompt to simultaneously outline, plot, cite, write, and refine a manuscript produces a draft that is mediocre at each of those steps. Decoupling them lets each agent specialize — the Literature Review Agent can be tuned to maximize citation recall, the Plotting Agent can be tuned to match the visual conventions of the target venue, and the Content Refinement Agent can focus on peer-review critique without worrying about introducing new hallucinations.

This pattern — multi-agent specialization beats single-agent prompting — keeps showing up across 2025 and 2026 agent research. PaperOrchestra adds another data point to the trend, and its 50%–68% human-evaluated literature review gap is one of the cleanest ablations of the effect published so far.

What This Means for Researchers

PaperOrchestra is not a replacement for scientific reasoning. It does not run experiments, propose hypotheses, or evaluate whether a result is interesting. What it does is automate the mechanical labor that surrounds a finished piece of work — outlining, plotting, citing, writing the introduction and related work, polishing language under simulated review.

For researchers who find paper writing the bottleneck between producing a result and sharing it with the community, a 40-minute pipeline that produces a submission-ready draft is significant. For reviewers and program committees, it raises important questions about how to handle submissions that were partially or fully drafted by a system like this. Google's framing — that PaperOrchestra is a writing assistant, not an autonomous author — matches how most working researchers would want to use it, but the line between "assisted" and "generated" is going to get blurry fast.

The Bottom Line

PaperOrchestra is a concrete demonstration of where multi-agent systems add the most value: tasks that decompose naturally into specialized sub-tasks, where each sub-task benefits from a dedicated prompt, dedicated tools, and a dedicated quality bar. A five-agent pipeline that produces a submission-ready manuscript in 40 minutes — with citations grounded in a real bibliographic database and a peer-review-style refinement loop — is a meaningful step beyond the first generation of autonomous paper-writing systems.

Whether the research community embraces it or pushes back against the rising tide of AI-assisted submissions is a separate question. But the technical direction is clear: specialization beats monoliths, grounded tools beat hallucination, and the writing layer of science is about to get a lot more automated.

Sources

Song, Y., Song, Y., Pfister, T., & Yoon, J. (2026). PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing. arXiv:2604.05018. Available at: https://arxiv.org/abs/2604.05018 ↩ ↩² ↩³
MarkTechPost. (April 8, 2026). Google AI Research Introduces PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing. Available at: https://www.marktechpost.com/2026/04/08/google-ai-research-introduces-paperorchestra-a-multi-agent-framework-for-automated-ai-research-paper-writing/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷
Zhu, D., Meng, R., Song, Y., Wei, X., Li, S., Pfister, T., & Yoon, J. (2026). PaperVizAgent (originally released as PaperBanana): Automating Academic Illustration for AI Scientists. arXiv:2601.23265. Available at: https://arxiv.org/abs/2601.23265 ↩ ↩²
Jin, Y., Zhao, Q., Wang, Y., Chen, H., Zhu, K., Xiao, Y., & Wang, J. (2024). AgentReview: Exploring Peer Review Dynamics with LLM Agents. EMNLP 2024. arXiv:2406.12708. Available at: https://arxiv.org/abs/2406.12708 ↩ ↩²
Goyal, P., Parmar, M., Song, Y., Palangi, H., Pfister, T., & Yoon, J. (2026). ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review. arXiv:2601.22638. Available at: https://arxiv.org/abs/2601.22638 ↩

Frequently Asked Questions

As of publication, the paper is available on arXiv and the project has a dedicated website, but there is no public API, no hosted demo, and no official open-source release of PaperOrchestra itself.1 The PaperWritingBench evaluation set (the 200 reverse-engineered CVPR 2025 and ICLR 2025 inputs) has not been released for public download either. The good news for researchers who want to build in this direction: several of the sibling components referenced in the paper are publicly available today. PaperVizAgent, the figure-generation framework that the Plotting Agent relies on, has been open-sourced by Google Research at google-research/papervizagent, with an additional academic release from the paper's first author at dwzhu-pku/PaperBanana.3 AgentReview, the simulated peer-review framework used by the Content Refinement Agent, has been open-sourced since its EMNLP 2024 publication at Ahren09/AgentReview.4 Sakana AI's AI Scientist-v2, the strongest autonomous baseline PaperOrchestra compares against, is fully open-sourced at SakanaAI/AI-Scientist-v2 and can serve as a useful reference implementation for the autonomous-writing parts of the pipeline. Semantic Scholar's API, which the Literature Review Agent uses for citation verification, is free with an API key from api.semanticscholar.org. What this means practically: a determined researcher can stitch together roughly 80% of PaperOrchestra's pipeline today using these public components plus their own orchestration layer and prompts. The missing 20% is the specific five-agent coordination logic described in the paper, the PaperWritingBench-tuned prompts, and the benchmark data itself — all of which would need to be re-derived from the paper or waited on for an official release.