PaperOrchestra: Google's 5-Agent AI Writes Research Papers
April 10, 2026
TL;DR
Google Cloud AI Research has introduced PaperOrchestra, a multi-agent framework that converts unstructured pre-writing materials — rough ideas, experimental logs, result tables — into submission-ready LaTeX research manuscripts. Five specialized agents handle outlining, plotting, literature review, section writing, and peer-review-style refinement. The full pipeline completes in roughly 39.6 minutes per paper using about 60–70 LLM API calls. In side-by-side human evaluations, it beat autonomous baselines by 50%–68% on literature review quality and 14%–38% on overall manuscript quality. The paper was posted to arXiv in April 2026 and introduces a new benchmark, PaperWritingBench, built from 200 accepted CVPR 2025 and ICLR 2025 papers.12
What You'll Learn
- What PaperOrchestra is and why its five-agent architecture matters
- How each of the five specialized agents contributes to the writing pipeline
- How the PaperWritingBench evaluation set was constructed and what it tests
- Exact win rates against autonomous baselines like AI Scientist-v2 in both human and automated evaluations
- How simulated peer review is used to refine drafts before submission
- What this means for the future of AI-assisted scientific writing
A Writing Assistant for Researchers, Not a Replacement
Autonomous "AI scientist" systems have been trending in the research community for more than a year, promising end-to-end pipelines that ideate, run experiments, and write up results with no human in the loop. The most widely cited example, Sakana AI's AI Scientist-v2, proved the concept is feasible but exposed a hard limitation: such systems can only write the papers they themselves generated through their own internal research loops. If you already have experimental results, rough notes, and a direction in mind, an end-to-end AI scientist has nothing to offer.
PaperOrchestra takes the opposite angle. It was built by Yiwen Song, Yale Song, Tomas Pfister, and Jinsung Yoon at Google Cloud AI Research, and it assumes the human has already done the interesting scientific work.1 Give it a rough idea summary and raw experimental logs, and it will return a submission-ready LaTeX manuscript — with verified citations, generated figures, and language polished through simulated peer review. The goal is to compress the most tedious part of research writing, not to replace the ideas behind it.
The Five Agents
PaperOrchestra strategically decouples the writing process across five specialized agents, each responsible for a different stage of manuscript construction.2
1. Outline Agent
The first agent ingests the researcher's raw materials and produces a structured outline tailored to the target venue. Conference formats differ — CVPR uses a double-column layout, ICLR uses a single-column layout, and each has its own conventions around section depth and figure placement. The Outline Agent produces a plan that fits the target format and the specific evidence the researcher has supplied.
2. Plotting Agent
The Plotting Agent generates both conceptual methodology diagrams and statistical plots directly from the experimental log. Under the hood it calls PaperVizAgent (originally released under the name PaperBanana) — a sibling academic-illustration framework from the same Google Cloud AI Research team — which uses a Vision-Language Model critic to iteratively refine generated figures against the source content until they meet a design quality bar.3 The generated figures are then integrated into the LaTeX source alongside tables extracted directly from the experimental log.
3. Literature Review Agent
This is arguably the most important component for scientific integrity. Rather than relying on an LLM's internal knowledge — which is notoriously prone to hallucinating citations to papers that do not exist — the Literature Review Agent runs a two-phase citation pipeline: first it uses an LLM with web search to surface candidate papers, then it verifies every candidate against the Semantic Scholar API with a Levenshtein fuzzy-match on the title, abstract and metadata retrieval, and a temporal cutoff tied to the target conference's submission deadline.2 Unverifiable references are discarded and the rest are compiled into a BibTeX file. The agent then uses that verified pool to draft the Introduction and Related Work sections under a hard constraint that at least 90% of the gathered literature pool must be actively cited — a mechanism designed to prevent both fabricated citations and token-efficient but shallow related work sections. On PaperWritingBench, PaperOrchestra generated an average of 45.73 to 47.98 verified citations per paper, against roughly 59 citations in the human-written ground-truth papers — close to human scale without sacrificing grounding.2
4. Section Writing Agent
With the outline, verified citations, and generated figures in hand, the Section Writing Agent drafts the remaining body of the paper — the abstract, methodology, experiments, and conclusion — leaving the introduction and related work to the Literature Review Agent. It extracts numeric values directly from the experimental log to construct the results tables and integrates the generated figures into the LaTeX source, stitching the whole manuscript into a coherent draft that respects the target venue's length and style constraints.
5. Content Refinement Agent
The final agent uses AgentReview, a previously published simulated peer-review system, to iteratively critique and revise the draft.2 AgentReview was introduced in a separate 2024 paper as an LLM-based framework for simulating the peer review process.4 In PaperOrchestra, it acts as a quality gate: the draft is accepted only if each refinement pass either raises the overall AgentReview score or ties it with net non-negative sub-axis gains. This prevents the refinement loop from wandering into worse drafts in pursuit of novelty.
PaperWritingBench: A New Benchmark for AI Paper Writing
To evaluate the framework rigorously, the authors built PaperWritingBench, described as the first standardized benchmark of reverse-engineered raw materials from top-tier AI conference papers.2
The benchmark contains 200 accepted papers — 100 from CVPR 2025 and 100 from ICLR 2025. The two venues were chosen specifically because they differ in format: CVPR uses a double-column layout and ICLR uses a single-column layout, forcing any paper-writing system to adapt to both. For each paper, the authors reverse-engineered the raw pre-writing materials a researcher might have started with — a rough idea summary, experimental logs, result tables — and used that as the input to the paper-writing system. This gives PaperOrchestra the same starting point a human author would have had, and lets the evaluators compare the generated manuscript directly against the paper the humans eventually wrote.
How PaperOrchestra Performs
The authors evaluated PaperOrchestra in two complementary ways: an automated side-by-side (SxS) evaluation using LLM judges, and a human side-by-side evaluation where expert annotators compared generated manuscripts to baseline outputs.
Human Evaluation
The authors ran a side-by-side (SxS) study with 11 AI researchers performing 180 paired manuscript comparisons, blindly judging PaperOrchestra drafts against autonomous baselines and against the human-written ground truth.2 On literature review quality, PaperOrchestra achieved absolute win-rate margins of 50% to 68% over baseline systems. On overall manuscript quality, the margins ranged from 14% to 38%.2
The literature review gap is the more striking result. It reflects the payoff of grounding every citation in the Semantic Scholar API rather than letting the model hallucinate references from its training data — a chronic problem with pure-LLM paper generators.
Automated Evaluation
In automated SxS evaluations, the gap was even larger. PaperOrchestra dominated on literature review quality with absolute win margins of 88% to 99% over AI baselines. For overall paper quality, it outperformed AI Scientist-v2 by 39% to 86% and a Single Agent baseline by 52% to 88% across all settings.2
The gap between human and automated numbers is worth noting: human evaluators are stricter, and the automated judges are more generous with the top-line scores. The honest read is that PaperOrchestra clearly wins on both axes, but the human numbers are the ones to quote when comparing against prior work.
Simulated Acceptance
Under ScholarPeer — a sibling Google Research reviewer agent released alongside PaperOrchestra and built by an overlapping team — PaperOrchestra-generated manuscripts achieved simulated acceptance rates of 84% on CVPR and 81% on ICLR, compared to the ground-truth (human-authored) rates of 86% and 94% respectively.25 Interpreted carefully — these are simulated acceptances from a sibling LLM reviewer, not real conference decisions — the results suggest that PaperOrchestra drafts land within a few percentage points of the human versions on the CVPR side and roughly 13 points behind on ICLR. The framework also reported absolute acceptance gains of 13% on CVPR and 9% on ICLR over the strongest autonomous baseline.2
Runtime and Cost
Speed matters for research automation, because any system that takes hours or days per manuscript will never fit into a real writing workflow. PaperOrchestra completes the full pipeline in a mean of 39.6 minutes per paper and consumes approximately 60 to 70 LLM API calls per manuscript.2 For reference, the autonomous baseline AI Scientist-v2 runs in 35.1 minutes end-to-end, so PaperOrchestra delivers its large quality gains at a runtime cost of only about 4.5 extra minutes per paper — roughly the same wall-clock budget, not a slower system.2 The paper does not publish an end-to-end dollar cost, and the actual figure depends on which backbone model is used and the pricing tier at the time of the run; readers should consult the full paper for the exact configuration used in the reported experiments.
Why the Multi-Agent Approach Wins
The core insight behind PaperOrchestra is that paper writing is not a single task. It is a collection of very different tasks that a monolithic LLM call tends to do badly when bundled together. Asking one prompt to simultaneously outline, plot, cite, write, and refine a manuscript produces a draft that is mediocre at each of those steps. Decoupling them lets each agent specialize — the Literature Review Agent can be tuned to maximize citation recall, the Plotting Agent can be tuned to match the visual conventions of the target venue, and the Content Refinement Agent can focus on peer-review critique without worrying about introducing new hallucinations.
This pattern — multi-agent specialization beats single-agent prompting — keeps showing up across 2025 and 2026 agent research. PaperOrchestra adds another data point to the trend, and its 50%–68% human-evaluated literature review gap is one of the cleanest ablations of the effect published so far.
What This Means for Researchers
PaperOrchestra is not a replacement for scientific reasoning. It does not run experiments, propose hypotheses, or evaluate whether a result is interesting. What it does is automate the mechanical labor that surrounds a finished piece of work — outlining, plotting, citing, writing the introduction and related work, polishing language under simulated review.
For researchers who find paper writing the bottleneck between producing a result and sharing it with the community, a 40-minute pipeline that produces a submission-ready draft is significant. For reviewers and program committees, it raises important questions about how to handle submissions that were partially or fully drafted by a system like this. Google's framing — that PaperOrchestra is a writing assistant, not an autonomous author — matches how most working researchers would want to use it, but the line between "assisted" and "generated" is going to get blurry fast.
The Bottom Line
PaperOrchestra is a concrete demonstration of where multi-agent systems add the most value: tasks that decompose naturally into specialized sub-tasks, where each sub-task benefits from a dedicated prompt, dedicated tools, and a dedicated quality bar. A five-agent pipeline that produces a submission-ready manuscript in 40 minutes — with citations grounded in a real bibliographic database and a peer-review-style refinement loop — is a meaningful step beyond the first generation of autonomous paper-writing systems.
Whether the research community embraces it or pushes back against the rising tide of AI-assisted submissions is a separate question. But the technical direction is clear: specialization beats monoliths, grounded tools beat hallucination, and the writing layer of science is about to get a lot more automated.
Sources
Footnotes
-
Song, Y., Song, Y., Pfister, T., & Yoon, J. (2026). PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing. arXiv:2604.05018. Available at: https://arxiv.org/abs/2604.05018 ↩ ↩2 ↩3
-
MarkTechPost. (April 8, 2026). Google AI Research Introduces PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing. Available at: https://www.marktechpost.com/2026/04/08/google-ai-research-introduces-paperorchestra-a-multi-agent-framework-for-automated-ai-research-paper-writing/ ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17
-
Zhu, D., Meng, R., Song, Y., Wei, X., Li, S., Pfister, T., & Yoon, J. (2026). PaperVizAgent (originally released as PaperBanana): Automating Academic Illustration for AI Scientists. arXiv:2601.23265. Available at: https://arxiv.org/abs/2601.23265 ↩ ↩2
-
Jin, Y., Zhao, Q., Wang, Y., Chen, H., Zhu, K., Xiao, Y., & Wang, J. (2024). AgentReview: Exploring Peer Review Dynamics with LLM Agents. EMNLP 2024. arXiv:2406.12708. Available at: https://arxiv.org/abs/2406.12708 ↩ ↩2
-
Goyal, P., Parmar, M., Song, Y., Palangi, H., Pfister, T., & Yoon, J. (2026). ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review. arXiv:2601.22638. Available at: https://arxiv.org/abs/2601.22638 ↩