Meta Muse Spark: Benchmarks and Strategy (2026)

April 9, 2026

Meta Muse Spark: Benchmarks and Strategy (2026)

TL;DR

Meta launched Muse Spark on April 8, 2026 — the first model from Meta Superintelligence Labs (MSL), the unit led by Chief AI Officer Alexandr Wang. Unlike the open-weight Llama series, Muse Spark is proprietary. It scores 52 on the Artificial Analysis Intelligence Index, placing it in the top five globally behind GPT-5.4 (57), Gemini 3.1 Pro (57), and Claude Opus 4.6 (53). Its standout results are in health (42.8% on HealthBench Hard, the highest among frontier models) and visual reasoning (86.4 on CharXiv Reasoning), but it trails significantly in coding (59.0 on Terminal-Bench 2.0 vs. GPT-5.4's 75.1) and abstract reasoning (42.5 on ARC-AGI-2 vs. 76.5 for Gemini 3.1 Pro). It is free on meta.ai today; API access is invite-only with no public pricing yet.


What You'll Learn

  • Why Meta went proprietary after years of championing open-weight models
  • Where Muse Spark beats and trails GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6
  • How the three operating modes (Instant, Thinking, Contemplating) work
  • What the Llama 4 benchmark controversy has to do with this launch
  • What Muse Spark's arrival means for developers and the open-source AI community

Why Muse Spark Matters

A year ago, Meta was the loudest champion of open-weight AI. The Llama family gave researchers and startups access to competitive models without API fees, and Mark Zuckerberg framed open-source AI as a strategic moat.1 That narrative took damage in April 2025 when the Llama 4 launch imploded: the version submitted to the LM Arena leaderboard was not the same model released publicly, and departing chief AI scientist Yann LeCun later acknowledged that benchmark results had been manipulated.2

Muse Spark is the answer. Built from the ground up by Meta Superintelligence Labs — the division Zuckerberg established in June 2025 after investing $14.3 billion in Scale AI and recruiting its CEO, Alexandr Wang, as Meta's first Chief AI Officer — it represents a deliberate pivot.3 The model is proprietary, the architecture is new, and the benchmarks tell a mixed but credible story.


The Backstory: From Llama to MSL

The Llama 4 Fallout

In April 2025, Meta released Llama 4 Scout and Llama 4 Maverick as open-weight models. Maverick initially ranked second on LM Arena, but independent evaluators discovered that Meta had submitted a custom-tuned variant rather than the publicly available weights. When the vanilla Maverick was tested, it ranked below GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro.2 The backlash was severe, and the LM Arena maintainers changed their policies in response.

The Leadership Overhaul

Less than three months later, Zuckerberg restructured Meta's AI organization. He invested $14.3 billion for a 49% stake in Scale AI, bringing Wang on board to lead the new Meta Superintelligence Labs.3 Yann LeCun, Meta's chief AI scientist since 2013 and a Turing Award laureate, departed in November 2025 to found Advanced Machine Intelligence Labs (AMI Labs), a startup focused on world-model architectures.4 LeCun had publicly described LLMs as a dead end for reaching superintelligence — a view at odds with MSL's direction.

The $14.3 Billion Bet

The Scale AI deal was not a simple acquisition. Meta bought a 49% stake, roughly doubling Scale AI's valuation to $29 billion.3 Wang, 28, stepped down as Scale AI CEO (remaining on the board) and became Meta's Chief AI Officer, reporting directly to Zuckerberg. The deal gave Meta access to Scale's data-labeling infrastructure — a resource that feeds directly into the kind of reinforcement learning from human feedback (RLHF) and synthetic data pipelines that modern frontier models depend on.


Architecture and Operating Modes

Built From Scratch

Muse Spark is not a Llama derivative. Meta describes it as a natively multimodal reasoning model, rebuilt from the ground up to integrate visual information across its internal logic rather than stitching vision and text modules together after training.5 The model accepts text, image, and voice input but currently produces text-only output. Its context window is 262,144 tokens (262K).6

Meta has not disclosed the parameter count, but reports that Muse Spark achieves its capability using over an order of magnitude less compute than Llama 4 Maverick.5 The efficiency comes from a training technique Meta calls "thought compression": during reinforcement learning, the model is penalized for excessive thinking time, forcing it to solve problems with fewer reasoning tokens without sacrificing accuracy. On the Artificial Analysis Intelligence Index evaluation, Muse Spark used 58 million output tokens — compared to 120 million for GPT-5.4 and 157 million for Claude Opus 4.6.6

Three Reasoning Modes

Muse Spark ships with two modes available today and a third coming soon:

Instant mode handles everyday queries with minimal latency. It is the default experience on meta.ai — quick lookups, simple conversations, and lightweight tasks.

Thinking mode applies step-by-step reasoning for complex problems. The model takes additional processing time, similar to the chain-of-thought approaches in GPT-5.4 Thinking and Gemini 3.1 Pro's reasoning mode. Most of the published benchmark scores use Thinking mode.

Contemplating mode is the most architecturally distinctive feature, though it is not yet publicly available. Instead of a single model reasoning for longer (which scales latency linearly), Contemplating mode orchestrates multiple sub-agents that reason in parallel and synthesize their findings.7 Meta reports that Contemplating mode scored 50.2% on Humanity's Last Exam, ahead of GPT-5.4 Pro (43.9%) and Gemini 3 Deep Think (48.4%).7


Benchmark Deep Dive

The table below compares Muse Spark (Thinking mode unless noted) against the current frontier:

Benchmark Muse Spark GPT-5.4 Gemini 3.1 Pro Claude Opus 4.6
AA Intelligence Index 52 57 57 53
CharXiv Reasoning 86.4 82.8 80.2
HealthBench Hard 42.8% 40.1% 20.6%
ARC-AGI-2 42.5 76.1 76.5
Terminal-Bench 2.0 59.0 75.1 68.5
GDPval-AA (Elo) 1,444 1,672 1,606
HLE (Contemplating) 50.2% 43.9%* 48.4%**

*GPT-5.4 Pro variant. **Gemini 3 Deep Think variant. †Some sources attribute CharXiv 86.4 to Contemplating mode rather than Thinking mode; Meta's official breakdown is unclear.678

Where Muse Spark Leads

Health and scientific reasoning stand out. The 42.8% on HealthBench Hard — a benchmark of open-ended health queries — is the highest score among general-purpose frontier models, surpassing GPT-5.4 (40.1%) and roughly doubling Gemini 3.1 Pro (20.6%).8 CharXiv Reasoning, which tests comprehension of scientific figures, shows a similar lead at 86.4 versus 82.8 for GPT-5.4 and 80.2 for Gemini 3.1 Pro.8

If the Contemplating mode numbers hold under independent evaluation, the 50.2% on Humanity's Last Exam would be a significant result — ahead of both GPT-5.4 Pro (43.9%) and Gemini 3 Deep Think (48.4%). That benchmark consists of 2,500 expert-level questions across dozens of disciplines, co-created by the Center for AI Safety and Scale AI.9

Where Muse Spark Trails

Coding and abstract reasoning are clear weaknesses. The 59.0 on Terminal-Bench 2.0 (agentic terminal coding) sits 16 points below GPT-5.4 and nearly 10 below Gemini 3.1 Pro.8 On ARC-AGI-2, which tests novel problem-solving, Muse Spark's 42.5 is roughly 34 points behind both GPT-5.4 (76.1) and Gemini 3.1 Pro (76.5).8 For agentic office tasks (GDPval-AA), its 1,444 Elo is substantially below Claude Opus 4.6 (1,606) and GPT-5.4 (1,672).8

The efficiency angle is worth noting but does not fully compensate. Muse Spark reaches its Intelligence Index score of 52 with less than half the output tokens used by Claude Opus 4.6 (53) and roughly half of GPT-5.4's (57).6 That token efficiency matters for Meta's scale — it serves billions of users across WhatsApp, Instagram, Facebook, and Messenger — but it does not offset the benchmark gaps for developers who need strong coding or general agentic performance.


The Proprietary Shift

This is the most strategically significant aspect of the launch. Meta's AI strategy under Yann LeCun was defined by open weights: after an initial research-only release for the original Llama in 2023, Llama 2, Llama 3, and Llama 4 were released with increasingly permissive licenses (with some restrictions at scale). Muse Spark breaks that pattern.10

Access is limited to meta.ai and an invite-only API preview for select partners. There are no public model weights. A Meta spokesperson confirmed that existing Llama models remain available as open source, but did not commit to releasing future Llama models — and pointedly did not address whether Muse Spark's weights would ever be released.10 The company has said only that it hopes to open-source future versions of the model.

For the open-source AI community, this is a deflating signal. Meta's open-weight releases set a standard that pushed Google to release Gemma (including the recent Gemma 4) and encouraged other labs to follow. If Meta's frontier efforts stay proprietary, one of the strongest forces driving open-weight model quality has weakened.

For developers already building on Llama models, the immediate impact is limited — Llama 4 Scout and Maverick remain available. The longer-term question is whether Meta continues to invest in the Llama line now that its flagship research effort has moved to a proprietary architecture.


Availability and Pricing

Muse Spark is available today on meta.ai in both Instant and Thinking modes, free of charge. Meta may impose rate limits but has not disclosed specifics.5

API access is limited to a private preview for invited partners. No public pricing has been announced.6 This stands in stark contrast to every other frontier model, which publishes per-token pricing (GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro all have well-documented API rate cards).

Rollout to Meta's broader ecosystem — WhatsApp, Instagram, Facebook, Messenger, and Ray-Ban Meta AI glasses — is expected in the coming weeks.5


What Developers Should Watch

Contemplating mode availability. The multi-agent parallel reasoning approach is the most technically novel feature, and the reported benchmark numbers are strong. Independent evaluation will determine whether those numbers hold. No timeline has been given for public access.

API pricing and rate limits. Until Meta publishes pricing, it is impossible to compare Muse Spark's cost-performance ratio against competitors. For production workloads, this is a dealbreaker.

Open-weight future. Meta's language about hoping to open-source future versions is deliberately noncommittal. If Muse Spark's weights never ship, developers building on Meta's ecosystem will face a choice between older Llama models and a proprietary API.

Health and science applications. The HealthBench Hard and CharXiv results suggest that Muse Spark may find its strongest niche in medical, scientific, and research-adjacent use cases — areas where it currently leads every frontier competitor.


The Bottom Line

Muse Spark is a credible re-entry into the frontier model race. Its health and science benchmarks are genuinely strong, and the Contemplating mode — if it delivers under independent testing — introduces an architecturally novel approach to scaling reasoning. But the coding and general-reasoning gaps are real, the API is not yet publicly accessible, and the proprietary pivot undercuts the open-source positioning that made Meta a unique force in AI.

For developers, the practical calculus is straightforward: if your workload centers on health, science, or visual reasoning, Muse Spark is worth evaluating when API access opens. For coding, agentic automation, or anything requiring open weights, GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro remain ahead on both capability and accessibility.


Footnotes

  1. Meta's open-source AI strategy, as articulated by Mark Zuckerberg in multiple public statements throughout 2024-2025.

  2. LeCun's comments confirmed the benchmark manipulation in a January 2026 interview reported by Slashdot and corroborated by TechCrunch's April 2025 reporting on the Llama 4 launch controversy. 2

  3. Meta's $14.3 billion investment in Scale AI and establishment of Meta Superintelligence Labs, announced June 30, 2025, reported by CNBC, Axios, and Maginative. 2 3

  4. Yann LeCun's departure from Meta announced November 2025, founding AMI Labs with $1.03 billion in funding at $3.5 billion valuation as reported by Fortune (December 2025) and TechCrunch.

  5. Meta's official Muse Spark announcement, about.fb.com, April 8, 2026; corroborated by VentureBeat, TechCrunch, and Axios. 2 3 4

  6. Artificial Analysis Intelligence Index v4.0 and model specifications for Muse Spark, artificialanalysis.ai, April 2026. 2 3 4 5

  7. Muse Spark Contemplating mode benchmarks (HLE 50.2%, FrontierScience Research 38.3%) reported by 9to5Mac and Lushbinary from Meta's technical blog, April 8, 2026. 2 3

  8. Benchmark comparison data from OfficeChai, Lushbinary, and Artificial Analysis model evaluations, April 2026. 2 3 4 5 6

  9. Humanity's Last Exam benchmark created by Center for AI Safety and Scale AI; GPT-5.4 Pro score (43.9%) from 9to5Mac and Lushbinary citing Meta's comparison data; Gemini 3 Deep Think score (48.4%) from Remio.ai and Demis Hassabis announcement, February 12, 2026.

  10. Muse Spark's proprietary status confirmed by VentureBeat, The Register, and The Next Web, with Meta spokesperson quotes, April 8, 2026. 2

Frequently Asked Questions

No. Muse Spark is proprietary. Meta has said it hopes to open-source future versions, but has made no commitment. Existing Llama models remain available under their current licenses.

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.