Q: Can I run it on a single H100?

A: Not at full precision. A dense 128B model needs roughly 256 GB of weights at BF16; one 80-GB H100 cannot hold it. FP8 quantisation gets you to about 128 GB, which still requires at least two 80-GB GPUs and realistically four to leave room for KV cache and activations. 12

Q: How does it compare to DeepSeek V4-Pro and Claude Sonnet 4.6?

A: V4-Pro is bigger (1.6T-parameter MoE versus 128B dense), scores higher on SWE-Bench Verified (80.6% versus 77.6%), and is even cheaper on the API ($1.74 / $3.48 versus $1.50 / $7.50). Sonnet 4.6 sits between them at 79.6% SWE-Bench Verified, $3.00 / $15.00, and a 1M-token context window now generally available without a surcharge. Medium 3.5 is dramatically smaller than V4-Pro and easier to self-host, and is the only one of the three with weights you can download. Pick V4-Pro if you want the leaderboard ceiling and have the cluster; pick Sonnet 4.6 if you want a managed API at the highest accuracy in this tier; pick Medium 3.5 if a four-GPU node is your operational reality and self-hosting matters. 1 11 4

Q: What is EAGLE and why does Mistral ship a draft head?

A: EAGLE is a speculative decoding technique where a small "draft" model proposes several next tokens at once, and the larger "target" model verifies them in parallel. Mistral's published EAGLE numbers for Medium 3.5 are about 1.41× output throughput and around 29% lower end-to-end latency, with an acceptance length of roughly 1.72 tokens per draft cycle. 7

Q: Does Medium 3.5 do reasoning chains?

A: Yes. Reasoning effort is configurable per request, in the same way that recent OpenAI, Anthropic, and Google models expose a "thinking time" knob. There is one set of weights, not separate "thinking" and "instant" SKUs. 1 2

Q: What about Mistral Vibe — is this a new product?

A: Vibe was first announced alongside Devstral 2; it is an Apache 2.0-licensed CLI coding agent. 17 The April 29 release added cloud-side remote agents that can run asynchronously, kicked off from either the CLI or Le Chat. 8

Mistral Medium 3.5: 128B Open-Weight Frontier Coder

May 1, 2026

#mistral #mistral-medium-3-5 #open-weight-ai #llm #coding-agents #le-chat #mistral-vibe #ai-models #european-ai

Mistral Medium 3.5: 128B Open-Weight Frontier Coder

TL;DR

Mistral AI released Mistral Medium 3.5 on April 29, 2026 — a 128 billion-parameter dense model with a 256,000-token context window and configurable reasoning effort, published as open weights on Hugging Face under a Modified MIT license.¹² The model scores 77.6% on SWE-Bench Verified and 91.4% on the τ³-Telecom agentic benchmark, a narrow lead over Anthropic's previous-generation Claude Sonnet 4.5 (77.2% on SWE-Bench Verified) but a step behind the current Anthropic Sonnet flagship, Sonnet 4.6, at 79.6%.¹³⁴ API pricing is $1.50 per million input tokens and $7.50 per million output tokens, roughly half Sonnet 4.5 and Sonnet 4.6's identical $3.00/$15.00 list price.⁵⁶⁴

The other half of the news is the deployment story. Mistral says Medium 3.5 self-hosts "on as few as four GPUs," and ships an EAGLE speculative-decoding draft head (Mistral-Medium-3.5-128B-EAGLE) for low-concurrency latency-bound serving.¹⁷ The model also became the new default for Le Chat and powers a refreshed Mistral Vibe that now runs coding agents in the cloud asynchronously rather than only on a developer's laptop.⁸

What You'll Learn

What Mistral Medium 3.5 actually is, and why a dense 128B was an unusual choice in 2026
The 256K context window, multimodal inputs, and configurable reasoning effort
The Modified MIT license and where its revenue carve-outs apply
How Medium 3.5 benchmarks against Claude Sonnet 4.5, Sonnet 4.6, and DeepSeek V4
How EAGLE speculative decoding and four-GPU self-hosting change the deployment math
How Le Chat and Mistral Vibe consume the model in production

What Mistral Shipped

On April 29, 2026 Mistral published Medium 3.5 alongside a refresh of its product surface. The release bundle has four pieces:¹²⁸

Component	What it is
`mistralai/Mistral-Medium-3.5-128B`	Base 128B dense weights on Hugging Face
`mistralai/Mistral-Medium-3.5-128B-EAGLE`	EAGLE draft head for speculative decoding
Mistral API	`mistral-medium-3.5` endpoint, $1.50 / $7.50 per million tokens
Le Chat + Mistral Vibe	Medium 3.5 set as the new default model

The Hugging Face collection page consolidates the base model, the EAGLE variant, and a Mistral-Medium-3.5-128B-EAGLE draft used for speculative decoding.² Third-party packagers — Unsloth, Ollama, NVIDIA NIM — picked up GGUF and container images within hours.⁹¹⁰

This is a single model that "merges instruction-following, reasoning, and coding into a single 128B dense model."¹ In other words, Mistral collapsed the previous specialised line — Devstral 2 for coding, Magistral for reasoning, Medium 3.1 for general instruction-following — into one weight set. Reasoning effort is configurable per request rather than baked into a separate model variant.

The 128B Dense Choice

Almost every frontier-class open-weight release in early 2026 has been Mixture-of-Experts. DeepSeek V4-Pro is a 1.6 trillion-parameter MoE with 49 billion parameters activated per token; V4-Flash is a 284B MoE with 13B activated.¹¹ GLM 5.1 and Kimi K2.6 are MoE. Even Meta's now-internal Muse Spark was described publicly as a sparse architecture.

Mistral went the other way: a dense 128B model. Every parameter participates in every token. That is heavier per FLOP than an equivalent-quality MoE, but it has two practical consequences:

Memory residency is predictable. A dense 128B model in BF16 needs roughly 256 GB of weights, which fits inside four 80-GB GPUs (320 GB) with room for activations and KV cache. Mistral's "as few as four GPUs" claim rests on FP8 quantisation bringing the weights closer to 128 GB.¹¹²
Speculative decoding is straightforward. EAGLE works best when the draft head shares the target's tokenizer and head, and dense models give you a clean vocabulary to draft against. Mistral's published EAGLE numbers are roughly 1.41× output throughput and around 29% lower end-to-end latency at low concurrency, with an acceptance length of about 1.72 tokens per draft cycle.⁷

The trade-off: a dense 128B will not match a 1T-class MoE on raw benchmark ceilings. Medium 3.5's 77.6% on SWE-Bench Verified sits below DeepSeek V4-Pro's 80.6% and below the current closed-source frontier.¹¹¹ Mistral is betting that "small enough to self-host, big enough to ship" is a more useful operating point for enterprises than the absolute leaderboard top.

The 256K Context Window and Multimodal Inputs

Medium 3.5 supports a 256,000-token context window — large enough to ingest a mid-sized codebase or several hundred pages of policy documents in one pass.¹² Mistral has stayed below the megacontext threshold that DeepSeek V4 (1M tokens) and Claude Sonnet 4.6 (1M tokens, GA without surcharge as of February 2026) target.⁴ The result is a context that is deep but kept inside a single sequence-parallelism regime, which is part of why a four-GPU self-host stays plausible.

The model is multimodal on input: it accepts text and images, but produces only text.²¹³ The vision encoder was trained from scratch to handle variable image sizes and aspect ratios, which is a quieter but useful detail — most open-weight vision encoders stamp images down to a fixed square. For document-parsing and visual-QA workloads in enterprises, native variable-resolution input means fewer pre-processing surprises.

Medium 3.5 also supports function calling and "configurable reasoning effort" per request — the same pattern Anthropic, Google, and OpenAI have been moving toward, where a single model can be asked to spend more or fewer thinking tokens on a problem rather than being forked into a separate "thinking" SKU.¹²

Benchmarks: Where Medium 3.5 Actually Lands

Mistral's headline benchmarks sit on a narrow strip between the open-weight MoE leaders of April 2026 and the proprietary Sonnet tier:¹³¹¹¹⁴⁴

Model	License	Total params	SWE-Bench Verified	Notes
DeepSeek V4-Pro	MIT	1.6T (MoE, 49B activated)	80.6%	1M context, $1.74 / $3.48 per M tokens
Claude Sonnet 4.6	Closed	Not disclosed	79.6%	1M context (no surcharge), $3.00 / $15.00 per M tokens
Mistral Medium 3.5	Modified MIT	128B (dense)	77.6%	256K context, $1.50 / $7.50 per M tokens
Claude Sonnet 4.5 (legacy)	Closed	Not disclosed	77.2%	200K standard context (1M beta retired April 30, 2026), $3.00 / $15.00 per M tokens
DeepSeek V4-Flash	MIT	284B (MoE, 13B activated)	Not separately disclosed	Cost-optimised tier of V4

The Mistral-vs-Sonnet comparison is the cleanest. SWE-Bench Verified is a fixed test set; all of these models were evaluated against it; Medium 3.5's 77.6% edges legacy Sonnet 4.5's 77.2% by 0.4 percentage points but trails the current Sonnet 4.6 (79.6%, released February 17, 2026) by two points at the same $3/$15 list price.¹³⁴ On τ³-Telecom, an agentic-tool-use benchmark used to stress-test multi-turn workflows, Mistral reports 91.4% for Medium 3.5.¹ Mistral has not published full MMLU, GPQA, or AIME scores as of publication, so reasoning and general-knowledge ceilings are harder to triangulate than the coding numbers.

Medium 3.5 is not the leaderboard king. DeepSeek V4-Pro stays ahead by three percentage points on SWE-Bench Verified on a model roughly twelve times bigger by parameter count, and Sonnet 4.6 stays ahead by two points on a closed model. The story is what those points cost: V4-Pro needs many more GPUs to host than a 128B dense network, and Sonnet at any version cannot be self-hosted — your prompts run on Anthropic's infrastructure or its cloud partners (AWS Bedrock, Google Vertex AI), but never on hardware you control.

Self-Hosting on Four GPUs

Mistral's "self-hosts on as few as four GPUs" claim is the most consequential part of the release for engineering teams.¹ In practice it lands like this:

BF16 native: 128B parameters × 2 bytes ≈ 256 GB of weights. Plus KV cache (which scales with context length) and activations, you are looking at four to eight 80-GB GPUs depending on context.
FP8 quantised (Mistral's published recipe): roughly 128 GB of weights, comfortably within a 4×H100 or 4×H200 node, with KV-cache headroom for medium-length contexts.¹²
EAGLE draft head (Mistral-Medium-3.5-128B-EAGLE) is FP8-quantised, two-layer GQA, and adds about 4 GB of weight overhead. It buys roughly 1.41× output throughput and around 29% lower end-to-end latency in low-concurrency, latency-bound serving, with an average acceptance length of 1.72 tokens per draft cycle.⁷

vLLM is the inference engine Mistral and the third-party docs both lean toward, with SGLang as the alternative that ships first-class EAGLE support.¹² If you have ever tried to deploy a 1.6T-parameter MoE in production, the difference is not a rounding error — Medium 3.5 fits inside a single high-end inference node where DeepSeek V4-Pro needs a small cluster.

The Modified MIT License

Mistral published Medium 3.5 under a Modified MIT license: an MIT-style permissive base with revenue-based exceptions for very large companies.¹² For independent developers, startups, mid-market vendors, and most universities, the license behaves like vanilla MIT — commercial use, redistribution, and self-hosting are allowed without additional terms. Companies above Mistral's revenue threshold need to come to a separate commercial arrangement.

The exact threshold is documented in the Hugging Face card and Mistral's help center; multiple aggregators have summarised it as a monthly-revenue test, but the precise figure, the trigger conditions, and the carve-outs for usage versus redistribution are worth reading directly from the license text rather than from third-party paraphrases.²¹⁵ Pulling the license out of LICENSE.md on the Hugging Face repo is a five-minute exercise; trusting a blog summary is the kind of thing that lawyers eventually re-read in a different mood.

Le Chat and Mistral Vibe: How Mistral Uses It

Mistral made Medium 3.5 the default model for two of its own products on the same day:¹⁸

Le Chat — Mistral's consumer and enterprise chat assistant — gained a new "Work mode" in preview that supports multi-step tasks like research and cross-tool actions, with Medium 3.5 underneath for Pro, Team, and Enterprise plans.¹⁶

Mistral Vibe — the open-source Apache 2.0 CLI coding agent originally announced alongside Devstral 2 — picked up cloud-side execution in this release.⁸¹⁷ Until April 29, Vibe was a local-only CLI: you ran it in your terminal, it shelled out to bash, it edited files in your working directory. The new "remote agents" feature lets multiple Vibe agents run simultaneously in Mistral's cloud, kicked off either from the CLI or from inside Le Chat. The use case Mistral pitches is asynchronous coding work — file a task from Le Chat, walk away, come back to a pull-request-ready diff.

That is structurally similar to what GitHub's Copilot coding agent, Anthropic's Claude Code Remote Sessions, and Google's Gemini Enterprise Agent Platform already offer; the Mistral angle is that the underlying model is open-weight and the CLI client is Apache 2.0, so the whole stack can in principle be self-hosted by a customer who does not want to use Mistral's cloud.

How Medium 3.5 Compares to the Sonnet Tier

For most enterprise readers the practical question is "Mistral Medium 3.5 versus the Anthropic Sonnet tier." Mistral's own announcement compared against Sonnet 4.5, but Sonnet 4.6 has been the current Anthropic Sonnet flagship since February 17, 2026, at the same $3/$15 list price.¹³⁵⁶⁴

Dimension	Mistral Medium 3.5	Claude Sonnet 4.6 (current)	Claude Sonnet 4.5 (legacy)
License	Modified MIT (open weights)	Closed (API + first-party clients only)	Closed
Parameter count	128B dense	Not disclosed	Not disclosed
Context window	256,000 tokens	1,000,000 tokens (no surcharge, GA)	200,000 tokens (1M beta retired April 30, 2026)
SWE-Bench Verified	77.6%	79.6%	77.2%
Self-hostable	Yes — Mistral cites four GPUs at FP8	No	No
API price (input)	$1.50 / M tokens	$3.00 / M tokens	$3.00 / M tokens
API price (output)	$7.50 / M tokens	$15.00 / M tokens	$15.00 / M tokens
Multimodal input	Text + image	Text + image + PDFs	Text + image + PDFs
Released	April 29, 2026	February 17, 2026	September 29, 2025

Sonnet 4.6 wins on raw SWE-Bench score, on the now-GA 1M-token context window without surcharges, on PDF input, and on overall product polish — the Anthropic stack is older and more battle-tested. Medium 3.5 wins on price at any context length below 200K, on the option to bring weights in-house, and against Sonnet 4.5 specifically on the coding benchmark Mistral chose to highlight.

The practical decision tree is closer to "which side of the build/buy fence are you on" than "which is the better model." If your security team is uncomfortable with prompts leaving the building, Medium 3.5 is the only one of the three you can deploy. If you want a managed API and are comfortable on that side, the two-point SWE-Bench gap to Sonnet 4.6 may matter more than the price gap when you weigh the operational lift of standing up a four-GPU inference node, three replicas for high availability, and the EAGLE draft pipeline.

The European AI Independence Story

Mistral was founded in April 2023 by Arthur Mensch, Guillaume Lample, and Timothée Lacroix — three researchers who came together after stints at DeepMind and Meta.¹⁸ The company is headquartered in Paris with offices in the U.S., London, Luxembourg, and Singapore, and reportedly employs around 350 people.¹⁸¹⁹ On March 30, 2026 Mistral raised $830 million in debt from a seven-bank European consortium to fund a 44-megawatt NVIDIA-powered data center at Bruyères-le-Châtel near Paris — its largest infrastructure financing to date, and a separate effort from a $1.4 billion February 2026 pledge with EcoDataCenter for Swedish capacity.²⁰

The strategic frame Mistral has pushed since 2023 is that Europe needs a domestic frontier-class lab whose weights companies can actually inspect, fine-tune, and self-host. Medium 3.5 is the most pointed version of that pitch so far. It is small enough to deploy on European-owned hardware, openly licensed enough to satisfy procurement teams that are wary of U.S. closed APIs, and benchmark-credible enough to be a real choice rather than a political one.

Whether that pitch lands depends less on the model itself than on enterprise buyers' patience for self-hosting frontier-class models. Most companies do not actually want to operate four-GPU inference nodes. Most companies do, however, want the option to walk away from a vendor without rebuilding their stack — and that optionality is exactly what an open-weight 77.6% SWE-Bench model under a Modified MIT license is selling.

Bottom Line

Mistral Medium 3.5 is not the highest-scoring model in the open-weight category — DeepSeek V4-Pro still owns that crown by three percentage points on the headline coding benchmark. It is, however, among the most deployable open-weight frontier-class models published as of April 29, 2026: a dense 128B network that fits inside a single four-GPU node with EAGLE-accelerated inference and a 256K-token context, shipped by one of the few European labs still pushing into this performance tier.

For enterprises that need to keep prompts in-house, for European procurement teams that want a credible domestic option, and for engineering teams that want the price floor of an open-weight model with the deployment ergonomics of a dense network, Medium 3.5 is one of the most interesting releases of the week. The cost of admission is a four-GPU node and a vLLM cluster; the prize is a model that lands within two percentage points of Anthropic's current Sonnet flagship at half the API price, edges past the previous Sonnet generation by 0.4 points on the same benchmark, and gives you the option to walk away from the API entirely.

Frequently Asked Questions

A: Open weights , not open training data or training code . The license is a Modified MIT — permissive for the vast majority of users, with revenue-based exceptions that kick in only for very large companies. For startups, mid-market vendors, and individual developers, behaviour is effectively MIT. 2 15