Google Gemma 4: Frontier Open AI You Can Run Locally (2026)
April 4, 2026
TL;DR
Google DeepMind released Gemma 4 on April 2, 2026 — a family of four open models under the Apache 2.0 license that bring frontier-level reasoning, coding, and multimodal capabilities to hardware ranging from smartphones to single-GPU workstations.1 The flagship 31B Dense model scores 89.2% on AIME 2026 math reasoning and ranks #3 among all open models on the Arena AI text leaderboard.2 The 26B Mixture-of-Experts variant achieves similar quality while activating only 3.8B parameters per token.3 The smallest edge variant (E2B) fits under 1.5 GB with aggressive quantization, and both edge models add native audio understanding — a first for the Gemma family.4
What You'll Learn
- Why the Apache 2.0 license change matters more than the benchmarks for commercial and sovereign AI deployments
- How Gemma 4's hybrid attention architecture balances speed with long-context understanding
- Benchmark comparisons against Qwen 3.5, Llama 4, and other open models at similar parameter counts
- Hardware requirements and VRAM budgets for running each variant locally
- Known limitations the community discovered within 24 hours of launch
- Where Gemma 4 fits in a production stack — from on-device agents to cloud-scale inference
Why Gemma 4 Matters
Open-weight AI models have reached an inflection point. A year ago, running a model that could compete with proprietary APIs on reasoning and coding tasks required renting clusters. With Gemma 4, a 31B-parameter model that outperforms many larger alternatives fits on a single consumer GPU with 4-bit quantization — and Google is releasing it under the most permissive license it has ever used for a Gemma model.
The Apache 2.0 license is the headline change that separates Gemma 4 from its predecessors and most competitors. Previous Gemma releases shipped under Google's custom Gemma Terms of Use, which imposed acceptable-use restrictions and required attribution. Llama 4 uses Meta's community license, which caps commercial usage at 700 million monthly active users and enforces Meta's acceptable-use policy.5 Qwen 3.5 ships under Apache 2.0 as well, making it and Gemma 4 the two most permissively licensed frontier-class open model families as of April 2026.6
For enterprises building sovereign AI deployments — systems that must run entirely within a specific jurisdiction with no external API dependencies — this licensing distinction is not academic. Apache 2.0 means no monthly active user limits, no acceptable-use policy enforcement by the model creator, and full freedom to modify, redistribute, and commercialize.
Architecture: How Gemma 4 Works
Gemma 4 is built on the same research and technology as Gemini 3, optimized for efficient open deployment.1 The family ships in four sizes, each targeting a different hardware tier.
The Four Model Variants
| Model | Total Parameters | Active Parameters | Context Window | Target Hardware |
|---|---|---|---|---|
| E2B | 5.1B | 2.3B (effective) | 128K tokens | Phones, embedded devices |
| E4B | ~8B | 4.5B (effective) | 128K tokens | Laptops, tablets |
| 26B-A4B (MoE) | 26B | 3.8B | 256K tokens | Consumer GPUs (RTX 4090, etc.) |
| 31B Dense | 31B | 31B | 256K tokens | High-end consumer/workstation GPUs |
The naming convention deserves explanation. The "E" prefix on E2B and E4B stands for "effective parameters" — these models use Per-Layer Embeddings (PLE), a technique that feeds a secondary embedding signal into every decoder layer, giving them the representational depth of larger models while keeping the actual parameter count small enough for on-device deployment.4
The "A4B" suffix on the 26B model means "4B active parameters." This Mixture-of-Experts variant contains 128 small experts and activates 8 plus 1 shared expert per token, meaning only 3.8B parameters fire during each forward pass.3
Hybrid Attention Mechanism
All Gemma 4 models use a hybrid attention design that alternates between two types of attention layers:1
Local sliding-window attention operates over a fixed window (512 tokens for the smaller models, 1,024 for the larger ones). These layers are computationally cheap and handle local context — understanding the relationship between nearby tokens.
Global full-context attention layers see the entire sequence. The architecture ensures the final layer is always a global attention layer, so the model's last representation has full awareness of the entire input.
This interleaving strategy delivers the processing speed of a lightweight model on most tokens while preserving the deep contextual awareness needed for complex, long-range tasks. The models also use dual RoPE (Rotary Position Embedding) configurations: standard RoPE for sliding-window layers and proportional RoPE for global layers, which enables the extended context windows of 128K and 256K tokens.7
KV Cache Sharing
The last N layers of each model reuse key-value states from earlier layers, eliminating redundant KV projections.7 This optimization reduces memory consumption during inference — a meaningful improvement when running long-context workloads on consumer hardware.
Multimodal Capabilities
All four variants natively process images at variable resolutions, supporting tasks like OCR, chart understanding, and visual question answering. The E2B and E4B edge models add native audio understanding, including automatic speech recognition and speech-to-translated-text across multiple languages — a first for the Gemma family, with support for up to 30 seconds of audio input.4
Benchmarks: Where Gemma 4 Stands
Academic Benchmarks
| Benchmark | Gemma 4 31B | Gemma 4 26B-A4B | Qwen 3.5 35B-A3B | Notes |
|---|---|---|---|---|
| AIME 2026 (math) | 89.2% | 88.3% | ~82% | Graduate-level math competition2 |
| MMLU Pro (knowledge) | 85.2% | — | — | Broad academic knowledge2 |
| GPQA Diamond (science) | 84.3% | 82.3% | — | Graduate-level science reasoning2 |
| LiveCodeBench v6 (coding) | 80.0% | 77.1% | — | Real-world coding tasks2 |
| Codeforces ELO | 2,150 | — | — | Competitive programming2 |
| MMMU Pro (visual reasoning) | 76.9% | — | — | Multimodal understanding2 |
Arena AI Leaderboard
The community-driven Arena AI leaderboard, which ranks models based on blind human preference votes, places Gemma 4 31B at #3 among all open models with an estimated ELO of 1,452. The 26B-A4B MoE variant sits at #6 with 1,441 — notable because it achieves this with only 3.8B active parameters per token.8
How It Compares
Against the closest competitors at the 26–35B parameter range:
Qwen 3.5 35B-A3B offers a larger vocabulary (250K tokens) optimized for CJK and non-Latin scripts, and its inference speed is significantly faster on current tooling — partly because it activates only 3B parameters per token. However, Gemma 4 31B leads on math reasoning (AIME 2026: 89.2% vs. ~82%) and competitive coding (Codeforces ELO).9
Llama 4 Scout (109B total, 17B active) offers a massive 10-million-token context window but requires substantially more VRAM. Gemma 4's 256K context covers the vast majority of production use cases at a fraction of the hardware cost.10
For most English-language and coding tasks, Gemma 4 currently leads the 26–31B size tier. For CJK-heavy multilingual workloads, Qwen 3.5's vocabulary gives it an edge. For extremely long-context applications, Llama 4 Scout remains unmatched.
Running Gemma 4 Locally: Hardware Guide
VRAM Requirements
| Model | 4-bit Quantized | 8-bit Quantized | BF16 (Full) | Notes |
|---|---|---|---|---|
| E2B | ~3 GB (Q4_K_M GGUF) | ~5 GB | ~10 GB | Runs on phones with extreme quantization4 |
| E4B | ~5 GB (Q4_K_M GGUF) | ~8 GB | ~16 GB | Laptop-friendly4 |
| 26B-A4B (MoE) | ~10 GB | ~18 GB | ~52 GB | Active params only need ~4B worth of compute11 |
| 31B Dense | ~18-20 GB | ~34 GB | ~61 GB | Fits on RTX 4090 (24 GB) at Q411 |
The Context Window Caveat
These VRAM figures cover model weights only. The KV cache for long contexts adds substantially to memory requirements. For the 31B model at 4-bit quantization without KV quantization:11
- 32K context: ~27 GB total VRAM
- 128K context: ~109 GB total VRAM
- 256K context: ~218 GB total VRAM
For practical local deployment on a single GPU, you will likely be limited to 8K–32K context windows with the 31B model. The 26B MoE variant is more memory-efficient here because its active parameter count is much smaller.
Day-One Framework Support
Gemma 4 launched with immediate support across the open-source inference ecosystem: Ollama, LM Studio, llama.cpp, vLLM, SGLang, Hugging Face Transformers, NVIDIA NIM, Keras, and more.1 Google also provides quantized checkpoints directly, including an NVIDIA NVFP4 variant for 4-bit inference with minimal accuracy loss.12
The Speed Problem: What the Community Found
Within 24 hours of launch, the community identified a significant inference speed gap that developers should understand before committing to Gemma 4 for latency-sensitive production workloads.13
vLLM and FlashAttention Incompatibility
Gemma 4's heterogeneous attention head dimensions (different sizes for sliding-window vs. global layers) force vLLM — the most popular production serving framework — to disable FlashAttention and fall back to a slower Triton attention kernel. One developer reported the E4B model generating at approximately 9 tokens per second on an RTX 4090, compared to over 100 tokens per second for a similarly-sized Llama 3.2 3B model on the same hardware.14
For the 26B-A4B MoE variant, community reports showed approximately 11 tokens per second compared to over 60 tokens per second for Qwen 3.5 35B-A3B on equivalent hardware.13
KV Cache Memory Hunger
Gemma models have historically consumed more VRAM for KV cache than comparably-sized alternatives. One developer reported fitting Qwen 3.5 27B Q4 with 190K context on an RTX 5090, while Gemma 3 27B Q4 on the same card topped out at 20K context. Gemma 4 continues this pattern.13
The Tradeoff
These speed limitations are real but may be temporary. FlashAttention support for heterogeneous head dimensions is an engineering problem, not a fundamental architectural limitation. Framework developers are actively working on optimizations. The quality-per-parameter that Gemma 4 achieves is genuine — the question is whether the serving ecosystem catches up quickly enough for production deployments where latency matters as much as accuracy.
Multilingual Quality: A Genuine Strength
Community testing across German, Arabic, Vietnamese, and French revealed that Gemma 4 outperforms Qwen 3.5 in non-English tasks despite Qwen's larger multilingual vocabulary. One tester described Gemma 4's multilingual capability as being "in a tier of its own" for translation quality.13 Given that Gemma 4 is natively trained on over 140 languages, this is an area where it may offer the strongest competitive advantage for non-English deployments.1
Licensing Deep Dive: Why Apache 2.0 Changes Everything
The shift from Google's custom Gemma Terms of Use to Apache 2.0 has concrete implications:5
| Aspect | Gemma 4 (Apache 2.0) | Llama 4 (Meta Community) | Qwen 3.5 (Apache 2.0) |
|---|---|---|---|
| Commercial use | Unrestricted | Capped at 700M MAU | Unrestricted |
| Modification | Unrestricted | Allowed with restrictions | Unrestricted |
| Redistribution | Allowed with license notice | Allowed with Meta's terms | Allowed with license notice |
| Acceptable use policy | None from Google | Meta's AUP applies | None from Alibaba |
| Sovereign deployment | Full freedom | Must comply with Meta's terms | Full freedom |
For startups, this means no need to worry about crossing a monthly active user threshold. For governments and regulated industries building sovereign AI infrastructure, Apache 2.0 eliminates the legal complexity of navigating a custom license.
The Bottom Line
Gemma 4 is the most capable open model family in the 26–31B parameter range as of April 2026, measured by academic benchmarks and human preference rankings. The Apache 2.0 license removes the last significant barrier to unrestricted commercial and sovereign deployment. The edge-optimized E2B and E4B variants bring genuine multimodal intelligence — including audio — to devices that fit in your pocket.
The tradeoff is inference speed. Until the serving ecosystem fully supports Gemma 4's heterogeneous attention architecture, developers building latency-sensitive applications may find Qwen 3.5 faster in practice despite lower benchmark scores. For applications where accuracy matters more than tokens-per-second — research, analysis, code generation, multilingual content — Gemma 4 is the new default in the open-weight tier.
The timing is significant. April 2026 is the densest model release period in open-source AI history, with Qwen 3.5, Llama 4, and now Gemma 4 all competing for developer adoption. Google's choice to ship under Apache 2.0 — matching Qwen and undercutting Llama's restrictions — signals that the open-model race is now being won on licensing terms as much as benchmark points.
If you are building local AI pipelines with tools like Ollama, Gemma 4 slots in as a direct upgrade — our guide on building local AI with RAG agents and Ollama covers the infrastructure patterns that apply equally well to Gemma 4. For context on the broader competitive landscape including Alibaba's multimodal push, see our coverage of Qwen 3.5 Omni.
Footnotes
-
Google DeepMind, "Gemma 4: Byte for byte, the most capable open models," blog.google, April 2, 2026. ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
Google DeepMind, Gemma 4 technical report and benchmark results, deepmind.google/models/gemma/gemma-4/, April 2, 2026. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7
-
Hugging Face model card, google/gemma-4-26B-A4B, huggingface.co, April 2, 2026. ↩ ↩2
-
Hugging Face blog, "Welcome Gemma 4: Frontier multimodal intelligence on device," huggingface.co/blog/gemma4, April 2, 2026. ↩ ↩2 ↩3 ↩4 ↩5
-
VentureBeat, "Google releases Gemma 4 under Apache 2.0 — and that license change may matter more than benchmarks," April 2, 2026. ↩ ↩2
-
Qwen team, Qwen 3.5 release notes, Apache 2.0 license. ↩
-
Kaitchup, "Gemma 4 31B and 26B A4B: Architecture and Memory Consumption," kaitchup.substack.com, April 2026. ↩ ↩2
-
Arena AI text leaderboard, arena.ai/leaderboard/text, accessed April 4, 2026. ↩
-
ai.rs, "Gemma 4 vs Qwen 3.5 vs Llama 4: Updated Benchmarks, New Leader," April 2026. ↩
-
Meta, Llama 4 Scout model specifications, llama.meta.com, 2026. ↩
-
Unsloth, gemma-4-31B-it-GGUF discussion, huggingface.co, April 2026. ↩ ↩2 ↩3
-
NVIDIA, nvidia/Gemma-4-31B-IT-NVFP4, huggingface.co, April 2026. ↩
-
DEV Community, "Gemma 4 After 24 Hours: What the Community Found vs What Google Promised," dev.to, April 2026. ↩ ↩2 ↩3 ↩4
-
vLLM GitHub issue #38887, "Gemma 4 E4B extremely slow on v0.19.0," github.com/vllm-project/vllm, April 2026. ↩