Is Gemma 4 faster than Qwen 3.5 for inference?

Not currently. Community benchmarks show Gemma 4 running 3–6× slower than Qwen 3.5 on equivalent hardware, primarily due to FlashAttention compatibility issues with Gemma 4's attention architecture. Framework updates are expected to narrow this gap.

Which variant should I use for an AI agent on a laptop?

The E4B model is the best fit for laptop deployment. It supports text, image, and audio inputs, requires approximately 5 GB at 4-bit quantization, and offers 128K context — enough for most agentic workflows.

Does Gemma 4 support function calling?

Yes. All variants support native function calling, structured JSON output, and system instructions out of the box, making them suitable for agentic applications that need tool use. 1

How does the Apache 2.0 license affect fine-tuning and redistribution?

You can fine-tune Gemma 4 for any purpose — including commercial products — and redistribute the fine-tuned weights with no restrictions beyond including the Apache 2.0 license notice. There are no monthly active user caps or acceptable-use policies to comply with.

ai-ml

Gemma 4 Review: Apache-2.0 Open AI, 89.2% on AIME 2026

April 4, 2026

#Gemma 4 #Google DeepMind #open-source AI #local AI deployment #Apache 2.0 #mixture of experts #on-device AI #multimodal AI

Gemma 4 Review: Apache-2.0 Open AI, 89.2% on AIME 2026

TL;DR

Google DeepMind released Gemma 4 on April 2, 2026 — a family of four open models under the Apache 2.0 license that bring frontier-level reasoning, coding, and multimodal capabilities to hardware ranging from smartphones to single-GPU workstations.¹ The flagship 31B Dense model scores 89.2% on AIME 2026 math reasoning and ranks #3 among all open models on the Arena AI text leaderboard.² The 26B Mixture-of-Experts variant achieves similar quality while activating only 3.8B parameters per token.³ The smallest edge variant (E2B) fits under 1.5 GB with aggressive quantization, and both edge models add native audio understanding, building on the audio support Google first introduced with Gemma 3n.⁴

What You'll Learn

Why the Apache 2.0 license change matters more than the benchmarks for commercial and sovereign AI deployments
How Gemma 4's hybrid attention architecture balances speed with long-context understanding
Benchmark comparisons against Qwen 3.5, Llama 4, and other open models at similar parameter counts
Hardware requirements and VRAM budgets for running each variant locally
Known limitations the community discovered within 24 hours of launch
Where Gemma 4 fits in a production stack — from on-device agents to cloud-scale inference

Why Gemma 4 Matters

Open-weight AI models have reached an inflection point. A year ago, running a model that could compete with proprietary APIs on reasoning and coding tasks required renting clusters. With Gemma 4, a 31B-parameter model that outperforms many larger alternatives fits on a single consumer GPU with 4-bit quantization — and Google is releasing it under the most permissive license it has ever used for a Gemma model.

The Apache 2.0 license is the headline change that separates Gemma 4 from its predecessors and most competitors. Previous Gemma releases shipped under Google's custom Gemma Terms of Use, which imposed acceptable-use restrictions and required attribution. Llama 4 uses Meta's community license, which caps commercial usage at 700 million monthly active users and enforces Meta's acceptable-use policy.⁵ Qwen 3.5 ships under Apache 2.0 as well, making it and Gemma 4 the two most permissively licensed frontier-class open model families as of April 2026.⁶

For enterprises building sovereign AI deployments — systems that must run entirely within a specific jurisdiction with no external API dependencies — this licensing distinction is not academic. Apache 2.0 means no monthly active user limits, no acceptable-use policy enforcement by the model creator, and full freedom to modify, redistribute, and commercialize.

Architecture: How Gemma 4 Works

Gemma 4 is built on the same research and technology as Gemini 3, optimized for efficient open deployment.¹ The family ships in four sizes, each targeting a different hardware tier.

The Four Model Variants

Model	Total Parameters	Active Parameters	Context Window	Target Hardware
E2B	5.1B	2.3B (effective)	128K tokens	Phones, embedded devices
E4B	~8B	4.5B (effective)	128K tokens	Laptops, tablets
26B-A4B (MoE)	26B	3.8B	256K tokens	Consumer GPUs (RTX 4090, etc.)
31B Dense	31B	31B	256K tokens	High-end consumer/workstation GPUs

The naming convention deserves explanation. The "E" prefix on E2B and E4B stands for "effective parameters" — these models use Per-Layer Embeddings (PLE), a technique that feeds a secondary embedding signal into every decoder layer, giving them the representational depth of larger models while keeping the actual parameter count small enough for on-device deployment.⁴

The "A4B" suffix on the 26B model means "4B active parameters." This Mixture-of-Experts variant contains 128 small experts and activates 8 plus 1 shared expert per token, meaning only 3.8B parameters fire during each forward pass.³

Hybrid Attention Mechanism

All Gemma 4 models use a hybrid attention design that alternates between two types of attention layers:¹

Local sliding-window attention operates over a fixed window (512 tokens for the smaller models, 1,024 for the larger ones). These layers are computationally cheap and handle local context — understanding the relationship between nearby tokens.

Global full-context attention layers see the entire sequence. The architecture ensures the final layer is always a global attention layer, so the model's last representation has full awareness of the entire input.

This interleaving strategy delivers the processing speed of a lightweight model on most tokens while preserving the deep contextual awareness needed for complex, long-range tasks. The models also use dual RoPE (Rotary Position Embedding) configurations: standard RoPE for sliding-window layers and proportional RoPE for global layers, which enables the extended context windows of 128K and 256K tokens.⁷

The last N layers of each model reuse key-value states from earlier layers, eliminating redundant KV projections.⁷ This optimization reduces memory consumption during inference — a meaningful improvement when running long-context workloads on consumer hardware.

Multimodal Capabilities

All four variants natively process images at variable resolutions, supporting tasks like OCR, chart understanding, and visual question answering. The E2B and E4B edge models add native audio understanding, including automatic speech recognition and speech-to-translated-text across multiple languages, with support for up to 30 seconds of audio input. Native audio was first introduced in the Gemma family with Gemma 3n in 2025; Gemma 4's edge models carry it forward with a smaller, more efficient audio encoder.⁴

Benchmarks: Where Gemma 4 Stands

Academic Benchmarks

Benchmark	Gemma 4 31B	Gemma 4 26B-A4B	Qwen 3.5 35B-A3B	Notes
AIME 2026 (math)	89.2%	88.3%	~82%	Graduate-level math competition²
MMLU Pro (knowledge)	85.2%	—	—	Broad academic knowledge²
GPQA Diamond (science)	84.3%	82.3%	—	Graduate-level science reasoning²
LiveCodeBench v6 (coding)	80.0%	77.1%	—	Real-world coding tasks²
Codeforces ELO	2,150	—	—	Competitive programming²
MMMU Pro (visual reasoning)	76.9%	—	—	Multimodal understanding²

Arena AI Leaderboard

The community-driven Arena AI leaderboard, which ranks models based on blind human preference votes, places Gemma 4 31B at #3 among all open models with an estimated ELO of 1,452. The 26B-A4B MoE variant sits at #6 with 1,441 — notable because it achieves this with only 3.8B active parameters per token.⁸

How It Compares

Against the closest competitors at the 26–35B parameter range:

Qwen 3.5 35B-A3B offers a larger vocabulary (250K tokens) optimized for CJK and non-Latin scripts, and its inference speed is significantly faster on current tooling — partly because it activates only 3B parameters per token. However, Gemma 4 31B leads on math reasoning (AIME 2026: 89.2% vs. ~82%) and competitive coding (Codeforces ELO).⁹

Llama 4 Scout (109B total, 17B active) offers a massive 10-million-token context window but requires substantially more VRAM. Gemma 4's 256K context covers the vast majority of production use cases at a fraction of the hardware cost.¹⁰

For most English-language and coding tasks, Gemma 4 currently leads the 26–31B size tier. For CJK-heavy multilingual workloads, Qwen 3.5's vocabulary gives it an edge. For extremely long-context applications, Llama 4 Scout remains unmatched.

Running Gemma 4 Locally: Hardware Guide

VRAM Requirements

Model	4-bit Quantized	8-bit Quantized	BF16 (Full)	Notes
E2B	~3 GB (Q4_K_M GGUF)	~5 GB	~10 GB	Runs on phones with extreme quantization⁴
E4B	~5 GB (Q4_K_M GGUF)	~8 GB	~16 GB	Laptop-friendly⁴
26B-A4B (MoE)	~10 GB	~18 GB	~52 GB	Active params only need ~4B worth of compute¹¹
31B Dense	~18-20 GB	~34 GB	~61 GB	Fits on RTX 4090 (24 GB) at Q4¹¹

The Context Window Caveat

These VRAM figures cover model weights only. The KV cache for long contexts adds substantially to memory requirements. For the 31B model at 4-bit quantization without KV quantization:¹¹

32K context: ~27 GB for the KV cache alone (~48 GB total VRAM once model weights are included)
128K context: ~109 GB for the KV cache alone
256K context: ~218 GB for the KV cache alone

For practical local deployment on a single GPU, you will likely be limited to 8K–32K context windows with the 31B model. The 26B MoE variant is more memory-efficient here because its active parameter count is much smaller.

Day-One Framework Support

Gemma 4 launched with immediate support across the open-source inference ecosystem: Ollama, LM Studio, llama.cpp, vLLM, SGLang, Hugging Face Transformers, NVIDIA NIM, Keras, and more.¹ Google also provides quantized checkpoints directly, including an NVIDIA NVFP4 variant for 4-bit inference with minimal accuracy loss.¹²

The Speed Problem: What the Community Found

Within 24 hours of launch, the community identified a significant inference speed gap that developers should understand before committing to Gemma 4 for latency-sensitive production workloads.¹³

vLLM and FlashAttention Incompatibility

Gemma 4's heterogeneous attention head dimensions (different sizes for sliding-window vs. global layers) force vLLM — the most popular production serving framework — to disable FlashAttention and fall back to a slower Triton attention kernel. One developer reported the E4B model generating at approximately 9 tokens per second on an RTX 4090, compared to over 100 tokens per second for a similarly-sized Llama 3.2 3B model on the same hardware.¹⁴

For the 26B-A4B MoE variant, community reports showed approximately 11 tokens per second compared to over 60 tokens per second for Qwen 3.5 35B-A3B on equivalent hardware.¹³

KV Cache Memory Hunger

Gemma models have historically consumed more VRAM for KV cache than comparably-sized alternatives. One developer reported fitting Qwen 3.5 27B Q4 with 190K context on an RTX 5090, while Gemma 3 27B Q4 on the same card topped out at 20K context. Gemma 4 continues this pattern.¹³

The Tradeoff

These speed limitations are real but may be temporary. FlashAttention support for heterogeneous head dimensions is an engineering problem, not a fundamental architectural limitation. Framework developers are actively working on optimizations. The quality-per-parameter that Gemma 4 achieves is genuine — the question is whether the serving ecosystem catches up quickly enough for production deployments where latency matters as much as accuracy.

Multilingual Quality: A Genuine Strength

Community testing across German, Arabic, Vietnamese, and French revealed that Gemma 4 outperforms Qwen 3.5 in non-English tasks despite Qwen's larger multilingual vocabulary. One tester described Gemma 4's multilingual capability as being "in a tier of its own" for translation quality.¹³ Given that Gemma 4 is natively trained on over 140 languages, this is an area where it may offer the strongest competitive advantage for non-English deployments.¹

Licensing Deep Dive: Why Apache 2.0 Changes Everything

The shift from Google's custom Gemma Terms of Use to Apache 2.0 has concrete implications:⁵

Aspect	Gemma 4 (Apache 2.0)	Llama 4 (Meta Community)	Qwen 3.5 (Apache 2.0)
Commercial use	Unrestricted	Capped at 700M MAU	Unrestricted
Modification	Unrestricted	Allowed with restrictions	Unrestricted
Redistribution	Allowed with license notice	Allowed with Meta's terms	Allowed with license notice
Acceptable use policy	None from Google	Meta's AUP applies	None from Alibaba
Sovereign deployment	Full freedom	Must comply with Meta's terms	Full freedom

For startups, this means no need to worry about crossing a monthly active user threshold. For governments and regulated industries building sovereign AI infrastructure, Apache 2.0 eliminates the legal complexity of navigating a custom license.

The Bottom Line

Gemma 4 is the most capable open model family in the 26–31B parameter range as of April 2026, measured by academic benchmarks and human preference rankings. The Apache 2.0 license removes the last significant barrier to unrestricted commercial and sovereign deployment. The edge-optimized E2B and E4B variants bring genuine multimodal intelligence — including audio — to devices that fit in your pocket.

The tradeoff is inference speed. Until the serving ecosystem fully supports Gemma 4's heterogeneous attention architecture, developers building latency-sensitive applications may find Qwen 3.5 faster in practice despite lower benchmark scores. For applications where accuracy matters more than tokens-per-second — research, analysis, code generation, multilingual content — Gemma 4 is the new default in the open-weight tier.

The timing is significant. April 2026 is the densest model release period in open-source AI history, with Qwen 3.5, Llama 4, and now Gemma 4 all competing for developer adoption. Google's choice to ship under Apache 2.0 — matching Qwen and undercutting Llama's restrictions — signals that the open-model race is now being won on licensing terms as much as benchmark points.

If you are building local AI pipelines with tools like Ollama, Gemma 4 slots in as a direct upgrade — our guide on building local AI with RAG agents and Ollama covers the infrastructure patterns that apply equally well to Gemma 4. For context on the broader competitive landscape including Alibaba's multimodal push, see our coverage of Qwen 3.5 Omni.

Google DeepMind, "Gemma 4: Byte for byte, the most capable open models," blog.google, April 2, 2026. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Google DeepMind, Gemma 4 technical report and benchmark results, deepmind.google/models/gemma/gemma-4/, April 2, 2026. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
Hugging Face model card, google/gemma-4-26B-A4B, huggingface.co, April 2, 2026. ↩ ↩²
Hugging Face blog, "Welcome Gemma 4: Frontier multimodal intelligence on device," huggingface.co/blog/gemma4, April 2, 2026. ↩ ↩² ↩³ ↩⁴ ↩⁵
VentureBeat, "Google releases Gemma 4 under Apache 2.0 — and that license change may matter more than benchmarks," April 2, 2026. ↩ ↩²
Qwen team, Qwen 3.5 release notes, Apache 2.0 license. ↩
Kaitchup, "Gemma 4 31B and 26B A4B: Architecture and Memory Consumption," kaitchup.substack.com, April 2026. ↩ ↩²
Arena AI text leaderboard, arena.ai/leaderboard/text, accessed April 4, 2026. ↩
ai.rs, "Gemma 4 vs Qwen 3.5 vs Llama 4: Updated Benchmarks, New Leader," April 2026. ↩
Meta, Llama 4 Scout model specifications, llama.meta.com, 2026. ↩
Unsloth, gemma-4-31B-it-GGUF discussion, huggingface.co, April 2026. ↩ ↩² ↩³
NVIDIA, nvidia/Gemma-4-31B-IT-NVFP4, huggingface.co, April 2026. ↩
DEV Community, "Gemma 4 After 24 Hours: What the Community Found vs What Google Promised," dev.to, April 2026. ↩ ↩² ↩³ ↩⁴
vLLM GitHub issue #38887, "Gemma 4 E4B extremely slow on v0.19.0," github.com/vllm-project/vllm, April 2026. ↩

Frequently Asked Questions

Yes. With 4-bit quantization, the 31B model fits on an RTX 4090 (24 GB VRAM) with short context windows (up to approximately 8K–16K tokens). For longer contexts, you will need more VRAM or should consider the 26B MoE variant, which uses less memory due to its smaller active parameter count.