AirLLM: Run 70B Models on a 4GB GPU — Hype vs Reality
April 5, 2026
TL;DR
AirLLM is an open-source Python library that lets you run 70-billion-parameter language models on a single 4GB GPU — and even 405B models on 8GB of VRAM — without quantization. It works by loading one transformer layer at a time from disk, performing the computation, and swapping in the next. The tradeoff is speed: real-world measurements range from about 0.7 tokens per second on faster hardware to minutes per token on slower setups. AirLLM is a genuine technical achievement for accessibility, but it is not a replacement for production inference tools like llama.cpp or vLLM. Here is what it actually does, where it excels, and where it falls short.
What You'll Learn
- What AirLLM is and the problem it solves
- How layer-wise inference works under the hood
- Real-world performance numbers and speed tradeoffs
- How AirLLM compares to llama.cpp, vLLM, and Ollama
- Which use cases AirLLM is genuinely good for — and which it is not
- Whether the hype around running 70B models on consumer hardware is justified
What Is AirLLM?
AirLLM is an open-source Python library created by Gavin Li, a former Airbnb AI engineer who led the company's AI customer support effort1. First released on November 20, 2023, AirLLM tackles one of the biggest barriers to running large language models locally: GPU memory.
A 70-billion-parameter model in full precision requires roughly 140GB of VRAM — far beyond what any consumer GPU offers. The standard solution is quantization: compressing the model to 4-bit or 8-bit precision so it fits in less memory. This works well but always involves some accuracy loss.
AirLLM takes a different approach entirely. Instead of shrinking the model, it streams it. The library loads one transformer layer at a time into GPU memory, runs the computation for that layer, saves the intermediate result, and then swaps in the next layer. The result: a 70B model runs on a GPU with just 4GB of VRAM, at full precision, with no quantization required2.
The latest version (2.11.0) added support for Llama 3.1 405B inference on just 8GB of VRAM, along with support for the Qwen2.5 model family3.
How Layer-Wise Inference Works
Traditional LLM inference loads the entire model into GPU memory at once. For a 70B model, that means all 80+ transformer layers sit in VRAM simultaneously. AirLLM's layer-wise approach works fundamentally differently.
When you first load a model, AirLLM's layer decomposition engine splits the full checkpoint into individual layer shards — typically 80 to 100 separate files for a 70B model. During inference, the pipeline follows a strict sequence: load layer N into GPU memory, compute that layer's activations, save the output, offload layer N, then prefetch layer N+1. This continues until all layers have processed the input4.
The memory savings are dramatic. Peak VRAM usage drops from 140GB+ to under 4GB — a reduction of over 95%. The model runs at full precision because no weights are modified; they are just loaded and unloaded sequentially5.
AirLLM v2.0 added optional block-wise quantization (4-bit or 8-bit) that can accelerate inference by up to 3x when full precision is not required. Version 2.5 introduced prefetching to overlap disk reads with GPU computation, reducing idle time between layers6.
The Speed Tradeoff: Where the Hype Meets Reality
This is where the headline claim — "run 70B on 4GB" — needs serious context. AirLLM makes large models possible on consumer hardware, but it does not make them fast.
Every layer swap requires reading tens of megabytes from disk. Even on an NVMe SSD, this I/O overhead adds up across 80+ layers for every single token generated. The actual performance varies dramatically depending on hardware:
| Hardware | Model | Approximate Speed |
|---|---|---|
| NVMe SSD + dedicated GPU | Llama 2 70B | ~0.7 tokens/second |
| M2 MacBook Pro (16GB RAM) | 70B model | ~50 tokens in 12 minutes (~0.07 tok/s) |
| Google Colab T4 | Llama 2 7B | ~1 token per minute (reported by users) |
| Batch mode (50 prompts) | 70B model | ~5.3 seconds per token |
For comparison, llama.cpp running a quantized 70B model on an RTX 4090 typically achieves 8-15 tokens per second — roughly 10 to 20 times faster than AirLLM's best case7.
The bottom line: AirLLM is not suitable for interactive chat applications, which generally require 20+ tokens per second for a smooth experience. It is designed for offline batch processing, one-shot analysis, and research experimentation where latency is acceptable.
Supported Models and Platform Compatibility
AirLLM supports a broad range of open-source model architectures. As of version 2.11.0, the supported model families include Llama (including Llama 3 and Llama 3.1 405B), Qwen and Qwen2.5, Mistral and Mixtral, ChatGLM, Baichuan, and InternLM8.
Version 2.6 introduced AutoModel, which automatically detects the model architecture from the checkpoint files, eliminating the need to specify the model class manually during initialization9.
Platform compatibility is broad: AirLLM runs on Linux, Windows, and macOS, including Apple Silicon (M1 through M4). CPU-only inference is supported as of version 2.10.1, though performance on CPU is significantly slower than even the slowest GPU configurations10.
Disk Space: The Hidden Requirement
While AirLLM eliminates the VRAM barrier, it shifts the storage requirement to disk. A 70B model in full precision occupies 120-140GB of disk space initially. After AirLLM's first-run layer decomposition, the original model is deleted automatically (via a delete_original flag), leaving approximately 70GB of layer shards — a 50% savings11.
For the 405B Llama 3.1 model, disk requirements scale proportionally. An NVMe SSD provides 3-5x faster layer loading than a SATA SSD, making storage type a critical factor in real-world performance. Mechanical hard drives will work but make inference painfully slow12.
AirLLM vs. the Alternatives
AirLLM is not the only way to run large models on consumer hardware. Here is how it compares to the three most popular alternatives:
| Feature | AirLLM | llama.cpp (GGUF) | vLLM | Ollama |
|---|---|---|---|---|
| Primary approach | Layer-wise streaming | Quantization (GGUF) | PagedAttention | Quantization + serving |
| Min VRAM for 70B | 4GB | ~32-40GB (Q4) | 80GB+ | ~32-40GB (Q4) |
| Speed (70B) | ~0.7 tok/s best case | 8-15 tok/s (RTX 4090) | 20+ tok/s (A100) | 5-10 tok/s |
| Precision | Full (FP16/FP32) | Quantized (Q4-Q8) | Full or quantized | Quantized |
| Production ready | No | Yes | Yes | Yes |
| Interactive chat | No | Yes | Yes | Yes |
| Ecosystem | Standalone | Extensive (LM Studio, etc.) | Server-focused | Desktop-friendly |
llama.cpp is the most direct comparison. It uses quantization to compress models — typically to 4-bit precision — so they fit in available VRAM. The accuracy loss from well-implemented quantization is usually small (often undetectable in practice), and the speed advantage is enormous. For most users who want to run large models locally, llama.cpp with GGUF quantization is the better practical choice13.
vLLM is designed for high-throughput production serving with features like PagedAttention for efficient memory management. It requires substantially more VRAM but delivers production-grade performance.
AirLLM's unique value is that it is among the few tools that can run a 70B model at full precision on a 4GB GPU — without requiring any quantization. Other frameworks like llama.cpp support partial layer offloading to RAM, but typically require quantization to fit the model. If your use case requires zero quantization loss and you can tolerate slow inference, AirLLM is one of the only tools that delivers this.
Who Should Use AirLLM?
AirLLM solves a real problem, but for a narrow audience. Based on real-world performance data and community feedback, AirLLM is well-suited for researchers who need to test or evaluate large models without cloud GPU access, offline batch processing tasks where latency is irrelevant (such as processing a dataset overnight), quick experimentation with different model sizes to understand capability differences, and situations where full-precision inference is genuinely required and quantization artifacts are unacceptable14.
AirLLM is not the right tool for interactive chatbots or real-time applications, production API serving, any workflow where users are waiting for responses, or situations where a quantized model via llama.cpp would produce acceptable quality (which is most situations).
Does AirLLM Live Up to the Hype?
The honest answer: partially.
The core technical claim is real and verified. AirLLM genuinely runs a 70B model on a 4GB GPU at full precision. That is not marketing — it is a legitimate engineering achievement. The layer-streaming architecture works, and for the narrow set of use cases where full precision on minimal hardware matters, nothing else does this15.
But the hype — driven by headlines like "Run 70B on Your Laptop!" — oversells the practical experience. Running a 70B model at 0.07 tokens per second on a MacBook is technically possible but practically useless for anything interactive. The speed tradeoff is not a minor footnote; it is the defining characteristic of using AirLLM.
For most people who want to run large models locally, the pragmatic recommendation is to start with llama.cpp or Ollama with quantized models. A well-quantized 70B model at Q4 precision runs 10-20x faster than AirLLM's full-precision approach, with accuracy differences that are rarely noticeable in practice. If you are interested in how open-source models are evolving more broadly, our guide on Google Gemma 4 covers the latest in open-weight model deployment.
AirLLM's real value is as a research and experimentation tool — a way to access models that would otherwise be completely out of reach. On that narrow promise, it delivers. On the broader promise of "democratizing" 70B models for everyday use, the speed penalty is too steep for most workflows. For a wider look at how AI infrastructure costs shape what is practical, see our AI cost breakdown.
References
Footnotes
-
LLM by Examples: Layer-wise inference using PyTorch or using AirLLM — Medium ↩
-
AirLLM: Run 70B Models on 4GB GPUs Without Compromise — BrightCoding ↩
-
Unbelievable! Run 70B LLM Inference on a Single 4GB GPU — Hugging Face Blog ↩
-
AirLLM: Run 70B Models on Your 4GB GPU (But Pack a Lunch) — AI505 ↩
-
AirLLM: Democratizing Large Language Models on Everyday Hardware — StartupNaama ↩
-
What Is AirLLM and Why It Matters for Running LLMs on Limited Hardware — Medium ↩
-
AirLLM and "70B on a 4GB GPU" — What's Actually Going On? — Medium ↩
-
AirLLM: Run Massive AI Models on Modest Hardware Without Extra VRAM — Nacho Conesa ↩
-
AirLLM Review: Democratizing Access vs. The Unavoidable Physics of Latency — DataSharePro ↩
-
I Ran a 70B AI Model on My Old Laptop — Here's How AirLLM Did It — Towards AI ↩