Speed varies dramatically by hardware. Best-case scenarios on NVMe SSDs with dedicated GPUs achieve around 0.7 tokens per second for a 70B model. On a MacBook with Apple Silicon, expect roughly 0.07 tokens per second. AirLLM is designed for offline batch processing, not interactive use.

Is AirLLM better than llama.cpp?

They solve different problems. llama.cpp uses quantization to fit models in VRAM and runs 10-20x faster. AirLLM preserves full precision but is dramatically slower. For most users, llama.cpp with GGUF quantization is the more practical choice. AirLLM is better when you specifically need full-precision inference on minimal hardware.

What models does AirLLM support?

As of version 2.11.0, AirLLM supports Llama (including Llama 3.1 405B), Qwen2.5, Mistral, Mixtral, ChatGLM, Baichuan, and InternLM. The AutoModel feature (v2.6+) automatically detects model architecture from checkpoint files.

How much disk space does AirLLM need?

A 70B model requires approximately 120-140GB initially. After layer decomposition, the original model is deleted automatically, leaving roughly 70GB of shards. An NVMe SSD is strongly recommended for acceptable performance.

Can AirLLM be used for fine-tuning?

No. AirLLM is inference-only. Fine-tuning requires loading gradients and optimizer states into memory alongside the weights, which defeats the layer-wise streaming approach. For fine-tuning on limited hardware, tools like Unsloth (which uses LoRA-based training with up to 70% less VRAM than standard approaches) are better options.

Can AirLLM run on CPU only?

Yes, CPU inference was added in version 2.10.1. However, performance is significantly slower than GPU inference, making it suitable only for testing and experimentation.

ai-ml

AirLLM Tested: Run a 70B LLM on a 4GB GPU — Does It Work?

April 5, 2026

#AirLLM #local LLM #LLM inference #open source AI #consumer GPU #Llama #layer-wise inference #AI democratization

AirLLM Tested: Run a 70B LLM on a 4GB GPU — Does It Work?

TL;DR

AirLLM is an open-source Python library that lets you run 70-billion-parameter language models on a single 4GB GPU — and even 405B models on 8GB of VRAM — without quantization. It works by loading one transformer layer at a time from disk, performing the computation, and swapping in the next. The tradeoff is speed: real-world measurements range from about 0.7 tokens per second on faster hardware to minutes per token on slower setups. AirLLM is a genuine technical achievement for accessibility, but it is not a replacement for production inference tools like llama.cpp or vLLM. Here is what it actually does, where it excels, and where it falls short.

What You'll Learn

What AirLLM is and the problem it solves
How layer-wise inference works under the hood
Real-world performance numbers and speed tradeoffs
How AirLLM compares to llama.cpp, vLLM, and Ollama
Which use cases AirLLM is genuinely good for — and which it is not
Whether the hype around running 70B models on consumer hardware is justified

What Is AirLLM?

AirLLM is an open-source Python library created by Gavin Li, a former Airbnb AI engineer who led the company's AI customer support effort¹. First released on November 20, 2023, AirLLM tackles one of the biggest barriers to running large language models locally: GPU memory.

A 70-billion-parameter model in full precision requires roughly 140GB of VRAM — far beyond what any consumer GPU offers. The standard solution is quantization: compressing the model to 4-bit or 8-bit precision so it fits in less memory. This works well but always involves some accuracy loss.

AirLLM takes a different approach entirely. Instead of shrinking the model, it streams it. The library loads one transformer layer at a time into GPU memory, runs the computation for that layer, saves the intermediate result, and then swaps in the next layer. The result: a 70B model runs on a GPU with just 4GB of VRAM, at full precision, with no quantization required².

Llama 3.1 405B inference on just 8GB of VRAM was added in mid-2024, and the latest version (2.11.0) added support for the Qwen2.5 model family³.

How Layer-Wise Inference Works

Traditional LLM inference loads the entire model into GPU memory at once. For a 70B model, that means all 80+ transformer layers sit in VRAM simultaneously. AirLLM's layer-wise approach works fundamentally differently.

When you first load a model, AirLLM's layer decomposition engine splits the full checkpoint into individual layer shards — typically 80 to 100 separate files for a 70B model. During inference, the pipeline follows a strict sequence: load layer N into GPU memory, compute that layer's activations, save the output, offload layer N, then prefetch layer N+1. This continues until all layers have processed the input⁴.

The memory savings are dramatic. Peak VRAM usage drops from 140GB+ to under 4GB — a reduction of over 95%. The model runs at full precision because no weights are modified; they are just loaded and unloaded sequentially⁵.

AirLLM v2.0 added optional block-wise quantization (4-bit or 8-bit) that can accelerate inference by up to 3x when full precision is not required. Version 2.5 introduced prefetching to overlap disk reads with GPU computation, reducing idle time between layers⁶.

The Speed Tradeoff: Where the Hype Meets Reality

This is where the headline claim — "run 70B on 4GB" — needs serious context. AirLLM makes large models possible on consumer hardware, but it does not make them fast.

Every layer swap requires reading tens of megabytes from disk. Even on an NVMe SSD, this I/O overhead adds up across 80+ layers for every single token generated. The actual performance varies dramatically depending on hardware:

Hardware	Model	Approximate Speed
NVMe SSD + dedicated GPU	Llama 2 70B	~0.7 tokens/second
M2 MacBook Pro (16GB RAM)	70B model	~50 tokens in 12 minutes (~0.07 tok/s)
Google Colab T4	Llama 2 7B	~1 token per minute (reported by users)
Batch mode (50 prompts)	70B model	~5.3 seconds per token

For comparison, llama.cpp running a quantized 70B model on an RTX 4090 typically achieves 8-15 tokens per second — roughly 10 to 20 times faster than AirLLM's best case⁷.

The bottom line: AirLLM is not suitable for interactive chat applications, which generally require 20+ tokens per second for a smooth experience. It is designed for offline batch processing, one-shot analysis, and research experimentation where latency is acceptable.

Supported Models and Platform Compatibility

AirLLM supports a broad range of open-source model architectures. As of version 2.11.0, the supported model families include Llama (including Llama 3 and Llama 3.1 405B), Qwen and Qwen2.5, Mistral and Mixtral, ChatGLM, Baichuan, and InternLM⁸.

Version 2.6 introduced AutoModel, which automatically detects the model architecture from the checkpoint files, eliminating the need to specify the model class manually during initialization⁹.

Platform compatibility is broad: AirLLM runs on Linux, Windows, and macOS, including Apple Silicon (M1 through M4). CPU-only inference is supported as of version 2.10.1, though performance on CPU is significantly slower than even the slowest GPU configurations¹⁰.

Disk Space: The Hidden Requirement

While AirLLM eliminates the VRAM barrier, it shifts the storage requirement to disk. A 70B model in full precision occupies 120-140GB of disk space initially. After AirLLM's first-run layer decomposition, the original model is deleted automatically (via a delete_original flag), leaving approximately 70GB of layer shards — a 50% savings¹¹.

For the 405B Llama 3.1 model, disk requirements scale proportionally. An NVMe SSD provides 3-5x faster layer loading than a SATA SSD, making storage type a critical factor in real-world performance. Mechanical hard drives will work but make inference painfully slow¹².

AirLLM vs. the Alternatives

AirLLM is not the only way to run large models on consumer hardware. Here is how it compares to the three most popular alternatives:

Feature	AirLLM	llama.cpp (GGUF)	vLLM	Ollama
Primary approach	Layer-wise streaming	Quantization (GGUF)	PagedAttention	Quantization + serving
Min VRAM for 70B	4GB	~32-40GB (Q4)	80GB+	~32-40GB (Q4)
Speed (70B)	~0.7 tok/s best case	8-15 tok/s (RTX 4090)	20+ tok/s (A100)	5-10 tok/s
Precision	Full (FP16/FP32)	Quantized (Q4-Q8)	Full or quantized	Quantized
Production ready	No	Yes	Yes	Yes
Interactive chat	No	Yes	Yes	Yes
Ecosystem	Standalone	Extensive (LM Studio, etc.)	Server-focused	Desktop-friendly

llama.cpp is the most direct comparison. It uses quantization to compress models — typically to 4-bit precision — so they fit in available VRAM. The accuracy loss from well-implemented quantization is usually small (often undetectable in practice), and the speed advantage is enormous. For most users who want to run large models locally, llama.cpp with GGUF quantization is the better practical choice¹³.

vLLM is designed for high-throughput production serving with features like PagedAttention for efficient memory management. It requires substantially more VRAM but delivers production-grade performance.

AirLLM's unique value is that it is among the few tools that can run a 70B model at full precision on a 4GB GPU — without requiring any quantization. Other frameworks like llama.cpp support partial layer offloading to RAM, but typically require quantization to fit the model. If your use case requires zero quantization loss and you can tolerate slow inference, AirLLM is one of the only tools that delivers this.

Who Should Use AirLLM?

AirLLM solves a real problem, but for a narrow audience. Based on real-world performance data and community feedback, AirLLM is well-suited for researchers who need to test or evaluate large models without cloud GPU access, offline batch processing tasks where latency is irrelevant (such as processing a dataset overnight), quick experimentation with different model sizes to understand capability differences, and situations where full-precision inference is genuinely required and quantization artifacts are unacceptable¹⁴.

AirLLM is not the right tool for interactive chatbots or real-time applications, production API serving, any workflow where users are waiting for responses, or situations where a quantized model via llama.cpp would produce acceptable quality (which is most situations).

Does AirLLM Live Up to the Hype?

The honest answer: partially.

The core technical claim is real and verified. AirLLM genuinely runs a 70B model on a 4GB GPU at full precision. That is not marketing — it is a legitimate engineering achievement. The layer-streaming architecture works, and for the narrow set of use cases where full precision on minimal hardware matters, nothing else does this¹⁵.

But the hype — driven by headlines like "Run 70B on Your Laptop!" — oversells the practical experience. Running a 70B model at 0.07 tokens per second on a MacBook is technically possible but practically useless for anything interactive. The speed tradeoff is not a minor footnote; it is the defining characteristic of using AirLLM.

For most people who want to run large models locally, the pragmatic recommendation is to start with llama.cpp or Ollama with quantized models. A well-quantized 70B model at Q4 precision runs 10-20x faster than AirLLM's full-precision approach, with accuracy differences that are rarely noticeable in practice. If you are interested in how open-source models are evolving more broadly, our guide on Google Gemma 4 covers the latest in open-weight model deployment.

AirLLM's real value is as a research and experimentation tool — a way to access models that would otherwise be completely out of reach. On that narrow promise, it delivers. On the broader promise of "democratizing" 70B models for everyday use, the speed penalty is too steep for most workflows. For a wider look at how AI infrastructure costs shape what is practical, see our AI cost breakdown.

References

Frequently Asked Questions

AirLLM is an open-source Python library that enables running large language models (70B+ parameters) on a single GPU with as little as 4GB of VRAM. It achieves this through layer-wise inference — loading and processing one transformer layer at a time from disk, rather than keeping the entire model in GPU memory.