SubQ: First Subquadratic LLM Ships 12M Context Window

June 1, 2026

SubQ: First Subquadratic LLM Ships 12M Context Window

TL;DR

On May 5, 2026, a 13-person Miami startup called Subquadratic came out of stealth with $29 million in seed funding and SubQ 1M-Preview — a frontier-tier LLM it claims is the first commercial model built on a fully subquadratic sparse-attention (SSA) architecture.12 The pitch: a 12-million-token native context window in the research model, a 1M-token production API, and roughly one-fifth the per-token cost of Claude Opus or GPT-5.5 at comparable context lengths.13 Subquadratic's own numbers show SubQ scoring 95.0% on RULER 128K at $8 of compute, versus 94.8% for Claude Opus at roughly $2,600 — a ~300x cost reduction at the same accuracy.4 But the launch arrives without an arXiv paper, open weights, or third-party leaderboard entries on Artificial Analysis, LiveBench, or LMArena, and researchers are already pushing back on the "1,000x efficiency" framing.5 This post walks through what Subquadratic actually shipped, where the architectural claim is defensible, where the benchmarks need asterisks, and how SubQ stacks against Claude Opus, GPT-5.5, and Gemini 3.1 Pro on the three tests Subquadratic chose to publish.


What You'll Learn

  • What Subquadratic Sparse Attention (SSA) is and how it differs from FlashAttention and from prior subquadratic models like Mamba and RWKV
  • The full benchmark table Subquadratic published: RULER 128K, MRCR v2, and SWE-Bench Verified — and where the research vs. production numbers diverge
  • How the 300x and 1,000x cost-reduction claims are constructed, and what they leave out
  • Why SubQ's "first subquadratic LLM" framing is narrower than it sounds
  • What products Subquadratic shipped on day one — SubQ API, SubQ Code, SubQ Search — and how to get access
  • The specific verification gaps that have researchers asking for independent reproduction

What Subquadratic Actually Shipped

Subquadratic announced on May 5, 2026 with a single product family: SubQ 1M-Preview.1 The "1M" refers to the production API's context window. A separate research configuration extends to a native 12 million tokens, currently gated to research and select enterprise partners.3

Three surfaces went live simultaneously, all behind a waitlist:6

  • SubQ API — OpenAI-compatible endpoints exposing the 1M-token production model with tool-use support
  • SubQ Code — a CLI coding agent that loads an entire codebase into a single context window, licensed per developer seat and pitched as a long-context layer for teams already using Claude Code, Codex, or Cursor
  • SubQ Search — a long-context retrieval product targeting enterprise document corpora

The company has not yet published a public per-token rate. In launch coverage, Subquadratic positioned SubQ API at roughly one-fifth the per-token cost of Claude Opus or GPT-5.5 at comparable context lengths.23

Subquadratic is run by CEO Justin Dangel, a five-time founder, and CTO Alex Whedon, who was previously a software engineer at Meta and later head of generative AI at TribeAI.7 The seed round was led by Javier Villamizar (formerly of SoftBank Vision Fund), with participation from Justin Mateen (Tinder co-founder and founder of JAM Fund), Grant Gittlin, Jaclyn Rice Nelson, and early investors in Anthropic, OpenAI, Stripe, and Brex.7 The company is 13 people in total, including an 11-person all-PhD research team with prior affiliations across Meta, Google, Oxford, Cambridge, BYU, ByteDance, and Adobe.8


What Subquadratic Sparse Attention Actually Is

Standard transformer attention is O(n²) in sequence length. Doubling the context quadruples the compute. FlashAttention reorganizes the same math to be I/O-efficient on GPUs and reduces the memory footprint from quadratic to linear, but it does not change the compute complexity class — the FLOP count is still O(n²).9

Subquadratic Sparse Attention (SSA), as Subquadratic describes it, is a content-dependent sparse routing scheme that learns which token pairs actually matter and computes exact attention only over the relevant subset.10 The published claim is that this gives roughly linear scaling in compute and memory for long sequences while preserving the ability to recover specific information introduced far earlier in the sequence — something pure linear-recurrent models like Mamba and RWKV historically struggle with.1011

Subquadratic's prefill speedup numbers, measured against FlashAttention-2 on Nvidia B200 GPUs:10

Context lengthSSA speedup vs FlashAttention-2
128K tokens7.2x
256K tokens13.2x
512K tokens23x
1M tokens52.2x

The widely-quoted "52x faster than FlashAttention at 1M tokens" figure is the 1M-token row, rounded.1 At 12M tokens, Subquadratic claims attention compute drops by roughly 1,000x compared with frontier transformer models, with attention FLOPs cut by ~62.8x relative to standard quadratic attention at 1M tokens.210

Important framing. Subquadratic's marketing calls SubQ "the first fully subquadratic LLM."10 That phrasing erases years of work: Mamba (Gu and Dao, 2023), RWKV, and Jamba are already subquadratic and in some cases fully linear in sequence length.11 DeepSeek also shipped DeepSeek Sparse Attention (DSA) inside Multi-Head Latent Attention in V3.2-Exp, then extended the approach into V4's Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) layers, so a commercial frontier-tier model with sparse attention is not unprecedented either.12 The narrower claim Subquadratic actually defends — "the first commercial frontier model with a fully subquadratic sparse-attention architecture" — is more defensible. DeepSeek's variants combine sparse attention with KV-cache compression and dense layers; SubQ's pitch is that SSA is the entire attention mechanism end-to-end, which is what produces the headline scaling numbers.1210


The Benchmark Table

Subquadratic published exactly three benchmarks: RULER 128K, MRCR v2, and SWE-Bench Verified. All three were run by a third-party testing service, and all three are tasks SubQ was explicitly designed for — long-context retrieval and coding. Broader evaluations across general reasoning, math, multilingual performance, and safety are not in the launch material.513

RULER 128K (long-context retrieval)

ModelAccuracyCompute cost per run
SubQ95.0%$8
Claude Opus94.8%~$2,600

At the same accuracy, SubQ runs roughly 300x cheaper.4 The caveat is that RULER at 128K is at the top of the curve for frontier models — leaderboard entries cluster in the low-to-mid 90s, so the headline here is "did not regress on a benchmark frontier models already win, while costing two orders of magnitude less," not "beat frontier accuracy."14

MRCR v2 (multi-needle retrieval and coreference)

ModelScore
SubQ (research)83
GPT-5.574.0
SubQ (third-party)65.9
Claude Opus 4.732.2
Gemini 3.1 Pro26.3

The MRCR result is where the asterisk gets large. Subquadratic's research figure is 83, but the third-party verified production-model score is 65.9 — a 17-point gap that Subquadratic has not fully explained, and that pushes the production model below GPT-5.5 on this benchmark.513 Against Claude Opus 4.7 and Gemini 3.1 Pro, SubQ's production model still posts a wide lead.

SWE-Bench Verified (coding agent)

ModelPass@1
Opus 4.787.6%
SubQ81.8%
Opus 4.680.8%

On general agentic coding, SubQ edges out the previous-generation Opus 4.6 but trails the current Opus 4.7 by ~6 points and is well behind Anthropic's Opus 4.8 results from May 28, which Anthropic reports at 88.6% on SWE-bench Verified.15 Subquadratic does not publish SubQ numbers against Opus 4.8.


How the "1,000x" and "300x" Claims Are Constructed

Two cost-related numbers dominate the launch coverage, and they measure different things:

  1. "~1,000x reduction in attention compute" — this is an architectural number measured at the full 12-million-token context, comparing SubQ's SSA against quadratic transformer attention scaling. It is a FLOP-count argument about how compute would grow if you tried to run a frontier transformer at 12M tokens.2
  2. "~300x cost reduction" — this is a benchmark-level number from the RULER 128K run, comparing SubQ's $8 dollar cost to Claude Opus's ~$2,600 at the same accuracy.4

Both numbers come from Subquadratic. Neither has been independently reproduced. The 300x figure also has confounders: Subquadratic's preview pricing is below long-term sustainable pricing for any frontier provider, and the cost denominator uses Claude Opus's standard $5/$25 per million tokens, which is the rate any RULER-scale eval pays — Subquadratic's advantage is per-token, not because Anthropic charges an extended-context surcharge on current Opus models (it doesn't; Opus 4.6/4.7/4.8 include the full 1M context at standard pricing).16 A like-for-like cost comparison at SubQ's eventual GA pricing has not been published.


What's Missing: Verification Gaps

A research-community pushback emerged within 48 hours of the launch and is summarized in VentureBeat's coverage and several independent technical write-ups.517 The recurring concerns:

  • No peer-reviewed paper, no arXiv preprint. The technical material lives on Subquadratic's company blog and a marketing site.
  • No open weights. External researchers cannot run their own evaluations.
  • No entries on Artificial Analysis, LiveBench, or LMArena. The standard public leaderboards have nothing to compare against.
  • Narrow benchmark set. Three tests, all in long-context retrieval and coding — Subquadratic's strongest territory.
  • Single-run benchmarks. Each model was evaluated once with no confidence intervals reported, which leaves room for run-to-run variance.
  • Unexplained research-vs-production gap. The 17-point MRCR drop between Subquadratic's research figure and its third-party verified production score is the most concrete red flag.

None of this means the architecture is fake — Mamba and RWKV are both real, published, and reproduced subquadratic architectures, so the category is well-established.11 It does mean that the specific magnitude of SubQ's claimed cost and speed advantages remains a marketing number until external groups can stress-test it.


Where SubQ Could Matter (If the Numbers Hold)

The honest read of the launch: even at half the claimed advantage, a subquadratic frontier-tier model with credible long-context retrieval would change three workloads materially.

  • Whole-repo coding agents. SubQ Code's pitch is that you load the entire repository into context in a single pass. At Claude Opus 4.7's pricing, that is uneconomic for most codebases. At one-fifth the cost, the math changes.
  • Document-corpus search and Q&A. SSA preserves exact retrieval at long ranges, which is the failure mode for pure linear-recurrent models. If verified, that's a meaningful differentiator for legal and financial use cases that pure-RAG pipelines struggle with.
  • Long-horizon agentic planning. Multi-step agents with memory of every prior turn become cheap enough to be the default rather than the exception.

The next 60 days are the real test. Either independent groups reproduce something close to the published numbers and SubQ becomes the long-context default, or the gap between research figures and verified production numbers widens and the launch joins the list of architecturally interesting but commercially overstated subquadratic attempts.


How to Get Access

All three SubQ products — API, Code, and Search — are in private beta, with access through waitlist forms on the Subquadratic site.6 The API exposes OpenAI-compatible endpoints, so most existing code can route to SubQ by changing a base URL once access is granted. The 12-million-token research context window is not in the public API; it is restricted to research and select enterprise partners.

If you are already on Claude Code, Codex, or Cursor, Subquadratic's pitch for SubQ Code is a long-context layer licensed per seat, framed as a cost reducer for teams running large-context agentic workflows rather than a replacement for the host agent.


Bottom Line

Subquadratic shipped a credible architectural story, an aggressive pricing pitch, and a benchmark deck with one clean win (RULER 128K at 300x cost reduction), one mixed result (MRCR v2 with a 17-point research-to-production gap), and one underperformance (SWE-Bench Verified, behind Opus 4.7). The "first subquadratic LLM" framing oversells what is otherwise the most interesting attention-architecture launch of 2026 so far. Whether SubQ becomes the new long-context default or joins the list of subquadratic models that did not scale at frontier depends almost entirely on what happens when independent researchers get hands-on weights or API access — a test the company has 60 days at best to clear before the launch narrative gets stale.

For related reading on the long-context cost picture, see our DeepSeek V4 deep dive on million-token context economics and the GPU cloud comparison for 2026 AI compute.


Footnotes

  1. SiliconANGLE — "Subquadratic launches with $29M to bring 12M-token context windows to AI" (May 5, 2026). https://siliconangle.com/2026/05/05/subquadratic-launches-29m-bring-12m-token-context-windows-ai/ 2 3 4

  2. VentureBeat — "Miami startup Subquadratic claims 1,000x AI efficiency gain with SubQ model; researchers demand independent proof." https://venturebeat.com/technology/miami-startup-subquadratic-claims-1-000x-ai-efficiency-gain-with-subq-model-researchers-demand-independent-proof 2 3 4

  3. eWeek — "Subquadratic Launches SubQ, a 12M-Token AI Model for Long-Context Tasks." https://www.eweek.com/news/subquadratic-subq-12m-token-llm-neuron/ 2 3 4

  4. felloai — "SubQ Review: The First Subquadratic LLM with a 12 Million Token Context." https://felloai.com/subq-llm-review/ 2 3

  5. VentureBeat coverage and dataworldbank summary of researcher pushback on SubQ's claims. https://www.dataworldbank.net/2026/05/06/miami-startup-subquadratic-claims-1000x-ai-efficiency-gain-with-subq-model-researchers-demand-independent-proof/ 2 3 4 5 6

  6. LLM Reference — "SubQ by Subquadratic — Models, Pricing & API." https://www.llmreference.com/model-family/subq 2

  7. Pulse 2.0 — "Subquadratic: $29 Million Seed Raised For Long-Context AI Architecture." https://pulse2.com/subquadratic-29-million-seed-raised-for-long-context-ai-architecture/ 2

  8. 36Kr — "13 People Overthrow Transformer: New Architecture SSA Cuts Computing Power by a Thousand Times with Cost Just 5% of Opus." https://eu.36kr.com/en/p/3797755244157959

  9. Tri Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (NeurIPS 2022). https://arxiv.org/abs/2205.14135 — FlashAttention reorganizes attention for I/O efficiency without changing complexity class.

  10. Subquadratic — "How SSA Makes Long Context Practical." https://subq.ai/how-ssa-makes-long-context-practical 2 3 4 5 6 7

  11. CallSphere — "Beyond Transformers: Mamba, RWKV, and State-Space Models Challenging the Dominant Architecture (2026)." https://callsphere.ai/blog/transformer-alternatives-mamba-rwkv-state-space-models-2026 2 3 4

  12. Sebastian Raschka — "DeepSeek Sparse Attention" architecture overview (DSA introduced in V3.2-Exp, 2025, within the MLA framework): https://sebastianraschka.com/llm-architecture-gallery/deepseek-sparse-attention/ — and the V4 extension into Compressed Sparse Attention (CSA) plus Heavily Compressed Attention (HCA): https://www.intoai.pub/p/what-makes-deekseek-v4-so-good 2

  13. DataCamp — "SubQ AI Explained: How Good Is the 12M Context Window LLM?" https://www.datacamp.com/blog/subq-ai-explained 2

  14. RULER long-context benchmark documentation; multiple frontier models cluster at 94-95% at the 128K budget. https://llm-stats.com/benchmarks/ruler

  15. NerdLevelTech — "Claude Opus 4.8: Benchmarks, Dynamic Workflows, Pricing." Opus 4.8 reaches 88.6% on SWE-bench Verified. 2

  16. Anthropic Claude API pricing documentation; Opus 4.6/4.7/4.8 include the full 1M context window at standard $5/$25 per-million-token pricing. https://platform.claude.com/docs/en/about-claude/pricing

  17. byteiota — "SubQ's 12M-Token AI: Breakthrough or Theranos?" Independent technical commentary on the verification gap. https://byteiota.com/subqs-12m-token-ai-breakthrough-or-theranos/

Frequently Asked Questions

Not strictly. Mamba, RWKV, and Jamba are all subquadratic and have been published for years. 11 Subquadratic's defensible framing is that SubQ is the first commercial frontier-tier LLM built on a fully subquadratic sparse-attention architecture. Prior subquadratic models are mostly recurrent or hybrid.

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.