Is CANN Next a full CUDA replacement?

No. CANN Next provides CUDA-compatible programming abstractions and near-drop-in API replacements for common operations, but it is not a complete reimplementation of CUDA. Highly specialized CUDA libraries (cuDNN custom kernels, CUTLASS templates) may require manual porting. The goal is to reduce migration effort from months to weeks, not to achieve zero-effort compatibility.

Can I buy the Atlas 350 outside China?

The Atlas 350 is primarily targeted at the Chinese domestic market. International availability has not been announced, and export restrictions on advanced Chinese semiconductors in some jurisdictions may limit distribution.

What workloads is the 950PR best suited for?

Inference (particularly LLM serving at FP4/FP8 precision) and recommendation systems. The chip is not designed for AI training workloads, which require higher memory bandwidth and different compute characteristics. Huawei's training-focused chip is the Ascend 910C.

Huawei Ascend 950PR Beats NVIDIA H20: 2.8× FP8, CUDA-Ready

April 2, 2026

#Huawei Ascend 950PR #AI chips #NVIDIA H20 #CANN Next #Atlas 350 #AI hardware #China AI #semiconductor

Huawei Ascend 950PR Beats NVIDIA H20: 2.8× FP8, CUDA-Ready

TL;DR

Huawei unveiled the Ascend 950PR at the China Partner Conference on March 20, 2026 — an inference-focused AI accelerator delivering 1 PFLOPS at FP8 and 1.56 PFLOPS at FP4, roughly 2.8 times the single-card compute of NVIDIA's H20.¹² Packaged in the Atlas 350 card with 112 GB of Huawei's in-house HiBL 1.0 high-bandwidth memory, the chip is the first Chinese AI accelerator to support FP4 low-precision inference at scale.¹ The real breakthrough, however, is software: CANN Next introduces a CUDA-compatible programming model that has convinced ByteDance and Alibaba to plan large orders, with Huawei targeting 750,000 units shipped in 2026.³⁴

What You'll Learn

Full hardware specifications of the Ascend 950PR and Atlas 350 accelerator card
How the 950PR compares to NVIDIA's H20 and where each chip excels
What CANN Next is and why CUDA compatibility changes the game for Huawei
Why ByteDance and Alibaba are placing orders after years of hesitation
Pricing, production timeline, and what this means for the global AI chip market
The US export control context that shaped this moment

Why the Ascend 950PR Matters

For years, Huawei's AI chips had a hardware-good-enough, software-not-ready-enough problem. The Ascend 910 series delivered respectable compute numbers, but enterprises struggled to migrate CUDA-based workloads to Huawei's ecosystem. The result was that even Chinese companies — despite government pressure to adopt domestic silicon — kept buying NVIDIA's export-compliant H20 GPUs because the software switching cost was too high.

The Ascend 950PR changes this equation on both fronts. On hardware, it delivers roughly three times the inference compute of the H20 at the precision levels that matter most for large language model serving. On software, the new CANN Next stack introduces CUDA-compatible programming abstractions — thread blocks, warps, kernel launches — that let developers port existing CUDA code with minimal rewrites rather than starting from scratch.⁵

The timing is significant. US export controls have restricted China's access to cutting-edge NVIDIA hardware since October 2022, with the H20 serving as NVIDIA's compliance-friendly offering for the Chinese market. The 950PR represents Huawei's most credible attempt yet to make that compromise unnecessary.⁶

Hardware Specifications: What's Inside the Atlas 350

The Ascend 950PR is the compute die at the heart of the Atlas 350 accelerator card. Here is how the specifications break down.

Compute Performance

Precision	Performance	vs. NVIDIA H20
FP4	1.56 PFLOPS	~5.3x (H20 lacks native FP4)
FP8	1 PFLOPS (1,000 TFLOPS)	~3.4x (H20: 296 TFLOPS FP8)
FP16	Not officially disclosed	—

Huawei's official marketing claims 2.8x the H20's single-card compute.¹ The raw FP8 ratio is closer to 3.4x, but the 2.8x figure likely accounts for real-world inference benchmarks rather than peak theoretical throughput, which is a more honest comparison.

Memory and Bandwidth

Specification	Ascend 950PR (Atlas 350)	NVIDIA H20
Memory capacity	112 GB HBM (HiBL 1.0)	96 GB HBM3
Memory bandwidth	1.4 TB/s	4.0 TB/s
Interconnect bandwidth	2.0 TB/s (LingQu)	900 GB/s (NVLink)

The memory bandwidth gap is the most notable trade-off. The H20's 4.0 TB/s memory bandwidth is nearly three times the 950PR's 1.4 TB/s, which matters for memory-bound workloads like long-context LLM inference where the bottleneck is moving key-value cache data rather than raw compute.¹² Huawei's 2.0 TB/s LingQu interconnect, however, is more than double the H20's NVLink bandwidth, giving multi-chip configurations an advantage in distributed inference.¹

Power and Fabrication

Specification	Ascend 950PR	NVIDIA H20
TDP	600 W	400 W
Process node	SMIC 7nm (N+2)	TSMC 4nm
HBM type	HiBL 1.0 (Huawei in-house)	HBM3 (SK Hynix/Samsung)

The 950PR draws 50% more power than the H20, a direct consequence of the 7nm versus 4nm process gap.² Huawei compensates with higher absolute compute, but data center operators need to account for the thermal and power delivery overhead. The HiBL 1.0 memory is Huawei's first in-house high-bandwidth memory, designed to reduce dependence on foreign HBM suppliers — a strategic priority given ongoing supply chain uncertainties.¹

CANN Next: The Software Breakthrough That Actually Matters

Raw compute has never been Huawei's primary obstacle. Software compatibility has. CANN Next represents Huawei's answer to the CUDA lock-in that has kept enterprises tied to NVIDIA hardware.

What CANN Next Actually Does

CANN Next introduces a SIMT (Single Instruction, Multiple Threads) programming model that mirrors CUDA's core abstractions. Developers familiar with CUDA will recognize the building blocks: thread blocks for organizing parallel work, warps for hardware-level thread scheduling, and kernel launches for dispatching compute functions to the accelerator.⁵

The key distinction from previous Huawei software efforts is the approach. Rather than building a translation layer that converts CUDA code at runtime (which introduces overhead and compatibility gaps), CANN Next provides near-drop-in replacements for CUDA equivalents. Huawei treats CUDA as a de facto language standard while mapping operations to the Ascend hardware's native capabilities.⁵

Why This Changed ByteDance's Mind

ByteDance and Alibaba had tested previous Ascend chips and found the software migration cost prohibitive. With CANN Next, sources told Reuters that the companies are "much happier now that the chip is more compatible with Nvidia's CUDA software system and has better response speeds."³ The practical implication is that teams with existing CUDA inference pipelines can port workloads without rewriting their entire software stack — a prerequisite that previous Ascend generations could not meet.

Customer Adoption and Production Plans

ByteDance and Alibaba Orders

Reuters reported on March 27, 2026 that ByteDance and Alibaba plan to place orders for the Ascend 950PR after customer testing went well.³ These are not speculative interest signals — both companies received samples in January 2026 and have been running production-grade inference benchmarks for approximately two months.⁴

This matters because ByteDance (which operates TikTok and Douyin) and Alibaba (which runs one of the largest cloud platforms in Asia) represent exactly the scale of customer that validates a new chip platform. If either company deploys 950PR clusters at scale, it establishes a commercial installed base that future Ascend chips can build on — something Huawei has not achieved before in the private sector.⁴

Production Timeline and Pricing

Detail	Value
Mass production start	Q2 2026
2026 shipment target	750,000 units
DDR memory version	~~50,000 yuan (~~$6,900)
HBM version (HiBL 1.0)	~~70,000 yuan (~~$9,600)
Full shipment ramp	H2 2026

The pricing positions the Atlas 350 competitively against the H20, which has sold in China at varying prices depending on supply constraints and intermediary markups. The two-tier pricing — with a cheaper DDR version for cost-sensitive deployments and a premium HBM version for bandwidth-hungry workloads — suggests Huawei is targeting both inference farms and recommendation systems.⁴

The Export Control Context

The Ascend 950PR arrives in a complex geopolitical environment. US export controls, first imposed in October 2022 and tightened in October 2023, restricted China's access to NVIDIA's most powerful chips. The H20 was specifically designed by NVIDIA to comply with these restrictions while still offering meaningful AI compute for the Chinese market.⁶

In a shift, the Trump administration announced on December 8, 2025 that it would approve export licenses for NVIDIA's H200 chips to China, subject to a 25% surcharge and volume caps limiting exports to 50% of US domestic sales.⁷ Reuters reported in late January 2026 that China approved its first major batch of H200 imports, clearing ByteDance, Alibaba, and Tencent to purchase more than 400,000 units collectively.⁸ Huang initially noted in January that purchase orders had not been placed, but by mid-March 2026 confirmed that NVIDIA had received orders from Chinese customers and was restarting manufacturing.⁹ This means the 950PR is not entering a market starved of alternatives — it is competing against both the incumbent H20 and the newly available H200.

A Council on Foreign Relations report noted that Huawei would still produce only about 5% of NVIDIA's aggregate AI computing power in 2025, declining to an estimated 4% in 2026.⁶ The 950PR does not close this gap in absolute terms, but it does not need to. If it captures a meaningful share of China's domestic inference market — the fastest-growing segment of AI compute demand — it builds the commercial foundation Huawei needs for its more ambitious next-generation chips.

Where the 950PR Excels and Where It Doesn't

Strengths

The 950PR is purpose-built for inference and recommendation workloads. Its FP4 support — a first among Chinese AI accelerators — is particularly relevant as the industry moves toward lower-precision inference for cost efficiency. The 1 PFLOPS FP8 throughput means a single Atlas 350 card can serve large language models at a scale that would require multiple H20 cards, reducing per-query infrastructure cost.

The LingQu interconnect at 2.0 TB/s bandwidth is another advantage. For distributed inference across multiple cards — increasingly common with models that exceed single-card memory — the 950PR's interconnect outpaces the H20's NVLink by a factor of 2.2x.

Limitations

Memory bandwidth remains the 950PR's most significant constraint. At 1.4 TB/s versus the H20's 4.0 TB/s, memory-bound operations — particularly long-context LLM serving where the key-value cache dominates latency — will favor the H20. This is not a minor gap; it means certain workload profiles will run faster on an H20 despite having less raw compute.

The 7nm process also means the 950PR consumes 50% more power per card. For data centers operating at scale, the total cost of ownership calculation must factor in power and cooling overhead, not just chip price.

Finally, CANN Next is new. While ByteDance and Alibaba have validated it in testing, the software ecosystem is still maturing. The breadth of third-party library support, debugging tools, and community resources available for CUDA dwarfs what CANN Next offers today. This gap will narrow over time, but early adopters should expect more friction than a pure NVIDIA deployment.

What This Means for the Global AI Chip Market

The Ascend 950PR is not going to displace NVIDIA globally. NVIDIA's ecosystem advantages — CUDA's 19-year head start, Blackwell and Hopper architectures, and a software stack that virtually every AI framework is optimized for — remain formidable outside China.

But within China, the 950PR represents a credible domestic alternative for the first time. The combination of competitive inference performance, CUDA-compatible software, and the backing of China's two largest tech companies creates a flywheel: more adoption drives more software optimization, which drives more adoption.

For the broader industry, this development accelerates the bifurcation of the AI hardware market into US-aligned and China-aligned ecosystems. While Western tech giants race to build custom AI chips to reduce NVIDIA dependence, Huawei is pursuing the same goal from the other side of the geopolitical divide. Companies building global AI infrastructure will increasingly need to plan for a world where the dominant accelerator differs by geography — NVIDIA in most markets, Ascend in China. Understanding how GPUs power the AI revolution makes the stakes of this bifurcation clearer.

References

TrendForce, "Huawei Debuts Atlas 350 on Ascend 950PR with In-house HBM, Touting 2.8X H20 Performance," March 23, 2026. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Tom's Hardware, "Huawei unveils new Atlas 350 AI accelerator with 1.56 PFLOPS of FP4 compute and up to 112GB of HBM," March 2026. ↩ ↩² ↩³
CNBC/Reuters, "Huawei's new AI chip finds favour with ByteDance, Alibaba which plan to place orders," March 27, 2026. ↩ ↩² ↩³
Technetbook, "Huawei Ascend 950PR AI Chip Secures Major Orders from ByteDance and Alibaba," March 2026. ↩ ↩² ↩³ ↩⁴
WCCFTech, "Huawei's Ascend 950PR AI Chip Just Won Over Chinese Customers By Mimicking CUDA Through CANN Next," March 2026. ↩ ↩² ↩³
Council on Foreign Relations, "China's AI Chip Deficit: Why Huawei Can't Catch Nvidia and U.S. Export Controls Should Remain," 2026. ↩ ↩² ↩³
CNN, "Trump greenlights exports of Nvidia H200 chips to China," December 8, 2025. ↩
Reuters/Sherwood News, "China has approved the sale of 400,000 H200 chips to Chinese tech firms," January 2026. ↩
CNBC, "Jensen Huang says Nvidia has received orders from China and is restarting manufacturing," March 17, 2026. ↩

Frequently Asked Questions

The comparison to the H100 is less direct because the H100 was never legally available in China under export controls. Against the H100 SXM5's FP8 Tensor Core performance of approximately 1,979 TFLOPS (without sparsity), the 950PR's 1,000 TFLOPS FP8 is roughly half. However, the 950PR targets inference workloads where FP4 precision is sufficient, and at FP4 it closes the gap significantly.