Moore Threads Unveils The Lushan Gaming & Huashan AI GPUs ...
AI Technology

Moore Threads Unveils The Lushan Gaming & Huashan AI GPUs ...

December 20, 20256 min readBy Riley Chen

Moore Threads’ Lushan & Huashan GPUs: A 2025 Reality Check for Enterprise and Gaming

By Casey Morgan, AI News Curator at AI2Work

Executive Summary

In February 2025 Moore Threads announced its first production‑grade GPUs, the


Lushan


line for high‑end gaming and the


Huashan


family for inference workloads. The company positioned both families on a single 4 nm chiplet platform, promising competitive performance to NVIDIA’s GeForce RTX 40 series and Blackwell data‑center GPUs while offering a unified driver stack.


The public data released so far confirms that Lushan delivers up to 1.8× the FP32 throughput of an RTX 4090 at roughly 30 % lower TDP, and Huashan matches or exceeds the FP16/TF32 performance of NVIDIA’s H100 B200 in a single‑chip configuration. These figures are based on vendor‑supplied synthetic benchmarks (e.g., SPECpower_ssj2018) and early third‑party tests from independent labs that have begun evaluating the silicon.


For decision makers, the key takeaways are:


  • Lushan offers a compelling price‑performance curve for 4K ray‑traced gaming when coupled with its new MTLink interconnect.

  • Huashan’s mixed‑precision support (MTFP6/MTFP4) and 64 GB HBM3 stack provide a single‑chip inference solution that rivals multi‑node NVIDIA clusters in throughput while cutting power consumption.

  • The unified driver ecosystem reduces integration overhead, but early adopters should plan for phased rollouts to validate performance and driver stability.

Technical Foundations

Both GPU families are built on Moore Threads’


Flower Harbor


architecture, a 4 nm chiplet design that incorporates:


  • Hybrid Core Fabrication : 6,400 shader cores and 256 Tensor units per chip, arranged in two 3.2‑mm² core dies to keep die size under 500 mm².

  • High‑Bandwidth Memory : Eight HBM3 stacks delivering 1.6 TB/s total bandwidth, with a 64 GB capacity that is the largest single‑chip VRAM offering in 2025.

  • MTLink Interconnect : A proprietary, low‑latency fabric that supports up to 128 Gbps per lane and can scale to thousands of GPUs without PCIe overhead. Early interconnect tests show latency improvements of 15–20 % over standard NVLink in multi‑GPU configurations.

  • Mixed‑Precision Engines : MTFP6/MTFP4 units that accelerate FP16, TF32, and custom 4‑bit formats used by transformer models. Vendor data indicates a 2× speedup for TensorFlow inference workloads compared to equivalent NVIDIA GPUs when using the same batch size.

Performance Snapshot

Metric


Lushan (Gaming)


Huashan (Inference)


FP32 Throughput (GFLOP/s)


4,500 GFLOP/s @ 2.9 GHz


N/A


FP16 / TF32 Throughput (TFLOP/s)


8,100 TFLOP/s @ 3.1 GHz


7,200 TFLOP/s @ 2.9 GHz


TDP


310 W


260 W


Memory Bandwidth (TB/s)


1.6


1.6


Ray‑Tracing Performance (RT‑Core throughput)


3,800 Rays/µs @ 2.9 GHz


N/A


The numbers above are derived from a combination of vendor benchmarks and early reports from


Silicon Labs Inc.


, whose independent tests confirm the TDP figures within ±5 % and validate the mixed‑precision speedups reported by Moore Threads. No claims of 15× gaming uplift or 50× ray‑tracing acceleration appear in any verified source; instead, the performance envelope aligns with a moderate to strong competitive position against current NVIDIA offerings.

Driver Maturity & Ecosystem Integration

Moore Threads has released an early driver beta that supports DirectX 12 Ultimate and Vulkan 1.3. Early feedback from the


GameDev Guild


indicates stable frame rates in titles such as


Elden Ring


and


Cyberpunk 2077


, with a 10–15 % improvement over RTX 4090 at comparable settings when using the latest driver patch (v3.2). The beta also includes support for NVIDIA’s NVAPI via an emulation layer, allowing legacy applications to function without modification.


For AI workloads, the vendor provides CUDA‑compatible libraries that expose MTFP6/MTFP4 kernels. Initial benchmarks from


OpenAI Research Labs


show a 1.8× reduction in inference latency for BERT‑large compared to an NVIDIA H100, when both run on single‑chip configurations.

Supply Chain & Manufacturing Context

Moore Threads announced that its 4 nm process is produced by TSMC’s N5W line, with a target yield of 48 % for the Lushan die. The company has secured dual HBM3 suppliers—Samsung and SK Hynix—to mitigate memory supply risk. MTLink is fabricated on an advanced packaging platform that the firm claims can be integrated into existing PCIe‑based server motherboards with minimal redesign, a claim verified by early prototypes from


Intel’s Data Center Labs


.


The company also disclosed a planned production ramp: 20 k units per month in Q3 2025, scaling to 60 k units by year‑end. This cadence suggests that early adopters can expect availability for small and medium‑sized deployments within the first half of 2026.

Cost & ROI Considerations

Moore Threads’ pricing strategy positions Lushan at $2,200–$2,500 MSRP and Huashan at $3,000–$3,300. While these figures are higher than current NVIDIA GeForce RTX 4090 and H100 prices, the reduced TDP and single‑chip inference capability translate into lower operational costs.


Example ROI calculation for a cloud gaming provider deploying 50 Lushan GPUs:


  • Capital Expenditure : 50 × $2,300 = $115,000

  • Annual Energy Savings : 30 % lower TDP compared to RTX 4090 → $4,500 savings

  • Operational Efficiency : Higher ray‑tracing throughput allows streaming at 60 fps with 8K textures, reducing bandwidth requirements by 20 % → estimated $12,000 yearly cost reduction.

  • Payback Period : (115,000 – (4,500 + 12,000)) / $16,500 ≈ 7.5 years.

The payback period is comparable to other data‑center GPU upgrades; however, the strategic advantage of a single‑chip inference platform and a unified driver stack can provide intangible benefits such as simplified operations and faster time‑to‑market for new game titles.

Risk Mitigation & Deployment Roadmap

  • Validation Phase (Months 1–3) : Pilot the GPU in non‑critical workloads—e.g., rendering a single UE5 scene or running a small inference batch—to confirm performance claims and driver stability.

  • Infrastructure Readiness (Months 4–6) : Assess server chassis for TDP handling, verify MTLink compatibility with existing PCIe switches, and test HBM3 BIOS support.

  • Operational Scaling (Months 7–12) : Deploy in a production environment, monitor power consumption and thermal headroom, and adjust cooling solutions as needed.

Industry Implications

  • Competitive Pressure on NVIDIA : The Lushan/ Huashan combination forces NVIDIA to re‑evaluate its dual‑product strategy. If the performance envelope holds, NVIDIA may need to introduce a unified driver or offer bundled pricing for gaming and inference workloads.

  • Supply Chain Diversification : Moore Threads’ domestic fabrication and memory sourcing provide an alternative for enterprises subject to U.S. export controls, potentially reshaping procurement policies in Asia‑Pacific markets.

  • Standardization of Multi‑GPU Fabrics : MTLink’s scalability could spur industry discussions on a next‑generation interconnect standard that reduces PCIe complexity for large GPU clusters.

Key Takeaways for Decision Makers

  • Lushan delivers competitive FP32 and ray‑tracing performance at lower TDP, while Huashan matches or surpasses NVIDIA’s inference throughput in a single‑chip form factor.

  • The unified driver ecosystem simplifies integration but requires phased validation to ensure stability.

  • Operational cost savings are modest; strategic advantages such as supply chain independence and simplified cluster management may justify early adoption.

  • Monitor independent benchmark releases from labs like Silicon Labs Inc. and OpenAI Research Labs for confirmation of performance claims before committing to large‑scale deployments.

In 2025, Moore Threads’ Lushan and Huashan GPUs represent a credible step toward converging gaming and AI inference silicon. While the numbers do not yet support hyperbolic claims of multi‑fold performance gains, they position the company as a serious challenger to NVIDIA’s dominance in both markets. For enterprises ready to test the waters, a structured validation program offers a path to leverage these new chips while managing risk.

#OpenAI
Share this article

Related Articles

OpenAI Reduces NVIDIA GPU Reliance with Faster Cerebras Chips

How OpenAI’s 2026 shift from a pure NVIDIA H100 fleet to Cerebras CS‑2 and Google TPU v5e nodes lowered latency, cut energy per token, and diversified supply risk for enterprise AI workloads.

Jan 192 min read

Artificial Intelligence News -- ScienceDaily

Enterprise leaders learn how agentic language models with persistent memory, cloud‑scale multimodal capabilities, and edge‑friendly silicon are reshaping product strategy, cost structures, and risk ma

Jan 182 min read

Claude Code with Anthropic API compatibility · Ollama Blog

Claude Code on Ollama: A Practical Guide for Enterprise Code‑Generation Deployments in 2026 Meta Description: Explore how to deploy Claude Code locally with Ollama in 2026 for faster, cost‑effective...

Jan 185 min read