hardware · 2026-04-21 · Tier 1

SemiAnalysis: GPU Cluster Economics and the Goodput Reckoning

SemiAnalysis: GPU Cluster Economics and the Goodput Reckoning

Date: 2026-04-21
Source: SemiAnalysis Newsletter
Coverage: How Much Do GPU Clusters Really Cost
Raw: (parallel daily digest 2026-04-21)


TL;DR

SemiAnalysis published a comprehensive GPU cluster TCO framework based on testing 80+ neoclouds and interviewing 150+ customers. The central finding: two cloud offerings with identical GPU-hour pricing can differ by 6–21% in useful work delivered, because downtime, debugging, fault recovery, and storage performance never appear on the monthly bill. "Goodput" — useful GPU work per dollar — is the metric that resolves this. Gold-tier neoclouds (Nebius, Fluidstack, Crusoe) cost 5–15% less than silver-tier for large pretraining jobs at the same nominal price; the gap collapses near zero for single-node inference.


The Goodput Framework

Eight-component TCO decomposition:

GPU cluster total cost of ownership:
  ┌────────────────────────────────────────────────────┐
  │  On-bill (visible):                                │
  │    GPU cost                                        │
  │    Storage cost                                    │
  │    Networking cost                                 │
  │    Control plane cost                              │
  │    Support cost                                    │
  │                                                    │
  │  Off-bill (hidden):                                │
  │    Goodput Expense  ← wasted compute from failures │
  │    Setup Expense    ← time to configure cluster    │
  │    Debugging Expense← time diagnosing issues       │
  └────────────────────────────────────────────────────┘

The three off-bill components are the core insight. They are invisible to the buyer until they accumulate into a real cost: a single GPU failure in a 5,184-GPU cluster training job can waste 10–15 minutes of full-cluster time for re-initialization plus all compute since the last checkpoint.

Grand Unifying Theory of Goodput — three recovery scenarios:

Scenario Spare node Recovery time Downtime cost
Checkpoint-cold Needs repair Hours to days Very high
Checkpoint-hot Available immediately Minutes Medium
Fault-tolerant Job continues Seconds Low

Fault-tolerance framework comparison:

Framework Status Performance overhead Constraints
TorchFT Open source 10%+ (GLOO comms) None — fully open
AWS SageMaker HyperPod Checkpointless Proprietary 5% memory overhead, 1min 45s recovery Kubernetes + NeMo only
TorchPass (Clockwork.io) Licensed 0% Requires idle spare nodes at 0.62% of cluster

Key Numbers

Goodput expense as % of total for large pretraining (5,184 GB300 NVL72 cluster):

  • Gold tier + TorchPass: 6.14%
  • Silver tier + checkpoint restart: 20.91%

Cost difference holding GPU-hour price constant:

  • Large training workloads: gold tier 5–15% cheaper than silver tier
  • Single-node inference workloads: gap collapses to near zero
  • Hyperscaler premium at equal GPU pricing: ~10% (driven by support costs and EFA tuning overhead)
  • 2,048 B200 multimodal RL research scenario: 61% hyperscaler premium at real-world pricing

ClusterMAX 2.1 Ratings (New)

New ratings added to SemiAnalysis's ClusterMAX provider scorecard:

  • Core42 (UAE/US, MI300X, Broadcom Thor-II NICs) — gold tier
  • BitDeer (Malaysia, GB200 NVL72) — gold tier
  • FPT Smart Cloud (Vietnam) — strong monitoring, but PKey/SAKey misconfiguration (security issue)
  • Radiant/Ori (London/Dallas) — similar security issues, no automated health checks

Relation to Prior Wiki Pages

Fills a gap in the hardware concept page. The GPU kernels page covers optimization at the software level (kernel fusion, memory access patterns). This paper covers optimization at the economic level — the same GPU-hours have different productive value depending on the cluster's fault tolerance and reliability infrastructure.

Connects to AccelOpt (04-20): AccelOpt optimized GPU kernel throughput at the software level (+12pp throughput). The SemiAnalysis framework is the business-level complement: a 6–21% goodput gap from reliability alone can dwarf the gains from kernel optimization at the economics level.

Extends the "selective compute convergence" pattern (04-16 through 04-21): TIP, LongAct, STOP, AVR, and W-RAC all discard wasted computation at different stack levels (token, gradient, path, format, pipeline). SemiAnalysis adds the cluster-operations level: not all GPU-hours are equal, and the ones lost to fault recovery are structurally different from GPU-hours on useful work.


Why It Matters (Tier 1 Assessment)

The goodput framework quantifies what the industry has known intuitively but failed to price: reliability is a cost multiplier. The most practically important finding is the asymmetry: for large pretraining jobs (where cluster synchrony matters), a single failure compounds across the whole cluster. For inference workloads (independent requests), failures are local. This is why the same provider can offer excellent value for inference at a price point that makes them unsuitable for training.

The fault-tolerance tooling gap is the most actionable finding. TorchFT is the only open-source option and carries meaningful overhead. There is no open, zero-overhead fault-tolerant training framework. This is a concrete infrastructure gap.


Research Angle (Tier 1)

Open problem 1: Goodput-aware model routing. If fault-tolerance overhead varies by 3–15x across cluster tiers, and inference routing already considers latency and cost, there should be a unified routing framework that factors in expected goodput loss per provider and workload type. This is a direct Tier 1 intersection: GPU goodput economics + routing.

Open problem 2: Open-source zero-overhead fault tolerance. TorchFT's 10%+ overhead comes from GLOO all-reduce communications during checkpoint state replication. Can this be reduced to <2% by using NCCL or hardware-level checkpointing? The theoretical floor should be much lower than current implementations.

Open problem 3: Goodput-aware training curricula. If some training steps are more critical than others (e.g., the final loss spikes that require checkpoint rollback are often concentrated in certain phases), could a training scheduler de-risk those phases by increasing checkpoint frequency? This would change the goodput calculation from a static average to a dynamic risk-adjusted metric.


Related Pages