NVLink 5.0 (B200/GB200) delivers 1,800 GB/s total bandwidth per GPU — 2× NVLink 4.0's 900 GB/s (H100/H200). NVLink provides ~19× more bandwidth than PCIe 5.0, making it essential for tensor parallelism inference across multiple GPUs.
GenerationBW per Link (Bidir)Links/GPUTotal GPU BWFirst AvailableGPUs Supported
NVLink 1.040 GB/s4160 GB/s2016P100
NVLink 2.050 GB/s6300 GB/s2017V100
NVLink 3.050 GB/s12600 GB/s2020A100
NVLink 4.050 GB/s18900 GB/s2022H100, H200
NVLink 5.0 Latest100 GB/s181,800 GB/s2024B100, B200, GB200

NVSwitch Generations

GenerationPortsPer-Port BWTotal Switch BWGPU Topology
NVSwitch 1.01850 GB/s900 GB/sDGX-2 (16× V100)
NVSwitch 2.03650 GB/s1.8 TB/sDGX A100 (8× A100)
NVSwitch 3.06450 GB/s3.2 TB/sDGX H100 (8× H100)
NVSwitch 4.064100 GB/s6.4 TB/sDGX B200, GB200 NVL72

InfiniBand: What Speed Is Used for H100 and B200 Clusters?

H100/H200 clusters use NDR 400G InfiniBand (96 GB/s effective bandwidth, NVIDIA ConnectX-7). B200/GB200 clusters use XDR 800G InfiniBand (192 GB/s effective bandwidth, ConnectX-8). Older A100 clusters use HDR 200G.

StandardPer-Port Rate4× TypicalEffective BW (Bidir)YearUsed For
HDR50 Gb/s200 Gb/s48 GB/s2018A100 clusters
NDR100 Gb/s400 Gb/s96 GB/s2022H100/H200 clusters
XDR200 Gb/s800 Gb/s192 GB/s2025B200/GB200 clusters

NVIDIA InfiniBand NICs and Switches

ProductGenerationBandwidthUse Case
ConnectX-6HDR 200G200–400 Gb/sA100 clusters
ConnectX-7NDR 400G400–800 Gb/sH100/H200 clusters
ConnectX-8XDR 800G800–1600 Gb/sB200/GB200 clusters
Quantum-2 (QM9700)NDR51.2 Tb/s switchSpine/leaf switches
Quantum-3 (QM9790)XDR102.4 Tb/s switchNext-gen fabric

PCIe Bandwidth by Generation

Generationx16 BandwidthPer LaneAvailableGPU Examples
PCIe 3.016 GB/s1 GB/s2010P100, V100
PCIe 4.032 GB/s2 GB/s2017A100
PCIe 5.064 GB/s4 GB/s2021H100 PCIe, L40S
PCIe 6.0128 GB/s8 GB/s2024B200

PCIe bandwidth is negligible compared to NVLink for multi-GPU communication: PCIe 5.0 provides 64 GB/s vs NVLink 4.0's 900 GB/s — a 14× difference. Always use NVLink topologies for multi-GPU workloads.

AI Cluster Networking Topologies

Cluster SizeIntra-NodeInter-Node FabricTopologyTotal Bisection BW
8 GPUs (1 node)NVLink + NVSwitchN/AFully connected7.2 TB/s (H100)
64 GPUs (8 nodes)NVLink + NVSwitch400G IB NDRFat-tree~25.6 TB/s
256 GPUs (32 nodes)NVLink + NVSwitch400G IB NDR2-tier fat-tree~51.2 TB/s
1,024 GPUs (128 nodes)NVLink + NVSwitch400G IB NDR3-tier fat-tree~204.8 TB/s
4,096 GPUs (512 nodes)NVLink + NVSwitch800G IB XDR3-tier fat-tree~819 TB/s

DGX SuperPOD Configurations

ConfigurationGPUsNodesNetwork FabricCompute
DGX H100 SuperPOD25632NDR 400G IB~256 PFLOPS FP8
DGX B200 SuperPOD57672 (NVL72)XDR 800G IB~1.4 EFLOPS FP4
DGX GB200 NVL7272 B200 + 36 Grace36NVLink 5.0 + XDR IB~720 PFLOPS FP4

Frequently Asked Questions

What is NVLink 5.0 bandwidth compared to NVLink 4.0?

NVLink 5.0 (B200/GB200) delivers 1,800 GB/s total bidirectional bandwidth per GPU — double NVLink 4.0's 900 GB/s (H100/H200). This is achieved by doubling per-link bandwidth from 50 to 100 GB/s while maintaining 18 links per GPU. NVSwitch 4.0 provides 6.4 TB/s total switch bandwidth vs NVSwitch 3.0's 3.2 TB/s.

What InfiniBand should I use for an H100 cluster?

H100/H200 clusters should use NDR InfiniBand 400G (NVIDIA ConnectX-7 NICs, Quantum-2 switches). NDR provides 96 GB/s effective bidirectional bandwidth per node and is the standard for H100-era clusters. A 256-GPU DGX H100 SuperPOD with NDR provides ~51.2 TB/s total bisection bandwidth.

When should I use InfiniBand vs RoCE for AI clusters?

Use InfiniBand for large-scale training clusters (64+ GPUs) where performance is critical — it provides native RDMA, deterministic latency, and better congestion control. Use RoCE (RDMA over Converged Ethernet) for cost-sensitive deployments or when integrating with existing Ethernet infrastructure. RoCE at 400 GbE is now competitive for H100-era clusters. Intel Gaudi 3 uses RoCE natively via 24× 200GbE ports.

What is the bandwidth hierarchy in an AI cluster?

AI cluster bandwidth hierarchy (fastest to slowest): HBM3e memory (H200): 4.8 TB/sNVLink 4.0: 900 GB/s (19% of HBM) → PCIe 5.0 x16: 64 GB/s (1.3% of HBM) → NDR InfiniBand: ~50 GB/s (1% of HBM) → 100 GbE: 12.5 GB/s (0.26% of HBM). This hierarchy explains why model parallelism strategies must match communication patterns to available bandwidth.