Multi-GPU Benchmark: B200 vs H200 vs H100 vs MI300X

with

updated on Nov 12, 2025

For over two decades, optimizing compute performance has been a cornerstone of my work. We benchmarked NVIDIA’s B200, H200, H100 and AMD’s MI300X to assess how well they scale for Large Language Model (LLM) inference. Using the vLLM framework with the meta-llama/Llama-3.1-8B-Instruct model, we ran tests on 1, 2, 4 and 8 GPUs.

We analyzed throughput and scaling efficiency to illustrate how each GPU architecture handles parallelized, compute-intensive workloads.

Multi-GPU benchmark results

Total throughput vs. GPU count

Loading Chart

Total throughput (tokens/second): This metric represents the raw processing power of the entire multi-GPU system. It measures the total number of input and output tokens processed per second, making it the most important indicator of maximum performance under a saturated, offline workload.

To understand how we calculated the score, see our multi-GPU benchmark methodology.

Key performance insights:

Performance analysis: The NVIDIA H200 demonstrates the highest throughput measurements across all tested configurations, exhibiting 9-10% performance improvements relative to the H100. The system achieves 99.8% scaling efficiency with dual-GPU configurations, indicating near-optimal resource utilization.

AMD MI300X performance characteristics: The AMD MI300X achieves single-GPU throughput of 18,752 tokens per second, representing approximately 74% of the H200’s performance. The system maintains scaling efficiencies of 95% and 81% for two-GPU and four-GPU configurations, respectively.

Average inference latency vs. GPU count

Loading Chart

Average inference latency (milliseconds): This metric measures the average time it takes to process a single request from start to finish. Lower latency translates to a faster, more responsive experience for the end-user.

Key performance insights:

Latency performance analysis: The NVIDIA B200 exhibits the lowest latency measurements across all evaluated configurations, achieving 2.40ms with eight-GPU implementations. These performance characteristics position it for applications requiring minimal response times, such as real-time interactive systems where sub-3ms latency is a design requirement.

Scaling efficiency observations: Analysis reveals diminishing returns in latency reduction as GPU count increases across all platforms. The most substantial latency reduction occurs in the transition from single to dual-GPU configurations (approximately 50% reduction across platforms). Configurations exceeding four GPUs demonstrate progressively smaller latency improvements.

H200 and H100 comparative analysis: The H200 demonstrates 5-8% lower latency compared to the H100 across all scales, with the absolute difference decreasing at higher GPU counts (2.81ms versus 2.86ms at eight GPUs, representing a 0.05ms difference). This marginal performance differential, when considered against the 41% price difference, suggests that the H100 may provide more favorable cost-performance characteristics for latency-sensitive deployments.

AMD MI300X latency characteristics: The MI300X demonstrates latency values 37-75% higher than the H200 across tested configurations, which may be attributed to current differences in software stack maturity between vLLM ROCm and CUDA implementations. At eight-GPU scale, the MI300X achieves 4.20ms latency, which remains within acceptable parameters for numerous production applications despite the performance differential relative to NVIDIA platforms.

Performance vs. price: A cost-efficiency analysis

While raw performance metrics are crucial, the ultimate decision for any organization hinges on cost-efficiency. To analyze the return on investment (ROI) for each platform, we’ve mapped our throughput results against the on-demand hourly pricing from RunPod at the time of testing. This allows us to calculate a “performance-per-dollar” score, revealing which setup offers the most computational power for the lowest cost.

Loading Chart

Note: All pricing information reflects the on-demand rates available on the RunPod Cloud platform at the time of the benchmark (September 2025) and is subject to change. The costs are presented for comparative analysis and do not include storage or network fees.

How we calculated throughput-per-dollar

To generate this graph, we processed our raw performance data against the hourly costs. The calculation formula is:

Data Preparation: For each data point in our results table, we retrieved the corresponding hourly cost for that specific GPU configuration (e.g., 4x H100 cost is $10.76).
Calculation: We then applied the formula to compute the throughput_per_dollar value. For example, the H100 at 1x GPU delivered 23,243 tokens/s at a cost of $2.69/hr, resulting in a score of 8,642 tokens/s per dollar.

This efficiency score provides a decision-making tool, moving the conversation from “which is fastest?” to “which is the smartest investment for our workload?”.

What is multi-GPU scaling?

Multi-GPU scaling refers to a system’s ability to increase its performance by distributing a single large task across multiple GPUs. For LLM inference, this can be achieved through data parallelism, where independent copies of the model run on each GPU, with a load balancer distributing incoming requests across all instances.

Ideally, using two GPUs would deliver twice the performance of a single GPU (2x speedup). However, in reality, performance gains are limited by CPU and system bottlenecks, the time the host system spends managing multiple concurrent processes, memory bandwidth constraints, and resource contention. Our benchmark measures how efficiently each platform manages these system-level constraints, which is a critical factor for building cost-effective, high-performance AI inference servers for small to medium models.

What are the challenges in Multi-GPU scaling tests?

Benchmarking multi-GPU systems presents unique challenges that can significantly impact performance outcomes.

Communication overhead and interconnect bottlenecks

When a model is split across GPUs, the interconnect, such as NVIDIA’s NVLink or AMD’s Infinity Fabric, becomes a critical performance bottleneck. The efficiency of inter-GPU communication directly impacts scaling. If the time spent waiting for data from another GPU is longer than the time saved by parallelizing the computation, performance gains will diminish. This effect is particularly pronounced in models that are not large enough to fully saturate the computational capacity of each individual GPU.

Software ecosystem maturity

Performance is not solely a function of hardware. The software stack, including drivers, communication libraries (like NCCL for NVIDIA and RCCL for AMD), and the inference engine (vLLM), plays a monumental role. We discovered that a platform’s performance is deeply tied to the maturity of its software support. An established ecosystem like NVIDIA’s CUDA often benefits from years of fine-tuning and optimization, which can lead to superior scaling efficiency compared to newer integrations like AMD’s ROCm, even on powerful hardware.

Platform-specific optimizations

As our tests revealed, achieving optimal performance often requires platform-specific configurations. Using a generic, “one-size-fits-all” approach can lead to misleadingly low performance. The correct Docker image, environment variables (e.g., enabling custom AMD kernels), and even model data types (bfloat16 for Blackwell) are essential for unlocking the true potential of the hardware. This makes fair “apple-to-apple” comparisons a significant technical challenge.

Multi-GPU benchmark methodology

We tested the latest high-performance GPU architectures from both NVIDIA and AMD to evaluate their scaling capabilities. Our benchmark measured the performance of single and multi-GPU (1x, 2x, 4x, 8x) configurations using the standard meta-llama/Llama-3.1-8B-Instruct¹ model and the vLLM² inference engine.

Test environment and process

Platform: All benchmarks were executed on RunPod Cloud for consistent hardware access.
Inference engine: vLLM (vllm bench throughput tool) was used as the standardized engine.
Model: meta-llama/Llama-3.1-8B-Instruct.
Dataset: ShareGPT Vicuna dataset (25,000 prompts) to simulate a conversational workload.
Strategy: Data parallelism, each multi-GPU test involved running an independent vLLM instance on each GPU. The total prompt load was divided equally among instances, which were executed simultaneously to simulate a load-balanced production environment. This approach eliminates inter-GPU communication (NVLink/PCIe) as a bottleneck, shifting performance limiters to the host system (CPU, RAM).
Automation: Custom Bash scripts were used to automate environment setup, test execution, resource monitoring (nvidia-smi, rocm-smi), and results aggregation.

Platform-specific configurations

Achieving optimal performance required tailored configurations for each architecture.

NVIDIA platforms (H100, H200, B200)

Base image: runpod/pytorch:2.8.0-py3.11-cuda12.8.1.
vLLM installation:
- H100/H200 (Hopper): Standard installation via pip install vllm.
- B200 (Blackwell): vLLM was compiled from source (pip install -e .) to enable native support for the new architecture, resolving “no kernel image” errors.
Key parameters:

Critical Environment Variable:

AMD platform (MI300X)

Base image: rocm/vllm:rocm6.4.1_vllm_0.10.1_20250909
vLLM installation: No installation was needed, as the optimized version was included in the image.
Key parameters & optimizations: Extensive tuning identified the following non-default settings as critical for achieving maximum throughput:

AMD-specific environment variables:

Device visibility: ROCR_VISIBLE_DEVICES was used instead of CUDA’s equivalent to assign instances to specific GPUs.

Benchmark execution phases

Each benchmark run followed a three-phase execution protocol to ensure accurate and reproducible results:

Phase 1: Warmup

Before each multi-GPU configuration test, we performed a dedicated warmup phase to eliminate cold-start effects:

Duration: 100 prompts processed on GPU 0
Purpose: Model loading, KV cache initialization, and CUDA/ROCm kernel compilation
Output: Discarded (not included in measurements)
Platform-specific behavior:
- NVIDIA (CUDA): Kernel compilation and CUDA graph optimization (~30-60 seconds)
- AMD (ROCm): Kernel compilation and optional TunableOp tuning (varies based on PYTORCH_TUNABLEOP_ENABLED setting)

Phase 2: GPU monitoring initialization

Concurrent with benchmark execution, we launched dedicated monitoring processes for each GPU:

Sampling rate: 1 second intervals
Metrics collected: GPU utilization, memory usage, temperature, power consumption
Tools: nvidia-smi (NVIDIA) or rocm-smi (AMD)
Output: CSV logs for post-analysis

Phase 3: Parallel benchmark execution

After warmup completion, all GPU instances were launched simultaneously:

Each GPU processed an equal share of the 25,000 total prompts
All instances started within the same second to simulate production load balancing
Total throughput measured as sum of all GPU outputs
Execution time measured from first instance start to last instance completion

Real-world performance impact from testing

Our testing revealed that minor configuration errors can lead to significant, misleading performance results. The following table illustrates the impact of platform-specific misconfigurations:

💡Conclusion

For serving models in the 8B-13B class, data parallelism is a highly efficient strategy. The choice of hardware depends on the specific priorities of the deployment.

For workloads where cost-effectiveness is a primary consideration, the NVIDIA H100 demonstrates favorable characteristics, offering a balanced combination of performance metrics, acquisition costs, and predictable scaling behavior.

When throughput maximization is the principal objective without budgetary constraints, the NVIDIA H200 exhibits the highest performance measurements among the evaluated platforms.

The AMD MI300X presents notable characteristics for long-term deployment strategies and AMD-based infrastructure environments. Performance improvements are anticipated through software optimization iterations, and the platform’s substantial VRAM capacity provides advantages for accommodating larger model architectures.

The NVIDIA B200 demonstrates limitations for this specific workload configuration, exhibiting CPU-related performance constraints and suboptimal cost-efficiency metrics. The architecture appears more suited to implementations utilizing large-scale models with tensor parallelism strategies.

Reference Links

meta-llama/Llama-3.1-8B-Instruct · Hugging Face

https://docs.vllm.ai/en/latest/

CTO

Sedat Dogan

CTO

Follow On

Sedat is a technology and information security leader with experience in software development, web data collection and cybersecurity. Sedat:
- Has ⁠20 years of experience as a white-hat hacker and development guru, with extensive expertise in programming languages and server architectures.
- Is an advisor to C-level executives and board members of corporations with high-traffic and mission-critical technology operations like payment infrastructure.
- ⁠Has extensive business acumen alongside his technical expertise.

View Full Profile

Researched by