AIMultiple ResearchAIMultiple ResearchAIMultiple Research
We follow ethical norms & our process for objectivity.
AIMultiple's customers in ai hardware include RunPod, Seeweb.
AI Hardware
Updated on Sep 12, 2025

Multi-GPU Benchmark: B200 vs H200 vs H100 vs MI300X

Headshot of Cem Dilmegani
MailLinkedinX

We benchmarked NVIDIA’s B200, H200, H100, and AMD’s MI300X to measure how well they scale for Large Language Model (LLM) inference. Using the vLLM framework with the meta-llama/Llama-3.1-8B-Instruct model, we ran tests on 1, 2, 4, and 8 GPUs. We analyzed throughput and scaling efficiency to show how each GPU architecture manages parallelized, compute-intensive workloads.

Multi-GPU benchmark results

Total throughput vs. GPU count

  • Total throughput (tokens/second): This metric represents the raw processing power of the entire multi-GPU system. It measures the total number of input and output tokens processed per second, making it the most important indicator of maximum performance under a saturated, offline workload.
  • Requests per second: Shown in the tooltip, this indicates how many individual user prompts the system can complete per second.

To understand how we calculated the score, see our multi-GPU benchmark methodology.

Key performance insights:

Best overall performer: NVIDIA’s B200 is delivering the highest throughput at every configuration. Its single-GPU performance of 31,343 tokens/s is 77% faster than the H100, making it the premier choice for workloads where maximum processing power is the primary objective.

Most efficient scaler: NVIDIA’s H100 is the most efficient scaler. While its single-GPU performance is the lowest of the NVIDIA cards, its excellent scaling efficiency allows it to nearly close the gap with the H200 at 4 GPUs and effectively match it at 8 GPUs.

Strong single-GPU contender: AMD’s MI300X delivers single-GPU throughput (16,474 tokens/s), making it highly competitive with the H100. However, its performance gains diminish significantly in multi-GPU setups, yielding only a 2x improvement at 8 GPUs, indicating a software-related scaling bottleneck.

Underperforming generational leap: NVIDIA’s H200, despite being newer, fails to maintain a significant performance gap over the H100 in multi-GPU configurations. The H100’s superior scaling efficiency closes the performance gap, making the H200 a less compelling upgrade for large, 8-GPU node deployments in this specific test.

Average inference latency vs. GPU count

  • Average inference latency (milliseconds): This metric measures the average time it takes to process a single request from start to finish. Lower latency translates to a faster, more responsive experience for the end-user.
  • Total throughput: Shown in the tooltip, this provides context on how much total work the system was doing while achieving that latency.

Key performance insights:

Best for real-time applications: NVIDIA’s B200 consistently achieves the lowest latency, dropping to 7.93 ms at an 8-GPU scale. This makes it the ideal choice for user-facing applications like chatbots, where response time is critical.

The law of diminishing returns: All GPUs show that adding more accelerators improves latency, but the gains shrink with each addition. For most cards, the biggest latency drop occurs when moving from 1 to 2 GPUs; for AMD’s MI300X, however, the sharpest improvement comes between 2 and 4 GPUs.

Cost-effective low latency: While the H200 starts with lower latency than the H100, the H100 nearly closes the gap entirely at the 8-GPU mark (10.39 ms vs. the H200’s 10.27 ms). This makes the H100 a highly cost-effective option for achieving low latency at scale without the premium price of the latest hardware.

Highest latency profile: The AMD MI300X exhibits the highest latency at every GPU count, reinforcing the finding that its current software stack within vLLM is less optimized for inter-GPU communication, which directly impacts the time required to complete a parallelized request.

Performance vs. price: A cost-efficiency analysis

While raw performance metrics are crucial, the ultimate decision for any organization hinges on cost-efficiency. To analyze the return on investment (ROI) for each platform, we’ve mapped our throughput results against the on-demand hourly pricing from RunPod at the time of testing. This allows us to calculate a “performance-per-dollar” score, revealing which setup offers the most computational power for the lowest cost.

Note: All pricing information reflects the on-demand rates available on the RunPod Cloud platform at the time of the benchmark (September 2025) and is subject to change. The costs are presented for comparative analysis and do not include storage or network fees.

How we calculated throughput-per-dollar

To generate this graph, we processed our raw performance data against the hourly costs. The calculation formula is:

Throughput per Dollar = Total Throughput (tokens/s) / Hourly Cost ($/hr)
  • Data Preparation: For each data point in our results table, we retrieved the corresponding hourly cost for that specific GPU configuration (e.g., 4x H100 cost is $10.76).
  • Calculation: We then applied the formula to compute the throughput_per_dollar value. For example, the H100 at 1x GPU delivered 17,684 tokens/s at a cost of $2.69/hr, resulting in a score of 6,574 tokens/s per dollar.

This efficiency score provides a decision-making tool, moving the conversation from “which is fastest?” to “which is the smartest investment for our workload?”.

What is multi-GPU scaling?

Multi-GPU scaling refers to a system’s ability to increase its performance by distributing a single large task across multiple GPUs. For LLM inference, this is typically achieved through tensor parallelism, where a model’s weights and computations are sharded across several accelerators.

Ideally, using two GPUs would deliver twice the performance of a single GPU (2x speedup). However, in reality, performance gains are limited by communication overhead, the time GPUs spend synchronizing and exchanging data with each other. Our benchmark measures how efficiently each platform manages this overhead, which is a critical factor for building cost-effective, high-performance AI inference servers for very large models.

What are the challenges in Multi-GPU scaling tests?

Benchmarking multi-GPU systems presents unique challenges that can significantly impact performance outcomes.

1. Communication overhead and interconnect bottlenecks

When a model is split across GPUs, the interconnect, such as NVIDIA’s NVLink or AMD’s Infinity Fabric, becomes a critical performance bottleneck. The efficiency of inter-GPU communication directly impacts scaling. If the time spent waiting for data from another GPU is longer than the time saved by parallelizing the computation, performance gains will diminish. This effect is particularly pronounced in models that are not large enough to fully saturate the computational capacity of each individual GPU.

2. Software ecosystem maturity

Performance is not solely a function of hardware. The software stack, including drivers, communication libraries (like NCCL for NVIDIA and RCCL for AMD), and the inference engine (vLLM), plays a monumental role. We discovered that a platform’s performance is deeply tied to the maturity of its software support. An established ecosystem like NVIDIA’s CUDA often benefits from years of fine-tuning and optimization, which can lead to superior scaling efficiency compared to newer integrations like AMD’s ROCm, even on powerful hardware.

3. Platform-specific optimizations

As our tests revealed, achieving optimal performance often requires platform-specific configurations. Using a generic, “one-size-fits-all” approach can lead to misleadingly low performance. The correct Docker image, environment variables (e.g., enabling custom AMD kernels), and even model data types (bfloat16 for Blackwell) are essential for unlocking the true potential of the hardware. This makes fair “apple-to-apple” comparisons a significant technical challenge.

Multi-GPU benchmark methodology

We tested the latest high-performance GPU architectures from both NVIDIA and AMD to evaluate their scaling capabilities. Our benchmark measured the performance of single and multi-GPU (1x, 2x, 4x, 8x) configurations using the standard meta-llama/Llama-3.1-8B-Instruct1  model and the vLLM2 inference engine.

  • Cloud platform: All benchmarks were executed on RunPod Cloud to ensure consistent access to a wide range of GPU instances (NVIDIA H100, H200, B200, and AMD MI300X).
  • Inference engine: vLLM, an open-source library for high-throughput LLM inference, was used as the standardized software engine. We utilized its built-in offline throughput benchmark (vllm bench throughput).
  • Test model: To maintain consistency, meta-llama/Llama-3.1-8B-Instruct was selected as the primary test model due to its widespread adoption and robust cross-platform support.
  • Dataset: All throughput tests were conducted using the ShareGPT Vicuna dataset3  to simulate a realistic conversational workload.
  • Automation & monitoring: A custom suite of Bash scripts was developed to automate the entire workflow, including environment setup, test execution, per-second VRAM monitoring (nvidia-smi, rocm-smi), and results post-processing.
  • Multi-GPU scaling considerations: For multi-GPU tests, we employed tensor parallelism (TP), where the model’s weights are sharded across multiple GPUs. It is important to note that performance does not scale linearly (e.g., 2x GPUs do not yield 2x performance). This is due to the communication overhead required for GPUs to synchronize and exchange tensor data after each computation layer, a factor that becomes a significant bottleneck affecting overall scaling efficiency.

Platform-specific configurations

Recognizing that each hardware platform requires a unique software environment to perform optimally, we tailored the setup for each GPU family. This was the most critical aspect of the benchmark, revealing key insights into the maturity of each ecosystem.

NVIDIA platform (H100 / H200 – Hopper architecture)

  • Base Docker image: A clean, official runpod/pytorch:2.8.0-py3.11-cuda12.8.1 image was used as the foundation, providing a stable CUDA environment.
  • vLLM installation: vLLM was installed directly from PyPI within the container using pip install vLLM, leveraging its mature support for the CUDA ecosystem.
  • Performance optimization: Standard vLLM parameters were used, as this hardware is well-supported out of the box.

NVIDIA platform (B200 – Blackwell architecture)

  • Base Docker image: The same runpod/pytorch:2.8.0-py3.11-cuda12.8.1 image was used.
  • vLLM installation: Due to the novelty of the Blackwell architecture (sm_100), the standard pip version of vLLM was incompatible. To resolve this, vLLM was compiled from its source code directly within the container (pip install -e .). This crucial step built native Blackwell support, resolving the no kernel image is available error.
  • Performance optimization: As recommended for the Blackwell architecture, the –dtype bfloat16 parameter was explicitly passed to vLLM to leverage hardware-level acceleration for this data type.

AMD platform (MI300X – CDNA 3 architecture)

  • Base Docker image: After extensive testing revealed instabilities with generic ROCm environments, we standardized on the official, pre-built AMD image: rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812. This image provided a stable, pre-configured environment where vLLM and all ROCm libraries were already installed and validated.
  • vLLM installation: No installation was required; the pre-built, optimized version within the image was used.
  • Performance optimization: As per AMD’s official documentation4 , the following platform-specific environment variables were set to unlock maximum performance by enabling AMD’s custom, high-performance attention kernels:
- export VLLM_USE_TRITON_FLASH_ATTN=0
- export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1

Conclusion

Our multi-GPU scaling analysis shows distinct performance patterns across different hardware. The NVIDIA B200 achieves the highest single-GPU throughput. For multi-GPU setups, performance depends heavily on software maturity. NVIDIA’s Hopper and Blackwell architectures scale consistently across multiple GPUs, supported by well-developed CUDA and NCCL software frameworks.

The AMD MI300X delivers strong performance when using its optimized software stack. However, its scaling efficiency with smaller models indicates that vLLM’s multi-GPU communication software is still developing.

These results show that organizations building multi-GPU inference servers must consider both hardware capabilities and software maturity. NVIDIA currently offers more predictable scaling performance. AMD shows strong potential, but full performance depends on further software development and optimization.

Further reading

Explore other AI hardware research, such as:

Share This Article
MailLinkedinX
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
Ekrem is an AI Researcher at AIMultiple, focusing on intelligent automation, GPUs, AI Agents, and RAG frameworks.

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments

Related research