GPU Concurrency Benchmark: H100 vs H200 vs B200 vs MI300X

with

updated on Oct 19, 2025

I have spent the last 20 years focusing on system-level computational performance optimization. We benchmarked the latest NVIDIA GPUs, including the NVIDIA (H100, H200, and B200) and AMD (MI300X), for concurrency scaling analysis. Using the vLLM framework with the gpt-oss-20b model, we tested how these GPUs handle concurrent requests, from 1 to 512. By measuring system output throughput, per-query output speed, and end-to-end latency, we share findings to help understand GPU performance for AI workloads.

Concurrency benchmark results

System output throughput vs concurrency

Loading Chart

This chart shows the total number of output tokens generated per second by the system at each concurrency level.

Output speed per query vs concurrency

Loading Chart

This metric illustrates how fast an individual query is processed (in tokens per second) as the system gets busier. It is calculated based on the end-to-end latency for a 1,000-token output.

End-to-End latency vs concurrency

Loading Chart

This chart displays the average time (in milliseconds) it takes to complete a request from start to finish at different concurrency levels.

Tokens per second per dollar vs. Concurrency

Loading Chart

This chart evaluates the cost-efficiency of each GPU by measuring how many tokens are generated per second for every dollar spent on hourly rental. This metric is crucial for understanding the return on investment for each hardware option, especially for budget-conscious deployments.

Note: Pricing is based on on-demand hourly rates from the Runpod cloud platform as of October 2025. Prices are subject to change and may vary based on availability and instance type.

You can read more about our concurrency benchmark methodology.

What is concurrency?

Concurrency refers to a GPU’s ability to process multiple requests simultaneously, a key factor for AI workloads such as large language model inference. In our performance evaluation, concurrency levels represent the number of simultaneous requests (from 1 to 1024) sent to the GPU during test runs. Higher concurrency tests the GPU’s capacity to manage parallel tasks without degrading performance, balancing throughput and latency.

Understanding concurrency helps users determine the right GPU for workloads with varying demand or batch processing needs. When running graphics tests or GPU benchmark suites, concurrency performance can significantly differ between GPUs, making it essential for consumers and buyers to compare test results across different system configurations and price points.

What is vLLM?

vLLM is a fast and easy-to-use open-source library for large language model (LLM) inference and serving, supported by a community of contributors. It handles both cloud and self-hosted LLM deployments by managing memory, processing concurrent requests, and serving models like gpt-oss-20b efficiently. For self-hosted LLMs, vLLM simplifies deployment with features like PagedAttention¹ for memory management, continuous batching, and support for both NVIDIA and AMD GPUs, enabling multiple concurrent requests on local hardware.

What are the challenges and limitations in concurrency testing?

Concurrency testing can be complex due to several theoretical challenges that affect system performance in AI workloads.

Resource contention and GPU limitations

High concurrency levels may lead to resource contention, where multiple AI inference requests compete for limited GPU memory or VRAM. This is particularly evident when running concurrent AI operations on NVIDIA, AMD Radeon, or Intel graphics cards optimized for machine learning workloads. The competition for processing capabilities can increase latency or cause test results to fail, especially when the system becomes CPU-bottlenecked during intensive AI processing and neural network operations.

Request queue management and test automation

Managing AI request queues is critical in test automation scenarios for machine learning systems. Rapid inference request submission might overwhelm the GPU, requiring delays that could cap AI throughput and affect performance scores. This becomes challenging when users attempt to achieve consistent AI performance using the same settings across different neural network configurations and model sizes.

Software ecosystem maturity

It is critical to recognize that performance is not solely a function of hardware. The software stack, including drivers, compilers, and inference libraries like vLLM, plays a significant role. Frameworks often have more mature and highly optimized support for established ecosystems like NVIDIA’s CUDA compared to newer integrations like AMD’s ROCm. This can result in performance disparities that reflect the current state of software optimization rather than the theoretical potential of the hardware itself.

Cloud environment variability

In cloud AI environments, factors like network variability or shared infrastructure can introduce inconsistent performance that affects official test runs for AI workloads. These variables can cause significant differences in test results, making it difficult to compare AI performance data or determine expected scores for specific hardware specifications running machine learning models.

Hardware limitations in self-hosted testing

For self-hosted AI setups using specialized software like inference engines, hardware differences such as device capabilities and cooling efficiency may affect test results. Users running AI tests on PC systems with different graphics cards may experience surprisingly varied performance, particularly when comparing gaming GPUs versus professional AI hardware across different price points for machine learning workloads.

Impact on test results and consumer decisions

These factors may significantly impact the interpretation of AI concurrency test outcomes. Consumers and buyers interested in evaluating hardware performance for AI applications should note these limitations when reviewing test data, as the expected AI performance may drop under certain testing conditions, affecting the overall reputation and reliability of the AI test suite.

Concurrency benchmark methodology

We tested the latest high-performance GPU architectures from both NVIDIA and AMD to evaluate their concurrency scaling capabilities for AI inference workloads. Our benchmark tested the NVIDIA H100, H200, and B200 GPUs alongside AMD’s MI300X, running the OpenAI gpt-oss-20b model via vLLM under varying concurrent load conditions. Through measurement of throughput metrics, latency distributions, and resource utilization patterns, this analysis aims to provide insights for AI inference deployments.

Test infrastructure

We deployed our tests on Runpod’s cloud infrastructure, utilizing NVIDIA’s most advanced GPU architectures and the vLLM framework.

GPU platform: Runpod cloud infrastructure (H100, H200, B200, and MI300X)
Model: OpenAI GPT-OSS-20B via vLLM framework

Software environment

NVIDIA GPUs (H100, H200, B200):

RunPod template: runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404
vLLM installation: vllm[flashinfer]==0.11.0

AMD GPU (MI300X):

Docker image: rocm/vllm-dev:open-mi300-08052025

vLLM server configuration

Different vLLM settings were used to optimize performance for each hardware architecture.

For NVIDIA H100, H200, and B200 GPUs, the server was launched with the following command:

For the AMD MI300X GPU, a ROCm-optimized vLLM build was used with specific settings for the architecture:

Benchmark configuration

Each GPU was tested across 10 different concurrency levels with standardized parameters to ensure consistent results.

Concurrency levels: 1, 4, 8, 16, 32, 64, 128, 256, 512 concurrent requests
Test duration: 180 seconds measurement phase with 30s ramp-up/cool-down
Request size: 1,000 input/output tokens per request

Note on result validation: Before recording the final metrics, we ran numerous tests to determine the optimal configuration for each GPU. Once identified, the benchmark was run three consecutive times to verify stability. The throughput results were consistent across these runs, with a variance of less than 0.1%. The figures reported in this analysis are based on the final of these three consecutive executions.

Key metrics

We tracked performance across multiple dimensions to provide a comprehensive view of GPU capabilities under load.

Throughput: System output tokens per second, successful requests per second, and individual request token generation speed
Latency: Time to First Token (TTFT), end-to-end latency with P50/P95/P99 percentiles, average latency per request
Reliability: Success rate percentage, timeout vs. other error classification

💡Conclusion

Based on our concurrency testing of NVIDIA’s H100, H200, B200, Based on updated concurrency testing of NVIDIA’s H100, H200, B200, and AMD’s MI300X GPUs, the results show distinct strengths depending on the use case.

The NVIDIA B200 performs best for high-throughput workloads. It scales efficiently and is well-suited for processing large numbers of parallel requests, such as batch inference jobs.

The AMD MI300X performs well in low-latency scenarios. Its quick response times at lower concurrency levels make it suitable for real-time applications like chatbots or co-pilot systems, where immediate output is important.

The NVIDIA H100 and H200 remain strong general-purpose GPUs, offering a balance between throughput and latency. However, the B200 leads in throughput, while the MI300X provides faster initial responses.

Overall, the benchmarks illustrate a trade-off between throughput and latency. Maximizing throughput with high concurrency can increase latency for individual requests. These results can help organizations choose GPUs that match their priorities, whether that’s handling large-scale workloads or minimizing response times.

Reference Links

https://arxiv.org/pdf/2309.06180

CTO

Sedat Dogan

CTO

Follow On

Sedat is a technology and information security leader with experience in software development, web data collection and cybersecurity. Sedat:
- Has ⁠20 years of experience as a white-hat hacker and development guru, with extensive expertise in programming languages and server architectures.
- Is an advisor to C-level executives and board members of corporations with high-traffic and mission-critical technology operations like payment infrastructure.
- ⁠Has extensive business acumen alongside his technical expertise.

View Full Profile

Researched by