We benchmarked the latest NVIDIA GPUs, including the H100, H200, and B200, for concurrency scaling analysis. Using the vLLM framework with the gpt-oss-20b model, we tested how these GPUs handle concurrent requests, from 1 to 1024. By measuring system output throughput, per-query output speed, and end-to-end latency, we share findings to help understand GPU performance for AI workloads.
Concurrency benchmark results
System Output Throughput vs Concurrency
Output Speed per Query vs Concurrency
End-to-End Latency vs Concurrency
What is concurrency?
Concurrency refers to a GPU’s ability to process multiple requests simultaneously, a key factor for AI workloads such as large language model inference. In our performance evaluation, concurrency levels represent the number of simultaneous requests (from 1 to 1024) sent to the GPU during test runs. Higher concurrency tests the GPU’s capacity to manage parallel tasks without degrading performance, balancing throughput and latency.
Understanding concurrency helps users determine the right GPU for workloads with varying demand or batch processing needs. When running graphics tests or GPU benchmark suites, concurrency performance can significantly differ between GPUs, making it essential for consumers and buyers to compare test results across different system configurations and price points.
What is vLLM?
vLLM is a fast and easy-to-use open-source library for large language model (LLM) inference and serving, supported by a community of contributors. It handles both cloud and self-hosted LLM deployments by managing memory, processing concurrent requests, and serving models like gpt-oss-20b efficiently. For self-hosted LLMs, vLLM simplifies deployment with features like PagedAttention1 for memory management, continuous batching, and support for NVIDIA GPUs, enabling multiple concurrent requests on local hardware.
What are the challenges and limitations in concurrency testing?
Concurrency testing can be complex due to several theoretical challenges that affect system performance in AI workloads.
Resource contention and GPU limitations
High concurrency levels may lead to resource contention, where multiple AI inference requests compete for limited GPU memory or VRAM. This is particularly evident when running concurrent AI operations on NVIDIA, AMD Radeon, or Intel graphics cards optimized for machine learning workloads. The competition for processing capabilities can increase latency or cause test results to fail, especially when the system becomes CPU-bottlenecked during intensive AI processing and neural network operations.
Request queue management and test automation
Managing AI request queues is critical in test automation scenarios for machine learning systems. Rapid inference request submission might overwhelm the GPU, requiring delays that could cap AI throughput and affect performance scores. This becomes challenging when users attempt to achieve consistent AI performance using the same settings across different neural network configurations and model sizes.
Cloud environment variability
In cloud AI environments, factors like network variability or shared infrastructure can introduce inconsistent performance that affects official test runs for AI workloads. These variables can cause significant differences in test results, making it difficult to compare AI performance data or determine expected scores for specific hardware specifications running machine learning models.
Hardware limitations in self-hosted testing
For self-hosted AI setups using specialized software like inference engines, hardware differences such as device capabilities and cooling efficiency may affect test results. Users running AI tests on PC systems with different graphics cards may experience surprisingly varied performance, particularly when comparing gaming GPUs versus professional AI hardware across different price points for machine learning workloads.
Impact on test results and consumer decisions
These factors may significantly impact the interpretation of AI concurrency test outcomes. Consumers and buyers interested in evaluating hardware performance for AI applications should note these limitations when reviewing test data, as the expected AI performance may drop under certain testing conditions, affecting the overall reputation and reliability of the AI test suite.
Concurrency benchmark methodology
We tested NVIDIA’s latest GPU architectures using Runpod cloud infrastructure to evaluate their concurrency scaling capabilities for AI inference workloads. Our benchmark tested the H100, H200, and B200 GPUs running the OpenAI gpt-oss-20b model via vLLM under varying concurrent load conditions, ranging from single requests to scenarios with 1,024 simultaneous connections. Through measurement of throughput metrics, latency distributions, and resource utilization patterns, this analysis aims to provide insights for AI inference deployments.
Test Infrastructure
We deployed our tests on Runpod’s cloud infrastructure, utilizing NVIDIA’s most advanced GPU architectures and the vLLM framework.
- GPU Platform: Runpod cloud infrastructure (H100, H200, B200)
- Model: OpenAI GPT-OSS-20B via vLLM framework
Benchmark Configuration
Each GPU was tested across 10 different concurrency levels with standardized parameters to ensure consistent results.
- Concurrency Levels: 1, 4, 8, 16, 32, 64, 128, 256, 512, 1024 concurrent requests
- Test Duration: 180 seconds measurement phase with 30s ramp-up/cool-down
- Request Size: 1,000 input/output tokens per request
Key Metrics
We tracked performance across multiple dimensions to provide a comprehensive view of GPU capabilities under load.
- Throughput: System output tokens per second, successful requests per second, individual request token generation speed
- Latency: Time to First Token (TTFT), end-to-end latency with P50/P95/P99 percentiles, average latency per request
- Reliability: Success rate percentage, timeout vs. other error classification
Conclusion
Based on our concurrency testing of NVIDIA’s H100, H200, and B200 GPUs, each architecture shows different performance characteristics as concurrent load increases. Higher concurrency levels can boost overall system throughput, but this comes with trade-offs in per-query response times and end-to-end latency.
These benchmark results may help organizations planning AI inference deployments choose the right GPU for their specific needs—whether prioritizing throughput for batch processing or maintaining low latency for real-time applications. The performance patterns observed across different concurrency levels can inform decisions about hardware selection, capacity planning, and deployment strategies in AI infrastructure.
Comments
Your email address will not be published. All fields are required.