LLM Inference Engines: vLLM vs LMDeploy vs SGLang

with

updated on Nov 24, 2025

We benchmarked 3 leading LLM inference engines on NVIDIA H100: vLLM, LMDeploy, and SGLang. Each engine processed identical workloads; 1,000 ShareGPT prompts using Llama 3.1 8B-Instruct to isolate the true performance impact of their architectural choices and optimization strategies.

Inference engines benchmark results

We measured offline batch throughput across 10,000 total inference operations (1,000 prompts × 10 runs per engine) to ensure statistical stability.

Throughput: Output tokens generated per second in batch inference mode. Measures how efficiently each engine utilizes the H100’s compute capabilities.

All engines were configured for their maximum theoretical performance: Llama 3.1 8B-Instruct, bfloat16 precision, and 0.8 GPU memory utilization on H100 80GB hardware.

To understand how we calculated throughput rates, please see our inference benchmark methodology.

Key findings

Our approach minimizes confounding variables: identical model, hardware, dataset, sampling configuration, memory limits, and warmup protocol. This isolation reveals what each engine’s architecture truly contributes.

The architectural gap is 29%: Even when vLLM is optimized with the exact same kernels (FlashInfer) used by SGLang, it significantly trails the leaders. SGLang (16,215 tok/s) and LMDeploy (16,132 tok/s) maintain a 29% advantage over the fully optimized vLLM (12,553 tok/s). This indicates that the bottleneck is no longer the mathematical kernel, but the engine’s internal orchestration overhead.

SGLang and LMDeploy are effectively tied: The performance difference between SGLang and LMDeploy is less than 0.6%, which falls within the margin of error. This suggests that both the “Python + Native Kernels” approach (SGLang) and the “Pure C++ Engine” approach (LMDeploy) are equally valid strategies for achieving peak performance on Hopper architectures.

GPU memory “safe zone” at 80% utilization: Attempts to allocate 95% GPU memory caused immediate crashes during CUDA Graph compilation across all engines, despite the 80GB capacity. The root cause was identified as system RAM exhaustion during graph capture, not GPU memory limits. A 0.8 fraction provided the optimal balance of stability and batch size.

Understanding the performance hierarchy

The throughput differences reveal a clear distinction between engine architectures on H100:

SGLang & LMDeploy: These engines achieve ~16,200 tok/s. SGLang achieves this via RadixAttention, a specialized memory manager designed for complex serving patterns. LMDeploy achieves this via TurboMind, a custom C++ backend that eliminates Python overhead entirely.

vLLM: Even with the FlashInfer backend enabled, vLLM peaks at ~12,500 tok/s. While this is a massive improvement over standard configurations, the remaining gap highlights the cost of vLLM’s flexible, plugin-based architecture (PagedAttention) versus the hyper-specialized designs of the leaders.

Architecture philosophy differences: SGLang and LMDeploy co-design their attention mechanisms with kernel assumptions. vLLM maintains a broader compatibility layer where attention algorithms must work with various backends, which limits specific optimization depth on bleeding-edge hardware.

Memory access pattern optimization: The 29% gap suggests SGLang and LMDeploy optimize memory coalescing, cache locality, and batch scheduling more aggressively than vLLM’s scheduler allows, particularly in how they handle the H100’s Tensor Memory Accelerator (TMA).

Benchmark methodology

Test environment

Hardware configuration:

GPU: NVIDIA H100 80GB HBM3
System: RunPod cloud instance
Docker base: runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404

Software versions:

CUDA: 12.8.1
PyTorch: 2.8.0
vLLM: 0.11.0 (FlashInfer enabled)
LMDeploy: 0.10.2
SGLang: v0.2.3

Dataset and workload

Source: ShareGPT_Vicuna_unfiltered dataset from Hugging Face

Selection criteria:

Why this dataset: ShareGPT contains real user-chatbot conversations with natural length variation, representing production chatbot workloads more accurately than synthetic benchmarks.

Engine configurations

All engines were configured for maximum performance while maintaining fairness:

vLLM setup (FlashInfer Backend):

LMDeploy setup:

SGLang setup:

Measurement procedure

Standard protocol applied to all engines:

Model loading: Download and initialize model with bfloat16 precision.
Warmup phase: Process 20 prompts to trigger JIT compilation and stabilize GPU clocks.
Benchmark runs: Execute 10 complete passes of all 1,000 prompts.
Timing methodology:

Token counting: Extract actual token counts from engine-specific output formats.
Throughput calculation: total_output_tokens / duration.

Statistical rigor:

10,000 total inference operations (1,000 prompts × 10 runs per engine).
~1.5 million tokens generated per engine.
Standard deviation consistently <1% of mean across all engines.

Interpreting the results

What you can conclude:

For offline batch inference of Llama 3.1 8B on H100 hardware, architectural efficiency dictates the winner. Even when vLLM is given the best possible kernels (FlashInfer), it cannot match the throughput of SGLang or LMDeploy. The 29% gap represents the cost of Python orchestration versus native C++ optimization.

The performance hierarchy applies to this exact scenario: batch processing 1,000 prompts simultaneously. SGLang and LMDeploy are robust choices that deliver ~45% more value per GPU hour compared to standard deployments, and ~29% more compared to highly optimized vLLM deployments.

What you cannot generalize:

Different models: Results specific to Llama 3.1 8B. Larger models (70B) or different architectures (Mixtral, Qwen) will show different scaling patterns.
Different hardware: These rankings apply to H100 80GB. On A100 or V100, vLLM’s portability may outweigh SGLang’s specialization.
Different metrics: This measures throughput only. Online serving requires TTFT and latency percentiles, where results differ significantly.
Different workloads: Random prompts minimize prefix caching benefits. Repeated system prompts or multi-turn conversations change the performance landscape drastically in favor of SGLang.

Developer experience comparison

Performance numbers don’t capture the full deployment picture. Each engine offers distinct developer workflows:

vLLM: Industry standard for good reason

Simplicity meets broad compatibility. Single pip install vllm supports 100+ model architectures across NVIDIA, AMD, and Intel hardware. Massive community means Stack Overflow has your answers. OpenAI-compatible API server included.

Choose vLLM for: Rapid prototyping, heterogeneous GPU environments, maximum model coverage, or leveraging the largest ecosystem.

LMDeploy: Production-grade with minimal friction

One-line installation (pip install lmdeploy) delivers 99.5% of peak H100 performance. Native C++ backend means zero Python overhead. First-class quantization support (AWQ, GPTQ) for further optimization. No dependency hell.

Choose LMDeploy for: Production deployments where you need maximum H100 performance without sacrificing installation simplicity or stability.

SGLang: Performance ceiling with complexity cost

Absolute peak throughput (16,215 tok/s) comes at a price: significant effort debugging FlashInfer installation. Requires specific PyTorch version. Binary incompatibilities with some pre-built wheels. RadixAttention shines on conversational workloads.

Choose SGLang for: Dedicated inference clusters where a specialized team can manage dependencies, and you need every last percentage point of throughput.

Installation and deployment challenges

Fair comparison required overcoming significant engineering hurdles:

Challenge 1: FlashInfer dependency conflicts

Issue: SGLang’s FlashInfer wheels expect specific PyTorch versions, but H100-optimized containers often ship different ones.

Resolution:

Time investment: 6 hours identifying compatible versions.

Takeaway: Pre-compiled ML wheels often hide version constraints that only surface at runtime.

Challenge 2: Enabling FlashInfer in vLLM

Issue: Standard vLLM versions often lack FlashInfer support or require complex source compilation.

Breakthrough: We utilized the vLLM 0.11.0 build on PyTorch 2.8 Nightly. This successfully enabled native FlashInfer support via pip install “vllm[flashinfer]==0.11.0”, bypassing the compilation barriers of older versions.

Impact: This provided the fairest possible comparison, confirming that while kernels help, they don’t solve the architectural bottleneck.

Challenge 3: Memory utilization sweet spot discovery

Issue: Standard recommendation of 0.9 GPU memory utilization caused std::bad_alloc crashes.

Testing progression:

Discovery: CUDA Graph capture allocates temporary system RAM proportional to GPU memory used. At 0.9 × 80GB = 72GB GPU allocation, system RAM exhausted during compilation.

Practical limit: 0.8 GPU utilization is the “safe zone” despite 80GB hardware capacity.

💡Conclusion

For Llama 3.1 8B batch inference on H100, the performance hierarchy has two clear tiers: vLLM (optimized with FlashInfer) provides a solid baseline, while the C++ native architectures of SGLang and LMDeploy unlock an additional 29% throughput.

SGLang (16,215 tok/s) and LMDeploy (16,132 tok/s) achieve near-identical throughput, suggesting both engines saturate H100’s memory bandwidth. The minimal gap between them is statistical noise.

For production deployments: LMDeploy emerges as the practical winner, delivering 99.5% of SGLang’s peak throughput with trivial installation (pip install lmdeploy) versus SGLang’s complex dependency resolution.

vLLM with FlashInfer (12,553 tok/s) offers a compelling middle ground: respectable performance while maintaining full hardware compatibility and the industry’s largest model support matrix. However, for dedicated H100 clusters, leaving 29% performance on the table is a significant cost.

For standardization across heterogeneous infrastructure or rapid model experimentation, vLLM remains the rational choice. For dedicated H100 deployments where throughput is paramount, LMDeploy’s combination of peak performance and installation simplicity is unmatched.

Principal Analyst

Cem Dilmegani

Principal Analyst

Follow On

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

View Full Profile

Researched by