Quantization reduces LLM inference cost by running models at lower numerical precision.
We benchmarked 4 precision formats of Qwen3-32B on a single H100 GPU. Over 2,000 inference runs and 12,000+ MMLU-Pro questions were executed to measure the real-world trade-offs between speed, memory, and accuracy.
LLM quantization benchmark results
1. Performance and accuracy metrics
The following table summarizes the core performance indicators observed during the evaluation.
2. Effective memory capacity analysis
Standard GPU monitoring tools (e.g., NVIDIA-SMI) often report near-full memory utilization regardless of model size, due to the pre-allocation strategy of inference engines such as vLLM. The table below decomposes the actual memory usage to reveal the effective capacity available for the Key-Value (KV) Cache, which dictates the maximum context length and concurrency.
3. Concurrency analysis: How many users can we serve?
The “Max Concurrency” figures represent the memory-bound limit on the number of active users the GPU can hold in its “Working Memory” (KV Cache) simultaneously before crashing with an Out-of-Memory (OOM) error.
The figures below are derived from the formula: Total Token Capacity / Context Length per User.
Scenario A: Heavy Workload (Document Analysis / RAG)
User Context: 4,096 Tokens (Filling the maximum window we configured).
Use Cases: Summarizing long PDFs and analyzing codebases.
Scenario B: Typical Chatbot (Customer Service)
User Context: 1,024 Tokens (Standard conversation history).
Use Cases: General Q&A, customer support bots.
Why this matters for “Reasoning” Models: Modern reasoning models (such as DeepSeek-R1 or Qwen-QwQ) generate massive amounts of internal “thought” tokens (often 2k-5k tokens) before giving a final answer.
- On BF16: A single reasoning request could easily consume the entire 17k capacity, causing the system to reject a second user.
- On INT4: The 193k capacity ensures plenty of room for multiple users to perform deep reasoning simultaneously.
Technical analysis of LLM quantization benchmark
The “Memory Wall” and throughput
The most significant finding is the 2.69x increase in throughput observed in the GPTQ-Int4 model. In LLM inference (particularly with a batch size of 1), the performance is bound by memory bandwidth rather than compute power.
By reducing the model size from 61GB to 18GB, the system transfers significantly less data per token generated, allowing the H100 GPU to utilize its compute resources more effectively.
Accuracy, retention, and robustness
Contrary to concerns regarding “model collapse” at lower precisions, the Qwen3-32B model exhibited remarkable resilience:
- Int8 Stability: The drop from BF16 to Int8 resulted in a negligible accuracy loss of 0.04%. This suggests that 8-bit precision is sufficient to capture the full dynamic range of the model’s weights.
- Int4 Viability: Even with aggressive 4-bit quantization, the model retained 98.1% of its baseline reasoning capability on the rigorous MMLU-Pro dataset.
Analogy: The “Real Estate” of GPU memory
To conceptualize the memory dynamics, one can view the 80GB GPU memory as a physical room:
- In the BF16 scenario, the model behaves like a massive piece of furniture occupying 76% of the room. This leaves space for only a few “guests” (users or long-term contexts) before the room reaches capacity.
- In the INT4 scenario, the model is compacted to occupy only 23% of the room. The remaining 77% is now open space. This does not merely “save” memory; it converts it into operational capacity, allowing the system to handle 12x longer conversations or serve 12x more simultaneous users.
Recommendations from our analysis
Based on the empirical data, we categorize the optimal deployment strategies as follows:
Cost-efficiency (economics of quantization)
Infrastructure cost is a primary concern for production deployment. Using the specific pricing of the NVIDIA H100 PCIe on RunPod ($2.39/hour), we calculated the raw generation cost per 1 million tokens.
- Business impact: Switching from the baseline BF16 to GPTQ-Int4 reduces operational hardware costs by 63%. For an application processing 100 million tokens per month, this shift represents approximately $1,600 in monthly savings per GPU instance, purely by optimizing model precision.
Important Context on “Batch Size = 1”: The costs below reflect a Low-Latency / Real-Time Chat scenario (Batch Size=1), where the GPU is optimized for speed rather than volume.
- In this scenario, the GPU computes for a single user while ignoring its parallel processing capacity, resulting in a higher “cost per token”.
- In a high-throughput Batch Processing scenario (e.g., Batch Size=32), the cost per 1M tokens would be significantly lower (likely <$1.00), but the relative savings between BF16 and INT4 would remain similar or increase due to INT4’s higher concurrency limit.
What LLM quantization is and why it is important
Large language models are typically trained and served using floating-point formats such as BF16 or FP16. While these formats preserve numerical precision, they significantly increase memory usage and inference cost, especially at deployment time.
Quantization reduces this overhead by representing model weights using lower-precision formats such as INT8 or INT4. This is typically done via post-training quantization (PTQ), where a trained model is converted to a lower-precision format without retraining. This approach avoids the high cost of quantization-aware training while retaining most of the model’s accuracy.
Lower precision formats reduce:
- Model size
- Memory bandwidth requirements during inference
- KV cache pressure, which directly limits context length and concurrency
Modern PTQ methods, such as GPTQ, minimize accuracy loss by quantizing weights layer-by-layer while accounting for error propagation.
In contrast, quantization-aware training (QAT) typically yields better accuracy but is rarely used for large language models due to retraining cost and data requirements.
Fundamentals of numerical precision in LLMs
At the core of quantization is how numerical values are represented. Standard training relies on floating-point representation, typically using FP32 or half-precision formats such as FP16 or BF16. A floating-point number allocates bits to both range and precision, allowing it to represent high-precision values across a wide dynamic range.
Quantization replaces floating-point numbers with quantized values drawn from a smaller, discrete set. These quantized data types, such as INT8 or INT4, use fewer bits and therefore store fewer numerical values. The key challenge is to map high-precision values into a quantized range while controlling quantization error.
This mapping depends on quantization parameters, including:
- A scale factor that determines how floating-point values are scaled into a lower precision format.
- A zero point that shifts the quantized range to align with the original distribution.
- Minimum and maximum values that define the quantization range.
Quantization paradigms
Quantization methods for large language models are commonly divided into post-training quantization and quantization-aware training.
Post-training quantization
Post-training quantization (PTQ) converts a trained full-precision model into a quantized version without additional fine-tuning. This approach is widely used for large language models because retraining via gradient descent is expensive and often impractical.
Within PTQ, two common approaches are used:
- Static quantization, where quantization parameters are fixed using a calibration dataset before inference.
- Dynamic quantization, where quantization parameters are computed on the fly based on input data.
Post training quantization PTQ typically focuses on weight quantization, as quantizing weights is more stable than activation quantization. Activations exhibit a more variable dynamic range and depend on runtime input data, making quantization more difficult.
Quantization-aware training
Quantization-aware training (QAT) introduces simulated quantization during training. The model learns to operate with quantized values while maintaining floating-point gradients for optimization.
Quantization-aware training generally produces better accuracy than PTQ, but it requires access to training data and additional computing power. For large language models, this cost often outweighs the benefits, which is why QAT is less common in practice.
Key quantization algorithms and formats
GPTQ
GPTQ is a post-training quantization method that quantizes weights layer-by-layer while minimizing quantization error. It uses second-order information to estimate how quantizing weights affects the model’s output.
GPTQ is commonly used to quantize INT4 weights in linear layers, producing a compact representation with relatively small accuracy loss.
AWQ
AWQ focuses on identifying important weight channels that disproportionately affect the model’s performance. Instead of uniformly quantizing all weights, it rescales selected channels before quantization.
This approach reduces error propagation while keeping the quantization process simpler than optimization-heavy methods.
GGUF
GGUF is a model format designed to efficiently store quantized weights, particularly for CPU inference. It supports multiple quantization schemes and precision levels, allowing models to be deployed in a quantized range suited to smaller models or limited computing power environments.
GGUF emphasizes practical deployment rather than training-time optimization.
Methodology
This benchmark was designed to ensure rigorous comparability between models (Apple-to-Apple comparison) by controlling all variables except the model precision format.
1. Hardware & environment
- Compute: Single NVIDIA H100 80GB HBM3 GPU.
- Environment: RunPod Cloud Container (Ubuntu 22.04).
- Drivers: CUDA 12.4.1, PyTorch 2.4.0.
- Inference Engine: vLLM (v0.6.3+), utilizing PagedAttention and CUDA Graphs.
- Evaluation Framework: lm-evaluation-harness (EleutherAI).
2. Controlled variables (Standardization)
To eliminate external factors affecting performance metrics, the following parameters were fixed across all tests:
3. Evaluation datasets
- Performance (Speed): Measured using a custom script generating 500 iterations of text generation to calculate stable Mean/P95 latency and Throughput.
- Accuracy (Intelligence): Measured using the MMLU-Pro (Massive Multitask Language Understanding) benchmark.
- Scope: Over 12,000 questions covering Physics, Chemistry, Biology, Math, Computer Science, Psychology, Law, and others.
- Protocol: 5-Shot (providing 5 solved examples in the context window before the question) to evaluate in-context learning capabilities.
Performance benchmarking protocol (speed test)
Unlike static dataset evaluations, the throughput and latency metrics were derived from a live generation test designed to simulate real-world usage patterns.
- Prompt strategy: We used a Rotational Sampling method with 5 distinct prompts across diverse domains (Science, Coding, General Knowledge) to prevent caching bias.
- Warmup phase: Prior to measurement, 10 warmup iterations were executed to stabilize GPU clock frequencies and allow vLLM to compile necessary CUDA graphs.
- Measurement phase: The model generated text for a defined number of iterations (default: 500). Timing was captured using Python’s high-precision time.perf_counter().
Test prompts include:
- “Explain the theory of relativity in simple terms.” (Science/Abstract)
- “Write a Python function to find the longest palindromic substring.” (Coding)
- “What are the main causes of climate change and their effects?” (Complex Reasoning)
- “Describe the process of photosynthesis step by step.” (Process Description)
- “How does a neural network learn from data?” (Technical Explanation)
Generation parameters are:
- Input Length: ~15-25 tokens (variable based on prompt).
- Output Limit: Fixed at 256 tokens to standardize the workload per request.
- Sampling: Temperature 0.7, Top-P 0.9.
4. Data verification: Runtime telemetry
The memory capacity figures presented above were derived directly from the vLLM engine initialization logs during the benchmark execution.
Evidence 1: BF16 Initialization
INFO … Loading weights took 23.45 seconds
INFO … Model loading took 61.0347 GiB memory
INFO … Available KV cache memory: 4.38 GiB
INFO … Maximum concurrency for 4,096 tokens per request: 4.38x
Evidence 2: GPTQ-Int4 Initialization
INFO … Loading weights took 8.76 seconds
INFO … Model loading took 18.1423 GiB memory
INFO … Available KV cache memory: 47.28 GiB
INFO … Maximum concurrency for 4,096 tokens per request: 47.27x
5. Models evaluated
All models are based on the Qwen/Qwen3-32B architecture:
- BF16: Qwen/Qwen3-32B (Original weights, loaded as bfloat16).
- FP8: Qwen/Qwen3-32B-FP8 (Official FP8 quantization).
- INT8: JunHowie/Qwen3-32B-GPTQ-Int8 (Post-training quantization via AutoGPTQ).
- INT4: JunHowie/Qwen3-32B-GPTQ-Int4 (4-bit quantization).
Limitations of the LLM quantization benchmark
While this benchmark provides evidence for quantization efficiency, the following limitations should be considered:
- Single-stream focus: Tests were conducted with a Batch Size of 1 to measure pure latency. In high-throughput scenarios (Batch Size > 64), the performance gap between INT4 and BF16 would likely be significantly larger due to memory bandwidth saturation.
- Hardware specificity: Results are based on the NVIDIA H100 architecture. Older generations (A100, A10) lack native FP8 support and might exhibit different performance characteristics for that specific format.
- Quantization scope: We focused on GPTQ and Native FP8. Other methods like AWQ (Activation-aware Weight Quantization) or BitsAndBytes NF4 might offer different trade-offs, though GPTQ is generally considered the standard for production serving.
- Model family: Results are specific to the Qwen3 architecture (Dense). Mixture-of-Experts (MoE) models or models with different activation functions might react differently to aggressive quantization.
💡Conclusion
This study confirms that precision is no longer the primary constraint for high-performance LLM deployment. The Qwen3-32B model demonstrates that modern quantization techniques, specifically GPTQ, can effectively decouple model intelligence from model size.
Moving from BF16 to GPTQ-Int4 unlocks a ~2.7x increase in throughput and a tenfold increase in effective context capacity, with a minimal impact on reasoning accuracy. For enterprise applications running on H100 hardware, deploying uncompressed 16-bit models is increasingly difficult to justify on both cost and performance grounds.
Be the first to comment
Your email address will not be published. All fields are required.