LLM Quantization: BF16 vs FP8 vs INT4

with

updated on Jan 29, 2026

Quantization reduces LLM inference cost by running models at lower numerical precision.

We benchmarked 4 precision formats of Qwen3-32B on a single H100 GPU. We ran over 2,000 inference runs and 12,000+ MMLU-Pro questions to measure the real-world trade-offs between speed, memory, and accuracy.

LLM quantization benchmark results

Performance and accuracy metrics

Effective memory capacity analysis

Standard GPU monitoring tools (e.g., NVIDIA-SMI) often report near-full memory utilization regardless of model size, due to the pre-allocation strategy of inference engines such as vLLM.

The table below shows actual memory breakdown:

Concurrency analysis: How many users can we serve?

The “Max concurrency” figures represent the memory-bound limit on the number of active users the GPU can hold in its “Working Memory” (KV Cache) simultaneously before crashing with an Out-of-Memory (OOM) error.

The figures below are derived from the formula: Total Token Capacity / Context Length per User.

Scenario A: Heavy workload (Document analysis / RAG)

User Context: 4,096 Tokens (Filling the maximum window we configured).

Use Cases: Summarizing long PDFs and analyzing codebases.

Scenario B: Typical chatbot (Customer service)

User Context: 1,024 Tokens (Standard conversation history).

Use Cases: General Q&A, customer support bots.

Why this matters for “Reasoning” models: Modern reasoning models (such as DeepSeek-R1 or Qwen-QwQ) generate massive amounts of internal “thought” tokens (often 2k-5k tokens) before giving a final answer.

On BF16: A single reasoning request could easily consume the entire 17k capacity, causing the system to reject a second user.
On INT4: The 193k capacity ensures plenty of room for multiple users to perform deep reasoning simultaneously.

Technical analysis of LLM quantization benchmark

The “memory wall” and throughput

We found that GPTQ-Int4 delivers a 2.69x throughput increase over BF16. This happens because LLM inference (particularly at batch size=1) hits memory bandwidth limits, not compute limits.

When we reduce the model size from 61GB to 18GB, the system transfers far less data per token generated. The H100’s compute units spend less time waiting for data.

Accuracy retention

Many engineers worry about “model collapse” at lower precisions. Our Qwen3-32B tests showed otherwise:

Int8 stability: We measured just a 0.04% drop from BF16 to Int8. That’s basically noise. 8-bit precision captures the full dynamic range of the model’s weights.
Int4 viability: Even with aggressive 4-bit quantization, the model retained 98.1% of its baseline reasoning capability on MMLU-Pro.

GPU memory breakdown

GPU memory is split into two main chunks: model weights and the KV cache. The KV cache stores attention states for active requests, so more KV cache means longer contexts or more concurrent users.

BF16: Model weights consume 61 GB (76% of the 80 GB). Only 4.4 GB remains for KV cache. At 4,096 tokens per user, you hit OOM after 4 concurrent users.

INT4: Model weights drop to 18.1 GB (23%). This frees up 47.3 GB for KV cache, enough for 47 concurrent users at the same context length, or 12x longer conversations per user.

Recommendations from our analysis

Cost-efficiency (economics of quantization)

Using the specific pricing of the NVIDIA H100 PCIe on RunPod ($2.39/hour), we calculated the raw generation cost per 1 million tokens.

Business impact: Switching from BF16 to GPTQ-Int4 cuts your hardware costs by 63%. If your application processes 100 million tokens per month, you save roughly $1,600 monthly per GPU.

Note on batch size: These costs reflect a low-latency real-time chat scenario (batch size=1), where the GPU handles one user at a time. In batch processing (e.g., batch size=32), cost per 1M tokens drops significantly (likely below $1.00), but the relative savings between BF16 and INT4 stay similar or increase because INT4 has a higher concurrency ceiling.

What LLM quantization is and why it is important

Large language models are typically trained and served using floating-point formats such as BF16 or FP16. While these formats preserve numerical precision, they significantly increase memory usage and inference cost, especially at deployment time.

Quantization reduces this overhead by representing model weights using lower-precision formats such as INT8 or INT4. This is typically done via post-training quantization (PTQ), where a trained model is converted to a lower-precision format without retraining. This approach avoids the high cost of quantization-aware training while retaining most of the model’s accuracy.

Lower precision formats reduce:

Model size
Memory bandwidth requirements during inference
KV cache pressure, which directly limits context length and concurrency

Modern PTQ methods, such as GPTQ, minimize accuracy loss by quantizing weights layer-by-layer while accounting for error propagation.

In contrast, quantization-aware training (QAT) typically yields better accuracy but is rarely used for large language models due to retraining cost and data requirements.

Fundamentals of numerical precision in LLMs

Quantization fundamentally changes how we represent numbers. Standard training relies on floating-point representation, typically FP32 or half-precision formats like FP16/BF16. Floating-point allocates bits to both range and precision, representing high-precision values across a wide dynamic range.

Quantization replaces floating-point numbers with values drawn from a smaller, discrete set. Quantized data types like INT8 or INT4 use fewer bits and store fewer distinct values. The challenge is mapping high-precision values into a quantized range while controlling error.

This mapping depends on quantization parameters:

A scale factor that determines how floating-point values map to lower precision
A zero point that shifts the quantized range to align with the original distribution
Min/max values that define the quantization range

LLM quantization benchmark methodology

This benchmark was designed to ensure rigorous comparability between models (Apple-to-Apple comparison) by controlling all variables except the model precision format.

1. Hardware & environment

Compute: Single NVIDIA H100 80GB HBM3 GPU.
Environment: RunPod Cloud Container (Ubuntu 22.04).
Drivers: CUDA 12.4.1, PyTorch 2.4.0.
Inference Engine: vLLM (v0.6.3+), utilizing PagedAttention and CUDA Graphs.
Evaluation Framework: lm-evaluation-harness (EleutherAI).

2. Controlled variables (Standardization)

To eliminate external factors affecting performance metrics, the following parameters were fixed across all tests:

3. Evaluation datasets

Performance (Speed): Measured using a custom script generating 500 iterations of text generation to calculate stable Mean/P95 latency and Throughput.
Accuracy (Intelligence): Measured using the MMLU-Pro (Massive Multitask Language Understanding) benchmark.
- Scope: Over 12,000 questions covering Physics, Chemistry, Biology, Math, Computer Science, Psychology, Law, and others.
- Protocol: 5-Shot (providing 5 solved examples in the context window before the question) to evaluate in-context learning capabilities.

Performance benchmarking protocol (speed test)

Unlike static dataset evaluations, the throughput and latency metrics were derived from a live generation test designed to simulate real-world usage patterns.

Prompt strategy: We used a Rotational Sampling method with 5 distinct prompts across diverse domains (Science, Coding, General Knowledge) to prevent caching bias.
Warmup phase: Prior to measurement, 10 warmup iterations were executed to stabilize GPU clock frequencies and allow vLLM to compile necessary CUDA graphs.
Measurement phase: The model generated text for a defined number of iterations (default: 500). Timing was captured using Python’s high-precision time.perf_counter().

Test prompts include:

“Explain the theory of relativity in simple terms.” (Science/Abstract)
“Write a Python function to find the longest palindromic substring.” (Coding)
“What are the main causes of climate change and their effects?” (Complex Reasoning)
“Describe the process of photosynthesis step by step.” (Process Description)
“How does a neural network learn from data?” (Technical Explanation)

Generation parameters are:

Input Length: ~15-25 tokens (variable based on prompt).
Output Limit: Fixed at 256 tokens to standardize the workload per request.
Sampling: Temperature 0.7, Top-P 0.9.

4. Data verification: Runtime telemetry

The memory capacity figures presented above were derived directly from the vLLM engine initialization logs during the benchmark execution.

Evidence 1: BF16 Initialization

Evidence 2: GPTQ-Int4 Initialization

5. Models evaluated

All models are based on the Qwen/Qwen3-32B architecture:

BF16: Qwen/Qwen3-32B (Original weights, loaded as bfloat16).
FP8: Qwen/Qwen3-32B-FP8 (Official FP8 quantization).
INT8: JunHowie/Qwen3-32B-GPTQ-Int8 (Post-training quantization via AutoGPTQ).
INT4: JunHowie/Qwen3-32B-GPTQ-Int4 (4-bit quantization).

Limitations of the LLM quantization benchmark

This benchmark has some limits:

Single-stream focus: Tests were conducted with a Batch Size of 1 to measure pure latency. In high-throughput scenarios (Batch Size > 64), the performance gap between INT4 and BF16 would likely be significantly larger due to memory bandwidth saturation.
Hardware specificity: Results are based on the NVIDIA H100 architecture. Older generations (A100, A10) lack native FP8 support and might exhibit different performance characteristics for that specific format.
Quantization scope: We focused on GPTQ and Native FP8. Other methods like AWQ (Activation-aware Weight Quantization) or BitsAndBytes NF4 might offer different trade-offs, though GPTQ is generally considered the standard for production serving.
Model family: Results are specific to the Qwen3 architecture (Dense). Mixture-of-Experts (MoE) models or models with different activation functions might react differently to aggressive quantization.

Conclusion

Full precision isn’t a requirement anymore.

Moving from BF16 to GPTQ-Int4 delivers a ~2.7x increase in throughput and a tenfold increase in effective context capacity, with a minimal impact on reasoning accuracy. For enterprise applications running on H100 hardware, deploying uncompressed 16-bit models is hard to justify on either cost or performance grounds.

AI Researcher

Ekrem Sarı

AI Researcher

Follow On

Ekrem is an AI Researcher at AIMultiple, focusing on intelligent automation, GPUs, AI Agents, and RAG frameworks.

View Full Profile

Researched by

Sıla Ermut

Industry Analyst