The effectiveness of large language models (LLMs) is determined not only by their accuracy and capabilities but also by the speed at which they engage with users.
We benchmarked the performance of leading language models across various use cases, measuring their response times to user input. From the time it takes to generate the first word to the pace of the complete response, the results reveal which models deliver faster interactions when it counts.
LLM latency benchmark results
We benchmarked the latency performance of the following large language models: GPT-5.2, Mistral Large 2512, Claude 4.5 Sonnet 20240229, Grok 4.1 Fast, and DeepSeek V3.2. We focus on two key metrics: First Token Latency, the time it takes for the model to start generating the first token of a response, and Per Token Latency, the time taken to generate each token throughout the response.
You can see our methodology here.
Performance analysis by use case
We observe that latency variations depend on the task type, indicating that these models exhibit different performance profiles across use cases.
Q&A
In Q&A scenarios, such as customer support, virtual assistants, and enterprise knowledge tools, speed and response times directly impact user experience.
- Mistral Large 2512 delivers the fastest initial response, with a first-token latency of 0.30 seconds, making it ideal for live support systems that require immediate answers. Its per-token latency of 0.025 seconds offers excellent efficiency for generating responses of any length.
- GPT-5.2 follows closely with a first token latency of 0.60 seconds and a per-token latency of 0.020 seconds. While slightly slower to start, its lower per-token latency makes it highly efficient for longer, more detailed responses.
- Claude 4.5 Sonnet, with a first-token latency of 2 seconds and a per-token latency of 0.030 seconds, shows moderate initial responsiveness. The delay before the first token can impact real-time interactions, though its steady generation speed maintains reasonable overall performance.
- Grok 4.1 Fast Reasoning has a first-token latency of 3 seconds and an excellent per-token latency of 0.010 seconds. Despite the slower start, once generation begins, it produces tokens extremely quickly, making it suitable for applications where total generation time matters more than immediate response.
- DeepSeek V3.2, with a first-token latency of 7 seconds and a per-token latency of 0.032 seconds, is the slowest model overall. The significant wait before the first token makes it less suitable for speed-critical Q&A systems.
Summary generation
The summary generation use case plays a critical role in applications where users need to quickly grasp long texts. For example, in scenarios where customer service teams need to summarize a call recording within seconds and take action, the first token latency directly impacts the user experience.
- Mistral Large 2512 leads with a first-token latency of 0.45 seconds and a per-token latency of 0.025 seconds, making it an effective option for scenarios requiring quick document summarization.
- GPT-5.2 follows with a first token latency of 0.60 seconds and the fastest per-token latency at 0.020 seconds, allowing it to maintain speed even with longer content.
- Claude 4.5 Sonnet has a slower initial response, with a first-token latency of 2 seconds. However, its per-token latency of 0.030 seconds still provides decent overall performance for summarization tasks.
- Grok 4.1 Fast Reasoning shows a first token latency of 4 seconds but compensates with an excellent per-token latency of 0.010 seconds, making it efficient once generation begins.
- DeepSeek V3.2 stands out as the slowest model, with a first-token latency of 7.5 seconds and a per-token latency of 0.025 seconds.
Language translation
Based on our benchmark, translation tasks reveal interesting performance trade-offs between initial response time and sustained generation speed.
- Mistral Large 2512 delivers the fastest initial response, with a first-token latency of 0.40 seconds and a per-token latency of 0.020 seconds, making it ideal for real-time translation scenarios.
- GPT-5.2 starts at 0.55 seconds with the lowest per-token latency at 0.010 seconds, providing exceptional efficiency for longer translations once generation begins.
- Claude 4.5 Sonnet, with a first-token latency of 2 seconds and a per-token latency of 0.015 seconds, balances moderate initial responsiveness with strong sustained generation speed.
- Grok 4.1 Fast Reasoning has a first token latency of 6 seconds. Still, it maintains an excellent per-token latency of 0.005 seconds, the fastest in this category, making it highly efficient for batch translation tasks.
- DeepSeek V3.2 exhibits the highest first-token latency at 7.5 seconds, with a per-token latency of 0.025 seconds, limiting its applicability in time-sensitive translation workflows.
Business Analysis
Based on the results we observed in the Business Analysis use case, the models exhibit varied performance profiles suited to different analytical scenarios.
- Mistral Large 2512 delivers a strong initial response, with a first-token latency of 0.40 seconds, though its per-token latency of 0.040 seconds is higher than in other use cases. It remains suitable for routine business analysis tasks.
- GPT-5.2 starts at 0.50 seconds with a per-token latency of 0.020 seconds, making it suitable for business analysis tasks that require both quick starts and efficient longer outputs, such as daily reports or dashboards.
- Claude 4.5 Sonnet responds with a first token latency of 2 seconds and a per-token latency of 0.035 seconds. While the initial delay can cause lags in real-time workflows, it provides consistent output speed for batch data reviews or scheduled reporting.
- Grok 4.1 Fast Reasoning shows a first token latency of 4 seconds but maintains excellent per-token efficiency at 0.010 seconds, making it effective for comprehensive analytical reports where total completion time matters more than immediate response.
- DeepSeek V3.2 was the slowest model with a first token latency of 8 seconds and per-token latency of 0.030 seconds, making it less suitable for time-sensitive business analysis scenarios.
Coding
Coding tasks reveal distinct performance characteristics, with models optimized for different aspects of code generation.
- Mistral Large 2512 had the lowest first-token latency at 0.30 seconds, with a per-token latency of 0.025 seconds, making it the fastest model to start generating code and maintain solid throughput throughout.
- GPT-5.2 followed with a first token latency of 0.50 seconds and the best per-token latency at 0.015 seconds. This combination allows GPT-5.2 to quickly catch up after a slightly slower start, making it highly efficient at handling longer or more complex coding tasks where sustained token-generation speed matters.
- Claude 4.5 Sonnet, with a first-token latency of 2 seconds and a per-token latency of 0.028 seconds, demonstrated moderate responsiveness. While not the fastest to start, it maintains reasonable generation speed for typical coding workflows.
- Grok 4.1 Fast Reasoning had a first token latency of 11 seconds, but the fastest per-token latency was 0.005 seconds. Despite the significant initial delay, once generation begins, it produces code extremely rapidly, potentially making it suitable for batch code generation tasks.
- DeepSeek V3.2 had the highest first-token latency at 19 seconds, with a per-token latency of 0.030 seconds, making it the slowest among the group for coding tasks and limiting its applicability in interactive development environments where immediate feedback is essential.
What is LLM latency, and why is it important?
LLM latency refers to the time it takes for a large language model to generate a response after receiving user input. In practice, latency is not a single number but a collection of latency measures that describe how quickly a system reacts and completes output generation.
One of the most important distinctions is end-to-end latency (E2E latency). E2E latency measures the total time from when the server receives a request to when it completes sending the response, including the final token. This value reflects the full waiting time experienced by the user and is closely related to what users perceive as responsiveness.
Latency is commonly broken down into key metrics such as:
- Time to first token (TTFT) or first token latency, which captures how long it takes before the model begins generating output
- Inter-token latency (ITL), which measures the delay between tokens generated during the response
- Total generation time, which spans from prompt submission to the completion of the response
Low latency is critical in interactive applications such as chatbots, coding assistants, and customer support tools. High latency can interrupt the natural flow of interaction, reduce engagement, and negatively affect user satisfaction. Over time, consistently high latency can also limit the adoption of AI-powered solutions, especially in real-time or customer-facing use cases.
Why does latency directly affect user experience?
The impact of latency on user experience goes beyond inconvenience. Users perceive response times differently depending on the context, complexity of the request, and expectations set by the application. A short delay may be acceptable for complex reasoning tasks, while even minor delays can feel disruptive in conversational interfaces.
- Delayed responses can break conversational flow in interactive AI systems.
- Consistent response times often lead to greater user satisfaction than highly variable ones.
- A slightly slower but more predictable response speed is often preferred over occasional fast replies mixed with long delays.
This psychological aspect of waiting explains why perceived responsiveness matters as much as raw response times. In many cases, maintaining consistent performance is more important than achieving the lowest possible latency for a single request.
Factors that affect LLM latency
LLM latency varies based on several technical and operational factors. Understanding these key factors helps teams identify performance bottlenecks and apply targeted latency optimization strategies.
Model size and configuration
Model size directly affects processing speed. Larger models typically require more compute resources and more time to process the same input tokens. While larger models may offer better output quality, they often increase first token latency and overall token latency.
Important considerations include:
- Model size and internal architecture
- Model configurations, such as context window length
- Tradeoffs between response quality and low latency
Selecting a model that aligns with the application’s performance requirements is a central part of model optimization.
Hardware and system architecture
Hardware plays a critical role in determining response times. Powerful GPUs or AI accelerators can significantly reduce computation time, lowering latency across both TTFT and inter-token latency. Key contributors include:
- GPU utilization and availability
- Memory bandwidth and data transfer efficiency
- Overall system architecture and compute resources
System throughput, typically measured in tokens per second (TPS), indicates how much output a system can generate under concurrent load. High-throughput metrics are essential for handling multiple requests without degrading response times.
Concurrency, batching, and system load
Latency behaves differently in single-request and concurrent-request scenarios. While batching can improve throughput, it can also introduce queuing delays that increase initial response time.
Factors that influence latency here include:
- Number of concurrent requests
- Batching and scheduling policies
- Current system load and usage patterns
Systems optimized only for throughput may experience high latency during peak usage, even if average performance looks acceptable.
Network and deployment effects
Network latency can add meaningful delays, especially in distributed or cloud-based systems. Communication between services, regions, and users contributes to overall end-to-end latency.
Cold starts are another critical factor. When models are scaled to zero during idle periods, the first request must wait for the model to load, which can significantly increase latency. Cold start effects can distort accurate latency measurements if not accounted for separately from steady-state performance.
Strategies to reduce LLM latency
Reducing latency requires coordinated changes across models, infrastructure, and application design. Effective latency optimization focuses on both actual and perceived responsiveness.
Model optimization approaches
Model optimization techniques aim to improve processing speed while maintaining acceptable response quality. Common methods include:
- Quantization and pruning to reduce model size
- Fine-tuning smaller models for specific tasks
- Adjusting model configurations to prioritize low latency
Optimizing model processes can significantly reduce latency and lower operational costs.
Prompt design and token efficiency
Prompt engineering directly affects latency. Longer prompts increase the number of input tokens the model must process, slowing both TTFT and output generation.
Best practices include:
- Using only relevant context
- Reducing prompt complexity and unnecessary instructions
- Limiting tokens generated when a complete response is not required
Streaming, caching, and response handling
Streaming response techniques allow the model to begin generating output as soon as the first token is ready, rather than waiting for the final token. This improves perceived responsiveness even when total generation time remains unchanged.
Additional techniques include:
- Caching responses for repeated or the same input queries
- Semantic caching for similar prompts with overlapping intent
- Infrastructure and throughput optimization
Infrastructure tuning is essential for maintaining performance at scale. This includes:
- Balancing throughput metrics and latency measures
- Ensuring sufficient compute resources for peak demand
- Reducing queuing delays during concurrent requests
Measuring and monitoring latency in production
Accurate latency measurements are essential for diagnosing issues and validating improvements. Different testing methods serve different purposes:
- Synchronous testing processes one request at a time, providing clean and isolated latency data.
- Asynchronous testing simulates real-world scenarios with multiple simultaneous requests, though it can complicate isolating individual latencies.
Monitoring key performance metrics helps teams identify performance bottlenecks, track performance trends, and maintain performance over time. Continuous monitoring is critical as usage patterns evolve.
Common tools used in production include:
- NVIDIA GenAI-Perf and LLMPerf for capturing latency metrics
- Prometheus and Grafana for monitoring and visualizing latency distributions
These tools support ongoing optimization and help ensure consistent performance under changing workloads.
Why consistency matters more than speed alone
While low latency is essential, consistency often matters more for user satisfaction. Systems with highly variable response times tend to feel unreliable, even if some responses are fast. In contrast, consistent response times create predictable interactions and improve perceived responsiveness.
In interactive AI applications, response speed shapes trust, usability, and long-term adoption. Optimizing LLM latency is therefore not just about minimizing milliseconds, but about delivering stable, predictable performance that aligns with user expectations.
By combining accurate measurement, thoughtful system design, and continuous monitoring, teams can significantly reduce latency while maintaining performance, response quality, and cost efficiency.
LLM latency benchmark methodology
Benchmark setup
We measured the latency performance of multiple LLMs across five use cases. The benchmark was executed on a remote server to ensure consistent network conditions. All models were tested using their respective official APIs. We set the temperature to 0.1.
Data collection
A single run was performed with 500 total questions (100 questions per use case). Each question was sent to the model’s streaming API endpoint, and timing measurements were captured at three critical points:
- Request sent: Timestamp when the API request was initiated
- First token received: Timestamp when the first response token arrived
- Final token received: Timestamp when the streaming response completed
Metrics
Time to First Token (TTFT)
Measures the initial response latency – how long it takes for the model to begin generating a response.
Per Token Latency (PTL)
Measures the average time (in milliseconds) required to generate each token after the initial response.
Q&A
We benchmarked the models on a set of 10 questions covering a variety of common factual and conceptual topics across technical, business, and general-knowledge domains. These inputs averaged around 13 tokens per prompt, making them relatively short.
This use case evaluates the models’ ability to generate clear, accurate, and informative answers suitable for educational, documentation, and customer support contexts. The required responses typically involve moderate-length explanations that balance detail with clarity.
Coding
We evaluated the models on a set of 10 distinct programming tasks, ranging from simple functions to more advanced API development. These tasks involved generating Python code snippets, such as basic scripts, web applications using Flask or FastAPI, and data-processing scripts.
This use case assesses the models’ ability to produce structured, functional, and coherent code, which often requires longer, more complex outputs than typical text generation. The input prompts averaged around 20 tokens each, reflecting concise but descriptive programming requests.
Language translation
We benchmarked the models using a set of 10 diverse translation prompts covering multiple languages (Spanish, Chinese, Russian) and text types, including long academic passages, short everyday sentences, scientific abstracts, business emails, and literary excerpts. These inputs varied significantly in length and complexity, ranging from short sentences of around 10 tokens to detailed multi-paragraph texts exceeding several hundred tokens.
This use case evaluates the models’ ability to accurately comprehend and faithfully reproduce meaning across different languages and domains, preserving nuances, style, and technical content. By using varied text types and lengths, we tested both general translation quality and the models’ handling of specialized or formal language.
Business analysis
We evaluated the models using 10 distinct business analysis prompts, each simulating real-world decision-making scenarios across domains like sales performance, customer retention, supply chain bottlenecks, marketing ROI, employee productivity, and competitive strategy. The prompts included structured tabular data and open-ended analytical questions, requiring models to interpret multiple business metrics and generate concise, actionable insights. Inputs varied in complexity, with an average input length of approximately 105 tokens.
This use case tests a model’s ability to synthesize quantitative data, apply logical reasoning, and communicate recommendations clearly in a business context.
Summary generation
We tasked models with producing academic-style summaries (~500 tokens) of technical articles on diverse topics, including AI in healthcare, climate change, renewable energy, blockchain, remote work, electric vehicles, cybersecurity, social media, urbanization, and quantum computing. Each summary was structured into main arguments, supporting ideas, and conclusions, with key terms highlighted and briefly explained.
This use case tests a model’s ability to comprehend detailed technical articles and generate clear, structured, academically styled summaries with key-term explanations.
Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.
Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.
He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.
Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.
Be the first to comment
Your email address will not be published. All fields are required.