The effectiveness of large language models (LLMs) is determined not only by their accuracy and capabilities but also by the speed at which they engage with users.
We benchmarked the performance of leading language models across various use cases—measuring how quickly they respond to user input. From the time it takes to generate the first word to the pace of the full response, the results reveal which models deliver faster interactions when it counts.
LLM latency benchmark results
We benchmarked the latency performance of the following large language models: GPT-4.1, Mistral-large, Claude-3-opus-20240229, Grok-2, and DeepSeek.We focus on two key metrics: First Token Latency—the time it takes for the model to start generating the first token of a response—and Per Token Latency—the time taken to generate each token throughout the response.
You can see our methodology here.
Performance analysis by use case
We can see that variations in latency values depend on the task type, indicating that these models exhibit different performance profiles across various use cases.
Q&A
In Q&A scenarios, such as customer support, virtual assistants, and enterprise knowledge tools, speed and response times directly impact user experience. Grok, with a first token latency of 0.345 seconds, provides the fastest initial response. This quick reaction is a significant advantage in live support systems and situations needing rapid answers. Its per-token latency is 0.015 seconds, offering high efficiency for short to medium-length queries.
GPT-4, with a first token latency of 0.615 seconds, responds slower than Grok and also has a higher per-token latency of 0.026 seconds. However, this per-token latency remains at a strong performance level, making GPT-4 suitable for generating longer and more detailed responses efficiently.
Mistral, with a first token latency of 0.495 seconds, lags behind Grok but remains faster than GPT-4 in initial response time, offering a solid balance in responsiveness. Notably, its per-token latency of 0.041 seconds is higher than GPT-4’s.Overall, Mistral is a balanced choice for users seeking quick turnaround on brief queries without compromising generation speed.
Claude, with a first token latency of 1.162 seconds, is one of the slowest models in terms of initial response. Although its per-token latency of 0.049 seconds is reasonable and close to Mistral’s performance, the delay before the first token appears can negatively impact overall responsiveness.
DeepSeek, with a first token latency of 2.270 seconds, is the slowest model overall. While its per-token latency of 0.060 seconds is not excessively high, the long wait before the first token makes it less suitable for speed-critical Q&A systems. It may be used in cases with less time pressure.
Summary generation
The summary generation use case plays a critical role in applications where users need to quickly grasp long texts. For example, in scenarios where customer service teams need to summarize a call recording within seconds and take action, the first token latency directly impacts the user experience.
Mistral delivered the fastest initial response with a first token latency of 0.551 seconds. Its per-token latency of 0.029 seconds makes it an effective option for scenarios requiring quick summarization of short documents.
Grok follows closely with a first token latency of 0.594 seconds. However, its per-token latency of 0.023 seconds allows it to maintain speed even with longer content.
GPT-4 falls between Grok and Mistral with a first token latency of 0.589 seconds and has the fastest per-token latency at 0.021 seconds.
Claude has a slower initial response with a first token latency of 1.298 seconds. However, its per-token latency of 0.047 seconds still provides decent overall performance.
DeepSeek stands out as the slowest model with a first token latency of 3.942 seconds and a per-token latency of 0.068 seconds.
Multi-source synthesis
According to our observations, Grok clearly stands out as the fastest model in scenarios where rapid synthesis from multiple sources is critical. With the lowest first token latency at 0.374 seconds, Grok excels in real-time applications like live data dashboards or instant decision support systems. Its per-token latency of just 0.017 seconds ensures consistently fast output, even as the content length increases.
Mistral and GPT-4 follow Grok in terms of performance Mistral starts faster than GPT-4, with a 0.520-second first token latency, but has a slower per-token latency of 0.037 seconds.
On the other hand, GPT-4 begins slightly slower with a first token latency of 0.566 seconds, but compensates with a lower per-token latency of 0.024 seconds.
Claude is noticeably slower, with a first token latency of 1.540 seconds. This delay can impact workflows where rapid feedback is key. However, its per-token latency of 0.045 seconds makes it still a reasonable choice for tasks where overall throughput is more important than immediate responsiveness.
Finally, DeepSeek ranks as the slowest model in our tests, with a first token latency of 2.834 seconds and per-token latency of 0.073 seconds.
Language translation
Based on our benchmark, Grok delivers the fastest initial response with a first token latency of 0.354 seconds, making it an ideal model for real-time translation tasks. Additionally, its per-token latency of 0.017 seconds provides both speed and efficiency.This enables Grok to perform highly efficiently on longer or more complex translation tasks.
GPT-4 starts slower than Grok with a first token latency of 0.766 seconds but has the lowest per-token latency at 0.014 seconds.
Mistral, with a first token latency of 0.558 seconds, falls between Grok and GPT-4 in terms of responsiveness, and delivers slower per-token latency compared to GPT-4, at 0.042 seconds.
Although Claude’s first token latency is much higher than Mistral’s at 1.191 seconds, their per-token latencies are quite similar, with Claude around 0.046 seconds and Mistral at 0.042 seconds.
DeepSeek stands out as the slowest model with a first token latency of 2.427 seconds and a relatively high per-token latency of 0.067 seconds.
Business analysis
Based on the results we observed in the Business Analysis use case, Grok delivers the fastest initial response. With a first token latency of 0.351 seconds, it demonstrates strong performance in scenarios involving real-time business analysis and rapid decision-making.Additionally, with a per-token latency of just 0.017 seconds, Grok maintains its speed not only at the start but throughout the entire output.
GPT-4 starts slower with a first token latency of 0.576 seconds, but it still delivers high efficiency with a per-token latency of 0.026 seconds making it suitable for business analysis tasks that require moderate speed such as daily reports or low-traffic dashboards, where near real-time response is sufficient.
Mistral positions itself between Grok and GPT-4, with a first token latency of 0.529 seconds. Its per-token latency of 0.040 seconds provides balanced performance, although it is slower than GPT-4.
Claude responds noticeably slower, with a first token latency of 1.368 seconds. This initial delay can cause noticeable lags in workflows that expect real-time answers—such as analysts in customer support tools. However, with a per-token latency of 0.047 seconds, it still provides consistent output speed, making it a reasonable choice in less time-sensitive scenarios like batch data reviews or scheduled executive reporting.
DeepSeek was the slowest model among those we tested with a first token latency of 2.425 seconds.Also its per-token latency of 0.072 seconds also adds more time to longer outputs.
Coding
Grok showed the lowest first token latency at 0.344 seconds, making it the fastest model to start generating tokens. With a per-token latency of 0.022 seconds, it demonstrates strong performance both in terms of the first-token and overall per-token latency.
GPT-4 followed with a first token latency of 0.561 seconds and the fastest per-token latency at 0.021 seconds. This combination allows GPT-4 to quickly catch up after a slightly slower start, making it highly efficient for handling longer or more complex coding tasks where sustained token generation speed matters.
Mistral showed a solid first token latency of 0.502 seconds and a moderate per-token latency of 0.035 seconds, positioning it as a balanced choice for everyday coding tasks that require both decent responsiveness and steady token throughput.
Claude, with a first token latency of 1.173 seconds and per-token latency of 0.062 seconds, demonstrated slower initial responsiveness and token generation speed, making it less suitable for scenarios where immediate feedback is essential.
DeepSeek had the highest first token latency at 2.369 seconds and a per-token latency of 0.078 seconds, indicating it is the slowest among the group, limiting its applicability in fast-paced coding environments.
What is LLM latency and why is it important?
LLM latency refers to the amount of time it takes for a large language model (LLM) to generate a response after receiving a prompt—a core component in LLM performance benchmarking and real-world application responsiveness. Key performance metrics such as latency to first token and per-token latency are crucial in this evaluation.
Low latency is crucial for delivering smooth, real-time user experiences—especially in applications like chatbots, coding assistants, customer support, and translation tools, where inference performance directly affects user satisfaction. High latency can lead to frustration, reduced engagement, and lower user satisfaction. As LLMs become integrated into more products and services, optimizing latency is key to maintaining performance, responsiveness, and overall product success.
Factors that affect LLM latency
The response time of a large language model (LLM) can vary significantly depending on a few key factors. Understanding these helps identify where improvements can be made.
- Model Size: Larger models typically require more processing power, which can increase latency. More parameters often mean better results, but also slower response times. This tradeoff is a central concern in performance benchmarking efforts.
- Hardware Capabilities: The type of GPU or TPU used, available memory, memory bandwidth, and system architecture all play a role. High-performance hardware with high system throughput and efficient resource utilization can significantly reduce latency and improve LLM inference efficiency.
- Batch Size: Processing multiple requests in a batch can be efficient, but it may also delay the response for individual users—especially the first token. Knowing how many requests are being processed simultaneously is essential for accurate latency measurements.
- Network Latency: In distributed systems or cloud environments, communication between components can introduce delays. The physical location of servers also matters.
Strategies to reduce LLM latency
Reducing latency is critical for improving the user experience, particularly in real-time applications. Here are a few effective strategies:
- Model Optimization: Techniques like quantization, pruning, and using distilled (smaller, faster) versions of models can speed up response times and improve latency without sacrificing too much accuracy. Additionally, these optimizations enhance cost efficiency by reducing computational resource demands, thereby improving overall inference performance.
- Improved Infrastructure: Running models on optimized inference engines and high-speed hardware (like modern GPUs or custom accelerators) can make a big difference. Additionally, deploying LLMs on infrastructure designed for high throughput ensures that more requests can be processed concurrently without sacrificing latency.
- Prompt Design: Crafting shorter, more efficient prompts reduces the number of input tokens the model has to process—saving both time and compute resources. Reducing unnecessary tokens, particularly by minimizing input sequence length and controlling how many tokens are used, also improves inter token latency.
- Streaming & Caching: Streaming & Caching: Delivering the output as it’s being generated (streaming) and caching frequent responses can help minimize perceived delays, especially when generating long sequences with many output tokens.
LLM latency benchmark methodology
We benchmarked the response latency performance of various large language models (LLMs) across different use cases, focusing on two key metrics: First Token Latency (time to first token) and Latency per Token. We tested the following large language models: GPT-4.1, Mistral-large, Claude-3-opus-20240229, Grok-2, and DeepSeek. Model names and versions corresponded to those specified by their respective API providers at the time of testing.
We used the same input prompt, input size, and same model configuration for each use case to ensure a fair and consistent comparison.To minimize the impact of transient network delays and server load variations, we avoided sending concurrent requests during testing.
We accessed each model through its official API, provided by its respective developer or platform (e.g., OpenAI, Anthropic, xAI, etc.), ensuring that performance measurements, especially the time to first token, reflected real-world usage conditions.
For each use case, we prepared 10 distinct questions to represent realistic variations in input. We executed all the requests—each question 10 times per model—to minimize the impact of transient network delays and server load variations. We reported the median latency values to reduce the influence of outliers.
Before measurements, we sent a warm-up request with an empty or neutral input to mitigate cold-start effects such as cache loading or connection setup.
We measured latency using streaming response mode, enabling precise timing of the time to first token and per token latency throughout the response, which together provide an approximation of the overall end-to-end latency
- First Token Latency, or time to first token, is defined as the time from request initiation to receiving the first response token, a crucial metric for assessing how quickly a model begins to respond.
- Latency per Token is the average time a model takes to generate each token after starting to respond. It reflects the model’s generation speed during output, which is critical for real-time use cases.
Q&A
We benchmarked the models using a set of 10 distinct questions designed to represent a variety of common factual and conceptual topics across technical, business, and general knowledge domains. These inputs averaged around 13 tokens per prompt, making them relatively short and concise.
This use case evaluates the models’ ability to generate clear, accurate, and informative answers suitable for educational, documentation, and customer support contexts. The required responses typically involve moderate-length explanations that balance detail with clarity.
Coding
We evaluated the models using a set of 10 distinct programming tasks of varying complexity, ranging from simple functions to more advanced API development. These tasks involved generating Python code snippets such as basic scripts, web applications using Flask or FastAPI, and data processing scripts.
This use case assesses the models’ ability to produce structured, functional, and coherent code that often requires longer and more complex outputs compared to typical text generation. The input prompts averaged around 20 tokens each, reflecting concise but descriptive programming requests.
Language translation
We benchmarked the models using a set of 10 diverse translation prompts covering multiple languages (Spanish, Chinese, Russian) and text types, including long academic passages, short everyday sentences, scientific abstracts, business emails, and literary excerpts. These inputs varied significantly in length and complexity, ranging from short sentences of around 10 tokens to detailed multi-paragraph texts exceeding several hundred tokens.
This use case evaluates the models’ ability to accurately comprehend and faithfully reproduce meaning across different languages and domains, preserving nuances, style, and technical content. By using varied text types and lengths, we tested both general translation quality and the models’ handling of specialized or formal language.
Business analysis
We evaluated the models using 10 distinct business analysis prompts, each simulating real-world decision-making scenarios across domains like sales performance, customer retention, supply chain bottlenecks, marketing ROI, employee productivity, and competitive strategy. The prompts included structured tabular data and open-ended analytical questions, requiring models to interpret multiple business metrics and generate concise, actionable insights. Inputs varied in complexity, with an average input length of approximately 105 tokens.
This use case tests a model’s ability to synthesize quantitative data, apply logical reasoning, and communicate recommendations clearly in a business context.
Multi-source synthesis
We assessed the models’ performance by providing 10 distinct prompts, each containing five customer reviews with varying and sometimes contradictory opinions about a product. The tasks required the models to analyze individual reviews in detail, highlight important positives, negatives, and contradictions, then synthesize these insights into a balanced overall evaluation. Finally, models were asked to produce actionable advice by listing critical factors prospective buyers should consider. The prompts featured a mix of qualitative feedback on packaging, delivery, product performance, customer service, and value, with an average input length of approximately 135 tokens.
This use case tests a model’s capability to comprehend nuanced, multi-source user feedback and generate clear, structured, and practical summaries.
Summary generation
We tasked models with producing academic-style summaries (~500 tokens) of technical articles on diverse topics such as AI in healthcare, climate change, renewable energy, blockchain, remote work, electric vehicles, cybersecurity, social media, urbanization, and quantum computing. Each summary was structured into main arguments, supporting ideas, and conclusions, with key terms highlighted and briefly explained.
This use case tests a model’s capability to comprehend detailed, technical articles and generate clear, structured, and academically styled summaries with key term explanations.
Comments
Your email address will not be published. All fields are required.