AIMultipleAIMultiple
No results found.

Benchmark 30 Finance LLMs: GPT-5, Gemini 2.5 Pro & more

Hazal Şimşek
Hazal Şimşek
updated on Aug 9, 2025

Large language models (LLMs) are transforming finance by automating complex tasks such as risk assessment, fraud detection, customer support, and financial analysis. Benchmarking finance LLM can help identify the most reliable and effective solutions. We evaluated 30 finance LLMs including the latest state-of-the-art ones, discovering:

  • For tasks demanding the highest accuracy: gpt-5-2025-08-07 and gpt-5-mini-2025-08-07 are the recommended models.
  • For a balance of high performance and efficiency: claude-opus-4-1-20250805 stands out with strong accuracy and a significantly lower token count.

Benchmark overview

We evaluated LLMs on 238 hard questions from the FinanceReasoning benchmark (Hendrycks et al.). This subset targets the most challenging financial reasoning tasks, assessing complex multi-step quantitative reasoning with financial concepts and formulas. Our evaluation employed a custom prompt design and scoring criteria of accuracy and token consumption.

Here are the results:

For a detailed explanation of how these metrics were calculated and the framework used for this evaluation, please see our financial benchmark methodology.

Results

Strong performers:

  • GPT-5 (especially gpt-5-2025-08-07 and gpt-5-mini-2025-08-07) outperforms all other models significantly with the highest accuracy rates, regardless of the token consumption. There is a notable gap in performance between the top two models and the rest of the field.
  • Other strong contenders in the top tier include gemini-2.5-pro (81.93% accuracy, 711,359 tokens) and grok-3 (81.51% accuracy, 510,671 tokens).
  • Claude-opus-4-1-20250805 offers a good balance of high accuracy (81.51%) and low token usage (139,373 tokens).

Middle tier:

This group of models has accuracy scores primarily in the 70% range.

  • Kimi-k2 (78.15% accuracy, 100,323 tokens) and 03-pro-2025-06-10 (78.15% accuracy, 473,659 tokens) are tied for 7th place. Kimi-k2 is also the most efficient high performing model in this group.
  • gpt-5-nano-2025-08-07 (76.89% accuracy, 1,028,909 tokens) and claude-sonnet-4-20250514 (76.05% accuracy, 135,462 tokens) also fit into this category.

Low performers:

  • Gemini-2.5-flash (65.55% accuracy, 286,603 tokens) and gpt-oss-20b (67.65% accuracy, 515,041 tokens).
  • The lowest-ranking model is deepseek-v3-0324, which had an accuracy of only 10.92% with 100,861 tokens.

Some models’ low performance of accuracy might be due to:

  • Lack of domain training: Many open models like LLaMA or Qwen may not be fine-tuned on financial text.
  • Hallucination and ambiguity: Some models might give grammatically correct but nonsensical answers.
  • Difficulty handling dense, multi-sentence questions: Especially those involving corporate finance, macroeconomics, or scenario analysis can challenge these models. 

Despite their impressive parameter counts (like LLaMA 3 70B or Qwen 2 72B), these models lagged in precision. This shows that model size alone is not enough; domain-specific alignment and instruction tuning are critical for complex tasks.

Impact of tokens:

  • There’s no clear correlation between the number of tokens used and a model’s accuracy. For instance, Deepseek-r1-0528 used the highest number of tokens (1,251,064) but only achieved 62.18% accuracy.
  • Models like claude-opus-4-1-20250805 and claude-opus-4-20250514 demonstrated strong performance (81.51% and 80.25% accuracy) with a significantly lower token count of 139,373 and 132,274, respectively.

Financial reasoning benchmark methodology

Our benchmark provides a fair, transparent, and reproducible evaluation of Large Language Model (LLM) performance on complex financial reasoning tasks.

Test setup & Data corpus

  • Benchmark suite: We utilized the data, code, and evaluation scripts from the FinanceReasoning1 benchmark. We selected it for its specialized focus on quantitative and inferential financial problems.
  • Knowledge corpus & test queries: We focused our analysis on the hard subset, comprising 238 challenging questions. As defined by the benchmark, each data point includes:
    1. A question requiring multi-step logical and numerical deduction.
    2. A context, which often contains dense information presented in structured formats like Markdown tables (e.g., balance sheets, stock performance data).
    3. A definitive ground truth answer for objective scoring.
  • Illustrative query types: The benchmark’s difficulty stems from its requirement for models to handle diverse and complex financial reasoning tasks. To illustrate this breadth, we highlight two representative examples from the test set:

Example: algorithmic & time-series reasoning (technical analysis)

Context: An investor is analyzing… stock prices over the last 25 days… to calculate the Keltner Channel using a 10-day EMA period and a 10-day ATR period, with a multiplier of 1.5…

Question: What is the value of the last upper band in the Keltner Channel…? Answer to two decimal places.

This query tests a model’s ability to act as a quantitative analyst by:

  1. Deconstructing a composite indicator: Recognizing that the “Keltner Channel” is derived from two other complex indicators:
    • the exponential moving average (EMA)
    • the average true range (ATR).
  2. Implementing algorithmic logic: Correctly implementing the iterative algorithms for both EMA and ATR from scratch over a time-series of 25 data points.
  3. Synthesizing results: Combining the calculated values according to the final Keltner Channel formula (Upper Band = EMA + (Multiplier × ATR)).

Core evaluation principles

  • Isolated & standardized API calls: For each of the 30 models, we conducted the evaluation programmatically via their respective API endpoints (e.g., OpenRouter, OpenAI). This ensured that every model received the exact same input under identical conditions, eliminating variability from UI interactions.
  • Free-form generation: We did not constrain the models to a multiple-choice format. Instead, they were prompted to generate a comprehensive, free-form response, allowing for a more authentic assessment of their reasoning capabilities.
  • Chain-of-Thought (CoT) prompting: To elicit and evaluate the models’ reasoning process, we employed a Chain-of-Thought (CoT) prompting strategy. The system prompt explicitly instructed each model to “first think through the problem step by step” before concluding with a final answer. This approach allows for a deeper analysis of how a model arrives at its conclusion, beyond just the final output.

Evaluation metrics & framework

We utilized the FinanceReasoning benchmark’s own fully automated evaluation framework to score the model outputs. This framework is designed to measure both conceptual correctness and computational cost.

1. Primary metric: Accuracy

This metric answers the critical question: “Can the model correctly solve the financial problem?” The scoring process involves a sophisticated two-step pipeline:

  • Step 1: LLM-based answer extraction: A model’s raw output is an unstructured text containing both its reasoning and the final answer. To reliably parse the definitive numerical or boolean value, we utilized a powerful supervisor model (openai/gpt-4o) as an intelligent parser. This method consistently identifies the intended final answer, even with slight variations in formatting across different models.
  • Step 2: tolerance-based comparison: A simple “exact match” is insufficient for numerical problems. Therefore, the extracted answer was programmatically compared against the ground truth. Script applies a numerical tolerance threshold (a relative difference of 1%) to fairly handle minor floating-point or rounding variations, ensuring that conceptually sound solutions are marked as correct.

2. Secondary metric: Token consumption

This metric answers the question: “How computationally expensive is it for the model to solve these problems?” It measures the total cost associated with generating the 238 answers.

  • Calculation: For each API call, we collected the usage data returned by the model provider, which includes prompt_tokens and completion_tokens. The final score for a model is the sum of total_tokens consumed across all 238 questions. This provides a clear measure of the model’s verbosity and overall computational cost for the task.

This two-metric approach, provided by the FinanceReasoning benchmark itself, allows for a holistic assessment, balancing a model’s raw problem-solving capability (accuracy) against its operational efficiency (token consumption).

Financial reasoning with Retrieval-Augmented Generation (RAG)

To surpass standalone models, we designed and implemented a custom RAG framework distinct from the benchmark’s original implementation. Our approach is built on a modern vector database stack (Qdrant) to supply LLMs with relevant, domain-specific knowledge at inference time, helping them solve problems beyond their training data. We tested this on gpt-4o-mini to measure its impact.

Results and analysis: The RAG trade-Off

The introduction of RAG had a significant and measurable impact on the performance of gpt-4o-mini.

Test type
Accuracy (%)
Correct answers
Token consumption
Total time
Standalone (without RAG)
43.70%
104 / 238
159,207
~3 minutes
Augmented (RAG-powered)
53.78%
128 / 238
2,818,601
~59 minutes
Impact
+10.08 points
+24 Questions
~17.7x Higher
~20x Slower

Key Takeaways from the RAG evaluation:

  • Significant accuracy improvement: RAG demonstrably enhanced the model’s problem-solving capability, boosting accuracy by over 10 percentage points. This confirms that providing external, relevant context is highly effective for complex, domain-specific reasoning tasks.
  • The cost of accuracy: This performance gain came at a significant cost. The total token consumption increased by nearly x18, and the total execution time increased by x20. This is due to the additional API calls for embedding and, more importantly, the vastly larger and more complex prompts that the LLM must process.
  • Implications for larger models: The results from gpt-4o-mini suggest that while RAG can unlock higher performance, applying this method to larger, more expensive models like GPT-4o or Claude Opus will be substantially more costly and time-consuming. This highlights the critical trade-off between accuracy, cost, and latency in designing production-grade financial AI systems.

 Financial reasoning RAG methodology

Our RAG pipeline is built on a modern stack using Qdrant as the vector database and OpenAI’s text-embedding-3-small model for generating semantic vector representations. The process consists of two main phases: an offline indexing phase and an online retrieval-generation phase.

1. Knowledge corpus indexing

  • Corpus creation: We curated a specialized knowledge base from two sources provided by the benchmark:
    1. Financial documents: A collection of articles (financial_documents.json) explaining various financial concepts and terms.
    2. Financial functions: A library of ready-to-use Python functions (functions-article-all.json) designed to solve specific financial calculations.
  • Intelligent chunking & embedding: To prepare this corpus for efficient retrieval, each document and function was processed and indexed:
    1. Chunking: Documents were segmented into smaller, semantically coherent chunks based on their sections. Each Python function was treated as a single atomic chunk. This ensures that retrieved context is focused and relevant.
    2. Embedding: Each chunk was then converted into a 1536-dimension vector using the text-embedding-3-small model.
    3. Indexing: These vectors were indexed into two separate collections within our local Qdrant instance (financial_documents_openai_small and financial_functions_openai_small), optimized for cosine similarity search.

2. RAG-powered inference

For each of the 238 questions, the model’s reasoning process was augmented with the following automated steps:

  1. Embedding generation (API calls 1 & 2): The user’s query (question + context) was converted into an embedding vector. This required two calls to OpenAI’s embedding API to prepare for searches in both collections.
  2. Multi-source retrieval: The query vector was used to perform a semantic search against both Qdrant collections simultaneously to retrieve the most relevant information:
    • The top 3 most relevant document chunks from the financial_documents collection.
    • The top 2 most relevant Python functions from the financial_functions collection.
  3. Prompt augmentation: The retrieved documents and functions were dynamically injected into the prompt, creating a rich, context-aware “information packet”. This significantly increased the input prompt size (from ~300-500 tokens to ~3,000-5,000+ tokens).
  4. Final answer generation (API call 3): This augmented prompt was sent to the gpt-4o-mini model to generate the final, reasoned answer.

Limitations 

Our benchmark, while comprehensive, is subject to several key limitations:

  • Data contamination risk: It is possible that these models have been trained on the benchmark’s dataset since the dataset is public. This could lead to inflated scores, making the true reasoning ability difficult to assess.
  • Single-model RAG analysis: The RAG evaluation was performed on only one model (gpt-4o-mini), so the observed trade-offs between performance and cost may not apply to all other models.

💡Conclusion

Our benchmark of 30 models on complex financial reasoning tasks reveals key findings:

  • GPT-5 dominance: GPT-5 models, specifically gpt-5-2025-08-07 and gpt-5-mini-2025-08-07, set a new standard for accuracy, outperforming all other models.
  • Performance vs. efficiency trade-off: Models like Claude Opus 4.1 offer a strong balance of high accuracy and low token usage, highlighting the importance of efficiency alongside raw performance.
  • RAG’s impact: Retrieval-Augmented Generation (RAG) significantly boosts a model’s accuracy but at a substantial cost in terms of token consumption and latency.

FAQ

Industry Analyst
Hazal Şimşek
Hazal Şimşek
Industry Analyst
Hazal is an industry analyst at AIMultiple, focusing on process mining and IT automation.
View Full Profile
Researched by
Ekrem Sarı
Ekrem Sarı
AI Researcher
Ekrem is an AI Researcher at AIMultiple, focusing on intelligent automation, AI Agents, and RAG frameworks.
View Full Profile

Comments 0

Share Your Thoughts

Your email address will not be published. All fields are required.

0/450