AIMultiple ResearchAIMultiple ResearchAIMultiple Research
We follow ethical norms & our process for objectivity.
AIMultiple's customers in llms include Holistic AI.
LLMs
Updated on Aug 8, 2025

Benchmark 30 Finance LLMs: GPT-5, Gemini 2.5 Pro & more

Headshot of Hazal Şimşek
MailLinkedinX

Large language models (LLMs) are transforming finance by automating complex tasks such as risk assessment, fraud detection, customer support, and financial analysis. Benchmarking finance LLM can help identify the most reliable and effective solutions. We evaluated 30 finance LLMs including the latest state-of-the-art ones, discovering:

  • For tasks demanding the highest accuracy: gpt-5-2025-08-07 and gpt-5-mini-2025-08-07 are the recommended models.
  • For a balance of high performance and efficiency: claude-opus-4-1-20250805 stands out with strong accuracy and a significantly lower token count.

Benchmark overview

We evaluated LLMs on 238 hard questions from the FinanceReasoning benchmark (Hendrycks et al.). This subset targets the most challenging financial reasoning tasks, assessing complex multi-step quantitative reasoning with financial concepts and formulas. Our evaluation employed a custom prompt design and scoring criteria of accuracy and token consumption.

Here are the results:

For a detailed explanation of how these metrics were calculated and the framework used for this evaluation, please see our financial benchmark methodology.

Results

Strong performers:

  • GPT-5 (especially gpt-5-2025-08-07 and gpt-5-mini-2025-08-07) outperforms all other models significantly with the highest accuracy rates, regardless of the token consumption. There is a notable gap in performance between the top two models and the rest of the field.
  • Other strong contenders in the top tier include gemini-2.5-pro (81.93% accuracy, 711,359 tokens) and grok-3 (81.51% accuracy, 510,671 tokens).
  • Claude-opus-4-1-20250805 offers a good balance of high accuracy (81.51%) and low token usage (139,373 tokens).

Middle tier:

This group of models has accuracy scores primarily in the 70% range.

  • Kimi-k2 (78.15% accuracy, 100,323 tokens) and 03-pro-2025-06-10 (78.15% accuracy, 473,659 tokens) are tied for 7th place. Kimi-k2 is also the most efficient high performing model in this group.
  • gpt-5-nano-2025-08-07 (76.89% accuracy, 1,028,909 tokens) and claude-sonnet-4-20250514 (76.05% accuracy, 135,462 tokens) also fit into this category.

Low performers:

  • Gemini-2.5-flash (65.55% accuracy, 286,603 tokens) and gpt-oss-20b (67.65% accuracy, 515,041 tokens).
  • The lowest-ranking model is deepseek-v3-0324, which had an accuracy of only 10.92% with 100,861 tokens.

Some models’ low performance of accuracy might be due to:

  • Lack of domain training: Many open models like LLaMA or Qwen may not be fine-tuned on financial text.
  • Hallucination and ambiguity: Some models might give grammatically correct but nonsensical answers.
  • Difficulty handling dense, multi-sentence questions: Especially those involving corporate finance, macroeconomics, or scenario analysis can challenge these models. 

Despite their impressive parameter counts (like LLaMA 3 70B or Qwen 2 72B), these models lagged in precision. This shows that model size alone is not enough; domain-specific alignment and instruction tuning are critical for complex tasks.

Impact of tokens:

  • There’s no clear correlation between the number of tokens used and a model’s accuracy. For instance, Deepseek-r1-0528 used the highest number of tokens (1,251,064) but only achieved 62.18% accuracy.
  • Models like claude-opus-4-1-20250805 and claude-opus-4-20250514 demonstrated strong performance (81.51% and 80.25% accuracy) with a significantly lower token count of 139,373 and 132,274, respectively.

Financial reasoning benchmark methodology

Our benchmark provides a fair, transparent, and reproducible evaluation of Large Language Model (LLM) performance on complex financial reasoning tasks.

Test setup & Data corpus

  • Benchmark suite: We selected the FinanceReasoning 1 benchmark for its specialized focus on quantitative and inferential financial problems, which go beyond simple knowledge retrieval.
  • Knowledge corpus & test queries: We utilized the hard subset, comprising 238 challenging questions. Unlike other benchmarks, FinanceReasoning integrates the knowledge corpus directly into each test query. Each data point includes:
    1. A question requiring multi-step logical and numerical deduction.
    2. A context, which often contains dense information presented in structured formats like Markdown tables (e.g., balance sheets, stock performance data).
    3. A definitive ground truth answer for objective scoring.
  • Illustrative query types: The benchmark’s difficulty stems from its requirement for models to handle diverse and complex financial reasoning tasks. To illustrate this breadth, we highlight two representative examples from the test set:

Example: algorithmic & time-series reasoning (Technical Analysis)

Context: An investor is analyzing… stock prices over the last 25 days… to calculate the Keltner Channel using a 10-day EMA period and a 10-day ATR period, with a multiplier of 1.5…

Question: What is the value of the last upper band in the Keltner Channel…? Answer to two decimal places.

This query tests a model’s ability to act as a quantitative analyst by:

  1. Deconstructing a composite indicator: Recognizing that the “Keltner Channel” is derived from two other complex indicators:
    • the exponential moving average (EMA)
    • the average true range (ATR).
  2. Implementing algorithmic logic: Correctly implementing the iterative algorithms for both EMA and ATR from scratch over a time-series of 25 data points.
  3. Synthesizing results: Combining the calculated values according to the final Keltner Channel formula (Upper Band = EMA + (Multiplier × ATR)).

Core evaluation principles

  • Isolated & standardized API calls: For each of the 30 models, we conducted the evaluation programmatically via their respective API endpoints (e.g., OpenRouter, OpenAI). This ensured that every model received the exact same input under identical conditions, eliminating variability from UI interactions.
  • Free-form generation: We did not constrain the models to a multiple-choice format. Instead, they were prompted to generate a comprehensive, free-form response, allowing for a more authentic assessment of their reasoning capabilities.
  • Chain-of-Thought (CoT) prompting: To elicit and evaluate the models’ reasoning process, we employed a Chain-of-Thought (CoT) prompting strategy. The system prompt explicitly instructed each model to “first think through the problem step by step” before concluding with a final answer. This approach allows for a deeper analysis of how a model arrives at its conclusion, beyond just the final output.

Evaluation metrics & framework

Our evaluation framework is fully automated and designed to measure both the conceptual correctness and the computational cost of each model’s response.

1. Primary metric: Accuracy

This metric answers the critical question: “Can the model correctly solve the financial problem?” The scoring process involves a sophisticated two-step pipeline:

  • Step 1: LLM-based answer extraction: A model’s raw output is an unstructured text containing both its reasoning and the final answer. To reliably parse the definitive numerical or boolean value, we utilized a powerful supervisor model (openai/gpt-4o) as an intelligent parser. This method consistently identifies the intended final answer, even with slight variations in formatting across different models.
  • Step 2: tolerance-based comparison: A simple “exact match” is insufficient for numerical problems. Therefore, the extracted answer was programmatically compared against the ground truth. Our script applies a numerical tolerance threshold (a relative difference of 1%) to fairly handle minor floating-point or rounding variations, ensuring that conceptually sound solutions are marked as correct.

2. Secondary metric: Token consumption

This metric answers the question: “How computationally expensive is it for the model to solve these problems?” It measures the total cost associated with generating the 238 answers.

  • Calculation: For each API call, we collected the usage data returned by the model provider, which includes prompt_tokens and completion_tokens. The final score for a model is the sum of total_tokens consumed across all 238 questions. This provides a clear measure of the model’s verbosity and overall computational cost for the task.

This two-metric approach allows for a holistic assessment, balancing a model’s raw problem-solving capability (accuracy) against its operational efficiency (token consumption).

Financial reasoning with Retrieval-Augmented Generation (RAG)

To surpass standalone models, we utilized the Retrieval-Augmented Generation (RAG) framework to supply LLMs with relevant, domain-specific knowledge at inference time, helping them solve problems beyond their training data. We tested this on gpt-4o-mini to measure its impact.

Results and analysis: The RAG trade-Off

The introduction of RAG had a significant and measurable impact on the performance of gpt-4o-mini.

Updated at 08-08-2025
Test typeAccuracy (%)Correct answersToken consumptionTotal time
Standalone (without RAG)43.70%104 / 238159,207~3 minutes
Augmented (RAG-powered)53.78%128 / 2382,818,601~59 minutes
Impact+10.08 points+24 Questions~17.7x Higher~20x Slower

Key Takeaways from the RAG evaluation:

  • Significant accuracy improvement: RAG demonstrably enhanced the model’s problem-solving capability, boosting accuracy by over 10 percentage points. This confirms that providing external, relevant context is highly effective for complex, domain-specific reasoning tasks.
  • The cost of accuracy: This performance gain came at a significant cost. The total token consumption increased by nearly x18, and the total execution time increased by x20. This is due to the additional API calls for embedding and, more importantly, the vastly larger and more complex prompts that the LLM must process.
  • Implications for larger models: The results from gpt-4o-mini suggest that while RAG can unlock higher performance, applying this method to larger, more expensive models like GPT-4o or Claude Opus will be substantially more costly and time-consuming. This highlights the critical trade-off between accuracy, cost, and latency in designing production-grade financial AI systems.

 Financial reasoning RAG methodology

Our RAG pipeline is built on a modern stack using Qdrant as the vector database and OpenAI’s text-embedding-3-small model for generating semantic vector representations. The process consists of two main phases: an offline indexing phase and an online retrieval-generation phase.

1. Knowledge corpus indexing

  • Corpus creation: We curated a specialized knowledge base from two sources provided by the benchmark:
    1. Financial documents: A collection of articles (financial_documents.json) explaining various financial concepts and terms.
    2. Financial functions: A library of ready-to-use Python functions (functions-article-all.json) designed to solve specific financial calculations.
  • Intelligent chunking & embedding: To prepare this corpus for efficient retrieval, each document and function was processed and indexed:
    1. Chunking: Documents were segmented into smaller, semantically coherent chunks based on their sections. Each Python function was treated as a single atomic chunk. This ensures that retrieved context is focused and relevant.
    2. Embedding: Each chunk was then converted into a 1536-dimension vector using the text-embedding-3-small model.
    3. Indexing: These vectors were indexed into two separate collections within our local Qdrant instance (financial_documents_openai_small and financial_functions_openai_small), optimized for cosine similarity search.

2. RAG-powered inference

For each of the 238 questions, the model’s reasoning process was augmented with the following automated steps:

  1. Embedding generation (API calls 1 & 2): The user’s query (question + context) was converted into an embedding vector. This required two calls to OpenAI’s embedding API to prepare for searches in both collections.
  2. Multi-source retrieval: The query vector was used to perform a semantic search against both Qdrant collections simultaneously to retrieve the most relevant information:
    • The top 3 most relevant document chunks from the financial_documents collection.
    • The top 2 most relevant Python functions from the financial_functions collection.
  3. Prompt augmentation: The retrieved documents and functions were dynamically injected into the prompt, creating a rich, context-aware “information packet”. This significantly increased the input prompt size (from ~300-500 tokens to ~3,000-5,000+ tokens).
  4. Final answer generation (API call 3): This augmented prompt was sent to the gpt-4o-mini model to generate the final, reasoned answer.

Limitations 

Our benchmark, while comprehensive, is subject to several key limitations:

  • Data contamination risk: It is possible that these models have been trained on the benchmark’s dataset since the dataset is public. This could lead to inflated scores, making the true reasoning ability difficult to assess.
  • Single-model RAG analysis: The RAG evaluation was performed on only one model (gpt-4o-mini), so the observed trade-offs between performance and cost may not apply to all other models.

Conclusion

Our benchmark of 30 models on complex financial reasoning tasks reveals key findings:

  • GPT-5 dominance: GPT-5 models, specifically gpt-5-2025-08-07 and gpt-5-mini-2025-08-07, set a new standard for accuracy, outperforming all other models.
  • Performance vs. efficiency trade-off: Models like Claude Opus 4.1 offer a strong balance of high accuracy and low token usage, highlighting the importance of efficiency alongside raw performance.
  • RAG’s impact: Retrieval-Augmented Generation (RAG) significantly boosts a model’s accuracy but at a substantial cost in terms of token consumption and latency.

FAQ

What is LLM in finance?

An LLM (Large Language Model) in finance is an AI model trained on vast amounts of financial data and texts using natural language processing techniques to perform complex financial analysis, compliance management, and document understanding. These models help financial institutions navigate financial law, regulatory requirements, and the dynamic demands of the financial industry.

Finance LLM use cases

Intelligent chatbots:
LLM-driven virtual assistants enable financial firms to provide automated, 24/7 customer support by handling routine queries and onboarding tasks without human intervention. This reduces wait times and improves customer satisfaction while freeing human agents for complex issues.

Advisory & analysis:
Investment banks use LLMs to analyze market trends, financial news, and client data. These models digest large volumes of unstructured information, enabling advisors to deliver personalized investment advice and portfolio management with real-time insights.

Regulatory document analysis:
Law firms and financial institutions employ LLMs to process dense regulatory documents like SEC filings. These models extract key information and summarize reports, reducing manual review time and helping firms stay compliant with evolving regulationsd

Fraud detection:
LLMs analyze vast financial datasets in real time to detect suspicious transaction patterns and emerging fraud tactics. Their continual learning capabilities allow faster and more accurate fraud identification than traditional methods.

Legal and compliance automation:
Law firms and compliance teams use LLMs to review contracts, interpret banking laws, and verify regulatory compliance. Automating these tasks reduces review time and legal costs while ensuring adherence to complex financial regulations.

Document Q&A and Named Entity Recognition (NER):
Financial institutions deploy LLMs to answer questions from investors by extracting data from financial reports and earnings calls. NER enables automatic tagging of company names, stock tickers (class trading symbols), and regulatory entities, streamlining data retrieval.

Key benefits

Efficiency and automation: LLMs automate routine analysis (e.g., summarizing earnings reports, processing loans or filings), saving analyst hours and reducing errors.

24/7 customer service: AI virtual assistants and chatbots powered by LLMs can handle customer queries around the clock with conversational answers, improving customer experience and satisfaction.

Personalized financial advice: By analyzing a client’s history and risk profile, LLMs deliver tailored financial or investment advice.

Fraud detection & risk management: LLMs sift through large transaction datasets to spot anomalies or fraud patterns, adapting to new scam tactics and helping build risk profiles.

Compliance & reporting: LLMs automatically draft regulatory reports, extract policy-relevant facts, and help parse complex finance law and regulations for compliance.

Challenges and considerations

Data privacy & security: Financial data is highly sensitive, requiring strong encryption, strict access controls, and anonymization to prevent breaches and regulatory fines. Use AI platforms compliant with banking law and data protection standards.

Regulatory compliance & explainability: LLMs often act as “black boxes,” making decision explanations difficult for regulators. Employ explainable AI tools, maintain audit trails, and combine AI outputs with human oversight using governance frameworks.

Bias & fairness: Training data biases can lead to unfair lending or advice. Use responsible AI tools for bias detection and mitigation, audit model decisions regularly, and ensure diverse, balanced domain-specific training data.

Integration & cost: Legacy systems and high costs complicate LLM adoption. Reduce expenses by continual pre-training on base models, plan phased integration, and upskill staff in finance and AI.

Dynamic domain: Rapid financial changes require frequent model updates. Set up pipelines for continual data retraining and collaborate with finance and legal experts on domain-specific curricula.

Skills & governance: Success demands multidisciplinary teams and strong oversight. Implement human-in-the-loop workflows and AI governance for compliance and quality control.

Is there a financial LLM?

Yes, several larger domain-specific models exist for finance. For example, BloombergGPT is designed to assist with financial regulation, capital markets, and compliance management by processing large financial datasets including documents from the national securities exchange and regulatory filings.

Other models like FinBERT and FinGPT focus on financial law, international banking law, and personalized financial advice, adapting large language models to the specialized vocabulary of finance such as class trading symbols and regulatory texts.

Share This Article
MailLinkedinX
Hazal is an industry analyst at AIMultiple, focusing on process mining and IT automation.
Ekrem is an industry analyst at AIMultiple, focusing on intelligent automation, AI Agents, and RAG frameworks.

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments