We evaluated 36 LLMs in finance on 238 hard questions from the FinanceReasoning benchmark to identify which models excel at complex financial reasoning tasks like statement analysis, forecasting, and ratio calculations.
LLM finance benchmark overview
We evaluated LLMs on 238 hard questions from the FinanceReasoning benchmark (Tang et al.).1 This subset targets the most challenging financial-reasoning tasks, assessing complex, multi-step quantitative reasoning involving financial concepts and formulas. Our evaluation employed a custom prompt design and scoring criteria of accuracy and token consumption.
For a detailed explanation of how these metrics were calculated and the framework used for this evaluation, please see our financial benchmark methodology.
Results: Which LLM is the best for finance?
Top-tier performers (>83% accuracy):
gpt-5-2025-08-07 achieves the highest accuracy at 88.23% with 829,720 tokens. This represents the current state-of-the-art performance for financial reasoning tasks.
claude-opus-4.6 scores 87.82% accuracy with 164,369 tokens, delivering near-top performance while consuming significantly fewer tokens than the leader.
gpt-5-mini-2025-08-07 reaches 87.39% accuracy with 595,505 tokens, offering a strong alternative within the GPT-5 family.
Both gemini-3-pro-preview and gpt-5.2 are tied at 86.13% accuracy. However, gpt-5.2 achieves this with 247,660 tokens compared to gemini-3-pro-preview’s 730,759 tokens, making it three times more efficient.
Strong performers (80-83% accuracy):
claude-opus-4.5 delivers 84.03% accuracy with 144,505 tokens, maintaining Claude’s strong balance of performance and efficiency.
gemini-3-flash-preview scores 83.61% accuracy with 118,530 tokens, the most token-efficient option among all high-performing models.
kimi-k2.5 achieves 82.77% accuracy but requires 877,868 tokens, the highest consumption among models in this performance tier.
Middle tier (70-80% accuracy):
o3-pro-2025-06-10 (78.15% accuracy, 473,659 tokens) and kimi-k2 (78.15% accuracy, 100,323 tokens) are tied. Kimi-k2 is the most efficient model in this group.
o3-mini-2025-01-31 (77.31% accuracy, 376,929 tokens), gpt-5-nano-2025-08-07 (76.89% accuracy, 1,028,909 tokens), and claude-sonnet-4-20250514 (76.05% accuracy, 135,462 tokens) follow closely.
Low performers (<70% accuracy):
claude-3-5-sonnet-20241022 (67.65% accuracy, 90,103 tokens) and gpt-oss-20b (67.65% accuracy, 515,041 tokens) lead this tier.
gemini-2.5-flash (65.55% accuracy, 286,603 tokens), glm-4.5 (64.29% accuracy, 692,662 tokens), and gpt-4.1-nano-2025-04-14 (63.45% accuracy, 171,096 tokens) follow.
The lowest-ranking model is deepseek-v3-0324, which had an accuracy of 10.92% with 100,861 tokens.
Performance insights:
The benchmark shows no clear correlation between token consumption and accuracy. deepseek-r1-0528 consumed the most tokens (1,251,064) yet achieved 62.18% accuracy, while claude-opus-4-20250514 scored 80.25% with 132,274 tokens.
Token efficiency varies dramatically even among high-performing models. gemini-3-flash-preview uses 118,530 tokens to achieve 83.61% accuracy, while kimi-k2.5 consumes 877,868 tokens for 82.77% accuracy (7.4x more tokens for slightly lower performance).
The table above presents other AI model benchmarks, including those utilized for this benchmark.
Financial reasoning benchmark methodology
Our benchmark provides a fair, transparent, and reproducible evaluation of Large Language Model (LLM) performance on complex financial reasoning tasks.
Test setup & data corpus
- Benchmark suite: We utilized the data, code, and evaluation scripts from the FinanceReasoning benchmark. We selected it for its specialized focus on quantitative and inferential financial problems.
- Knowledge corpus & test queries: We focused our analysis on the hard subset, comprising 238 challenging questions. As defined by the benchmark, each data point includes:
- A question requiring multi-step logical and numerical deduction.
- A context, which often contains dense information presented in structured formats like Markdown tables (e.g., balance sheets, stock performance data).
- A definitive ground truth answer for objective scoring.
- Illustrative query types: The benchmark’s difficulty stems from its requirement for models to handle diverse and complex financial reasoning tasks. To illustrate this breadth, we highlight two representative examples from the test set:
Example: Algorithmic & time-series reasoning (technical analysis)
Context: An investor is analyzing… stock prices over the last 25 days… to calculate the Keltner Channel using a 10-day EMA period and a 10-day ATR period, with a multiplier of 1.5…
Question: What is the value of the last upper band in the Keltner Channel…? Answer to two decimal places.
This query tests a model’s ability to act as a quantitative analyst by:
- Deconstructing a composite indicator: Recognizing that the “Keltner Channel” is derived from two other complex indicators:
- the exponential moving average (EMA)
- the average true range (ATR).
- Implementing algorithmic logic: Correctly implementing the iterative algorithms for both EMA and ATR from scratch over a time series of 25 data points.
- Synthesizing results: Combining the calculated values according to the final Keltner Channel formula (Upper Band = EMA + (Multiplier × ATR)).
Core evaluation principles
- Isolated & standardized API calls: For each of the 30 models, we conducted the evaluation programmatically via their respective API endpoints (e.g., OpenRouter, OpenAI). This ensured that every model received the exact same input under identical conditions, eliminating variability from UI interactions.
- Free-form generation: We did not constrain the models to a multiple-choice format. Instead, they were prompted to generate a comprehensive, free-form response, allowing for a more authentic assessment of their reasoning capabilities.
- Chain-of-Thought (CoT) prompting: To elicit and evaluate the models’ reasoning process, we employed a Chain-of-Thought (CoT) prompting strategy. The system prompt explicitly instructed each model to “first think through the problem step by step” before concluding with a final answer. This approach allows for a deeper analysis of how a model arrives at its conclusion, beyond the final output.
Evaluation metrics & framework
We utilized the FinanceReasoning benchmark’s own fully automated evaluation framework to score the model outputs. This framework is designed to measure both conceptual correctness and computational cost.
1. Primary metric: Accuracy
This metric answers the critical question: “Can the model correctly solve the financial problem?” The scoring process involves a sophisticated two-step pipeline:
- Step 1: LLM-based answer extraction: A model’s raw output is an unstructured text containing both its reasoning and the final answer. To reliably parse the definitive numerical or boolean value, we utilized a powerful supervisor model (openai/gpt-4o) as an intelligent parser. This method consistently identifies the intended final answer, even with slight variations in formatting across different models.
- Step 2: Tolerance-based comparison: A simple “exact match” is insufficient for numerical problems. Therefore, the extracted answer was programmatically compared against the ground truth. Script applies a numerical tolerance threshold (a relative difference of 0.2%) to fairly handle minor floating-point or rounding variations, ensuring that conceptually sound solutions are marked as correct.
2. Secondary metric: Token consumption
This metric answers the question: “How computationally expensive is it for the model to solve these problems?” It measures the total cost associated with generating the 238 answers.
- Calculation: For each API call, we collected the usage data returned by the model provider, which includes prompt_tokens and completion_tokens. The final score for a model is the sum of total_tokens consumed across all 238 questions. This provides a clear measure of the model’s verbosity and overall computational cost for the task.
This two-metric approach, provided by the FinanceReasoning benchmark itself, allows for a holistic assessment, balancing a model’s raw problem-solving capability (accuracy) against its operational efficiency (token consumption).
Financial reasoning with Retrieval-Augmented Generation (RAG)
To surpass standalone models, we designed and implemented a custom RAG framework distinct from the benchmark’s original implementation. Our approach is built on a modern vector database stack (Qdrant) to supply LLMs with relevant, domain-specific knowledge at inference time, helping them solve problems beyond their training data. We tested this on gpt-4o-mini to measure its impact.
Results and analysis: The RAG trade-off
The introduction of RAG had a significant and measurable impact on the performance of gpt-4o-mini.
Key takeaways from the RAG evaluation:
- Significant accuracy improvement: RAG demonstrably enhanced the model’s problem-solving capability, boosting accuracy by over 10 percentage points. This confirms that providing external, relevant context is highly effective for complex, domain-specific reasoning tasks.
- The cost of accuracy: This performance gain came at a high cost. The total token consumption increased by nearly x18, and the total execution time increased by x20. This is due to the additional API calls for embedding and, more importantly, the vastly larger and more complex prompts that the LLM must process.
- Implications for larger models: The results from gpt-4o-mini suggest that while RAG can unlock higher performance, applying this method to larger, more expensive models like GPT-4o or Claude Opus will be substantially more costly and time-consuming. This highlights the critical trade-off between accuracy, cost, and latency in designing production-grade financial AI systems.
Financial reasoning RAG methodology
Our RAG pipeline is built on a modern stack using Qdrant as the vector database and OpenAI’s text-embedding-3-small model for generating semantic vector representations. The process consists of two main phases: an offline indexing phase and an online retrieval-generation phase.
1. Knowledge corpus indexing
- Corpus creation: We curated a specialized knowledge base from two sources provided by the benchmark:
- Financial documents: A collection of articles (financial_documents.json) explaining various financial concepts and terms.
- Financial functions: A library of ready-to-use Python functions (functions-article-all.json) designed to solve specific financial calculations.
- Intelligent chunking & embedding: To prepare this corpus for efficient retrieval, each document and function was processed and indexed:
- Chunking: Documents were segmented into smaller, semantically coherent chunks based on their sections. Each Python function was treated as a single atomic chunk. This ensures that the retrieved context is focused and relevant.
- Embedding: Each chunk was then converted into a 1536-dimension vector using the text-embedding-3-small model.
- Indexing: These vectors were indexed into two separate collections within our local Qdrant instance (financial_documents_openai_small and financial_functions_openai_small), optimized for cosine similarity search.
2. RAG-powered inference
For each of the 238 questions, the model’s reasoning process was augmented with the following automated steps:
- Embedding generation (API calls 1 & 2): The user’s query (question + context) was converted into an embedding vector. This required two calls to OpenAI’s embedding API to prepare for searches in both collections.
- Multi-source retrieval: The query vector was used to perform a semantic search against both Qdrant collections simultaneously to retrieve the most relevant information:
- The top 3 most relevant document chunks from the financial_documents collection.
- The top 2 most relevant Python functions from the financial_functions collection.
- Prompt augmentation: The retrieved documents and functions were dynamically injected into the prompt, creating a rich, context-aware “information packet”. This significantly increased the input prompt size (from ~300-500 tokens to ~3,000-5,000+ tokens).
- Final answer generation (API call 3): This augmented prompt was sent to the gpt-4o-mini model to generate the final, reasoned answer.
LLMs in finance benchmark limitations
Our benchmark, while comprehensive, is subject to several key limitations:
- Data contamination risk: It is possible that these models have been trained on the benchmark’s dataset since the dataset is public. This could lead to inflated scores, making the true reasoning ability difficult to assess.
- Single-model RAG analysis: The RAG evaluation was performed on one model (gpt-4o-mini), so the observed trade-offs between performance and cost may not apply to all other models.
💡Conclusion
Our benchmark of 36 models on complex financial reasoning tasks reveals key findings:
- gpt-5-2025-08-07 leads the field: With 88.23% accuracy, this model sets the current standard for financial reasoning tasks.
- Multiple strong alternatives exist: claude-opus-4.6 (87.82%) and gpt-5-mini-2025-08-07 (87.39%) offer near-top performance, with Claude Opus 4.6 achieving this with significantly lower token consumption (164,369 tokens).
- Efficiency matters as much as accuracy: gemini-3-flash-preview achieves 83.61% accuracy with 118,530 tokens, proving that high performance and low cost can coexist. Similarly, gpt-5.2 demonstrates strong efficiency at 247,660 tokens while achieving 86.13% accuracy.
- RAG’s impact: Retrieval-Augmented Generation (RAG) significantly boosts a model’s accuracy (+10 percentage points for gpt-4o-mini) but at a substantial cost in terms of token consumption (18x increase) and latency (20x slower).
Changelog
February 6, 2026
Added 7 new models to the benchmark:
- Claude Opus 4.6 (anthropic/claude-opus-4.6)
- Gemini 3 Pro Preview (google/gemini-3-pro-preview)
- GPT 5.2 (openai/gpt-5.2)
- Claude Opus 4.5 (anthropic/claude-opus-4.5)
- Gemini 3 Flash Preview (google/gemini-3-flash-preview)
- Kimi K2.5 (moonshotai/kimi-k2.5)
- Claude Sonnet 4.5 (anthropic/claude-sonnet-4.5)
Further reading
Financial analysis may refer to multiple capabilities, such as stock analysis, financial law interpretation, and financial reasoning. In our benchmark, we focused specifically on financial reasoning, while other tasks are covered in separate articles:
- LLM for stock analysis: These models help process market data, company reports, and news to identify investment opportunities. (See full analysis here: AI-based Stock Trading)
- Finance law AI: Some LLMs can interpret financial regulations, contracts, and compliance requirements to assist legal-finance tasks. (See our legal AI tools list here: Legal AI Tools)
Be the first to comment
Your email address will not be published. All fields are required.