AIMultiple ResearchAIMultiple ResearchAIMultiple Research
We follow ethical norms & our process for objectivity.
This research is not funded by any sponsors.
OCR
Updated on Apr 7, 2025

Receipt OCR Benchmark with LLMs in 2025

Extracting data from receipts is essential for businesses since millions of employees are submitting their work related expenses via receipts.

With the latest developments in generative AI and large language models, data extraction accuracy has reached approximately human levels.

Benchmark results

We used Claude 3.5 Sonnet to measure the receipt data extraction accuracy of LLMs:

This image shows that Anthropic Claude 3.5 Sonnet is 97% successful in carefully extracting data from receipts.
Figure 1: Results of data extraction accuracy.

Dataset

We divided our dataset into two parts:

  • High quality: Scanned, high resolution receipts. These images are aligned well, with high contrast.1

  • Low quality: Photographed, low quality receipts. These images are not aligned properly, with no pre-processing to make contrast higher.2

Figure 2: Samples from high quality dataset and low quality dataset.

Our aim is to cover real-life cases as much as possible.

We asked for a JSON output to make evaluation easier. Our prompt is: Please output the text on the PDFs in a proper JSON format.

We also have an invoice OCR benchmark if you are interested.

Methodology

Results were evaluated at key-value pair level:

  • If a field includes the correct label and value, it is marked as correct.

  • If there are any character differences vs the ground truth in the label or the value, that row is marked as false.

Extraction accuracy: Number of correctly extracted key-value pairs divided by the total number of key-value pairs.

Next steps

We will add more LLMs (ChatGPT etc.) to this benchmark to examine their ability to data extraction better.

What is receipt OCR?

Receipt OCR (Optical Character Recognition) is a technology that extracts data from scanned and digital receipts using artificial intelligence and machine learning algorithms. Receipt OCR parses the data, converts it to a structured format and captures details in the receipt, like date, items, and prices.

Best practices to extract data from receipts

To increase the accuracy of the OCR, the images should be:

  • In higher resolution

  • Aligned well

  • Free of printing errors

You should be aware of:

Most of the receipt OCR tools fail in matching the correct item with correct price when there is a note about the item in the next line with no pricing listed. In that case, it is common for tools to read the next item’s price as the note’s price. To see clearly, let’s look at the example:

Blank fields on receipts affect Receipt OCR accuracy.
Figure 3: A common mistake of receipt OCR tools.

In such cases, the output of OCR may match “SpcyDlx +PJ” with the price 0.40, which is not correct. It is possible especially in the cases where image resolution and quality is low, and the image is not aligned straight.

We noticed that in the case of low resolution or printing errors (ink does not cover the letter completely etc.), tools are having trouble in fully identifying similar letters and numbers. Like “8” and “9” or “5” and “6”. Also having trouble in identifying “/” and “1” is a common case, especially in dates.

Types of data that can be extracted from receipts

  • Receipt number

  • Date

  • Vendor name

  • Subtotal amount

  • Tax amount

  • Total amount

  • Purchased items

A step-by-step guide to receipt data extraction:

  • Receipt scanning: Scanning the receipt with high resolution. OCR receipt scanning helps getting more high quality images than taking photographs of the receipts.

  • Receipt processing: To increase contrast and readability of the input image, processing receipts may be needed.

  • Receipt parsing: Parsing the receipt image is essential to analyze and capture data, it breaks down data into more organized portions.

  • Using structured data: Structured data can be used to automate data entry in existing systems like accounting software. Relevant data can be used in so many cases like following the transaction date in financial records and expense management. By automatically extract data from receipts by using LLMs or receipt OCR APIs can reduce errors and manual entry and increases overall efficiency with high accuracy.

You can also see our Handwriting OCR Benchmark.

FAQ

What are the business benefits of OCR receipt scanning?

OCR technology helps expense tracking, and identifying spending patterns. Line items on json response can provide key information and help saving time by automatically extracting raw text from documents and invoices. Businesses can fine tune an ocr engine according to project needs. Business numbers from different countries like australian business number and VAT number can be extracted from receipts.

Share This Article
MailLinkedinX
Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Next to Read

Comments

Your email address will not be published. All fields are required.

0 Comments