AIMultiple ResearchAIMultiple Research

OCR in 2024: Benchmarking Text Extraction/Capture Accuracy

Cem Dilmegani
Updated on Feb 14
6 min read

Optical Character Recognition (OCR) is a field of machine learning that is specialized in distinguishing characters within images like scanned documents, printed books, or photos. Although it is a mature technology, there are still no OCR products that can recognize all kinds of text with 100% accuracy. Among the products that we benchmarked, only a few products could output successful results from our test set.

OCR tools are used by companies to identify texts and their positions in images, classify business documents according to subjects, or conduct key-value pairing within documents. Based on OCR results, other technology companies build applications like document automation. For all these business cases, accurate text recognition is critical for an OCR product. We identified:

  • Google Cloud Vision and AWS Textract as leading technologies in the market for all cases
  • Abbyy also has top performance for non-handwritten documents
  • All benchmarked OCRs, including the open source Tesseract performed well on digital screenshots.

Objective

This benchmark focuses on the text extraction accuracy of the products. We measure accuracy as the distance between the meaning of OCR output and actual text. We only work with and compare the raw texts from the images, thus, other product capabilities like text location detection, key-value pairing, or document classification will not be evaluated in this benchmark.

Products

We tested five OCR products to measure their text accuracy performance. We used versions available as of May/2021. Used products are:

  • ABBYY FineReader 15
  • Amazon Textract
  • Google Cloud Platform Vision API
  • Microsoft Azure Computer Vision API
  • Tesseract OCR Engine

Many OCR products in the market have different capabilities. We need to focus on the ones that can output raw text results. The products for this benchmark are chosen based on:

  • Capability to extract text. We did not include solutions that only extract machine readable (i.e. structured data) in this comparison
  • Their popularity in the market
  • Success based on various resources

This was not a comprehensive market review and we may have excluded some products with significant capabilities. If that is the case, please leave a comment and we are happy to expand the benchmarking.

Data

Although there are many image datasets for OCR, these are

  • mostly in character level and do not conform to real business use cases
  • or focus on the text location rather than the text itself.

Thus, we decided to create our own dataset under three main categories:

  1. Category 1 – Web page screenshots that include texts: This category includes screenshots from random Wikipedia pages and Google search results with random queries.
  2. Category 2 – Handwriting: This category includes random photos that include different handwriting styles.
  3. Category 3 – Receipts, invoices, and scanned contracts: This category includes a random collection of receipts, handwritten invoices, and scanned insurance contracts collected from the internet.

All input files are in .jpg or .png format. We will be publishing all images once we are done with the benchmarking exercise. We are currently holding back the images in case another major OCR company wants to be included in the benchmark. We will only consider requests from companies of similar market traction as those in our current benchmark.

For all images, text files that include the text within the images were generated as .txt files. These .txt files were used for comparison with the product outputs. The original text of each image and product outputs will be provided once the benchmarking is closed. The name of the .txt file matches the name of the image file.

Method

We ran all the products on the same data set and generated text outputs as .txt files. Then, we compared these outputs with the original texts to measure the text accuracy. For that, we used the similarity function from the spaCy library in Python and calculated the similarity score between each product’s output and the original text. The similarity score obtained in this operation is the text accuracy.

The similarity function uses cosine distance formulation for calculating the similarity between two texts. We did not use Levenshtein distance for this benchmark because different products output texts in different orders. While Levenshtein distance takes these differences into account, we are only looking for how accurate the text is detected but not where it is located. The cosine distance has negligible penalties for such cases, so we decided to use it in this benchmark.

Results

Overall Results

OCR accuracy benchmark of top OCR companies
Overall Results of OCR Text Accuracy with 90% confidence intervals

Google Cloud Platform’s Vision OCR tool has the greatest text accuracy by 98.0% when the whole data set is tested. While all products perform above 99.2% with Category 1, where typed texts are included, the handwritten images in Category 2 and 3 create the real difference between the products. The overall results show that GCP Vision and AWS Textract are the dominant OCR products that most accurately recognize the given text.

Notes from the overall results:

  • There is a single time where AWS Textract failed to recognize the handwritten text. This situation significantly reduces AWS Textract’s category and total performance. It also increases the deviation for the category and in total because AWS Textract performs very successfully in all other instances.  
  • Azure was the leading product in Category 1 with 99.8% accuracy. However, the product usually fails to recognize the handwritten text, as seen in the second category results. This is the reason why Azure falls behind in the third category and total.
  • Tesseract OCR is an open-source product that can be used for free. Compared to Azure and ABBYY, it performs better in handwritten instances and can be considered for handwriting recognition if the user cannot obtain AWS or GCP products. However, it may perform poorer in scanned images.
  • Unlike other products, ABBYY outputs a more structured .txt file. ABBYY also considers the location of the text on the image while generating the output file. While the product has further useful capabilities, we are focused only on the text accuracy in this benchmark. And it performed poorly in handwriting recognition.

Removing the “Trouble-Maker” image

As mentioned in the overall results, there was a single “outlier” image where AWS Textract could not recognize any text. While the product shows more than 95% text accuracy in all other images, this instance reduced AWS’ performance and widened its confidence interval. As this instance might be an exception, we also wanted to compare the products without it. We called this image the “trouble-maker” and re-run our results to see if they make a difference.
Here are the new results when the “trouble-maker” is excluded from the data set.

OCR accuracy benchmark of top OCR companies after removing one outlier image
OCR Text Accuracy Results when the “trouble-maker” is excluded. 90% confidence interval is shown

When the “trouble-maker” is excluded, AWS Textract becomes the top performer by an almost perfect (99.3%) text accuracy level with a narrow confidence interval. While the scores do not change much, GCP Vision and AWS Textract are still the top 2 products that perform better than the others in terms of text accuracy.

Results without Handwriting Recognition

The main factor that reduces the text accuracy of certain products is the images that include handwriting. Thus, we excluded all images (all of category 2 and 6 images from category 3) and re-evaluated the text accuracy performance, again.

OCR accuracy benchmark of top OCR companies after removing handwritten text
OCR Text Accuracy without handwriting recognition cases

The results are more head-to-head when handwritten images are excluded. AWS Textract and GCP Vision remain as the top-2 products in the benchmark, but ABBYY FineReader also performs very well (99.3%) this time. Although all products perform above 95% accuracy when handwriting is excluded, Azure Computer Vision and Tesseract OCR still have issues with scanned documents, which puts them behind in this comparison.

Limitations

  • Limited Dataset: Originally, we had a fourth category that consisted of photos of newspapers to observe the performance of products in the photos of printed documents. However, these photos include too much text which made it hard to generate the ground truth. Thus, we decided not to use them.
  • Inconsistencies in output formats: Many images include instances where there are separate texts on the left and right-hand sides. The products extract these texts with different orders causing the output files to be different although texts are accurately detected. This situation prevented us from using other distance measures (like Levenshtein distance) and limited our options for calculating text accuracy.
  • Possible Problem with Cosine Distance: The cosine distance uses embeddings while calculating the similarity. For example, comparing the sentences “I like tea” and “I like coffee” would give a higher similarity score than it should be. However, cases like detecting the word “tea” as “coffee” would hardly ever occur, so we did not take this possibility into account in this exercise.

We use other market data (e.g. software reviews, customer case studies) to rank software providers. Feel free to check out our list of OCR providers. However, since most corporates use the term “OCR” when searching for data extraction solutions (i.e. including those that generate machine-readable data), our list has a larger scope and more companies than those presented in this benchmarking exercise.

And we can guide you in selecting an OCR vendor:

Find the Right Vendors
Access Cem's 2 decades of B2B tech experience as a tech consultant, enterprise leader, startup entrepreneur & industry analyst. Leverage insights informing top Fortune 500 every month.
Cem Dilmegani
Principal Analyst
Follow on

Cem Dilmegani
Principal Analyst

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 60% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE, NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and media that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised businesses on their enterprise software, automation, cloud, AI / ML and other technology related decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

To stay up-to-date on B2B tech & accelerate your enterprise:

Follow on

Next to Read

Comments

Your email address will not be published. All fields are required.

6 Comments
Webster
Feb 05, 2023 at 07:24

Hello, great work! Just curious, did you use a trained Tesseract when making these testing?

Bardia Eshghi
Feb 06, 2023 at 12:29

Hi, Webster. Glad you enjoyed the article. The tools we tested were:
ABBYY FineReader 15
Amazon Textract
Google Cloud Platform Vision API
Microsoft Azure Computer Vision API
Tesseract OCR Engine
Hope this answers your question.

Bobby
Aug 14, 2022 at 23:54

The graph images are not working for me at the moment. Otherwise great

Cem Dilmegani
Aug 15, 2022 at 14:48

Thank you Bobby! We have a glitch in the CMS and we are fixing it. Apologies for the issue, it should be fixed next week.

samsun
Jun 07, 2022 at 14:10

Thanks for sharing, can you add a free OCR for everyone to use?
https://www.geekersoft.com/ocr-online.html

Cem Dilmegani
Aug 17, 2022 at 07:46

Hi Samsun, unfortunately, we don’t share all OCR providers on this page, there are thousands of them. We tried to put together the largest ones in terms of market presence. If you have evidence that your solution is one of the top 10 globally, please share it with us at info@aimultiple.com so we can consider it.

Scott
Jan 20, 2022 at 20:42

What version of Tesseract did you test with? They recently released v5.

Cem Dilmegani
Aug 23, 2022 at 12:01

Hi Scott, we did the benchmarking before Tesseract 5. We will redo it soon and include the versions in the methodology section as well.

Bob
Jan 12, 2022 at 15:09

This is very informative, nice work. I assume your tests used documents/images in English? I’ve been experimenting with OCR tools on other languages and finding relatively poor accuracy.

Cem Dilmegani
Jan 15, 2022 at 13:52

Exactly, all text were in English.
I hear similar things about OCR on non-Latin characters. We have an Arabic speaker in the team who claims that accuracy in Arabic is much lower compared to English.
We can do a benchmark on non-Latin characters if there is demand for it.

kin
Jun 21, 2021 at 02:22

interesting post!!!
do you have any suggestion about improving accuracy on scanned image ? i’m using tesseract right now. anyway , great work!

Cem Dilmegani
Jun 22, 2021 at 07:50

Thank you for the comment. There are pre-processing approaches that can be implemented to improve image quality. But such approaches may already be used in Tesseract. A detailed research into Tesseract image processing would be helpful in your case.

Related research