The interest in optical character recognition (OCR) and intelligent character recognition (ICR) technology is falling as companies switch to more automated solutions, such as machine learning-enabled data extraction. However, due to its various benefits, many companies still use1 or plan to use tools powered by OCR technology in their paper-based operations.
Whether you use OCR/ICR or data extraction tools enabled with machine learning, you require training data to develop robust models for such solutions. However, data collection challenges may happen in preparing datasets to train such models.
Therefore, this article explains how developers and business leaders can prepare effective datasets to streamline the development and implementation process of their intelligent document processing (IDP) solutions.
1. Define the purpose of the dataset
First, establish the dataset’s purpose. This will help you decide what data needs to be gathered and how it should be presented.
For instance, if the dataset aims to train an OCR system to recognize text in scanned paper-based or digital documents, the information gathered should include scanned images of text in a range of font sizes, styles, and arrangements.
On the other hand, if the system needs to scan documents like invoices or bills, then the dataset should include images of numerical values, calculations, formulas, etc. (see image below).
The following image shows an example of an OCR system identifying numerical values in an invoice through bounding box tags:
2. Collect relevant data
Once the purpose of the dataset is understood, the next step is to collect the relevant data. This can be done by using the following data collection methods.
It’s critical to gather data that is representative of the kinds of documents the system will be handling. For instance, for an AI-powered resume screening system, you need to gather data that contains images of different types of resumes, such as:
- Format (Chronological, functional, or combination)
- Academic or professional
- Field-specific resumes (For instance, a resume for a software developer will contain different terminology as compared to a resume of a human resource candidate)
Similarly, handwritten text images may be required for a system that will scan handwritten documents like letters or forms. The more diverse the dataset is regarding variations in writing tools, content, writing styles, and other factors, the better the OCR system will function on new, unseen images.
In another example, a license plate recognition system also uses OCR technology. The data required to train such systems is usually blurry images of different types of license plates in various angles and lighting scenarios. This is mainly because the system usually needs to scan fast-moving vehicles.
Our recommendations
If preparing your dataset through in-house data collection does not suit your project timeline or budget, you can consider outsourcing or crowdsourcing the data.
3. Annotate the data accurately
Data annotation is a crucial step in preparing training data for any machine learning model, as is the case for OCR processing. This involves modifying the data through labels and tags to make it easier for the system to recognize the text and the data that needs to be extracted.
Things to consider while annotating OCR and ICR systems:
- In the case of an OCR system, the data should be labeled with the text that appears on the input image.
- On the other hand, for an ICR system, the data should be annotated with the information attached to each unit of text/numerical value (e.g., date, amount, etc.).
For higher quality annotation, you can rely on a validator for important annotation work that will double-check the annotation work done by the first annotator. Human annotators can do data annotation manually or by using semi-automated tools.
Manual annotation
In manual annotation, human annotators can label the images with the corresponding text using tools like a text editor, a graphical user interface, or specialized annotation software. This process can be time-consuming and may require multiple annotators for large-scale datasets.
Semi-automated annotation
Leveraging semi-automated tools can speed up the annotation process by assisting OCR for handwriting recognition algorithms. These tools can automatically create text transcriptions, which human annotators can review and correct. This human-in-the-loop approach can significantly reduce the amount of time required for manual annotation while ensuring the quality of the data.
Our recommendations
Regardless of the method applied, it is important to make sure that the annotated data is accurate and consistent throughout the dataset. If in-house data labeling and automated tools do not suit your project requirements, then you can work with a data labeling partner.
4. Split the dataset into training, validation, & test sets
Once the annotation is done, the dataset can now be divided into training, validation, and test sets.
- The training set is used to train the model.
- The validation set is used to evaluate the performance of the model during training.
- The test set is used to evaluate the performance of the model after the training phase is complete (Metrics such as character error rate and word error rate are used with this subset of data to evaluate the output of the model).
It’s important to ensure that the three subsets accurately represent the data the system will be processing. This can be done by randomly sampling the data to ensure that each set contains a similar distribution of data.
Check out this quick read to learn more about the AI training process.
5. Preprocess the data
The data should be pre-processed to ensure that it is in the correct format and has the desired quality for training before being fed into an OCR or ICR system. Pre-processing can help to reduce or eliminate noise sources, enhance the quality of the data as a whole, and boost the system’s accuracy.
For instance, consider a scenario where the OCR system is trained to recognize handwritten text. If the input data is not pre-processed, it may contain a lot of noise, such as smudges, creases, and distortions. This noise can make the recognition process more challenging.
On the other hand, if the input data has been processed to eliminate noise and improve the quality, the OCR system will be more likely to recognize the text accurately.

Further reading
References
- Slotta, D. (Jul 30, 2021) “Market share of optical character recognition among image recognition technology in China from 2014 to 2025” Statista.
Comments
Your email address will not be published. All fields are required.