Synthetic document generators create annotated, realistic document images that help train and evaluate machine learning models without relying on large, manually labeled datasets.
We benchmark leading synthetic document generators by creating more than 2,500 synthetic documents, comparing their effectiveness in realistic layouts, accurate numerical data, and useful training datasets for document analysis tasks. Results show that
- Genalog and DocCreator are strong performers across utility and fidelity, with Genalog slightly better for numerical accuracy.
- Tonic Textual excels in visual layout realism but lags behind in other areas, making it more suitable for tasks that require realistic documents.
Benchmark results
For more information on metrics, read the benchmark methodology.
- Utility measures how well models trained on synthetic data perform on real documents.
- Layout fidelity measures how well the spatial arrangement of elements in synthetic documents matches the real ones.
- Numerical fidelity checks whether numeric values in synthetic documents resemble the real data.
Comment on results: To better understand the performance differences, the benchmark was also conducted using the training set instead of the separate test set. This secondary evaluation aimed to determine whether providing the models with training material would improve their ability to reproduce structured and numerically accurate outputs.
The results show that, even when evaluated on the training data, the models achieved only slightly higher scores. This indicates that the results reflect not only how well the tools handle the task itself. The moderate results are likely influenced by limitations in OCR quality and the trained model’s capacity, rather than the benchmarking procedure itself.
Genalog
Genalog performed the strongest overall. Its synthetic documents were very effective for model training and maintained a good balance between realistic layout elements and numerical accuracy. The generated documents reflected the structure and spacing of real forms and receipts closely, making them suitable for a variety of document analysis tasks.
DocCreator
DocCreator also produced high-quality outputs. This document generator’s documents were nearly as useful for training as Genalog’s. Layouts were realistic, and the synthetic documents generally preserved the statistical properties of numbers. DocCreator’s strength lies in combining diverse layout generation with its degradation models, making the outputs visually similar to scanned real-world documents.
Tonic Textual
Tonic Textual had mixed results. While this synthetic document generator produced very clean and consistent layouts, the documents were less effective for training models. In addition, the synthetic numbers were not always statistically similar to real data. This suggests that Tonic Textual is best suited for tasks that focus on document appearance or privacy-preserving PII replacement rather than full-scale training for layout structure and information extraction tasks.
Overall Insight
Genalog is the best balanced tool, providing both realistic layouts and accurate numbers.
DocCreator is strong for complex and diverse layouts and document degradation, with minor numeric inaccuracies.
Tonic Textual is ideal for layout-focused tasks but not for tasks needing precise numeric data.
Methodology Overview
Evaluation metrics
Each generated dataset was scored against the original data using the following metrics:
Utility score
(KIE F1 Score): A score between 0 and 1, where higher is better. It is defined by the F1 score of the LayoutLMv3 model trained on the synthetic data when evaluated on the real test set. A high score indicates the synthetic data is a highly effective substitute for real data.
Fidelity scores
These metrics measure how closely the synthetic documents resemble the real ones.
- Layout Fidelity (EMD Score): The Earth Mover’s Distance (dEMD) measures the difference between the distribution of bounding box center points in the real versus synthetic documents. It is a value from 0 to 1, where lower is better. A low score signifies that the spatial layout elements are well-preserved.
- Numerical Fidelity (K-S Distance): The Kolmogorov-Smirnov Distance (DKS) measures the maximum difference between the cumulative distribution functions (CDFs) of numerical values (e.g., prices, quantities) in the real and synthetic data. It ranges from 0 to 1, where lower is better. A low score means the generator accurately reproduces the statistical properties of the numbers.
All metrics were normalized during the calculation.
Datasets
FUNSD: A collection of 199 scanned forms characterized by noisy text, complex and diverse layouts, and handwritten annotations. It was downloaded more than 1,500 times last month. This tests a generator’s ability to handle unstructured and imperfect data. 1
- We divide the sample into two: 80% of the data is used for training the model, while the remaining 20% is reserved for testing after training.
- Each tool produced between three and six synthetic documents for every original, resulting in a total of more than 2,500 synthetic documents.
Task evaluation
To measure utility, a popular LayoutLMv3 model with 22K GitHub stars and over 750K downloads was trained on the synthetic data generated by each synthetic document generator tool. 2
The performance of this model was then evaluated on a held-out test set of real documents from the original datasets. This directly measures how useful the synthetic data is for a real-world task.
Synthetic generation tools
Genalog
An open-source Python library by Microsoft for generating synthetic document images with synthetic noise. It works by taking text + layout templates (written in HTML + CSS) and rendering them via WeasyPrint, then applying degradation effects (blur, bleed-through, salt-and-pepper noise, morphological operations).3
DocCreator
A multi-platform, open-source tool for generating synthetic document images with associated ground truth. It has been widely used in Document Image Analysis and Recognition (DIAR) research.4 ,5
Tonic Textual
A solution for redaction and synthesis in real-world document formats (PDF, Word). It claims to scan unstructured documents, identify named entities (e.g., PII), redact or replace them with synthetic values, and output de-identified documents in similar formats.
8 Synthetic document degradation methods
Synthetic document generation often includes adding realistic defects to make artificial data resemble real-world documents. These defects, or degradation models, help train models that perform better on noisy, aged, or scanned documents. These tools apply several physical and visual transformations to simulate common document imperfections.6
1. Ink degradation
This model simulates fading, blotches, or streaks caused by aging or low-quality printing. It adds small ink spots or removes parts of letters to imitate real ink decay.
2. Phantom characters
Old printing tools often left faint outlines or “ghost” marks around letters. The phantom character model recreates these by inserting extracted defects from real scans between printed characters.
3. Paper holes
Holes of different shapes and sizes are added randomly to documents, replicating tears or punch marks seen in worn papers.
4. Bleed-through
This effect mimics ink seeping through from the other side of the page. It uses both front and back images of a document to recreate how the ink partially transfers through the paper.
5. Adaptive blur
Scanning or photographing documents often creates slight blur. This model compares real blurred examples and applies a similar blur using Gaussian filters, keeping the result subtle and realistic.
6. 3D paper deformation
Documents can bend, fold, or curve when scanned or photographed. Using 3D meshes from real papers, this model recreates these shapes and lighting effects, helping to train models for camera-based document analysis.
7. Nonlinear illumination
Uneven lighting during scanning can make one side of a document appear darker. This model adjusts brightness based on simulated light angles and page curvature, reproducing the effect of poor illumination.
8. Salt-and-pepper noise
Adds random black and white pixels to simulate dust, paper texture, or scanning sensor noise. This “salt-and-pepper” effect helps create the grainy appearance of aged or low-quality digital scans.
Synthetic document generation as a solution for layout analysis challenges
The challenge of layout analysis
Understanding the structure of documents is harder than just reading the text. OCR tools can extract words, but they don’t explain the role of each block, such as titles, tables, or figures.
To deal with this challenge, methods have been developed:
Early methods for layout analysis were rule-based. They relied on geometric rules and texture analysis to split pages into blocks. While useful, these approaches required heavy manual tuning and did not generalize well.
Machine learning approaches like Support Vector Machines (SVMs) and Gaussian Mixture Models (GMMs) improved this by learning from data.7 However, they still depended on hand-crafted features and struggled with the diversity of real-world documents.
Deep learning transformed the field. Convolutional neural networks (CNNs) made it possible to treat layout recognition like object detection, identifying tables, figures, or formulas in much the same way models detect objects in natural images.8 Some models also combine both text and image features for more accurate results.
The challenge of deep learning: require large, labeled datasets to train.
Synthetic data as a solution: Synthetic document generation process offers a scalable way to create annotated training data without the cost of manual labeling.
Generative models now bring more advanced possibilities. Variational autoencoders (VAEs), attention-based models, and GANs can learn structural patterns of documents and produce realistic new layouts.9
Further readings
- Synthetic Data Generation Benchmark & Best Practices
- Synthetic Data vs Real Data: Benefits, Challenges
- Top 5 Synthetic Data Finance Applications
Reference Links

Be the first to comment
Your email address will not be published. All fields are required.