Top 3 Synthetic Document Generators Benchmarked [2026]

updated on Nov 25, 2025

Synthetic document generators create annotated, realistic document images that help train and evaluate machine learning models without relying on large, manually labeled datasets.

We benchmark leading synthetic document generators by creating more than 2,500 synthetic documents, comparing their effectiveness in realistic layouts, accurate numerical data, and useful training datasets for document analysis tasks.

Document generation benchmark results

Select Metric:

Results show that

Genalog and DocCreator are strong performers across utility and fidelity, with Genalog slightly better for numerical accuracy.
Tonic Textual excels in visual layout realism but lags behind in other areas, making it more suitable for tasks that require realistic documents.

For more information on metrics, read the benchmark methodology.

Utility measures how well models trained on synthetic data perform on real documents.
Layout fidelity measures how well the spatial arrangement of elements in synthetic documents matches the real ones.
Numerical fidelity checks whether numeric values in synthetic documents resemble the real data.

Comment on results: To better understand the performance differences, the benchmark was also conducted using the training set instead of the separate test set. This secondary evaluation aimed to determine whether providing the models with training material would improve their ability to reproduce structured and numerically accurate outputs.

The results show that, even when evaluated on the training data, the models achieved only slightly higher scores. This indicates that the results reflect not only how well the tools handle the task itself. The moderate results are likely influenced by limitations in OCR quality and the trained model’s capacity, rather than the benchmarking procedure itself.

Genalog

Genalog performed the strongest overall. Its synthetic documents were very effective for model training and maintained a good balance between realistic layout elements and numerical accuracy. The generated documents reflected the structure and spacing of real forms and receipts closely, making them suitable for a variety of document analysis tasks.

DocCreator

DocCreator also produced high-quality outputs. This document generator’s documents were nearly as useful for training as Genalog’s. Layouts were realistic, and the synthetic documents generally preserved the statistical properties of numbers. DocCreator’s strength lies in combining diverse layout generation with its degradation models, making the outputs visually similar to scanned real-world documents.

Tonic Textual

Tonic Textual had mixed results. While this synthetic document generator produced very clean and consistent layouts, the documents were less effective for training models. In addition, the synthetic numbers were not always statistically similar to real data. This suggests that Tonic Textual is best suited for tasks that focus on document appearance or privacy-preserving PII replacement rather than full-scale training for layout structure and information extraction tasks.

Overall Insight

Genalog is the best balanced tool, providing both realistic layouts and accurate numbers.

DocCreator is strong for complex and diverse layouts and document degradation, with minor numeric inaccuracies.

Tonic Textual is ideal for layout-focused tasks but not for tasks needing precise numeric data.

Methodology Overview

Evaluation metrics

Each generated dataset was scored against the original data using the following metrics:

Utility score

(KIE F1 Score): A score between 0 and 1, where higher is better. It is defined by the F1 score of the LayoutLMv3 model trained on the synthetic data when evaluated on the real test set. A high score indicates the synthetic data is a highly effective substitute for real data.

Fidelity scores

These metrics measure how closely the synthetic documents resemble the real ones.

Layout Fidelity (EMD Score): The Earth Mover’s Distance (dEMD) measures the difference between the distribution of bounding box center points in the real versus synthetic documents. It is a value from 0 to 1, where lower is better. A low score signifies that the spatial layout elements are well-preserved.
Numerical Fidelity (K-S Distance): The Kolmogorov-Smirnov Distance (DKS) measures the maximum difference between the cumulative distribution functions (CDFs) of numerical values (e.g., prices, quantities) in the real and synthetic data. It ranges from 0 to 1, where lower is better. A low score means the generator accurately reproduces the statistical properties of the numbers.

All metrics were normalized during the calculation.

Datasets

FUNSD: A collection of 199 scanned forms characterized by noisy text, complex and diverse layouts, and handwritten annotations. It was downloaded more than 1,500 times last month. This tests a generator’s ability to handle unstructured and imperfect data. ¹

We divide the sample into two: 80% of the data is used for training the model, while the remaining 20% is reserved for testing after training.
Each tool produced between three and six synthetic documents for every original, resulting in a total of more than 2,500 synthetic documents.

Task evaluation

To measure utility, a popular LayoutLMv3 model with 22K GitHub stars and over 750K downloads was trained on the synthetic data generated by each synthetic document generator tool. ²

The performance of this model was then evaluated on a held-out test set of real documents from the original datasets. This directly measures how useful the synthetic data is for a real-world task.

Synthetic generation tools

Genalog

An open-source Python library by Microsoft for generating synthetic document images with synthetic noise. It works by taking text + layout templates (written in HTML + CSS) and rendering them via WeasyPrint, then applying degradation effects (blur, bleed-through, salt-and-pepper noise, morphological operations).³

DocCreator

A multi-platform, open-source tool for generating synthetic document images with associated ground truth. It has been widely used in Document Image Analysis and Recognition (DIAR) research.⁴,⁵

Tonic Textual

A solution for redaction and synthesis in real-world document formats (PDF, Word). It claims to scan unstructured documents, identify named entities (e.g., PII), redact or replace them with synthetic values, and output de-identified documents in similar formats.

8 Synthetic document degradation methods

Synthetic document generation often includes adding realistic defects to make artificial data resemble real-world documents. These defects, or degradation models, help train models that perform better on noisy, aged, or scanned documents. These tools apply several physical and visual transformations to simulate common document imperfections.⁶

1. Ink degradation

This model simulates fading, blotches, or streaks caused by aging or low-quality printing. It adds small ink spots or removes parts of letters to imitate real ink decay.

2. Phantom characters

Old printing tools often left faint outlines or “ghost” marks around letters. The phantom character model recreates these by inserting extracted defects from real scans between printed characters.

3. Paper holes

Holes of different shapes and sizes are added randomly to documents, replicating tears or punch marks seen in worn papers.

4. Bleed-through

This effect mimics ink seeping through from the other side of the page. It uses both front and back images of a document to recreate how the ink partially transfers through the paper.

5. Adaptive blur

Scanning or photographing documents often creates slight blur. This model compares real blurred examples and applies a similar blur using Gaussian filters, keeping the result subtle and realistic.

6. 3D paper deformation

Documents can bend, fold, or curve when scanned or photographed. Using 3D meshes from real papers, this model recreates these shapes and lighting effects, helping to train models for camera-based document analysis.

7. Nonlinear illumination

Uneven lighting during scanning can make one side of a document appear darker. This model adjusts brightness based on simulated light angles and page curvature, reproducing the effect of poor illumination.

8. Salt-and-pepper noise

Adds random black and white pixels to simulate dust, paper texture, or scanning sensor noise. This “salt-and-pepper” effect helps create the grainy appearance of aged or low-quality digital scans.

Synthetic document generation as a solution for layout analysis challenges

The challenge of layout analysis

Understanding the structure of documents is harder than just reading the text. OCR tools can extract words, but they don’t explain the role of each block, such as titles, tables, or figures.

To deal with this challenge, methods have been developed:

Early methods for layout analysis were rule-based. They relied on geometric rules and texture analysis to split pages into blocks. While useful, these approaches required heavy manual tuning and did not generalize well.

Machine learning approaches like Support Vector Machines (SVMs) and Gaussian Mixture Models (GMMs) improved this by learning from data.⁷ However, they still depended on hand-crafted features and struggled with the diversity of real-world documents.

Deep learning transformed the field. Convolutional neural networks (CNNs) made it possible to treat layout recognition like object detection, identifying tables, figures, or formulas in much the same way models detect objects in natural images.⁸ Some models also combine both text and image features for more accurate results.

The challenge of deep learning: require large, labeled datasets to train.

Synthetic data as a solution: Synthetic document generation process offers a scalable way to create annotated training data without the cost of manual labeling.

Generative models now bring more advanced possibilities. Variational autoencoders (VAEs), attention-based models, and GANs can learn structural patterns of documents and produce realistic new layouts.⁹

Key Differences Between Synthetic Document Generators

The three synthetic document generators benchmarked differ in focus, output quality, and usability:

Genalog: Best balanced for both realistic layouts and numerical accuracy. Its Python-based workflow with HTML/CSS templates and degradation models makes it ideal for training machine learning models across diverse document analysis tasks.
DocCreator: Strong in generating visually complex and degraded documents, preserving layout diversity. Slightly less accurate numerically than Genalog, but effective for tasks requiring realistic scanned-document simulation.
Tonic Textual: Excels in clean, visually consistent layouts and privacy-preserving data synthesis. Less suitable for numeric accuracy or full training datasets, making it better for layout-focused tasks or PII replacement.

These differences reflect their primary approaches: Genalog balances realism and data fidelity, DocCreator emphasizes layout variety and document degradation, and Tonic Textual prioritizes appearance and privacy. This helps users select the right tool based on whether the priority is training effectiveness, layout realism, or data de-identification.

Reference Links

nielsr/funsd · Datasets at Hugging Face

microsoft/layoutlmv3-base · Hugging Face

Synthetic Document Generator

GitHub - DocCreator/DocCreator: DIAR software for synthetic document image and groundtruth generation, with various degradation models for data augmentation

DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images | MDPI

MDPI

DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images | MDPI

MDPI

Evaluation of SVM, MLP and GMM Classifiers for Layout Analysis of Historical Documents | Proceedings of the 2013 12th International Conference on Document Analysis and Recognition

CNN Based Page Object Detection in Document Images | Semantic Scholar

IEEE International Conference on Document Analysis and Recognition

[2104.02416] Variational Transformer Networks for Layout Generation

Industry Analyst

Ezgi Arslan, PhD.

Industry Analyst

Follow On

Ezgi holds a PhD in Business Administration with a specialization in finance and serves as an Industry Analyst at AIMultiple. She drives research and insights at the intersection of technology and business, with expertise spanning sustainability, survey and sentiment analysis, AI agent applications in finance, answer engine optimization, firewall management, and procurement technologies.

View Full Profile