We compared the top 6 text-to-image models across 15 prompts to evaluate visual generation capabilities in terms of temporal consistency, physical realism, text and symbol recognition, human activity understanding, and complex multi-object scene coherence.
Text-to-image generators benchmark results
Review our benchmark methodology to understand how these results are calculated and see output examples.
Examples from the benchmark
Figure 1: Results from 6 text-to-image generators on the clocks task, featuring an analog and a digital clock displaying conflicting times.
Prompt: “An analog wall clock hangs on a light-colored wall, clearly visible with black hour and minute hands and numbered markings. On a wooden table below, a digital clock displays the time in bright LED numbers. The analog clock shows 12:35, and the digital clock shows 23:48.”
This prompt tests precise symbolic rendering and cross-object consistency. While most models display a readable digital time, errors commonly occur on the analog clock, where hand positions do not accurately correspond to the specified time.
Figure 2: Results from 6 text-to-image generators on the calendar task, depicting an impossible date (February 29, 2023).
Prompt: “A detailed close-up of a paper calendar on a desk. The calendar clearly shows the month “February 2023” printed at the top. The dates are laid out in a traditional grid format, and the calendar includes February 29 as a visible date. The paper texture is realistic, slightly off-white, with subtle shadows and soft lighting.”
This prompt is designed to test strict prompt compliance over real-world correctness by requiring an impossible calendar configuration. Stronger models correctly include February 29 while maintaining a realistic paper texture and grid layout, demonstrating instruction-following over factual priors. Lower-performing outputs either omitted the 29th or showed meaningless dates on the calendar, reducing compliance despite visual realism.
Figure 3: Results from 6 text-to-image generators on the notebook task, involving a long handwritten text.
Prompt: “A close-up view of an open notebook lying on a wooden desk. The pages are filled with neat handwritten text in dark ink. The handwriting contains sentences such as: “Time fractures perception when memory competes with intention, leaving behind echoes of decisions never fully made.” and “Language becomes fragile when meaning stretches beyond the limits of certainty.” The paper shows natural texture, slight creases, and realistic pen pressure variations. Warm ambient lighting, shallow depth of field.”
This prompt primarily evaluates long-form text generation in natural handwriting. Most models produce visually convincing handwriting textures but fail on semantic accuracy, line continuity, or exact sentence reproduction. Higher scores correlate with outputs that preserve readable and coherent text across multiple lines without degenerating into pseudo-writing.
Figure 4: Results from 6 text-to-image generators on the hands task, requiring nail painting with specific color and pattern constraints.
Prompt: “A close-up, highly detailed shot focusing only on a woman’s hands as she paints her fingernails. The hand on the table, three of her fingernails are painted a glossy blue, while two fingernails are painted red with a white dotted design. The other hand holds a small nail polish brush, carefully applying polish to the nails. The skin texture is realistic, with soft natural lighting highlighting the fingers and nail surfaces. The background is softly blurred and neutral, ensuring full focus on the hands and the contrasting nail colors and patterns.”
This prompt focuses on anatomical accuracy, fine motor interaction, and pattern control across multiple small objects. All models failed to comply with the given prompt fully.
Errors commonly include incorrect hand and nail counts, inconsistent nail colors, or implausible brush positioning. Higher-performing models clearly separate the two hands, respect the exact color and pattern distribution, and maintain realistic skin and nail geometry.
Figure 5: Results from 6 text-to-image generators depicting a child using a calculator to apply the quadratic formula.
Prompt: “A realistic, well-lit scene of a child sitting at a desk, using a handheld calculator while concentrating on a complex mathematical problem. The calculator screen clearly displays the formula: x = (−b ± √(b² − 4ac)) / (2a). A notebook lies open on the desk with handwritten calculations and symbols matching the formula. The child’s hands are visible pressing the calculator buttons, and the expression on their face shows focus and curiosity. The environment feels like a quiet study space, with natural daylight, soft shadows, and a shallow depth of field for a photorealistic look.”
This prompt tests fine-grained text rendering, mathematical symbol accuracy, and narrative alignment between objects. The main differentiator is whether the calculator screen correctly displays the complete quadratic formula and whether the surrounding notebook supports it contextually. Models that approximate or simplify the formula lose significant compliance despite realistic scenes.
Figure 6: Results from the 6 text-to-image generators for a woman with an indoor–outdoor scene.
Prompt: “A young woman stands in pink pajamas in her messy bedroom, holding her hair up with one hand while staring out an open window toward a busy street below; outside, cars pass, and a cyclist waits at a red light.”
This prompt primarily evaluates human pose accuracy, spatial separation between interior and exterior, and narrative coherence across a window boundary. Most models correctly place the subject indoors, and the street activity outside, but differences emerge in the naturalness of posture and in how convincingly the exterior scene reads as spatially below and separate rather than composited.
Figure 7: Results from the 6 text-to-image generators on the cafe task, set on a rainy day with multiple interactions and reflections.
Prompt: “Inside a small cafe during heavy rain outside, a barista pours milk into a cup while chatting with a customer; raindrops streak down the window, a dog sleeps under a table, a cracked mirror behind the counter reflects shelves of cups and hanging plants, and pedestrians with umbrellas pass outside.”
This is a high-complexity prompt testing multi-element handling, causal weather cues, and reflective surface logic. Differences appear in whether secondary elements, such as the sleeping dog, pedestrians outside, and the crack in the mirror, are integrated coherently. Models scoring higher maintain clear role separation, a highly realistic mirror image, consistent rain and lighting behavior.
Figure 8: Results from the 6 text-to-image generators on the living room renovation task, involving parallel actions.
Prompt: “A family living room mid-renovation: a child builds a Lego tower on the floor, the mother measures a wall with a tape measure, the father assembles furniture in the background, sunlight enters through half-installed blinds, and cardboard boxes labeled with room names are scattered around.”
This prompt primarily evaluates multi-agent role separation and object–tool interaction within a shared space. Higher-performing models clearly assign distinct tasks to each person and maintain renovation cues that align logically across the room. Lower performing models often struggled to generate human elements, such as the child’s hands and feet or the writings on the boxes.
Figure 9: Results from the 6 text-to-image generators on the street market task at dusk, showing vendors closing their stalls.
Prompt: “An outdoor street market at dusk with vendors closing stalls, warm street lights turning on, a child tugging their parent’s sleeve, steam rising from food carts, stray cats weaving between crates, and a musician packing up instruments in the background.”
This prompt tests large-scale scene orchestration, lighting transition, and storytelling density. Strong models balance many small events without visual overload, maintaining consistent dusk lighting and clear spatial depth. Weaker results tend to have low realism or omit secondary actions.
Figure 10: Results from the 6 text-to-image generators on the bathroom task, featuring two people, steam on the mirror, and visible clutter.
Prompt: “A small bathroom in the morning: one person brushing their teeth, another person adjusting makeup in the mirror, steam fogging the glass, towels hanging unevenly, sunlight bouncing off white tiles, and a phone lying on the sink counter.”
This prompt evaluates tight-space spatial logic, mirror behavior, and environmental effects such as steam. Higher performing models partially preserve both individuals’ activities while keeping the mirror and steam physically plausible. However, none of the models is entirely successful in all the parameters.
Figure 11: Results from the 6 text-to-image generators on the glass refraction task.
Prompt: “A clear glass of water placed on a wooden table, with a pencil standing behind it; the pencil appears bent and magnified through the water, the background wall tiles distort through the glass, and light refracts realistically.”
This prompt primarily evaluates physical and optical accuracy, specifically refraction at the air–water boundary and distortion through cylindrical glass. Higher-performing models correctly bend the pencil at the waterline and apply consistent background distortion. Other models either understate refraction or introduce implausible curvature. None of the models fully complied with the prompt, as all placed the pencil inside the glass rather than behind it.
Figure 12: Results from the 6 text-to-image generators on the mirror task, showing a sideways person with objects visible only in reflection.
Prompt: “A person standing sideways in front of a mirror; their reflection is visible on the mirror, and objects behind them (a chair and lamp) appear only in the mirror.”
This prompt is a strict test of geometric correctness and mirror logic. All models correctly limit certain background objects to the reflection and maintain consistent orientation between the subject and their mirrored image.
Figure 13: Results from the 6 text-to-image generators on the shadow task at sunset, with long, aligned shadows.
Prompt: “An outdoor scene at sunset where people, trees, and a bicycle cast long shadows in the same direction, shadows stretching realistically across uneven pavement, with the sun low on the horizon.”
This prompt tests the consistency of global lighting and the single-light-source logic across multiple objects and surfaces. All outputs align all shadows in the same direction with lengths consistent with a low sun, even across uneven ground.
Figure 14: Results from the 6 AI image generators depicting a clownfish in a glass bowl with background distortion.
Prompt: “A red clown fish is inside a round glass bowl filled with water on a table, with books behind it visible through the glass surface.”
This prompt evaluates curved-glass optics, water behavior, and object integrity of an organic subject. Higher-quality results show realistic magnification and warping of background objects through the bowl while maintaining correct fish anatomy and scale. Lower-scoring images either fail to represent the glass optics correctly or do not follow the prompt.
Figure 15: Results from the 6 AI image generators on the cyclist task, featuring motion blur against a sharp background.
Prompt: “A moving cyclist passing in front of stationary parked cars, where the cyclist shows motion blur while background objects remain sharp, streetlights reflecting on wet pavement.”
This prompt primarily evaluates selective motion blur and temporal consistency. High-performing models blur the cyclist along the direction of travel while keeping parked cars and street elements sharp, with reflections on wet pavement remaining coherent. Lower-performing outputs often blur unrelated elements, thereby weakening the illusion of motion.
Text-to-image generation tools
Nano Banana Pro
Nano Banana Pro demonstrates the strongest overall performance, consistently handling scenes with multiple interacting elements, clear spatial organization, and coherent foreground–background relationships. It reliably maintains object integrity and scene coherence in complex environments involving several actors, environmental effects, and secondary details.
Performance decreases primarily in prompts that rely on precise physical or optical phenomena at small scales, such as refraction, magnification through curved glass, or subtle distortions caused by transparent materials. In these cases, the model tends to approximate physical behavior rather than accurately reproduce it. Despite these limitations, it rarely omits required elements, which contributes to its high overall score.
GPT Image 1.5
GPT Image 1.5 performs exceptionally well on prompts that require strict adherence to explicit instructions, including correct symbolic content, readable text, and clearly defined relationships between objects. It shows strong consistency in spatial logic, object completeness, and overall scene structure.
Its primary weakness appears in scenarios dominated by complex optical interactions, especially involving transparent or refractive materials. In such cases, physical accuracy can break down, leading to significant penalties in realism and physical correctness.
Seedream v4
Seedream v4 excels at generating visually convincing and aesthetically coherent scenes, particularly those involving people, outdoor environments, motion, and atmospheric lighting. It generally maintains global realism and consistent lighting across the image, which supports strong scores in realism-oriented evaluations.
However, the model is less reliable when prompts require high precision rather than visual plausibility. Text-heavy content, exact symbolic representations, and fine optical details are often rendered approximately or incorrectly. As a result, images may appear realistic at first glance but fail under closer inspection against strict compliance or physical accuracy criteria.
Flux 2 Pro
Flux 2 Pro exhibits high variability in performance across the benchmark. In prompts aligned with naturalistic scenes and loosely constrained visual descriptions, it produces highly realistic images with strong object integrity and believable lighting.
In contrast, prompts that impose strict constraints, such as exact text content, deliberate logical contradictions, or tightly specified multi-element interactions, often result in missing or misrepresented elements. This results in significant drops in prompt compliance and overall consistency.
Reve
Reve generally succeeds at constructing coherent scenes and maintaining a consistent visual style, particularly in prompts focused on overall composition rather than fine detail. It handles medium-complexity environments with reasonable spatial logic and recognizable objects.
Its performance declines substantially on prompts that require fine-grained control over detail, including accurate rendering of hands, readable handwriting, mathematical symbols, or small patterned elements. These limitations reduce scores in prompt compliance and object integrity, especially in tasks designed to test precision rather than general scene plausibility.
Dreamina v3.1
Dreamina v3.1 shows the lowest overall consistency across the benchmark. While it occasionally performs well in prompts centered on simple physical relationships such as lighting direction or mirror alignment, it frequently fails to include all required elements in more complex scenes.
Prompts involving multiple actors, dense environmental detail, or exact constraints often result in incomplete or non-compliant outputs. This pattern indicates limited realism in handling compound requirements, significantly affecting its overall evaluation.
Methodology
We used the following models for our benchmark with the endpoints on fal.ai, except for GPT Image 1.5, where we used its own chat feature to generate images:
- Nano Banana Pro
- GPT Image 1.5
- Seedream v4
- Flux 2 Pro
- Reve
- Dreamina v3.1
The tools were evaluated in December 2025.
Our benchmark consisted of 15 text-to-image prompts designed to evaluate real-world product reliability and deployment readiness of vision-language models. The prompts span a diverse set of failure-prone scenarios, including temporal and factual inconsistencies, physical and optical realism, text and symbol recognition, human activity and intent understanding, and multi-object scene coherence.
Each prompt was created to reflect conditions commonly encountered in production environments, such as conflicting visual signals, reflections and refractions, motion and lighting effects, and concurrent human actions, where model errors and hallucinations can materially impact downstream applications. Model outputs were assessed based on their ability to correctly interpret visual details, maintain internal consistency, and avoid unsupported inferences, enabling systematic comparison of reliability across models.
Evaluation criteria
Prompt Compliance: Does the image follow all major elements, relationships, and actions described in the prompt? (0-10)
0: Ignores most prompt elements; the scene does not match the description
2: Includes a few elements but misses or misinterprets key actions or relationships
6: Most core elements present, but some are missing, misplaced, or incorrect
8: Nearly all elements are correctly depicted with minor omissions or inaccuracies
10: Fully complies with the prompt; all elements, actions, and relationships are clearly and correctly represented
Realism: How believable and lifelike is the scene overall? (0-5)
0: Highly artificial, uncanny, or cartoonish; breaks immersion
2: Noticeably unrealistic textures, lighting, or proportions
3: Some realistic aspects, but clear visual or physical inconsistencies
4: Mostly realistic with minor artifacts or stylization
5: Highly photorealistic; visually convincing and natural
Physical & optical occurracy: Does the image respect real-world physics, optics, and spatial logic? (e.g., shadows, reflections, refraction, scale) (0-5)
0: Severe physical impossibilities or contradictory lighting/perspective
2: Multiple incorrect shadows, reflections, or scale relationships
3: Generally plausible but with noticeable physical errors
4: Physically consistent with small inaccuracies
5: Physically and optically accurate, including complex interactions (glass, mirrors, motion)
Scene coherence & spatial logic: Do all elements exist logically in the same space and interact consistently? (0-5)
0: Disjointed or fragmented scene; elements feel unrelated
2: Weak spatial logic; unclear foreground/background relationships
3: Mostly coherent, but some depth or placement issues
4: Strong spatial consistency with minor perspective errors
5: Fully coherent scene with clear depth, scale, and believable interactions
Multi-element handling: How well does the model handle multiple people, objects, and actions in one scene? (0-5)
0: Many elements missing, merged, or nonsensical
2: Several elements present but confused or duplicated incorrectly
3: Most elements appear, but interactions are weak or unclear
4: Multiple elements handled well with minor errors
5: Complex, crowded scene handled cleanly with clear roles and interactions
Object integrity: Are individual objects clearly formed, complete, and recognizable? (0-5)
0: Objects are broken, fused, or unrecognizable
2: Objects lack structure or a clear identity
3: Objects are mostly correct with some deformation
4: Objects are accurate with minor visual flaws
5: Objects are crisp, complete, and clearly defined
Consistency of style & lighting: Is lighting, color, and style consistent across the entire image? (0-5)
0: Inconsistent lighting or conflicting visual styles
2: Multiple lighting sources or styles clash unnaturally
3: Mostly consistent with noticeable mismatches
4: Consistent lighting and style with minor anomalies
5: Fully consistent lighting, shadows, color temperature, and style
Key features of the text-to-image generators
Quality & resolution
A text-to-image generator is often evaluated first by image quality. High-quality images show precise edges, accurate lighting, and consistent textures. This matters when images generated are used beyond casual experimentation, such as in commercial projects, concept art, or social posts.
Key aspects that influence output quality include:
- The underlying machine learning models and how well they handle fine detail.
- Support for higher resolution outputs, which helps when images are downloaded for print or large displays.
- Consistency across multiple images created from similar prompts helps teams stay consistent.
Multiple aspect ratios
Support for different aspect ratio options improves flexibility when generating visuals for different formats. Instead of cropping images later, users can generate images that already match their intended layout.
Common aspect ratios include:
- Square for general-purpose visuals and thumbnails.
- Portrait for posters, mobile screens, or editorial layouts.
- Landscape and widescreen for presentations, web pages, and video covers.
For an AI image generator used in workflows such as marketing or design, this saves time and preserves composition quality from the start.
Prompt understanding
Effective text-to-image systems accurately interpret a text description, even when prompts include multiple objects, relationships, or constraints. Strong prompt understanding ensures that images generated align closely with the user’s idea, rather than requiring repeated trial-and-error.
Good prompt comprehension typically includes:
- Understanding spatial relationships, such as foreground and background.
- Correct handling of adjectives, quantities, and actions.
- Logical interpretation of longer or more detailed text prompts.
AI image generators can also interpret image style and emotional tone directly from the prompt. Users can request specific artistic styles, lighting conditions, or moods without needing technical parameters.
Common use cases include:
- Selecting a specific art style, such as watercolor, anime, or photorealistic.
- Matching the tone of existing visuals or a reference photo.
- Exploring diverse styles during creative exploration.
Customization & control
Selecting from prompt templates reduces friction for users who are new to image generation or working under time constraints. Instead of writing a prompt from scratch, templates guide users toward clearer structure and better results.
Templates are often designed for:
- Marketing visuals and social posts.
- Character design and concept art.
- Product mockups and editorial images.
For a text-to-image generator, templates help generate AI images that are more predictable and usable, especially in professional contexts.
Some image tools allow users to edit or refine AI-generated images after they are created. This can include adjusting details, regenerating specific parts, or continuing generation based on existing images.
Workflow integration
API and tool integration
Workflow integration allows AI image generation to fit into larger systems rather than operating as a standalone page. APIs enable you to generate images programmatically or integrate the generator with other tools.
Common integration scenarios include:
- Embedding image generation into design or content platforms.
- Automating image creation for websites or applications.
- Supporting bulk image generation at scale.
For teams that regularly work with AI-generated content, integration options can matter as much as output quality.
Challenges of text-to-image generation
Misinterpretation of complex prompts
A common limitation of a text-to-image generator is difficulty handling complex or nuanced text descriptions. When prompts include multiple objects, attributes, or abstract ideas, the AI image generator may prioritize some elements while ignoring others.
This issue often appears when:
- A single prompt includes several objects with specific roles or relationships.
- Descriptions rely on subtle language rather than explicit instructions.
- The prompt combines visual details with abstract concepts.
Even advanced AI models can misread intent, resulting in images generated that only partially match the original idea. Users often compensate by simplifying prompts or breaking a single idea into multiple image-generation steps.
Counting and numerical accuracy
Most AI image generators struggle with numerical precision. When a text prompt specifies an exact number of objects, such as “three cups” or “seven birds,” the images created often show the wrong count.
Key reasons this happens include:
- Image generation models are trained on patterns, not explicit counting rules.
- Numbers are treated as descriptive tokens rather than constraints.
- Prompt tweaking alone rarely fixes consistent counting errors.
This limitation is especially noticeable in use cases that require precision, such as diagrams, educational visuals, or structured layouts. It remains one of the most prominent problems to solve in AI image generation.1
Object relationships and spatial reasoning
Another challenge lies in how AI-generated images handle spatial relationships. Models may correctly generate individual objects but fail to position them accurately relative to one another.
Common issues include:
- Objects appear to float or overlap unnaturally.
- Incorrect foreground and background placement.
- Hands or tools do not interact realistically with other objects.
For scenes that depend on clear spatial logic, such as product setups or instructional visuals, this can reduce usability. While reference images or existing visuals can help guide composition, the results remain inconsistent.
Text rendering within images
Generating readable text within images remains a weak point for many image generators. Letters may appear distorted, misspelled, or replaced with symbols that resemble text but carry no meaning.
This affects scenarios such as:
- Signs, labels, or posters.
- Clothing designs like T-shirts or caps.
- Interface mockups that include UI text.
Although newer AI models show improvement, users often rely on manual editing or external design tools to add text after image generation rather than trusting AI-generated text directly.
Semantic and contextual errors
Even when image quality is high, AI-generated photos can contain subtle semantic mistakes. These errors occur when the model produces visuals that look plausible at first glance but break real-world logic.
Examples include:
- Inconsistent lighting or shadows.
- Objects interacting in physically impossible ways.
- Items are placed where they would not realistically belong.
These issues stem from a limited understanding of physics and context. The AI focuses on visual similarity rather than true comprehension, which can be problematic for commercial projects that require realism.
Bias and representation issues
Bias remains a broader concern across artificial intelligence, including text-to-image systems. AI-generated content can reflect imbalances present in training data, leading to stereotypical or limited representations.
This may show up as:
- Overrepresentation of certain demographics in professional roles.
- Cultural stereotypes in clothing or environments.
- Limited diversity when prompts are vague.
While many platforms are actively working to address these issues, users creating AI-generated images for public or commercial use should carefully review outputs and avoid relying on default assumptions.
All tools are better at generating single or minimal objects in one scene; when there are more complex scenarios with multiple objects, they tend to perform worse. Also, integrating a human causes problems.
Be the first to comment
Your email address will not be published. All fields are required.