Skip to content

Top editing image models maintain product details only 28% of the time

There has been a lot of progress on image editing models recently. Big labs like Google and OpenAI have released closed-source frontier models that are a real step up in quality. Realism is much better, and issues like extra fingers and other visual glitches are mostly gone. These models also brought a paradigm change: beyond generating images from text prompts, they natively support image editing. You input an image and a prompt, and the model edits it. That has unlocked a lot of new use cases, from small touch-ups to complete reimaginations.

One of those use cases is changing how e‑commerce produces catalog imagery. You can now convert a flat-lay shot of a product into a lifestyle shot of a virtual model wearing it. This is called virtual model.

For e‑commerce, one thing is critical: product fidelity. The product in the generated image has to look exactly the same as the product in the input. Every button, every zipper, every logo, every stitch, every color. Professional brands cannot advertise products with images that do not accurately depict the real thing.

Left: input image. Right: generated image. Center: zoom on the collar, where the rainbow stripe is reduced to a plain red band.

At Photoroom we have been building applications leveraging the latest frontier models with a strong focus on e‑commerce. Therefore, we know how much this matters to brands and where the models tend to break. To go from intuition to numbers, we benchmarked the top contenders on product fidelity. In this post we present our findings.

The benchmark

Models

Four models, all run at 2K resolution with the same prompt:

  • Nano Banana 2: Google's latest editing model.

  • Nano Banana Pro: Google's most capable editing model.

  • GPT Image 2 Medium: OpenAI's latest image-editing model.

  • FLUX.2 Klein 9B: one of the strongest open-weights image-editing models, from Black Forest Labs.

Setup

  • 850 products spanning clothing, footwear, bags, jewelry and accessories, each run through all four AI models for virtual model, for a total of 3,400 generations.

  • 10 trained annotators, each generation reviewed by at least 3 of them.

  • A custom annotation UI with side-by-side input and output, independent pan and zoom (so annotators can match fine-grained details), and a 30-second minimum dwell time before they can mark "no issues".

  • When annotators flag an issue, they describe it in free-form text (e.g. "logo distorted", "missing button on left cuff", "color shifted from navy to grey"). We use these descriptions later to group failures into categories.

  • A generation counts as a Pass only if no annotator flagged an issue.

Why one flag equals fail

The obvious alternative is majority pass: a generation is fine if at least 2 of 3 raters say so. We considered it and we tested it. It is the wrong call for product fidelity.

We started by sampling cases where one rater flagged an issue the other two had missed, and reviewed them manually. The lone flagger was usually right: the other two had not zoomed in enough, or the issue was subtle. Product fidelity annotation is recall-limited, meaning humans miss issues, they do not invent them.

It also matches what a buyer actually cares about: if one careful pair of eyes spots a problem, the product on the shelf has a problem.

Results

A few things we can claim with confidence from these numbers:

  • None of the models are close to solving the product fidelity task out-of-the-box. The best one only gets through 1 in 4 product-fidelity checks. This is the main result.

  • FLUX.2 Klein 9B ranks the lowest, with a significant gap to the other contenders (14% pass rate). Around 28% of its outputs are invalid virtual models (the garment ends up floating, the model grows extra limbs, the product is missing). For the other three models that rate is 4 to 8%.

  • The top three (Nano Banana 2, Nano Banana Pro, GPT Image 2) are essentially tied in a 25-28% pass-rate band. The gap between them is small enough that any of them can come out on top depending on which products you happen to test on. Picking between them should come down to cost, speed, and how well they fit a given workflow.

Where the failures happen

We bucketed the failure explanations across the four models. The dominant failure modes are:

  • Logo and text distortion (19% of all generations). Brand names become gibberish, logos warp, small labels lose readability. This is the single biggest problem, by a clear margin.

  • Missing or changed elements (13%). A button removed, a zipper added, a pocket relocated.

  • Pattern or design changes (12%). Stripes rearranged, embroidery simplified, prints redrawn.

  • VTO failure (8%). Anatomy artifacts and overlays. Mostly driven by Klein but present in all models.

  • Color shifts (8%). Usually subtle but visible.

Here are some more examples of failure cases:

Top left: color shift on the backpack. Top right: flower detail missing. Bottom left: text on the t-shirt back has changed. Bottom right: extra button added on the blazer.

What to do about it

As our benchmark shows, even the current state-of-the-art models fail a majority of product-fidelity checks. There are two parts to the problem: catching and flagging all product fidelity issues in generated images, and fixing them.

At Photoroom we have been working on both. We have developed intelligent detectors that automatically flag fidelity issues before they reach your catalog, and a tool called Brush Fixer that lets you paint over a problem area (a distorted logo, a missing button, etc.) and regenerate only the indicated region using the original product image as reference. It is built specifically for the kind of fine-detail failures our benchmark measures. If you have hit any of these issues, you can try Brush Fixer in the Photoroom app.

Brush Fixer in action: brush over the distorted text, regenerate only that region against the reference product.

What's next

Benchmarking product fidelity is a valuable first step to making progress in this area, but it is not the only thing brands care about. Realism, photographic integrity, brand-guideline adherence, consistency across a catalog all matter, and they involve real trade-offs. We will keep publishing what we measure on these dimensions too. Brush Fixer is the first in a series of tools and editing models we are building to help e‑commerce brands generate product images that meet their high quality bar. Stay tuned for what's coming next!

Jon AlmazánResearch scientist
Virtual try-on in action. Inputs (left): a model reference and the product flat-lays. Output (right): a generated image of the model wearing the products in a lifestyle scene.

Sell faster with studio‑quality product visuals

Drive sales with professional visuals you can create in minutes, with brand consistency and control.