> The pipeline (bottom) shows how diverse OpenImages inputs are edited
using Nano-Banana and quality-filtered by Gemini-2.5-Pro, with failed attempts automatically retried.
Pretty interesting. I run a fairly comprehensive image-comparison site for SOTA generative AI in text-to-image and editing. Managing it manually got pretty tiring, so a while back I put together a small program that takes a given starting prompt, a list of GenAI models, and a max number of retries which does something similar.
It generates and evaluates images using a separate multimodal AI, and then rewrites failed prompts automatically repeating up to a set limit.
It's not perfect (nine pointed star example in particular) - but often times the "recognition aspect of a multimodal model" is superior to its generative capabilities so you can run it in a sort of REPL until you get the desired outcome.
That's a great website! Feature request: a button to toggle all the sliders left or right at the same time - would make it easier to glance the results without lots of finicky mouse moves.
Seconding this. Once you’ve seen the original image once, you don’t need to see it each time. The idea of syncing the sliders in the current group is a clever solution.
Thanks! It's probably the same site. It used to only be a showdown of text-to-image models (Flux, Imagen, Midjourney, etc), but once there was a decent number of image-to-image models (Kontext, Seedream, Nano-Banana) I added a nav bar at the top so I could do similar comparisons for image editing.
Honestly it's kind of inconsistent. Model releases sometimes seem to come in flurries - (it felt like Seedream and Nano-banana were within a few weeks of each other for example) and then the site will receive a pretty big update.
Recently I've found myself getting the evaluation simultaneously from to OpenAI gpt-5, Gemini 2.5 Pro, and Qwen3 VL to give it a kind of "voting system". Purely anecdotal but I do find that Gemini is the most consistent of the three.
I found the opposite. GPT-5 is better at judging along a true gradient of scores, while Gemini loves to pick 100%, 20%, 10%, 5%, or 0%. Like you never get a 87% score.
I am running similar experiment but so far, changing the seed of openai seems to give similar results. Which if that confirms, is concerning to me on how sensitive it could be
> The pipeline (bottom) shows how diverse OpenImages inputs are edited using Nano-Banana and quality-filtered by Gemini-2.5-Pro, with failed attempts automatically retried.
Pretty interesting. I run a fairly comprehensive image-comparison site for SOTA generative AI in text-to-image and editing. Managing it manually got pretty tiring, so a while back I put together a small program that takes a given starting prompt, a list of GenAI models, and a max number of retries which does something similar.
It generates and evaluates images using a separate multimodal AI, and then rewrites failed prompts automatically repeating up to a set limit.
It's not perfect (nine pointed star example in particular) - but often times the "recognition aspect of a multimodal model" is superior to its generative capabilities so you can run it in a sort of REPL until you get the desired outcome.
https://genai-showdown.specr.net/image-editing