Which AI Image Generator Has the Best Character Consistency? OpenAI vs Gemini vs Black Forest Labs vs Runway (May 2026)

May 18, 2026 · 36 min read

Software engineer and technical writer

FLUX.2 and Gemini 3.1 Flash produce the strongest character consistency across our three tests. gpt-image-2 comes in third and Runway Gen-4 last.

Character consistency is one of the hardest problems in AI image generation. Getting a model to place the same person in a new scene without their features drifting is something users expect and models regularly fail at.

Side by side of the reference character photo versus the worst scene placement result

We ran three tests across four models to find out which handles this best:

Can it place a real person in a new scene without changing their features?
Can it add clothing items to an image while preserving every other detail?
Can it generate a stylized character consistently across six independent frames?

You can find all the test code in our GitHub repository.

note

This article was originally published using OpenAI's gpt-image-1. It has been updated to use gpt-image-2.

Results at a glance

FLUX.2 and Gemini 3.1 Flash produced the strongest character consistency across our three tests. Here is how all four models compare.

Placing the same person in a new scene

We gave each model a reference photo of a real person and asked it to place them in a new scene as a coffee shop barista, without changing their features.

Winner: FLUX.2

FLUX.2 and Gemini both passed this test. Here is a side-by-side of the best result:

Side by side of the reference character photo versus the best scene placement result

Loser: Runway

Runway failed this test. Here is the result:

Adding items to an image while preserving every other detail

We gave each model a reference photo of a person alongside three clothing items and asked it to place all three items on the person without changing anything else.

Winner: FLUX.2

FLUX.2 was the clear winner, placing all three items with near-perfect accuracy:

Side by side of the reference character and clothing items versus the best virtual try-on result

Loser: Runway

Runway struggled with item accuracy and character consistency:

Side by side of the reference character and clothing items versus the worst virtual try-on result

Generating a stylized character consistently across multiple images

We gave each model a pixel art sprite and asked it to generate six independent frames of a walk cycle, each with a different pose, with no chaining between frames.

Winner: Gemini 3.1 Flash

Gemini 3.1 Flash showed the most consistent character style across all six frames:

Side by side of the reference sprite versus the most consistent walk cycle result

Loser: Runway Gen-4

Runway generated characters that were inconsistent both with the reference and with each other:

Side by side of the reference sprite versus the least consistent walk cycle result

Comparing FLUX.2, Gemini 3.1 Flash, gpt-image-2, and Runway Gen-4

Each of the four models takes a different approach to multi-reference image generation. Here is how they compare at a high level.

Model	Provider	Approach	Max references
FLUX.2	Black Forest Labs	Multi-reference synthesis	8 via API
Gemini 3.1 Flash	Google	Multi-reference inference	Up to 14
gpt-image-2	OpenAI	Multi-reference inference (image edits endpoint)	Up to 16
Gen-4	Runway	Reference-based inference	3

Full feature comparison

Here is a full breakdown of how each model differs on API approach, output sizes, pricing, and SDK support.

	FLUX.2	Gemini 3.1 Flash	gpt-image-2	Runway Gen-4
Provider	Black Forest Labs	Google	OpenAI	Runway
API approach	Submit request, poll for result	Synchronous (response in single call)	Synchronous (response in single call)	Submit request, poll for result
Max reference images	8 via API, 10 in playground	Up to 14	Up to 16	3
Output sizes	Up to 4MP, any aspect ratio	512px, 1K, 2K, 4K	1024x1024, 1024x1536, 1536x1024	720p or 1080p
Price per image (standard)	From $0.03 ([pro]) to $0.07 ([max]) per MP	~$0.045 (512px) to ~$0.151 (4K) per image	$0.006 (low) to $0.211 (high) per 1024x1024	$0.05 (720p) or $0.08 (1080p) per image
SDK / auth	REST API, API key in header	Google GenAI SDK (Python/JS), API key	OpenAI SDK (Python/JS), API key	Runway SDK (Python/JS), API key
Async / polling	Yes, polling required	No	No	Yes, polling required

Which model can place the same person in a new scene without changing their features?

Maintaining a consistent character is both a huge problem with AI image generators as well as a huge expectation from their users, especially with human subjects.

A successful character-consistent image requires two things:

The unique qualities of the subject don't drift away in the new generation.
Their features don't move into the uncanny valley, where they seem slightly off in general.

For this test, we use this reference photo of a human subject with distinct features that we can use to easily track consistency across image generation:

An upside down rose tattoo on the subject's right cheek
A sunflower tattoo on the subject's right arm
Short green hair

Reference photo of a man with green hair and tattoos used as the character input for the scene placement test. Photo by Megan Ruth on Unsplash.

Thanks Megan Ruth for the photo.

We placed the subject in a completely different scene for these tests. Specifically, as a barista in a coffee shop.

FLUX.2

FLUX.2 uses multi-reference synthesis, where you pass a reference image directly as input_image alongside a text prompt.

The model uses this as a character reference and generates a new scene while preserving the subject's identity.

# Encode the reference photo as base64 — this becomes the character reference
character_b64 = encode_image("character.jpg")

# Pass it as input_image alongside a text prompt describing the new scene
response = requests.post(
    f"{BASE_URL}/flux-2-pro-preview",
    headers=HEADERS,
    json={
        "prompt": prompt,                                          # describes the new scene
        "input_image": f"data:image/jpeg;base64,{character_b64}", # the character to preserve
    },
).json()

Result

Here is what FLUX.2 generated:

Qualitative analysis

The FLUX.2 result is really quite impressive.

Character consistency

FLUX.2 did an excellent job maintaining the features of the subject in the newly generated image:

Sunflower tattoo on the right arm is clearly present
Upside down rose tattoo on the cheek is clearly present
Short green hair is maintained

Uncanny valley analysis

The subject has not drifted into the uncanny valley. They look extremely lifelike and realistic, maintaining all the same features of the original subject.

Overall image quality

The image is great quality:

The subject is clear and focused in the shot with the espresso machine
The background is warm and out of focus
No strange artifacts in the background or the subject

Performance

Here's how long it took and how much it cost to generate this image.

Metric	Value
Time	30.5s
Cost	$0.135

Gemini 3.1 Flash

Gemini 3.1 Flash uses the Google GenAI SDK, where you pass the reference image directly as a content part alongside the text prompt in a single call.

# Load the reference photo as a PIL Image — this becomes the character reference
character = Image.open("character.jpg")

# Pass it as a content part alongside the prompt
response = client.models.generate_content(
    model="gemini-3.1-flash-image-preview",
    contents=[prompt, character],  # prompt describes the scene, character is the reference
    config=types.GenerateContentConfig(
        response_modalities=["IMAGE"],
    ),
)

Result

Here is what Gemini 3.1 Flash generated:

Qualitative analysis

The Gemini 3.1 Flash result shows extremely strong character consistency with the original image.

Character consistency

The character is extremely consistent with the original image, with some small details that aren't extremely noticeable. For example, the rose tattoo on the cheek has become a more abstract teardrop tattoo.

Zoomed in annotated comparison of the rose tattoo on the cheek in the original reference photo versus the Gemini 3.1 Flash result

Otherwise, the remaining features are very accurate to the original:

Sunflower tattoo on the arm is clearly present
Second sunflower tattoo on the neck is clearly present
Unique facial features of the subject are maintained

Uncanny valley analysis

There is no uncanny valley detected here. The character's facial features are highly realistic.

Overall image quality

This is an extremely high quality image:

Good composition and a warm tone
No strange artifacts across either the subject or the background

Performance

Here's how long it took and how much it cost to generate this image.

Metric	Value
Time	31.11s
Input tokens	359
Output tokens	1,296
Cost	~$0.084

gpt-image-2

gpt-image-2 uses the OpenAI SDK's images.edit endpoint, where you pass the reference image as a file object alongside the text prompt.

# Open the reference photo and pass it directly to the images.edit endpoint
with open("character.jpg", "rb") as char_file:
    response = client.images.edit(
        model="gpt-image-2",
        image=char_file,   # the character to preserve
        prompt=prompt,     # describes the new scene
        size="1024x1024",
    )

Result

Here is what gpt-image-2 generated:

Qualitative analysis

The gpt-image-2 result is really quite impressive.

Character consistency

gpt-image-2 did an excellent job maintaining the features of the subject in the newly generated image:

Sunflower tattoo on the right arm is clearly present
Upside down rose tattoo on the cheek is clearly present
Short green hair is maintained

Uncanny valley analysis

There is a subtle uncanny valley effect here, but it comes from an unexpected source: too much consistency. gpt-image-2 has preserved the head position and facial expression from the reference photo so precisely that the character looks unnatural in the new scene. The head is tilted at the same angle as the original and the expression is blank, giving the result a stiff, posed quality.

Side by side of the reference photo versus the gpt-image-2 result showing how the head pose was carried over from the reference

Compare this with the FLUX.2 result, where the character is in a completely different pose and is smiling, yet still looks like the same person. FLUX.2 understood that consistency means preserving identity, not copying posture. gpt-image-2 treated the reference more literally, which produced a face that is arguably more accurate but a scene that feels less natural.

Overall image quality

The image is great quality:

The subject is well framed with a good composition
The background is blurred and the foreground is high definition

The one issue worth noting is the hands. Some fingers are morphing into each other, and the hands have an incorrect number of fingers. This is a subtle but common problem with AI image generators.

Close-up of the gpt-image-2 barista result showing finger artifacts on the hands

Performance

Here's how long it took and how much it cost to generate this image.

Metric	Value
Time	211.95s
Input tokens	1,566 (1,457 image + 109 text)
Output tokens	7,024
Cost	~$0.224

Runway Gen-4

Runway Gen-4 uses its Runway SDK, where you pass the reference image as a base64-encoded entry in a referenceImages array, with a tag assigned to it.

You then reference that tag directly in the prompt using @character, which tells the model which part of the prompt refers to the reference image.

# Encode the reference photo as base64
character_b64 = encode_image("character.jpg")

# Pass it in referenceImages with a tag, then reference it in the prompt via @character
response = requests.post(
    f"{BASE_URL}/text_to_image",
    headers=HEADERS,
    json={
        "model": "gen4_image",
        "promptText": "The @character is working as a barista...",  # @character refers to the tagged image
        "referenceImages": [
            {
                "uri": f"data:image/jpeg;base64,{character_b64}",
                "tag": "character",  # this tag is what @character in the prompt resolves to
            }
        ],
    },
)

Result

Here is what Runway Gen-4 generated:

Qualitative analysis

Runway Gen-4 did a pretty poor job with this test.

Character consistency

Again, this is not the same person as in the original image:

The facial features are all wrong
The tattoos are strange, abstract shapes that do not match the original at all

It seems to have understood only that it is trying to generate a man with tattoos and short green hair.

Uncanny valley analysis

There is some uncanny valley here. You can't quite put your finger on it, but his face is quite strange. It's also kind of unclear what he's doing with his hands on the coffee machine.

Overall image quality

The composition is great and the tone is warm. Everything looks pretty good except for the human features of the subject.

Performance

Here's how long it took and how much it cost to generate this image.

Metric	Value
Time	25.19s
Cost	$0.05

Which AI image generator best preserves a real person's face?

FLUX.2, Gemini 3.1 Flash, and gpt-image-2 all passed this test, producing images that are clearly recognisable as the same person from the reference photo. Runway Gen-4 failed, generating a different person who only broadly matches the description of the subject.

	FLUX.2	Gemini 3.1 Flash	gpt-image-2	Runway Gen-4
Character preserved?	Yes	Yes	Yes	No

On time and cost, gpt-image-2 was significantly slower and more expensive than the other models, taking 211.95 seconds and costing ~$0.224 compared to FLUX.2's 30.5 seconds at $0.135, Gemini's 31.11 seconds at ~$0.084, and Runway's 25.19 seconds at $0.05.

	FLUX.2	Gemini 3.1 Flash	gpt-image-2	Runway Gen-4
Time	30.5s	31.11s	211.95s	25.19s
Cost	$0.135	~$0.084	~$0.224	$0.05

Which model can add an item to an image while preserving every other detail?

Here we want to know whether a model can add items to an existing image while preserving every other detail.

We give each model a picture of a person and three items of clothing:

A multicoloured jacket
A large pair of sunglasses
A red duffel bag

All three items have very distinct, recognisable details that we can use to identify consistency across image generation.

Reference images showing the character, jacket, glasses, and bag used in the virtual try-on test

Thanks Babak Eshaghian for the photo.

FLUX.2

FLUX.2 accepts multiple reference images, so we use the same technique as test 2 but pass all four images in together, referencing them as input_image, input_image_2, and so on.

Result

Here is what FLUX.2 generated:

Qualitative analysis

FLUX.2 did a perfect job placing the items in the scene while maintaining all the details from the original image.

Item correctness

The small details of all the items placed in the scene are basically perfect:

The crochet jacket has exactly the right squares of colour in exactly the same places as the original image

Side by side comparison of the reference jacket versus the FLUX.2 result

The glasses have the small details like the holes at the top of the frame

Side by side comparison of the reference glasses versus the FLUX.2 result

The Supreme bag has the straps going over the exact same letters as in the original image, as well as the two straps the bag has

Side by side comparison of the reference bag versus the FLUX.2 result

Character consistency

The character consistency is perfect. The character is in the exact same pose and has all the same qualities as the original image.

Performance

Here's how long it took and how much it cost to generate this image.

Metric	Value
Time	14.47s
Cost	$0.09

Gemini 3.1 Flash

Gemini 3.1 Flash uses the same technique as test 2, passing all four images as content parts in a single call.

Result

Here is what Gemini 3.1 Flash generated:

Qualitative analysis

Gemini 3.1 Flash is almost perfect in its generated image, with some strange artifacts.

Item correctness

The items of clothing match the original items perfectly:

The crochet jacket has exactly the right squares of colour in exactly the same places as the original image
The glasses have the small details like the holes at the top of the frame
The Supreme bag has the straps going over the exact same letters as in the original image, as well as the two straps the bag has

The one strange artifact is that the character was given two bags for some reason.

Character consistency

The character consistency is almost perfect. The one detail that has changed is that the character's arm position has been changed to hold the bag that was added to the scene. This could be seen as a pro in the intelligence of the model, or it could be seen as a con if you are aiming for more of a paint-in kind of edit where your character stays consistent across generations.

Performance

Here's how long it took and how much it cost to generate this image.

Metric	Value
Time	16.22s
Input tokens	1,085
Output tokens	1,373
Cost	~$0.084

gpt-image-2

gpt-image-2 uses the same technique as the previous tests, passing all four images as a list to the images.edit endpoint.

Result

Here is what gpt-image-2 generated:

Qualitative analysis

Item correctness

The jacket is identical to the reference image, with the same colour blocks in the same positions.

Side by side comparison of the reference jacket versus the gpt-image-2 result

The glasses resemble the reference but with some stylistic differences. The lenses are more square than in the original, and the distinctive holes along the top of the frames are missing.

Side by side comparison of the reference glasses versus the gpt-image-2 result

The bag is identical to the reference image.

Side by side comparison of the reference bag versus the gpt-image-2 result

Character consistency

Performance

Here's how long it took and how much it cost to generate this image.

Metric	Value
Time	197.73s
Input tokens	3,403
Output tokens	7,024
Cost	~$0.224

Runway Gen-4

Runway Gen-4 uses the same @tag technique as test 2, but with one important limitation: it only supports up to 3 reference images.

Since we have 4 images (character, jacket, glasses, bag), we had to run this test in two passes. The first pass used the character, jacket, and glasses. The second pass added the bag.

Result

Here is what Runway Gen-4 generated in pass 1 (jacket and glasses):

And pass 2 (all three items including the bag):

Runway Gen-4 result showing the character wearing all three items including the bag (pass 2 of 2)

Qualitative analysis

At first, Runway's output might seem better than gpt-image-2's, but with some further analysis we can see it fell short in several key areas.

Item correctness

The crochet jacket was generated well, with good consistency across the colours of the jacket.
The sunglasses generated are not the sunglasses we expected at all.

Side by side comparison of the reference glasses versus the Runway Gen-4 result

The bag that was generated is consistent with the image we gave it. However, it does look like it gave the character an extra thumb to hold on to the bag.

Side by side comparison of the reference bag versus the Runway Gen-4 result

Character consistency

In the first pass, the character seems to have been flipped in its body position.

Side by side comparison of the reference character versus the Runway Gen-4 pass 1 result

This is not a huge deal if you are not too worried about these smaller details, but if you need the image to be exactly the same across generations, then obviously this fails.

Performance

Here's how long it took and how much it cost to generate this image.

Metric	Value
Time	30.01s avg across 2 passes
Cost	$0.10 (2 images x $0.05)

Which image generator is best for AI virtual try-on?

FLUX.2 was the clear winner of this test, producing a near-perfect result across all three items with no notable artifacts.

Gemini 3.1 Flash came in second, matching the items well but with the strange two-bag artifact.

gpt-image-2 passed, reproducing the jacket and bag accurately, but fell short on the glasses, missing the distinctive frame details of the original.

Runway Gen-4 struggled with item accuracy and character consistency.

	FLUX.2	Gemini 3.1 Flash	gpt-image-2	Runway Gen-4
Items preserved?	Yes	Mostly	Mostly	No

gpt-image-2 was significantly slower and more expensive than the other passing models, taking 197.73 seconds and costing ~$0.224 compared to FLUX.2's 14.47 seconds at $0.09 and Gemini's 16.22 seconds at ~$0.084.

	FLUX.2	Gemini 3.1 Flash	gpt-image-2
Time	14.47s	16.22s	197.73s
Cost	$0.09	~$0.084	~$0.224

Which model can generate a stylized character consistently across multiple images?

Image generation models are expected to maintain consistency with abstract characters as well as real people.

In this section, we test the model's ability to maintain consistency in a difficult art style for an abstract character. Specifically, this pixel art character:

We test a workflow that would be extremely useful in a real life application: generating the sprites needed for an animation sheet of a character.

Here is the human artist's character animation that we will be comparing the model output to:

The original artist's walk cycle animation for the pink pixel art character

FLUX.2

FLUX.2 uses the same technique as the previous tests, passing the sprite as input_image alongside a text description of each pose.

We make 6 independent calls, one per frame, with no chaining between them.

Result

Here are all six frames FLUX.2 generated:

All six FLUX.2 walk cycle frames side by side

Here is what it looks like when animated:

FLUX.2 animated walk cycle GIF showing 6 frames of the pink pixel art character

Qualitative analysis

FLUX.2 failed this test.

Character style consistency

FLUX.2 generated some sprites that could be close to the character style.

Side by side comparison of the reference sprite versus a FLUX.2 frame with close character style

But some of the other sprites were not so consistent.

Side by side comparison of the reference sprite versus a FLUX.2 frame with poor character style consistency

Character pose correctness

If the model had followed our pose instructions correctly, the animated GIF would show a smooth walk cycle where each frame advances from the previous. Instead, FLUX.2 generated inconsistent poses with no clear progression between frames.

Side by side of the original artist walk cycle versus the FLUX.2 output

Performance

Here's how long it took and how much it cost to generate these frames.

Metric	Value
Frames generated	6
Avg. time per frame	7.12s
Cost per frame	$0.045
Total cost	$0.27

Gemini 3.1 Flash

Gemini 3.1 Flash uses the same technique as the previous tests, passing the sprite alongside a text description of each pose.

We make 6 independent calls, one per frame, with no chaining between them.

Result

Here are all six frames Gemini 3.1 Flash generated:

All six Gemini 3.1 Flash walk cycle frames side by side

Here is what it looks like when animated:

Gemini 3.1 Flash animated walk cycle GIF showing 6 frames of the pink pixel art character

Qualitative analysis

Gemini also failed this test.

Character style consistency

Some frames manage to create a character consistent with our style.

Side by side comparison of the reference sprite versus a Gemini 3.1 Flash frame with close character style

While others seem to fail.

Side by side comparison of the reference sprite versus a Gemini 3.1 Flash frame with poor character style consistency

Character pose correctness

If the model had followed our pose instructions correctly, the animated GIF would show a smooth walk cycle where each frame advances from the previous. Instead, Gemini generated frames with no clear pose progression.

Side by side of the original artist walk cycle versus the Gemini 3.1 Flash output

Performance

Here's how long it took and how much it cost to generate these frames.

Metric	Value
Frames generated	6
Avg. time per frame	19.81s
Avg. output tokens	~1,401
Cost per frame	~$0.084
Total cost	~$0.50

gpt-image-2

gpt-image-2 uses the same technique, passing the sprite to the images.edit endpoint alongside a text description of each pose for 6 independent calls.

Result

Here are all six frames gpt-image-2 generated:

All six gpt-image-2 walk cycle frames side by side

Here is what it looks like when animated:

gpt-image-2 animated walk cycle GIF showing 6 frames of the pink pixel art character

Qualitative analysis

gpt-image-2 also failed this test. Like FLUX.2, it showed poor character consistency across frames.

Character style consistency

Some frames produced a character that was reasonably close to the reference style.

Side by side comparison of the reference sprite versus a gpt-image-2 frame with close character style

While others drifted significantly from the reference.

Side by side comparison of the reference sprite versus a gpt-image-2 frame with poor character style consistency

Character pose correctness

If the model had followed our pose instructions correctly, the animated GIF would show a smooth walk cycle where each frame advances from the previous. Instead, gpt-image-2 generated frames with no clear pose progression.

Side by side of the original artist walk cycle versus the gpt-image-2 output

Performance

Here's how long it took and how much it cost to generate these frames.

Metric	Value
Frames generated	6
Avg. time per frame	59.99s
Avg. input tokens	~81
Output tokens per frame	1,756
Cost per frame	~$0.053
Total cost	~$0.32

Runway Gen-4

Runway Gen-4 uses the same technique as the previous tests, passing the sprite as a tagged @sprite reference and describing each pose in text for 6 independent calls.

Result

Here are all six frames Runway Gen-4 generated:

All six Runway Gen-4 walk cycle frames side by side

Here is what it looks like when animated:

Runway Gen-4 animated walk cycle GIF showing 6 frames of the pink pixel art character

Qualitative analysis

Runway failed this test harder than any of the other models.

Character style consistency

Not only did Runway not generate a single character that was consistent with our reference character, it was unable to even maintain any sort of consistency across all six of the sprites it generated.

Side by side comparison of all six Runway Gen-4 walk cycle frames

Character pose correctness

If the model had followed our pose instructions correctly, the animated GIF would show a smooth walk cycle where each frame advances from the previous. Instead, Runway generated characters that all appear in the same position, with no progression between frames.

Side by side of the original artist walk cycle versus the Runway Gen-4 output

Performance

Here's how long it took and how much it cost to generate these frames.

Metric	Value
Frames generated	6
Avg. time per frame	26.47s
Cost per frame	$0.05
Total cost	$0.30

Which AI image generator maintains the most consistent character identity across multiple generations?

All four models failed this test. None of them were able to consistently follow our pose descriptions across six independent generations.

Gemini 3.1 Flash was the best performer on character style consistency, producing the most recognisable version of our reference character across all six frames.

FLUX.2 and gpt-image-2 both showed poor and inconsistent character style across their frames, performing roughly on par with each other. Neither followed the pose descriptions reliably.

Runway performed worst overall, generating characters that were neither consistent with the reference nor consistent with each other across the six frames.

	FLUX.2	Gemini 3.1 Flash	gpt-image-2	Runway Gen-4
Avg. time per frame	7.12s	19.81s	59.99s	26.47s
Cost per frame	$0.045	~$0.084	~$0.053	$0.05
Total cost (6 frames)	$0.27	~$0.50	~$0.32	$0.30
Character consistency	Inconsistent	Best	Inconsistent	Worst
Pose correctness	No	No	No	No

Which AI image generator is best for character consistency?

Here are the final results:

	FLUX.2	Gemini 3.1 Flash	gpt-image-2	Runway Gen-4
Test 1: walk cycle consistency	Inconsistent	Best	Inconsistent	Worst
Test 2: scene placement	Pass	Pass	Pass	Fail
Test 3: virtual try-on	Pass	Mostly	Mostly	Fail

First place: FLUX.2

FLUX.2 was the strongest overall performer across the three tests.

Preserved a realistic character's identity when placing them in a new scene
Produced the most accurate virtual try-on result across all three clothing items
Showed some character style consistency across the walk cycle frames

Second place: Gemini 3.1 Flash

Gemini came in a close second, performing well across all three tests.

Preserved a realistic character's identity when placing them in a new scene
Matched all three clothing items accurately in the virtual try-on test
Produced the most consistent character style across the walk cycle frames of any model

Third place: gpt-image-2

gpt-image-2 passed the realistic character tests but was the slowest and most expensive model by a significant margin.

Preserved a realistic character's identity in a new scene, though with a tendency to carry over the reference pose too literally
Reproduced most clothing items accurately in the virtual try-on test, with minor issues on the glasses
Showed poor and inconsistent character style across the walk cycle frames

Fourth place: Runway Gen-4

Runway Gen-4 failed all three tests and was the weakest performer overall.

Failed to preserve a realistic character's identity in a new scene
Failed to accurately reproduce clothing items in the virtual try-on test
Generated characters that were inconsistent with the reference and with each other across the walk cycle frames

FLUX.2 Speed vs Gemini Price

FLUX.2 and Gemini both produced strong results, but where they differ is on speed and price.

	FLUX.2	Gemini 3.1 Flash	gpt-image-2	Runway Gen-4
Avg. time per image (tests 2+3)	~22.5s	~23.7s	~128s	~28s
Avg. cost per image (tests 2+3)	~$0.113	~$0.084	~$0.224	~$0.065
Cost per frame (test 1)	$0.045	~$0.084	~$0.053	$0.05

FLUX.2 wins on speed and quality

FLUX.2 was the fastest model overall and produced the strongest quality results across the tests.

Gemini comes first on price

Gemini was able to produce very comparable results at a lower price. Across the single-image tests, Gemini averaged ~$0.084 per image compared to FLUX.2's ~$0.113, making it roughly 26% cheaper.

Which AI image generator is right for you?

The right model depends on what you are building and what tradeoffs matter most to you.

If you need the best character consistency overall

Use FLUX.2. It consistently preserved identity across both realistic and stylized characters, and produced the strongest virtual try-on results. It is also the fastest model of the four.

If price is your priority

Use Gemini 3.1 Flash. It produced results comparable to FLUX.2 at roughly 26% lower cost per image, and its synchronous API means no polling logic to manage.

If you are building for game dev or sprite animation

None of the four models reliably followed pose descriptions across independent generations in our tests. All four struggled to maintain consistent character style across a walk cycle. This is still an unsolved problem for current image generation APIs.

If you are building a virtual try-on or e-commerce tool

FLUX.2 is the clear choice. It reproduced fine item details across multiple reference images with no notable artifacts. Gemini is a solid second option.

If you are placing a real person in new scenes

FLUX.2, Gemini, and gpt-image-2 all passed this test. FLUX.2 and Gemini are the better choices on speed and cost. gpt-image-2 works but is significantly slower and more expensive, and tends to carry over the reference pose too literally. Runway Gen-4 failed this test entirely.

FAQ

How many reference images does FLUX.2 accept?

FLUX.2 accepts up to 8 reference images via the API and up to 10 in the playground. In our tests, we passed up to 4 reference images simultaneously, covering a character and three clothing items, and the model handled all of them accurately.

How many reference images does Gemini 3.1 Flash accept?

Gemini 3.1 Flash accepts up to 14 reference images as content parts in a single API call.

Does Runway Gen-4 support multiple reference images?

Runway Gen-4 supports up to 3 reference images. Each image is passed with a tag, which you then reference directly in the prompt using @tag. This 3-image limit is the most restrictive of the four models we tested.

Which AI image generator is best for virtual try-on?

FLUX.2 is the best choice for virtual try-on. It reproduced fine details across multiple clothing items, including colour patterns, hardware, and strap placement, while keeping the original character consistent. Gemini 3.1 Flash is a close second and a good option if cost is a priority.

Which AI image generator is cheapest for character consistency tasks?

Gemini 3.1 Flash is the cheapest of the top-performing models, averaging around $0.084 per image across our tests. FLUX.2 averaged around $0.113 per image, making Gemini roughly 26% cheaper while producing comparable results on most tasks.

Is gpt-image-2 fast for image generation?

gpt-image-2 was the slowest of the four models in our tests by a significant margin, averaging around 128 seconds per image across the single-image tests and around 60 seconds per frame in the walk cycle test. FLUX.2 was the fastest at around 22.5 seconds per image, followed by Gemini at around 23.7 seconds.

Results at a glance​

Placing the same person in a new scene​

Winner: FLUX.2​

Loser: Runway​

Adding items to an image while preserving every other detail​

Winner: FLUX.2​

Loser: Runway​

Generating a stylized character consistently across multiple images​

Winner: Gemini 3.1 Flash​

Loser: Runway Gen-4​

Comparing FLUX.2, Gemini 3.1 Flash, gpt-image-2, and Runway Gen-4​

Full feature comparison​

Which model can place the same person in a new scene without changing their features?​

FLUX.2​

Result​

Qualitative analysis​

Performance​

Gemini 3.1 Flash​

Result​

Qualitative analysis​

Performance​

gpt-image-2​

Result​

Qualitative analysis​

Performance​

Runway Gen-4​

Result​

Qualitative analysis​

Performance​

Which AI image generator best preserves a real person's face?​

Which model can add an item to an image while preserving every other detail?​

FLUX.2​

Result​

Qualitative analysis​

Performance​

Gemini 3.1 Flash​

Result​

Qualitative analysis​

Performance​

gpt-image-2​

Result​

Qualitative analysis​

Performance​

Runway Gen-4​

Result​

Qualitative analysis​

Performance​

Which image generator is best for AI virtual try-on?​

Which model can generate a stylized character consistently across multiple images?​

FLUX.2​

Result​

Qualitative analysis​

Performance​

Gemini 3.1 Flash​

Result​

Qualitative analysis​

Performance​

gpt-image-2​

Result​

Qualitative analysis​

Performance​

Runway Gen-4​

Result​

Qualitative analysis​

Performance​

Which AI image generator maintains the most consistent character identity across multiple generations?​

Which AI image generator is best for character consistency?​

First place: FLUX.2​

Second place: Gemini 3.1 Flash​

Third place: gpt-image-2​

Fourth place: Runway Gen-4​

FLUX.2 Speed vs Gemini Price​

FLUX.2 wins on speed and quality​

Gemini comes first on price​

Which AI image generator is right for you?​

If you need the best character consistency overall​

If price is your priority​

If you are building for game dev or sprite animation​

If you are building a virtual try-on or e-commerce tool​

If you are placing a real person in new scenes​

Results at a glance

Placing the same person in a new scene

Winner: FLUX.2

Loser: Runway

Adding items to an image while preserving every other detail

Winner: FLUX.2

Loser: Runway

Generating a stylized character consistently across multiple images

Winner: Gemini 3.1 Flash

Loser: Runway Gen-4

Comparing FLUX.2, Gemini 3.1 Flash, gpt-image-2, and Runway Gen-4

Full feature comparison

Which model can place the same person in a new scene without changing their features?

FLUX.2

Result

Qualitative analysis

Performance

Gemini 3.1 Flash

Result

Qualitative analysis

Performance

gpt-image-2

Result

Qualitative analysis

Performance

Runway Gen-4

Result

Qualitative analysis

Performance

Which AI image generator best preserves a real person's face?

Which model can add an item to an image while preserving every other detail?

FLUX.2

Result

Qualitative analysis

Performance

Gemini 3.1 Flash

Result

Qualitative analysis

Performance

gpt-image-2

Result

Qualitative analysis

Performance

Runway Gen-4

Result

Qualitative analysis

Performance

Which image generator is best for AI virtual try-on?

Which model can generate a stylized character consistently across multiple images?

FLUX.2

Result

Qualitative analysis

Performance

Gemini 3.1 Flash

Result

Qualitative analysis

Performance

gpt-image-2

Result

Qualitative analysis

Performance

Runway Gen-4

Result

Qualitative analysis

Performance

Which AI image generator maintains the most consistent character identity across multiple generations?

Which AI image generator is best for character consistency?

First place: FLUX.2

Second place: Gemini 3.1 Flash

Third place: gpt-image-2

Fourth place: Runway Gen-4

FLUX.2 Speed vs Gemini Price

FLUX.2 wins on speed and quality

Gemini comes first on price

Which AI image generator is right for you?

If you need the best character consistency overall

If price is your priority

If you are building for game dev or sprite animation

If you are building a virtual try-on or e-commerce tool

If you are placing a real person in new scenes