Google and OpenAI are in direct competition to create the best native image generation model. Following Google’s introduction of native image generation with Gemini, OpenAI quickly rolled out native image output functionality for all ChatGPT users. To determine which AI model produces superior results, I compared the native image generation capabilities of OpenAI’s ChatGPT with Google’s Gemini. In this article, I assessed both models based on factors like character consistency, text rendering, and adherence to user instructions, among others.
1. Transform Yourself into an Anime Character
I began my comparison of native image generation features by asking both ChatGPT and Gemini to produce an anime-style image. The results were striking: ChatGPT excelled, generating an image reminiscent of classic Studio Ghibli art with just one prompt. Conversely, Gemini struggled to produce any anime-style image, despite several attempts.

2. Whiteboard Session
For the next test, I requested both ChatGPT and Gemini to create an image depicting a man explaining the concept of relativity. ChatGPT, leveraging its advanced model, delivered a remarkable image with clearly legible handwritten text. It even included a reflection of the photographer.
On the other hand, the smaller Gemini 2.0 Flash model had difficulties producing accurate text on the whiteboard. While Gemini correctly added the label “Beebom” to the man’s t-shirt, it failed to capture the photographer’s reflection. However, the representation of the man in Gemini’s output appeared more authentic than that from ChatGPT.


This scenario vividly illustrates the differences between ChatGPT and Gemini in native image generation. ChatGPT crafted a beautiful menu card with excellent text rendering, though it did miss one dish on the menu. Nevertheless, it followed the provided instructions effectively. In contrast, Gemini tends to hallucinate when inundated with complex information in the prompts, resulting in jumbled text and inaccuracies.


4. Creating an Infographic
Next, I tasked both models with creating an infographic to illustrate the concept of gravity, featuring Newton as a character. Unsurprisingly, ChatGPT performed admirably, providing a well-designed visual that clearly explained the concept with readable text.
Its output was so impressive that the native image generation feature of ChatGPT could be utilized to develop comic strips, educational materials, visual guides, and more.
In stark contrast, Gemini’s results were unsatisfactory, with both text and visuals lacking coherence. Notably, while Gemini 2.0 Flash can generate an image in just 3 to 4 seconds, ChatGPT requires over a minute for a single image due to processing power demands from its larger 4o model, resulting in extremely coherent outputs.


5. Image Restyling
For the image restyling task, I uploaded a photograph of a cactus in a garden and instructed both models to add colorful flowers. My findings showed that ChatGPT tended to overdo each modification, drastically changing the image after every adjustment. In contrast, Gemini maintained greater consistency across multiple iterations.
While ChatGPT 4o is inherently multimodal (built on an auto-regressive architecture), some experts suggest that its image generation feature employs a diffusion-based decoder. This allows for accurate text rendering but results in a complete image regeneration with every iteration.
In contrast, Gemini 2.0 Flash is not a pure auto-regressive model, explaining the differences in outputs after each modification.
6. Blending Images Together
Next, I uploaded two images and tasked both ChatGPT and Gemini to generate an image featuring a woman holding a mug. Both models produced impressive outcomes, with Gemini exhibiting a bit more creativity by altering the posture. OpenAI claims that ChatGPT 4o can process up to 20 images in one prompt, optimally using in-context learning to create a cohesive image.
7. Changing the Point of View
In the subsequent test, I uploaded an image of a hallway and directed both models to change the perspective. The results were quite similar, but ChatGPT remained closer to the original image, whereas Gemini added an extra leg to an armchair in the process. Overall, this round goes to ChatGPT for its more accurate representation of the opposite viewpoint.
8. A Wall Clock Displaying 6:30
In the final test, both ChatGPT and Gemini failed to represent the specified time of 6:30 accurately on the wall clock. This issue frequently arises in AI-generated images, as models often default to rendering 10:10 due to biases in the training datasets. Even with their advanced native image generation capabilities, both platforms have yet to resolve this persistent challenge in following instructions.


Conclusion: Comparing Native Image Generation in ChatGPT and Gemini
After conducting a variety of tests, I can conclude that ChatGPT’s native image generation features are currently more advanced than those of Gemini 2.0 Flash. The larger ChatGPT 4o model offers a more extensive knowledge base, producing images that are both coherent and true to user instructions.
In comparison, Google’s Gemini 2.0 Flash model is smaller, resulting in quicker performance, but it often struggles with hallucinations during text rendering, leading to lower-quality outputs.
One area where Gemini does shine is its ability to maintain consistency across different generations, which is a significant advantage. Users are encouraged to look out for native image output support in the forthcoming Gemini 2.5 Pro model, which is expected to elevate performance in native image generation to new heights.