On Wednesday, OpenAI announced DALL-E 3, the latest version of its AI image synthesis model that features full integration with ChatGPT. DALL-E 3 renders images by closely following complex descriptions and handling in-image text generation (such as labels and signs), which challenged earlier models. Currently in research preview, it will be available to ChatGPT Plus and Enterprise customers in early October.
Like its predecessor, DALLE-3 is a text-to-image generator that creates novel images based on written descriptions called prompts. Although OpenAI released no technical details about DALL-E 3, the AI model at the heart of previous versions of DALL-E was trained on millions of images created by human artists and photographers, some of them licensed from stock websites like Shutterstock. It’s likely DALL-E 3 follows this same formula, but with new training techniques and more computational training time.
Judging by the samples provided by OpenAI on its promotional blog, DALL-E 3 appears to be a radically more capable image synthesis model than anything else available in terms of following prompts. While OpenAI’s examples have been cherry-picked for their effectiveness, they appear to follow the prompt instructions faithfully and convincingly render objects with minimal deformations. Compared to DALL-E 2, OpenAI says that DALL-E 3 refines small details like hands more effectively, creating engaging images by default with “no hacks or prompt engineering required.”
I wish more people realised this. It’s much harder to create very specific images with the current image generation tools than most people seem to think, which is creating an inaccurate view of the technology in the public eye.
The generator will create something inspired by the prompt it is given, but it can be very hard to make it match the output the prompt writer imagines when writing the prompt. There are various tools that can refine and narrow the generator’s output, to try and control things like posing, composition, style etc and to redraw details. But even then it’s often pot luck as to the output. The generated images aren’t necessarily bad, just not what was wanted.
I think the comparison to stock photo images is apt, current image generators are great for creating themed but somewhat generic images. The tools are going to continue to advance, and they are useful in for some applications already. But they are still a long way off from truly replacing human artistry.
The crux with that argument is that the artists is the only one that cares about specific output, meanwhile the art consumer doesn’t. When somebody plays a game or watch a movie, they don’t know what to expect, that’s part of the fun, they just care about it being good. So as long as the output is good enough for the consumer, whatever the artists thinks about it, really doesn’t matter, assuming they still have a job to begin with.