Drawing the Undrawable: DALL·E 2
Anna Kovalenko
Posted on December 7, 2022
Imagine that you can draw anything you want: from a bowl of soup that is actually a planet in the universe drawn in a classical digital art style to the portrait of a lady with a ruby necklace drawn in a Renaissance painting style. Believe it or not, you can generate those images — all imaginable and unimaginable things — using DALL·E 2.
So what exactly is DALL·E 2?
DALL·E 2 is an AI system that can create realistic images and art from a prompt in natural language in a couple of seconds. A prompt is just a sentence that can’t be longer than 400 characters and is describing the image you want to create. A prompt can be a detailed descriptive sentence or just an emoji — AI will create an image on its basis anyway.
Here are some random examples of prompts:
- An astronaut playing the basketball with cats in space as a children’s book illustration
- Teddy bears mixing sparkling chemicals as mad scientists as a 1990s Saturday morning cartoon
- A bowl of soup that looks like a monster knitted out of wool
- Teddy bears (once again because AI makes them very cute) shopping for groceries in the style of ukiyo-e
DALL·E 2 has not been deliberately “taught” different art styles or different artists’ techniques as well as it has not been “taught” what a bowl of soup is or how the figurines made of plasticine look like. DALL·E studied 650 million images and their descriptions and drew its own conclusions. DALL·E’s skills and abilities are surprising even for its creators because even they, the developers of the AI system, do not know what DALL·E has learned and what it didn’t learn and how it interprets the prompts.
But how does DALL·E 2 actually work?
To be honest, at the highest level, DALL·E 2's works pretty simply:
Firstly, your text prompt is input into a text encoder that is trained to map the prompt to a representation space.
Secondly, a model called the prior maps the text encoding to a corresponding image encoding that captures the semantic information of the prompt contained in the text encoding.
Finally, an image decoder stochastically generates an image which is a visual manifestation of this semantic information you put in your prompt.
But on the more detailed level, there are some steps that DALL·E takes to link related textual and visual abstractions.
Step 1. How DALL·E 2 links Textual and Visual Semantics
If you input a prompt “a bowl of soup that is a portal to another dimension in the style of Basquiat”, DALL·E 2 will output something like this:
But how does the AI system know how a textual concept like “a bowl of soup” is manifested in the visual space? The link between textual semantics and their visual representations in DALL·E 2 is learned by another OpenAI model called CLIP (Contrastive Language-Image Pre-training).
CLIP is trained on millions of images and their captions, learning how much a given text prompts relates to an image. CLIP does not try to predict a caption given to a picture or an artwork but learns how related the caption to an image is. This approach helps CLIP to learn the link between textual and visual representations of the same abstract concept or object. To obtain this “knowledge”, DALL·E 2 model uses CLIP's ability to learn semantics from natural language.
The principles of CLIP training are not really difficult:
Firstly, all images and their associated captions are passed through their respective encoders, mapping all objects into an m-dimensional space.
Next, the cosine similarity of each (image and text) pair is computed.
The training objective is to simultaneously maximize the cosine similarity between N correct encoded image and caption pairs and minimize the cosine similarity between N2 — N incorrect encoded image and caption pairs.
After the training, the CLIP model gets “frozen” and DALL·E 2 moves onto its next step.
Step 2. How DALL·E 2 generates Images from Visual Semantics
During this step DALL·E 2 learns how to reverse the image encoding mapping that the CLIP model just learned. CLIP learns a representation space in which it is easy to determine how textual and visual encodings relate to each other, but the image generation requires the AI system to learn how to exploit the representation space to create an image.
DALL·E 2 uses the GLIDE model to perform the image generation. On its part, GLIDE uses the Diffusion Model to create an image. In brief, Diffusion Models learn to generate data by reversing a gradual noising process and help with the text-to-image generation process. If you want to learn more about Diffusion Models and Stable Diffusion, you can read my other article.
Step 3. How DALL·E 2 maps from Textual Semantics to Corresponding Visual Semantics
After GLIDE successfully generates an image that reflects the text prompt by image encoding, DALL·E 2 needs to actually find these encoded representations. Put simply, DALL·E needs to inject the text conditioning information from the text into the text-to-image generation process.
Keep in mind that the CLIP model also learns a text encoder in addition to the image encoder. To map from the text encodings of image captions to the image encodings of their corresponding images, DALL·E 2 uses another model called the Prior. The Prior also uses the Diffusion Model in its encoding.
The Prior consists of a decoder-only Transformer. It operates on an ordered sequence of:
the tokenized text and caption
the CLIP model’s text encodings of these tokens
an encoding for the diffusion timestep
the noised image passed through the CLIP model’s image encoder
final encoding which output from the Transformer is used to predict the unnoised CLIP image encoding
Step 3. How the developers of DALL·E 2 put all of it together
There were all the components needed for DALL·E 2 and the next step is to chain all of it together for text-to-image generation:
First of all, the CLIP model text encoder maps the image description into the representation space.
Second of all, the diffusion prior maps from the CLIP text encoding to a corresponding CLIP image encoding.
Finally, the GLIDE generation model maps from the representation space into the image space with the help of reverse-Diffusion, generating one of many possible images that conveys the semantic information within the text prompt.
And it’s basically it.
With this technology DALL·E 2 can create realistic, detailed, brand new images and expand existing images beyond their original canvas, edit them and do other fun stuff. If you want to learn more, you can check out DALL·E 2 website and try out the AI system yourself.
Posted on December 7, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.