You type a sentence. Seconds later, an image appears that never existed before. It looks like a photograph, or a painting, or something entirely new. But what actually happened between your words and those pixels?
Most explanations either drown in mathematics or wave their hands and say "AI magic." This guide takes a different approach. We are going to walk through the entire process, component by component, in a way that respects your intelligence without requiring a graduate degree.
By the end, you will understand not just that AI creates images, but how it creates them, and why that understanding makes you a dramatically better practitioner.
When you type a prompt into an AI image generator, four major things happen in rapid succession:
Your written prompt is converted into a mathematical representation, a set of numbers that capture the meaning of your words. This is handled by a text encoder, most commonly a model called CLIP.
The system starts with pure random noise, visual static, like an old television between channels. This noise exists in a compressed mathematical space called latent space.
Over a series of steps (typically 20 to 50), the system gradually removes noise while being guided by your text encoding. Each step makes the image slightly more coherent, slightly more aligned with what your words described.
The final cleaned-up representation is decoded from latent space back into actual pixels, producing the image you see on screen.
That is the skeleton. Now let's put meat on those bones.
Before an AI model can generate a single image, it must be trained, and training is where the real magic lives.
The training process works in a beautifully counterintuitive way. Instead of teaching the model how to create images from scratch, we teach it how to repair damaged images. Specifically, we take millions of existing images, systematically add noise to them at varying intensities, and then train a neural network to predict what the original, clean image looked like.
The AI model does this millions of times, across millions of images, each paired with text descriptions. Through this process, it learns two crucial things simultaneously: the statistical structure of what images look like (edges, textures, shapes, lighting, composition) and how text descriptions correspond to visual features.
This is fundamentally different from how most people imagine AI working. The model does not store a database of images and cut-and-paste from them. It learns patterns, the statistical relationships between visual features, and uses those patterns to construct new images from noise. No individual training image is memorized. What the model retains is a generalized understanding of how visual information is structured.
The technical name for this approach is diffusion modeling, and it borrows its name from physics. In thermodynamics, diffusion describes how particles spread from areas of high concentration to low concentration, eventually reaching a state of uniform randomness (equilibrium). Think of a drop of ink spreading through water until the color is evenly distributed.
Diffusion models work with this same concept, but in reverse:
Forward diffusion (used during training) takes a clean image and progressively adds Gaussian noise over many steps until the image becomes pure random static. This is the "ink spreading through water" direction, easy and natural.
Reverse diffusion (used during generation) starts with pure noise and progressively removes it, step by step, to reveal a coherent image. This is the hard direction, "un-spreading" the ink, and it is what the neural network learns to do.
Key insight: The model never generates an image in one shot. It sculpts it gradually, through dozens of small refinements. Each step removes a little noise and adds a little structure, guided by your prompt. Early steps establish broad composition and color. Later steps add fine detail and texture.
This iterative refinement is why the "steps" parameter matters in practice. More steps generally means more refinement, though with diminishing returns. Twenty steps captures the broad structure. Thirty or forty adds detail. Beyond fifty, you are mostly wasting computation.
If diffusion models worked directly on pixel data, generating a single 512x512 image would require manipulating 786,432 individual values (512 x 512 x 3 color channels) at every step. That is computationally brutal. Generating a 1024x1024 image would mean over 3 million values per step.
The solution is latent space, and it is one of the most important concepts in modern AI image generation.
A latent space is a compressed mathematical representation of images. Before the diffusion process begins, a component called a Variational Autoencoder (VAE) learns to compress images from full pixel resolution down to a much smaller representation, typically 64x64, that retains the essential information while discarding redundant data.
This is why the approach used by Stable Diffusion and most modern generators is called Latent Diffusion. The entire noising and denoising process happens in this compressed space, making it dramatically faster and less memory-intensive than working with raw pixels. The compression ratio is typically 8x in each dimension, meaning a 512x512 image becomes a 64x64 latent representation.
The VAE has two halves: an encoder that compresses images into latent space (used during training) and a decoder that reconstructs pixels from latent representations (used at the end of generation). The quality of the VAE directly affects the final image quality, which is why different models can produce noticeably different textures and details even with identical prompts.
Your text prompt cannot directly influence a mathematical denoising process. Words need to be translated into the same mathematical language the diffusion model speaks. This is where text encoders come in.
The most common text encoder in AI image generation is CLIP (Contrastive Language-Image Pre-training), developed by OpenAI. CLIP was trained on hundreds of millions of image-text pairs scraped from the internet. Through that training, it learned to map text descriptions and images into a shared mathematical space where similar concepts cluster together.
When you write "a sunset over the ocean," CLIP does not look up those words in a dictionary. It converts the entire phrase into a high-dimensional vector, a list of numbers, typically 768 or 1024 values long, that encodes the semantic meaning. Critically, CLIP encodes meaning, not just words. "A sunset over the ocean" and "golden light on water at dusk" produce similar vectors because CLIP understands they describe similar visual concepts.
Key insight: This is why prompt engineering is more art than formula. The text encoder interprets meaning, not keywords. Word order, phrasing, and context all affect the resulting vector. Two prompts with the same words in different arrangements can produce noticeably different images because they encode different semantic relationships.
Newer models like SDXL and Flux use dual text encoders, combining CLIP with a second, larger language model (like OpenCLIP-G or T5) to capture richer semantic information. The first encoder handles broad visual concepts, while the second captures fine-grained details and longer-range dependencies in the text. This is why newer models tend to follow complex prompts more accurately than their predecessors.
Before encoding, your prompt is broken into tokens, which are not always whole words. The word "photograph" might be one token. "Cyberpunk" might be split into "cyber" and "punk." Most CLIP models have a limit of 77 tokens per prompt, which is why very long prompts get truncated. Some interfaces work around this with techniques like prompt chunking, processing the prompt in segments and combining the results.
Each token gets an embedding (a vector of numbers), then attention mechanisms allow the tokens to influence each other, capturing relationships like "the red dress" (red modifies dress) versus "the dress near the red car" (red modifies car, not dress). This contextual understanding is what makes modern text-to-image so much more capable than earlier approaches that treated prompts as bags of independent keywords.
This is the heart of the generation process. It is where noise becomes image.
The process begins with a latent representation filled with random Gaussian noise. At each step, the neural network (called a U-Net in most current architectures, or a Diffusion Transformer in newer ones like Flux) receives three inputs:
1. The current noisy latent. This is the image-in-progress, initially pure static, gradually becoming more structured.
2. The text encoding. This is your prompt, translated into mathematical form by CLIP, acting as a constant guide throughout the process.
3. The timestep. This tells the model how much noise remains to be removed, essentially where we are in the denoising schedule. The model behaves differently at early timesteps (making big structural decisions) versus late timesteps (refining details).
Given these inputs, the model predicts the noise component of the current latent, essentially answering: "What part of this image is noise, and what part is signal?" The predicted noise is then subtracted (with some mathematical nuance handled by the sampler), producing a slightly cleaner version. This slightly cleaner version becomes the input for the next step.
The role of the sampler (Euler, DPM++, DDIM, and others) is to manage how noise is subtracted at each step. Different samplers use different mathematical strategies for this subtraction, which is why switching samplers can change the character of the output even with identical prompts and seeds. Some samplers are more conservative (adding less randomness per step, producing more predictable results), while others are more exploratory (allowing more variation, sometimes producing more creative outputs).
After the final denoising step, you have a clean latent representation, a 64x64 grid of numbers that encodes an image. But you cannot display a 64x64 grid of abstract numbers on a screen. It needs to be converted back into pixels.
This is the job of the VAE decoder. It takes the compressed latent representation and reconstructs a full-resolution image, typically at 8x the latent resolution (64x64 becomes 512x512). The decoder has learned, through its own training, how to plausibly fill in the details that were lost during compression, adding texture, sharpness, and fine detail that the latent space could not represent.
The quality of the VAE decoder matters more than most people realize. A good decoder produces sharp, detailed images with accurate colors and clean textures. A mediocre decoder can introduce artifacts, muddy colors, or an overly soft look. This is why some community-created VAE replacements can meaningfully improve output quality, even without changing the diffusion model itself.
There is one more critical component that dramatically affects output quality: Classifier-Free Guidance, usually abbreviated as CFG.
During each denoising step, the model actually runs twice. Once with your text prompt (the "conditioned" prediction) and once without any text at all (the "unconditioned" prediction). The final noise prediction is then calculated by comparing these two:
Final prediction = unconditioned prediction + CFG_scale * (conditioned prediction - unconditioned prediction)
The CFG scale, typically set between 5 and 15, controls how strongly the text prompt influences the output. At CFG 1, the prompt has minimal effect, and the model generates whatever it wants. At CFG 7, there is a strong but natural influence. At CFG 15 or higher, the model aggressively forces the output to match the prompt, which can produce oversaturated, artifact-heavy results.
Key insight: CFG scale is a trade-off between creativity and adherence. Lower values give the model more creative freedom (sometimes producing surprising, beautiful results, sometimes producing irrelevant ones). Higher values make the model follow instructions more literally (producing more predictable results, but sometimes at the cost of natural quality). The sweet spot for most use cases is between 6 and 9.
This dual-evaluation is also why generation takes longer than you might expect. Every step requires two full passes through the neural network, one conditioned and one unconditioned, effectively doubling the computation. Some optimization techniques (like using a distilled model for the unconditioned pass) can reduce this cost, but the fundamental trade-off remains.
Now that we have covered each component, here is the complete sequence from start to finish:
You type: "A cyberpunk cityscape at night with neon reflections on wet streets, highly detailed, cinematic lighting"
The prompt is broken into tokens: ["a", "cyber", "punk", "city", "scape", "at", "night", ...]. Each token is mapped to an embedding vector.
CLIP processes the token embeddings through its transformer layers, producing a final text encoding: a matrix of numbers that captures the semantic meaning of your entire description, including relationships between concepts.
A 64x64x4 tensor (for a 512x512 output) is filled with random Gaussian noise. The random seed determines the specific noise pattern, which is why the same seed produces the same image.
At each step: the U-Net receives the current noisy latent + text encoding + timestep. It predicts noise. The sampler subtracts predicted noise according to its algorithm. CFG scaling amplifies the text-conditioned direction. The result feeds into the next step.
The clean 64x64x4 latent is passed through the VAE decoder, which reconstructs a full 512x512x3 pixel image with proper color values.
The pixel array is saved as a PNG or JPEG file and displayed on your screen. Total time: typically 3 to 30 seconds depending on hardware and settings.
You can absolutely use AI image generators without understanding any of this. But knowing how the pipeline works transforms you from someone who tries random things and hopes for good results into someone who can diagnose problems and make targeted adjustments.
When your image looks washed out, you know the issue might be in the VAE decoder and can try a different one. When your prompt is being ignored, you understand that the text encoder might be truncating it at 77 tokens, so you should front-load the important concepts. When details are mushy, you know that adding more denoising steps might help the later refinement phase. When colors are oversaturated, you know your CFG scale is too high, pushing the model too hard toward your prompt at the expense of natural quality.
This is the difference between using a tool and understanding a tool. Both can produce results. But understanding gives you control, the ability to get what you want consistently rather than through trial and error.
The components we covered here, text encoding, latent diffusion, iterative denoising, VAE decoding, and CFG guidance, are the foundation that every other technique builds on. Prompt engineering is about optimizing the text encoding. LoRA fine-tuning is about adjusting the U-Net's learned patterns. ControlNet is about adding additional conditioning signals alongside the text. Inpainting is about masking portions of the latent space. Everything connects back to this pipeline.
Understanding the foundation makes everything else click into place.