Casting Spells: The Art of Text-to-Image Generative AI

Image created by the author using the Stable Diffusion text-to-image generation tool.

Express yourself! This familiar phrase is both an invitation and a command in our culture. Recognition, relevance, social status and potential monetary reward all require that you “put yourself out there.” But as any creator will tell you, particularly as they stand before their latest creative work, there’s always a gap between what you envision in your mind’s eye and what you can actually create. Of course, everything is perfect when we imagine it—your mental picture of a finished work naturally glosses over the challenges and inevitable imperfections that come with rendering an idea in the real world. Otherwise, your visualization wouldn't serve its primary purpose: motivating you to begin the actual work of bringing your idea to life. But the difference between what we envision and what we can actually make is also often because of a skills gap: most of us don’t have the skills required to realize our ideas in the exact way we imagine the finished product. Technology is one tool we use to close the gap between what we imagine and what we can create.

In the introduction to this series on generative AI (The Mirror and the Lens), I wrote about the long, intimate relationship between technology and the visual arts that has enabled artists to render the world around them more realistically. I also wrote about the disruptive impact of photography on painting—how it both undermined painters whose primary source of income was portraiture and freed artists to explore looser forms of painting that allowed for more obvious personal expression. The introduction of photography also had another important effect: it bridged the skills gap for those interested in creating visual images who lacked the requisite drawing and painting skills. Generative AI text-to-image systems are laying the foundation for a new way to bridge the skill gap that goes beyond enabling you to capture what you see with your eyes. It’s a tool that enables you to capture what you see in your mind’s eye.

Making Images with Generative AI

Despite the fact that generative AI products like DALL·E 2 (DALL·E)* and ChatGPT-3 (ChatGPT) have been in the news a lot lately, you may not know what generative AI really is. If so, you’re not alone—it’s a challenging technology to grasp. In addition, significant advances in artificial intelligence raise questions and fears about the future AI may lead to, and our place in that new world.

Generative AI is a type of artificial intelligence that’s capable of analyzing vast amounts of data (text, images, music, video, code, and more) to produce new, unique content. What makes generative AI special is the last part of the sentence above—generative AI systems use existing data to produce unique and creative output that hasn’t been seen before. To date, machine learning systems have mainly been used to analyze existing information, make predictions, and make logic-based decisions. But as you’re about to see, with the introduction of generative AI they can do much, much more.

Below you’ll find a short tutorial on how to signup for DALL·E and generate your first DALL·E created image. (OpenAI doesn’t charge for limited use of DALL·E. You only pay if you exhaust your monthly allotment of free credits.)

To begin, go to the DALL·E 2 page on the OpenAI website. Click on the SignUp button in the top right corner. You’ll then see the “Create your account form.” Complete the form and the account validation process to create your account.
Once you've created your account, you’ll see the “Welcome to DALL·E” dialog. Review it, then click Continue. You’ll then see another dialog with information about your free credits. Once again, review it, then click “Start creating with DALL·E.” You'll then see this web page:

The white field in the middle is where you enter the text prompt for the image you want DALL·E to create.

To create your first image, type a brief description of the image you have in mind into the text prompt field. For example, to create an image of a dog frolicking in a park, type "a dog frolicking in a park," then click the Generate button.

After a few moments, DALL·E returns its interpretation of the prompt:

Notice that DALL·E returns four interpretations of the prompt you submitted. These are thumbnail images. You can click on the thumbnail image to make it larger, edit it, or ask DALL·E to generate variations of the image you selected.

If you used the same prompt I did, you may wonder why your images of a dog frolicking in the park don’t look exactly like the image the DALL·E created for me. Generative AI systems like DALL·E use a random number called a “seed” to start the image generation process. You need to know and control the seed value to reproduce an image—and even when you use the same seed value, the system may not replicate every aspect of the image the same way. You’ll learn why in the next section.

Because we didn't include any specific details about the type of dog we wanted to see, the park setting, or the style of image we wanted, the system "guessed" at the breed of the dog, the type of park, and the style of the image it returned. If you just glance at the photos, you might think all the system did is grab four pictures from the Internet that matched the general description. But look a little closer: you'll see telltale rendering artifacts that tell you that the image was composed—the oddly curved left leg in the first image, the strange tail of on the dog in the second image, the blurred imagistic dog painted into the third image, and the slightly horrifying mouth and tongue on the dog in final image. (DALL·E is a work in progress!)

Let's try again. This time, we’re going to add a little more detail on what we want to see in the image. For example, "A blonde labrador retriever with a green ball in its mouth, frolicking in a green field of grass at sunset. Rendered in dry brush watercolor." Here are the images DALL·E generated for me when I submitted the more detailed prompt:

Here's my favorite in the sequence in a larger size:

You'll notice that there are still some odd artifacts in the image—for example, the ball in the dog's mouth isn’t quite round and its gait is improbable. But, the overall composition of the image is appealing, the direction of the light and shadowing is consistent, and the dry watercolor brushwork (especially the grass) is lovely. The image is far, far better than anything I could paint.

DALL·E also enables you to edit an image and generate variations of the image.

Here’s a set of variations DALL·E produced of the image above:

And here’s a larger version of the third variation:

The point of view (looking up at the dog from the level of the water) is engaging, the reflections in the water are interesting and nicely rendered. DALL·E also rendered the image in a wetter watercolor style with loosened brushwork, which works well with the large, painterly, cloud-filled sky.

I’ll return to prompt crafting, but before we go any further into using generative AI text-to-image tools like DALL·E, it’s helpful to know a little more about how text-to-image conversion works.

Generative AI and Pattern Recognition

For those of us who aren’t well versed in AI and its subfields, such as machine learning, deep learning, neural networks, cognitive computing, natural language processing, and computer vision, it’s easy to get lost in the descriptions of how the different generative AI systems from companies like OpenAI, Stable Diffusion, Midjourney, Google, Meta, and Microsoft work. Most descriptions quickly turn to the architecture, models, and specific technologies used in the system—and lose the casual reader along the way. So I’m not going to dive into those topics or related generative AI topics such as Recurrent Neural Networks (RRNs), transformer models, Generative Adversarial Networks (GANs), diffusion models—et cetera, et cetera, et cetera!

While the purpose of the various text-to-image generative AI systems is basically the same (produce a new image based on the instructions in a text prompt or other form of input), the architecture of the systems, the language and image datasets they use, and the specific technologies they use to perform the various tasks associated with translating text into an image differs—completely or in part. But, every text-to-image system has to address the following four high-level technical challenges:

Analyzing the text in the prompt to determine what kind of image should be created. The system looks for keywords and phrases that can be used to generate the desired output image, then translates those elements into a set of instructions the system can act on.
Creating the various text-image pairs specified in the instructions, for example the blonde labrador and the green ball specified in the prompt used above.
Assembling the parts into a new, visually coherent image.
Refining the image to match the style and attributes specified in the prompt.

A given technology might solve one challenge above, or several in combination. For example, analyzing the text in the prompt requires the integration of a Natural Language Processing (NLP) technology or function that can determine which parts of the text are important, resolve ambiguities, measure sentiment, and add the additional information required for continued processing of the data extracted. A generative AI developer might use a dedicated NLP technology, or a solution that combines the first two steps above, such as OpenAI’s CLIP, which analyzes the text in the prompts and generates text-image pairs.

The architecture, technologies, and data sets used by the different systems lead to different results. But what all generative AI systems have in common is that they rely on sophisticated pattern matching techniques. Think about the labrador retriever used in the example above… No two labradors look exactly alike. When you say a dog looks like a labrador, you are evaluating the labrador in front of you and assessing whether it has enough of the traits you associate with a labrador to call it a labrador. We do this instantly.

An analytical AI system uses the patterns it derived from the dataset it was trained on, and statistical weighting, to determine which pattern, if any, the thing it’s evaluating comes closest to. In an AI system, a labrador is a pattern of traits that add up to a word the pattern is paired with, in this case “labrador.”

Generative AI systems take pattern recognition a step further. They use the patterns in their training data and generative content creation techniques such as transformers, GANs, and diffusion to create the next iteration in the pattern: the next variation of a labrador in the series of labradors in its dataset. It’s important to remember the patterns are in the training data, not in the content creation model. This is why the make-up and quality of the content in the training dataset and its size matter so much. Garbage in, garbage out.

Each generative AI system has its own sensibility—the combination of the approach it uses to determine the intent of your prompt, the dataset that informs its worldview (the information and biases it inherits from its training data),**** and the techniques it uses to create the image it returns. “Sensibility” and “worldview” imply sentience: To be clear, Generative AI systems are not sentient. (Some would argue that the output from generative AI systems isn’t even “creative.”) But each system does have its own way of interpreting the text you use to describe the image you have in mind and reflect back the image it creates from its interpretation of your prompt, the content in its training data, and the biases inherent in the models and code it uses to create the image. Like the optics used by the Old Masters, the systems are a kind of lens and mirror—one that reflects a highly synthesized, selective version of our collective verbal and visual cultural heritage.

Prompt Crafting

Coaxing text-to-image systems like DALL·E to generate the image you have in your mind’s eye is both a skill and art—so much so that new terms have popped up to describe the process, including “prompt engineering,” “prompt crafting,” and “spell casting.” The skill is learning syntax and defined attributes the designers of the AI system you’re using have implemented. The art is developing a feel for how the system responds to the ways you express your creative ideas—how it responds to the words and syntax you use to describe the images you want it to create. Just like the physical tools you work with, it takes time to develop a feel for how a given generative AI tool responds. You need to approach each tool with an experimenter's mindset.

Every picture ever made has rules, even the ones made by a surveillance camera in a car park. There's a limit to what it can see. Someone has put it there, arranged that it would cover a certain area. There is nothing automatic about it: someone had to choose its point of view.
—David Hockney and Martin Gayford, A History of Pictures

The most successful text-to-image prompts start with these four elements:

A concise subject statement: Use at least 3 words to describe your subject, but not too many words: some systems, such as DALL·E limit the length of your prompt.**
A few desired attributes: Adjectives like beautiful, realistic, energetic, dark add emotional depth to your image.
The medium you’d like the system to imitate when it renders your image: An oil painting, watercolor picture, photograph, pen and ink drawing, animation or other artistic medium.
Your desired style: If you want your image rendered in a specific style or genre, you can include it in the prompt. The style can be an artist’s name, movement (such as Impressionist), or type of image (such as a portrait).

While it’s usually a good idea to start with your subject, don’t be afraid to play with the order of these elements. Sometimes a simple change in the order can have significant results, as can repeating an element, for example using the word “energetic” twice in your list of desired attributes.

This brief introduction to prompt crafting just barely scratches the surface of what you can do with generative AI text-to-image systems. The Recommended Resources section below contains links to several general guides on prompt crafting, as well as specific prompt crafting guides for DALL·E, Stable Diffusion, and Midjourney.

It’s Play Time!

Generative AI is a quickly evolving arena filled with new, often competing approaches and technologies. It takes time to develop an understanding of how the systems work. DALL·E, Stable Diffusion’s Dream Studio***, and Midjourney all give you free credits that enable you to get started with their systems. Unfortunately, it’s very easy to burn through your allotment of free credits! If you do, you can purchase additional credits or create an account on one of the other systems and give it a try. This will help you decide which system you like best.

Before I send you off to play, I want to share a project I created for myself when I first started working with Stable Diffusion. I decided to use the system to create illustrations for a poem and chose 13 Ways of Looking at a Blackbird by Wallace Stevens. I also decided that I’d try to create all of the illustrations in the style of artists associated with the New York School of the 1950s and 60s. Having a project in mind, and lots of iteration, helped me learn more about prompt crafting. Here’s the result: 13 Ways of Looking at a Blackbird.

Now it’s your turn. Have fun!

Footnotes

* The name DALL·E is a combination of the names of the robot character WALL-E (in Pixar's animated film of the same name) and the Spanish surrealist artist Salvador Dalí.

** The current version of DALL·E only recognizes the first 400 characters of a prompt.

*** DreamStudio is Stability AI’s public test site for Stable Diffusion. Stability AI has published Stable Diffusion as an open source project, so you can easily find other sites that have implemented Stable Diffusion.

**** I’ll be addressing the biases in training data and other concerns associated with generative AI later in this series.

Recommended Resources

Below you'll find links to resources on text-to-image generative AI, prompt crafting. You also find links DALL·E, Stable Diffusion, and Midjourney and resources related to each platform.

On Text-to-Image Generative AI

How do DALL-E, Midjourney, Stable Diffusion, and other forms of generative AI work?, Big Think
How Do DALL·E 2, Stable Diffusion, and Midjourney Work?, Marktechpost.com

On Prompt Crafting

A Beginner’s Guide to Prompt Design for Text-to-Image Generative Models, Lenonie Monigatti, Medium
The Anatomy of an AI Art Prompt, Paul DelSignore, Medium
The Ultimate Prompting Guide, PromptHero
Prompt Engineering: From Words to Art, Michael Taylor, Saxifrage.xyz

On DALL·E 2 by OpenAI

OpenAI website
DALL·E 2 Home page
DALL·E Wikipedia entry
How DALL·E 2 Actually Works, AssemblyAI.com (article and video)

DALL·E 2 prompt guide

The DALL·E 2 Prompt Book, Gary Parsons, Dall·ery Gall·ery

On Stable Diffusion by Stability AI

Stable Diffusion prompt guide

Stable Diffusion Prompt Book, Open Art (downloadable PDF)

On Midjourney

Midjourney site
Midjourney Discord page (Unlike DALL·E and Stable Diffusion, who enable you to interact with their AI systems through a publicly available web page, Midjourney is only publicly available through the chat app Discord (which is like Slack). If you subscribe to Midjourney and are logged in through Discord, you can access their web page interface.)
Midjourney Quick Start Guide
Midjourney Wikipedia page

Midjourney prompt guide

An Advanced Guide to Writing {rompts for Midjourney ( text-to-image), Lars Nielsen, Medium