Okay, here it comes…the big moment! Everything I’d crammed in my brain from the past several weeks was about to have its spotlight moment. My partner requested that I create a logo for his photography business. I mean, sure, I could sketch something with crayons or snag a snazzy premade logo from Canva, but why do that when Midjourney and Stable Diffusion could do it for me? It’s just a logo – how tricky can it be?
I had a specific picture of his that I wanted to turn into a simple line drawing, but I was also open to other suggestions.
I tried Midjourney first. I uploaded an image and used the /describe command to generate descriptions of my image in “Midjourney syntax”, then I then used the descriptions to generate some images.
Images Generated from Midjourney’s /describe command
***Side note: I find it interesting that nothing about the original image indicates “hip hop” but Midjourney classifies it as “hip hop aesthetic” or “hip hop flair”
The results weren’t terrible but not really what I wanted. The next step was to use the /blend command to merge my original image with one of the generated images to get it back closer to the original. After several iterations, I finally started altering the prompt to introduce the concept of line drawing illustrations, but Midjourney’s output wasn’t quite what I wanted.
Blending a generated image with the original image and then introducing line drawing and silhouette in the prompt
I jumped over to Leonardo.ai which is built on Stable Diffusion. I upload my original image and pick a community-trained model close to the style I want for my logo. This time, I tried the image-to-image generation specifying line drawing illustrations in the prompt. I go through several iterations here too, similar to Midjourney. The results were the same – still not quite what I wanted.
Images generated by Leonardo.ai
Why was this so hard? Was it because I had a very specific image I wanted to generate? Or maybe I wasn’t structuring my prompts correctly. How do these image-generation tools actually work anyway?
A High-Level Overview of Image Generation Models
In essence, an image generator is a type of artificial intelligence that creates images from a text prompt or an image input. These models are trained on large datasets of image-text pairs allowing them to learn the relationships between text descriptions and visual concepts.
Although AI art dates back to the 1960s, recent advancements in machine learning have sparked a surge of image generation platforms, notably in the past 2 years.
Stable Diffusion was first announced in August 2022, and it was made available to the public in October 2022. The model is trained on a dataset of 1.56 trillion images from a variety of sources, including the internet, public domain collections, and social media.
Midjourney was founded in August 2021 by David Holz. Holz started working on Midjourney as a self-funded research project, first testing the raw image-generation technology in September 2021. The Midjourney image-generating software went into limited beta on March 21, 2022, and open beta on July 12, 2022
There were several other text-to-image diffusion models released around this time too. DALL-E was released in January 2021, and Its successor, DALL-E 2 was released in April 2022.
There are two kinds of image generation models.
Generative Adversarial Networks, or GANs, are made up of two neural networks, a generator and a discriminator. I like to imagine this as two bots playing a game. The generator’s job is to create images. At first, it’s really bad at it, but it wants to get better. The discriminator is trained in spotting real and fake images. When the generator presents its work, the discriminator judges it and if it says it looks fake, it penalizes the generator or takes away a point. This feedback loop continues until the generator gets better at creating images that look real.
One significant downside of GANs is the stability issue during training. They are prone to problems such as mode collapse, where the generator fails to capture the full range of patterns in the training data, resulting in limited diversity in the generated images. Hence most image-generation platforms use Diffusion.
Diffusion models are what powers popular image generators like Stable Diffusion and Midjourney. For this blog, we are going to focus mostly on diffusion models and how they generate images.
Diffusion models are trained by adding “noise” to an image and then removing “noise” to recreate the original image. Visual noise is a pattern of random dots, similar to television static or a very grainy picture. Let’s imagine a television screen that isn’t getting a clear picture. It has a lot of grain or static. We need to make adjustments to the antenna to get a better signal which will then produce a clearer picture. The process of adjusting the antenna to get a clear picture is similar to how a diffusion model removes noise. In its training, we introduce a clear picture and then obscure it heavily with noise. The model removes the noise in steps recreating the original picture.
Text to Image
Before we can generate images, our prompt (“a beautiful beach sunset”) goes through a text encoder where it is broken down and transformed into a numeric representation. The text encoder is a language model, specifically a CLIP or Contrastive Language Image Pre-Training. These models are trained on images and their text descriptions. Each word is broken down into its own token and given a unique ID where the model can then try to understand the relationship between the tokens. I like to think of this just like how humans refer to a dictionary to learn the meaning of a word, but we also need to learn the meaning of the word within the context of the sentence it was used in.
How GPT-3 breaks down words into tokens and organizes them in an array or Tensor.
Image from OpenAI official documentation https://platform.openai.com/tokenizer
Next, we feed our tokenized prompt to the Image Generator. This is where the “diffusion” happens.
Using our tokenized prompt as a guide, it begins removing noise; carving out an image step by step. Eventually, it generates a clear image.
Image to Image
When we have an image as part of our prompt, our image passes through a feature extractor. Our original image is then infused with noise. The diffusion model will de-noise the image combined with the features extracted. The output will result in variations of the original image.
I’m A Front End Dev! What Am I Doing Playing with Machine Learning and Image Generation?
Let’s rewind to a few weeks ago when a group of us here at WillowTree were given the task of exploring Image Generation tools. How might designers and illustrators use these tools to quickly generate illustration libraries for clients? We had two weeks to run our experiments. Our team consisted of two designers, Karolina Whitmore and Jenny Xie, two engineers, Tyler Baumler and myself, one data scientist, Cristiane de Paula Oliveira, and Jenn Upton, our project manager.
We knew that we needed to start with Stable Diffusion for several reasons. First, you can install it locally and keep your images within your machine. This keeps it off public channels where it can be used by anyone who stumbles upon your generated images. Second, you can train your own model with your own images and illustrations thereby minimizing the ethical issues that come up with infringement of copyrighted work. Third, it’s open source and doesn’t require a membership or subscription.
We also wanted to test out other platforms, like Midjourney, Leonardo.ai, Runway, and Adobe Firefly. We were looking for ease of use and how quickly we could get the image generator to consistently generate images aligned with a specific illustration style.
The first hurdle was installing and running Stable Diffusion locally. This required some knowledge of the command line and git. The installation process wasn’t too bad. For me, it only required updating Python, cloning the repo, and typing a magical incantation in the terminal. What greeted me in http://127.0.0.1:7860/ was a somewhat intimidating interface.
Stable Diffusion Web UI, Automatic1111
There were so many settings to toggle and terminology that was new and confusing. Nevertheless, we started tinkering and experimenting with generating anything. I’m not sure how everyone else did, but I immediately realized I needed to learn how to write effective prompts for Stable Diffusion.
Our initial run at training a model also yielded terrible results, most likely because we only had four images in our datasets. It wasn’t until after reading some forums and watching some videos that we decided to use an illustration pack that Jenn had obtained for us. This illustration pack consisted of 50 images, with a consistent style and color palette.
One of our first experiments with model training
Images from our Blue Ink illustration pack
Before I continue, I have to confess my naivety about hardware requirements. I have an M1 Mac and I put too much trust in it. Training a model or an embedding through Automatic1111 was a very heavy lift for my machine. It took a whole day and a half for it to train, and even then, I’m not even sure if I trained it correctly. Thankfully, we can train our models through Google Colab, a hosted Jupyter Notebook service where you can write and execute Python through the browser. Training a model there took about an hour at most.
Now the fun part! Generating images using our model! Our results were… interesting.
The results seemed better but still presented with some distortions. It was clear that whatever we ended up generating through Stable Diffusion would still require manual editing before it was usable for any client project.
Karolina was more familiar with Midjourney and ran similar experiments. Her results were better.
Karolina’s Midjourney Couch vs. My Stable Diffusion Couch
Those of us who jumped over to Midjourney to give it a try appreciated its easier setup. Although you can’t train the model, you can however upload a sample image and iterate over it countless times until you get your desired output. In most cases, oddities and distortions were still present in the generated images which led us to the conclusion that we would still need to do final edits on another platform.
Converting our images to vector images
Karolina took our images a step further and converted them into vectors using vectorizer.ai. As an engineer, I didn’t know much about editing vectors aside from tweaking a few parameters in an SVG file. I have very little experience using Adobe Illustrator, but I do know it can be a pain having to edit several anchor points. Vectorizer.ai produced SVGs with minimal rework.
Midjourney couch (right) and Stable Diffusion (left) couch in the style of Blue Ink in Adobe Illustrator
At the end of our two weeks, it seemed that we barely scratched the surface of AI Image Generators. This is an emerging field and it is rapidly changing. There are a multitude of platforms out there to try, some requiring a subscription and some are free. Here are the things our team collectively agreed on:
- Local installation of Stable Diffusion has the most flexibility when it comes to model training and generating images. However, it has a hardware requirement and a very steep learning curve for using its interface.
- There is a substantial time investment required to learn how to train models, checkpoints, LORAs, embeddings, and ControlNets.
- Midjourney is a bit easier. It does require a subscription and a specific prompt structure which you’ll get the hang of as you continue to use it. The images generated can be reproduced, displayed, and modified by Midjourney itself to improve its service.
- Copyright laws and ownership of generated images are currently a little hazy. There are no laws protecting your images from being used by anyone.
- Most images will need further adjustments. Vectorizer.ai worked well!
There’s still so much to learn!
Our two-week SPIKE flew by. There was still so much to learn out there. Many of us who have been dabbling in AI image generation have learned little things here and there, and developed techniques and processes. We all carry a small piece of the puzzle.
So, was I successful in generating a logo using AI? Well, it depends on how you define success. In the end, I ended up hand drawing the logo on an iPad using Procreate, a digital art program, but drawing inspiration from images generated by Midjourney and Leonardo.ai. I suppose yes, It was a success! I used AI for ideation and perhaps that’s currently the most common use case for artists, illustrators, and designers for now. As more people continue to use these platforms and the model continues to learn from its users, we’ll likely see fluctuations in its effectiveness and results.
The finished product was inspired by Midjourney and Leonardo.ai’s generated images