Multi-Class Image Generation with AI

A fundamental aspect of human creativity involves mixing symbols and meanings in new ways, to generate fantasies, metaphors and imagined realities. Many existing artificial intelligence (AI) programs facilitate the translation of a word to an image in a 1:1 relationship. However, I was interested in constructing an AI that is able to draw sporadically, changing its mind partway. The machine trained for this project is able to draw more naturalistically by having the ability to focus on different subjects while still producing a single cohesive image.

I feed the machine thousands of hand-drawn images. The project employs a customized type of neural network to learn to draw called a variational autoencoder (VAE). This particular neural net consists of a convolutional neural net (CNN) encoder and a long short term memory (LSTM) decoder. The dual nature of the CNN-LSTM network allows it to identify spatial features within the image (CNN encoder) and relate them to stroke patterns in the drawing sequences (LSTM decoder), which enables the machine to start drawing one object then switch to drawing a second object based on what it has drawn so far. It was able to produce mixed object drawings with cohesive internal structure in terms of complete (closed-form) shapes, symmetry and naturalistic transitions.


I have uploaded the code for the (1) data preprocessing (taking Google Quickdraw stroke-based data and converting it to bitmap npys) (2) implementing the CNN-based VAE. The network model code was written in collaboration with Julie Chang.

Github Repository:


Neural networks have been used before for stroke-based image generation although prior efforts largely focus on single-object generation (pictures of one item) whereas we explore neural networks here that best allow us to switch between objects. The pen-stroke-based data model used here was first proposed to model Kanji writing. Our model also employs a mixture density network to determine out- puts based on a probability distribution, a technique first proposed by Edward Bishop that was employed in a similar case to generate naturalistic English handwriting.

A similar existing model is the sketch-rnn neural net developed by Google Magenta, which is a VAE that employs as bi-directional recurrent neural network (RNN) as its encoder. However, experimentation with sketch_rnn showed the limitations of using RNN’s to generate mixed-object sketches because it learns in a highly sequential format when object switching may involve thinking in a non-linear fashion. We therefore turned to a CNN encoder here, which processes an image spatially and demonstrates the ability to learn localized structures within images. In order to feed data into the CNN encoder, the stroke-based format of the Quickdraw dataset is converted into a grayscale bitmap png.

The key difference between the two models is that the RNN network model is purely stroke-based whereas the CNN incorporates spatial understanding. A challenge with the RNN model is that it learns from a sequential list of vectors. This limits its ability to create the most realistic composite doodles because the start of one doodle does not necessarily smoothly integrate into the end of another. In other words, the CNN-based model learns in a way that is more flexible for multi-class drawings.

A comparison of earlier timesteps between the RNN-based VAE and the CNN-based VAE on a four-class dataset show that the CNN more quickly begins to learn to draw closed-form shapes and image symmetry:

More details of experimentation with network models can be found here.


I considered three main methods for generating mixed-object drawings with our trained AI: (1) latent vector switching (2) interpolation (3) unconditional sampling. To gauge the success of these various measures, I assess the generated images on subjective qualities of cohesiveness. Specifically, I look for elements of cohesive internal structure such as complete shapes (closed shapes), moderate complexity (not too many shapes as to be indistinguishable and not too few as to be featureless), and symmetry. I additionally look for a naturalistic transition between multiple classes in mixed doodles, in that I look for elements of one object to be incorporated in the drawing of the second object rather than two objects simply juxtaposed side by side.

(1) Latent vector switching (z-switching): Z-switching was by far the most successful at creating cohesive multi-object images. This essentially amounts to switching the model halfway through a generated drawing. The trick is to re-encode the partway generated drawing of Object 1 as a hidden state for the model of Object 2. Specifically, the trained AI associates a series of latent vectors (z vectors) with drawings. In latent vector switching, we generate an incomplete drawing for Object 1, re-encode this drawing into a latent vector using the Object 2 model’s encoder, then finish the drawing using the Object 2 model. The resulting images demonstrate a cohesiveness in that the AI generated the best version of Object 2 using the incomplete drawing of Object 1, showing naturalistic transitions and cohesiveness. Experiments showed z-switching is effective using both single-class and multi-class trained models. The booklet is generated using z-switching on multi-class models (which is much more time-efficient to generate).

(2) Interpolation: Interpolation is a technique that has been previously used to explore the latent space (the imagined possibilities) of an AI. The method involves finding distinct latent vectors for different classes in a multi-class trained model, and interpolating the numerical values in between these vectors, then decoding. The result is an image that lies in the space between one object and another. Though theoretically interesting to explore, interpolation proved less effective than z-switching at generating cohesive imagery. This is largely because many of the images midpoint through two latent vectors of different objects do not resemble either object, but instead a transition point between the two (example below). This may be effective for creative projects interested in more abstract imagery but did not suit our goal here of generating mixed object drawings that recognizably embodied both objects.

(3) Unconditional sampling: I perform unconditional sampling by setting randomly generating a vector z in the space the model has learned, then passing this into the decoder LSTM. This rarely produced satisfactory mixed object images, but does allow us to compare the effectiveness of various network models at learning to draw. Unconditionally sampled images from the CNN, more of which can be seen in, suggest the model was able to distinguish between the four classes. In a subjective assessment of 100 randomly generated images based on a 70K step trained model, 63 percent could be visually distinguished as being recognizable as an object in one of the four training classes. Unconditionally sampled images from the RNN did not seem to capture the internal structure of training images. These results were often disconnected lines that did not form complete shapes, much less recognizable images. In a subjective assessment of 100 randomly generated images, 82 percent were lines that did not resemble any of the four training classes. In addition to not producing recognizable images, the RNN also seemed prone to producing long, jagged sequences of lines and almost produced closed form shapes. We suspect this may be related to the sequential na- ture of the RNN model, and the possibility that the network may not remember the full image and therefore continue sequences of open lines rather than complete closed shapes.


Leave a Reply

Your email address will not be published. Required fields are marked *