Multi-Class Image Generation with AI

A fundamental aspect of human creativity involves mixing symbols and meanings in new ways, to generate fantasies, metaphors and imagined realities. Many existing artificial intelligence (AI) programs facilitate the translation of a word to an image in a 1:1 relationship. However, I was interested in constructing an AI that is able to draw sporadically, changing its mind partway. The machine trained for this project is able to draw more naturalistically by having the ability to focus on different subjects while still producing a single cohesive image.

I feed the machine thousands of hand-drawn images. The project employs a customized type of neural network to learn to draw called a variational autoencoder (VAE). This particular neural net consists of a convolutional neural net (CNN) encoder and a long short term memory (LSTM) decoder. The dual nature of the CNN-LSTM network allows it to identify spatial features within the image (CNN encoder) and relate them to stroke patterns in the drawing sequences (LSTM decoder), which enables the machine to start drawing one object then switch to drawing a second object based on what it has drawn so far. It was able to produce mixed object drawings with cohesive internal structure in terms of complete (closed-form) shapes, symmetry and naturalistic transitions.


I have uploaded the code for the (1) data preprocessing (taking Google Quickdraw stroke-based data and converting it to bitmap npys) (2) implementing the CNN-based VAE. The network model code was written in collaboration with Julie Chang.

Github Repository:


Neural networks have been used before for stroke-based image generation although prior efforts largely focus on single-object generation (pictures of one item) whereas we explore neural networks here that best allow us to switch between objects. The pen-stroke-based data model used here was first proposed to model Kanji writing. Our model also employs a mixture density network to determine out- puts based on a probability distribution, a technique first proposed by Edward Bishop that was employed in a similar case to generate naturalistic English handwriting.

A similar existing model is the sketch-rnn neural net developed by Google Magenta, which is a VAE that employs as bi-directional recurrent neural network (RNN) as its encoder. However, experimentation with sketch_rnn showed the limitations of using RNN’s to generate mixed-object sketches because it learns in a highly sequential format when object switching may involve thinking in a non-linear fashion. We therefore turned to a CNN encoder here, which processes an image spatially and demonstrates the ability to learn localized structures within images. In order to feed data into the CNN encoder, the stroke-based format of the Quickdraw dataset is converted into a grayscale bitmap png.

The key difference between the two models is that the RNN network model is purely stroke-based whereas the CNN incorporates spatial understanding. A challenge with the RNN model is that it learns from a sequential list of vectors. This limits its ability to create the most realistic composite doodles because the start of one doodle does not necessarily smoothly integrate into the end of another. In other words, the CNN-based model learns in a way that is more flexible for multi-class drawings.

A comparison of earlier timesteps between the RNN-based VAE and the CNN-based VAE on a four-class dataset show that the CNN more quickly begins to learn to draw closed-form shapes and image symmetry:

More details of experimentation with network models can be found here.


I considered three main methods for generating mixed-object drawings with our trained AI: (1) latent vector switching (2) interpolation (3) unconditional sampling. To gauge the success of these various measures, I assess the generated images on subjective qualities of cohesiveness. Specifically, I look for elements of cohesive internal structure such as complete shapes (closed shapes), moderate complexity (not too many shapes as to be indistinguishable and not too few as to be featureless), and symmetry. I additionally look for a naturalistic transition between multiple classes in mixed doodles, in that I look for elements of one object to be incorporated in the drawing of the second object rather than two objects simply juxtaposed side by side.

(1) Latent vector switching (z-switching): Z-switching was by far the most successful at creating cohesive multi-object images. This essentially amounts to switching the model halfway through a generated drawing. The trick is to re-encode the partway generated drawing of Object 1 as a hidden state for the model of Object 2. Specifically, the trained AI associates a series of latent vectors (z vectors) with drawings. In latent vector switching, we generate an incomplete drawing for Object 1, re-encode this drawing into a latent vector using the Object 2 model’s encoder, then finish the drawing using the Object 2 model. The resulting images demonstrate a cohesiveness in that the AI generated the best version of Object 2 using the incomplete drawing of Object 1, showing naturalistic transitions and cohesiveness. Experiments showed z-switching is effective using both single-class and multi-class trained models. The booklet is generated using z-switching on multi-class models (which is much more time-efficient to generate).

(2) Interpolation: Interpolation is a technique that has been previously used to explore the latent space (the imagined possibilities) of an AI. The method involves finding distinct latent vectors for different classes in a multi-class trained model, and interpolating the numerical values in between these vectors, then decoding. The result is an image that lies in the space between one object and another. Though theoretically interesting to explore, interpolation proved less effective than z-switching at generating cohesive imagery. This is largely because many of the images midpoint through two latent vectors of different objects do not resemble either object, but instead a transition point between the two (example below). This may be effective for creative projects interested in more abstract imagery but did not suit our goal here of generating mixed object drawings that recognizably embodied both objects.

(3) Unconditional sampling: I perform unconditional sampling by setting randomly generating a vector z in the space the model has learned, then passing this into the decoder LSTM. This rarely produced satisfactory mixed object images, but does allow us to compare the effectiveness of various network models at learning to draw. Unconditionally sampled images from the CNN, more of which can be seen in, suggest the model was able to distinguish between the four classes. In a subjective assessment of 100 randomly generated images based on a 70K step trained model, 63 percent could be visually distinguished as being recognizable as an object in one of the four training classes. Unconditionally sampled images from the RNN did not seem to capture the internal structure of training images. These results were often disconnected lines that did not form complete shapes, much less recognizable images. In a subjective assessment of 100 randomly generated images, 82 percent were lines that did not resemble any of the four training classes. In addition to not producing recognizable images, the RNN also seemed prone to producing long, jagged sequences of lines and almost produced closed form shapes. We suspect this may be related to the sequential na- ture of the RNN model, and the possibility that the network may not remember the full image and therefore continue sequences of open lines rather than complete closed shapes.


Animation with Neural Networks


The latent space of a neural network is its “learned” space. In my prior experiments, I have achieved some success in unconditional generation via the latent space — for example, generating random images by defining random vectors in the latent space. This suggests that it may be possible to generate frame by frame animations via a series of unconditional generations in the latent space. Specifically, we can move from an initial frame to a final frame by interpolating between two images in the latent space. The neural network would then act as filling in the transition states of the animation.

For this experiment, I worked with animation generation on facial features — the mouth, nose, eyes and ears — continuing on my prior work with feature identification [link].  I wanted animation generation with a multi-class model so I wouldn’t have to train multiple networks when workign with four facial features. Previously, I had found — and there is research to support — that a purely stroke-based LSTM model works poorly to separate multiple classes in the latent space. However, convolutional neural networks are better at separating classes. Since, this process is based on latent space organization (we don’t want an eye turning into an ear partway), I worked with a CNN-LSTM model.


To perform interpolation on our trained VAE, we begin with an identified feature image and pass it to the encoder to generate z0. We randomly choose another feature image from the same class from the Quickdraw dataset and encode it into z1. Then, we create a series of intermediary latent vectors by performing spherical interpolation:

where t is a fraction that expresses how far in between we interpolate, and α = cos^−1 (z0 · z1). Typically, 10 interpolations was enough to generate a visually smooth sequence based on qualitative evaluation.


The results here are generated using QuickDraw stroke-based input data. We evaluated the results of this interpolation qualitatively. In 25 tests, all variations on the two input images successfully produced interpolated images that were fully formed images immediately recognizable as being of the same class. Several examples of this technique can be see in 5, such as one where the mouth moves from open to closed. These results demonstrate that our VAE successfully learned to differentiate between multiple classes in its latent space since no interpolations produced images from other classes. Moreover, it was able to generate images comparable to those in the dataset with symmetry, closed form strokes and cohesive stroke styles throughout.


To explore the VAE’s learned latent space, we can use t-distributed stochastic neighbor embedding (TSNE), which is a technique for reducing highdimensionality vectors down to a lower dimensional space [12]. Specifically, our 512-element latent vector z is reduced down to 2 dimensions for x-y chart plotting 5. TSNE graphs using encoded latent vectors on 200 images show that the VAE successfully created separate clusters for the four classes in its latent space. The TSNE allows us to make some observations on the network’s learning. The TSNE suggests latent vectors for sharp noses were clustered separately from latent vectors for curved noses (observe the rightmost cluster of sharp noses and the cluster curved noses further left). Additionally, it shows that the mouths and eyes are clustered relatively close together, which aligns with their lower accuracy rates in other classification methods we try in Experiments.


The interpolations were compiled into video at 5 frames per second, and the demo cycles through interpolations between 7 images per feature. Vectorization was not included in this pipeline due to its low performance.


Image Vectorization

Image vectorization refers to various processes for converting pixel-based image formats to line-based image formats — for example bitmaps to SVG’s. Vectorization is useful because digital cameras record images in terms of pixels, but vector-based formats are much easier to manipulate, allowing you to warp or erase individual lines, seamlessly scale the image, or draw the image in sequence. This is an active field of research in computer vision, and I found most existing programs to be limited in their functionality. Here, I recap some of my explorations of custom processes for doing so (with thanks to Jeff Hara @ Stanford).


The most existing vectorization programs I found were applications that did not have or had a limited command-line component, such as Super Vectorizer and Inkscape. Existing command-line vectorizers such as Potrace, work by tracing around the outline of colored pixels (tracing of the outside contour of line strokes) rather than stroke extraction:

original PNG

contour tracing with potrace

A different type of vectorization, centerline tracing, involves recognizing lines within a pixel-based image.

desired centerline tracing

I was specifically interested in centerline tracing because I wanted to convert PNGs to SVGs to stroke-based images. Theoretically, this would allow any PNG image database to be converted into a stroke-based database that could be plugged into a stroke-based neural network model. The ability to construct an SVG dataset from a PNG one would have the potential for smoother and more flexible — less machinic — image generation.

Additionally, Hough transforms have also been used for line extraction, but work better for clean, predefined shapes rather than arbitrary shapes. Instead, we employ the technique of centerline tracing, which has been previously used to capture handwriting and process contour maps.


To perform centerline tracing, we first applied Guo-Hall thinning, an efficient thinning algorithm that can reduce the image to a pixel-width skeleton. This is similar to finding the centerline of a shape. We can now extract all the points of the skeleton.

These points are unordered, but there is an optimal stroke that passes through the points in a sensible order. This stroke may be vertical, horizontal, or any arbitrary shape, and we use a brute-force solution to find the optimal path, which minimizes the Euclidean distance between each consecutive pair of points. There may also be multiple disconnected strokes, so we use Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to cluster the points of a stroke together, and find the optimal paths per cluster.

Finally, we take the optimal paths and apply the Ramer-Douglas-Peuker algorithm with an epsilon of 2, which is used to simplify lines by representing them with fewer points. We can convert the simplify lines to strokes simply by calculating the change in x and y. Now, we have the desired pen-stroke format.


To quantitatively evaluate the success of this centerline tracing, I run the results through a classifier. I take a base dataset of SVG images (Dataset A), convert them to PNG, then back to SVG through centerline tracing (Dataset B). I train the classifier on the base dataset, then test it with the traced dataset.

The classifier had a high accuracy rate (98.48%) when tested on Dataset A. The classifier had much lower accuracy (69.5%) when tested on bitmap to vector converted images. In particular, the classifier mis-categorized 45% of ears as noses based on vectorized images.

test on 100 samples per class

This is likely because some curvature detail is lost in the vectorization process due to noise or over-simplification of lines; ear images with reduced curvature may begin to resemble nose images. Comparing stroke-based inputs versus vectorized bitmap inputs to the classifier, vectorization adds a 29.5% error rate.

Interactive Documentary

This was one of my earlier experiments with interactive video. “A Tale of Many Children” is a documentary about families impacted by the “one child policy” in China, shot on-site in Beijing. It aims to emphasize the diversity for different types of families — the wealthy versus the poor, urban versus rule, parents versus children etc. — of experience under the blanket rule.

 Role: Directing, Editing, Producting
On-Site Translators: Jack Linshi, Li Boynton
Additional Translations: Meng Wang
Additional Research and Interviews: Haley Adams, Jess Leao, Jack Linshi, Li Boynton