Animation with Neural Networks


The latent space of a neural network is its “learned” space. In my prior experiments, I have achieved some success in unconditional generation via the latent space — for example, generating random images by defining random vectors in the latent space. This suggests that it may be possible to generate frame by frame animations via a series of unconditional generations in the latent space. Specifically, we can move from an initial frame to a final frame by interpolating between two images in the latent space. The neural network would then act as filling in the transition states of the animation.

For this experiment, I worked with animation generation on facial features — the mouth, nose, eyes and ears — continuing on my prior work with feature identification [link].  I wanted animation generation with a multi-class model so I wouldn’t have to train multiple networks when workign with four facial features. Previously, I had found — and there is research to support — that a purely stroke-based LSTM model works poorly to separate multiple classes in the latent space. However, convolutional neural networks are better at separating classes. Since, this process is based on latent space organization (we don’t want an eye turning into an ear partway), I worked with a CNN-LSTM model.


To perform interpolation on our trained VAE, we begin with an identified feature image and pass it to the encoder to generate z0. We randomly choose another feature image from the same class from the Quickdraw dataset and encode it into z1. Then, we create a series of intermediary latent vectors by performing spherical interpolation:

where t is a fraction that expresses how far in between we interpolate, and α = cos^−1 (z0 · z1). Typically, 10 interpolations was enough to generate a visually smooth sequence based on qualitative evaluation.


The results here are generated using QuickDraw stroke-based input data. We evaluated the results of this interpolation qualitatively. In 25 tests, all variations on the two input images successfully produced interpolated images that were fully formed images immediately recognizable as being of the same class. Several examples of this technique can be see in 5, such as one where the mouth moves from open to closed. These results demonstrate that our VAE successfully learned to differentiate between multiple classes in its latent space since no interpolations produced images from other classes. Moreover, it was able to generate images comparable to those in the dataset with symmetry, closed form strokes and cohesive stroke styles throughout.


To explore the VAE’s learned latent space, we can use t-distributed stochastic neighbor embedding (TSNE), which is a technique for reducing highdimensionality vectors down to a lower dimensional space [12]. Specifically, our 512-element latent vector z is reduced down to 2 dimensions for x-y chart plotting 5. TSNE graphs using encoded latent vectors on 200 images show that the VAE successfully created separate clusters for the four classes in its latent space. The TSNE allows us to make some observations on the network’s learning. The TSNE suggests latent vectors for sharp noses were clustered separately from latent vectors for curved noses (observe the rightmost cluster of sharp noses and the cluster curved noses further left). Additionally, it shows that the mouths and eyes are clustered relatively close together, which aligns with their lower accuracy rates in other classification methods we try in Experiments.


The interpolations were compiled into video at 5 frames per second, and the demo cycles through interpolations between 7 images per feature. Vectorization was not included in this pipeline due to its low performance.


Leave a Reply

Your email address will not be published. Required fields are marked *