# BEGAN: State of the art generation of faces with Generative Adversarial Networks

## TL;DR

This post describes the theory behind the newly introduced BEGAN.

• As of early April 2017, the BEGAN is the state of the art when it comes to generating realistic faces.
• It is inspired from the EBGAN in that its discriminator is an auto-encoder.
• It shows fast and stable convergence even in the absence of batch norm.
• It automatically balances the trade-off between image diversity and quality of generation.
• And it offers an approximate measure of convergence.

As a bonus, I’ve added a comparison between the BEGAN and the Improved WGAN at the end.

You can comment on the reddit page associated this post, or on the reddit page associated to the paper.

I hope you’ll enjoy this post! 🙂

## The reason it is interesting

Generative Adversarial networks (GANs) have achieved impressive (and often state-of-the-art) results in various domains such as:

• image generation with Improved WGANs (Gulrajani et al. 2017)
• image editing with Neural Photo Editors (Brock et al. 2016)
• super-resolution with SRGANs (Ledig et al. 2016)
• semi-supervised learning with BiGANs (Donahue et al. 2016)
• domain adaptation with CycleGANs (Zhu et al. 2017)
• etc.

Here at Heuritech (a french start-up) we aim to use these models (and many others) to help consumers, searchers, corporations and kids with their day-to-day deep learning needs. As things turned out, a big part of our focus currently lies within the world of fashion (because Heuritech is also a stylish, well-dressed company), but who knows where we’ll be just a few years from now!

Heuritech has been founded by four doctors in Artificial Intelligence and is very research oriented. If you’re interested to be part of the adventure you can contact us here.

## BEGAN

Recently, a new GAN architecture (the Boundary Equilibrium GAN) caught my interest by successfully generating anatomically coherent faces at a resolution of 128×128 pixels. As of early April 2017, this is the current state of the art. Have a look:

## Main goal

It’s all about providing a better loss to the networks.

It has been previously shown that the first GAN architecture minimizes the Kullback-Leibler divergence between the real data distribution $p_X$ and the generated data distribution $p_{G(z)}$. An unfortunate consequence of trying to minimize this distance is that the discriminator D gives meaningless gradients to the generator G if D gets too good too quickly.

Since then, a few publications focused their effort on trying to find better loss functions:

• The Improved Wassertein GAN (since its first version) minimizes the Wasserstein distance (also called the Earth-Mover distance) by giving very simple gradients to the networks (+1 if the output should be considered real and -1 if the output should be considered fake).
• The Least Squares GAN uses a least squares loss function to minimize the Pearson $\chi^2$ divergence between D‘s output and its target.
• The Generalized Loss Sensitive GAN uses a discriminator that quantifies the quality of images. The loss is then computed as a function of the distance between the quality of real and generated images (which allows the model to focus more on improving poor samples than good samples).

The main goal behind the BEGAN is also to change the loss function. This time, it is achieved by making D an autoencoder. The loss is a function of the quality of reconstruction achieved by D on real and generated images. This idea (of making D an autoencoder) is inspired by the Energy Based GAN (EBGAN). Below is the architecture of the EBGAN shamelessly copy-pasted from the paper:

## The idea

Let’s start by clarifying something important. The reconstruction loss is not the same thing as the real loss that the nets are trying to minimize. The reconstruction loss is the error associated to reconstructing images through the autoencoder/discriminator. In the EBGAN schema the reconstruction-loss is referred to as « Dist » and the real loss is referred to as « L ».

The main idea behind the BEGAN is that matching the distributions of the reconstruction losses can be a suitable proxy for matching the data distributions. The real loss is then derived from the Wasserstein distance between the reconstruction losses of real and generated data. Later, the networks are trained by using this real loss in conjunction with an equilibrium term to balance D and G.

## The training

I’d like to spoil you right away with the solution to give you an understanding of where we’re going.

The training goes like this:

1. D (the autoencoder) reconstructs real images better and better. Said differently, the weights of D are updated so that the reconstruction loss of real images is minimized.
2. D simultaneously increases the reconstruction loss of generated images.
3. And G works adversarially to that by minimizing the reconstruction loss of generated images.

Points 1 and 2 can be rephrased as « D tries to discriminate real and generated distributions ». So G can only succeed with point 3 by generating more realistic images.

## Deriving the loss

Let’s first derive the real losses and we’ll focus on the equilibrium term later. What we want at this point is to use the Wasserstein Distance between the reconstruction losses to derive the real losses.

If we use an L1 norm between the input image of D and its reconstructed version, then the loss distribution (refered to as $\mathcal{L}(x)$), is approximately normal (at least it looks like it experimentally). We’re about to use this fact to simplify the Wasserstein distance between the reconstruction losses (of real and generated images).

Here is the Wassertein distance mathematically:
$W(\mu_{real}, \mu_{gen}) = |m_{real} - m_{gen}| + another\_term$
with:

• $\mu_{real}$ and $\mu_{gen}$: the distributions of the reconstruction losses of real and generated images ($\mathcal{L}(x)$ and $\mathcal{L}(G(z))$).
• $m_{real}$ and $m_{gen}$: the averages associated to the loss distributions.
• $another\_term$: a term that is the unfortunate victim of 2 assumptions. The first assumption is that the loss distributions are assumed to be normal (as said before) and the second assumption is that it stays roughtly constant. If (like me) you’re not so sure about these assumptions, you should know that the experimental results show that it’s fine.

In the end, this is the simplified distance that we want to minimize:
$W(\mu_{real}, \mu_{gen})\propto {|m_{real} - m_{gen}|}$
And from this formulation we can derive the GAN objective.

## The complete BEGAN objective

The reconstruction losses $\mathcal{L}(x)$ and $\mathcal{L}(G(z))$ can only be positive, so as D wants to maximize the distance $W(\mu_{real}, \mu_{gen})$ between the losses, it has only two choices:

• either (case 1) it needs: $m_{real} \to \infty$ and $m_{gen} \to 0$
• or (case 2) it needs: $m_{real} \to 0$ and $m_{gen} \to \infty$

Case 1 doesn’t make sense of course because we want $m_{real}$ to go 0 (real images should be reconstructed perfectly). We can enforce case 2 by applying the following loss function to D:
$\mathcal{L}_D = \mathcal{L}(x) - \mathcal{L}(G(z))$

G can then work adversarially to that by trying to minimize:
$\mathcal{L}_G = \mathcal{L}(G(z))$

In the end these losses (which are not the final ones) look very similar to those of the WGAN except:

• that we are matching distributions between losses,
• and there is no need for the discriminative function to be k-Lipschitz.

But this is not the end of the story.

In the next section I describe the second main contribution of the BEGAN paper: the diversity ratio $\gamma$. In short, the role of $\gamma$‘s is to balance the losses $\mathcal{L}(x)$ and $\mathcal{L}(G(z))$ to stabilize the training. This is done in an adaptive fashion, during training, with the help of its surrogate $k_t$.

Let me first spoil you with the final solution and I’ll explain how $\gamma$ and $k_t$ work in the next section.

Here is the complete BEGAN objective:
$\mathcal{L}_D = \mathcal{L}(x) - k_t . \mathcal{L}(G(z))$

$\mathcal{L}_G = \mathcal{L}(G(z))$

$k_{t+1} = k_t + \lambda * (\gamma.\mathcal{L}(x) - \mathcal{L}(G(z))$

In this formulation:

• $\mathcal{L}_D$ and $\mathcal{L}_G$ are the respective losses for D and G (what they try to minimize).
• $\mathcal{L}_D$ is only used to optimize $\theta_D$ and $\mathcal{L}_G$ is only used to optimize $\theta_G$.
• $\mathcal{L}(x)$ and $\mathcal{L}(G(z))$ are the losses of reconstruction of real and generated images.
• $\gamma$ is the diversity ratio (in $[0,1]$) defined before as: $\gamma = {\mathbb{E}[\mathcal{L}(G(z))]} / {\mathbb{E}[\mathcal{L}(x)]}$.
• $k_t$ is the adaptive term that will allow us to balance the losses automagically.
• $\lambda$ is the proportional gain for $k_t$ (aka the learning rate for $k_t$).

## The equilibrium term

What we aim to do now is to balance the losses however we want. This will help us control the equilibrium between G and D in real time, stabilize the training, get an approximate measure of convergence and adjust the trade-off between image diversity and realism with just one hyperparameter. Magic!

The reconstruction losses are considered to be at equilibrium when:
$\mathbb{E}[\mathcal{L}(x)] = \mathbb{E}[\mathcal{L}(G(z))]$

And the thing is we don’t want this equilibrium to ever be reached because this would mean that D become incapable to distinguish generated samples from real ones. Said differently this would mean G wins. For the training to go smoothly, neither network should win over the other.

The strategy is to define a ratio $\gamma$ between the 2 losses.
$\gamma = \frac{\mathbb{E}[\mathcal{L}(G(z))]}{\mathbb{E}[\mathcal{L}(x)]}$

$\gamma$ is called the diversity ratio and it is should be in $[0, 1]$. It is always positive because the reconstruction losses are always positive. And it is below 1 in practice because we are going to make it so (because we want to have $\mathcal{L}(x) > \mathcal{L}(G(z))$)

We will make sure this ratio is maintained during the training in the next section, but first I would like to give you an intuition of the importance of $\gamma$.

D has 2 competing goals: auto-encode real images and discriminate real from generated images. $\gamma$ helps us balance these goals:

• Lower values of $\gamma$ lead to $\mathbb{E}[\mathcal{L}(x)] \gg \mathbb{E}[\mathcal{L}(G(z))]$, which means D focuses more on its auto-encoding task, which means G has an incentive to produce more realistic images, which may be at the cost of image diversity.
• Higher values of $\gamma$ lead to $\mathbb{E}[\mathcal{L}(x)] = \mathbb{E}[\mathcal{L}(G(z))]$, which means D focuses more on its discrimination task, which means G has an incentive to produce more diverse images (as diverse as the dataset), which may be at the cost of image quality.

Take a look at what happens with different values of the hyperparameter $\gamma$:

## Balancing the losses

The strategy, now, is to maintain the ratio $\gamma$ between the 2 reconstruction losses over time. In practice we can control $\gamma$ by adding an adaptive term $k_t$.

What makes the $k_t$ adaptive is something called « Proportional Control Theory ». It is a fancy name to describe what you’re doing when you’re driving at constant speed. If you’re going too fast you slow down proportionally to how much faster (than the cruise speed) you’re going. If you’re going too slow you accelerate proportionally to how much slower (than the cruise speed) you’re going.

Here, the ratio we want to maintain (the speed of the car) is the ratio $\gamma$ defined as:
$\gamma = \frac{\mathbb{E}[\mathcal{L}(G(z))]}{\mathbb{E}[\mathcal{L}(x)]}$
So in a perfect world we should have exactly this during training:
$\gamma = \frac{\mathcal{L}(G(z))}{\mathcal{L}(x)}$
which is equivalent to:
$\gamma.\mathcal{L}(x) - {\mathcal{L}(G(z))} = 0$

In practice $\gamma.\mathcal{L}(x) - {\mathcal{L}(G(z))}$ is never equal to 0 during training. This value represents how off we are from the stable point (this is how much faster or slower your car in really going).

Now that we have this information we need to control this ratio dynamically during the training (we need to maintain the correct speed while driving). And the factor that controls this ratio is $k_t \in [0, 1]$ through the formula: $\mathcal{L}_D = \mathcal{L}(x) - k_t . \mathcal{L}(G(z))$.

$k_t$ is adapted in the right direction like this: $k_{t+1} = k_t + \lambda * (\gamma.\mathcal{L}(x) - \mathcal{L}(G(z))$ with $\lambda$ being the learning rate that adapts $k_t$ over time. $\lambda$ is also called the proportional gain for k and in practice (in the experiments) it is set to 0.001.

It is this formulation of $k_t$ that justifies the complete BEGAN objective presented above.

## Consequences of balancing D and G during training

Training procedure: In most GANs, D and G are trained alternatively. This is not the case with here. They can be trained simultaneously at each time step, because D and G are automatically balanced. The training is still adversarial, it is just simultaneous.

There is no need to have $k_t$ show up in $\mathcal{L}_G$: This is because the ratio $\gamma$ (that $k_t$ controls) is there to balance the goals of D. It has nothing to do with the generator which has only one objective. So, in short, G does its own thing and D adapts. D adapts how much it wants to focus on its discrimination task (in which case D gets better faster than G) relative to its auto-encoding task (in which case G gets better faster than D).

## Convergence Measure

A convenient consequence of this BEGAN formulation is that it makes the derivation of an approximate convergence measure possible.

Let’s look at the complete BEGAN objective from another point of view:

1. $\mathcal{L}(x)$ should go to 0 as images are reconstructed better and better after each time step.
2. $\gamma.\mathcal{L}(x) - {\mathcal{L}(G(z))}$ should stay close to 0 (so that the losses are balanced).

The combination of both points implies that the reconstruction losses get closer to 0 (and therefore to one another). This means that D has more and more trouble maximizing the distance between $\mathcal{L}(x)$ and $\mathcal{L}(G(z))$ (by minimizing $\mathcal{L}_D = \mathcal{L}(x) - k_t . \mathcal{L}(G(z))$). Said differently the task of discriminating real from generated images becomes more difficult as time goes on. And there is only one way that this happens: G is getting better!

In short, the combination of (1) and (2) implies that $\mathcal{L}(G(z))$ goes to 0 and that $p_{G(z)}$ gets closer to $p_x$.

Now here is the awesome part: (1) and (2) have such simple forms that we can simply add them up to get the following convergence measure:
$M_{global} = \mathcal{L}(X) + |\gamma.\mathcal{L}(x) - {\mathcal{L}(G(z))}|$

Why is this awesome? Because global measures of GAN convergence are hard to come by. There isn’t any in the typical GAN setup and to my knowledge only the WGAN provides one.

But now there is a way to know if the network converges or collapses:

## Critical implementation details

Model architecture:

• Both the generator and the decoder are deep deconvolutionals with identical architectures but different weights. The encoder and the decoder have « opposite » architectures.
• Downsampling is done with strided convolutions, just like in the DCGAN (cf this awesome blog post for an explanation of what strided convolutions are).
• Upsampling is done with nearest-neighbors.
• The non-linearities are exponential linear units (ELUs) (by opposition to a combination of ReLus and LeakyReLus in the DCGAN).
• The embedding state $h$ (aka the bottleneck of the autoencoder) is not connected to non-linearities.
• $z$ (the random input of G) is sampled uniformly in $[-1, 1]^N$.

Interpolation: To display the interpolations of real images in latent space (Figure 4) the authors needed to obtain the embedding $z$ of real images. This is done by training directly on the $z$ alone by providing the real image as a target of the generator. It is an L1 loss that is minimized in this case (between the generated image and the target image): $e_r = |x - G(z)|$. $h$ cannot be used in place of $z$ because there is nothing that constrains $h$ to converge toward $z$.

Overview: In the end, the architecture is pretty simple. As they say in the paper: « no batch normalization, no dropout, no transpose convolutions and no exponential growth for convolution filters ». Any of those techniques, though, may improve the results even more.

The secret trick: In Figure 2 of the paper the authors put side to side the results from the EBGAN and those from the BEGAN. The BEGAN results look much much better than the EBGAN results, but you should keep in mind that the models were not trained on the same datasets.

## Beauty

As a side note: it is common knowledge that « averageness » is a strong indicator of beauty (there are fourteen sources to corroborate this statement on the wikipedia page for Beauty). The idea is that the more you blend faces together the more beautiful they look.

The faces generated by BEGAN are not just visually realistic they also seem to be very attractive (the authors even claim that they saw few older people and more women than men). My understanding is that G could have developed a good representation of the « average » face to be more efficient, and the beauty of the generated faces could be a consequence of this.

## Points to take home

What BEGAN can do:

• It makes possible the generation of the « first anatomically coherent » face images at a resolution of 128×128.
• Its convergence is fast and stable even in the absence of batch norm.
• It also offers an approximate measure of convergence.

How BEGAN works:

• BEGAN uses an autoencoder as the discriminator (just like in the EBGAN).
• The losses used to update $\theta_G$ and $\theta_D$ are a function of the quality of reconstruction achieved by D.
• Matching the distributions of the reconstruction losses is assumed to be a suitable proxy for matching the data distributions.
• The trade-off between image diversity and quality of generation is automatically balanced by maintaining the ratio of the reconstruction losses close to an hyperparameter $\gamma$.
• When $\gamma$ is low, there is less variety and more realism. When $\gamma$ is high, there is more variety and less realism.

## Bonus: BEGAN vs Improved WGAN

BEGAN or Improved WGAN, which one is better?

First, they have a lot in common:

• Both models have a training procedure that leads to a fast and stable convergence.
• Both models provide an approximate measure of how efficient G becomes over time.
• And both models successfully make the use of batch normalization in D unnecessary.

The Improved WGAN paper is impressive mostly because:

• The authors managed to avoid the vanishing/exploding gradient problem by adding a constraint on the norm of the gradients.
• And they trained many different GAN architectures including 101-layer ResNets as well as a language model over discrete data.

The BEGAN paper is impressive mostly because:

• The authors try to make loss distributions match (by opposition to making data distributions match).
• This leads to a way to automatically balance G against D during training as well image diversity versus visual quality.

So which one is better? It doesn’t matter. Everybody is a winner!

## Footnote

My name is Loris Felardos and you can contact me at: felardos [dot] loris {at} gmail (dot) com.

The author (that’s me) would like to thank Charles Ollion for his very insightful comments and advice.

Here is Heuritech’s website

Here is the reddit page associated this post

Here is the reddit page associated to the paper.