In this blog, I will be explaining about image to image translation which is popularly known as BicycleGAN. The task of image to image translation can be thought of as per pixel regression or classification. One more approach that can be used to solve this problem is generative adversarial network. The results obtained by using GANs are more robust and perceptually realistic.

In the paper “Toward Multimodal Image-to-Image Translation”, the aim is to generate a distribution of output images given an input image. Basically, it is an extension of image to image translation model using Conditional Generative Adversarial Networks. Before pix2pix, many people tried to solve this problem using GAN but unconditionally and the output was conditioned on input using L2 regression. I have already explained conditional GAN in my previous blog in addition to Variational Autoencoder.

In the first part of the model, we use a conditional Variational Autoencoder GAN. The idea is to learn a low-dimensional latent representation of target images using an encoder net i.e., a probability distribution which has generated all the target images and we try this distribution to be close to the normal distribution so as to sample easily during inference time. Next we use a generator to map an input image to the output image using the encoded representation z.

In the second part of the image, we use Conditional Latent Regressor GAN. In this, z is sampled from a normal distribution N(z) which in addition to the input image A is fed to the Generator to get the output image. This output image is then fed to the Encoder net to output z’ which we try to be close to N(z). After these two steps, we calculate loss function. The final loss function looks like this:

where G, D and E stands for Generator, Discriminator and Encoder.

In this model, the mapping from latent vector(z) to output images and output image to latent vector is bijective. The overall architecture consists of two cycle, B->z->B’ and z->B’->z’ and hence the name BicycleGAN. The architecture has been clearly summarized in this figure.

Key Points:-

  • We have 3 different networks: a) Discriminator, b) Encoder, and c) Generator.
  • A cVAE-GAN (Conditional Variational Autoencoder – Generative Adversarial Network- ) is used to encode the ground truth output image B to latent vector z which is then used to reconstruct the output image B’ i.e., B -> z -> B’.
  • For inverse mapping (z->B’->z’), we use LR-GAN (Latent Regressor Generative Adversarial Networks) in which a Generator is used to generate B’ from input image A and z.
  • Combining both these models, we get BicycleGAN.
  • The architecture of Generator is same as U-net in which there are encoder and decoder nets with symmetric skip connections.
  • For Encoder, we use several residual blocks for an efficient encoding of the input image.
  • The model is trained using Adam optimizer using BatchNormalization with batch size 1.
  • Leaky ReLU activation function is used for all types of networks.

Credits:Prakash Pandey