ERNIE-ViLG

ERNIE‑ViLG is Baidu’s flagship Chinese text-to-image model, capable, culturally aware, and technically advanced, especially with its diffusion-based architecture and denoising experts in version 2. Despite strong results, it’s moderated for sensitive content and optimised for Chinese-language use.

🚀 Origins & Versions

  • ERNIE‑ViLG v1 (Dec 2021): Introduced as a bidirectional vision-language transformer, trained on Chinese image-text pairs; about 10 billion parameters.
  • ERNIE‑ViLG 2.0 (Oct 2022): Upgraded to a diffusion model with ~24 billion parameters and a novel “mixture-of-denoising-experts” approach.

GANs

Generative Adversarial Networks are a type of neural network architecture invented in 2014 by Ian Goodfellow and his collaborators. GANs are foundational to much of today’s AI image generation.

A GAN is made of two neural networks that play a game:

ComponentRole
Generator (G)Tries to create fake data that looks like real data (e.g., fake images).
Discriminator (D)Tries to tell real from fake — it acts like a critic or detective.

They train together:

  1. The generator creates an image.
  2. The discriminator decides if it’s fake or real.
  3. Feedback from the discriminator helps the generator improve.
  4. Over time, the generator gets so good the discriminator can’t tell the difference.

This is why it’s called adversarial — the two networks are in a constant battle.

GANs are unsupervised (or self-supervised) learning models — they don’t need labeled data.

They learn the distribution of training data and generate new data from that distribution.

Many improved GANs have been developed since 2014, including:

VariantPurpose
DCGAN (2015)Deep Convolutional GAN — popular for image generation
StyleGAN (2018–2021)Introduced “style” control — used in “This Person Does Not Exist”
CycleGANImage-to-image translation (e.g., horses ↔ zebras)
BigGANHigh-quality, class-conditional image generation (from ImageNet)

StyleGAN

StyleGAN, developed by NVIDIA Research, is a groundbreaking architecture for generating ultra-realistic synthetic images, especially of human faces. Its ability to control “styles” across image layers set a new standard in AI image generation and led to viral real World applications like This Person Does Not Exist. Today, it’s used across art, games, fashion, and media, with both exciting and troubling implications.

The lead authors are Tero Karras, Samuli Laine and Timo Aila. Unlike earlier GANs, which often had limited control over image attributes, StyleGAN introduced a “style-based” architecture that revolutionized image synthesis. Images are generated in a multi-scale, layered way : High-level (pose, identity, …), mid-level (features like eye shape, …), low-level (color, texture, …). A latent input vector is transformed into an intermediate latent space (W space).

The following list presents the timescale of the launch of different StyleGAN versions :

🔹 StyleGAN1 (2018)
  • Introduced style-based generation
  • Produced realistic but occasionally distorted faces
🔹 StyleGAN2 (2019–2020)
  • Major quality improvement
  • Fixed artifacts and strange features in faces (e.g., weird teeth or asymmetry)
  • Used in “This Person Does Not Exist”
🔹 StyleGAN3 (2021)
  • Introduced equivariance, making it better at handling rotation and translation
  • Improved realism and temporal coherence (useful for video)