ERNIE-ViLG

ERNIE‑ViLG is Baidu’s flagship Chinese text-to-image model, capable, culturally aware, and technically advanced, especially with its diffusion-based architecture and denoising experts in version 2. Despite strong results, it’s moderated for sensitive content and optimised for Chinese-language use.

🚀 Origins & Versions

  • ERNIE‑ViLG v1 (Dec 2021): Introduced as a bidirectional vision-language transformer, trained on Chinese image-text pairs; about 10 billion parameters.
  • ERNIE‑ViLG 2.0 (Oct 2022): Upgraded to a diffusion model with ~24 billion parameters and a novel “mixture-of-denoising-experts” approach.