ERNIE‑ViLG is Baidu’s flagship Chinese text-to-image model, capable, culturally aware, and technically advanced, especially with its diffusion-based architecture and denoising experts in version 2. Despite strong results, it’s moderated for sensitive content and optimised for Chinese-language use.
🚀 Origins & Versions
- ERNIE‑ViLG v1 (Dec 2021): Introduced as a bidirectional vision-language transformer, trained on Chinese image-text pairs; about 10 billion parameters.
- ERNIE‑ViLG 2.0 (Oct 2022): Upgraded to a diffusion model with ~24 billion parameters and a novel “mixture-of-denoising-experts” approach.