top of page

Text-To-Image Models: How Diffusion and GANs Create Photorealistic Images

Midjourney, Stable Diffusion, and DALL-E are cutting-edge text-to-image models that are capable of creating images based on text input. They use state-of-the-art machine learning techniques to produce high-quality images that are becoming increasingly realistic.

In 2014, researchers designed a framework called a Generative Adversarial Network (GAN). In this approach, two neural networks can work together to create natural images - the generator creates an image while the discriminator evaluates its realism. Over time, both networks are trained independently to get better at their tasks.

In 2015, a technique called "diffusion" was proposed as a more efficient way to train machine learning models. In image generation training, the model is shown a mathematical representation of an image, and then noise is added to the data until it becomes unrecognizable. The model then learns to reverse this process, arranging the data in new ways.

Despite their impressive capabilities, text-to-image models are just the beginning. Text and Image-to-Video models are also in development. This opens up new avenues for creativity and expression and is sure to have a profound impact on the way we tell stories and communicate ideas.


Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative Adversarial Nets. arXiv. Retrieved from

Wiggers, K. (2022, December 22). A brief history of diffusion, the tech at the heart of modern image-generating AI. TechCrunch. Retrieved from

Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv. Retrieved from

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., Taigman, Y., & Meta AI. (2022, September 29). MAKE-A-VIDEO: TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA. arXiv. Retrieved from

Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A., & Runway. (2023). Structure and Content-Guided Video Synthesis with Diffusion Models. arXiv. Retrieved from

Molad, E., Horwitz, E., Valevski, D., Rav Acha, A., Matias, Y., Pritch, Y., Leviathan, Y., Hoshen, Y., Google Research, & The Hebrew University of Jerusalem. (2023). Dreamix: Video Diffusion Models are General Video Editors. arXiv. Retrieved from


Ready to learn more?

bottom of page