Stability AI announced the release of a preliminary version of its flagship artificial intelligence (AI) model, Stable Diffusion 3.0. Engineered to generate images from text-based descriptions, the model’s versions will be based on varying neural networks, with sizes ranging from 800 million to 8 billion parameters.
Over the past year, Stability AI has been refining and launching several neural networks, each showcasing growing complexity and quality. The release of the SDXL in July significantly improved the basic Stable Diffusion model, with the company aiming to further expand on this with 3.0.
Stable Diffusion 3.0 intends to deliver enhanced image quality and performance in creating images from complex prompts. It offers significantly improved typography, allowing more accurate text representation within generated images – an area where previous Stable Diffusion versions and other AI artists had fallen short.
Stable Diffusion 3.0 is more than just an upgraded version of its predecessor from Stability AI – it’s built on a new architecture. “Stable Diffusion 3 is a diffusion transformer model, a new type of architecture that is similar to that used in the recently presented OpenAI Sora,” explained Emad Mostaque, CEO of Stability AI. “This is the true successor to the original Stable Diffusion.”
Stability AI is constantly experimenting with various image creation approaches. Earlier this month, they launched a preliminary version of Stable Cascade, leveraging the Würstchen architecture to enhance performance and accuracy. However, Stable Diffusion 3.0 adopts a different method, using diffusion transformer models.”
Transformers underpin most contemporary neural networks, igniting a revolution in AI. They are widely used as the base model of text generation. Meanwhile, image generation has been largely headed by diffusion models. In an academic study, diffusion transformers (DiTs) are detailed as a new architecture for diffusion models, replacing the extensively used U-Net backbone with a transformer that functions on hidden parts of the image. Adopting DiTs allows for efficient computational power usage and surpassing other approaches to image diffusion.
Another significant innovation employed by Stable Diffusion 3.0 is flow matching. Flow matching, detailed in its respective research paper, introduces a new technique for training neural networks with “continuous normalizing flows” (Conditional Flow Matching – CFM) to simulate intricate data distributions. According to the researchers, using CFM with optimal transport pathways results in quicker training, more efficient sampling, and enhanced performance compared to diffusion pathways.
Stable Diffusion 3.0’s improved typography is the outcome of multiple enhancements made by Stability AI. According to Mostaque, quality text generation has been made possible due to the use of diffusion transformer models and additional text encoders. With Stable Diffusion 3.0, users can now generate cohesive text styles with complete sentences.
While Stable Diffusion 3.0 is primarily showcased as an AI technology to transform text into images, it will also serve as the foundation for much more. Over the coming months, Stability AI also plans to develop neural networks for creating 3D images and videos.
“We are creating open models that can be used anywhere and adapted to any needs,” stated Mostaque. “This is a series of models of different sizes, which will serve as the basis for the development of our next-generation visual models, including video, 3D, and much more,” he added.
This post was last modified on 02/22/2024