In a blog post shared with SurgeZirc SA, Meta stated, “With CM3Leon’s capabilities, image generation can be more coherent with imagery that better aligns with the input prompts.”
Most existing image generators, including OpenAI’s DALL-E 2, Google’s Imagen, and Stable Diffusion, rely on a computationally intensive process known as diffusion to create artwork.
In diffusion, a model progressively subtracts noise from a noise-based starting image, moving it step by step closer to the desired prompt.
While the results have been impressive, the computational intensity of diffusion makes it expensive and impractical for real-time applications.
In contrast, CM3Leon is built on a transformer model, leveraging the power of “attention” to assess the relevance of input data such as text or images.
- Advertisement -
It’s worth noting that OpenAI previously explored transformers for image generation with Image GPT but ultimately chose diffusion as the preferred method and is now considering transitioning to “consistency.”
Meta trained CM3Leon using a dataset comprising millions of licensed images from Shutterstock. The most advanced version of CM3Leon boasts over 7 billion parameters, more than twice the number in DALL-E 2.
A key element contributing to CM3Leon’s superior performance is a technique called supervised fine-tuning (SFT). SFT has proven effective in training text-generating models like OpenAI’s ChatGPT. Meta hypothesized that applying SFT to the image domain would yield positive results.
Indeed, SFT not only improved CM3Leon’s image generation capabilities but also enhanced its image captioning skills, enabling it to answer questions about images and edit images based on text instructions, such as changing the color of the sky.
While many image generators struggle with complex objects and text prompts containing numerous constraints, CM3Leon exhibits fewer difficulties in these areas.
Meta showcased examples where CM3Leon successfully generated images based on prompts like:
A small cactus wearing a straw hat and neon sunglasses in the Sahara desert
A close-up photo of a human hand, hand model
A raccoon main character in an Anime preparing for an epic battle with a samurai sword,
A stop sign in a Fantasy style with the text ‘1991″
CM3Leon also demonstrates the ability to understand instructions for editing existing images.
For instance, when given the prompt “Generate a high-quality image of ‘a room that has a sink and a mirror in it’ with a bottle at location (199, 130),” the model produces a visually coherent and contextually appropriate image.
The image includes the room, sink, mirror, and bottle. DALL-E 2, on the other hand, struggles to capture the nuances of such prompts and often omits specified objects entirely.
Furthermore, unlike DALL-E 2, CM3Leon can generate short or long captions and answer questions about specific images.
Meta claims that CM3Leon outperforms specialized image captioning models (e.g., Flamingo, OpenFlamingo) despite receiving less text in its training data.
Addressing concerns regarding bias, Meta acknowledges that CM3Leon may reflect any biases present in the training data.
As the AI industry continues to evolve, Meta emphasizes the importance of transparency in accelerating progress and addressing these challenges.
Meta has not specified if or when CM3Leon will be released, considering the controversies surrounding open-source art generators. Nevertheless, anticipation remains high for the potential impact of this innovative model.