See All Articles on AI
A new image generator reasons over prompts to produce outstanding pictures.
What’s new: Tencent released HunyuanImage-3.0, which is fine-tuned to apply reasoning via a variety of reinforcement learning methods. The company says this helps it understand users’ intentions and improve its output.
- Input/output: Text and images in, text and images out (fine-tuned for text in, images out only)
- Architecture: Mixture of experts (MoE) diffusion transformer (80 billion parameters, 13 billion parameters active per token), one VAE, one vision transformer, two vanilla neural network projectors
- Performance: Currently tops LMArena Text-to-Image leaderboard
- Availability: Weights available for commercial and noncommercial use by companies with fewer than 100 million monthly active users under Tencent license
- Undisclosed: Input and output size limits; parameter counts of VAE, vision transformer, and projectors; training data; models used for labeling, filtering, and captioning images; reward models
How it works: The authors built a training dataset of paired text and images. They trained the model on image generation via diffusion through several stages and fine-tuned it on text-to-image generation in further stages.
- To produce the dataset, the authors collected 10 billion images. (i) They built models specially trained to measure image clarity and aesthetic quality, and removed images that didn’t make the grade. (ii) They also built models to identify text and named entities such as brands, artworks, and celebrities, and extracted this information from the remaining images. (iii) They fed the images, extracted text, and extracted entities to a captioning model that produced a text caption for each image. (iv) For a subset of the data, they manually annotated chains of thought, producing data that linked text to chains of thought to images. (v) They added text-to-text data and image-text data from unspecified corpi.
- The authors pretrained the system to generate text and images from the various text and image elements in the dataset. Specifically, for text-to-image tasks: (i) First, the VAE’s encoder embedded an image. (ii) The authors added noise to the embedding. (iii) Given the noisy embedding and a text prompt, the MoE removed the noise. (iv) The VAE’s decoder generated an image from the embedding with noise removed.
- The authors fine-tuned the system (i) for text-to-image tasks by training it in a supervised fashion to remove noise from human-annotated examples, (ii) via DPO to be more likely to generate higher-quality examples, like human-annotated ones, than lower-quality ones, (iii) via the reinforcement learning method MixGRPO to encourage the model to generate more aesthetically pleasing images as judged by unspecified reward models, and (iv) via SRPO (another reinforcement learning method) to encourage the model to generate images more like a text description that specified desired traits and less like a text description that specified negative traits. While applying SRPO, they also encouraged the model to generate images similar to those in an author-chosen distribution.
Results: At present, HunyuanImage 3.0 holds first place in the LMArena Text-to-Image leaderboard, ahead of Google Gemini 2.5 Flash Image (Nano Banana), Google Imagen 4.0 Ultra Generate, and ByteDance Seedream 4.0. In addition, 100 people compared 1,000 outputs of 4 competing models to those of HunyuanImage 3.0 in side-by-side contests. The people evaluated which image was better, or whether they were both equally good or equally poor.
- On average, the people preferred HunyuanImage 3.0’s images over those of the competitors.
- For example, 20.01 percent of the time they preferred HunyuanImage 3.0, 18.84 percent of the time they preferred Seedream 4.0, 39.3 percent of the time they were equally good, and 21.85 percent of the time they were equally poor.
Behind the news: Tencent has been on a streak of releasing vision models.
- Tencent recently launched the API version of Hunyuan-Vision-1.5, its latest vision-language model, with promises to release the weights and a paper soon.
- The company released Hunyuan3D-Omni, a model that takes an image and rough 3D representation (such as a skeleton or bounding box) and generates a detailed 3D representation.
- It also played a role in the release of FlashWorld, which accepts an image and text prompt and generates a 3D scene.
Why it matters: Simplifying training methods can be helpful, since each additional step adds time spent not only training but also debugging, and each additional component can interact with other components in unexpected ways, which adds to the time required to debug the system. Yet Tencent used several stages of pretraining and fine-tuning and produced a superior model.
We’re thinking: One key to this success may be to use different methods for different purposes. For instance, the team used MixGRPO to fine-tune the model for aesthetics and SRPO to better match human preferences.

No comments:
Post a Comment