Stories from the team

We’re training a text-to-image model from scratch and open-sourcing it

Early steps in the model training
Early steps in the model training

We’re training a text-to-image model from scratch and are publishing the code, weights, and research process in the open. The goal is to build something both lightweight and capable: easy to train and fine-tune, but still making use of the latest advances in the field.

We’ll publish the model weights (Hugging Face, diffusers-compatible) under a permissive license, together with other useful resources. More than just weights, we want the process itself to be transparent and reusable: how we trained, what worked, what didn’t, and the little details that usually stay hidden.

To that end, we’ll be documenting the journey through blog posts, intermediate releases, and eventually a full ablation study. Our hope is that this becomes not just a strong open model, but also a practical resource for anyone interested in training diffusion models from scratch.

What we’ve done so far

Here’s a sneak peek of where we are right now and we’ll leave further technical training details for our next blog post.

For the past few weeks, we’ve been experimenting with a wide range of recent techniques to refine our training recipe:

  • On the architecture side, we tested DiT [1], UViT [2], MMDiT [3], DiT-Air [4], and our own MMDiT-like variant which we call Mirage.

  • For losses, we tried REPA [5] with both DINOv2 [6] and DINOv3 [6] features and contrastive flow matching [7].

  • We also integrated different VAEs, including Flux’s [8] and DC-AE [9], and explored recent text embedders such as GemmaT5 [10].

  • In addition, we looked at a few other techniques like Uniform ROPE [11], Immiscible [12], distillation with LADD [13], and the Muon optimizer [14].

  • Finally, we paid close attention to the training process itself, carefully exploring hyperparameters and implementation details such as EMA and numerical precision.

We believe we already have something very exciting in our hands and we’re really eager to show you. Here are some early samples from our best checkpoint so far.

A photograph depicts an orange tabby cat standing in a bathroom, gazing at its reflection in a mirror above a sink. The cat is positioned on the right side of the frame, its body angled slightly away from the viewer. Its fur is a rich, warm orange with subtle variations in shade, appearing soft and slightly fluffy. The cat's reflection in the mirror is a near-perfect duplicate, creating a symmetrical composition. The mirror is framed in a simple, light-colored frame, and the background shows a portion of a white-tiled bathroom wall. A white porcelain sink with a chrome double faucet is positioned below the mirror. A small, clear glass bottle with a white label is visible on the counter next to the sink. The label has illegible text. A white towel hangs on the wall behind the mirror's reflection of the cat. The lighting is soft and diffused, creating a calm and serene atmosphere. The overall aesthetic is minimalist and slightly vintage, with muted colors and a focus on natural light. The image has a nostalgic and peaceful vibe. There are no synthetic elements. The image is a single photograph, not a collage or graphic. The text on the bottle is illegible.

A photograph depicting a close-up portrait of a young woman with freckles wearing a vintage-style, bright orange astronaut helmet. The helmet's visor is clear, showing the woman's face and some of her brown hair. Her expression is serious and contemplative. The background is blurred, suggesting a depth of field effect, with muted teal and gray tones. The lighting is soft and diffused, creating a moody atmosphere. The overall aesthetic is retro-futuristic, blending vintage elements with a modern, slightly melancholic feel. The color palette is dominated by the orange of the helmet and the muted background colors. The image has a cinematic quality, with a focus on mood and atmosphere. No synthetic elements are apparent. The style is realistic portraiture with a strong emphasis on lighting and color grading. The vibe is introspective and mysterious. There is no text in the image.

A photograph depicts a rustic roadside fruit stand, partially shaded by a weathered wooden awning. A faded light-blue vintage pickup truck is parked behind the stand, its hood adorned with several pomegranates. The stand overflows with vibrant fruits: mangoes, papayas, oranges, limes, cucumbers, and more, arranged in wooden crates. The lighting is natural, dappled sunlight filtering through leaves, creating a warm, slightly hazy atmosphere. The overall aesthetic is vintage, rustic, and evokes a sense of rural charm. The colours are rich and saturated, with the bright hues of the fruits contrasting against the muted tones of the truck and wood. A simple wooden bench sits in front of the stand. A handwritten sign on the stand reads "SALE $3.50". The font is a simple, sans-serif style, likely hand-painted. The image is composed using a slightly low angle, emphasizing the abundance of fruit and the truck's worn condition. No synthetic elements are apparent. The vibe is relaxed, idyllic, and nostalgic.

A digital painting depicting a skateboarder in motion. The composition is a dynamic, slightly low-angle shot of a male skateboarder wearing a white helmet and olive green t-shirt and jeans, executing a trick. The background is an abstract wash of teal, light green, and beige, suggesting a sunny outdoor setting. The style is painterly, with visible brushstrokes creating a sense of movement and energy. The lighting is bright, with a sunlit feel, highlighting the skateboarder's form against the textured background. The overall aesthetic is vibrant, energetic, and slightly nostalgic, evoking a sense of freedom and youthful rebellion. The colours are muted yet saturated, creating a harmonious balance between the figure and the background. The vibe is active, dynamic, and expressive. There is no text in the image. The image is a digital painting, not a photograph, and contains no synthetic elements beyond those inherent to digital painting.

A photograph depicts a woman wearing a vibrant yellow jumpsuit holding a similarly colored circular plate in front of her torso. The image is a medium shot, focusing on the woman and the plate. The jumpsuit is a solid, bright yellow, with a collared, button-down design and large pockets. The fabric appears to be a lightweight material, possibly cotton or a similar blend. The plate is a simple, round shape, smooth and glossy, matching the jumpsuit's color perfectly. The woman's skin tone is light, and her hair is brown, partially visible around her face. Her expression is neutral, her gaze not directly visible. The background is a clear, unblemished, bright blue sky, suggesting an outdoor setting, possibly a beach or a sunny field. The lighting is natural, bright, and even, casting no harsh shadows, indicating a sunny day. The overall aesthetic is minimalist, with a strong emphasis on color contrast and simplicity. The style is reminiscent of 1970s fashion photography, with a retro vibe. The atmosphere is cheerful and sunny, conveying a sense of warmth and optimism. There are no synthetic elements visible in the image. The image is a single photograph, not a collage or graphic. There is no text in the image.

A photograph depicts a minimalist living room interior. The composition centers on a white sofa with a beige cushion and brown throw pillows. The sofa is positioned against a plain white wall. Natural light streams in from an unseen window, casting a soft shadow on the wall. The floor is whitewashed wood. The overall aesthetic is calm, serene, and modern, with a muted color palette of white, beige, and brown. The image evokes a feeling of tranquility and simplicity. There are no synthetic elements visible. The lighting is soft and diffused, creating a gentle atmosphere. The style is Scandinavian minimalist. The vibe is peaceful and relaxing.

These images come from a 1.2B-parameter Mirage model, trained for 1.4M steps at 256-pixel resolution for less than 9 days on 64 H200s. This checkpoint in particular uses REPA with DINOv2 features, Flux’s VAE, and GemmaT5 as our text embedder. Finally, we distilled it with LADD so it can generate in 4 steps. Below are animations built from every checkpoint, using the same prompt and seed, so you can see the model’s progression from scratch to the final result.

The plot below shows how training went for this model. We started with REPA (blue), which helped the model converge faster. Disabling it later in training (orange) led to a further drop in the validation loss (we’re currently running more experiments to better understand REPA’s impact, but it seems switching it off after convergence helps fine-tune the loss). Finally, we added exponential moving average (green), which we integrated in the codebase at that stage of training and saw a positive impact.

We'll talk more about this and other experiments in our next blog post, which will come with the first model release.

What’s next?

This wraps up the first of a series of research blog posts and releases we’re planning. There’s still plenty in the pipeline:

  • We’ve just launched training at 512-pixel resolution with both Flux’s VAE [8] and DC-AE [9] and distillation for all our 256-pixel resolution models.

  • We’ve started exploring preference alignment, currently looking at supervised finetuning and DPO [15] and ready to launch a training.

  • We’re sketching out the roadmap of upcoming experiments and considering other recent techniques.

  • In parallel, we’re preparing the first release in Diffusers and Hugging Face (coming shortly), while documenting everything along the way for the ablation study.

Stay tuned for the next blog post, which will include the first model release and the details of how we trained it.

Interested in contributing?

We’ve set up a Discord server (join here!) for more regular updates and discussion with the community. Join us there if you’d like to follow progress more closely or talk through details.

If you’re interested in contributing, you can either message us on Discord or email jon@photoroom.com. We’d be glad to have more people involved.

The team

This project is the result of contributions from across the team in engineering, data, and research: David Bertoin, Quentin Desreumaux, Roman Frigg, Simona Maggio, Lucas Gestin, Marco Forte, David Briand, Thomas Bordier, Matthieu Toulemont, Benjamin Lefaudeux, and Jon Almazán. We’re hiring for senior roles!

References

[1] Peebles et al. Scalable Diffusion Models with Transformers

[2] Bao et al. All are Worth Words: A ViT Backbone for Diffusion Models

[3] Esser et al. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

[4] Chen et a. DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation

[5] Yu et al. Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

[6] Oquab et al. DINOv2: Learning Robust Visual Features without Supervision

[7] Siméoni et al. DINOv3

[8] Black Forest Labs, FLUX

[9] Chen et al. Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

[10] Dua et al., EmbeddingGemma

[11] Jerry Xiong On N-dimensional Rotary Positional Embeddings

[12] Li et al. Immiscible Diffusion: Accelerating Diffusion Training with Noise Assignment

[13] Sauer et al. Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

[14] Jordan et al. Muon: An optimizer for hidden layers in neural networks

[15] Rafailov et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Jon AlmazánResearch scientist
Design your next great image

Design your next great image

Whether you're selling, promoting, or posting, bring your idea to life with a design that stands out.