In the realm of artificial intelligence, a groundbreaking advance is reshaping how machines comprehend and generate interconnected sensory data. Researchers have unveiled Emu3, a next-generation multimodal model, designed to bridge language, vision, and action seamlessly through next-token prediction. This architecture heralds a paradigm shift, merging textual and visual inputs into one cohesive framework that promises to redefine machine learning’s boundaries.
At the heart of Emu3 lies a unified tokenizer system that translates words, images, and videos into a shared discrete token space. The system uses byte-pair encoding (BPE) for textual data, while a vector quantization (VQ)-based visual tokenizer compresses immense high-resolution visual content into manageable token sequences. This shared vocabulary enables the model to treat textual and visual information equivalently, enhancing its autoregressive prediction capabilities without loss of modality-specific nuances.
The textual tokenizer is built on the Qwen tokenizer architecture, leveraging byte-level BPE and encompassing over 151,000 regular text tokens supplemented by 211 special tokens reserved for template control. In parallel, the visual tokenizer operates atop SBER-MoVQGAN technology, encoding 512×512 pixel images or equivalent video clips into 4,096 discrete tokens from a vast codebook containing 32,768 entries. Innovations such as temporal residual layers with 3D convolution kernels allow sophisticated temporal and spatial downsampling, enabling the model to process diverse video resolutions and durations efficiently.
Emu3’s core structure employs a decoder-only Transformer design, containing approximately 8.5 billion parameters distributed across 32 layers. Using RMSNorm for normalization and advanced attention mechanisms such as GQA, combined with SwiGLU activation and rotary positional embeddings, the architecture is specifically optimized to integrate vision and language representations into one harmonized stream. The shared multimodal vocabulary facilitates consistent interpretation of input across different sensory domains, creating a robust foundation for complex multimodal reasoning.
To rigorously evaluate model design choices, Emu3 was pitted against leading architectures, including diffusion models and encoder-plus-large-language-model (LLM) composites. When compared with a diffusion transformer trained on massive datasets, Emu3’s next-token prediction model demonstrated superior convergence speed, challenging assumptions that diffusion paradigms inherently excel in visual generation. Similarly, in head-to-head tests against late-fusion LLaVA-style models—both with and without pretraining—Emu3 matched or exceeded performance, confirming that decoder-only models trained from scratch can rival hybrid architectures without reliance on pretrained visual encoders.
Training such a versatile system demanded innovative data curation and scheduling. Emu3 was pretrained from scratch on a carefully constructed mixture of language, image, and video data curated for quality and diversity. Resizing of visual inputs adhered to consistent aspect ratios and spatial scales near 512×512 pixels, ensuring uniformity during tokenization. A dedicated curriculum spanning three stages was utilized: initial rapid convergence with longer sequences and no dropout, followed by a stability-focused phase introducing dropout regularization, and finally an expansion to ultra-long sequences accommodating video data alongside text.
Post-training strategies further refined Emu3’s capabilities. For text-to-image (T2I) generation, the model underwent quality-focused fine-tuning using human preference scores aggregated from multiple evaluative metrics, enabling it to produce sharper and more aesthetically appealing high-resolution images. Additionally, Direct Preference Optimization (DPO) was deployed to align outputs more closely with human taste, involving annotator-guided ranking of generated images and iterative fine-tuning against chosen preferences, balancing fidelity and diversity.
Extending beyond still images, Emu3 was scaled to generate coherent and temporally consistent video sequences. Video fine-tuning incorporated stringent quality and motion filters on curated five-second clips, with sequence lengths exceeding 130,000 tokens. The model’s video outputs were quantitatively assessed across 16 key dimensions, including semantic fidelity and subject-background coherence, demonstrating landmark progress in controllable and believable video synthesis.
Vision–language understanding tasks also showcased Emu3’s versatility. A two-stage post-training regimen first integrated large-scale image-text pair data with masked vision token losses to focus on text prediction, followed by instruction tuning on millions of question-answer pairs relating to visual inputs. This multi-modal fine-tuning enhanced performance on image understanding benchmarks, highlighting the model’s capacity for reasoning and dialogue grounded in visual context.
Innovatively, Emu3 supports interleaved image-text generation, where structured text instructions are augmented with inline illustrative images within a single coherent output stream. Fine-tuning for this complex format reinforces the model’s potential for producing explanatory content combining modalities naturally—foundational for applications demanding rich, multimodal communication such as educational tools and interactive storytelling.
Going even further, the model has been adapted for vision–language–action tasks relevant to robotics. By fine-tuning on the CALVIN benchmark, which simulates long-horizon, language-conditioned robot manipulation, Emu3 ingests visual observations and discrete action tokens alternately, predicting sequences of perception and actions accurately. This integration underscores the model’s capacity to serve as a unified controller for agents requiring continuous interpretation and decision-making across sensory and motor domains.
Though real-world deployment on physical robots remains a future goal, Emu3’s ability to model complex interleaved perception–action sequences without bespoke modules signals a paradigm shift. Its autoregressive formulation naturally supports conditioning on arbitrarily long histories and enables recovery from partial or noisy inputs, key capabilities for robust, real-time robotic operation under uncertain sensory conditions.
Emu3’s development marks a milestone in multimodal AI research, demonstrating that next-token prediction frameworks can unify language, vision, and actions at scale. It challenges prior dominance of diffusion and hybrid encoder-LLM architectures, proving that single-stream Transformers trained from scratch can deliver superior efficiency and versatility. With their advances in tokenization, model design, and fine-tuning strategies, the creators set a new standard for building large-scale, unified multimodal models with broad implications across AI-assisted creativity, robotics, and interactive intelligence.
As the frontier of multimodal AI continues to expand, Emu3 exemplifies how integrative architectures enable more fluid and contextual machine understanding of the world. The ability to process and generate intertwined streams of text, imagery, video, and actions without switching modalities paves the path toward truly generalized artificial intelligence systems that learn, adapt, and interact seamlessly with humans and their environments.
Subject of Research: Multimodal learning architectures combining language, vision, and action modalities through next-token prediction.
Article Title: Multimodal learning with next-token prediction for large multimodal models.
Article References:
Wang, X., Cui, Y., Wang, J. et al. Multimodal learning with next-token prediction for large multimodal models. Nature (2026). https://doi.org/10.1038/s41586-025-10041-x
Image Credits: AI Generated
DOI: https://doi.org/10.1038/s41586-025-10041-x
Keywords: Emu3, multimodal models, next-token prediction, unified tokenizer, vector quantization, multimodal Transformer, vision-language-action, text-to-image, text-to-video, robotic manipulation, decoder-only architecture, direct preference optimization
Tags: autoregressive prediction capabilitiesbyte-pair encodingEmu3 architecturehigh-resolution visual content compressionmerging textual and visual inputsmultimodal AI modelsnext-token predictionQwen tokenizer architectureSBER-MoVQGAN technologytemporal residual layers in AIunified tokenizer systemvector quantization visual tokenizer



