• HOME
  • NEWS
  • EXPLORE
    • CAREER
      • Companies
      • Jobs
    • EVENTS
    • iGEM
      • News
      • Team
    • PHOTOS
    • VIDEO
    • WIKI
  • BLOG
  • COMMUNITY
    • FACEBOOK
    • INSTAGRAM
    • TWITTER
Thursday, January 29, 2026
BIOENGINEER.ORG
No Result
View All Result
  • Login
  • HOME
  • NEWS
  • EXPLORE
    • CAREER
      • Companies
      • Jobs
        • Lecturer
        • PhD Studentship
        • Postdoc
        • Research Assistant
    • EVENTS
    • iGEM
      • News
      • Team
    • PHOTOS
    • VIDEO
    • WIKI
  • BLOG
  • COMMUNITY
    • FACEBOOK
    • INSTAGRAM
    • TWITTER
  • HOME
  • NEWS
  • EXPLORE
    • CAREER
      • Companies
      • Jobs
        • Lecturer
        • PhD Studentship
        • Postdoc
        • Research Assistant
    • EVENTS
    • iGEM
      • News
      • Team
    • PHOTOS
    • VIDEO
    • WIKI
  • BLOG
  • COMMUNITY
    • FACEBOOK
    • INSTAGRAM
    • TWITTER
No Result
View All Result
Bioengineer.org
No Result
View All Result
Home NEWS Science News Technology

Next-Token Prediction Powers Large Multimodal Models

Bioengineer by Bioengineer
January 29, 2026
in Technology
Reading Time: 4 mins read
0
blank
Share on FacebookShare on TwitterShare on LinkedinShare on RedditShare on Telegram

In the realm of artificial intelligence, a groundbreaking advance is reshaping how machines comprehend and generate interconnected sensory data. Researchers have unveiled Emu3, a next-generation multimodal model, designed to bridge language, vision, and action seamlessly through next-token prediction. This architecture heralds a paradigm shift, merging textual and visual inputs into one cohesive framework that promises to redefine machine learning’s boundaries.

At the heart of Emu3 lies a unified tokenizer system that translates words, images, and videos into a shared discrete token space. The system uses byte-pair encoding (BPE) for textual data, while a vector quantization (VQ)-based visual tokenizer compresses immense high-resolution visual content into manageable token sequences. This shared vocabulary enables the model to treat textual and visual information equivalently, enhancing its autoregressive prediction capabilities without loss of modality-specific nuances.

The textual tokenizer is built on the Qwen tokenizer architecture, leveraging byte-level BPE and encompassing over 151,000 regular text tokens supplemented by 211 special tokens reserved for template control. In parallel, the visual tokenizer operates atop SBER-MoVQGAN technology, encoding 512×512 pixel images or equivalent video clips into 4,096 discrete tokens from a vast codebook containing 32,768 entries. Innovations such as temporal residual layers with 3D convolution kernels allow sophisticated temporal and spatial downsampling, enabling the model to process diverse video resolutions and durations efficiently.

Emu3’s core structure employs a decoder-only Transformer design, containing approximately 8.5 billion parameters distributed across 32 layers. Using RMSNorm for normalization and advanced attention mechanisms such as GQA, combined with SwiGLU activation and rotary positional embeddings, the architecture is specifically optimized to integrate vision and language representations into one harmonized stream. The shared multimodal vocabulary facilitates consistent interpretation of input across different sensory domains, creating a robust foundation for complex multimodal reasoning.

To rigorously evaluate model design choices, Emu3 was pitted against leading architectures, including diffusion models and encoder-plus-large-language-model (LLM) composites. When compared with a diffusion transformer trained on massive datasets, Emu3’s next-token prediction model demonstrated superior convergence speed, challenging assumptions that diffusion paradigms inherently excel in visual generation. Similarly, in head-to-head tests against late-fusion LLaVA-style models—both with and without pretraining—Emu3 matched or exceeded performance, confirming that decoder-only models trained from scratch can rival hybrid architectures without reliance on pretrained visual encoders.

Training such a versatile system demanded innovative data curation and scheduling. Emu3 was pretrained from scratch on a carefully constructed mixture of language, image, and video data curated for quality and diversity. Resizing of visual inputs adhered to consistent aspect ratios and spatial scales near 512×512 pixels, ensuring uniformity during tokenization. A dedicated curriculum spanning three stages was utilized: initial rapid convergence with longer sequences and no dropout, followed by a stability-focused phase introducing dropout regularization, and finally an expansion to ultra-long sequences accommodating video data alongside text.

Post-training strategies further refined Emu3’s capabilities. For text-to-image (T2I) generation, the model underwent quality-focused fine-tuning using human preference scores aggregated from multiple evaluative metrics, enabling it to produce sharper and more aesthetically appealing high-resolution images. Additionally, Direct Preference Optimization (DPO) was deployed to align outputs more closely with human taste, involving annotator-guided ranking of generated images and iterative fine-tuning against chosen preferences, balancing fidelity and diversity.

Extending beyond still images, Emu3 was scaled to generate coherent and temporally consistent video sequences. Video fine-tuning incorporated stringent quality and motion filters on curated five-second clips, with sequence lengths exceeding 130,000 tokens. The model’s video outputs were quantitatively assessed across 16 key dimensions, including semantic fidelity and subject-background coherence, demonstrating landmark progress in controllable and believable video synthesis.

Vision–language understanding tasks also showcased Emu3’s versatility. A two-stage post-training regimen first integrated large-scale image-text pair data with masked vision token losses to focus on text prediction, followed by instruction tuning on millions of question-answer pairs relating to visual inputs. This multi-modal fine-tuning enhanced performance on image understanding benchmarks, highlighting the model’s capacity for reasoning and dialogue grounded in visual context.

Innovatively, Emu3 supports interleaved image-text generation, where structured text instructions are augmented with inline illustrative images within a single coherent output stream. Fine-tuning for this complex format reinforces the model’s potential for producing explanatory content combining modalities naturally—foundational for applications demanding rich, multimodal communication such as educational tools and interactive storytelling.

Going even further, the model has been adapted for vision–language–action tasks relevant to robotics. By fine-tuning on the CALVIN benchmark, which simulates long-horizon, language-conditioned robot manipulation, Emu3 ingests visual observations and discrete action tokens alternately, predicting sequences of perception and actions accurately. This integration underscores the model’s capacity to serve as a unified controller for agents requiring continuous interpretation and decision-making across sensory and motor domains.

Though real-world deployment on physical robots remains a future goal, Emu3’s ability to model complex interleaved perception–action sequences without bespoke modules signals a paradigm shift. Its autoregressive formulation naturally supports conditioning on arbitrarily long histories and enables recovery from partial or noisy inputs, key capabilities for robust, real-time robotic operation under uncertain sensory conditions.

Emu3’s development marks a milestone in multimodal AI research, demonstrating that next-token prediction frameworks can unify language, vision, and actions at scale. It challenges prior dominance of diffusion and hybrid encoder-LLM architectures, proving that single-stream Transformers trained from scratch can deliver superior efficiency and versatility. With their advances in tokenization, model design, and fine-tuning strategies, the creators set a new standard for building large-scale, unified multimodal models with broad implications across AI-assisted creativity, robotics, and interactive intelligence.

As the frontier of multimodal AI continues to expand, Emu3 exemplifies how integrative architectures enable more fluid and contextual machine understanding of the world. The ability to process and generate intertwined streams of text, imagery, video, and actions without switching modalities paves the path toward truly generalized artificial intelligence systems that learn, adapt, and interact seamlessly with humans and their environments.

Subject of Research: Multimodal learning architectures combining language, vision, and action modalities through next-token prediction.

Article Title: Multimodal learning with next-token prediction for large multimodal models.

Article References:
Wang, X., Cui, Y., Wang, J. et al. Multimodal learning with next-token prediction for large multimodal models. Nature (2026). https://doi.org/10.1038/s41586-025-10041-x

Image Credits: AI Generated

DOI: https://doi.org/10.1038/s41586-025-10041-x

Keywords: Emu3, multimodal models, next-token prediction, unified tokenizer, vector quantization, multimodal Transformer, vision-language-action, text-to-image, text-to-video, robotic manipulation, decoder-only architecture, direct preference optimization

Tags: autoregressive prediction capabilitiesbyte-pair encodingEmu3 architecturehigh-resolution visual content compressionmerging textual and visual inputsmultimodal AI modelsnext-token predictionQwen tokenizer architectureSBER-MoVQGAN technologytemporal residual layers in AIunified tokenizer systemvector quantization visual tokenizer

Share12Tweet8Share2ShareShareShare2

Related Posts

blank

Prethermalization via Random Multipolar Driving on 78 Qubits

January 29, 2026
blank

3D Micro-Trench Imaging via Fourier Ptychographic Interferometry

January 29, 2026

Population Sequencing Reveals EBV DNA Persistence

January 29, 2026

Nonlinear Nanophotonics Powers High-Dimensional Quantum States

January 29, 2026

POPULAR NEWS

  • Enhancing Spiritual Care Education in Nursing Programs

    157 shares
    Share 63 Tweet 39
  • PTSD, Depression, Anxiety in Childhood Cancer Survivors, Parents

    149 shares
    Share 60 Tweet 37
  • Robotic Ureteral Reconstruction: A Novel Approach

    80 shares
    Share 32 Tweet 20
  • Digital Privacy: Health Data Control in Incarceration

    62 shares
    Share 25 Tweet 16

About

We bring you the latest biotechnology news from best research centers and universities around the world. Check our website.

Follow us

Recent News

Evolving Views on Hearing Aids and Abandonment

Muscle RING Finger-1 Boosts Skeletal Muscle Regeneration

Factors Influencing Caregiver Turnover in Japan

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 72 other subscribers
  • Contact Us

Bioengineer.org © Copyright 2023 All Rights Reserved.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Homepages
    • Home Page 1
    • Home Page 2
  • News
  • National
  • Business
  • Health
  • Lifestyle
  • Science

Bioengineer.org © Copyright 2023 All Rights Reserved.