MIRAGE: Adaptive Multimodal Gating
for Whole-Brain fMRI Encoding

Multimodal Integration with Representation-Adaptive Gated Encoding

Abdulkadir Gokce*   ·   Badr AlKhamissi*   ·   Martin Schrimpf
École Polytechnique Fédérale de Lausanne (EPFL)
* Equal contribution
Paper (arXiv) Model (Hugging Face) Code (GitHub)
MIRAGE Architecture  ·  Adaptive gating drives 4 subject readouts
t 0
A naturalistic scene enters on the left. The center of the pass is adaptive cross-attention gating over frozen backbone depth; a shared encoder and subject heads turn those gated tokens into cortical predictions.
Naturalistic Stimulus scene · waveform · transcript Backbone Qwen3-Omni · 48 frozen layers Adaptive Layer Gating cross-attention over depth Brain Encoder shared context Subject Heads linear fMRI Prediction cortical parcels
Vision
Audio
Text
Shared brain encoder
Subject linear head
Trace a modality to see its layer distribution through cross-attention. Brighter paths and backbone layers carry more weight at the current time step; the subject-specific linear head changes across the 4 readouts.

Overview

Recent progress in task-optimized neural networks has established encoding models as a powerful tool for predicting brain responses to naturalistic stimuli, yet most existing approaches rely on unimodal representations. The emergence of omni-modal foundation models and rich multimodal neural datasets enables shared encoding models that jointly integrate visual, auditory, and linguistic information across subjects.


We introduce MIRAGE, a brain encoding framework that predicts whole-brain fMRI responses to naturalistic audiovisual stimuli with paired transcripts. MIRAGE extracts representations from a single pretrained omni-modal backbone through three modality-specific cross-attention modules whose latent queries adaptively aggregate features across the backbone's 48 layers, and combines them through a transformer-based brain encoder and a subject-specific linear head over the cortical parcels.


On the Algonauts benchmark, MIRAGE achieves state-of-the-art results on the out-of-distribution dataset. Controlled comparisons show that native multimodal fusion consistently outperforms post-hoc fusion of independently extracted unimodal streams, across architectural levels and backbones. Beyond predictive accuracy, the learned attention weights are directly inspectable: each modality's gating module discovers a distinct depth profile over the backbone, and each modality traces a distinct, anatomically structured pattern across cortex.

Measured and Predicted fMRI Activity

Friends season 6 is MIRAGE's validation set. Here we compare 50 consecutive TRs of measured ground-truth fMRI activity against model predictions for episode 20b across four subjects.*

MIRAGE closely tracks the temporal structure of brain activity, capturing when and where responses rise and fall. Its predictions are smooth in magnitude, reflecting MSE training on noisy fMRI measurements, which emphasizes robust shared signal over trial-specific noise.

* Stimulus media are not shared due to copyright restrictions, see the CNeuroMod CC0 license and Algonauts 2025 brain-data page for dataset details.

Measured fMRI
MIRAGE prediction
Subject
Brain view

Current TR --

Spatial comparison across 1000 parcels.

friends/s06e20b - sub-01; 50 consecutive TRs from validation S06E20b.

TRs are 1.49 seconds; metrics stay aligned to the displayed TR.

Brain Response Visualization

Select a video clip below. The stimulus (left) and the predicted whole-brain fMRI response (right) play in sync, so you can observe how brain activity changes with the audiovisual input.

Stimulus
Stimulus video
Predicted brain response
Brain visualization

Input stimulus is from Koala-36M clips and predicted whole-brain fMRI activity uses the sub-01 head; video-sample terms follow the related Panda-70M license.

Key Findings

Evaluation on Algonauts 2025 CNeuroMod splits.

Values are mean Pearson r across the four trained subjects. Friends S06 is the held-out validation split used during development, Friends S07 is the held-out in-distribution benchmark, and OOD is the held-out movie benchmark.

Model Friends S06 Eval Friends S07 In-Dist Eval OOD Eval Notes
MIRAGE single model 0.319 0.310 0.217 Hugging Face checkpoint
MIRAGE 15-member ensemble 0.335 0.323 0.227 Algonauts 2025 final submission ensemble
OOD Subject Pearson r
sub-010.244
sub-020.210
sub-030.235
sub-050.179
Benchmark results
Figure 1  ·  Algonauts Benchmark

Method Comparison Across Benchmarks & Backbone Ablations.

(a) Method Comparison Across Benchmarks. Mean Pearson r between predicted and measured BOLD on the validation set (Friends S06), the in-distribution test set (Friends S07), and the out-of-distribution movie benchmark, grouped by architectural complexity: linear ridge baselines (gray), Qwen3-Omni features with a learned brain encoder but no cross-attention gating (orange), and MIRAGE as a single model (red) and as an ensemble (blue). Each group is shown under both post-hoc and native fusion where applicable; Linear (Challenge) reproduces the official Algonauts ridge baseline. (b) Backbone Ablation. Pearson r on the validation set when varying the feature-extraction backbone of MIRAGE, comparing native multimodal fusion (red) against post-hoc fusion (orange). Error bars denote SEM across the four subjects.

Cortical Alignment and Modality Contributions
Figure 2  ·  Modality Contributions

Cortical Alignment and Modality Contributions.

(a) Per-parcel Pearson r  for MIRAGE on the validation set, shown on a cortical flatmap. (b) Dominant modality per parcel, vision (red), audio (blue), or text (green), defined as the modality whose ablation causes the largest drop in per-parcel Pearson r  relative to the full trimodal model. Color saturation encodes dominance strength (the dominant modality's share of the total drop, normalized to [0, 1]); desaturated parcels reflect distributed multimodal contributions. (c) Mean Pearson r across cortex when restricting input to subsets of modalities during training (T = text, V = vision, A = audio), from each modality alone through pairwise combinations to the full trimodal model.

Architectural Ablations
Figure 3  ·  Architectural Ablations

Where Does MIRAGE Help, and Which Components Contribute?

(a) Parcel-wise difference in Pearson r between MIRAGE and the matched Linear (Native Fusion) baseline, averaged across subjects and projected onto an inflated cortical surface (LH/RH: left/right hemisphere); warmer colors mark parcels where MIRAGE improves. Both models share the same input features, so the difference isolates the contribution of the learned encoder. (b) Mean Pearson r for MIRAGE (red) and the linear baseline (gray) within each of the seven canonical Yeo-Krienen networks: Visual, Somatomotor (SomMot), Dorsal/Ventral Attention (DorsAttn/VentAttn), Limbic, Frontoparietal Control (Control), and Default Mode (Default). (c) Pearson r from a per-subject linear probe trained on representations at successive stages of MIRAGE: raw input features, post cross-attention, post Brain Encoder, and full model output (no additional fitting). Error bars in (b) and (c) denote SEM across the four CNeuroMod subjects.

Layer attention profiles
Figure 4  ·  Cortical Organization

Layer-wise Contributions of Qwen3-Omni Features.

Cross-attention weights from MIRAGE's per-modality cross-attention modules (vision, text, audio) over the 48 layers of the Qwen3-Omni language module, averaged across attention heads and the 24 latent queries; brighter cells indicate layers that contribute more strongly to the modality-specific readout. Per-head and per-query breakdowns are in the Appendix.

BibTeX

@misc{gokce2026mirage,
  title         = {MIRAGE: Adaptive Multimodal Gating for Whole-Brain fMRI Encoding},
  author        = {Abdulkadir Gokce and Badr AlKhamissi and Martin Schrimpf},
  year          = {2026},
  eprint        = {2605.29850},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url           = {https://arxiv.org/abs/2605.29850},
}