Multimodal Integration with Representation-Adaptive Gated Encoding
Recent progress in task-optimized neural networks has established encoding models as a powerful tool for predicting brain responses to naturalistic stimuli, yet most existing approaches rely on unimodal representations. The emergence of omni-modal foundation models and rich multimodal neural datasets enables shared encoding models that jointly integrate visual, auditory, and linguistic information across subjects.
We introduce MIRAGE, a brain encoding framework that predicts whole-brain fMRI responses to naturalistic audiovisual stimuli with paired transcripts. MIRAGE extracts representations from a single pretrained omni-modal backbone through three modality-specific cross-attention modules whose latent queries adaptively aggregate features across the backbone's 48 layers, and combines them through a transformer-based brain encoder and a subject-specific linear head over the cortical parcels.
On the Algonauts benchmark, MIRAGE achieves state-of-the-art results on the out-of-distribution dataset. Controlled comparisons show that native multimodal fusion consistently outperforms post-hoc fusion of independently extracted unimodal streams, across architectural levels and backbones. Beyond predictive accuracy, the learned attention weights are directly inspectable: each modality's gating module discovers a distinct depth profile over the backbone, and each modality traces a distinct, anatomically structured pattern across cortex.
Friends season 6 is MIRAGE's validation set. Here we compare 50 consecutive TRs of measured ground-truth fMRI activity against model predictions for episode 20b across four subjects.*
MIRAGE closely tracks the temporal structure of brain activity, capturing when and where responses rise and fall. Its predictions are smooth in magnitude, reflecting MSE training on noisy fMRI measurements, which emphasizes robust shared signal over trial-specific noise.
* Stimulus media are not shared due to copyright restrictions, see the CNeuroMod CC0 license and Algonauts 2025 brain-data page for dataset details.
Spatial comparison across 1000 parcels.
friends/s06e20b - sub-01; 50 consecutive TRs from validation S06E20b.
TRs are 1.49 seconds; metrics stay aligned to the displayed TR.
Select a video clip below. The stimulus (left) and the predicted whole-brain fMRI response (right) play in sync, so you can observe how brain activity changes with the audiovisual input.
Input stimulus is from Koala-36M clips and predicted whole-brain fMRI activity uses the
sub-01 head;
video-sample terms follow the related Panda-70M license.
Values are mean Pearson r across the four trained subjects. Friends S06 is the held-out validation split used during development, Friends S07 is the held-out in-distribution benchmark, and OOD is the held-out movie benchmark.
| Model | Friends S06 Eval | Friends S07 In-Dist Eval | OOD Eval | Notes |
|---|---|---|---|---|
| MIRAGE single model | 0.319 | 0.310 | 0.217 | Hugging Face checkpoint |
| MIRAGE 15-member ensemble | 0.335 | 0.323 | 0.227 | Algonauts 2025 final submission ensemble |
| OOD Subject | Pearson r |
|---|---|
| sub-01 | 0.244 |
| sub-02 | 0.210 |
| sub-03 | 0.235 |
| sub-05 | 0.179 |
(a) Method Comparison Across Benchmarks. Mean Pearson r between predicted and measured BOLD on the validation set (Friends S06), the in-distribution test set (Friends S07), and the out-of-distribution movie benchmark, grouped by architectural complexity: linear ridge baselines (gray), Qwen3-Omni features with a learned brain encoder but no cross-attention gating (orange), and MIRAGE as a single model (red) and as an ensemble (blue). Each group is shown under both post-hoc and native fusion where applicable; Linear (Challenge) reproduces the official Algonauts ridge baseline. (b) Backbone Ablation. Pearson r on the validation set when varying the feature-extraction backbone of MIRAGE, comparing native multimodal fusion (red) against post-hoc fusion (orange). Error bars denote SEM across the four subjects.
(a) Per-parcel Pearson r for MIRAGE on the validation set, shown on a cortical flatmap. (b) Dominant modality per parcel, vision (red), audio (blue), or text (green), defined as the modality whose ablation causes the largest drop in per-parcel Pearson r relative to the full trimodal model. Color saturation encodes dominance strength (the dominant modality's share of the total drop, normalized to [0, 1]); desaturated parcels reflect distributed multimodal contributions. (c) Mean Pearson r across cortex when restricting input to subsets of modalities during training (T = text, V = vision, A = audio), from each modality alone through pairwise combinations to the full trimodal model.
(a) Parcel-wise difference in Pearson r between MIRAGE and the matched Linear (Native Fusion) baseline, averaged across subjects and projected onto an inflated cortical surface (LH/RH: left/right hemisphere); warmer colors mark parcels where MIRAGE improves. Both models share the same input features, so the difference isolates the contribution of the learned encoder. (b) Mean Pearson r for MIRAGE (red) and the linear baseline (gray) within each of the seven canonical Yeo-Krienen networks: Visual, Somatomotor (SomMot), Dorsal/Ventral Attention (DorsAttn/VentAttn), Limbic, Frontoparietal Control (Control), and Default Mode (Default). (c) Pearson r from a per-subject linear probe trained on representations at successive stages of MIRAGE: raw input features, post cross-attention, post Brain Encoder, and full model output (no additional fitting). Error bars in (b) and (c) denote SEM across the four CNeuroMod subjects.
Cross-attention weights from MIRAGE's per-modality cross-attention modules (vision, text, audio) over the 48 layers of the Qwen3-Omni language module, averaged across attention heads and the 24 latent queries; brighter cells indicate layers that contribute more strongly to the modality-specific readout. Per-head and per-query breakdowns are in the Appendix.
@misc{gokce2026mirage,
title = {MIRAGE: Adaptive Multimodal Gating for Whole-Brain fMRI Encoding},
author = {Abdulkadir Gokce and Badr AlKhamissi and Martin Schrimpf},
year = {2026},
eprint = {2605.29850},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2605.29850},
}