MIRAGE: Adaptive Multimodal Gating
for Whole-Brain fMRI Encoding

Multimodal Integration with Representation-Adaptive Gated Encoding

Abdulkadir Gokce^* · Badr AlKhamissi^* · Martin Schrimpf

École Polytechnique Fédérale de Lausanne (EPFL)

^* Equal contribution

Paper (arXiv) Model (Hugging Face) Code (GitHub)

MIRAGE Architecture · Adaptive gating drives 4 subject readouts

t 0

A naturalistic scene enters on the left. The center of the pass is adaptive cross-attention gating over frozen backbone depth; a shared encoder and subject heads turn those gated tokens into cortical predictions.

Vision

Audio

Text

Shared brain encoder

Subject linear head

Trace a modality to see its layer distribution through cross-attention. Brighter paths and backbone layers carry more weight at the current time step; the subject-specific linear head changes across the 4 readouts.

Abstract

Overview

Recent progress in task-optimized neural networks has established encoding models as a powerful tool for predicting brain responses to naturalistic stimuli, yet most existing approaches rely on unimodal representations. The emergence of omni-modal foundation models and rich multimodal neural datasets enables shared encoding models that jointly integrate visual, auditory, and linguistic information across subjects.

We introduce MIRAGE, a brain encoding framework that predicts whole-brain fMRI responses to naturalistic audiovisual stimuli with paired transcripts. MIRAGE extracts representations from a single pretrained omni-modal backbone through three modality-specific cross-attention modules whose latent queries adaptively aggregate features across the backbone's 48 layers, and combines them through a transformer-based brain encoder and a subject-specific linear head over the cortical parcels.

On the Algonauts benchmark, MIRAGE achieves state-of-the-art results on the out-of-distribution dataset. Controlled comparisons show that native multimodal fusion consistently outperforms post-hoc fusion of independently extracted unimodal streams, across architectural levels and backbones. Beyond predictive accuracy, the learned attention weights are directly inspectable: each modality's gating module discovers a distinct depth profile over the backbone, and each modality traces a distinct, anatomically structured pattern across cortex.

Validation Comparison

Measured and Predicted fMRI Activity

Friends season 6 is MIRAGE's validation set. Here we compare 50 consecutive TRs of measured ground-truth fMRI activity against model predictions for episode 20b across four subjects.*

MIRAGE closely tracks the temporal structure of brain activity, capturing when and where responses rise and fall. Its predictions are smooth in magnitude, reflecting MSE training on noisy fMRI measurements, which emphasizes robust shared signal over trial-specific noise.

* Stimulus media are not shared due to copyright restrictions, see the CNeuroMod CC0 license and Algonauts 2025 brain-data page for dataset details.

Measured fMRI

MIRAGE prediction

Current TR --

Spatial comparison across 1000 parcels.

friends/s06e20b - sub-01; 50 consecutive TRs from validation S06E20b.

TRs are 1.49 seconds; metrics stay aligned to the displayed TR.

Results

Key Findings

Evaluation on Algonauts 2025 CNeuroMod splits.

Values are mean Pearson r across the four trained subjects. Friends S06 is the held-out validation split used during development, Friends S07 is the held-out in-distribution benchmark, and OOD is the held-out movie benchmark.

Model	Friends S06 Eval	Friends S07 In-Dist Eval	OOD Eval	Notes
MIRAGE single model	0.319	0.310	0.217	Hugging Face checkpoint
MIRAGE 15-member ensemble	0.335	0.323	0.227	Algonauts 2025 final submission ensemble

OOD Subject	Pearson r
sub-01	0.244
sub-02	0.210
sub-03	0.235
sub-05	0.179

Figure 1 · Algonauts Benchmark

Method Comparison Across Benchmarks & Backbone Ablations.

(a) Method Comparison Across Benchmarks. Mean Pearson r between predicted and measured BOLD on the validation set (Friends S06), the in-distribution test set (Friends S07), and the out-of-distribution movie benchmark, grouped by architectural complexity: linear ridge baselines (gray), Qwen3-Omni features with a learned brain encoder but no cross-attention gating (orange), and MIRAGE as a single model (red) and as an ensemble (blue). Each group is shown under both post-hoc and native fusion where applicable; Linear (Challenge) reproduces the official Algonauts ridge baseline. (b) Backbone Ablation. Pearson r on the validation set when varying the feature-extraction backbone of MIRAGE, comparing native multimodal fusion (red) against post-hoc fusion (orange). Error bars denote SEM across the four subjects.

Figure 2 · Modality Contributions

Cortical Alignment and Modality Contributions.

(a) Per-parcel Pearson r for MIRAGE on the validation set, shown on a cortical flatmap. (b) Dominant modality per parcel, vision (red), audio (blue), or text (green), defined as the modality whose ablation causes the largest drop in per-parcel Pearson r relative to the full trimodal model. Color saturation encodes dominance strength (the dominant modality's share of the total drop, normalized to [0, 1]); desaturated parcels reflect distributed multimodal contributions. (c) Mean Pearson r across cortex when restricting input to subsets of modalities during training (T = text, V = vision, A = audio), from each modality alone through pairwise combinations to the full trimodal model.

Figure 3 · Architectural Ablations

Where Does MIRAGE Help, and Which Components Contribute?

(a) Parcel-wise difference in Pearson r between MIRAGE and the matched Linear (Native Fusion) baseline, averaged across subjects and projected onto an inflated cortical surface (LH/RH: left/right hemisphere); warmer colors mark parcels where MIRAGE improves. Both models share the same input features, so the difference isolates the contribution of the learned encoder. (b) Mean Pearson r for MIRAGE (red) and the linear baseline (gray) within each of the seven canonical Yeo-Krienen networks: Visual, Somatomotor (SomMot), Dorsal/Ventral Attention (DorsAttn/VentAttn), Limbic, Frontoparietal Control (Control), and Default Mode (Default). (c) Pearson r from a per-subject linear probe trained on representations at successive stages of MIRAGE: raw input features, post cross-attention, post Brain Encoder, and full model output (no additional fitting). Error bars in (b) and (c) denote SEM across the four CNeuroMod subjects.

Figure 4 · Cortical Organization

Layer-wise Contributions of Qwen3-Omni Features.

Cross-attention weights from MIRAGE's per-modality cross-attention modules (vision, text, audio) over the 48 layers of the Qwen3-Omni language module, averaged across attention heads and the 24 latent queries; brighter cells indicate layers that contribute more strongly to the modality-specific readout. Per-head and per-query breakdowns are in the Appendix.

MIRAGE: Adaptive Multimodal Gating
for Whole-Brain fMRI Encoding

Overview

Measured and Predicted fMRI Activity

Current TR --

Brain Response Visualization

Key Findings

Evaluation on Algonauts 2025 CNeuroMod splits.

Method Comparison Across Benchmarks & Backbone Ablations.

Cortical Alignment and Modality Contributions.

Where Does MIRAGE Help, and Which Components Contribute?

Layer-wise Contributions of Qwen3-Omni Features.

BibTeX

MIRAGE: Adaptive Multimodal Gating for Whole-Brain fMRI Encoding

Overview

Measured and Predicted fMRI Activity

Current TR --

Brain Response Visualization

Key Findings

Evaluation on Algonauts 2025 CNeuroMod splits.

Method Comparison Across Benchmarks & Backbone Ablations.

Cortical Alignment and Modality Contributions.

Where Does MIRAGE Help, and Which Components Contribute?

Layer-wise Contributions of Qwen3-Omni Features.

BibTeX

MIRAGE: Adaptive Multimodal Gating
for Whole-Brain fMRI Encoding