Abstract

The incorporation of high-resolution visual input equips multimodal large language models (MLLMs) with enhanced visual perception capabilities for real-world tasks. However, most existing high-resolution MLLMs rely on a cropping-based approach to process images, which leads to fragmented visual encoding and a sharp increase in redundant tokens. To tackle these issues, we propose the FALCON model. FALCON introduces a novel visual register technique to simultaneously: 1) Eliminate redundant tokens at the stage of visual encoding. To directly address the visual redundancy present in the output of vision encoder, we propose a Register-based Representation Compacting (ReCompact) mechanism. This mechanism introduces a set of learnable visual registers designed to adaptively aggregate essential information while discarding redundancy. It enables the encoder to produce a more compact visual representation with a minimal number of output tokens, thus eliminating the need for an additional compression module. 2) Ensure continuity in visual encoding. To address the potential encoding errors caused by fragmented visual inputs, we develop a Register Interactive Attention (ReAtten) module. This module facilitates effective and efficient information exchange across sub-images by enabling interactions between visual registers. It ensures the continuity of visual semantics throughout the encoding. We conduct comprehensive experiments with FALCON on high-resolution benchmarks across a wide range of scenarios. FALCON demonstrates superior performance with a remarkable 9-fold reduction in visual tokens. FALCON is open-sourced and publicly available at https://github.com/JiuTian-VL/FALCON.

Overview Framework of FALCON

We propose FALCON, which addresses the issues of visual redundancy and fragmentation in high-resolution understanding of MLLMs in a unified manner through the proposed visual register mechanism. It synergizes two essential components: (1) To directly eliminate redundancy during visual encoding, we propose Register-based Representation Compacting (ReCompact) mechanism. This mechanism introduces a set of learnable visual registers, which are paired with image tokens from each sub-image and fed into the vision encoder to capture rich visual information. (2) To ensure the continuity of visual semantics throughout the encoding, a novel Register Interactive Attention (ReAtten) module is integrated into the Vision Transformer to facilitate information exchange between sub-images via the visual registers. Finally, the compact visual representations produced by the visual registers are processed through a simple MLP module before being fed into the LLM for further analysis.

Experiment

Table 1: Main Results of FALCON on MME-RealWorld.

Table 2 and 3: Main Results of FALCON on More Diverse Benchmarks.

Qualitative Results

Visualization of the register-to-image attention map. As shown in Figure (b), each visual register focuses on specific parts of the image, capturing rich visual patterns. Meanwhile, registers pay minimal attention to background areas, effectively avoiding the inclusion of redundancy. As shown in Figure (c), in the ViT model without ReAtten, the attention patterns across different sub-images appear extremely fragmented. In contrast, the ViT model with ReAtten shows continuous attention patterns, indicating effective information interaction between sub-images.

Case Study on Diverse Tasks. The figures in left illustrate FALCON's exceptional ability to recognize small objects and text in natural scenes, demonstrating its capability to capture rich, fine-grained details in high-resolution images. The figure in right highlights FALCON's proficiency in understanding and summarizing high-resolution document images with dense text, while also demonstrating its sensitivity to small text elements. These examples demonstrate FALCON's remarkable capabilities across various high-resolution vision-language tasks.

Conclusion

To address the visual redundancy and fragmentation in high-resolution MLLMs, we propose FALCON. FALCON employs an innovative visual register technique that simultaneously addresses both challenges. This technique uses a ReCompact mechanism to adaptively aggregate essential visual information through visual registers, creating a compact, non-redundant representation. Additionally, a novel ReAtten module is introduced to facilitate information exchange among sub-images via visual registers, thereby enhancing visual continuity during encoding. Extensive experiments demonstrate FALCON’s superiority in high-resolution understanding and validate the effectiveness of the proposed ReCompact and ReAtten.

BibTeX

@InProceedings{zhang2025falcon,
    author={Zhang, Renshan and Shao, Rui and Chen, Gongwei and Zhang, Miao and Zhou, Kaiwen and Guan, Weili and Nie, Liqiang},
    title={FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers},
    booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month= {October},
    year={2025},
}

FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers

ICCV 2025