Singapore Institute of Technology
Browse

Image Segmentation with Vision-Language Models

Download (619.55 kB)
conference contribution
posted on 2025-07-25, 06:29 authored by Lihu Pan, Yunting Yang, Zhengkui WangZhengkui Wang, Rui Zhang, Wen Shan, Jiashu Li
<p dir="ltr">Image segmentation traditionally relies on predefined object classes, which can pose challenges when accommodating new categories or complex queries, often necessitating model retraining. Relying solely on visual information for segmentation heavily depends on annotated samples, and as the number of unknown classes increases, the model’s segmentation performance experiences significant declines. To address these challenges, this paper introduces ViLaSeg, an innovative image segmentation model that generates binary segmentation maps for query images based on either free-text prompts or support images. Our model capitalizes on text prompts to establish comprehensive contextual logical relationships, while visual prompts harness the power of the GroupViT encoder to capture local features of multiple objects, enhancing segmentation precision. By employing selective attention and facilitating cross-modal interactions, our model seamlessly fuses image and text features, further refined by a transformer-based decoder designed for dense prediction tasks. ViLaSeg excels across a spectrum of segmentation tasks, including referring expression, zero-shot, and one-shot segmentation, surpassing prior state-of-the-art approaches.</p>

History

Journal/Conference/Book title

CSAI '23: Proceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence

Publication date

2024-03-14

Version

  • Published

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC