Abstract: Pre-training a vision-language model and then fine-tuning it on downstream tasks have become a popular paradigm. However, pre-trained vision-language models with the Transformer architecture ...
CLIP for Unsupervised and Fully Supervised Visual Grounding. This repository is the official Pytorch implementation for the paper CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding.
There was an error while loading. Please reload this page.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results