Skip to content

VinVL

Pre-trained large-scale object-attribute detection (OD) model based on the ResNeXt-152 C4 architecture1. The OD model has been firstly trained on much larger amounts of data, combining multiple public object detection datasets, including COCO, OpenImages (OI), Objects365, and Visual Genome (VG). Then it is fine-tuned on VG dataset alone, since VG is the only dataset with label attributes (see issue #120). It predicts objects from 1594 classes with attributes from 524 classes. See the code and the paper for details.

Pre-trained models

mozuma.models.vinvl.pretrained.torch_vinvl_detector

VinVL object detection model

Parameters:

Name Type Description Default
score_threshold float required
attr_score_threshold float required
device torch.device

PyTorch device attribute to initialise model.

required

Base model

The VinVL model is an implementation of a TorchModel.

mozuma.models.vinvl.modules.TorchVinVLDetectorModule

VinVL object detection model

Attributes:

Name Type Description
score_threshold float
attr_score_threshold float
device torch.device

PyTorch device attribute to initialise model.

Provider store

See the stores documentation for usage.

mozuma.models.vinvl.stores.VinVLStore

Pre-trained model states for VinVL

These are identified by training_id=vinvl.


  1. Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5579–5588. June 2021.