VinVL

Pre-trained large-scale object-attribute detection (OD) model based on the ResNeXt-152 C4 architecture¹. The OD model has been firstly trained on much larger amounts of data, combining multiple public object detection datasets, including COCO, OpenImages (OI), Objects365, and Visual Genome (VG). Then it is fine-tuned on VG dataset alone, since VG is the only dataset with label attributes (see issue #120). It predicts objects from 1594 classes with attributes from 524 classes. See the code and the paper for details.

Pre-trained models

`mozuma.models.vinvl.pretrained.torch_vinvl_detector`

VinVL object detection model

Parameters:

Name	Type	Description	Default
`score_threshold`	`float`		required
`attr_score_threshold`	`float`		required
`device`	`torch.device`	PyTorch device attribute to initialise model.	required

Base model

The VinVL model is an implementation of a TorchModel.

`mozuma.models.vinvl.modules.TorchVinVLDetectorModule`

VinVL object detection model

Attributes:

Name	Type	Description
`score_threshold`	`float`
`attr_score_threshold`	`float`
`device`	`torch.device`	PyTorch device attribute to initialise model.

Provider store

See the stores documentation for usage.

`mozuma.models.vinvl.stores.VinVLStore`

Pre-trained model states for VinVL

These are identified by training_id=vinvl.

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5579–5588. June 2021. ↩