CLIP

CLIP¹ model encodes text snippets and images into a common embedding space which enables zero-shot retrieval and prediction. See OpenAI/CLIP for the source code and original models.

Pre-trained models

`mozuma.models.clip.pretrained.torch_clip_image_encoder`

Pre-trained CLIP image encoder

Parameters:

Name	Type	Description	Default
`clip_model_name`	`str`	Name of the model to load (see CLIP doc)	required
`device`	`torch.device`	The PyTorch device to initialise the model weights. Defaults to `torch.device("cpu")`.	required

`mozuma.models.clip.pretrained.torch_clip_text_encoder`

Pre-trained CLIP text encoder

Parameters:

Name	Type	Description	Default
`clip_model_name`	`str`	Name of the model to load (see CLIP doc)	required
`device`	`torch.device`	The PyTorch device to initialise the model weights. Defaults to `torch.device("cpu")`.	required

Models

CLIP comes with image (CLIPImageModule) and a text (CLIPTextModule) encoders. These modules are an implementation of TorchModel.

`mozuma.models.clip.image.CLIPImageModule`

Image encoder of the CLIP model

Attributes:

Name	Type	Description
`clip_model_name`	`str`	Name of the model to load (see CLIP doc)
`device`	`torch.device`	The PyTorch device to initialise the model weights. Defaults to `torch.device("cpu")`.

`mozuma.models.clip.text.CLIPTextModule`

Text encoder of the CLIP model

Attributes:

Name	Type	Description
`clip_model_name`	`str`	Name of the model to load (see CLIP doc)
`device`	`torch.device`	The PyTorch device to initialise the model weights. Defaults to `torch.device("cpu")`.

Pre-trained states from CLIP

See the stores documentation for usage.

`mozuma.models.clip.stores.CLIPStore`

Pre-trained model states by OpenAI CLIP

These are identified by training_id=clip.

List models and parameters

There is a command line utility to list all available models from CLIP with their associated parameters in JSON format:

python -m mozuma.models.clip.list

The output is used to fill the mozuma.models.clip.parameters file.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 8748–8763. PMLR, 18–24 Jul 2021. URL: https://proceedings.mlr.press/v139/radford21a.html. ↩