Skip to content

CLIP

CLIP1 model encodes text snippets and images into a common embedding space which enables zero-shot retrieval and prediction. See OpenAI/CLIP for the source code and original models.

Pre-trained models

mozuma.models.clip.pretrained.torch_clip_image_encoder

Pre-trained CLIP image encoder

Parameters:

Name Type Description Default
clip_model_name str

Name of the model to load (see CLIP doc)

required
device torch.device

The PyTorch device to initialise the model weights. Defaults to torch.device("cpu").

required

mozuma.models.clip.pretrained.torch_clip_text_encoder

Pre-trained CLIP text encoder

Parameters:

Name Type Description Default
clip_model_name str

Name of the model to load (see CLIP doc)

required
device torch.device

The PyTorch device to initialise the model weights. Defaults to torch.device("cpu").

required

Models

CLIP comes with image (CLIPImageModule) and a text (CLIPTextModule) encoders. These modules are an implementation of TorchModel.

mozuma.models.clip.image.CLIPImageModule

Image encoder of the CLIP model

Attributes:

Name Type Description
clip_model_name str

Name of the model to load (see CLIP doc)

device torch.device

The PyTorch device to initialise the model weights. Defaults to torch.device("cpu").

mozuma.models.clip.text.CLIPTextModule

Text encoder of the CLIP model

Attributes:

Name Type Description
clip_model_name str

Name of the model to load (see CLIP doc)

device torch.device

The PyTorch device to initialise the model weights. Defaults to torch.device("cpu").

Pre-trained states from CLIP

See the stores documentation for usage.

mozuma.models.clip.stores.CLIPStore

Pre-trained model states by OpenAI CLIP

These are identified by training_id=clip.

List models and parameters

There is a command line utility to list all available models from CLIP with their associated parameters in JSON format:

python -m mozuma.models.clip.list

The output is used to fill the mozuma.models.clip.parameters file.


  1. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 8748–8763. PMLR, 18–24 Jul 2021. URL: https://proceedings.mlr.press/v139/radford21a.html