CLIP
CLIP1 model encodes text snippets and images into a common embedding space which enables zero-shot retrieval and prediction. See OpenAI/CLIP for the source code and original models.
Pre-trained models
mozuma.models.clip.pretrained.torch_clip_image_encoder
Pre-trained CLIP image encoder
Parameters:
Name | Type | Description | Default |
---|---|---|---|
clip_model_name |
str |
Name of the model to load (see CLIP doc) |
required |
device |
torch.device |
The PyTorch device to initialise the model weights.
Defaults to |
required |
mozuma.models.clip.pretrained.torch_clip_text_encoder
Pre-trained CLIP text encoder
Parameters:
Name | Type | Description | Default |
---|---|---|---|
clip_model_name |
str |
Name of the model to load (see CLIP doc) |
required |
device |
torch.device |
The PyTorch device to initialise the model weights.
Defaults to |
required |
Models
CLIP comes with image (CLIPImageModule
)
and a text (CLIPTextModule
) encoders.
These modules are an implementation of TorchModel
.
mozuma.models.clip.image.CLIPImageModule
Image encoder of the CLIP model
Attributes:
Name | Type | Description |
---|---|---|
clip_model_name |
str |
Name of the model to load (see CLIP doc) |
device |
torch.device |
The PyTorch device to initialise the model weights.
Defaults to |
mozuma.models.clip.text.CLIPTextModule
Text encoder of the CLIP model
Attributes:
Name | Type | Description |
---|---|---|
clip_model_name |
str |
Name of the model to load (see CLIP doc) |
device |
torch.device |
The PyTorch device to initialise the model weights.
Defaults to |
Pre-trained states from CLIP
See the stores documentation for usage.
mozuma.models.clip.stores.CLIPStore
Pre-trained model states by OpenAI CLIP
These are identified by training_id=clip
.
List models and parameters
There is a command line utility to list all available models from CLIP with their associated parameters in JSON format:
The output is used to fill the mozuma.models.clip.parameters file.
-
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 8748–8763. PMLR, 18–24 Jul 2021. URL: https://proceedings.mlr.press/v139/radford21a.html. ↩