Skip to content


MoZuMa is a model zoo for multimedia search application. It provides an easy to use interface to run models for:

  • Text to image retrieval: Rank images by their similarity to a text query.
  • Image similarity search: Rank images by their similarity to query image.
  • Image classification: Add labels to images.
  • Face detection: Detect and retrieve images with similar faces.
  • Object detection: Detect and retrieve images with similar objects.
  • Video keyframes extraction: Retrieve the important frames of a video. Key-frames are used to apply all the other queries on videos.
  • Multilingual text search: Rank similar sentences from a text query in multiple languages.


We support Python >= 3.7 and PyTorch >= 1.9. However, it is likely that MoZuMa can be run on previous versions of PyTorch, we are simply not testing it for versions before 1.9.


mozuma requires PyTorch and we recommend to follow the PyTorch's documentation for the installation procedure.

pip install mozuma

How to search multimedia collections ?

Running a model with MoZuMa will always have the same structure:

# Getting a dataset of images (1)
dataset = ImageDataset(LocalBinaryFilesDataset(paths))

# Model definition (2)
model = torch_resnet_imagenet("resnet18")

# Creating the callback to collect data (3)
features = CollectFeaturesInMemory()
labels = CollectLabelsInMemory()
callbacks = [features, labels]

# Getting the torch runner for inference (4)
runner = TorchInferenceRunner(
  1. See Datasets for a list of available datasets.
  2. List of all models
  3. List of available callbacks.
  4. List of available runners

The following sections are discussing different search scenario for which we will be changing the model and callbacks variables

Search with labels

Some models have been trained to do classification, they produce labels to described an image. See mozuma.labels for a list of label sets. The models supporting labels are:

Work with custom labels

We also provide a way to add custom labels on top of existing features. See the classification module or the example CIFAR10 Image Classification.

Search with features

Sometimes what we are looking for is not included in the labels. In this case, we can use similarity search to find images from a query image. This can be done by comparing embeddings (or features) of images.

What are embeddings?

Embeddings (or features) are dense vectors that contain a lot of information on the image and they have interesting properties. Usually embeddings that are close together will represent entities that are similar.

For instance, Densenet pre-trained on Places365 embeddings will contain information on the context of the image, where it is taking place, the scene. Therefore, searching similar images will return images taken in a similar scene.

On the other hand the version pre-trained on ImageNet will tend to focus on the main subject.

Similarly, embeddings can also be used to find similar faces or objects with ArcFace and VinVL respectively.

Finally, embeddings are also used by CLIP to find images similar to a text description.

Handling videos

Searching videos can be inefficient if we have to apply models to all frames. Here, we provide Video key-frames extractors to select only a few representative frames for a video. We then apply the other models on these key-frames.

Going further