# Learning to link images with their descriptions

In this blog post, we will present an introduction to recent advances in multimodal information retrieval and conditional language models. In other words, it will be about machine learning, dealing with both image and textual data.

Everything written here is the ground and absolute truth, gated by the understanding of the author.

## Why do we want to do this?

First of all, it is necessary to explain why we are interested in extracting knowledge from both image and text data.

Many real world applications don’t involve only one data modality. Web pages, for example, contain text, images, links to other pages, videos, ads, style, etc. Restricting oneself to using only one modality would involve loosing all the information contained in the others. As humans we often need those multiple sources to fully understand the Web page, and having only one can be lead to ambiguous information . Other examples of multimodal real world data include catalog products, YouTube videos, social network posts, … We are interested in techniques that allow us to use knowledge from different modalities.

We won’t tackle this problem directly. This kind of is a higher-level further goal, but there are several steps towards directly understanding multimodal data. One of the intermediate steps is being able to « link » an image with its description. Basically, we can distinguish two tasks when dealing with image and textual data:

• information retrieval: given an image (resp. text) query, retrieve the most relevant text (resp. image) in a database,
• caption generation: given an image, automatically generate the description of this image.

## Why is it hard?

Well, it is hard because we want to associate two views of one data point that belong to very different spaces. The human brain is very good at saying that this picture

Example from MS-COCO dataset

can be described by the sentence « a stuffed teddy bear in a water and soap filled sink with a rubber duck ». But for the computer, this task reduces to saying that a big table of numbers lying in $\mathbb{R}^{height \times width \times 3}$ is somehow « close » to a sequence of integers, indexing words in vocabulary.

## How

We will present, from a high level perspective, a few methods for image and text analysis. The typical datasets used in multimodal analysis are the following:

• Flickr8K: 8,000 images, each accompanied with 5 sentences describing image content,
• Flickr30K: 31,783 images, each accompanied with 5 sentences describing image content,
• MS-COCO: 82,783 training images and 40,504 validation images, each accompanied with 5 sentences.

### Information retrieval

As explained before, we tackle the problem of multimodal information retrieval. Given one view (image or text), we want to retrieve the most relevant other view from a database. Mathematically, we want to build a system that computes a similarity measure between an image and a text. To do so, we use the machine learning general framework : using a dataset of aligned images and textual descriptions, we try to learn from this dataset a way to compute a similarity between those views.

Many algorithms have tried to learn a similarity measure between an image and a text, and the goal here is not to provide an exhaustive bibliography on the topic. Therefore, we will present two different methods.

#### Linear mapping between pre-learnt embeddings

Suppose you have learnt a way to represent sentences in a « low » dimensional space, such as the position of a sentence in this space encodes some high level information. This space provides a fixed-size embedding for sentences. There are several ways to embed a sentence:

• a huge vocabulary-size vector containing word counts,
• another huge vocabulary-size vector containing TF-IDs coefficient,
• if we pair each word with an embedding (such as Word2Vec), the sum of all embeddings in a sentence,
• a learnt combination of word embeddings.

We focus on the last bullet-point. Skip-thought vectors (STV) [Kiros et al., Skip-thought vectors] are an effective way to learn a powerful way to combine words in a sentence. Unlike the three firsts bullet-points, STV algorithm doesn’t make the bag-of-words assumption: word order matters. Intuitively, sentences that have the same meaning have a very similar representation.

To illustrate, we computed the skip-thought vectors of sentences extracted from scientific papers, and computed the t-SNE to visualize the datapoints.

Each point represents the skip-thought vector of a sentence extracted from articles dealing with biology (green), machine learning (blue) or psychology (red)

As you can see, the skip-thought algorithm brings close together sentences that deal with the same topic, without any supervision. It is a clue to saying that the sentence embedding provided by the skip-thought is qualitatively « good ».

Suppose you have also learnt a way to represent images in the same kind of dimensional space. Not the SAME space, another one, such as the position of an image in this space encodes some high level information. Convolutional neural networks (ConvNets) trained on some superbig labelised image dataset (like ImageNet) have proven to provide good vectorial representations of images. This can be achieved by cutting out the last layer of the network, and retriving the activity of the neurons in the last layer.

Note that until now, we didn’t use any aligned corpus to learn relations between images and texts. We use a huge amount of images to learn an image representation and a huge amount of sentences to learn a sentence representation, without any cross-modal knowledge.  This being said, we use the following idea: if both our image and sentence representations are strong, and really able to encode high-level information, it shouldn’t be hard to map these two spaces into a third one, a multimodal one. That is precisely what the authors of [Kiros et al., Skip-thought vectors] did:

• each image from the MS-COCO dataset is presented at the input of a pretrained ConvNet, yielding a vector $x\in \mathbb{R}^{d_{im}}$
• each sentence from the MS-COCO dataset is presented at the input of a pretrained skip-thought encoder, yielding a vector $y \in \mathbb{R}^{d_{txt}}$
• then, we learn two simple linear mappings, following this objective function:

$\sum_x \sum_k \max \{0, \alpha - s(Ux, Vy)+s(Ux,Vy_k)\}$

$+ \sum_y \sum_k \max \{0, \alpha - s(Vy, Ux)+s(Vy,Ux_k)\}$

where $U$ and $V$ are the only parameters of our multimodal mapping, simply learnt by backpropagation. $Ux$ and $Vy$ are the representations of an image and a text in the multimodal space. A similarity score between an image and a text is calculated by the dot product $dot(Ux, Vy)$.

This, because we have very high quality separate embeddings $x$ and $y$ for image and text, can yield impressive results. Here is a table from the original Skip-thought vectors article, where the authors present their information retrieval results on MS-COCO.

R@K: proportion of cases where the correct ground truth retrieved view is ranked in top K by the system’s similarity score.

The skip-thought performances are the three bottom lines (uni-skip, bi-skip and combine-skip are three variants of the STV model). Though the STV score is not the best, one must keep in mind that the multimodal information only used for training two linear mappings. Other algorithms that outperform STV are much more complicated, and the interaction between image and text modalities is modelled. Let us explore one of these models, which is more complex and performs better at information retrieval (since, unlike the skip-thought model, it was designed for this task).

#### Multimodal CNN

The model is presented in [Ma, L. et al., Multimodal convolutional neural networks for matching image and sentence].

The idea behind this model is the following: we want to map the global image representation (typically given by the output of a CNN) to local fragments of the sentence associated with it. The architecture is a three-stage one:

• Image-CNN: takes the raw pixels of the input image and yields a high-quality embedding
• Matching CNN: takes the image embedding and merges it with the sentence representation. This merging can be carried:
• at word level,
• at phrase level,
• at sentence level,

and yields a vector containing the merged image/sentence information

• Scorer: a multi-layer perceptron yielding a matching score.

We then train our model so that it maximizes the score of matching pairs, and minimizes the score of unmatching ones.

Illustration of the multimodal CNN architecture

Incorporating the image representation into the text’s one via the matching CNN is done through the convolution operation itself, as illustrated here

Illustration of how matching-CNN merges image embedding into textual one

As we said earlier, the matching can be done at different textual levels. Incorporating the image into the first convolutional layer will align global image with words, but we can instead merge the image at a higher level in the matching-CNN. In the end, the best model is a bagging of four models:

• $MatchCNN_{wd}$: image merged at the first layer (word level)
• $MatchCNN_{phs}$: image merged at the second layer (short phrases level)
• $MatchCNN_{phl}$: image merged at the third layer (long phrases level)
• $MatchCNN_{st}$: image merged at the output layer (sentence level)

The image CNN is pretrained on ImageNet, and the word embeddings are initialized with Word2Vec. All the weights are fine-tuned during training.

To the best of our knowledge, this model is currently at the state of the art performance on Flickr30K and MS-COCO. It doesn’t perform very well on Flickr8K, certainly because this dataset is to small for an adequate tuning of the m-CNN’s parameters.

To conclude on this information retrieval task:

• having a good high-level unimodal representation for both modalities allows us to learn only two simple linear mappings into a multimodal space. The similarity between a text and an image is then simply the cosine similarity between the mapped vectors,
• but if we want a better similarity metric, we might want to merge the global image representation with local text fragments.

These methods are good at computing a similarity metric between an image and a text, which is perfect for information retrieval. Now, we will show how one can train a system to automatically generate a caption, given an image.

### Automatic captioning

The task of sampling a sentence $\mathcal{S} = \{w_1 ,..., w_n \}$, given an image $\mathcal{I}$, can fit in the framework of conditional language models. We learn a parametric model for the probability distribution

$P \left( \mathcal{S} | \mathcal{I} \right) = \prod_{j=1}^n P \left( w_j | w_{1:j-1}, \mathcal{I} \right)$

which we sample from at inference.

We present the model proposed in [Vinyals et al., Show and tell: a neural image caption generator], which attempts to sample a sentence given a representation of the image.

Illustration of the NIC model

The idea is to compute the image embedding via a ConvNet, and to provide this representation to an LSTM language model. The LSTM is used to predict, at each timestep, the next occurring word given the previous words.

Global architecture of the NIC model

As we can see here, we compute the image representation with a pretrained CNN, and use this representation as the first input of the LSTM. Then, we compute the probability distribution $p_t$ at each timestep, which can be viewed as a vector containing $|Vocab|$ dimensions, and where each coordinate $j$ represents the probability of having the word $Vocab[j]$ at this position $t$ of the sentence.

At training time, we minimize the following cost function

$L \left( \mathcal{I}, \mathcal{S} \right) = - \sum_{t=1}^N \log p_t \left( \mathcal{S}_t \right)$

using stochastic gradient descent on the LSTM weights.

At test time, we present an image representation at the first timestep of the LSTM. From there, we can either sample from distribution $P \left( \mathcal{S}, \mathcal{I} \right)$, or search for the maximum of this conditional probability.

• Sampling: at each timestep, the LSTM outputs a distribution which we can sample from. The word sampled at timestep $t$ is used as input for the timestep $t+1$, until we sample a special <end_of_sequence> token, or reach the maximum length authorized.
• Maximum search: a beam search can be used to get an approximate of the sentence $\mathcal{S}^*$ that maximizes $P(\mathcal{S} | \mathcal{I} )$.

If the same datasets can be used to train and benchmark automatic captioning models as the ones used for multimodal information retrieval, the metrics for model evaluation have to be different. A widely used criterion for evaluating automatic captioning models is the BLEU score [Papineni et al., BLEU: A Method for Automatic Evaluation of Machine Translation]. It was initially proposed for machine translation evaluation, but can be used for automatic captioning evaluation as well. It consists in a comparison between the n-grams of a candidate sentence (sampled by our model) and a some reference sentences (ground truth captions for one image).

Here are a few captions generated by the model after training on the MS-COCO dataset.

Sentences generated by the neural image captioner (click to expand)

The model learnt to use the information lying in the image representation to condition a language model, yielding a sentence that :

• is a correct english sentence (since we learn a language model),
• and this correct english sentence is not ANY sentence, but a sentence that describes the image.

Even though this model performs well, we can think of two limitations that might, if overcame, result in models that give better performances:

• we use a global representation for the image that does not explicitly embed positional information,
• when we sample a sentence given an image, the same image representation is used at each timestep.

These two points are a problem because we might want to learn a model that can focus on a particular zone of the image when sampling certain words. For instance, if we want to sample the sentence

« a group of young people playing a game of frisbee »,

it might be useful to have a model that somehow pays attention to the image area containing the frisbee when it tries to sample the word « frisbee » . This is done via attention mechanisms, and presented in the context of automatic captioning in [Xu et al., Show, Attend and Tell: Neural Image Caption Generation with Visual Attention]. In the following examples, we can see where the model payed attention to when trying to sample the underlined word.

Illustration of results given by the attention mechanism (click to expand)

## Conclusion

Finally, there are some things we need to keep in mind when dealing with multimodal data.  We can « easily » represent images and sentences in a common multimodal space if we already have good unimodal embeddings. This yields pretty good results, and this multimodal embedding can be used for many tasks. But this kind of embeddings will not allow us to sample a description given an image.

Plus, using two global unimodal embeddings ($\Phi (image) = v \in \mathbb{R}^{m}$ and $\Psi (sentence) = w \in \mathbb{R}^{n}$ )  does not explicitly let us merge or align different fragments of the image with different fragments of the sentence. This more sophisticated kind of modeling can improve results in information retrieval.

At Heuritech, we use some of the ideas presented in this blog post (and others 😉 ) to build systems that learn to extract information from images and texts. In a following article, we will present a use case of image/text analysis applied to e-commerce product classification.