Our last Meetup was the largest meetup so far: more than 300 people attended! Moreover, we had the pleasure to welcome Yann Lecun, head of Facebook AI Research who talked about the latest works produced by his team and more general insights about the future of deep learning, such as unsupervised learning. Photos and videos of the event are soon to be released.
We would like to thank Telecom Paristech for accepting to host the Meetup, NVIDIA and Facebook AI Reseach for sponsoring the event.
Semi-supervised Learning for Multilingual sentence representation – Alexandre Rame & Hedi Ben-Younes (Heuritech)
Heuritech tackles the problem of semantical analysis of web documents. Web documents are almost infinite resources and are composed of various things: images, hyperlinks, text … The talk of Alexandre and Hedi focuses on the understanding of text, in other words, it focuses on building a semantic space for textual data. The advantages of an embedding is twofold. First, two words that are semantically close (such as cat and dog) will be projected in similar places in the embedding space. Second, complicated relations between concepts can be linearized.
Building a uni-lingual embedding
One very important feature of the aforementioned semantic space is that it needs to be multi-lingual. Indeed, it needs to represent texts of different languages such that two texts about the same topics but in different languages will present high similarity in the semantic space. Ideally, we will like to build a semantic space for each language and then align those spaces to create a unique multi-lingual space. Such a process is depicted in the figure below.
We can easily collect lots of monolingual data (by crawling webpages), however, aligned translations datasets is a much more limited resource.
To represent words, the monolingual embeddings can be unsupervisingly created using the word2vec algorithm developed by T. Mikolov. The skip-gram method bring words close to their context (the words before and after). When the embedding is learnt, it projects words into a vectorial space of low dimension (typically between 50 and 500) and similar words are located in similar places.
To represent sentences, two cutting-edge algorithms can be used: the skip-thought algorithm and the fast-sent method. The skip-thought architecture reuses the idea of the skip-gram and applies it to whole sentences. This method works well but is highly resource consuming (around 2 weeks of training). The fast-sent method is inspired by the skip-thought architecture but is lighter to run. In practice, it gives very good results in a much more reasonable time.
Building a multi-lingual embedding: aligning uni-lingual embeddings
To create a unique multi-lingual embedding, 12 embeddings, corresponding to respectively 12 different languages, need to be aligned. A first attempt, is to use the Canonical Correlation Analysis (CCA). This method is inspired by the Principal Component Analysis (PCA) and learns a linear projection between two embedding spaces. Many variants of this algorithm have been proposed such as the Kernel CCA or the Deep CCA (where the transformation learnt is not linear). Unfortunately, this method scales poorly with the number of data as the full covariance matrix needs to computed.
Correlational networks are models inspired from the autoencoder architecture and the CCA algorithm. As in the case of CCA, those networks learn to correlate the projections in the common subspace created. Self-reconstruction terms (as in autoencoders) are used in order to regularize the network and thus to better generalize. Alexandre and Hedi have made Correlational Networks deep with success, by adding layers as someone would do for deep autoencoders.
Bridge Correlational Networks make possible the alignment of several languages at the same time using only one « pivot » language (English). The Deep Bridge Correlational Network (!) is then the method chosen to align all the embeddings.
The results are impressive. Here, for example, 3 topics were chosen (science, sport and fashion) and 10 words for each topic were selected in English and then translated in each of the l2 languages. The figure below reports a visualization (T-SNE) of all of these words projected in the multi-lingual embedding space. Interestingly, the data is homogeneous with language but heterogeneous in topic.
In the multi-lingual embedding, similarity between words can be measured with the cosine of the two corresponding vectors.
We can for example find the closest words in each language of the word « machine learning » in. Something remarkable is that the English-English similarity model has been improved with the use of the 11 other languages.
Future work will involve character-aware embeddings and embeddings of images in the multi-lingual space. The space subsequently created will be multimodal.
A short presentation of NVIDIA’s activities – Guillaume Barat (NVIDIA)
If NVIDIA was initially renown for its gaming GPU series, it has been orienting its research towards deep-learning-purpose devices over the last years. This includes smart-cars, cloud visualization and deep learning tasks. This presentation was done while the GPU Technology Conference (GTC) was ongoing, and fresh announcements were then made concerning the release of new NVIDIA products.
Among many news, the Tesla M40 sees its memory capacity doubled: from 12GB to 24GB. The Tesla M4 is a new card with low electrical consumption than can be typically used for inference. Cuda 8 is released, it simplifies much of the work for back-end developers. A cuDNN 5 version is also announced: full support for 3D convolution, LSTM implementations, 16bits precision and Torch.
One of the major announcement concerns the PASCAL board: the Tesla P100 GPU. The resulting card is not bigger than a hand and reaches 10.6 TF performance with precision 32B. The board supports half-precision (FP16) and then reaches 21.2 TF.
The second major announcement is about the NVIDIA DGX-1: 8 P100 GPU embedded in a server. DGX-1 reaches 170 TF and can train an AlexNet in 3 hours while 150 hours were needed so far. DGX-1 was fully built by NVIDA: software, hardware, design and support. It is an end-to-end solution for deep learning designed for data scientists and researchers. All the GPU are connected together with the NVLink technology: 80 GB/sec transfer (5 times more than today).
Improving your Search Engine – Gregoire Mesnil (Phoenixia inc)
Phoenixia inc. is a company which aims at translating state-of-the-art machine learning into real-world solutions for companies. It hires a bunch of Deep Learning experts and provides in-house deep learning solutions.
A search engine is a system that ranks items (such as pages or articles) when given a query. The items are generally stored in a database which contain millions of products and attributes.
Classic search engine algorithms rely on a Term Frequency-Inverse Document Frequency algorithm (TF-IDF). It is a statistical method that tries to evaluate the importance of a word with respect to a document in a collection of documents. The results of this method are correct but can be improved with machine learning technique that take full advantage of the amount of available data, either in the database or the data given by the clicks of the user (known as click-through data)
Use as many signals as possible
Generally, basic search engines use only one signal (or feature) to rank the output. For example, it uses the name of the items in an online catalogue. This limits a lot the performance of the search engine because two items that look like the same but with different names will not be associated. Moreover, many signals are usually available at hand: images, textual description of the products, hyperlinks, price…
Embed queries and products in the same space
One way to use deep learning is to learn a joint representation of the query and the products. A mapping is learnt to project both queries and products in a joint space. Then a similarity measure can be computed between queries and products: the similarity between a query and the sought-after items will be high while the similarity between a query and random items will be very low.
We can embed text such as the textual description of an item, or the text of a webpage, in a vectorial space. For example, one can use a 3-gram letter to encode text. The upper layer of the deep network used will produce a representation of the input. More precisely, one can use a Convolutional Neural Network (ConvNet) that will go through character 3-gram with a sliding window. In order to learn longer-term dependencies, a recurrent network (LSTM) can be used to read the text sequentially. This yields better results.
Images can also be embedded. Typically, one would use the features of the last layer of a pre-trained ConvNet such as an AlexNet or VGG-16. The advantage of this approach is that it can associate products that look like similar, even though their name are different.
Take the most of user feedback data
The click-through data, in the context of a search-engine, are records of pairs (query, link_clicked). If records are stored for every users, those data can be used to learn a joint representation of queries and links, and so improve the quality of the search-engine afterwards. In practice, one samples one existing pair (query, link_clicked) and one random pair. The similarity of the existing pair is trained to be higher than the similarity of a random pair. Different optimization methods (hinge loss, logistic regression, maximize log-likelihood) can be used.
The take home message, is that we can simply improve search engine by learning powerful representations when taking full advantage of all available channels (images, description, summary, price, color…) and when using user feedback.
Reasoning, Memory, Unsupervised learning – Yann LeCun (Facebook AI Research)
Convolutional neural networks are getting bigger and deeper. The most recent architectures have between 1 and 10 billion parameters and can have up to 152 layers! For example in resNet, a Convolutional Neural Network (ConvNet) developed by Microsoft research, there are skip-connections between layers which allow information to flow through the net. It turns out that many layers are then useless because they have not learned anything. Yann underlines the need of research works about network compression.
Quick overview of several of the latest works released at FAIR or NYU
Deep Mask is a system that is trained to detect objects in an image along with the precise contour of the objects. There are two paths in the Deep Mask architecture: one for detecting the label of the object, and one for detecting the contour. Some refinements are used to train the network such as iterative multi scale techniques or Elastic Averaging Stochastic Gradient Descent (EASGD). This last method is developed at NYU. A network is distributed over several severs. Each server keeps a copy of the parameters of the model, however the weights are constrained together with an L2 norm through a communication between the servers.
The results are impressive and produce neat contouring as shown in the pictures below. However, the system is not perfect as it lacks global awareness and common sense. Moreover, labelled data are needed to train the model, this is a problem because this is done by hand and so this is a limited and expensive resource.
ConvNets are everywhere!
For example, in smart-cars (self-driving cars for example), an ConvNets are embedded for the scene-parsing, where every pixel is labelled as shown in the picture below. The model enables automatic segmentation: it detects objects like road, cars, pedestrians.
ConNets are revolutionizing computer vision but Yann underlignes the problems that still needs to be improved and require more research efforts. Around 1 billion pictures are uploaded everday on the Facebook website. Each of those images are then passed in two ConvNets. The role of the first one is for tagging (image recognition) and results are then used to suggest adapted content for each user. The second one is used for the face recognition, subsequently suggest friends tagging in images. Note that the ConvNet for face detection is not used in Europe.
Memory augmented networks
Recurrent Neural Networks (RNN) are widely used to recall things from the past in sequential data. However, Yann states that if sentences are too long, the network will not recall the beginning. This problem was studied in 1994 by Yoshua Bengio who argues that if we want memory in RNNs, then the state needs to be stable through time and the final state should not depend on initial perturbations. If such a case is respected, the gradients are equal to zero and no learning can happen. Nevertheless, Yann explains that it is possible to have memory stored in an orbit or reverberating state. Long Short Term Memory (LSTM) networks are widely used today to tackle the issue of learning long term dependencies. The memory is stored in separated units (called cells) which are protected from perturbations with a gating system.
Memory Neural Networks (MemNN) go further in the idea of LSTM where a fixed memory is used in a separate place from the computation part of the network. Similar ideas have raised in several places at the same time: the Neural Turing Machines at Deep Mind or the Stacked-augmented Recurrent Net at FAIR.
MemNN are fully differentiable architecture and are trained by retro propagation of the gradients through the structure. They are very good for performing Question-Answering tasks like the ones from the BaBi dataset developed by Facebook.
According to Yann, the major obstacle in the current AI research is unsupervised learning. His dream is to find a unique global rule, a learning method that includes all learning algorithms. Yann told us about his very personal definitions of concepts in AI:
- Reinforcement learning. At each try, it gets a unique information: a scalar value (a few bits per sample)
- Supervised learning. We show the answers to the models but it means that we also show which are *not* the answers thus providing much more information: between 10 and 10,000 bits per sample
- Unsupervised learning. No explicit output is given, but there is a huge amount of information per each sample. An example we can learn that the world is 3 dimensional by simple observation of the parallax when walking around.
The quantization of the number of bits provided in every case give us insights about the number of examples we need to have to learn: billion of examples for reinforcement learning, million of examples for supervised, but only a few examples for unsupervised learning. Despite that, unsupervised learning remains the biggest challenge of our time. Yann goes on saying that human and animals more generally learn through reinforcement learning for the most past.
Here, we want to train a model that tries to predict the next frames of a video given the first ones. With a classical approach, like mean square error minimization over a whole batch of video, the algorithm will learn to predict the average of all possible situations that can occur. The results is then very disappointing because the predicted frames will be very blurred.
Adversarial networks consist in two agents: a generator and a discriminator. The generator network learns to produce real-world, plausible examples (like next frames in our example). The discriminator is a network that tries to detect which examples were invented (by the generator) and which ones are existing examples. By jointly training both entities, the results of generated examples are impressive. For example in the figure below, photos of bedrooms were generated with quite a high precision. Other examples involved the learning of simple physics.
During the questions, Yann discussed a really interesting question: whether reasoning could be addressed with fully differentiable techniques instead of discrete, symbolic approaches.
Thanks everyone who attended this meetup, as well as the speakers. Our next event will be a workshop on the python Deep Learning frameworks: TensorFlow and Keras.