Last week, Heuritech’s team was again attending Yann Lecun’s lecture at the Collège de France. It was followed by a talk of Yann Ollivier about optimization of neural networks.
You can find a report of the course n°2 at : blog.heuritech.com/2016/02/24/yann-lecun-in-the-college-de-france-n2/
Architectures for Neural Networks : Yann Lecun
Last week, Yann Lecun presented the convolutional networks. This week, he presented some more sophisticated architecture for solving different kinds of problems, like localisation or face recognition.
State of the art networks for image classification
He presented some of the bests CNN for image classification
How can we understand why we need the networks to be so deep to get good results ? Yann Lecun explained that it can be seen as an iterative algorithm, and each layer is an iteration of the algorithm. In the residual networks, shortcut connection is added to each pair of 3×3 filters: it enables to drop some layers if necessary and to improve flexibility of the number of needed layers.
Improving classification by localization
He presented two ways to improve image classification by looking for localization.
The first one used the CNN in a multi-scaling setting, by running it on different parts of different sizes of the picture. Matthieu Cord (Paris 6) talked about this method in the last Deep Learning Meetup, you can find more information here : Report of Deep Learning Meetup #5
He also presented the R-CNN algorithm, which is also solving this task by taking an image and some « Regions of interest » in the picture, and compute the CNN on these regions. Then a fully connected network takes all the representations of these RoI and produce a « RoI feature vector » representing the picture. This is used to classify the image. This network is trained end-to-end. [Fast R-CNN, ICCV, 2015]
For Facebook, recognizing faces is a major issue. Their program DeepFace is already working for every photo (but is not activated in Europe for the moment). When someone is uploading a photo, in less that 2 seconds, the photo is analyzed and the faces are detected and recognized.
How does it work ? There are four stages in this procedure. First detect a face, then align the picture with a average model of face, get a representation of the face, and then classify it.
The alignment is taking a picture of the face, and by identifying some important points of the face, is aligning it with a 3D average model of a face.
Then, the DeepFace architecture is computing a representation of the face used for classification, and used a contrastive loss function to learn how to distinguish 2 different persons.
This method reaches an accuracy of 97.35% on the Labeled Faces in the Wild (LFW) dataset, closely approaching human-level performance. [DeepFace]
Segmenting and localizing objects
Yann Lecun presented the work of the FAIR team on the COCO challenge of image segmentation. Unfortunately, their work is not published yet, even it pre-print, so we won’t be able to tell more about it or show some picture. The COCO dataset (Common Objects in COntext) is composed of segmented images, and the task is to compute masks for each object in it. These algorithms have applications for self driving car or
brain tumor detection for example.
Yann Lecun finished his lecture by talking about Recurrent Neural Networks (RNN). We won’t explain here what these networks are, since it is a very large topic. But we can recommand this great blogpost, which is a very good start : The Unreasonable Effectiveness of Recurrent Neural Networks
What was very interesting in Yann Lecun’s talk was the history of these networks. He explained that there have been a lot of research about it, but researcher all claimed that it could never work because it can’t contain enough memory to do some sophisticated computation. But he argued that this was not relevant, because there are some (theoretical) computation models like quantum computing which allow computation without memory.
Therefore, the important thing of Recurrent layers (LSTM, GRU, …) is to simulate a memory, and to be able to propagate information along the network, for solving hard tasks like translation, image captioning, …
The video of this lecture will be provided at : http://www.college-de-france.fr/site/yann-lecun/course-2016-02-26-11h00.htm (in french). For the moment, only the audio is available.
Invariance principles for neural network training : Yann Ollivier
Yann Ollivier presented the work he has done with Gaétan Marceau-Caron on Riemannian Neural Networks : http://arxiv.org/pdf/1602.08007v1.pdf
The question this work is solving is the following : why is a neural network so sensitive to parametrization ? We will give two examples :
- Imagine you want to classify the MNIST handwritten digits, and try a non-convolutional architecture. You learn the weights on the train data, and then the classification is working fine. Now, you want to classify the same digits, but with inverted colors (white on black instead of black on white). You take the same network with re-initialized weights, and train it again on the modified data.Then, surprise, it doesn’t converge at all (see figure below).This is a very surprising result, because we could even compute the « best » network for white-on-black digits if we have the « best » network for the black-on-white digits. It is only an affine transformation. It shows that a stochastic gradient descent is sometimes not able to learn from the data if they are affine-tranformed.
- It is well-known that sometimes, a sigmoïd function will work better than a tanh. Sometimes, the activation function used is 1.7159*tanh(2x/3). But all these functions are some affine transformations of each others.
This is a huge problem when we want to use Neural Networks in practice, because it means that we have to try a lot of different parameters in order to make the network converge.
Yann Ollivier introduce an optimization algorithm invariant by all these transformations, that uses Riemannian metrics. We won’t explain here the mechanism, but we deeply invite you to read his paper, which give some pseudo-code very easy to implement.
How does it impact the results for convergence ? The figure below show the results of convergence for different optimization methods for the same task but for different architecture. What it shows is that, even if adagrad can be very good sometime, it is higly dependent of the choice of architecture, whereas the new algorithms are always working, and converge faster.
This is a very interesting work because it can make the task of architecture engineering easier, since a lot of the choices become not important.
The video of this lecture will be provided at : http://www.college-de-france.fr/site/yann-lecun/seminar-2016-02-26-12h00.htm (in french). For the moment, only the audio is available.