We were happy to host the latest Heuritech Deep Learning Meetup the 4rth of October 2016 in Paris. More than 100 people attended this time! We had many deeply (ahem…) interesting presentations, some of the academic kind and other more business oriented!
1. Efficient networks to classify medium and small datasets
UPMC, Rémi Cadène, Supervised by Nicolas Thome and Matthieu Cord
We saw a lot of progress recently in computer vision, and most of the credit goes to the intensive use of CNN (convolutional neural networks). The CNN state-of-the-art is a huge leap from previous methods (Hand crafted filters, SIFT…), and even though CNN had been around for a while, the research efforts, the widespread use of GPU, ever more convenient, useful and cheap, the availability of accessible and dedicated code libraries thanks to the growing deep learning community, and particularly the availability and use of large datasets, allowed the CNN to show their immense potential.
However, for specific tasks, most teams don’t have access to huge datasets like the ImageNet 1.2 million pictures dataset, but instead they have to deliver with medium (less than 130 000) and small (few tens of thousands) datasets. To do so, they usually choose one of the 3 following methods:
- Train a CNN, custom or not, from scratch, on the small custom dataset.
- Get the image representation on a CNN trained on ImageNet and train a linear model (SVM…) or a smaller neural network on it.
- Fine tune a CNN pretrained on ImageNet on the custom dataset with smaller learning rate.
The UPMC team tried those 3 methods on the UPMC Food-101 dataset, 80 000 images of food and 101 classes. This dataset is considered to have a “small semantic gap” with imageNet, as imageNet already has 6 classes of food.
The results were interesting:
The CNN performed better than hand crafted, training from scratch is better than feature extraction which is beaten by fine tuning!
The Data Science Game (DSG) online challenge has a dataset with bigger semantic gap. The aim is to classify roof images in 4 categories (north-south, east-west, flat and other): 8000 train images, 13999 test images.
The UPMC’s solution came first to the challenge, with an ensemble of methods (fine tuning + ensemble) :
As usual, fine tuning is better. Note also that ensembling is primordial. The Custom CNN showed encouraging results at first, but it was held back by the difficulty to train from scratch and the large number of hyperparameters to tune (which one should expect when building a completely new model).
The fine-tuning of existing convnets, also called transfer learning, helps to overcome some training difficulties on a small dataset. However, translation invariance is another challenge, and transfer learning does not help that much.
Normally, CNNs achieve translation invariance by themselves. The convolutional layers have it intrinsically and the big dataset usually allow it to be learnt by the dense layers, often with the help of data augmentation (see figure above to understand why translation invariance is not intrinsic to the whole network because of the dense layers). But even data augmentation may not be enough for small dataset.
So the UPMC introduced a multi instance learning framework (WELDON), using heat maps at the end and a special aggregation layer. (see paper for details).
They tried it on the MIT67 indoor dataset : a small dataset with strong spatial variance, fine tuning did not work, only WELDON could give a better accuracy.
Conclusion : Fine Tuning and Ensembling are the best way to get a good accuracy. When there is strong spatial variance, use a multi instance learning framework like WELDON.
Future work : the team wants to build more theoretical knowledge to understand the possible convergence with non convex loss and how a 140m parameters model can avoid overfitting.
Avila, Sandra et al. (2013). “Pooling in image representation : The visual codeword point of view”. In : Computer Vision and Image Understanding 117.5, p. 453–465.
Durand, Thibaut, Nicolas Thome et Matthieu Cord (2016). “WELDON : Weakly Supervised Learning of Deep Convolutional Neural Networks”. In : Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition.
Simonyan, Karen et Andrew Zisserman (2014). “Very deep convolutional networks for large-scale image recognition”. In :
Wang, Xin et al. (2015). “Recipe recognition with large multimodal food dataset”. In : Multimedia & Expo Workshops (ICMEW), 2015 IEEE International Conference on. IEEE, p. 1–6.
Github (contributions are welcome) : Cadene/torchnet-vision
2. Wide Residual Networks
Sergey Zagoruyko, Nikos Komodakis
The success of modern CNN is commonly associated with increasingly deep architectures, from the 8 layers Alexnet, the VGG 19, and ResNet 152! If we plot depth versus performance, we can see that easily:
So Deeper is always better?
Not so fast…
There are some difficulty in going deeper, such as the famous vanishing/exploding gradient problem, but also the inability to converge properly even when the vanishing gradient problem is avoided. With ResNet, the « Residual Block » was introduced. Basically, instead of learning F(x), we learn X – F(x), the residual function, which is easier.
So we overcome vanishing gradients and converge better with deep architectures, but did we introduce new unknown and obscure problems?
First, let’s ask ourselves a question: are ResNet really that deep? If we unfold them, and follow the shortest paths through the residual blocks, they appear shallower than you may imagine. They look like a wider, shorter, net of many ensemble blocks :
And in fact, an experiment of « stochastic depth« , which is, basically, randomly skipping some layers of the net, actually lead to an improvement of performance!
Deeper is not better? Are you confused now?
In Deep Learning the theoretical understanding can be a little lacking and we rely a lot on intuition and experimentation, so we should not set in stone every belief, including the fact that deeper is always better! So, are all residual blocks useful? Can we do better with less layers, how important is depth? How about width?
Just by increasing width they could improve performance, so what matters most is the raw number of parameters. But growing deeper costs more than wider computationally-wise, because we cannot take advantages of parallel computing.
We note also that dropout in the residual blocks improves the performance thanks to additional regularization, and training on raw data instead of whitened data improves the results significantly.
Those initial findings were the results of experiments on relatively small dataset, so they could quickly run a lot of tests and thus get a lot of result. However would those conclusions stand with imageNet? Would it also beat State-of-the-Art nets? Let’s cut the suspense: they found out that wider achieves same (WRN no-bottleneck) or better (WRN bottleneck)* accuracy as deeper at the cost of slightly more parameters but with less computation time.
But ImageNet is a huge dataset, mainly used for academic performance! Most real world application take advantages of transfer learning. Would the WRNs also win with transfer learning?
On MS-COCO, the performances were comparable to ResNet-101 in speed and parameters for classification, and achieved better or similar results on object detection, and WRNs generally beats state of the art with help of more augmentation (1 2 3).
- Decrease depth, increase width, profit.
- WRN are not shallow, still very deep!
- Width is as important as depth.
- Width is less sequential and more parallel. Faster with parallel computing.
- Model will be available soon!
* Bottleneck is a technique using 1×1 convolution to reduce dimensions before performing big computation and then use again 1×1 convolution to return to higher dimension. It is useful to reduce computation time. See for instance ResNet paper for more details.
3. Building a deep learning powered search engine.
This talk by Koby was more business and deployment oriented than academic. The speaker is a data scientist at Equancy.
Today, we have a lot of images available, from catalogues, social networks and marketplaces. In the case of fashion, we wave three typical uses: visual search engine, fashion object detection and « data quality ».
What Koby presented today is the result of his work in building a visual search engine, whose core task is taking pictures, searching in a database and returning similar instances.
One challenge is that daily fashion cannot be expected to be shaped and presented in a normalized standard way, we see it in many shapes, forms and contexts. Thus, preprocessing is hard. Very hard.
The search engine works in two steps:
First, there is a “batch phase”, when we build a database. We take the dataset and embed the images in a convenient mathematical space with an associated distance metrics, holding, hopefully, some semantic relevance.
After that, the “online phase”: a user takes a picture, uploads it, the picture is embedded and the distance to the known instances in the database are computed, allowing to return a rank of the closest and, hopefully again, more relevant elements.
Of course, the hard part is the embedding. It should consider shape, color, texture… It is extremely domain specific (but less today with CNNs, but later on that). Back in the days, we used edge detectors and image moment for shape, color histograms for color, HOG/HOF/Fourier/Wave etc for textures. There was way too much parameters to tune, the use of separate methods for shape, color and texture created the problem of weighting the models, it was awfully slow (the image went through several transformations), and it was ungeneralizable. Grim.
But, as you guessed, then came the CNNs. The idea was to use a CNN for the describe part, and they chose AlexNet « the Beatles of the CNNs » as Koby gracefully dubbed it.
So, he took Alexnet pre-trained on ImageNet, removed the softmax layer and used the CNN to generate embedding vectors of size 4096 from images. The part of the net chosen contained quite high-level features, it had to discriminate among the 1000 imageNet categories, so it is used as a general-purpose describer.
He tried it on an fashion dataset used as a benchmark. It is a dataset of 60 000 items, of medium quality, with a grey background. This resulted in good classification classification. A T-SNE compression on the embedding for a 2D representation, with 10% of items for easier visualisation gives the following:
If we zoom on some area, we can see we have a decent clustering with T-shirts, shoes, shorts, jeans, kakis, chinos, trousers, bags, jackets, « funky tops » (multicolored vivid tops). And all of that totally unsupervised!
After this initial success, he had a decent « bake » part, and he resolved to try to come up with a first implementation of a visual search engine, with an online part working on “natural pics”, ie pictures taken in an everyday context. A lot of preprocessing (not specified) is needed to get good results in that part.
The results are relatively solid, considering the small size of the dataset and how the convnet was used almost “off the shelf”.
Contact : email@example.com
- They tried to improve the dataset by using another metric on the embedding space. The default metric was the Euclidian distance but they wanted to experiment a potentially more relevant metric, so he tried to infer a metric by asking people to judge what were the closest images. He found that it was not worth the effort and that the Euclidian distance was very relevant regarding its cost (nothing).
- Also be warned that the application to real pictures requires a lot of preprocessing, cropping for instance, and that is a really hard process.
- The use of feedback in ecommerce in order to fine tune the model is one of the possible way to improve the current model.
Image tagging using transfer learning : application to Search Engine.
Dataiku is an end to end data science platform founded in 2013, allowing users to prepare, analyse and model their data in design, while easily automating, monitoring and scoring in production.
A typically challenging task in data analysis is travel recommendation.The products have a short lifespan unlike, for instance, physical objects in ecommerce (Amazon, Ebay…). Buyers are not very recurrent, there is a lot of « weak signals » and the users tend to look at only a few offers among the numerous available. Especially, the image in the e-travel website is a very important but hard to quantify factor. Dataiku tried to bring valuable insights by creating a model giving a better understanding of the image and its relationship with the other characteristics of the travel offer.
The first step was to get annotations from the images in order for them to be usable by a recommendation engine. They decided to use a pretrained CNN to extract features from the images, they chose VGG-16 for that matter.
The second step was to extract general topics from multiple categories, for instance “beach + sun + sea”, became “at the beach” and they wanted to build user profile with these topics. Exemple: Martin Eden=0.4 * « At the beach » + 0.3 * « mountain » + 0.2 * « hotel ».
Next, they conducted analysis to find the more attractive visual elements for a given destination or travel category.
Finally, for the last step, they wanted to rank sales by comparing it to user profiles and add the visual model to the recommandation engine.
Let’s see how they did all of this:
First, they fine tuned a VGG 16 on the MIT Places205. 205 labels and 2.5 M images of places, quite relevant for travel. After the training, they chose to take the top 5 prediction for each images to make the annotation (using threshold failed and tweaking the probability did not make sense). The top5 method was empirically satisfying while very simple.
But they needed more labels so they also used another dataset, the SUN397 for scene recognition, comprising 100 000 images and 397 labels. The label were good but there was not enough images for a good training, so they took the VGG16 already finetuned with Places205 up to the last convolutionnal layer, passed it on the Sun397, and trained another model on those features with good result (Top 5 accuracy of 92 % and accuracy of 72%).
By blending the two models, they could get up to 10 tags per image. Sometimes, the tags were complementary and useful, but some time it was redundant information :
Also there was too much different tags. It badly needed some process, so they studied the co-occurence graph of the labels and they saw that they could easily use dimension reduction techniques on the label space with NMF, X = WH, with X of shape images number x tags number, W : images number x themes number and H : themes number x tags number and ending up with 30 “themes” from the initial 500 labels.
This space is sparse (beach and mountain hardly come up together), allowing for neat cluster representation and they used the themes space to build scores for each images.
And finally they could integrate the image scores to the recommendation engine. The data used was all the reservations made on a certain period, in the format (user, reservation, date) and for each they sampled 5 random reservations available at that moment that the user did not take. The chosen representation was a composition of reservation characteristics, user history of reservation and visits, and the new image theme scores. The model was a logistic regression and it allowed the travel agency to increase its sales by 7% in value.
For example, a user visited pools pictures for travels in France. The non-image model recommended travel in France but with many images with no swimming pools, the image only recommended travels with pools but in Italy, Morocco, Sri Lanka etc, and the full mixed model recommended travels in France with swimming pools.
As a bonus they developed an attractivity formula, which was the share of visited travels over the share of offered travels for each image theme. Interesting finding : They found out that among ski travel offers, the image theme « bedroom » or « hotel room » and other indoor themes scored better than mountains and other outdoor themes. It could seem coutner-intuitive but it could be explained by the fact that the client already decided the area of their ski travel and discriminated the offers with the quality of their hotel or bedroom.
This was overall a simple but very powerful and very practical use of image recognition and deep network to improve travel recommendation!
Conclusion and future events
The meetup was of great quality, the speakers were all very interesting and the topics were diverse, both academic and business oriented, about practical uses of CNNs!
Thanks again everyone who attended this meetup, as well as the speakers.
Stay tuned for the next meetup (by the end of the year) and for more news!