There is an unreasonable amount of information that can be extracted from what people publicly say on the internet. At Heuritech we use this information to better understand what people want, which products they like and why. This post explains from a scientific point of view what is Knowledge extraction and details a few recent method on how to do it.
What is knowledge extraction ?
Highly structured databases make it easy to reason with and can be used for inference. For example in WikiData or YAGO, entities are isolated and linked together with relations. However, most of the human knowledge expressions take the form of unstructured texts, from which it is very hard to reason and get wisdom. Consider the example here:
The raw text on the left contains a lot of useful information in an unstructured way, such as birthday, nationality, activity. Extracting those information corresponds to a challenging field in Natural Language Processing, which may require sentence parsing (mapping natural language to machine interpretable representations), entity detection and multi-reference resolution to aggregate information about the same entity. Knowledge extraction is guided, for example, by the will of being able to perform Question Answering tasks: in a structured knowledge base, one can make a query and then get the requested information. Another application is to perform arbitrarily complex reasoning by finding paths in a graph of extracted knowledge. In knowledge extraction, one can be interested in hypernymy where entities are included within other entities and one can also be interested in relation extraction.
Knowledge graph completion: link prediction
Even though at Heuritech we are more interested in knowledge extraction from raw text, we first quickly review here techniques that rely on a KG only (no extra text corpus available). The task we want to perform is to fill an incomplete KG. Before 2013, links were filled with graph theory techniques that ignored the fact that our graph is a KG.
Translating Embeddings for Modelling Multi-relational Data by Bordes et al. in 2013 is a first attempt of a dedicated method for KG completion. It learns an embedding for the entities and the relations in the same low-dimensional vector space. The objective function is such that it constraints entity e2 to be close to e1 + r. This is done with assigning a higher score to existing triplets than to random triplets obtained with negative sampling. This above model is known as TransE and this work is related with the word2vec work by Mikolov where relations between concepts naturally take the form of translations in the embedding space as seen in the picture here.
Several improvements have then been added to make way to TransH and TransR models for examples. State-of-the art is held by Probabilistic Reasoning via Deep Learning: Neural Association Models.
Triplet extraction from raw text
We focus here on extracting triplets (e1, r, e2) from a raw text. There are several variants of this task, depending on the kind of supervision used.
Triplet extraction can be done in a purely unsupervised way. Usually the text is first parsed with several tools (such as TreeBank parser, MiniPar or OpenNLP parser) then the texts between entities (as well as the annotations from the parsers) are clustered and finally simplified. While attractive at a first look as no supervision is needed, there are a few drawbacks. First, it requires lots of tedious work to hand-craft rules which depend on the parser used. Moreover, the clusters found contain semantically related relations but they do not give us fine-grained implications. Typically, a cluster may contain « is-capital-of » and « is-city-of » which are semantically closed relations. However, with the unsupervised approach, we will fail to discover that « is-capital-of » implies the relation « is-city-of » and not the opposite.
We will focus more on other kinds of supervision: supervised learning, distant supervision and universal schemas. We first give some definitions. Fixed-schema relation extraction implies that relations to be found are in a fixed list of possible relations. On the contrary, in open-domain relation extraction, relations are not constrained. In that case, there is no fixed-schema which would constrain too much the knowledge extraction, if not perfectly appropriate. However, it is much harder to generalize and to infer new relations in a graph built with open-domain relations, as there are lots of relations with various style. OpenIE (Open Information Extraction) is a tool that filters and normalizes raw text between entities to obtain open-domain relations.
Schema-based supervised learning
In this case, the available data is a collection of sentences where each sentence is annotated with the triplet extracted from it. This means that raw text aligned with a KG of the text. Two recent papers (both published in 2016) give cutting-edge solutions to this problem.
The End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures article by Miwa and Bansal shows an approach which uses two stacked networks: a Bidirectional LSTM for entity detection (it creates an embedding of the entities) and a Tree-based LSTM for the detection of the relation that links the entities found. The figure below from the original paper shows the architecture used.
Their approach uses POS tagging on the raw text which gives additional information that is fed in the bidirectional LSTM along with the original words. The strength of this method relies in being end-to-end, as the model jointly learns to detect entities and relations. The architecture is quite heavy and the authors use many tricks for training (such as schedule sampling and entity pre-training). Those tricks significantly improve the performance of the trained model. This method outperforms state-of-the art results on the relation-extraction task on ACE04 and ACE05 datasets and on relation classification on SemEval-2010 Task 8 development set. More than 80% of entities and 60% of relations can be discovered.
Language to Logical Form with neural attention by Dong and Lapata describes a sequence-to-sequence model. It takes a raw sentence as input (no need of POS tagging) and outputs desired information. In our case, the desired information is the triplet (e1, r, e2) but the model of the original paper is not restricted to this particular case. The model is an encoder-decoder architecture. More precisely, LSTM units are distributed in two stacked layers and an attention mechanism is used (see our other blog post for more details on attention mechanism). The attention mechanism makes it possible to learn a soft alignment between natural language expressions and fixed-schema relations. Rare words and numbers are handled in a post-processing step: they are masked during training with a rare-word-token or a number-token as well as a unique ID. At inference, the token+ID is replaced by the true value of word or of the number. This trick happens to be very useful to avoid having too big dictionaries. The results of the experiments reach state-of-the art results, sometimes even outperforming them. Note that the experiments are not done specifically on triplet extraction so we cannot compare this method with the model of Miwa and Bansal.
Supervised learning for relation extraction works well with end-to-end methods (in the case of the second article reviewed here, they do not even require POS tagging). However, this kind of learning is limited by the amount of labelled data while raw texts available on the web are virtually unlimited.
Schema-based distant supervision
Distant supervision, also called weak supervision, is when we wish to extract relations from a text and that an initial KG is available, as a seed, for the triplet extraction task. We say that a pair of entities is weakly-supervised by every relation of the KG that links the two entities of the pair. Of course, this kind of supervision is very noisy as for example the sentence « Barack Obama is 3 years older than Michelle Obama » will be weakly-labeled by the (Barack Obama, maried-to, Michelle Obama) instance of the KG (and all the other instances linking Barack Obama to his wife) but those two relations are obviously different. If we have a lot of texts and a big enough KG (with the same entities as the ones in the texts) we can learn a mapping from raw text to fixed-schema relations of the KG.
Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction by Weston, Bordes et al. (2013) is a model with distant supervision. It assumes that entities are discovered and disambiguated and the text between entities is converted in an open-domain relation using the OpenIE tool. An embedding of entities and relations is created in a same low-dimensional space. The mapping of openIE texts to most likely relations of the fixed schema is computed with a similarity measure over the embedding of the openIE text and the relations of the fixed schema. The system is trained with a ranking loss (as explained here). Given an openIE relation, the idea is to assign a higher score to a pair with one of its weak-labels than to a pair with a random relation of the KB (negative sampling). The article goes even further: once the triplets are extracted from the text, the model learns an embedding of the entities and of the relations. In that embedding, we want the relation r to correspond to the translation from e1 to e2. This embedding is created not only with the found triples but also with all the available triples from the original KG.
In all of the examples presented above, the relations found are within a initially proposed fixed-schema. However, as mentioned earlier there does not exist a fixed-schema that perfectly suits all the possible relations a text can express between two entities.
Universal schemas build a KG by embedding both relations of a seed KG (fixed-schema relations) and open-domain relations contained in a corpus. One big advantage of universal-schema is that it does not need distant supervision. A semantic space is built for entities and relations by learning an embedding. The embedding for fixed-schema relations is the same for the open-domain relations: inference regarding the two kinds of relations is made possible and KG completion can then be improved.
The first article that introduces universal schema is Relation Extraction with Matrix Factorization and Universal Schemas by Riedel et al. in 2013. In this article, open-domain relations are computed with the OpenIE tool. Then a binary matrix is created where rows correspond to pairs of entities and columns to the concatenation of fixed-schema relations and open-domain relations ; a 1 in the matrix means if there is a relation between the entities. We wish to predict missing values in the matrix and produce a confidence value (between 0 and 1) as shown in the picture here from the original paper. The consideration of the matrix brings the problem of relation extraction close to another field: collaborative filtering. Methods of collaborative filtering can then be used to infer new relations.
Several parametrizations of the embedding methods can be considered: latent feature model, neighborhood model and the entity model or even a combination of them. For training, Bayesian Personalized Ranking (BPR) is used ; it is a ranking procedure that gives observed facts a better score than random facts (obtained by negative sampling).
A problem with the above approach is that each openIE text is embedded in a different vector, and so it is not possible to generalize to a new openIE text which was absent from the training set.
Universal schemas with deep learning
Representing Text for Joint Embedding of Text and Knowledge Bases by Toutanova et al. (2015) addresses the issue of generalizing to new open-domain relations by embedding the text between entities with a Convolutional Neural Network (ConvNet). Instead of using the openIE tool as in the article presented above, the ConvNet is used to parametrize the text between two entities (at word-level). Syntactic dependency parsing is used and added as extra input. On the picture here, the yellow vector is the embedding of the open-domain relation. Note that embedding similar open-domain relations also avoids the problem of cold-start in collaborative filtering.
Multilingual Relation Extraction using Compositional Universal Schema by Verga et al. (2016) uses the same kind of architecture. They tried both ConvNet and LSTM recurrent neural network and it turns out that the LSTM network outperforms the ConvNet. There are two other differences between their model and the one of Toutanova. The first one is that the encoder network of the open-domain relation is used at inference time when we want to generalize to text without retraining the model. Second, Verga et al. do not use the syntactic dependency parsing information on the raw text. Verga et al. go even further since their model works with multi-lingual data. Importantly, their method performs multilingual transfer learning, providing a predictive model for a language which has no entities in the the KG, by learning identical representation for shared entities across text corpora. The figure below gives an overview of the matrix to be filled and the parametrization model. Note that different encoders (with tied weights) are used for different languages. Interestingly, joint learning of the model for English and Spanish improves the scores for the English only model.
The article also underlines that open-domain relations obtained by filtering and normalizing raw text between entities also have an advantage for idiomatic expressions for example when the meaning of the text snippet is not a composition of the words it contains. In this case, we do not want to feed the idiomatic expression to a LSTM network but better learn a unique embedding for it. In practice, the article shows that an ensemble of both embedding parametrizations (LSTM on words and unique embedding) words very well as it takes full advantage of complementary methods.
We have reviewed here various techniques to infer new relations in a Knowledge Graph and to extract relations from documents. We focused on most recent techniques which rely on embedding of relations and entities, deep learning, collaborative filtering… Further work for us is to consider texts that do not always give absolute and time-invariant knowledge. For example in social media, when people express their opinion, facts can be very different from a person to another one leading to contradictions in a knowledge base. Moreover, at Heuritech we are interested in multi-modal data and so we would like to be able to extract relevant information from images and put it in the same KG as the information found in texts.
Eloi Zablocki & Team Heuritech