Natural Language Processing Homeworks
Design and implementation of architectures to face both Named Entity Recognition (NER) task and Semantic Role Labeling (SRL) task.
Named Entity Recognition
The task consisting of classifying every word of a sentence in categories such as a person, organization, location, and others, is called Named Entity Recognition (NER). There are different ways to deal with this kind of task, like unsupervised learning and supervised learning approaches, and the one I chose is supervised learning. With this approach, the NER task can be considered as a sequence labeling task. Initially, with this kind of approach were used hand-engineered features or specialized knowledge resources like gazetteers, but leveraging deep learning techniques to discover features automatically, improved the models until they reached the state-of-the-art.
Recurrent neural networks (RNNs) work very well in modeling sequential data. In particular, bidirectional RNNs can create a context using past information retrieved from the forward states and future information retrieved from backward states for every step in the input. Thus, each token of a sentence will have information about the whole sentence. But it has been proven that in practice, RNNs tend to be biased from the most recent inputs. To avoid this behavior can be used long short-term memory networks which can capture long-range dependencies. Moreover, since in the NER task, to predict the label of each word needs to make use of neighbor tag information, to decode it is used conditional random field.
Neural architecture
The chosen neural architecture is composed of an embedding layer, followed by a bidirectional long short-term memory with a sequential conditional random layer.
Embedding layer
The input of the model is a vector that represents each word of the given sentence. The representation of every word is done by using an embedding layer that returns a fixed dimensionality real-vector that stores the semantic and syntactic information of words.
Starting with randomly initialized embedding, I observed significant improvements with pre-trained ones. In particular, with pre-trained word embeddings, the model converges very quickly, allowing to get relevant results in fewer epochs as compared with the model which had the randomly initialized embeddings. I tried both Glove J. Pennington et al. 2014, trained using Wikipedia 2014 and Gigaword 5, and fast text E. Grave et al. 2018, trained on Common Crawl and Wikipedia, embeddings to observe that with the latter, I get better performance.
Bidirectional LSTM
Long short-term memories (LSTM) are a variant of the RNNs that use memory cells to propagate information through the sequence without applying non-linearities J. Eisenstein et al. 2019.
The reason for using the bidirectional LSTM is to create the context of each word in the sentence. To do this is used two different LSTM with different parameters. Indeed, the network takes as input a sentence represented as a sequence of
Conditional Random Field
Conditional random field (CRF) is a conditional probabilistic model for sequence labelling used when there are dependencies across output labels, like in NER task, since the grammar imposes constraints on the interpretable sequences of labels which are to be modelled jointly, and it is based on the score function:
To estimate the parameters is needed to maximize the following probability of a label given the word, which is based on the label of the previous word and the current word:
While to decode the best sequence of labels
I implement the CRF layer taking inspiration from the following vectorized implementation.
Experiment
This section presents the model setup for the training phase and the outcomes and the workflow of the latter.
Training process
With the purpose of making the sentences readable by the neural network, each sentence is converted into a tensor of numbers, each of which identifies a word in a vocabulary built from the training dataset. Each sentence is then padded to make its length homogeneous; this is useful to work with batches of sentences. These latter are input to the embedding layer, which is initialized using fast text pre-trained embeddings that cover 90% of the vocabulary, and the remaining embeddings are initialized randomly.
The resulting of the embedding layer passed through a dropout layer, with
Thereafter, the output of the LSTM is passed through another dropout layer and a dense layer which outputs the representation of the emission score to give in input to the CRF layer, together with the transition matrix obtained by adding two new labels needed to represent the start and the end of the sentences, which returns the predictions.
Adam optimizer is used with a gradient clipping set to 2.0 (usually used to avoid exploding gradients in RNN). To prevent overfitting, besides the dropout layer, some hyperparameter tuning is done. In particular, the model works fine with a learning rate set to 1e-2 and weight-decay set to 1e-7. The latter is also used to penalize big weights and so penalize the complexity of the model.
With the aim of evaluating the model, the negative log-likelihood is used as a loss function instead of likelihood because computing the partial derivatives of the logarithm are easier than the partial derivatives of the softmax equation, and it must be minimized since minimizing the negative will yield the maximum.
Results
The metrics used to evaluate the model are the macro precision, macro f1, and macro recall that are computed independently for each class. The model is trained for 10 epochs where each epoch takes approximately 2 minutes, reaching about 88% of macro f1 on the test dataset.
As can be seen in the following confusion matrix, most of the wrong predictions are for minority classes, i.e. the labels ‘ORG’ and ‘LOC’; this is the reason for using macro metrics since they consider classes imbalance issues.
The performances obtained on the dev dataset are
precision | recall | f1-score | support | |
---|---|---|---|---|
PER | 0.92 | 0.89 | 0.90 | 14396 |
LOC | 0.86 | 0.80 | 0.83 | 12359 |
O | 0.99 | 0.99 | 0.99 | 315809 |
ORG | 0.82 | 0.73 | 0.77 | 9043 |
accuracy | 0.97 | 351607 | ||
macro avg | 0.90 | 0.85 | 0.87 | 351607 |
weighted avg | 0.97 | 0.97 | 0.97 | 351607 |
Conclusion
The resulting model is left very simple, making it as generalized as possible so that it does not depend on the task, and still, it can achieve almost the performance of the state-of-the-art models.
Semantic Role Labeling
A fundamental task in Natural Language Processing is the task of Semantic Role Labeling which extracts the predicate-argument structure of a sentence, determining “who did what to whom”, “when”, “where”, and “how”. The whole SRL process is divided into four steps:
- predicate identification: which consists in identifying the predicates in a sentence;
- predicate disambiguation: which consists in determining the type of the identified predicates;
- argument identification: which consists in identifying the arguments in a sentence of a specific predicate;
- argument classification: which consists in determining the role of the arguments in relation to a specific predicate.
For the purpose of the homework, the mandatory tasks are argument identification and argument classification, while the other two are optional.
In this work, two models are presented: one which deals with the tasks argument identification and argument classification, and the other which deals with the tasks predicate disambiguation, argument identification, and argument classification; from now on, in the report, these two models will be respectively named
Since both the tasks argument classification and predicate disambiguation can be seen as sequence labeling tasks, the neural model with which to face these tasks is the bidirectional LSTM that can capture syntactic information leveraging the context created by its states.
Moreover, the two models use a pre-trained BERT model to get the embedding representation of each word of a sentence during the preprocessing of the dataset.
In the next sections are described the structures of the models, the encoding of the input both in
Neural architecture
The architectures of
The flow of the developed models are
BERT model
Taking inspiration from P. Shi et al. 2019, during the preprocessing of the dataset, instead of using a pre-trained word embedding, a pre-trained BERT model is used to get the embedding representation of a sentence, which has shown impressive gains in many natural language tasks among which there are the sequence labeling tasks. BERT, which stands for Bidirectional Encoder Representations from Transformers, is a neural model that makes use of Transformer, an attention mechanism that learns contextual relations between words and/or sub-words in a text.
There are many pre-trained models that differ in the number of parameters, and for this work, I adopted the simplest pre-trained BERT model (‘Bert-base-cased’), which has 12 layers, a hidden size equals 768, and the number of self-attention heads equals to 12, for a total number of parameters equals to 110M (as mentioned in J. Devlin et al. 2018);
Embedding layers
With the aim of giving the model more information as possible, the dataset provides much syntactic information. From the set of information, those used are pos tags, predicate frames, and lemmas D. Marcheggiani et al. 2017; these are exploited in different ways in the two models
After getting the contextual embedding representation of the sentence using the BERT model, depending on the model, the sentence is concatenated the information obtained from the embedding layers.
: In the following model the sentence is simply represented by the concatenation of all the selected information; so, let be the representation of the token of the sentence where is the concatenation operator; then, represents the contextual embedding of the token , represents the predicate indicator of the token which is 1 if is a predicate 0 otherwise, represents the embedding of the pos tag of the token , represents the embedding of the predicate frame of the token and represents the lemma embedding of the token ; : since this model is composed of two bidirectional LSTM where the latter leverages the output of the former, a different representation is adopted. To the first LSTM that faces predicate disambiguation task, the sentence is fed to it with only the information useful to disambiguate the predicates, so , but to add more relevance to the token that represents the predicate, is the embedding of the lemma of the token only if in position there is a predicate, while about the pos tag embedding is better to have information about all the tokens because to disambiguate a predicate is useful also know what part of speech the other tokens play in the sentence. The second LSTM that composes leverages the output of the previous LSTM to face the tasks of argument identification and argument classification. So, the representation of each token of a sentence is the following where in this case the embeddings of the lemmas are all used. Note that the pos tag embeddings are not used since in there is already the information about them.
BiLSTM
Long short-term memories (LSTM) are a variant of the Recurrent Neural Networks that use memory cells to propagate information through the sequence without applying non-linearities. The reason for using the bidirectional LSTM is to create the context of each word in the sentence using two different LSTM: one which computes the left context of each token in a sentence and a second LSTM which reads the same sequence in a reverse way to create the right context of the tokens; the two contexts are concatenated to obtain the representation of each token in the sentence.
In both the models
Experiment
The models are built increasingly starting from using only the word embedding to represent the sentence until using almost all the syntactic information given in the dataset.
Encoding of the input
To represent the sentences, some vocabularies are made to represent all the information about the pos tags, the lemmas, the predicates, and the roles (this is necessary to evaluate the performances of the model). After verifying that all the pos tags and roles that are in the validation and the test dataset are also in the training dataset, their vocabularies are built using only the information in the provided dataset; while, about the vocabulary of the predicate frames, I used VerbAtlas (A. Di Fabio) which provides all the possible frames avoiding the use of ‘
Due to the fact that in the dataset there are sentences without predicates, these are handled as information of relevance, but since in the dataset the dictionary of the roles for these sentences is empty, it is replaced with an array filled with the null token; in this way, no information is lost. However, rather it helps the model to identify in a better way which token will be predicted as a null token.
In both the models, the BiLSTM, which predicts the roles, takes as input a sentence with only one predicate in order to predict only the roles that each token of a sentence assume considering that specific predicate and to avoid missing predictions since there can be a token that has a different role for two distinct predicates. So, for the sentences with many predicates, the model, during its forward step, splits them and assigns a predicate to each of these duplicated sentences.
Training phase
The models are trained using Adam optimizer, where the model
Hyperparameters | |
---|---|
Bert embedding | 768 |
Pos tags embedding dim | 128 |
Lemmas embedding dim | 200 |
Predicate embedding dim | 200 |
Hidden dim | 512 |
Batch size | 128 |
Dropout | 0.2 |
Learning rate | 1e-3 |
Gradient clipping | 2 |
while the ones for the model
Hyperparameters | |
---|---|
Bert embedding | 768 |
Pos tags embedding dim | 128 |
Lemmas embedding dim | 200 |
Predicate hidden dim | 400 |
Roles hidden dim | 512 |
Batch size | 128 |
Dropout | 0.2 |
Learning rate | 1e-3 |
Gradient clipping | 2 |
Furthermore, different loss functions are used for the models: since
To improve the models’ generalization ability and to avoid overfitting, some regularizers are introduced. One of them is the dropout layer which is applied on the whole word representation, on the input, and on the output of the BiLSTM that takes care of the role classification task; moreover, it is even applied to the output of the BiLSTM that takes care about predicate disambiguation task. Another regularizer is the gradient clipping, used to avoid exploding gradients which negatively impacts the performances of the LSTMs.
The last introduced regularizer is simply early stopping with the patience of 5 epochs and a change of 1e-4 which breaks the training loop if the f1-score does not improve at least 1e-4 for 5 consecutive epochs and restore the weights of the best model, namely the one with highest f1-score.
Results
The performance of the final models on the given test dataset is
Architecture model | Predicate disambiguation | Argument identification | Argument classification |
---|---|---|---|
BiLSTM (W + Pi) | - | 84.48 % | 76.45 % |
BiLSTM (B + Pi) | - | 91.30 % | 82.50 % |
BiLSTM (B + Pi + T + L + P) | - | 93.91 % | 90.09 % |
Stacked BiLSTM (B + Pi + O + L) | 95.17 % | 93.50 % | 87.83 % |
The shown outcomes are obtained after 17 epochs, as shown in \cref{fig:losses_34}, for
Both the training of
Note that the metric used to evaluate the performances of the model is not the accuracy since the number of possible roles (and also the number of possible predicates in
Moreover, since the tasks are multi-class problems, the adopted metric to evaluate the models is the f1-score.
These results are not the same as the ones obtained with a state-of-the-art model, but they are very close, so the proposed model is a good baseline to face the SRL task.