A Primer on Natural Language Processing – Artificial intelligence

The Field of Natural Language Processing

Natural Language Processing is one of the fastest evolving fields in AI and machine learning. It might also be the shortest path to understand intelligence. When we think of an intelligent machine, we imagine a machine that can communicate with us, that has language skills.

Alan Turing in his famous 1950 paper on Computing Machinery and Intelligence (“I.—COMPUTING MACHINERY AND INTELLIGENCE | Mind | Oxford Academic,” 1950) proposes to answer the question “Can Machine Thinks?” with an Imitation Game (now called the Turing test) based on language. A machine that can have a natural conversation with a human would be considered a thinking machine. Solving AI would therefore be equivalent to solving NLP.

Solving NLP involves many practical tasks that should be useful beyond looking for artificial general intelligence. In this chapter, we review some of these tasks and go over the different models which are used in modern deep learning NLP including the GPT-3 model.

Language Tasks

NLP tasks are as diverse as the different uses of natural language. We present a non-exhaustive list of tasks: Question answering, machine translation, named entity extraction, coreference resolution, semantic role labeling, sentiment analysis, textual entailment.

Question answering

A question answering task tests reading comprehension of an NPL system. The NLP system should be able to answer questions. The prevalent benchmark is the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016). It contains a list of 100k questions with answers identified as a segment of text (a span) in a Wikipedia entry. For instance to the question “What causes precipitation to fall?”, it answers “gravity”.

The latest version, SQuaD, also includes 50k unanswerable questions. If the question does not have answers, a system should not offer one. An NPL system is given the question and has to retrieve the answer from the Wikipedia articles. It is evaluated according to its F1 score (F1 Score = 2*(Recall * Precision) / (Recall + Precision)).

Machine translation

Machine translation is one of the most popular applications of NLP and is used in tools such as Google Translate or on Facebook to translate posts. Datasets used for machine translation are provided by the Workshop on Statistical Machine Translation (WMT) (“Translation Task – ACL 2014 Ninth Workshop on Statistical Machine Translation,” n.d.). They include the WMT2014 English-German dataset and the WMT2014 English-French dataset.

The models are evaluated with the BLEU score which considers human translation as the benchmark. The Bilingual Evaluation Understudy (BLEU) (Papineni et al., 2002)) is a precision measure. It counts the matches of 4-grams to the human translation and makes adjustments for the length of the translation. The BLEU score is found to be highly correlated with the human judgment of translation quality.

Named entity extraction

Named entity extraction identifies named entities in a text and assigns them to different categories such as persons, organizations and locations, or miscellaneous entities. This task is useful to search, reference, or classify documents. It has to identify the named entities, which can be one or several tokens such as the United States of America, and then classify the named entity correctly. It is evaluated according to its F1 score. A benchmark database is the Reuters RCV1 corpus (“Reuters Corpora @ NIST,” n.d.)with annotated entity classifications.

Coreference resolution

Coreference resolution consists of linking worlds referring to the same entity, especially pronouns in a sentence. For instance, a benchmark database is the OntoNotes coreference annotations (“OntoNotes Release 5.0 – Linguistic Data Consortium,” n.d.). It is evaluated according to its F1 score. An example of coreference resolution is the Winograd Schema Challenge. In the sentence “The city councilmen refused the demonstrators a permit because they [feared/advocated] violence.” Depending on which verb is used, “they” refer to either the “city councilmen” (“feared”) or “demonstrators” (“advocated”). So some deep understanding of the sentence seems to be required to identify the correct coreference. The Winograd Schema Challenge has been compared to the Turing Test.

Semantic role labeling

Semantic role labeling consists of labeling words according to their role around a predicate in a sentence. For instance, The Proposition Bank or PropBank (“The Proposition Bank (PropBank),” n.d.), built on top of the Penn Treebank (“Treebank-3 – Linguistic Data Consortium,” n.d.), has a list of annotated sentences with verb predicates and defined roles for each argument of the predicate. The roles are specific for each verb predicate. For the predicate “agree”, the roles are “Agreer”, ”Proposition”, and “Other entity agreeing”. Another common source of labeling is FrameNet which is focused on frames and frame elements instead of a verb predicate and OntoNotes which build on top of the Penn Treebank for syntax, and Propbank for predicate-argument structure.

Sentiment analysis

Sentiment analysis deals with the polarity, positive or negative, of a sentence or piece of text. It can be applied to movie reviews, product reviews, written reports, news articles, social media posts, or customer voice interactions. A standard database with annotated sentiments is the Stanford Sentiment Treebank (“Treebank-3 – Linguistic Data Consortium,” n.d.) which uses around 11,000 sentences from movie reviews. Each movie review falls into one of five categories from very negative, negative, somewhat negative, neutral to somewhat positive, positive, and very positive as classified by Amazon mechanical Turks. A Bag of word approach can be used where each word is given a sentiment score but is sometimes not sufficient because it lacks context and order.

Textual entailment

Textual entailment is the relationship between a text and a hypothesis. Given a text or fact, the NLP system has to evaluate if a hypothesis is True (entailment), False (contradiction), or Neutral. A benchmark is the Stanford Natural Language Inference (SNLI)(“The Stanford Natural Language Processing Group,” n.d.). It has 570k entries of text, judgments (entailment, contradiction, or neutral), and hypothesis. For instance, the text could be “A soccer game with multiple males playing.”, the hypothesis is “A soccer game with multiple males playing.” and the judgment is “entailment”, because the hypothesis is backed by the text. If the text is “A black race car starts up in front of a crowd of people.” and the hypothesis is “A man is driving down a lonely road.” then the judgment is “contradiction”.

Other tasks

There are many other tasks such as speech recognition (used by personal assistants such as Siri or Alexa), text-to-speech to read texts, text summarization to summarize news articles, reports, or books, text classification to screen for email spams, offensive contents, or identify authorship, information extraction to collect data from web pages or online documents, information retrieval to find relevant documents or pieces of information (used in search engines in Google, YouTube or Amazon).

Classical NLP Modelling

Symbolic NLP

To solve these tasks one approach is to teach the computer vocabulary, syntax, and grammar, the rules of language. This approach is symbolic NLP and uses parsing techniques to identify the words, their roles, and their meanings (Part-of-Speech or POS tagging). Because of the complexity and ambiguity of language and its relative free form, it is difficult to make a hand-written inventory of all the rules required to understand and generate language.

Another approach is to learn language probabilistically, using a statistical language model that is trained on real-world data. Because of the now extremely large amount of digital text available with corpora of millions, billions, and even trillions of words, and the large availability of computing power, the statistical approach has gained the upper ground while the symbolic approach has not made meaningful progress in real-world applications. MIT Professor Noam Chomsky has been very critical of the statistical approach despite of its success. He was quoted as saying:

“It’s true there’s been a lot of work on trying to apply statistical models to various linguistic problems. I think there have been some successes, but a lot of failures. There is a notion of success … which I think is novel in the history of science. It interprets success as approximating unanalyzed data.” (“Pinker/Chomsky Q&A from MIT150 Panel,” n.d.)

Norvig (“On Chomsky and the Two Cultures of Statistical Learning,” n.d.) has an interesting article addressing his criticism. In particular, he points out the empirical success of these models applied to search engines (Norvig works at Google), speech recognition, machine translation, and question answering.

Language Model

A language model describes the probability distribution of words. It is a statistical representation of language. It answers the question of what is the probability that a word appears after a sequence of words, or what is the probability that a sentence was said vs another one. This is a very powerful approach to develop language applications because it can leverage existing textual data and can tell for instance, if a sentence is grammatically correct or logical because correct and logical sentences are more likely to occur in the data.

Bag of words

The simplest language model is the bag-of-words model where only the frequency of each word matters, not the ordering nor the presence of other words. It is a poor model to generate sentences but it is useful to measure sentiment or classify text. If some words tend to appear more frequently in a negative sentence, their presence can indicate that a sentence is likely to be negative, using the Bayes formula of conditional probabilities.

N-gram models

A more advanced approach than the bag-of-words is the N-gram model. In the N-gram model, the probability of each word is conditional on the previous N-1 words. A bigram model accounts only for the previous word, a 3-gram model will account for the previous two words, etc… Given these conditional probabilities, the probability of a full sentence can be calculated thanks to the law of iterated expectations. It will be expressed as a simple product of conditional probabilities or as the sum of logarithmic probabilities if logs are used. N-gram models can be used for spam detection, sentiment analysis, or document classification.

Deep NLP Modelling

Word Embedding and Word Vectors

The previous language models do not compare words. Two similar words or related words should be close in some dimension and word vectors allow these comparisons. Word vectors are also called world embeddings. Two successful approaches have been Glove and Word2Vec.

Glove (Global vectors)

Glove (Pennington et al., 2014) was developed at Stanford to construct vector representations of words. It is based on the co-occurences of words. Co-occurence means that words occur together in the same sentence. An unsupervised machine learning model on a corpus to estimate the co-occurence of pairs of words. The word vectors are estimated so that the dot product of two word vectors is equal to the probability of co-occurence. Thanks to this vector representation, relationships between words can be seen such as man to woman and king to queen.

Word2Vec

Word2Vec (Mikolov et al., 2013) was developed at Google and also aims to create word vectors where similar words have close representations. Closeness is measured either with the continuous bag of words (CBOW) or the continuous skip-gram. IWith continuous bag of words, a word is predicted according to its context. With continuous skip-gram a word predicts the context worlds surrounding it. A two layer neural network is used to estimate each model.

GLUE Benchmark

The General Language Understanding Evaluation (GLUE, (Wang et al., 2019)) benchmark is a set of tests to evaluate NLP models on different tasks of sentence understanding. Some tasks are based on individual sentences, some others on pairs of sentences.

Table 1. The GLUE tasks.

Because the new generations of NPL models tend to have superhuman performance in some tasks from the GLUE benchmark, SuperGLUE (Wang et al., 2020) has been introduced with more difficult and more varied tasks which also include human benchmarks.

Recurrent Neural Network (RNN)

RNNs (Elman, 1990) are a type of neural network that allows efficient modeling of sequences, such as time series or text data.

In a basic RNN, for each step t, an input vector x(t) is combined with a hidden vector or layer h(t-1) to produce an updated vector h(t) which then generates the vector y(t). In the next step t+1, the new input vector x(t+1) is combined with h(t) from the previous step to produce the new output vector y(t+1). The relationship between x(t+n),h(t+n-1),h(t+n),and x(t+n) is independent of n which makes it more efficient with fewer parameters to estimate.

Figure 1. Recurrent Neural Networks

The hidden layer h(t+n) keeps the memory of the previous step layers h(t+n-1),h(t+n-2),…,h(0). The parameters are estimated by back-propagation through time starting from the last period and moving back to the initial values of each layer.

As for a neural network with too many layers, the RNN can suffer from vanishing gradients t=(gradient becoming smaller and smaller as we go back in time) or exploding gradients (gradient becoming larger and larger). To address this issue, the LSTM model has been created.

Long Short Term Memory (LSTM)

LSTM was introduced in Hochreiter and Schmidhuber (Hochreiter and Schmidhuber, 1997). The LSTM uses a carry or memory cell c(t) which depends on an input gate it, and a forget gate f(t). The output depends on an output gate o(t).

The memory cell carries information from one step to the other but is more flexible than the hidden state. The information is copied with some adjustments. The memory cell depends on:

input gate it: the input gate modulates the information from the input layer x(t) and the hidden layer h(t)
a forget gate f(t): the forget date can erase some past memory cell information

The memory cell can therefore forget some past memory with the forget gate and use some new memory content thanks to the input fate. is the sigmoid function.

c(t+1)=f(t) ⊙ c(t)+i(t)⊙σ(b+Ux(t)+Wh(t))

⊙ is the element-wise multiplication.

The output uses an output gate o(t), the output gate modulates the memory cell c(t) to transform it into an output vector y(t). The output is calculated as:

yt=o(t)⊙tanh(ct)

The input gate, the output gate, and the forget date are updated with a sigmoid function :

i(t)=σ(b(i)+U(i)x(t)+W(i)h(t-1))

o(t)=σ(b(o)+U(o)x(t)+W(o)h(t-1))

f(t)=σ(b(f)+U(f)x(t)+W(f)h(t-1))

Figure 2. LTSM

The output y(t) will depend on the hidden state h(t) and memory cell c(t).

Compared to the simple RNN, the input layer x(t) does not feed the hidden layer h(t) directly but indirectly through the memory cell c(t). The hidden layer h(t-1) does not feed into the next hidden layer h(t) directly but only indirectly through the memory cell c(t).

Bi-directional LSTM

With the bidirectional LSTM (Graves and Schmidhuber, 2005), the same sequence is analyzed in reverse and the two LSTM outputs are combined by concatenation, sum, or product (Figure 3).

Figure 3. Bidirectional LSTM

Gated Recurrent Units (GRU)

The GRU was introduced by Cho (Cho et al., 2014) to simplify the LSTM. There is no more hidden layer. The output layer depends on an update gate u(t) and a reset gate r(t).

The update gate u(t) and the reset gate r(t) are updated with a sigmoid function :

u(t)=σ(b(u)+U(u)x(t)+W(u)y(t))

r(t)=σ(b(r)+U(r)x(t)+W(r)y(t))

The output layer y(t) is then updated as:

y(t+1)=u(t)⊙y(t)+(1-u(t))⊙(b+Ux(t)+Wr(t)⊙y(t))

Updating and resetting to a new value is determined in a single equation and bypasses a memory cell and a hidden layer.

ELMo (Embeddings from Language Models)

In traditional word embedding, a word can have only one meaning. ELmo, proposed in 2018 by the Allen AI Institute and the University of Washington (Peters et al., 2018), improves on traditional static word embeddings such as GloVe buy using the context of the word usage. It constructs vector representation of words based on the parameters of bidirectional on a LTSM model trained on a large text corpus. The representation depends on the whole sentence in which the world appears. These are contextualized representations since they depend on the context of the word.

The parameters are from all the layers of the biLSTMs and not only from the last layer.The parameters from the upper layers help understand context, while the parameters from the lower layers help to understand the syntax.

ELMo can be integrated to improve NLP tasks. The BLSTM model is run on the text and the ELMo representations and the status word representations are both fed into the supervised NLP tasks. ELMo improves the performance of many tasks such as question answering, text entailment, semantic role labeling, or coreference resolution.

Attention Model

Attention

The concept of attention allows to associate dynamically each word or token in a sequence to some words or tokens in another or the same sequence. This allows richer associations that do not depend on specific locations of the target words, in particular it can relate a word to words which are not in close proximity. This is useful in translation for instance where a meaningful word can be at the beginning of a sentence and still be useful to translate a word appearing at the end.

Attention (Vaswani et al., 2017) uses the concept of Queries, Keys, and Values. A Query is what we are looking for, the Key gives the location of what we are looking for and the Value is the result of the query. A Query is for instance a word, the key is a page in a dictionary where the word appears and the value is the translation of the word.

A word represented as an embedding vector x, is multiplied (matrix dot product) with a query weight matrix W(Q) to produce Queries Q, with a key weight matrix W(K) to produce Keys K. These matrices Q,K,V are then combined together and transformed into probabilities (through a softmax function and after normalization) to emphasize attention to specific values or tokens in the same sequence. The values V are calculated as the dot product between the initial token and a value weight matrix W(V).

The self-attention vector is then:

attention(Q,K,V)=softmax(QKᵀ/√d(k))V

d(k) is the dimension of the key vectors and is its square root is a normalization factor.

We can represent it in a picture:

Figure 4. Attention mechanism

attention(Q,K,V)is called a dot product attention (here a scaled dot product attention because of the scale factor). It would be called self-attention if the target sequence is the same. It is causal attention if attention cannot look forward and a mask is used to eliminate any forward looking attention. It is bimodal attention if attention can look backward and forward (two directions).

To have causal attention we just add to QKᵀ a triangular matrix M with 0 everywhere and -∞ in the upper triangular area of the matrix.

Multi-Head Attention

The procedure can me repeated several times with different W(Q,i),W(K,i),W(V,i) matrices to create several self-attention vectors or a “multi-head” attention. These vectors are then concatenated and multiplied by another weight matrix W(O) to produce a single self-attention vector.

z(i)=attention(Q(i),K(i),V(i)) for i=1,..,h if there are h heads.

Then

Multihead(Q,K,V)=Concat(z(1),…,z(h))W(O)

Transformers

Transformers (Vaswani et al., 2017) use the multi-head attention and as well as add & normalize layers, and feed-forward layers. A Transformer layer can combine an encoder and a decoder or use one only an encoder (as in BERT models) or only a decoder (as in GPT models).

They use as inputs word embeddings. Word embeddings are vector representations of words. Each word is represented by a vector of fixed dimension. The word embedding is a list of such vectors. The list size is fixed, usually equal to the longest sentence in a text.

Encoder

The encoder has four layers. As input it uses an embedding added to a positional encoding. The positional encoding indicates the location of each word vector. In the encoder, the input goes through a multi-head attention layer which encodes each word vector with other vectors which it needs to pay attention to. It is then added and normalized with the original layer input to preserve some memory of the input. It goes to a feed forward layer and again another add and normalize layer. The output is then fed into a multi-head attention layer of the decoder.

Figure 5. Transformer Encoder Layer

Decoder

The decoder is very similar to the encoder but it uses a masked multi-head attention layer to pay attention only to past word vectors.

Figure 6. Transformer Decoder Layer

Transformer

The Transformer layer uses a word embedding as input to the encoder, the result is fed into the decoder along with the output word embedding. The output word embedding is inputted into the first decoder layer repeatedly as new words get outputted.

Figure 7. Transformer

Transformers have launched a new wave of pre-trained language models such as BERT and GPT-3. We review some of them.

BERT (Bidirectional Encoder Representations from Transformers)

BERT is a pre-trained language model introduced by Google in 2018 (Devlin et al., 2019), that can be fine-tuned to perform many common NLP tasks such as the ones from the BLUE benchmark. Contrary to ELMo which uses the new embeddings as new features, BERT requires very little re-training.

BERT is using transformers with layers of decoders. It is trained first to identify randomly masked words (Masked Language Model) in a sentence using their contexts, words from the left and the right of the mask, and then to predict a next sentence (Next Sentence Prediction). It is therefore bidirectional contrary to GPT style models which are unidirectional. BERT uses a multi-layer bidirectional Transformer encoder.

There are two versions of BERT: BERT base and BERT large. BERT base has 12 layers of size 768, and 12 self-attention heads, and 110M parameters. BERT large has 24 layers of size 1024, and 16 self-attention heads, and 340M parameters. BERT is described in Figure 8.

BERT is pre-trained on a Corpus made of the BooksCorpus (800M words) and the English Wikipedia (2,500M words) representing a total of 3.3 billion words. The text goes through WordPiece tokenization and then runs through a masking step where tokens are masked at the rate of 15%. The token is replaced by [MASK] 80% of the time, by a random token 10% of time and remains the same 10% of the time. [MASK] is not used 100% of the time because it does appear in the fine-tuning step. BERT then goes through the Next Sentence Prediction step in which pairs of sentences can either be paired correctly with label [IsNext], 50% of the time or with the label [NotNext].

BERT is then fine-tuned on specific tasks. Most of the hyperparameters remain the same, the mode parameters are re-estimated. The input can be pairs of sentences in the case of machine translation or question answering and the output will be some token representations to be fed into a single additional task specific layer.

Figure 8. BERT

RoBERTa

RoBERTa (Robustly optimized BERT pretraining approach (Liu et al., 2019)) is a reimplementation of BERT by Facebook where they change the followings: a longer training period, bigger batches, more data, no the next sentence, longer sequences, and dynamic masking pattern on the training data. The authors find their improvements are significantly improving the model performance and that it achieves state-of-the-art results on GLUE, RACE and SQuAD.

XLNet

XLNet (Dai et al., 2019) is an improvement of the BERT model from Google and Stanford University. It uses a Transformer architecture but uses an auto-regressive approach without masking. It performs token permutations to feed into the encoder layer and tries to predict each token. XLNet also includes ideas from Transformer-XL such as the relative positional encoding scheme and the segment recurrence mechanism into pretraining. XLNet performs better than BERT on many NPL tasks including question answering, natural language inference, sentiment analysis, and document ranking.

ELECTRA

The ELECTRA (Clark et al., 2020) model proposes to use an alternative to masking which is more sample-efficient than BERT. It replaces tokens randomly with alternatives generated by a neural network and the task is to detect these replacements. ELECTRA outperforms BERT on the GLUE benchmark when both run with the same model size, data, and compute. It also outperforms XLNet and ROBERTa with the same amount of compute.

T5

T5 (Raffel et al., 2020) is a unified framework for language modelling based on the original transformer architecture with very changes. It is framed as a text-to-text problem. They use a new cleaned up data set, the “Colossal Clean Crawled Corpus”. They achieve state-of-the-art results on many MPL benchmark tasks such as summarization, question answering, text classification. The T5 needs to be fine-tuned by changing all the pre-trained weights.

GPT-3 (Generative Pre-Training)

GPT-3 (Brown et al., 2020) is a language model that can be used for many downstream tasks such as question answering, text completion, text generation, neural machine translation. GPT-3 is the third generation of the Generative Pre-Training (GPT) model (Radford et al., 2018). GPT-3 is described in Figure XX below.

The original GPT model is pre-trained on a large corpus of text using unsupervised learning and transformers. Each layer of the GPT model is a transformer decoder layer. A decoder layer contains an attention layer and a feed forward neural network. The attention layer is a self-attention layer. The masked attention layer is a masked multi-head self-attention layer that cannot look forward.

The model is then fine-tuned to specific tasks with supervised learning. The GPT-3 model skips the fine-tuning step.

GPT-3 has 175 Billion trainable parameters, 96 layers, 12,288 units in each bottleneck layer, 96 attention heads with 128 units each. Performance increases with the number of parameters. Because of its size, GPT-3 can perform well without fine tuning. The weights do not need to be re-estimated for a new task.

GPT-3 is trained on a combination of five data sets: filtered Common Crowl (410 billion tokens), WebText2 (19 billion), Books1 (12 billion), Books2 (55 billion), and Wikipedia (3 billion). Some datasets are seen several times if they are of higher quality.

GPT-3 is evaluated with few-shot learning, one-shot learning, and zero-shot learning. An X-shot learning means the model is given X examples before returning an answer to a query. GPT-3 improves the state of the art results on several benchmark tasks such as sentence completion, question answering, and machine translation to English but still falls short on some others such as common sense reasoning and reading comprehension. For benchmarks such SuperGlue, it falls short of the best fine-tuned models. GPT-3 shines at news article generation. Humans were only 52% accurate at guessing that an article was written by GPT-3 instead of a human.

Figure 9. GPT-3

Conclusion

It is not clear that we are closer to solving artificial intelligence but the recent progress in NPL has been very impressive. The outputs of these NPL models are very usable and some are already deployed in many commercial applications: digital assistants, mobile phones, customer support, machine translation, article generation etc..The more recent models such as GPT-3 are promising zero-shot learning which could be revolutionary. Still, the accuracies of GPT-3 on several NPL tasks are still lagging human performance by a lot. We expect future generations of models to be even more useful and to become ubiquitous in our daily lives.

References

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D., 2020. Language Models are Few-Shot Learners. ArXiv200514165 Cs.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y., 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. ArXiv14061078 Cs Stat.

Clark, K., Luong, M.-T., Le, Q.V., Manning, C.D., 2020. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. ArXiv200310555 Cs.

Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R., 2019. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. ArXiv190102860 Cs Stat.

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv181004805 Cs.

Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning, Illustrated edition. ed. The MIT Press, Cambridge, Massachusetts.

Graves, A., Schmidhuber, J., 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw., IJCNN 2005 18, 602–610. https://doi.org/10.1016/j.neunet.2005.06.042

Hochreiter, S., Schmidhuber, J., 1997. Long Short-term Memory. Neural Comput. 9, 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735

I.—COMPUTING MACHINERY AND INTELLIGENCE | Mind | Oxford Academic [WWW Document], n.d. URL https://academic.oup.com/mind/article/LIX/236/433/986238 (accessed 11.13.20).

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V., 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv190711692 Cs.

Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient Estimation of Word Representations in Vector Space. ArXiv13013781 Cs.

On Chomsky and the Two Cultures of Statistical Learning [WWW Document], n.d. URL http://norvig.com/chomsky.html (accessed 7.25.20).

OntoNotes Release 5.0 – Linguistic Data Consortium [WWW Document], n.d. URL https://catalog.ldc.upenn.edu/LDC2013T19 (accessed 11.13.20).

Papineni, K., Roukos, S., Ward, T., Zhu, W.-J., 2002. Bleu: a Method for Automatic Evaluation of Machine Translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Presented at the ACL 2002, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. https://doi.org/10.3115/1073083.1073135

Pennington, J., Socher, R., Manning, C., 2014. Glove: Global Vectors for Word Representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Presented at the Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, pp. 1532–1543. https://doi.org/10.3115/v1/D14-1162

Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L., 2018. Deep contextualized word representations. ArXiv180205365 Cs.

Pinker/Chomsky Q&A from MIT150 Panel [WWW Document], n.d. URL http://languagelog.ldc.upenn.edu/myl/PinkerChomskyMIT.html (accessed 11.13.20).

PyTorch documentation — PyTorch 1.7.0 documentation [WWW Document], n.d. URL https://pytorch.org/docs/stable/index.html (accessed 12.30.20).

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., 2018. Improving Language Understanding by Generative Pre-Training 12.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J., 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. ArXiv191010683 Cs Stat.

Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P., 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. ArXiv160605250 Cs

TensorFlow [WWW Document], n.d. . TensorFlow. URL https://www.tensorflow.org/ (accessed 12.30.20).

The Proposition Bank (PropBank) [WWW Document], n.d. URL https://propbank.github.io/ (accessed 11.13.20).

The Stanford Natural Language Processing Group [WWW Document], n.d. URL https://nlp.stanford.edu/projects/snli/ (accessed 11.13.20).

Translation Task – ACL 2014 Ninth Workshop on Statistical Machine Translation [WWW Document], n.d. URL http://www.statmt.org/wmt14/translation-task.html (accessed 11.10.20).

Treebank-3 – Linguistic Data Consortium [WWW Document], n.d. URL https://catalog.ldc.upenn.edu/LDC99T42 (accessed 11.13.20).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I., 2017. Attention Is All You Need. ArXiv170603762 Cs.

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R., 2020. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. ArXiv190500537 Cs.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R., 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. ArXiv180407461 Cs.