A Primer on Natural Language Processing

The Field of Natural Language Processing

Natural Language Processing is one of the fastest evolving fields in AI and machine learning. It might also be the shortest path to understand intelligence. When we think of an intelligent machine, we imagine a machine that can communicate with us, that has language skills.

Alan Turing in his famous 1950 paper on Computing Machinery and Intelligence (“I.—COMPUTING MACHINERY AND INTELLIGENCE | Mind | Oxford Academic,” 1950) proposes to answer the question “Can Machine Thinks?” with an Imitation Game (now called the Turing test) based on language. A machine that can have a natural conversation with a human would be considered a thinking machine. Solving AI would therefore be equivalent to solving NLP. 

Solving NLP involves many practical tasks that should be useful beyond looking for artificial general intelligence. In this chapter, we review some of these tasks and go over the different models which are used in modern deep learning NLP including the GPT-3 model.

Language Tasks

NLP tasks are as diverse as the different uses of natural language. We present a non-exhaustive list of tasks: Question answering, machine translation, named entity extraction, coreference resolution, semantic role labeling, sentiment analysis, textual entailment.

Question answering

A question answering task tests reading comprehension of an NPL system. The NLP system should be able to answer questions. The prevalent benchmark is the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016). It contains a list of 100k questions with answers identified as a segment of text (a span) in a Wikipedia entry. For instance to the question “What causes precipitation to fall?”, it answers “gravity”.

The latest version, SQuaD, also includes 50k unanswerable questions. If the question does not have answers, a system should not offer one. An NPL system is given the question and has to retrieve the answer from the Wikipedia articles. It is evaluated according to its F1 score (F1 Score = 2*(Recall * Precision) / (Recall + Precision)).

Machine translation

Machine translation is one of the most popular applications of NLP and is used in tools such as Google Translate or on Facebook to translate posts. Datasets used for machine translation are provided by the Workshop on Statistical Machine Translation (WMT) (“Translation Task – ACL 2014 Ninth Workshop on Statistical Machine Translation,” n.d.). They include the WMT2014 English-German dataset and the WMT2014 English-French dataset. 

The models are evaluated with the BLEU score which considers human translation as the benchmark. The Bilingual Evaluation Understudy (BLEU) (Papineni et al., 2002)) is a precision measure. It counts the matches of 4-grams to the human translation and makes adjustments for the length of the translation. The BLEU score is found to be highly correlated with the human judgment of translation quality.

Named entity extraction

Named entity extraction identifies named entities in a text and assigns them to different categories such as persons, organizations and locations, or miscellaneous entities. This task is useful to search, reference, or classify documents. It has to identify the named entities, which can be one or several tokens such as the United States of America, and then classify the named entity correctly. It is evaluated according to its F1 score. A benchmark database is the Reuters RCV1 corpus (“Reuters Corpora @ NIST,” n.d.)with annotated entity classifications. 

Coreference resolution

Coreference resolution consists of linking worlds referring to the same entity, especially pronouns in a sentence. For instance, a benchmark database is the OntoNotes coreference annotations (“OntoNotes Release 5.0 – Linguistic Data Consortium,” n.d.). It is evaluated according to its F1 score. An example of coreference resolution is the Winograd Schema Challenge. In the sentence “The city councilmen refused the demonstrators a permit because they [feared/advocated] violence.” Depending on which verb is used, “they” refer to either the “city councilmen” (“feared”) or “demonstrators” (“advocated”). So some deep understanding of the sentence seems to be required to identify the correct coreference. The Winograd Schema Challenge has been compared to the Turing Test. 

Semantic role labeling

Semantic role labeling consists of labeling words according to their role around a predicate in a sentence. For instance, The Proposition Bank or PropBank (“The Proposition Bank (PropBank),” n.d.), built on top of the Penn Treebank (“Treebank-3 – Linguistic Data Consortium,” n.d.), has a list of annotated sentences with verb predicates and defined roles for each argument of the predicate. The roles are specific for each verb predicate. For the predicate “agree”, the roles are “Agreer”, ”Proposition”, and “Other entity agreeing”. Another common source of labeling is FrameNet which is focused on frames and frame elements instead of a verb predicate and OntoNotes which build on top of the Penn Treebank for syntax, and Propbank for predicate-argument structure.

Sentiment analysis

Sentiment analysis deals with the polarity, positive or negative, of a sentence or piece of text. It can be applied to movie reviews, product reviews, written reports, news articles, social media posts, or customer voice interactions. A standard database with annotated sentiments is the Stanford Sentiment Treebank (“Treebank-3 – Linguistic Data Consortium,” n.d.) which uses around 11,000 sentences from movie reviews. Each movie review falls into one of five categories from very negative, negative, somewhat negative, neutral to somewhat positive, positive, and very positive as classified by Amazon mechanical Turks. A Bag of word approach can be used where each word is given a sentiment score but is sometimes not sufficient because it lacks context and order. 

Textual entailment

Textual entailment is the relationship between a text and a hypothesis. Given a text or fact, the NLP system has to evaluate if a hypothesis is True (entailment), False (contradiction), or Neutral. A benchmark is the Stanford Natural Language Inference (SNLI)(“The Stanford Natural Language Processing Group,” n.d.). It has 570k entries of text, judgments (entailment, contradiction, or neutral), and hypothesis. For instance, the text could be “A soccer game with multiple males playing.”, the hypothesis is “A soccer game with multiple males playing.” and the judgment is “entailment”, because the hypothesis is backed by the text. If the text is “A black race car starts up in front of a crowd of people.” and the hypothesis is “A man is driving down a lonely road.” then the judgment is “contradiction”.

Other tasks

There are many other tasks such as speech recognition (used by personal assistants such as Siri or Alexa), text-to-speech to read texts, text summarization to summarize news articles, reports, or books, text classification to screen for email spams, offensive contents, or identify authorship, information extraction to collect data from web pages or online documents, information retrieval to find relevant documents or pieces of information (used in search engines in Google, YouTube or Amazon).

Classical NLP Modelling 

Symbolic NLP

To solve these tasks one approach is to teach the computer vocabulary, syntax, and grammar, the rules of language. This approach is symbolic NLP and uses parsing techniques to identify the words, their roles, and their meanings (Part-of-Speech or POS tagging). Because of the complexity and ambiguity of language and its relative free form, it is difficult to make a hand-written inventory of all the rules required to understand and generate language.   

Another approach is to learn language probabilistically, using a statistical language model that is trained on real-world data. Because of the now extremely large amount of digital text available with corpora of millions, billions, and even trillions of words, and the large availability of computing power, the statistical approach has gained the upper ground while the symbolic approach has not made meaningful progress in real-world applications. MIT Professor Noam Chomsky has been very critical of the statistical approach despite of its success. He was quoted as saying:

“It’s true there’s been a lot of work on trying to apply statistical models to various linguistic problems. I think there have been some successes, but a lot of failures. There is a notion of success … which I think is novel in the history of science. It interprets success as approximating unanalyzed data.” (“Pinker/Chomsky Q&A from MIT150 Panel,” n.d.)

Norvig (“On Chomsky and the Two Cultures of Statistical Learning,” n.d.) has an interesting article addressing his criticism. In particular, he points out the empirical success of these models applied to search engines (Norvig works at Google), speech recognition, machine translation, and question answering. 

Language Model

A language model describes the probability distribution of words. It is a statistical representation of language. It answers the question of what is the probability that a word appears after a sequence of words, or what is the probability that a sentence was said vs another one. This is a very powerful approach to develop language applications because it can leverage existing textual data and can tell for instance, if a sentence is grammatically correct or logical because correct and logical sentences are more likely to occur in the data.

Bag of words

The simplest language model is the bag-of-words model where only the frequency of each word matters, not the ordering nor the presence of other words. It is a poor model to generate sentences but it is useful to measure sentiment or classify text. If some words tend to appear more frequently in a negative sentence, their presence can indicate that a sentence is likely to be negative, using the Bayes formula of conditional probabilities.

N-gram models

A more advanced approach than the bag-of-words is the N-gram model. In the N-gram model, the probability of each word is conditional on the previous N-1 words. A bigram model accounts only for the previous word, a 3-gram model will account for the previous two words, etc… Given these conditional probabilities, the probability of a full sentence can be calculated thanks to the law of iterated expectations. It will be expressed as a simple product of conditional probabilities or as the sum of logarithmic probabilities if logs are used. N-gram models can be used for spam detection, sentiment analysis, or document classification.

Deep NLP Modelling 

Word Embedding and Word Vectors

The previous language models do not compare words. Two similar words or related words should be close in some dimension and word vectors allow these comparisons. Word vectors are also called world embeddings. Two successful approaches have been Glove and Word2Vec.

Glove (Global vectors)

Glove (Pennington et al., 2014) was developed at Stanford to construct vector representations of words. It is based on the co-occurences of words. Co-occurence means that words occur together in the same sentence. An unsupervised machine learning model on a corpus to estimate the co-occurence of pairs of words. The word vectors are estimated so that the dot product of two word vectors is equal to the probability of co-occurence.  Thanks to this vector representation, relationships between words can be seen such as man to woman and king to queen.  

Word2Vec

Word2Vec (Mikolov et al., 2013) was developed at Google and also aims to create word vectors where similar words have close representations. Closeness is measured either with the continuous bag of words (CBOW) or the  continuous skip-gram. IWith continuous bag of words, a word is predicted according to its context. With continuous skip-gram a word predicts the context worlds surrounding it. A two layer neural network is used to estimate each model.

GLUE Benchmark

The General Language Understanding Evaluation (GLUE, (Wang et al., 2019)) benchmark is a set of tests to evaluate NLP models on different tasks of sentence understanding. Some tasks are based on individual sentences, some others on pairs of sentences.

No alt text provided for this image

Table 1. The GLUE tasks.

Because the new generations of NPL models tend to have superhuman performance in some tasks from the GLUE benchmark, SuperGLUE (Wang et al., 2020) has been introduced with more difficult and more varied tasks which also include human benchmarks.

Recurrent Neural Network (RNN)

RNNs (Elman, 1990) are a type of neural network that allows efficient modeling of sequences, such as time series or text data. 

In a basic RNN, for each step t, an input vector x(t) is combined with a hidden vector or layer h(t-1) to produce an updated vector h(t) which then generates the vector y(t). In the next step  t+1, the new input vector  x(t+1) is combined with h(t) from the previous step to produce the new output vector  y(t+1). The relationship between x(t+n),h(t+n-1),h(t+n),and x(t+n) is independent of n which makes it more efficient with fewer parameters to estimate.

No alt text provided for this image

Figure 1. Recurrent Neural Networks

 The hidden layer h(t+n) keeps the memory of the previous step layers h(t+n-1),h(t+n-2),…,h(0). The parameters are estimated by back-propagation through time starting from the last period and moving back to the initial values of each layer.

As for a neural network with too many layers, the RNN can suffer from vanishing gradients t=(gradient becoming smaller and smaller as we go back in time) or exploding gradients (gradient becoming larger and larger). To address this issue, the LSTM model has been created.

Long Short Term Memory (LSTM)

LSTM was introduced in Hochreiter and Schmidhuber (Hochreiter and Schmidhuber, 1997). The LSTM uses a carry or memory cell c(t) which depends on an input gate it, and a forget gate  f(t). The output depends on an output gate o(t). 

The memory cell carries information from one step to the other but is more flexible than the hidden state. The information is copied with some adjustments. The memory cell depends on:

  • input gate it:  the input gate modulates the information from the input layer x(t) and the hidden layer h(t) 
  • a forget gate f(t): the forget date can erase some past memory cell information

The memory cell can therefore forget some past memory with the forget gate and use some new memory content thanks to the input fate. is the sigmoid function.

c(t+1)=f(t) ⊙ c(t)+i(t)⊙σ(b+Ux(t)+Wh(t))

⊙ is the element-wise multiplication.

The output uses an output gate o(t), the output gate modulates the memory cell c(t) to transform it into an output vector y(t). The output is calculated as:

yt=o(t)⊙tanh(ct)

The input gate, the output gate, and the forget date are updated with a sigmoid function :

i(t)=σ(b(i)+U(i)x(t)+W(i)h(t-1))

o(t)=σ(b(o)+U(o)x(t)+W(o)h(t-1))

f(t)=σ(b(f)+U(f)x(t)+W(f)h(t-1))

No alt text provided for this image

Figure 2. LTSM 

The output y(t) will depend on the hidden state h(t) and memory cell c(t).

Compared to the simple RNN, the input layer x(t) does not feed the hidden layer h(t) directly but indirectly through the memory cell c(t). The hidden layer h(t-1) does not feed into the next hidden layer h(t) directly but only indirectly through the memory cell c(t).

Bi-directional LSTM

With the bidirectional LSTM (Graves and Schmidhuber, 2005), the same sequence is analyzed in reverse and the two LSTM outputs are combined by concatenation,  sum, or product (Figure 3).

No alt text provided for this image

Figure 3. Bidirectional LSTM 

Gated Recurrent Units (GRU)

The GRU was introduced by Cho (Cho et al., 2014) to simplify the LSTM. There is no more hidden layer. The output layer depends on an update gate u(t) and a reset gate r(t).

The update gate u(t) and the reset gate r(t) are updated with a sigmoid function :

u(t)=σ(b(u)+U(u)x(t)+W(u)y(t))

r(t)=σ(b(r)+U(r)x(t)+W(r)y(t))

The output layer y(t) is then updated as:

y(t+1)=u(t)⊙y(t)+(1-u(t))⊙(b+Ux(t)+Wr(t)⊙y(t))

Updating and resetting to a new value is determined in a single equation and bypasses a memory cell and a hidden layer.

ELMo (Embeddings from Language Models)

In traditional word embedding, a word can have only one meaning. ELmo, proposed in 2018 by the Allen AI Institute and the University of Washington (Peters et al., 2018), improves on traditional static word embeddings such as GloVe buy using the context of the word usage. It constructs vector representation of words based on the parameters of bidirectional on a LTSM model trained on a large text corpus. The representation depends on the whole sentence in which the world appears. These are contextualized representations since they depend on the context of the word. 

The parameters are from all the layers of the biLSTMs and not only from the last layer.The parameters from the upper layers help understand context, while the parameters from the lower layers help to understand the syntax.

ELMo can be integrated to improve NLP tasks. The BLSTM model is run on the text and the ELMo representations and the status word representations are both fed into the supervised NLP tasks. ELMo improves the performance of many tasks such as question answering, text entailment, semantic role labeling, or coreference resolution.  

Attention Model

Attention

The concept of attention allows to associate dynamically each word or token in a sequence to some words or tokens in another or the same sequence. This allows richer associations that do not depend on specific locations of the target words, in particular it can relate a word to words which are not in close proximity. This is useful in translation for instance where a meaningful word can be at the beginning of a sentence and still be useful to translate a word appearing at the end. 

Attention (Vaswani et al., 2017) uses the concept of Queries, Keys, and Values. A Query is what we are looking for, the Key gives the location of what we are looking for and the Value is the result of the query. A Query is for instance a word, the key is a page in a dictionary where the word appears and the value is the translation of the word.

A word represented as an embedding vector x, is multiplied (matrix dot product) with a query weight matrix W(Q) to produce Queries Q, with a key weight matrix W(K) to produce Keys K. These matrices Q,K,V are then combined together and transformed into probabilities (through a softmax function and after normalization) to emphasize attention to specific values or tokens in the same sequence. The values V are calculated as the dot product between the initial token and a value weight matrix W(V).

The self-attention vector is then:

attention(Q,K,V)=softmax(QKᵀ/√d(k))V

d(k) is the dimension of the key vectors and is its square root is a normalization factor. 

We can represent it in a picture:

No alt text provided for this image

Figure 4. Attention mechanism

attention(Q,K,V)is called a dot product attention (here a scaled dot product attention because of the scale factor). It would be called self-attention if the target sequence is the same. It is causal attention if attention cannot look forward and a mask is used to eliminate any forward looking attention. It is bimodal attention if attention can look backward and forward (two directions).  

To have causal attention we just add to QKᵀ a triangular matrix M with 0 everywhere and -∞ in the upper triangular area of the matrix.

Multi-Head Attention

The procedure can me repeated several times with different W(Q,i),W(K,i),W(V,i) matrices to create several self-attention vectors or a “multi-head” attention. These vectors are then concatenated and multiplied by another weight matrix W(O) to produce a single self-attention vector.

z(i)=attention(Q(i),K(i),V(i)) for i=1,..,h  if there are h heads.

Then 

Multihead(Q,K,V)=Concat(z(1),…,z(h))W(O)

Transformers

Transformers (Vaswani et al., 2017) use the multi-head attention and as well as add & normalize layers, and feed-forward layers. A Transformer layer can combine an encoder and a decoder or use one only an encoder (as in BERT models) or only a decoder (as in GPT models).

They use as inputs word embeddings. Word embeddings are vector representations of words. Each word is represented by a vector of fixed dimension. The word embedding is a list of such vectors. The list size is fixed, usually equal to the longest sentence in a text.  

Encoder

The encoder has four layers. As input it uses an embedding added to a positional encoding. The positional encoding indicates the location of each word vector. In the encoder, the input goes through a multi-head attention layer which encodes each word vector with other vectors which it needs to pay attention to. It is then added and normalized with the original layer input to preserve some memory of the input. It goes to a feed forward layer and again another add and normalize layer. The output is then fed into a multi-head attention layer of the decoder. 

No alt text provided for this image

Figure 5. Transformer Encoder Layer

Decoder

The decoder is very similar to the encoder but it uses a masked multi-head attention layer to pay attention only to past word vectors.  

No alt text provided for this image

Figure 6. Transformer Decoder Layer

Transformer

The Transformer layer uses a word embedding as input to the encoder, the result is fed into the decoder along with the output word embedding. The output word embedding is inputted into the first decoder layer repeatedly as new words get outputted.  

No alt text provided for this image

Figure 7. Transformer

Transformers have launched a new wave of pre-trained language models such as BERT and GPT-3. We review some of them.

BERT (Bidirectional Encoder Representations from Transformers)

BERT is a pre-trained language model introduced by Google in 2018 (Devlin et al., 2019), that can be fine-tuned to perform many common NLP tasks such as the ones from the BLUE benchmark. Contrary to ELMo which uses the new embeddings as new features, BERT requires very little re-training.

BERT is using transformers with layers of decoders. It is trained first to identify randomly masked words (Masked Language Model) in a sentence using their contexts, words from the left and the right of the mask, and then to predict a next sentence (Next Sentence Prediction). It is therefore bidirectional contrary to GPT style models which are unidirectional. BERT uses a multi-layer bidirectional Transformer encoder.

There are two versions of BERT: BERT base and BERT large. BERT base has 12 layers of size 768, and 12 self-attention heads, and 110M parameters. BERT large has 24 layers of size 1024, and 16 self-attention heads, and 340M parameters. BERT is described in Figure 8.

 BERT is pre-trained on a Corpus made of the BooksCorpus (800M words) and the English Wikipedia (2,500M words) representing a total of 3.3 billion words. The text goes through WordPiece tokenization and then runs through a masking step where tokens are masked at the rate of 15%. The token is replaced by [MASK] 80% of the time, by a random token 10% of time and remains the same 10% of the time. [MASK] is not used 100% of the time because it does appear in the fine-tuning step. BERT then goes through the Next Sentence Prediction step in which pairs of sentences can either be paired correctly with label [IsNext], 50% of the time or with the label [NotNext]. 

BERT is then fine-tuned on specific tasks. Most of the hyperparameters remain the same, the mode parameters are re-estimated. The input can be pairs of sentences in the case of machine translation or question answering and the output will be some token representations to be fed into a single additional task specific layer.  

No alt text provided for this image

Figure 8. BERT

RoBERTa

RoBERTa (Robustly optimized BERT pretraining approach (Liu et al., 2019)) is a reimplementation of BERT by Facebook where they change the followings: a longer training period, bigger batches, more data, no the next sentence, longer sequences, and dynamic masking pattern on the training data. The authors find their improvements are significantly improving the model performance and that it achieves state-of-the-art results on GLUE, RACE and SQuAD. 

XLNet

XLNet (Dai et al., 2019) is an improvement of the BERT model from Google and Stanford University. It uses a Transformer architecture but uses an auto-regressive approach without masking. It performs token permutations to feed into the encoder layer and tries to predict each token. XLNet also includes ideas from Transformer-XL such as the relative positional encoding scheme and the segment recurrence mechanism into pretraining. XLNet performs better than BERT on many NPL tasks including question answering, natural language inference, sentiment analysis, and document ranking.

ELECTRA

The ELECTRA (Clark et al., 2020) model proposes to use an alternative to masking which is more sample-efficient than BERT. It replaces tokens randomly with alternatives generated by a neural network and the task is to detect these replacements.  ELECTRA outperforms BERT on the GLUE benchmark when both run with the same model size, data, and compute. It also outperforms XLNet and ROBERTa with the same amount of compute.

T5

T5 (Raffel et al., 2020) is a unified framework for language modelling based on the original transformer architecture with very changes. It is framed as a text-to-text problem. They use a new cleaned up data set, the “Colossal Clean Crawled Corpus”. They achieve state-of-the-art results on many MPL benchmark tasks such as summarization, question answering, text classification. The T5 needs to be fine-tuned by changing all the pre-trained weights.

GPT-3 (Generative Pre-Training)

GPT-3 (Brown et al., 2020) is a language model that can be used for many downstream tasks such as question answering, text completion, text generation, neural machine translation. GPT-3 is the third generation of the Generative Pre-Training (GPT) model (Radford et al., 2018). GPT-3 is described in Figure XX below.

The original GPT model is pre-trained on a large corpus of text using unsupervised learning and transformers.  Each layer of the GPT model is a transformer decoder layer. A decoder layer contains an attention layer and a feed forward neural network. The attention layer is a self-attention layer. The masked attention layer is a masked multi-head self-attention layer that cannot look forward.

The model is then fine-tuned to specific tasks with supervised learning. The GPT-3 model skips the fine-tuning step.

GPT-3 has 175 Billion trainable parameters, 96 layers, 12,288 units in each bottleneck layer, 96 attention heads with 128 units each. Performance increases with the number of parameters. Because of its size, GPT-3 can perform well without fine tuning. The weights do not need to be re-estimated for a new task.

GPT-3 is trained on a combination of five data sets: filtered Common Crowl (410 billion tokens), WebText2 (19 billion), Books1 (12 billion), Books2 (55 billion), and Wikipedia (3 billion). Some datasets are seen several times if they are of higher quality.

GPT-3 is evaluated with few-shot learning, one-shot learning, and zero-shot learning. An X-shot learning means the model is given X examples before returning an answer to a query. GPT-3 improves the state of the art results on several benchmark tasks such as sentence completion, question answering, and machine translation to English but still falls short on some others such as common sense reasoning and reading comprehension. For benchmarks such SuperGlue, it falls short of the best fine-tuned models. GPT-3 shines at news article generation. Humans were only 52% accurate at guessing that an article was written by GPT-3 instead of a human.

No alt text provided for this image

Figure 9. GPT-3

Conclusion

It is not clear that we are closer to solving artificial intelligence but the recent progress in NPL has been very impressive. The outputs of these NPL models are very usable and some are already deployed in many commercial applications: digital assistants, mobile phones, customer support, machine translation, article generation etc..The more recent models such as GPT-3 are promising zero-shot learning which could be revolutionary. Still, the accuracies of GPT-3 on several NPL tasks are still lagging human performance by a lot. We expect future generations of models to be even more useful and to become ubiquitous in our daily lives. 

References

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D., 2020. Language Models are Few-Shot Learners. ArXiv200514165 Cs.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y., 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. ArXiv14061078 Cs Stat.

Clark, K., Luong, M.-T., Le, Q.V., Manning, C.D., 2020. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. ArXiv200310555 Cs.

Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R., 2019. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. ArXiv190102860 Cs Stat.

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv181004805 Cs.

Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning, Illustrated edition. ed. The MIT Press, Cambridge, Massachusetts.

Graves, A., Schmidhuber, J., 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw., IJCNN 2005 18, 602–610. https://doi.org/10.1016/j.neunet.2005.06.042

Hochreiter, S., Schmidhuber, J., 1997. Long Short-term Memory. Neural Comput. 9, 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735

I.—COMPUTING MACHINERY AND INTELLIGENCE | Mind | Oxford Academic [WWW Document], n.d. URL https://academic.oup.com/mind/article/LIX/236/433/986238 (accessed 11.13.20).

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V., 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv190711692 Cs.

Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient Estimation of Word Representations in Vector Space. ArXiv13013781 Cs.

On Chomsky and the Two Cultures of Statistical Learning [WWW Document], n.d. URL http://norvig.com/chomsky.html (accessed 7.25.20).

OntoNotes Release 5.0 – Linguistic Data Consortium [WWW Document], n.d. URL https://catalog.ldc.upenn.edu/LDC2013T19 (accessed 11.13.20).

Papineni, K., Roukos, S., Ward, T., Zhu, W.-J., 2002. Bleu: a Method for Automatic Evaluation of Machine Translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Presented at the ACL 2002, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. https://doi.org/10.3115/1073083.1073135

Pennington, J., Socher, R., Manning, C., 2014. Glove: Global Vectors for Word Representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Presented at the Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, pp. 1532–1543. https://doi.org/10.3115/v1/D14-1162

Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L., 2018. Deep contextualized word representations. ArXiv180205365 Cs.

Pinker/Chomsky Q&A from MIT150 Panel [WWW Document], n.d. URL http://languagelog.ldc.upenn.edu/myl/PinkerChomskyMIT.html (accessed 11.13.20).

PyTorch documentation — PyTorch 1.7.0 documentation [WWW Document], n.d. URL https://pytorch.org/docs/stable/index.html (accessed 12.30.20).

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., 2018. Improving Language Understanding by Generative Pre-Training 12.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J., 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. ArXiv191010683 Cs Stat.

Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P., 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. ArXiv160605250 Cs

TensorFlow [WWW Document], n.d. . TensorFlow. URL https://www.tensorflow.org/ (accessed 12.30.20).

The Proposition Bank (PropBank) [WWW Document], n.d. URL https://propbank.github.io/ (accessed 11.13.20).

The Stanford Natural Language Processing Group [WWW Document], n.d. URL https://nlp.stanford.edu/projects/snli/ (accessed 11.13.20).

Translation Task – ACL 2014 Ninth Workshop on Statistical Machine Translation [WWW Document], n.d. URL http://www.statmt.org/wmt14/translation-task.html (accessed 11.10.20).

Treebank-3 – Linguistic Data Consortium [WWW Document], n.d. URL https://catalog.ldc.upenn.edu/LDC99T42 (accessed 11.13.20).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I., 2017. Attention Is All You Need. ArXiv170603762 Cs.

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R., 2020. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. ArXiv190500537 Cs.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R., 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. ArXiv180407461 Cs.

A Primer on Computer Vision

Computer vision has been a great success of deep machine learning. It is now widely used in many practical applications such as object recognition, classification and detection, self-driving cars, image captioning, image reconstruction, and generation. We present a primer on computer vision starting with how we understand vision in humans. 

Vision Recognition

Human eye

Vision recognition with a human works by capturing light refracted through the cornea, the anterior chamber, the pupil, the posterior chamber, the lens, the vitreous humor, and then the retina in the back of the eye (Figure 1). The pupil adjusts the aperture of the eye letting more or less light in depending on the need to focus or the ambient light.

Figure 1. Eye. Rhcastilhos. And Jmarchn., CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons

The retina contains photoreceptor cells made of rods (sensitive to light) and cones (sensitive to color), bipolar cells, and ganglion cells (Figure 2). All these cells are neurons. The ganglion cells then form the optic nerve with their axons.  Through the rods and cones, the photons generate electrical signals by phototransduction. 

Figure 2. Retinal layers. By Fig_retine.png: Ramón y Cajalderivative work Fig retine bended.png: Anka Friedrich (talk)derivative work: vectorisation by chris 論 – Fig_retine.pngFig retine bended.png, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=7550631

The optic nerve then connects to the optic tract and to the Lateral geniculate nuclei (LNG, left and right) situated in the thalamus and then, in turn, connect to the Primary visual cortex through the optic radiations (Figure 3). The visual information is processed in the Primary visual cortex (also called the visual area V1). 

Figure 3. Optical cabling. Ratznium at en.wikipedia, CC BY-SA 3.0 <http://creativecommons.org/licenses/by-sa/3.0/>, via Wikimedia Commons

Hubel and Wiesel experiment

In 1958, Two scientists at Johns Hopkins University who later received the Nobel Prize in Medicine, David Hubel and Torsten Wiesel discovered that neurons in the striate cortex, part of the visual cortex, were activated by particular oriented lines and movements. They used kittens looking at a projector screen with tungsten microelectrodes inserted in the visual cortex connected to an oscilloscope to measure neuron activation. They initially investigated the neuron cell activation with black dots on a white slide till they accidentally showed the edge of the slide which triggered the neuron to fire. They found that field receptors on the neuron were being activated by specific oriented patterns (slit, dark bar, or edge) and movements. Some receptors were either excited or inhibited and have a particular geometry that matches the specific pattern they are reacting to. Neuron cells reacting to the same pattern are organized in vertical columns and neighboring cells are reacting to patterns of similar shape but slightly of different orientation.

Figure 4. Field receptors on a simple neuron cell are aligned with the pattern they react to

Convolutional Networks

Convolutional networks are inspired by the visual processing described in the previous section. Convolutional networks are particular cases of deep learning networks with layers of convolutions applied to images. 

Convolution

Convolution is a mathematical operation that mixes two functions by multiplying their values by pairs. One version of convolution used in machine learning consists of multiplying pairs of values evaluated at the same point. This is called the cross-correlation.

One function works as the signal, the second function works as a filter. Figure 5 gives some examples of convolutional filters. The input signal is a 5×5 matrix with numerical values. It could be some color or light intensity. The input signal goes through the filter by multiplying each input cell by the corresponding filter cell in the same position. The input signal is then transformed into a filtered signal (also called the feature map). The output is calculated as the sum of all the values in the filtered signal.

The filters can be of different types. Each filter represents a different channel. Filter A at the top detects diagonal signals by filtering only values close to the main diagonal. Filter B detects horizontal signals and filter C detects signals on the secondary diagonal. If there is no overlap between the input signal and the filter, the final output value is zero. If the overlap is very large then the final output value is very large.   

Figure 5. Examples of convolutional filters

The filters can amplify the input signal or even invert it (by using negative values). Like the neurons in the visual cortex, each filter is specialized in detecting special features.

An image is however larger and more complex than a 5×5 matrix. A solution is to use different filters and make them scan the image starting from left to right then top to bottom. This is illustrated in Figure 6 (with a 10×10 image and a 3×3 filter). The convolution operation starts with the top-left submatrix and continues to the subsequent matrices on the right by moving by one column (the stride which can be 1 or a higher value) till all the cells are covered and then towards the bottom by moving by one row (or more). The output ends up being an 8×8 matrix. To maintain the 10×10 size it is possible to add paddings of 0 values by adding extra rows and columns around the initial input image.

Figure 6. Convolutional Neural Network with a (3,3) Convolution

Figure 7. Convolutional Neural Network with 3 channels

Max pooling and average pooling

Besides convolution, another common operation is max pooling (Figure 8) and average pooling (Figure 9). With max pooling, the filter selects the maximum value of the matrix cells it is covering instead of multiplying the cells with some weights and summing the results. With average pooling, the filter calculates the average values of the matrix cells. Average pooling is a particular case of convolution where the weights in the filter have the same value and are normalized to sum up to one. 

The pooling layers perform these pooling operations which aggregate the signals and downsize the image files (also called downsampling). Some information is lost during pooling operations. Some more recent techniques avoid pooling for that reason. 

Figure 8. Convolutional Neural Network with Max Pooling

Figure 9. Convolutional Neural Network with Average Pooling

Translation equivariance

Convolutional networks have the property that they perform equally well at identifying and classifying an object if it moves horizontally or vertically in the image. The reason is that the same filters are also translated in the image. This is called translation equivariance. Convolutional networks are however not indifferent to rotation or inversion. They probably would be if filters were to rotate and be inverted. A solution is data augmentation. Images can be rotated and inverted and added to the training data. 

Locality

Convolutional networks operate at the local level. They identify features in limited parts of the image as defined by the filter size and feed the features through several layers of neural networks. 

Benchmarks

Figure 9. Image localization and identification

ImageNet

ImageNet is an image database created in 2009 by Professor Fei-Fei Li and her team as a benchmark for visual recognition and classification tasks. It contains over 14 million images from the internet annotated by humans around 20,000 categories called Synonym Sets (synsets). A higher-order category could be “fish” and be divided into hundreds of synsets of fish species that have hundreds of images of fish each. ImageNet is used for the ImageNet Large Scale Visual Recognition Challenge, started in 2010, in which researchers compete to detect and classify objects in images and videos. AlexNet (Krizhevsky et al., 2017) won the competition in 2012 using convolutional neural networks. The competition has been hosted by Kaggle since 2017. Its validation and test sets have 150,000 photographs and 1,000 categories. The training set is randomly sampled from these sets. Each photograph in the training and validation set has the coordinates of bounding boxes with the attached object category. 

MNIST

MNIST is a dataset on handwritten digits that has been used by LeCun (1988) for visual recognition. It has 60,000 digits in the training set and 10,000 in the test set. Each digit occupies a 28×28 grid. 250 human writers, a mix of Census employees and high school students, created these digits in the training set and another 250 did the same for the test set.

Figure 10. Examples of MNIST digits. Source LeCun et a;. 1998 

Fashion MNIST

Fashion MNIST has the same structure as MNIST but is based on clothing articles from the company Zalando. Like MNIST, it has 60,000 images in the training set and 10,000 images in the test set. The size of each image is also 28×28. It also has ten categories ( 0: T-shirt/top, 1: Trouser, 2: Pullover, 3: Dress, 4: Coat, 5: Sandal, 6: Shirt, 7: Sneaker, 8: Bag, 9: Ankle boot). The difference is that the task is more difficult because the clothing articles have more variations than written digits. 

Figure 11. Clothing articles from Fashion MNIST. Source: https://github.com/zalandoresearch/fashion-mnist/blob/master/doc/img/fashion-mnist-sprite.png 

CIFAR-10 and CIFAR-100

CIFAR-10 is a dataset with 60,000 photos classified in 10 categories. CIFAR-100 is an extension of CIFAR-10 with 100 categories.  

Figure 12. Photos from CIFAR-10. Source: https://www.cs.toronto.edu/~kriz/cifar.html

Convolutional Network Models

AlexNet

Building on convolution networks such as LeNet by (LeCun et al., 1989) (Figure 13), (Krizhevsky et al., 2012) proposed AlexNet in 2012 that won the ImageNet competition and put deep learning networks on the map for computer vision. They successfully classify 1.2 images into 1,000 classes with state-of-the-art results at the time. 

AlexNet uses five convolutional layers, max-pooling layers, and three fully-connected layers (Figure 14) and ReLU activation functions. Images are of size 224×224 with three channels (RGB colors). To prevent overfitting, it performs data augmentation by extracting 224×224 patches and their inverses (horizontal reflections) from 256×256 images and by changing the RGB channel intensities. They also use dropout to reduce overfitting. 

Figure 13. LeNet architecture

Figure 14. AlexNet architecture

GoogleNet

GoogleNet is based on the inception network as described in(Szegedy et al., 2014). A basic building block is the inception module. The inception modules were inspired by the Network in Network of (Lin et al., 2014). They allow a shift from sparse to dense representations using smaller filter-size convolutions (1×1, 3×3, 5×5), enhance the representativeness of the network and perform dimensionality reduction. The whole network will be built by stacking inceptions modules. These modules are stacked 22 times in GoogleNet. 

In their inception module (Figure 15), a layer goes through three 1×1 convolutions and a 3×3 max pooling, then 3×3 convolutions, 5×5 convolutions and 1×1 convolutions. Outputs are then concatenated. The whole network is presented in Figure 16. GoogleNet achieves very good results in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014 Classification (first) and Detection (second) Challenges.

Figure 15. Inception module. Source: Szegedy et al., 2014

Figure 16. GoogleNet

VGG

VGGNet was introduced in (Simonyan and Zisserman, 2015) as an extension of standard convolutional networks such as LeNet and AlexNet with the difference that the network is deeper with 16-19 layers and with smaller (3×3) convolutional filters. They achieved 2nd and 1st place in the 2014 ImageNet Challenge in classification and localization. The increase in depth and the smaller receptive fields of the convolutions reduce the number of parameters compared to a standard convolutional network and work as a regularizer of the network. The configuration for VGG-16 (16 weight layers) combines stacked of two 3×3 convolutions with 64, 128, 256, 512, and 512 channels respectively, max-pooling layers and full and three fully connected layers of size 4096, 4096, and 1000 (for the 1000 classes). The activation function is ReLU. Figure 17 shows a truncated VGG-19 network.

ResNet

Residual networks (ResNet) introduced by (He et al., 2015b) are networks similar to VGG networks but with skip connections. These skip connections (Figure 17, the loops are the skip connections in the 34-Layer ResNet) connect inputs to outputs by adding the input values to the layer outputs coming from convolutional networks. Because the identity function is forced into the output at each step, the model focuses on fitting the residuals from the identity, a task that is easier to achieve as the authors have documented. Models can be very deep without encountering optimization problems or vanishing/exploding gradient issues. They evaluate their model on ImageNet and on CIFAR-10.  Their ResNet model with 152 layers won the ILSVRC in 2015.

Figure 17. VGG Net vs 34-Layer plain and 34-Layer ResNet. Source: (He et al., 2015b)

References

Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning, Illustrated edition. ed. The MIT Press, Cambridge, Massachusetts.

He, K., Zhang, X., Ren, S., Sun, J., 2015a. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv150201852 Cs.

He, K., Zhang, X., Ren, S., Sun, J., 2015b. Deep Residual Learning for Image Recognition. ArXiv151203385 Cs.

Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105.

LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D., 1989. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 541–551. https://doi.org/10.1162/neco.1989.1.4.541

Lin, M., Chen, Q., Yan, S., 2014. Network In Network. ArXiv13124400 Cs.

Simonyan, K., Zisserman, A., 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. ArXiv14091556 Cs.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 15, 1929–1958.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2014. Going Deeper with Convolutions. ArXiv14094842 Cs.

A Primer on Deep Learning

“Machine intelligence is the last invention that humanity will ever need to make” – Nick Bostrom

Deep learning has greatly changed the landscape of machine learning and artificial intelligence in the last ten years. In 2018, professors Yoshua Bengio, Geoffrey Hinton, and Yann LeCun, pioneers of deep learning, have received the prestigious ACM A.M. Turing Award for “conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing”. This chapter reviews the fundamentals of deep learning. Other chapters will cover its applications in computer vision and natural language processing. Deep learning is covered in great detail in (Goodfellow et al., 2016) and the documentation of TensorFlow and Pytorch.

Deep Learning

Perceptron

The work on artificial neurons started in the 1930s and 1940s. In 1943, McCulloch and Pitts proposed  that “neural events and the relations among them can be treated by means of propositional logic. It is found that the behavior of every net can be described in these terms.” In 1958, Frank Rosenblatt, a researcher at the Office of Naval Research invented the Perceptron to perform image recognition using photocells and a one-layer neural network. It could however perform only some rudimentary image classification tasks. 

Figure 1. Description of Mark I Perceptron. 

Source: https://apps.dtic.mil/dtic/tr/fulltext/u2/236965.pdf

Deep Learning

Deep learning is a branch of machine learning that uses layers of activation functions, described as neurons, linking inputs to outputs. The inputs form an input layer which could be in the form of a numerical value, a vector, a matrix, or a multidimensional array (a tensor). The input can represent a picture, a video frame, some text, a soundwave, or any data collected by a sensor. Each function acts as a neuron with inputs and outputs. The function can be linear or nonlinear. When it is nonlinear it works as an activation function, being very small when the inputs are sufficiently small and increase in value when the combined inputs are sufficiently large. The outputs in turn form an output layer. Between the input and the output layers, there can be several hidden layers (Figure 2).

Figure 2. A neural network with an input layer, a hidden layer, and an output layer.

Neuron

The inspiration of the artificial neuron is the human neuron (figure 3). A human neuron has a cell body called soma, receives nerve signals from the dendrites and sends an output signal through the axon to other neurons or other cell bodies such as muscle cells. The axon connects to the dendrite of another neuron and forms a synapse. The signal can be electric with moving ions or chemical with neurotransmitters. Each neuron has contacts with 1,000 other neurons. It is estimated that there are around 86 billion neurons in the brain. In addition to neurons, there are glial cells that outnumber the neurons by a factor of ten. Glial cells play important roles to support the neurons.

Figure 3. Neuron

By BruceBlaus – Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=28761830

The human brain appears to be vastly more complicated than a neural network. This should not be a concern as there are many examples of artificial technologies playing the same role as natural technologies: the wings and reactor of a plane replacing the wings and the muscles of a bird, or the combustion engine replacing carriage horses. 

To be fair, the artificial neural network does not run in a vacuum, if we include the software,  hardware, and power (the brain cells have to generate their own power) required to run the neural network the complexity can be as large. For instance, as of 2020, the Wafer Scale Engine 2 by Cerebras, a deep-learning  integrated circuit chip, has 2.6 trillion transistors. A Graphical Processing Unit (GPU) can have more than 20 billion transistors and it is not uncommon to run hundreds of GPUs in parallel to train some deep learning neural networks.   

Feedforward Neural Network

Figure 4 shows a feedforward neural network. It is feedforward because the information flows from the inputs to the outputs in only one direction, forward. Some other neural networks such as recurrent neural networks allow loops, with information moving backward. 

The network can be described in terms of input layers, hidden layers with numbers of inputs and outputs and with an activation function applied to the inputs, and output layers. The layer will contain a state with learnable parameters such as weights and biases and will perform some computation such as multiplying the inputs by the weights and adding the biases.

Figure 4. Neural network defined as a sequence of layers

The hidden layers can be of different types:

Input layer

The input layer is a tensor object with an indication of the input shape e.g. (n,) and batch size m. Each observation is a n-dimensional vector and the model takes m observations at a time.

Dense layer

The dense layer uses the inputs from the previous layer, multiplies them by some weights, adds some bias terms, and transforms them through an activation function. The activation function is typically a Relu (rectified linear unit) which implements the maximum between the output value and zero (max(output,0)). Another popular activation for classification problems is the softmax activation. In the softmax, outputs are converted to probabilities between 0 and 1 by taking the exponential of their values and normalizing them so that they add up to 1.    

Figure 5. Relu activation

Figure 6. Softmax activation

Activation layer

The activation layer transforms input values with some functions similar to the ones used in the dense layer or more complex functions.

Embedding layer

The embedding layer transforms the input values into vector representations. This is commonly used in Natural Language Processing (word embedding) where indexed words are converted to vector representations such as Word2Vec (Mikolov et al., 2013)). Words of the same meaning will tend to be close in the vector space and relationships between words will tend to be similar in that space. Closeness is measured by some distance.

Masking layer

A masking layer discards certain input values for instance because they are missing. Missing values could be coded as 0 and the mask value will be 0.  

Lambda layer 

A lambda layer allows arbitrary calculations on previous layers. It works as an activation layer but is more general as it can for instance make calculations with multiple input layers. 

Subclass layer

A subclass layer will modify an existing class layer and add new states and computation methods. For instance, input layers can be combined and go through a new computation to produce new outputs.

Feedforward propagation

The different transformation from the input layers through each successive layer up to the output layers forms the feedforward propagation of the neural network. If all the network parameters are known the propagation will give some model outputs. If the model needs to fit some output data such as in supervised learning, the parameters will need to be learned with backward propagation. 

Model Training

Model training will adjust the model parameters such as the weights and biases to minimize some loss function.

Model Loss Function

Sum of squared errors

The sum of squared errors is often used for regression problems. It is calculated as the sum of the squared differences between predicted values and true values. If we take the mean, it becomes the mean squared error.

Other losses that can be used for regressions are the mean absolute error, the mean absolute percentage error, the mean squared logarithmic error, and the cosine similarity among others.

Cross-entropy loss

The cross-entropy loss is used for classification problems. It is calculated as the negative of the sum of the products between the true class probability values (so 0 or 1) and the logarithm of the predicted probability values.

KL divergence 

The KL divergence loss can also be used for classification problems. It is calculated as the sum of the products between the true class probability values (so 0 or 1) and the logarithm of the ratio of true class probability values to the predicted probability values.

Model Initialization

The model parameters will be initialized when the layers are created. Usually, zero initialization is not a good idea because of the need to break the symmetry between neurons. With zero initialization the neural network conveys no information as all the inputs give the same outputs. Also in the hidden layers, the weights are not very differentiated and are unlikely to have unique final values.

Normal initialization

With normal initialization, the weights are taken from a random normal distribution of a given mean (usually 0) and standard deviation. 

Glorot/Xavier initialization

With Glorot/Xavier (Glorot and Bengio, 2010)) normal initialization, the weights are taken from a random normal distribution of a given mean (usually 0) and standard deviation that depends inversely on the square root of the sum of the number of inputs and the number of outputs).

With Glorot/Xavier uniform initialization, the weights are taken from a random uniform distribution of a given mean (usually 0) and boundaries that depend inversely on the square root of the sum of  the number of inputs and the number of outputs).

He initialization

He initialization (He et al., 2015)) is similar to the Glorot/Xavier normal initialization but with a factor of 2 in the variance.

Backpropagation

Deep neural networks have come back in vogue thanks to the rediscovery of backpropagation and the application of stochastic gradient-descent (Bottou, 2011)). The objective during the training of a neural network is to minimize a loss function by adjusting the weights and biases of the neural network.

In the univariate case (Figure 7), the first-order derivative indicates the direction towards which the free weight parameter x has to be adjusted. If it is positive then x has to be lower. If it is negative that x has to be higher. If the loss function is convex, this procedure is very reliable to find the global minimum. If it is not convex, the procedure might only find a local minimum.

Figure 7. Model loss as a function of weight (univariate case)

Optimizing a neural network adds two major complications to the unit variate case. The derivative becomes a gradient when there is more than one variable. There are many weights to optimize. Some very large language models such as GPT3 (Brown et al., 2020)) have billions of parameters. Then, there are many layers and each layer is a compounding function that makes use of the chain rule to calculate the gradient. 

Chain rule

The chain rule is a simple method to calculate the derivative of a compounded function. For instance if h(x)=f(g(x))  then h’(x)=f’(g(x))g’(x). The derivative of h is the product of two derivatives. If there are n layers, the derivative would be the product of n derivatives.

Gradient descent

With one weight variable, a new value would be calculated from the current weight, the derivative at this point, and a positive learning rate parameter lrx’=x-lr * f’(x).

If the weights are vectors, we use the gradient instead of the derivative and the formula becomes: x’=x-lr * Dxf(x).

This procedure is iterative. Each application of the formula is an update. It is common to make an update after making the calculation for a group of observations (a mini-batch) taken from the training sample. The update is done by using the average gradients across the mini-batch observations: this is the stochastic gradient descent. Once all the mini-batches from the training sample are used, we have completed an epoch. We repeat the procedure and monitor the error on the training and validation sets after each epoch.

Learning rate  

The learning rate is usually not constant. It will decrease in value as the learning progresses. Several methods are available such as momentum, AdaGrad, RMSProp, or Adam. The idea is to adjust the gradient faster by influencing its velocity with its past values (first moment) or past squared values (second moment).  The higher the past value, the higher is the adjustment on the parameters but the higher the past squared value, the lower is the adjustment on the parameters. RMSProp and Adam are somehow normalizing the gradient so that the direction counts more than the value of the gradient itself.

Exploding and vanishing gradient 

Because of the product of gradients, the final gradient can end up being very small (vanishing gradient) or very large (exploding gradient). Vanishing gradient problems can be addressed by alternative weight initialization methods and activation functions such as ReLU. Exploding gradient problems can be addressed by gradient clipping which simply imposes a maximum and minimum value to the gradient.

Model Regularization

Overfitting

Like in all supervised learning problems, there is always a risk of overfitting the model and losing in generalization. The model will perform well in-sample on the training data but will perform poorly out-of-sample on the validation data. Figure 8 shows the loss curves as a function of the number of epochs. The training loss and the validation loss both decrease till it reaches a point where the validation loss starts to increase. The model starts to overfit on the training data and underfit on the validation data. Early stopping will prevent some of the overfitting.

Figure 8. Model training and validation losses as a function of the number of epochs

L1 and L2 Regularization

Another method to limit overfitting is to use L1 and L2 regularizations. They consist of limiting the size of the weights by adding a regularization term to the loss. Instead of minimizing f(x) it is minimizing f(x)+ alpha * ||x||1 or f(x)+ alpha * ||x||2, where ||.||1 is the L1 norm (sum of absolute value of vector components) and ||.||2 is the L2 norm (square root of the sum of squared component values). By limiting the size of the weights, there is less risk of overfitting to training data because the weights cannot take extreme values. 

Dropout

Dropout (Srivastava et al., 2014)) is a powerful technique of regularization. Dropout drops inputs randomly (put the weights at 0) at a fixed rate during training. The remaining weights are scaled up to preserve the sum of weights. With dropout, the model does not rely on particular weights and is more robust to overfitting, and will generalize better.

Batch normalization

Batch normalization (Ioffe and Szegedy, 2015)) is a technique to stabilize the training of a deep neural network. Each mini-batch is renormalized to a mean of 0 and a standard deviation of 1 before entering an activation function. This makes the learning easier as the weight updates have a similar scale and do not become too large or too small.

Model Prediction

Metrics

When the model is trained, additional metrics can be useful in addition to the model loss. Other measures of model fit can be used for probabilistic models such as cross-entropy and regression models such as cosine similarity. For instance, in a classification model, accuracy is a useful statistic, as well as AUC (area under the curve), true positives and negative, false positives and negatives, precision and recall, sensitivity and specificity. 

Evaluate

While the model is training, it is also run on validation data. The same metrics and loss statistics are calculated for both training and validation data. Before being deployed in production, the model can be run on test data.

Inference

The model is then used for inference and prediction on new data online or in batch mode. 

Model Monitoring

During the training, validation, and inference phase, model and performance data and statistics should be collected. In TensorFlow, TensorBoard (Figure 9) can be used to visually present and monitor such data. The model weights, summary plots, training graphs can easily be reported on such a dashboard.

Figure 9. TensorBoard

References

Bottou, L., 2011. Large-Scale Machine Learning with Stochastic Gradient Descent, in: Statistical Learning and Data Science. Chapman and Hall/CRC, pp. 33–42. https://doi.org/10.1201/b11429-6

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D., 2020. Language Models are Few-Shot Learners. ArXiv200514165 Cs.

Glorot, X., Bengio, Y., 2010. Understanding the difficulty of training deep feedforward neural networks 8.

Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning, Illustrated edition. ed. The MIT Press, Cambridge, Massachusetts.

He, K., Zhang, X., Ren, S., Sun, J., 2015a. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv150201852 Cs.

Ioffe, S., Szegedy, C., 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ArXiv150203167 Cs.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 15, 1929–1958.

AI at JP Morgan

Figure 1. JP Morgan headquarters

“AI will be enormous. It will be enormous for idea generation. It will take care of errors. … I could go on and on and on about the complexities it raises, the opportunities it raises.”

Jamie Dimon, CEO JP Morgan Chase

Reasons to invest in AI

JP Morgan (JPM) is the largest US bank in terms of assets and market capitalization. JP Morgan spends annually 11bn dollars in technology and AI is becoming a larger share of its budget. 40% is spent on new initiatives (JPM net income was 36bn dollars in 2019).

Thanks to its size and scale it is able to gather a large amount of data (400 petabytes of data) on its customers, operations, transactions, and markets. Its challenge is to serve its customers better with personalized services at scale with efficiency, reliability, security, and confidentiality. 

It has brought expertise from technology companies such as Google and academia (Carnegie Mellon) to build its in-house AI expertise. It focuses on academic research and applied research and initiatives.  

AI Initiatives

JPM lists six areas where it plans to use AI:

  • Anomaly Detection: Identifies unusual patterns in order to minimize and mitigate risk
  • Intelligent Pricing: Complements traditional pricing models, enabling more accurate prediction and confidence intervals
  • News Analytics: Aggregates news from various sources and provides analytics for sentiment, summarization, topics and trading signals
  • Quantitative Client Intelligence: Draws insights from multi-channel client communications to be used to improve client service
  • Smart Documents: Identifies meaningful information and insights from lengthy text sources in order to reduce manual operations and improve workflow
  • Virtual Assistants: Automates responses to client queries, (chat, email, voice) with the goal of improving client service and operational efficiency

Anomaly detection and in particular fraud detection is a very active area for AI deployment in financial institutions. It can be used for anti-money laundering, credit card fraud prevention and detection, trade manipulation detection, and cyber-security.  

Pricing is already using standard analytics (matrix pricing, derivatives pricing) in banks but can be refined with richer machine learning models (deep learning models, reinforcement learning models).

News, customer intelligence, smart documents, and virtual assistants involve a lot of Natural Language Processing (NLP). Automatic text recognition, classification, understanding, and generation are known NLP techniques that can be deployed in this context. The oldest application is probably check-deposit in ATM machines which is now also done with a mobile phone. 

JPM is using virtual assistants to guide its corporate clients (CFO, treasurers) in its treasury services division portal and helps them access information, provide recommendations and transact. It is also using AI to facilitate trading and share relevant research with its clients.

JPM has also a research center led by a CMU professor, Manuela Veloso, PhD, which focuses on:

  • Data & Knowledge: Massive Data Understanding, Graphs Learning, Synthetic Data, Knowledge Representation
  • Learning From Experience: Reinforcement Learning, Learning from Data, Learning from Feedback
  • Reasoning and Planning: Domain Representation, Optimization, Reasoning under Uncertainty and Temporal Constraint
  • Safe Human AI Interaction: Agent Symbiosis, Ethics and Fairness, Explainability, Trusted AI
  • Multi Agent Systems: Multi Agent Simulation, Negotiation, Game and Behavior Theory, Mechanism Design
  • Secure and Private AI: Privacy, Cryptography, Secure Multi-Party Computation, Federated Learning

These areas of research are probably a bit more academic though ethics, fairness and explainability, privacy, and cryptography are very important for companies using AI.

JPM is also working closely with some AI startups, either mentoring them or investing in them. 

Challenges

The challenge is the sheer scale of JPM operations and the infrastructure it requires to sustain its AI efforts and operations. It has recently deployed Omni AI to provide data for its AI researchers and engineers. As it relies more on the public cloud, security and confidentiality are also very important. A lot of activities are also probably not easy to replace with AI such as investment banking advisory though AI could help provide insights from new sources of data and make bankers more efficient and less focused on just collecting data.

AI at Ping An

Ping An is the largest insurance company in China. It is a publicly-listed conglomerate providing services in Life and Health Insurance, Property and Casualty, Banking, Asset Management, Fintech, and Healthtech to over 200 million customers and over 500 million online customers. Its subsidiary Ping An Bank has recently been named World’s Best Digital Bank at the Euromoney Global Awards for Excellence 2020.

Reasons to invest in AI

In banking, Ping An with its subsidiary Ping An Bank is a relative latecomer compared to the incumbents. There are four historical banks: Industrial & Commercial Bank of China (ICBC), China Construction Bank Corp. (CCB), Agricultural Bank of China (ABC), and Bank of China, and then newcomers such as China Merchants Bank (CMB) and China CITIC Bank. 

To differentiate itself and because of the scale of its operations, it has invested heavily in technology, in AI (along with blockchain and cloud computing) to service its customers. Ping An follows a finance + technology strategy, investing 1% of its revenues in R&D every year to enhance its technology, improve efficiency, lower its cost, and better manage risk. Technology feeds into five ecosystems: financial services, health care, auto services, real estate services, and smart city services.

AI Initiatives

Financial services

In financial services, Ping An bank is using AI extensively in its “AI Banker” system. The “AI Banker” is used to:

  • Automatize customer service workloads
  • Improve the efficiency and quality of its customer interactions
  • Shorten credit card and loan approvals online and reduce manual work
  • Automatize credit limit calculations based on credit and transaction history
  • Identify eligible customers for private banking and wealth management
  • Provide research and investment recommendations such as stock recommendations and asset allocation to high-net-worth customers 
  • Lower credit losses thanks to better risk management
  • Prevent credit cards frauds by monitoring transactions and using fraud detection models 
  • Provide services in mobile banking

Technologies such as face recognition and document recognition are also used for customer identification to provide credit or make payments. Product recommendation systems are also used to match customers with 

Ping An offers some of these capabilities to other banks and insurance companies through its OneConnect SaaS (Software as a Service) platform. 

Health care

In healthcare, Ping An has developed Ping An Good Doctor, a platform to connect doctors and their patients. Ping An Good Doctor, has more than 300 million registered customers and 67 million monthly users. It provides information on 3,000 diseases and suggests treatments based on medical records and data. The doctor has also access to the electronic profile of the patient. The system is designed to prevent misdiagnosis and missed diagnosis.

Ping An has recently deployed one-minute clinics that allow patients to interact with an AI doctor for diagnosis and receive treatment. The AI doctor interacts with the patient in the clinic booth, finds a diagnosis, then a real doctor confirms the diagnosis and provides supplementary information. The common drugs are stored cryogenically in the booth and can be delivered on-site. Drugs can also be ordered through the Good Doctor App. 

Property & casualties insurance

Ping An insurance arm uses a Credit-Based Smart Auto Insurance Claim Solution to process auto claims. Several AI technologies are used in this process. After an automobile accident, a customer can file a claim on a mobile phone, take pictures of the damages and submit any relevant documents. The customer is identified by face recognition. The AI system can assess the losses by identifying the auto parts and accessing a database of replacement costs. The customer then receives compensation based on the loss assessment but also her driving behavior and history. The whole process can take just a few minutes. 

Ping An is anticipating the emergence of self-driving cars where the risk is shifting from the drivers to the automakers and is already thinking on how to cover this new risk. With AI and more data, it is moving to a predictive model of damage loss estimate instead of a simple ex-post model of loss estimate. 

In other areas, Ping An is leveraging satellite imaging, drones, and Internet of Things (IoT) to assess business risks such as climate change. These data can be fed into AI models which can predict risk and losses more accurately. 

Challenges

The FinTech and HealthTech initiatives are still a small part of current Ping An’s profits. They require very large investments that might test the patience of investors. These are also very competitive areas where AI innovations are key but present risk if they don’t have a long track record. Ping An is also offering many of its AI models on its platforms like OneConnect. Ping An will need to implement a smart AI risk management system to address these new risks internally and externally.

AI at Ant Group

 “AI is being used in almost every corner of Ant’s business,” 

Yuan (Alan) Qi, a vice president and chief data scientist at Ant

Ant Group

Ant Group is the fintech affiliate of Alibaba. It was founded as Ant Financial in 2011 to operate Alipay, the digital payment system of Alibaba set up in 2004 to establish escrow payments for customer transactions. Alipay has expanded to be much more than a payment platform and is now used for commercial transactions, financial transactions, daily life transactions, and to access over two million third-party apps (see Figure 1).

Alipay has over 1 billion annual active users, over 700 million monthly active users, and more than 80 million active merchants. Ant Group not only works with Alibaba, which remains its main customer, but also with many other partners such as banks, asset managers, and insurers. Ant Group works with more than 2,000 partner financial institutions to give them access to customers and help them offer financial services.  

No alt text provided for this image

Figure 1. Alipay on mobile phone

Ant Group has its own products: asset management (Yu’e Bao for money market funds), consumer credit (Huabei), health care (Xiang Hu Bao), private banking (MY Bank), and credit scoring (Zhima Credit). Some of its products can be combined with its partner products to enhance customer insights and risk management.

No alt text provided for this image

Figure 2. Ant Group offers Credit, Investment, and Insurance services

Ant Group’s strategy is to increase the trust and engagement of its customers in the Alipay platform by offering all kinds of services (digital finance, food, entertainment, transportation, travel, healthcare, public utilities..) and gain very accurate insights about them. These insights allow Ant Group to offer more innovative and customized products and services either directly or indirectly through its partners.

Reasons to invest in AI

Ant Group serves over one billion customers, 80 million merchants, and processes over 15 trillion dollars of transactions (Total Payment Value) every year. It has to be very accurate to maintain trust and customer satisfaction and keep on offering appropriate tailored products while managing all the risks related to KYC, fraud, AML, credit, liquidity, operations, security, and data privacy. In particular, its expertise in fraud detection is critical for the success of its platform. AI is used extensively at Ant Group to support not only its scale and scope of business activities but also its numerous partner operations.

AI Initiatives

Ant Group specializes in technology applied to the world of consumer and small and micro-business finance and is an online leader in CreditTech, InvestmentTech, and InsureTech. AI techniques such as machine learning, natural language processing, man-machine interaction, secure collaborative intelligence, and time-series graph intelligence support all these activities.

Risk Management

Ant Group has developed AlphaRisk, an artificial intelligence smart risk control engine to detect and prevent fraud. It offers real-time risk-based decisions to counter fraud attempts, real-time transaction verifications, and customer authentication that can be used by third parties. It uses state of the art AI algorithms to power AlphaRisk. Its prediction models allow companies to manage their risks better, secure their platforms, and guarantee legitimate customer transactions against frauds. Its models are self-learning and refit automatically.

CreditTech

Credit is a growth area for Ant Group. The level of consumer credit in China is still very low compared to the US and other developed countries. Working with 100 partner banks, Ant Group offers consumer and small and micro business loans. Models are used to assess and reevaluate credit limits, the likelihood of a borrower’s ability and willingness to repay a loan, and the pricing on the loan.

Ant Group is developing joint credit risk models with some partner banks. Like in federated learning, the models use data from both Ant Group (consumption, wealth, risk profile) and the bank (tax and income) without ever leaving each institution, maintaining the privacy of data.

InvestmentTech

AI is used to match customers to investment products according to their risk profiles and behavior. Ant Group lets asset managers leverage its customer database, technology, and AI models to offer more innovative investment products on its platform. 

Intelligent investment advisory is also used for asset allocation and investment recommendation. In partnership with Vanguard, it offers AI-based fund investment advisory services on its wealth management platform. It suggests a fund allocation based on the customer’s financial objectives, risk tolerance, and time horizon. The minimum investment is only 113 dollars.

InsureTech

The insurance market is relatively underdeveloped in China. With a wealthier and aging population, there are growth opportunities in life, health and P&C insurance products. Ant Group offers shipping return insurance for merchandise purchased on the Taobao platform, health insurance, pension annuity insurance, and also works with third-party insurers to sell their products and collect insurance premiums, and contributions.

 AI models can be deployed to assess the risk and pricing of insurance products based on the high-quality data collected on each customer. AI is also used to assist insurance claims, in particular, Image Recognition and Natural Language Processing to analyze submitted documentation and photos.  

Challenges

Due to the scale and scope of Ant Group’s operations, there are multiple challenges. We will focus on the ones related to AI.

First, Ant Group depends on the trust of the accuracy of its AI models. It relies on these models for prediction, decision making, risk management, matching of customers to products, pricing and valuation, fraud detection, and prevention, etc. Any failure of one of its models can be very costly. All the stakeholders in Ant Group’s AI models need to trust them. Markets, products, customers, small businesses evolve all the time and become more sophisticated requiring the AI models to be continuously improved and updated. 

Second, Ant Group works with a lot of user data. These data are at risk of being misused intentionally or unintentionally and this can hurt users’ trust in Ant Group operations. Many countries including China have new privacy laws that become more stringent. Bad actors can also attempt to steal or misuse data. They can disguise themselves as partners or users. 

Third, Ant Group relies on a network of partners and affiliates that it does not control directly. Any model failures or data issues can negatively impact Ant Group’s AI operations. For instance, if some incorrect customer’s income data is used by a financial partner, the final credit decision could be erroneous. 

Fourth, Ant Group operates in financial services that are heavily regulated. Commercial and retail banking, asset management, insurance are all regulated. Online financial services are also starting to be closely regulated. Failure to comply can be very costly to Ant Group and compliance can be expensive. 

Fifth, Ant Group is a technology company that needs to constantly innovate while maintaining operations at a huge scale domestically and internationally. With so many users, customers, merchants, small businesses, and financial partners, products, and daily transactions, its operations can be extremely complex to manage and change.

AI at Amazon Web Services (AWS)

AWS is the cloud services subsidiary of Amazon. It provides many tools and services to develop AI and Machine Learning models on its platform, from data ingestion, data exploration, data transformation, to model training, tuning, optimization, and deployment.

Data ingestion

Amazon Athena

Amazon Athena is a serverless fast, efficient, highly available, durable, and secured database query engine for big data. It is based on Presto, an open-source query engine originally created by Facebook to query its own databases with low latency. It gets data from Amazon S3 in different formats such as CSV, JSON, ORC, Avro, or Parquet using standard SQL queries. It can also execute join queries from JDBC-compliant databases such as MySQL, and other databases such as Amazon Redshift.

Amazon Redshift

Amazon Redshift is a cloud-based data warehouse and a relational database management system. It replaces on-site data warehouses and database systems. It is based on the open-source PostgreSQL project but works very differently as it is focused on very large databases. 

It works with clusters of nodes and slices of nodes to process the SQL queries and retrieve the structured data stored in the nodes. A cluster can have a leader node distributing tasks to the worker nodes and make them work in parallel. Amazon Redshift is highly scalable with added nodes when required and can run very fast queries on petabytes of data. It can be linked to ETL processes and feed analytical workloads (dashboard, visualization, and business intelligence tools) at the enterprise level.

Amazon Kinesis

Amazon Kinesis is a data streaming platform to ingest, process, analyze, and store real-time, high-throughput streaming data. It is based on the open-source project Apache Kafka, initially developed by LinkedIn for its own needs. Streaming data can be video data, transaction data, time-series data, and any data that are produced continuously. Contrary to batch analytics, streaming analytics allow an almost immediate reaction to new events and constantly refreshed data outputs and instances for end-users and customers. It is for instance ideal for price data, fraud detection, and system monitoring data.

Amazon Kinesis offers four capabilities: Kinesis Video Streams for video data captured by cameras, Kinesis Data Streams to capture, process, and store streaming data from multiple sources, Kinesis FireHose for continuous ETL jobs and data transfer to AWS databases, and Kinesis Data Analytics for transforming and analyzing streaming data.  

Data exploration

Amazon SageMaker Notebooks

Amazon SageMaker Notebooks are Jupyter style notebooks that can import, process, and analyze the data from AWS data stores. Usually, only small data samples can be analyzed in a SageMaker Python notebook. If necessary Spark jobs using SparkMagic notebooks and an EMR Spark cluster can be run to process the data or even Redshift or Athena are used directly to explore the data.

Amazon Athena

Since Amazon Athena is a database query engine, it can be used for data exploration like in a normal relational database.

Amazon QuickSight

Amazon QuickSight is a business intelligence tool to create interactive dashboards that can be embedded into websites, analytics reports, emails to share ML insights with the entire organization. It connects seamlessly with all AWS storage and database solutions. It is serverless and therefore scalable, as the number of users grows, it can grow along with them. It allows quick iteration when developing new ML models as results can quickly be shared with all stakeholders.  

AWS Glue

AWS Glue is a serverless extract, transform, and load (ETL) tool to prepare data and identify useful metadata and data transformation from an AWS data lake or data source (Amazon S3, Redshift,..). The metadata and table definitions are stored in an AWS Glue Metadata Catalog. It can load the final data into a data store such as Amazon Redshift. It is built with Apache Spark and generates ETL and visualization and automatic ETL modifiable code in Scala or Python.   

Data preparation

Amazon SageMaker Processing Jobs

Amazon SageMaker provides Notebooks that a user can use to write Python scripts and access the standard data science and machine learning libraries (Pandas, Matplotlib, Seaborn, Sklearn, TensorFlow..). Athena and Redshift can also be accessed through these notebooks thanks to the Athena client library (PyAthena) and SQL libraries (SQLAlchemy). Complex queries can be sent directly from the notebooks.  

Amazon SageMaker Processing is used when the whole production data needs to be processed and transformed into useful features at scale. The type and the number of instances need to be defined to perform the processing step.

Amazon Elastic MapReduce (EMR)

Amazon EMR is a scalable data processing engine built on Hadoop or Apache Spark. Apache Spark is a very popular distributed processing and analytics engine for big data. Workloads are automatically deployed to clusters and nodes. A SageMaker Notebook can run Spark commands and process data on a Spark cluster. The data can be analyzed and tested with the Amazon DeeQu API. Data can be tested for missing or Null values, range, correct formatting, completeness, uniqueness, consistency, size, correlation, etc..

Model training

Amazon SageMaker Notebooks

Amazon SageMaker Notebooks can use standard machine learning libraries such as Scikit-Learn, TensorFlow, MXNet, or PyTorch to transform the data, do feature engineering, split the data, and train the models on samples. The libraries are accessed by loading containers with pre-defined environments, through scripts, or customized containers. 

Some objective metrics such as accuracy have to be defined to evaluate the model performance. Model hyperparameters and parameters can be saved to be examined for model review and evaluation.

Amazon SageMaker Training Jobs Debugger

Amazon SageMaker Training Jobs Debugger uses rules to check for issues such as overfitting, data imbalance, or vanishing gradients. If the rules are triggered, the training stops to allow debugging of the model and inspection of intermediary steps and objects. 

Model tuning and optimization

Amazon SageMaker Hyper-Parameter Optimizer

Amazon SageMaker Hyper-Parameter Optimizer can find the best hyperparameters within some ranges to optimize some objective metrics using different methods such as grid search, random search, or Bayesian optimization. 

Amazon SageMaker AutoPilot

Amazon SageMaker AutoPilot is the AutoML tool of SageMaker. It analyzes the raw data and the target to be predicted. It chooses the best algorithm candidates, processes the data to create the best features, and automatically trains and tunes the models. The best hyperparameters are automatically selected for each algorithm. 

Amazon SageMaker Experiment Tracking

Amazon SageMaker Experiments track the multiple model runs and provide auditability, traceability, and reproducibility of these runs. Data, parameters, hyperparameters, models can be accessed historically to review and reproduce feature engineering, training, tuning, and deployment results. Each experiment includes trials, and each trial includes steps, each step includes tracking information. Versioning and lineage are kept across all the trials.

Model deployment

Amazon SageMaker Model Endpoints

Amazon SageMaker Model Endpoints allow the user to interface with a model to get inference results on production data. It requires the location of the data and model artifacts (e.g. an S3 bucket), the container of the model, and some parameters and compute resource configurations to get inferences from the model. Different variants of the model can be requested to run in parallel. Endpoints are accessed through REST APIs.  

Amazon SageMaker Model Monitoring

Amazon SageMaker Model Monitoring is used for monitoring the model and identifies any deviations from a baseline. A baseline is created from the training data using a tool such as Amazon DeeQu in Apache Spark. Model Monitoring captures the data and model inference results and checks that all the constraints are verified, if not, Amazon CloudWatch gets triggered and sends warnings about the deviation. Amazon CloudTrails will save all the model logs to perform model reviews and debugging. 

Amazon SageMaker A/B Tests

A/B Tests are used to improve production models and test models and hypotheses on production data. Amazon SageMaker A/B Tests can be performed using Endpoints. Different training data, model versions, compute resource configurations can be tested with Amazon SageMaker Model Endpoints. After reviewing the different model results, an improved model can be selected and replace the current one.

Amazon SageMaker Canary Rollouts

With Amazon SageMaker Canary Rollouts, a new model with a different production variant than the current model can be deployed through Endpoints to a limited number of customers and progressively be expanded to more customers if the model performance is satisfactory. 

Amazon SageMaker Batch inference

Amazon SageMaker Batch inference is an alternative to Endpoints if real-time results are not necessary. Amazon SageMaker reads the batch data from an S3 bucket location, runs inference from a model, and delivers the results to another S3 bucket location.  

Model Pipeline

AWS Step Functions

No alt text provided for this image

Figure 1. AWS Step Functions. Source: Amazon

AWS Step Functions is an orchestration tool to coordinate the tasks of a machine learning workflow such as processing the data and running AWS Lambda functions or pre-trained models. It can be used for extract, transform, and load (ETL) processes, for breaking down complex machine learning codebase and makes it more modular, for coordinating batch processing jobs, for triggering events and notifications. AWS Step Functions is presented through a visual workflow graph. 

Amazon EventBridge

No alt text provided for this image

Figure 2. AWS Event Bridge. Source: Amazon

 Amazon EventBridge connects events (changes of states) to workflows. The events can come from SaaS applications (Datadog, OneLogin, PagerDuty, Savyint, Segment, SignalFX, SugarCRM, Symantec, Whispir, and Zendesk), customized applications, or AWS Services. They trigger workflows that can include connecting to applications, microservices or databases, AWS Lambda functions, and other AWS applications, or communicating results. 

AI at Tencent Holdings

Figure 1. WeChat payment

Tencent Holdings (“Tencent”) is a technology conglomerate firm based in China. It offers products and services in consumer internet, online gaming, social networks, media and entertainment, fintech, and cloud.  Its most well-known products are QQ, an instant messaging app for teenagers, and WeChat (Weixin in mainland China), a mobile messaging app that offers also other services such as digital payment, peer-to-peer payment, shopping, and games. 

WeChat Pay is the digital payment service of WeChat. WeChat also includes mini-programs which are apps within WeChat developed for third-party businesses. WeChat Pay competes directly with AliPay of Ant Financial. WeChat Pay can be used in-store at point-of-sales with a WeChat Pay barcode or merchant QR code, on websites, on mobile apps, on WeChat official merchant accounts, or mini-programs hosted in WeChat. A WeChat Pay account is most commonly linked to payment cards and today can be linked to an international credit card such as Visa, Mastercard or American Express.  

Like AliPay, WeChat Pay offers wealth management services such as savings and investment products through its platform LiCaitong and is partnering with banks, mutual fund, and wealth management bank subsidiaries or companies including Blackrock. 

No alt text provided for this image

Figure 2. WeChat Pay on mobile phone

Reasons to invest in AI

WeChat has more than 1.1 billion users and WeChat Pay has more than 900 million users. Tencent’s business is all digital and consumer-oriented. Given its size, it needs to leverage AI to support its products and services at scale. IT and cloud infrastructure management, customer support and enhanced customer engagement, payment fraud prevention and detection, digital content management and monitoring, product innovation, all require advanced AI to grow.

AI initiatives

Tencent has three labs dedicated to AI: Tencent AI Lab, Youtu Lab, and WeChat AI.  Tencent AI Lab is focused on fundamental research. Youtu Lab is developing applications in image processing, pattern recognition, and deep learning. WeChat AI is focused on Speech Recognition, Natural Language Processing, Computer Vision, Data Mining, and Machine Learning for WeChat.

Tencent also invests in many AI accelerators and AI startups. It has invested in over 800 companies and 70 have gone public. It has an office in Palo Alto, CA to invest in non-Chinese startups. It has invested inTesla, Spotify, and Snap.

Tencent is also involved in agriculture, healthcare, industry, and manufacturing applications of AI. 

Tencent AI Lab

Social AI

Social AI aims at developing better interactions between humans and machines. For instance, the lab has developed a smart chat application using natural language processing and understanding. The chat can be customized and used by businesses on the WeChat App or other platforms to interact with their customers.

Game AI

Game AI facilitates the interaction between the real world and the virtual world of games and continuously enhances the players’ game experience. It supports the numerous online games offered by Tencent and its partners (Riot Games’ League of Legends, Epic Games’ Fortnite,BlueHole’s PlayerUnknown’s Battlegrounds).  It has recently developed an AI player, named Wukong AI which learned how to play games such as Honor of Kings through reinforcement learning, the same way AlphaGo of DeepMind learned to play Go (Tencent has its own AI go player named Fine Art). Humans can play against Wukong AI and average players have difficulty beating it at higher levels. 

Content AI

Content AI focuses on search, personalized recommendation, and content generation for its users. It improves the contents and recommendation of online video subscription services (drama series, anime series, variety shows, and short videos in the Weishi app), music (paid streaming music), reading subscription platforms (Weixin Reading app), and news (WeChat Moments newsfeed). 

Platform AI

Platform AI provides tools to develop AI applications using OCR, machine translation, conversation bot, speech recognition, natural language processing, sentiment analysis, computer vision, human body and face recognition, image and video processing and enhancement.  

Intelligent Titanium Machine Learning is a one-stop cloud-based machine learning for machine learning engineers and data scientists to perform model training, evaluation, and prediction. Tencent Yunzhi Tianshu Artificial Intelligence Service Platform is an AI Platform service to deploy AI applications in enterprises. It connects edge devices, AI algorithms, and data through data connectors. 

Youtu Lab

Youtu Lab specializes in computer vision and offers different applications in policing, person search and identification, vehicle traffic control and monitoring, face verification, graphical content monitoring, and censoring. 

WeChat AI

WeChat AI supports all the applications of AI on the WeChat platform. They include voice recognition, usage of image scanning QR code, machine translation, chatbots to entertain users, music/TV and voice lock security. It uses speech recognition and audio processing, natural language processing, image and video processing, data mining and text understanding, and distributed machine learning.

Challenges

The most important challenges include government regulation, reputation, and competition risks. 

Tencent is exposed to a lot of regulatory and compliance risks as the consumer internet and AI are becoming more scrutinized in most countries including China. Privacy, data protection, consumer protection laws apply to Tencent in its social networks and gaming activities. Another set of laws and regulations in the financial sector such as banking laws, investor protections, financial regulations and compliance, and risk regulations apply to Tencent’s activities.

The Chinese government seems to have some control over WeChat and could present some potential risk for Tencent’s international activities. The US government is for instance attempting to ban WeChat for American users because it might expose them to some security risk. 

Internet and gaming activities can sometimes be perceived as damaging for humans if it leads to psychological problems such as addiction especially among young customers and Tencent has to be careful at evaluating the social impact of its businesses. Furthermore, some activities in AI such as policing or surveillance can be controversial in some countries and present some reputation risk for Tencent. 

Business competition is another challenge as consumers can change their behaviors and adopt new platforms, new products and services offered by other firms. If Tencent does not keep up with innovation, it might lose users and market share. In fintech, Ant Financial with AliPay is for instance a significant competitor. Tencent is very dominant in gaming but consumer taste can change quickly and large investments are required to keep up with the latest technologies such as augmented reality (AR) and virtual reality (VR).

AI at Netflix

Netflix is the largest video streaming service company in the world, present in 190 countries and serving around 195 million customers. It has annual revenues of close to 20 billion dollars and a market capitalization of over 200 billion dollars. It started in 1997 with DVD rentals and sales by mail and started to video streaming in 2007. Netflix is available on many platforms including TVs, phones, and tablets. Netflix is also involved in the production of original content and in movie production with Netflix Studio. 

Reasons to Invest in AI

Netflix is mostly a digital company with its infrastructure run in the cloud with AWS. It streams billions of hours of content every month in many countries and many languages, it collects a large amount of data from its users and thrives to provide them with real-time recommendations based on viewing and preferences. Its objective is to keep its users watching the most enjoyable shows on its platform. It needs AI to operate at this scale. 

AI initiatives

On its excellent blog, Netflix describes how AI and machine learning are used in different areas of its business.

Personalization and Recommendations

Netflix needs to help its customers find contents to watch on its platform. A customer can watch a film she enjoys but then will be looking to find another one to watch with maybe the same theme (action, romantic comedy, science-fiction..), same director, or same cast.   

Each user has a personalized page with recent views, trends, and recommendations by category, as well as original Netflix content. Everything on the page is customized to the viewer including the suggested categories, films or series, and their even visuals. The image representing the film can show a particular actor or graphic that will attract the attention of the viewer. 

Netflix uses several machine learning algorithms to select the content to show on the user home age. In particular, it is using A/B testing and contextual bandits. It is running experiments in real-time of different page configurations and collects information on which configuration is getting the most clicks. It knows which film the user is ending up watching and knows if the user has watched it to the end. It is mixing predictions based on the user’s characteristics, preferences, and history with more randomized suggestions to uncover more information on the user’s preferences. 

Content and Studio

Netflix has to constantly purchase the rights or produce new content for its platform. For TV series, it will often agree to stream the full season without a pilot. It also needs to know how much to invest in new productions. It is using predictive modelling to forecast the demand for new shows. It looks for instance at similar shows, the similarity being measured by some distance between show attributes. Because it has detailed information on shows which have been popular and have found an audience it knows with some probability which new show will be successful.

Netflix is producing its own movies with Netflix Studio. It has optimized the movie creation life-cycle from pre-production, planning, scheduling, production, post-production to marketing using data science and machine learning. For instance, scheduling is treated as an optimization with constraints problem. Given the availability of the film crew, director, actors, location it can generate an optimal schedule in a very short time. It also chooses which film to produce and how much to invest in each film given the likelihood that it will attract sufficient viewers on its platform. Netflix has borrowed 20 billion dollars to finance its original productions.

Streaming

Streaming is a technical challenge as Netflix is using over a third of the national internet bandwidth in the US. It has to monitor the quality of the streaming experience for each individual user who is at a different location, has a specific device, specific bandwidth, and specific internet provider. Even before a new content is streamed, Netflix is controlling its quality and tries to predict if some content will have quality issues. 

Marketing and Sales

Marketing messages are individualized so that they are more likely to convince non-subscribers to sign up. Netflix has to choose the marketing channel such as YouTube or Facebook or others and what content to show to a potential new member. It is using causal modelling to evaluate the effectiveness of its marketing spending. 

Challenges

A challenge for Netflix is to keep licensing and producing attractive content for its customer base. If tastes change, its models have to capture them and quickly recommend appropriate new content. Netflix competes for customer attention and have to compete with other activities such as VR video games or social networks. Netflix is not paying for the internet infrastructure per se but if it continues to be a significant user of the national bandwidth it might be asked to pay for it or to reduce the quality of its video streaming.

AI at DBS Bank (Singapore)

Figure 1. DBS Bank Marina Bay

DBS Bank is the largest bank in Singapore and Southeast Asia with an international presence in China, Taiwan, Hong Kong, India, Indonesia. It operates in consumer banking, wealth management (10.8 million customers in 2019), and institutional and SME banking (240,000 customers) across 18 markets globally.

It started a digital transformation process in 2014 to modernize its business operations and become a fully digital bank. It has since then received many awards as the best digital bank and the best bank.  

Reasons to invest in AI

It is not clear when AI became prevalent at DBS Bank but it has embraced digital transformation, the cloud, data, and analytics very early on to become more competitive and disrupt itself before being disrupted by competition from other banks, fintech companies, and foreign tech conglomerates such as Alibaba. In its early years, DBS Bank had a reputation for poor customer service.

Among the Singaporean banks, it offers more mobile apps (Figure 2) and has adopted mobile and digital banking to acquire, retain, and engage its customer base. 

No alt text provided for this image

Figure 2. DBS mobile apps

Analytics and AI initiatives 

DBS Bank has numerous initiatives that leverage AI and analytics:

Digital payments

DBS Bank owns DBS PayLah!, a digital wallet used by its 1.6 million customers to make payments in stores, pay bills, order meals online, book shows, travels, taxis and make transfers to other users. It has many platform partners and uses its insights on its users for cross-marketing initiatives. 

Contextualized marketing

DBS Bank uses contextualized marketing to sell products to its customers. It calls it hyper-personalization and is very similar to recommendation systems (for products or ads) seen in other industries. This kind of personalized service used to be available to high-net-worth individuals in private banking but can now be offered to all its clients thanks to technology.

Sentiment analysis

DBS Bank uses sentiment analysis to understand its clients better and address their needs and requests. This lowers the cost of customer support and increases customer satisfaction. Sentiment analysis leverages recent progress in Natural Language Processing by identifying positive and negative keywords and sentences in text and speech.

Algorithmic credit underwriting

In India and Indonesia, DBS Bank uses data-driven algorithmic credit underwriting models to approve small ticket-size loans to individuals through their mobile phones. These markets are much larger than Singapore and DBS Bank has to rely on automation and algorithms to service such markets. Mobile phones are also key to success because large shares of the population are under-banked. 

Credit risk assessment and monitoring

DBS Bank is developing automation and data-driven capabilities for credit risk assessment and monitoring of the credit-sensitive assets in its portfolios and help reduce their downgrade risk. 

DBS Bank is also building a credit platform for its Institutional Banking Group to manage and modernize its credit workflow. It has rolled out the platform to several regions in Asia.

Wealth management

In Wealth Management, DBS Bank is using robo-advisors with human advice with its DBS digiPortfolio product. It is offering customized market research and insights on its DBSiWealth platform.

Financial crime

DBS Bank is using artificial intelligence models to manage financial crime risks through dashboards, advanced customer and counterparty network monitoring, and priority ranking of financial crime risk.

Platform operating model

DBS Bank has used a Platform Operating Model strategy in 2018. These platforms let business and technology collaborate on common projects, share data, models and analytical tools, predictive analytics and workflow processes. DBS Bank has deployed over 33 such platforms across its business.

Call centers

DBS Bank is using AI models in its call centers in Singapore and India to predict customer issues and route the calls more efficiently and address the issues automatically.

APIs

DBS Bank has adopted builts APIs in real estate, education, healthcare, insurance, transport, logistics, and e-commerce sectors to connect with its ecosystem partners and cross-sell its services to their shared customers using contextualized marketing.

Challenges

As a financial institution, DBS Bank is exposed to credit and financial risk, financial crime risk, data governance and protection risk, cybersecurity risk, regulatory and reputational risk. As it expands its digital footprint, cybersecurity and data protection are becoming fundamental to the credibility of its digital operations.