Machine Learning Q&A (Set 2)


Tags: Natural language processing using deep learning (Lazy Programmer Inc.)
Ref: https://www.udemy.com/user/lazy-programmer/

Q1: Brief about 'Brown Corpus'.

The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled in the 1960s by Henry Kučera and W. Nelson Francis at Brown University, Providence, Rhode Island as a general corpus (text collection) in the field of corpus linguistics. It contains 500 samples of English-language text, totaling roughly one million words, compiled from works published in the United States in 1961.

Here is Python code to use it:

from nltk.corpus import brown
def get_sentences():
# returns 57340 of the Brown corpus
# each sentence is represented as a list of individual string tokens
return brown.sents()


If you get a look-up error logs, then fire this command "nltk.download('brown')" and run the code snippet again:
LookupError:

Resource brown not found.
Please use the NLTK Downloader to obtain the resource:

>>> import nltk
>>> nltk.download('brown')

Attempted to load corpora/brown.zip/brown/

Searched in:
- 'C:\\Users\\ashish/nltk_data'
- 'C:\\Users\\ashish\\AppData\\Local\\Continuum\\anaconda3\\nltk_data'
- 'C:\\Users\\ashish\\AppData\\Local\\Continuum\\anaconda3\\share\\nltk_data'
- 'C:\\Users\\ashish\\AppData\\Local\\Continuum\\anaconda3\\lib\\nltk_data'
- 'C:\\Users\\ashish\\AppData\\Roaming\\nltk_data'
- 'C:\\nltk_data'
- 'D:\\nltk_data'
- 'E:\\nltk_data'

If you are behind a proxy or in a protected network, you might get this error:
[nltk_data] Error loading brown: [urlopen error [Errno 11002] [nltk_data] getaddrinfo failed]


ON SUCCESSFUL TRIAL:
[nltk_data] Downloading package brown to
[nltk_data] C:\Users\ashish\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\brown.zip.
Out[6]:
True

--- --- --- --- ---
Output of "brown.sents()":

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

Ref: https://en.wikipedia.org/wiki/Brown_Corpus

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

Q2: Write the code for softmax function using 'numpy' package.

Answer:

def softmax(a):
a = a - a.max()
exp_a = np.exp(a)
return exp_a / exp_a.sum(axis=1, keepdims=True)

~ ~ ~

(base) C:\Users\ashish>python
Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np

>>> np.sum(arr)
15

>>> np.exp(1)
2.718281828459045

>>> np.e
2.718281828459045

>>> np.log(np.exp(1))
1.0


~ ~ ~

>>> from softmax import softmax as sm
>>> import numpy as np
>>> arr = np.array([[1, 2, 3], [2, 3, 4]]) # Won't work on one-D array

>>> np.exp(arr)
array([[ 2.71828183, 7.3890561 , 20.08553692],
[ 7.3890561 , 20.08553692, 54.59815003]])

>>> sm(arr)
array([[0.09003057, 0.24472847, 0.66524096],
[0.09003057, 0.24472847, 0.66524096]])
>>>

In mathematics, the softmax function, also known as softargmax or normalized exponential function, is a function that takes as input a vector of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval (0,1), and the components will add up to 1, so that they can be interpreted as probabilities.

Furthermore, the larger input components will correspond to larger probabilities. Softmax is often used in neural networks, to map the non-normalized output of a network to a probability distribution over predicted output classes.

Ref 1: https://en.wikipedia.org/wiki/Softmax_function
Ref 2: http://survival8.blogspot.com/p/ml-dose-with-ten-q-set-1.html

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

Q3: What is 'logistic regression'?

Answer:

Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.

Comparison to linear regression:
Given data on time spent studying and exam scores. Linear Regression and logistic regression can predict different things:

-- Linear Regression could help us predict the student’s test score on a scale of 0 - 100. Linear regression predictions are continuous (numbers in a range).

-- Logistic Regression could help use predict whether the student passed or failed. Logistic regression predictions are discrete (only specific values or categories are allowed). We can also view probability scores underlying the model’s classifications.

It is important to understand 'sigmoid' function in order to understand 'logistic regression':

Ref: https://ml-cheatsheet.readthedocs.io/en/latest/logistic_regression.html Sigmoid function, graphically:
Ref: https://en.wikipedia.org/wiki/Sigmoid_function Logistic regression curve:
Ref: https://www.saedsayad.com/logistic_regression.htm ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Q4: Explain 'word2vec'. Answer: Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep nets can understand. [ Ref: https://skymind.ai/wiki/word2vec ] There are various word embedding models available such as word2vec (Google), Glove (Stanford) and fastest (Facebook). [ Ref: https://www.guru99.com/word-embedding-word2vec.html ] Given enough data, usage and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances. Those guesses can be used to establish a word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or cluster documents and classify them by topic. Those clusters can form the basis of search, sentiment analysis and recommendations in such diverse fields as scientific research, legal discovery, e-commerce and customer relationship management. The output of the Word2vec neural net is a vocabulary in which each item has a vector attached to it, which can be fed into a deep-learning net or simply queried to detect relationships between words. Measuring cosine similarity, no similarity is expressed as a 90 degree angle, while total similarity of 1 is a 0 degree angle, complete overlap; i.e. Sweden equals Sweden, while Norway has a cosine distance of 0.760124 from Sweden, the highest of any other country. Here’s a list of words associated with “Sweden” using Word2vec, in order of proximity:
[ Ref: https://skymind.ai/wiki/word2vec ] ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Q5: What is Gensim? Answer: Gensim is a topic modeling toolkit which is implemented in python. Topic modeling is discovering hidden structure in the text body. Word2vec is imported from Gensim toolkit. Please note that Gensim not only provides an implementation of word2vec but also Doc2vec and FastText but this tutorial is all about word2vec so we will stick to the current topic. Ref: https://www.guru99.com/word-embedding-word2vec.html ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Q6: Brief about 'Named-entity recognition'. Answer: Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. Named-entity recognition platforms Notable NER platforms include: GATE supports NER across many languages and domains out of the box, usable via a graphical interface and a Java API. OpenNLP includes rule-based and statistical named-entity recognition. SpaCy features fast statistical NER as well as an open-source named-entity visualizer. Ref: https://en.wikipedia.org/wiki/Named-entity_recognition ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Q7: What is negative sampling (such as in Word2Vec implementation)? Answer: The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. v_c * v_w ------------------- sum(v_c1 * v_w) The numerator is basically the similarity between words c (the context) and w (the target) word. The denominator computes the similarity of all other contexts c1 and the target word w. Maximising this ratio ensures words that appear closer together in text have more similar vectors than words that do not. However, computing this can be very slow, because there are many contexts c1. Negative sampling is one of the ways of addressing this problem- just select a couple of contexts c1 at random. The end result is that if cat appears in the context of food, then the vector of food is more similar to the vector of cat (as measures by their dot product) than the vectors of several other randomly chosen words (e.g. democracy, greed, Freddy), instead of all other words in language. This makes word2vec much much faster to train. Ref: https://stackoverflow.com/questions/27860652/word2vec-negative-sampling-in-layman-term Ref: https://www.coursera.org/lecture/nlp-sequence-models/negative-sampling-Iwx0e Negative Sampling as in Word2Vec (Coursera) ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Q8: What is LSTM? Ans 8.1: LSTM (Long Short Term Memory) is a prominent variation of RNN. Recurrent nets are a type of artificial neural network designed to recognize patterns in sequences of data, such as text, genomes, handwriting, the spoken word, or numerical times series data emanating from sensors, stock markets and government agencies. [ Ref: https://skymind.ai/wiki/lstm ] "Recurrent networks take as their input not just the current input example they see, but also what they have perceived previously in time." [ Ref: https://skymind.ai/wiki/lstm ] "In the mid-90s, a variation of recurrent net with so-called Long Short-Term Memory units, or LSTMs, was proposed as a solution to the vanishing gradient problem. LSTMs help preserve the error that can be backpropagated through time and layers. By maintaining a more constant error, they allow recurrent nets to continue to learn over many time steps (over 1000), thereby opening a channel to link causes and effects remotely. This is one of the central challenges to machine learning and AI, since algorithms are frequently confronted by environments where reward signals are sparse and delayed, such as life itself." [ Ref: https://skymind.ai/wiki/lstm ] ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Q9: What is RNN? A recurrent neural network looks quite similar to a traditional neural network except that a memory-state is added to the neurons. The computation to include a memory is simple. Imagine a simple model with only one neuron feeds by a batch of data. In a traditional neural net, the model produces the output by multiplying the input with the weight and the activation function. With an RNN, this output is sent back to itself number of time. We call timestep the amount of time the output becomes the input of the next matrice multiplication. For instance, in the picture below (please find the ref below), you can see the network is composed of one neuron. The network computes the matrices multiplication between the input and the weight and adds non-linearity with the activation function. It becomes the output at t-1. This output is the input of the second matrices multiplication. Ref: https://www.guru99.com/rnn-tutorial.html ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Q10: What are the limitations of RNN? Ans: In theory, RNN is supposed to carry the information up to time . However, it is quite challenging to propagate all this information when the time step is too long. When a network has too many deep layers, it becomes untrainable. This problem is called: vanishing gradient problem. If you remember, the neural network updates the weight using the gradient descent algorithm. The gradients grow smaller when the network progress down to lower layers. In conclusion, the gradients stay constant meaning there is no space for improvement. The model learns from a change in the gradient; this change affects the network's output. However, if the difference in the gradient is too small (i.e., the weights change a little), the network can't learn anything and so the output. Therefore, a network facing a vanishing gradient problem cannot converge toward a good solution. Improvement through LSTM: To overcome the potential issue of vanishing gradient faced by RNN, three researchers, Hochreiter, Schmidhuber and Bengio improved the RNN with an architecture called Long Short-Term Memory (LSTM). In brief, LSMT provides to the network relevant past information to more recent time. The machine uses a better architecture to select and carry information back to later time. Ref: https://www.guru99.com/rnn-tutorial.html ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Q11: Discuss LSTM architecture. Answer 11.1: A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. [Ref-1] Idea In theory, classic (or "vanilla") RNNs can keep track of arbitrary long-term dependencies in the input sequences. The problem of vanilla RNNs is computational (or practical) in nature: when training a vanilla RNN using back-propagation, the gradients which are back-propagated can "vanish" (that is, they can tend to zero) or "explode" (that is, they can tend to infinity), because of the computations involved in the process, which use finite-precision numbers. RNNs using LSTM units partially solve the vanishing gradient problem, because LSTM units allow gradients to also flow unchanged. However, LSTM networks can still suffer from the exploding gradient problem. [Ref-1] Architecture: There are several architectures of LSTM units. A common architecture is composed of a cell (the memory part of the LSTM unit) and three "regulators", usually called gates, of the flow of information inside the LSTM unit: an input gate, an output gate and a forget gate. Some variations of the LSTM unit do not have one or more of these gates or maybe have other gates. For example, gated recurrent units (GRUs) do not have an output gate. Intuitively, the cell is responsible for keeping track of the dependencies between the elements in the input sequence. The input gate controls the extent to which a new value flows into the cell, the forget gate controls the extent to which a value remains in the cell and the output gate controls the extent to which the value in the cell is used to compute the output activation of the LSTM unit. The activation function of the LSTM gates is often the logistic sigmoid function. There are connections into and out of the LSTM gates, a few of which are recurrent. The weights of these connections, which need to be learned during training, determine how the gates operate. [Ref-1] The long-term memory is usually called the cell state. The looping arrows indicate recursive nature of the cell. This allows information from previous intervals to be stored with in the LSTM cell. Cell state is modified by the forget gate placed below the cell state and also adjust by the input modulation gate. From equation, the previous cell state forgets by multiply with the forget gate and adds new information through the output of the input gates. The remember vector is usually called the forget gate. The output of the forget gate tells the cell state which information to forget by multiplying 0 to a position in the matrix. If the output of the forget gate is 1, the information is kept in the cell state. From equation, sigmoid function is applied to the weighted input/observation and previous hidden state. The save vector is usually called the input gate. These gates determine which information should enter the cell state / long-term memory. The important parts are the activation functions for each gates. The input gate is a sigmoid function and have a range of [0,1]. Because the equation of the cell state is a summation between the previous cell state, sigmoid function alone will only add memory and not be able to remove/forget memory. If you can only add a float number between [0,1], that number will never be zero / turned-off / forget. This is why the input modulation gate has an tanh activation function. Tanh has a range of [-1, 1] and allows the cell state to forget memory. The focus vector is usually called the output gate. Out of all the possible values from the matrix, which should be moving forward to the next hidden state? [ Ref-2 ] ...
The first sigmoid activation function is the forget gate. Which information should be forgotten from the previous cell state (Ct-1). The second sigmoid and first tanh activation function is our input gate. Which information should be saved to the cell state or should be forgotten? The last sigmoid is the output gate and highlights which information should be going to the next hidden state. Ref-1: https://en.wikipedia.org/wiki/Long_short-term_memory Ref-2: https://medium.com/@kangeugine/long-short-term-memory-lstm-concept-cb3283934359 Answer 11.2: The repeating module in an LSTM contains four interacting layers.
Ref: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Q12: Draw RNN architecture. Ans:
Ref: https://medium.com/@kangeugine/long-short-term-memory-lstm-concept-cb3283934359 RNN cell takes in two inputs, output from the last hidden state and observation at time = t. Besides the hidden state, there is no information about the past to REMEMBER.

No comments:

Post a Comment