survival8: Machine Learning Q&A (Set 3)

Q1: What are the main features of SpaCy?
Ans:
Main features
1. Non-destructive tokenization
2. Named entity recognition
3. "Alpha tokenization" support for over 25 languages
4. Statistical models models for 8 languages [including English, German, Spanish, Portuguese, French, Italian, Dutch]
5. Pre-trained word vectors
6. Part-of-speech tagging
7. Labelled dependency parsing
8. Syntax-driven sentence segmentation
9. Text classification
10. Built-in visualizers for syntax and named entities
11. Deep learning integration

Ref: https://en.wikipedia.org/wiki/SpaCy (Dated: Dec 2019)

Q2: What is TF-IDF?
Ans:

Tf-idf stands for term frequency-inverse document frequency.

This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.

The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Tf-idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

Ref: http://www.tfidf.com/

Q3: How do you compute TF-IDF?
Ans:
Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

See this simple example: Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

Ref: http://www.tfidf.com/

Q4: What is the "vanishing gradient" problem with RNNs?
Ans:

In machine learning, the vanishing gradient problem is a difficulty found in training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, each of the neural network's weights receives an update proportional to the partial derivative of the error function with respect to the current weight in each iteration of training. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range (-1, 1), and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the "front" layers in an n-layer network, meaning that the gradient (error signal) decreases exponentially with n while the front layers train very slowly.

Ref: https://en.m.wikipedia.org/wiki/Vanishing_gradient_problem

Q5: What is the "gradient explosion" problem with RNNs?
Ans:

An error gradient is the direction and magnitude calculated during the training of a neural network that is used to update the network weights in the right direction and by the right amount.

In deep networks or recurrent neural networks, error gradients can accumulate during an update and result in very large gradients. These in turn result in large updates to the network weights, and in turn, an unstable network. At an extreme, the values of weights can become so large as to overflow and result in NaN values.

The explosion occurs through exponential growth by repeatedly multiplying gradients through the network layers that have values larger than 1.0.

Ref: https://machinelearningmastery.com/exploding-gradients-in-neural-networks/

Q6: What is SMOTE?
Ans:

Using a machine learning algorithm out of the box is problematic when one class in the training set dominates the other. Synthetic Minority Over-sampling Technique (SMOTE) solves this problem. 

SMOTE does not change the number of majority cases.

Ref: http://rikunert.com/SMOTE_explained

Q7: How does SMOTE work?
Ans:

The new instances are not just copies of existing minority cases; instead, the algorithm takes samples of the feature space for each target class and its nearest neighbors, and generates new examples that combine features of the target case with features of its neighbors. This approach increases the features available to each class and makes the samples more general. 
[ Ref: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/smote ]

As an example, consider a dataset of birds for classification. The feature space for the minority class for which we want to oversample could be beak length, wingspan, and weight (all continuous). To then oversample, take a sample from the dataset, and consider its k nearest neighbors (in feature space). To create a synthetic data point, take the vector between one of those k neighbors, and the current data point. Multiply this vector by a random number x which lies between 0, and 1. Add this to the current data point to create the new, synthetic data point.

Ref: https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis

Research Paper on SMOTE:
Link 1: https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/chawla2002.html
Link 2: https://arxiv.org/pdf/1106.1813.pdf

Q8: Brief about CBOW model as in word2vec context.
Ans:

The Continuous Bag of Words (CBOW) Model
 
The CBOW model architecture tries to predict the current target word (the center word) based on the source context words (surrounding words). Considering a simple sentence, “the quick brown fox jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy) and so on. Thus the model tries to predict the target_word based on the context_window words.

Ref: https://www.kdnuggets.com/2018/04/implementing-deep-learning-methods-feature-engineering-text-data-cbow.html

Q9: Brief about Skip-gram model as in word2vec context.
Ans:

Skip-gram is used to predict the context word for a given target word. It’s reverse of CBOW algorithm. Here, target word is input while context words are output. As there is more than one context word to be predicted which makes this problem difficult.




Ref: https://towardsdatascience.com/skip-gram-nlp-context-words-prediction-algorithm-5bbf34f84e0c

Q10: Differences between Word2vec and GloVe?
Ans:
1. Presence of Neural Networks: GloVe does not use neural networks while word2vec does. 

In GloVe, the loss function is the difference between the product of word embeddings and the log of the probability of co-occurrence. We try to reduce that and use SGD but solve it as we would solve a linear regression. 

While in the case of word2vec, we either train using skip-gram model or train based on continuous bag of words model using a 1-hidden layer neural network.

2. Global information: word2vec does not have any explicit global information embedded in it by default. GloVe creates a global co-occurrence matrix by estimating the probability a given word will co-occur with other words. This presence of global information makes GloVe ideally work better. Although in a practical sense, they work almost similar and people have found similar performance with both.

Ref: https://www.quora.com/How-is-GloVe-different-from-word2vec

Q11: How do you define a loss function?
Ans:

In mathematical optimization and decision theory, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function.

Ref: https://en.m.wikipedia.org › wiki › Loss_function
survival8

Pages

Machine Learning Q&A (Set 3)

No comments:

Post a Comment