survival8: ML dose with ten Q&A (Set 1)

ML dose with ten Q&A (Set 1)

Q1: What is a 'language model'?
Ans (a):
A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability P(w_1,..., w_m) to the whole sequence.

The language model provides context to distinguish between words and phrases that sound similar. For example, in American English, the phrases "recognize speech" and "wreck a nice beach" sound similar, but mean different things.

Data sparsity is a major problem in building language models. Most possible word sequences are not observed in training. One solution is to make the assumption that the probability of a word only depends on the previous n words. This is known as an n-gram model or unigram model when n = 1. The unigram model is also known as the bag of words model.

Estimating the relative likelihood of different phrases is useful in many natural language processing applications, especially those that generate text as an output. Language modeling is used in speech recognition, machine translation, part-of-speech tagging, parsing, Optical Character Recognition, handwriting recognition, information retrieval and other applications.

In speech recognition, sounds are matched with word sequences. Ambiguities are easier to resolve when evidence from the language model is integrated with a pronunciation model and an acoustic model.

Ref: https://en.wikipedia.org/wiki/Language_model

Ans (b):
The goal of probabilistic language modelling is to calculate the probability of a sentence of sequence of words:
P(W) = P(w_1, w_2,..., w_n) 

and can be used to find the probability of the next word in the sequence:

P(w_5 | w_1, w_2, w_3, w_4)

A model that computes either of these is called a Language Model.

Ref: https://towardsdatascience.com/learning-nlp-language-models-with-real-data-cdff04c51c25

Q2: What is 'softmax function'?
Ans:
In mathematics, the softmax function, also known as softargmax or normalized exponential function, is a function that takes as input a vector of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval (0,1), and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities. Softmax is often used in neural networks, to map the non-normalized output of a network to a probability distribution over predicted output classes.





Q3: What is 'ReLU'?
Ans: 
ReLU stands for "rectified linear unit".

In a neural network, the activation function is responsible for transforming the summed weighted input from the node into the activation of the node or output for that input.

The rectified linear activation function is a piecewise linear function that will output the input directly if is positive, otherwise, it will output zero. It has become the default activation function for many types of neural networks because a model that uses it is easier to train and often achieves better performance.

We can describe this using a simple if-statement:
if input > 0:
 return input
else:
 return 0
 
We can describe this function g() mathematically using the max() function over the set of 0.0 and the input z; for example:
g(z) = max{0, z}

Ref: https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/

Potential problems
1. Non-differentiable at zero; however, it is differentiable anywhere else, and the value of the derivative at zero can be arbitrarily chosen to be 0 or 1.
2. Not zero-centered.
3. Unbounded.
4. Dying ReLU problem: ReLU neurons can sometimes be pushed into states in which they become inactive for essentially all inputs. In this state, no gradients flow backward through the neuron, and so the neuron becomes stuck in a perpetually inactive state and "dies". This is a form of the vanishing gradient problem. In some cases, large numbers of neurons in a network can become stuck in dead states, effectively decreasing the model capacity. This problem typically arises when the learning rate is set too high. It may be mitigated by using leaky ReLUs instead, which assign a small positive slope for x < 0.

Ref: https://en.wikipedia.org/wiki/Rectifier_(neural_networks)




Q4: Encoding is done first or SMOTE is done first?
Ans:

It assumed that you will first vectorize your categorical features with your preferred method.

SMOTE and variations work by calculating distances between examples from the majority and minority classes. In order to be able to calculate such distances your data has to be formatted as a feature vector per entry. That means that categorical features must first be encoded to numerical values (e.g.: by using one hot encoding) before being passed to the object.

At the end of the end the SMOTE method (and all methods in this package for that matter) take as input a design matrix with all entries being numbers in addition to the respective labels.

Ref: https://github.com/scikit-learn-contrib/imbalanced-learn/issues/33

Q5: Normalization is done first or splitting is done first?
Ans:
It makes a HUGE difference and is one of the most common errors in data science.  
Part of the reason is that most software tools are not allowing you to do this in the right way.  Luckily enough, RapidMiner is not "most software tools" and allows you to do this right.
Here is the answer: You should NEVER do anything which leaks information about your testing data BEFORE a split.  

If you normalize before the split, then you will use the testing data to calculate the range or distribution of this data which leaks this information also into the testing data.  And that "contaminates" your data and will lead to over-optimistic performance estimations on your testing data.  This is by the way not just true for normalization but for all data preprocessing steps which change data based on all data points including also feature selection.  Just to be clear: This contamination does not have to lead to over-optimistic performance estimations but often it will.

What you SHOULD do instead is to create the normalization only on the training data and use the preprocessing model coming out of the normalization operator.  This preprocessing model can then be applied like any other model on the testing data as well and will change the testing data based on the training data (which is ok) but not the other way around.

Ref: https://community.rapidminer.com/discussion/32592/normalising-data-before-data-split-or-after

Q6: What is "containerization"?
Ans:
Operating-system-level virtualization, also known as containerization, refers to an operating system feature in which the kernel allows the existence of multiple isolated user-space instances. Such instances, called containers, partitions, virtual environments (VEs) or jails (FreeBSD jail or chroot jail), may look like real computers from the point of view of programs running in them. A computer program running on an ordinary operating system can see all resources (connected devices, files and folders, network shares, CPU power, quantifiable hardware capabilities) of that computer. However, programs running inside a container can only see the container's contents and devices assigned to the container.

On Unix-like operating systems, this feature can be seen as an advanced implementation of the standard chroot mechanism, which changes the apparent root folder for the current running process and its children. In addition to isolation mechanisms, the kernel often provides resource-management features to limit the impact of one container's activities on other containers.

System-level-virtualization is frequently implemented in remote access applications with dynamic cloud access, allowing for simultaneous two-way data streaming over closed networks.[2]

Q7: Brief about Docker.
Ans:
Docker is a computer program that performs operating-system-level virtualization, also known as "containerization". It was first released in 2013 and is developed by Docker, Inc.

Docker is used to run software packages called "containers". Containers are isolated from each other and bundle their own application, tools, libraries and configuration files; they can communicate with each other through well-defined channels. All containers are run by a single operating system kernel and are thus more lightweight than virtual machines. Containers are created from "images" that specify their precise contents. Images are often created by combining and modifying standard images downloaded from public repositories.

Q8: How would you check if your TensorFlow installation has Keras alongside?
Ans:
(tensorflow) C:\Users\Admin>python
Python 3.5.6 |Anaconda, Inc.| (default, Aug 26 2018, 16:05:27) [MSC v.1900 64 bit (AMD64)] on win32

>>> import tensorflow as tf
>>> tf.__version__
'1.5.0'
>>> dir(tf.keras)

['Input', 'Model', 'Sequential', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_impl', 'activations', 'applications', 'backend', 'callbacks', 'constraints', 'datasets', 'estimator', 'initializers', 'layers', 'losses', 'metrics', 'models', 'optimizers', 'preprocessing', 'regularizers', 'utils', 'wrappers']

URL: https://stackoverflow.com/questions/47262955/how-to-import-keras-from-tf-keras-in-tensorflow
>>> from tensorflow.python.keras.layers import Input, Dense
>>> from tensorflow.python.keras import layers

Q9: What is "logit"? (In context of TensorFlow)
Ref: https://stackoverflow.com/questions/41455101/what-is-the-meaning-of-the-word-logits-in-tensorflow

In ML, "logit" can mean the vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.

Q10: Define treatment group in statistics:
Ans:
Control and Treatment Groups: A control group is used as a baseline measure. The control group is identical to all other items or subjects that you are examining with the exception that it does not receive the treatment or the experimental manipulation that the treatment group receives.

Q11: What are the various terms that can be used in place of "dependent variable"?
Ans:
Depending on the context, a dependent variable is sometimes called a "response variable", "regressand", "criterion", "predicted variable", "measured variable", "explained variable", "experimental variable", "responding variable", "outcome variable", "output variable" or "label".
survival8

Pages

ML dose with ten Q&A (Set 1)

No comments:

Post a Comment