Pre-read
Fig:1 Fig:2 Fig:3 If we have several continuous variables x1,...,xD, denoted collectively by the vector x, then we can define a joint probability density p(x) = p(x1,...,xD) such Fig:4Question
Fig:5Solution
Sunday, January 7, 2024
Exercise 1.4 On Probability Densities - Pattern Recognition and ML - By Christopher Bishop
Exercise 1.6 On Covariance - Pattern Recognition and ML - By Christopher Bishop
Pre-read
Fig 1 Fig 21.6: Show that if two variables x and y are independent, then their covariance is zero.
Solution
Fig 3 Fig 4
Exercise 1.7 - Identifying normalization constant for Gaussian Distribution and Proving that it's integral equals 1 - From Pattern Recognition and ML - By Christopher Bishop
Pre-read
Fig 1: Fig 2:Question
Fig 3:Solution
Fig 4: Fig 5: Fig 6: Fig 7: Fig 8: Fig 9: Fig 10: Fig 11:Index of Machine Learning
THEORY
What is Machine Learning?
Logistic Regression
- Logistic Regression Equation
- Logistic Regression is a binary classification algorithm
- Practice problems on Logistic Regression for interview preparation
- Getting the Geometric Intuition Behind Logistic Regression
- Likelihood of a Single Observation For Logistic Regression Model
- Cost Function For Logistic Regression Model
Decision Tree
Deep Learning
Integration
- What happens when we reverse the limits of a definite integral?
- Can you give the example of an odd integrand and prove that it is odd?
- What is the differentiation and integration of f(x)=1 wrt dx
- Show step by step integration of exp(-x by a)
- Change in limits of integration when changing from cartesian to polar coordinates
- Gaussian Integral: Integrate exp(-x^2) from -infinite to +infinite
Vector Calculus
- y = w.x Where w and x are vectors. What is the derivative of y wrt w?
- x is a vector and y = euclidean_norm(x). What is the derivative of y wrt x?
- y = (norm(x))^2. What is the derivative of y wrt x?
- Can you show the derivative X-transpose.X wrt x?
- What is Jacobian of change of variables?
- Derive Jacobian when we are changing from 2D cartesian coordinates to polar coordinates.
- Derive generalized Gaussian Integral from the Special Case of Gaussian Integral for Mu = 0 and Sigma = 1-by-sqrt(2)
- Solving 2D Gaussian Integral w.r.t x and y and also w.r.t polar coordinates
Exercises from the book "Pattern Recognition and Machine Learning" by Christopher Bishop
- Exercise 1.1 On Linear Equations for Regression - From Pattern Recognition and ML - By Christopher Bishop
- Exercise 1.2 On Regularization in Linear Regression - Pattern Recognition and ML - By Christopher Bishop
- Exercise 1.3 on 'sum rule and product rule of probability' (From the book Pattern Recognition and ML by Christopher Bishop)
- Exercise 1.4 On Probability Densities - Pattern Recognition and ML - By Christopher Bishop
- Exercise 1.5 On Concept of Variance From Statistics - Pattern Recognition and ML - By Christopher Bishop
- Exercise 1.6 On Covariance - Pattern Recognition and ML - By Christopher Bishop
- Exercise 1.7 - Identifying normalization constant for Gaussian Distribution and Proving that it's integral equals 1 - From Pattern Recognition and ML - By Christopher Bishop
- Ex 1.8 - Show that the mean value of Gaussian distribution is µ. (Pattern Recognition and ML by Christopher Bishop)
- Ex 1.9 - Show that the mode of the multivariate Gaussian is given by µ. (Pattern Recognition and ML by Christopher Bishop)
- Ex 1.10 - On mean and variance of two variables (Pattern Recognition and ML by Christopher Bishop)
- Ex 1.12 - Pattern Recognition and ML by Christopher Bishop
- Ex 1.13 - Pattern Recognition and Machine Learning - By Christopher Bishop
- Ex 1.14 - Pattern Recognition and Machine Learning - By Christopher Bishop
Statistics
PRACTICALS
ML1: Category Encoding
- Installing 'Category Encoders' Python Package Using Pip And Conda
- Category Encoders Analysis (in Python)
- One Hot Encoding Using Pandas' get_dummies() Method on Titanic Dataset
- Do we need all the one hot features?
- One Hot Encoding from PySpark, Pandas, Category Encoders and skLearn
- Comparing StringIndexer (PySpark), LabelEncoder (skLearn), OrdinalEncoder (skLearn), OrdinalEncoder (category_encoders) [Tags: Machine Learning, Spark]
- Effect of PySpark's StringIndexer on clustering of data [Tags: Machine Learning, Spark]
- Two ways to get Frequency Based Order of Categorical Data (Python)
ML2: Data Preprocessing
- Binning (of a column or 1D Numerical Data)
- Feature Scaling in Machine Learning (when to use which among MinMaxScaler and StandardScaler)
- Similarity and Dissimilarity for Numerical (interval-scaled) variables, Asymmetric binary variables, Categorical variables, For text
- Correlation between continuous-numeric columns, and between categorical columns
- Importance of posing right question for machine learning, data analysis and data preprocessing [Tags: Machine Learning, Spark]
- Working with skLearn's MinMax scaler and defining our own
- Data Preprocessing Using Python package Pandas (Use case: Creating a Nifty50 SIP Simulator)
- Loading data from Pandas to PostgreSQL [Tags: Databases, Machine Learning]
- Exploring skLearn's CountVectorizer
- Using Snorkel to create test data and classifying using Scikit-Learn
- Pandas DataFrame Filtering Using eval()
- Three Types of Input Data Format For Apriori Algorithm (Association Analysis)
- Bagging in overcoming variance of a classifier, clustering algorithm or regressor
ML3: Data visualization
ML3.1: Misc
- Data Visualization's Basic Theory
- Box Plot and Anomaly Detection in 1D [Tags: Data Visualization, Anomaly Detection]
- Social Analysis (SOAN using Python 3) Report [Tags: Data Visualization, NLP]
- Plotting Correlation Matrix in Three Ways Using Pandas, Matplotlib and Seaborn
- Creating Heatmap from Pandas DataFrame correlation matrix
- Binomial Probability Distribution (visualization using Seaborn) [Tags: Machine Learning, NumPy]
- Google Analytics for Beginners (Assessments Dump, Oct 2020)
- survival8 Audience Around The World (Jun 2021)
-
Topic modeling using Latent Dirichlet Allocation from sklearn and visualization using pyLDAvis
[Tags: Data Visualization, Natural Language Processing]
ML3.2: Using PowerBI
- PowerBI's HTML Content Visualization [Tags: Data Visualization, PowerBI]
- Sorting an 'HTML Content' Visual in PowerBI [Tags: Data Visualization, PowerBI]
- Concatenate Two Tables using R in PowerBI [Tags: Data Visualization, PowerBI, R Language]
- Timeline View Using HTML Content Visual in PowerBI [Tags: Data Visualization, PowerBI, Python]
ML3.3: Line Chart
- Creating and Editing Line Chart in LibreOffice Calc [Tags: Data Visualization, FOSS]
- Line plot with multiple lines for page views for survival8 (Sep 2022)
ML3.4: Pie Plot
- Stratified sampling and fixed size sampling plus visualization using pie plot (Nov 2022)
- Plotting changes in Nifty50's top-5 sectors after last three market crashes Using Pie Plot, Bar Chart and Grouped Bar Chart
ML3.5: Choropleth and Cartography
Choropleth: A choropleth map is a type of statistical thematic map that uses pseudocolor, i.e., color corresponding with an aggregate summary of a geographic characteristic within spatial enumeration units, such as population density or per-capita income.
Cartography: the science or practice of drawing maps.
- Drawing a world heat map using 'cartopy'
- Ownership of Bicycle, 2-wheeler and car in India by percentages (2022)
- Global Footprint of Survival8 (Nov 2022)
ML3.6: Histogram
- Differences between 'bar graph' and histogram
- Histogram report and binning on Sales data
- An exercise in visualization (plotting line plot and multicolored histogram with -ve and +ve values) using Pandas and Matplotlib
ML4: Outlier Detection / Anomaly Detection
- Box Plot and Anomaly Detection in 1D [Tags: Data Visualization, Anomaly Detection]
- DCSO: Dynamic Combination of Detector Scores for Outlier Ensembles (Research paper, 2018)
- Unsupervised Outlier Detection Using PyOD
- Isolation based anomaly detection using iForest (2133360.2133363 / Research Paper)
- Density-based algorithm for anomaly detection (Adeel Hashmi / Research Paper)
- Isolation Forest Implementation using skLearn, PyOD, and spark-iForest
- Anomaly Detection using Scikit-Learn and "eif" PyPI package (for Extended Isolation Forest)
- Distributed Deep Learning Using Python Packages Elephas, Keras, Tensorflow and PySpark [Tags: Anomaly Detection, Deep Learning, Spark]
- Anomalies in 'survival8' Viewers' Stats (Mar 2022)
- Estimating the Contamination Factor For Unsupervised Anomaly Detection
ML5: Classification
ML5.1: Decision Tree
- Decision Tree Learning
- Interpretation of Decision Tree J48 Classifier output in Weka [Tags: FOSS, Machine Learning, Weka]
- Calculations for Info Gain and Gini Coefficient for Building Decision Tree
ML5.2: Miscellaneous of Classification
- Creating ML model, saving it, and creating Flask API [Tags: Flask, Machine Learning, Classification]
- kNN classification parallelized using MapReduce [Tags: Hadoop / Spark, Machine Learning]
- Elbow Method for identifying k in kMeans (clustering) and kNN (classification)
- Snorkel's Analysis Package Overview (v0.9.6, Sep 2020). This dicusses how to interpret classification results
- Improving a Classifier (ML) Using Snorkel's Slicing Technique
- Multi-label Classification using Python
- Naïve Bayes Classifier for Spam Filtering
- Weka classification experiment on Iris dataset [Tags: FOSS, Machine Learning, Weka]
ML6: Clustering
- Elbow Method for identifying k in kMeans (clustering) and kNN (classification)
- Weka clustering experiment on Iris dataset [Tags: FOSS, Machine Learning, Weka]
- 'Supervised classification-oriented measures' for Cluster Analysis
- Similarity-oriented measures for cluster analysis
ML7: Regression
- Linear Regression (Theory)
- Improvements over OLS (Forward Stepwise, Ridge, Lasso and LARS forms of Regression)
- Descriptive Statistics and Linear Regression Using 'statistics' module and 'statsmodels' module
- Hands-on 5 Regression Algorithms Using Scikit-Learn
- Demo of Linear Regression on Boston Housing Data Using Weka [Tags: FOSS, Machine Learning]
- Saving Model, Loading Model and Making Predictions for Linear Regression (in Weka)
- Linear Regression Using Java Code And Weka JAR [Tags: FOSS, Java, Machine Learning, Weka]
ML8: Association Mining Between Attributes
- Three Types of Input Data Format For Apriori Algorithm (Association Analysis)
- The Concept of Lift in Association Rules Mining
- Apriori Algorithm For Association Mining Using Weka's Supermarket Dataset
- Running Weka's Apriori on 9_TXN_5_ITEMS Dataset
- Interpretation of output from Weka for Apriori Algorithm
ML9: Weka Tool
- Demo of Linear Regression on Boston Housing Data Using Weka [Tags: FOSS, Machine Learning, Weka]
- Interpretation of Decision Tree J48 Classifier output in Weka [Tags: FOSS, Machine Learning, Weka]
- Weka classification experiment on Iris dataset [Tags: FOSS, Machine Learning, Weka]
- Weka clustering experiment on Iris dataset [Tags: FOSS, Machine Learning, Weka]
- Demo of Linear Regression on Boston Housing Data Using Weka [Tags: FOSS, Machine Learning]
- Saving Model, Loading Model and Making Predictions for Linear Regression (in Weka)
- Linear Regression Using Java Code And Weka JAR [Tags: FOSS, Java, Machine Learning, Weka]
- Apriori Algorithm For Association Mining Using Weka's Supermarket Dataset
- Running Weka's Apriori on 9_TXN_5_ITEMS Dataset
- Interpretation of output from Weka for Apriori Algorithm
- Machine Learning and Weka Interview (5 Questions) [Tags: Machine Learning Q&A, Weka Tool]
ML10: Traffic Prediction on my Blog
- Traffic Prediction on my Blog (Oct 2023)
- When not to use Poisson Distribution for prediction?
- Time Series Analysis and Forecasting Using Exponential Moving Average (A use case of traffic prediction on my blog)
ML11: Questions and Answers
- Machine Learning dose with ten Q&A (Set 1)
- Machine Learning Q&A (Set 2)
- Machine Learning Q&A (Set 3)
- LinkedIn Machine Learning Assessment Dump (Aug 2021)
- Machine Learning and Weka Interview (5 Questions) [Tags: Machine Learning Q&A, Weka Tool]
- ARIMA forecast for timeseries is one step ahead. Why? (Solved Interview Problem)
ML12: Miscellaneous
- Simple demonstration of how important data is for machine learning
- Reading a JSON file from the Google Drive in the Google Colab
- A case of cyclic dependencies between PyPI packages [Tags: Machine Learning, Python]
- Extracting Information From Search Engines [Tags: Machine Learning, Python]
-
Digging deeper into your toolbox (Viewing LDiA code of sklearn)
[Tags: FOSS, Machine Learning, Natural Language Processing]
ML13: Articles
- Machine Learning Resources (Dec 2019)
- Machine Learning Evolution (Jan 2020)
- Data Science Timeline (Aug 2020)
ML14: Our 'Machine Learning' Videos on YouTube
- Session 1 - Linear Regression (OLS Method and Theory) - 20210716
- Session 2 - Improvements over Linear Regression method of OLS (Forward Stepwise, Ridge, Lasso, LARS)
- Linear Regression Theory (2022-02-15)
- Pandas and Linear Regression in Code (Dated: 2022-Feb-16)
- Naive Bayes Classifier / Application: Spam Filtering / Dated: 2022 Feb 17
- Decision Trees Learning (2022 Feb 22)
- Perceptron in Machine Learning (24 Apr 2022)
Thursday, December 28, 2023
Mushroom Picker (A problem on Prefix Sums)
Problem
You are given a non-empty, zero-indexed array A of n (1 <= n <= 100000) integers a0, a1,..., an−1 (0 <= ai <= 1000). This array represents number of mushrooms growing on the consecutive spots along a road. You are also given integers k and m (0 <= k, m < n).
A mushroom picker is at spot number k on the road and should perform m moves. In one move she moves to an adjacent spot. She collects all the mushrooms growing on spots
she visits. The goal is to calculate the maximum number of mushrooms that the mushroom
picker can collect in m moves.
For example, consider array A such that:
A[0]=2, A[1]=3, A[2]=7, A[3]=5, A[4]=1, A[5]=3, A[6]=9
The mushroom picker starts at spot k = 4 and should perform m = 6 moves. She might
move to spots 3, 2, 3, 4, 5, 6 and thereby collect 1 + 5 + 7 + 3 + 9 = 25 mushrooms. This is the
maximal number of mushrooms she can collect.
Code
# Counting prefix sums def prefix_sums(A): n = len(A) P = [0] * (n + 1) for k in xrange(1, n + 1): P[k] = P[k - 1] + A[k - 1] return P # Total of one slice def count_total(P, x, y): return P[y + 1] - P[x] # Mushroom picker — O(n + m) def mushrooms(A, k, m): n = len(A) result = 0 pref = prefix_sums(A) # When we first take p steps on the left and return from their back in right direction. for p in xrange(min(m, k) + 1): left_pos = k - p right_pos = min(n - 1, max(k, k + m - 2*p)) result = max(result, count_total(pref, left_pos, right_pos)) # When we first take p steps on the right and return from their back in left direction. for p in xrange(min(m + 1, n - k)): right_pos = k + p left_pos = max(0, min(k, k - (m - 2*p))) result=max(result, count_total(pref, left_pos, right_pos)) return result
Explanation (Part 1)
# When we first take p steps on the left and return from their back in right direction. for p in xrange(min(m, k) + 1): left_pos = k - p right_pos = min(n - 1, max(k, k + m - 2*p)) result = max(result, count_total(pref, left_pos, right_pos))
Expression “for p in xrange(min(m, k) + 1)” has min(m, k) because we can only go m steps to the left or k steps to the left whichever number is minimum.
After going k steps, we cannot go past 0th position.
Or after taking m steps, we cannot take another step.
Expression “right_pos = min(n - 1, max(k, k + m – 2*p))” has ‘k+m-2*p’ because after taking p steps to the left, we are returning p steps back to position k, hence 2p.
And then number of steps left is: m – 2p
Then right position is identified by: m-2p steps after k ie, k+m-2p
Expression “right_pos = min(n - 1, max(k, k + m – 2*p))” has max(k, k + m – 2*p) because what if we take all the m steps to left and value of p becomes m.
Then there are no steps left to take to the right and right position is simply identified by k.
Similarly, for any value of p > m/2 (as in 0.6m or 0.75m), we would get same right position as k.
Explanation (Part 2)
# When we first take p steps on the right and return from their back in left direction. for p in xrange(min(m + 1, n - k)): right_pos = k + p left_pos = max(0, min(k, k - (m - 2*p))) result = max(result, count_total(pref, left_pos, right_pos))
Here in the expression “for p in xrange(min(m + 1, n – k))”, we have m+1 and not just m because xrange goes till m when passed the argument m+1. Similarly, we have n-k and not (n-1)-k because xrange goes till (n-1)-k when passed the argument n-k.
And we take minimum of m or (n-1)-k because that’s is the possible number of steps we can take on the right.
Here in the expression “left_pos = max(0, min(k, k - (m – 2*p)))”, we have k-(m-2p) because we take p steps on the right, then take those p steps back to k (on the left now).
Number of steps yet to take: m-2p
Left position would be identified by: k-(m-2p)
If p takes value m, which means we have taken m steps on the right, then we have no steps remaining to take to the left and left position is identified by k itself. (This logic is valid for any value of p > m/2)
Side Note
The way we calculating prefix sums and sums of subsections is without using the Python built-ins. But if we were to use Python built-ins, the code would look somewhat like this:def get_mushrooms_sum(A, start, end): return sum(A[start : end + 1])
Coding Round in Interview for Data Scientist Role at a Bank (17 Nov 2023)
Task 1
Write a function: def solution(A) that, given an array A of N integers, returns the smallest positive integer (greater than 0) that does not occur in A. For example, given A = [1, 3, 6, 4, 1, 2], the function should return 5. Given A = [1, 2, 3], the function should return 4. Given A = [−1, −3], the function should return 1. Write an efficient algorithm for the following assumptions: N is an integer within the range [1..100,000]; each element of array A is an integer within the range [−1,000,000..1,000,000].A brute-force solution to this problem that will fail the performance tests will look like this:
def solution(A): # a_sort = sorted(A) rtn = -1 for i in range(1, 100001): if i in A: # This is the most time consuming instruction of this code that runs repeatedly for all of i(s) continue else: rtn = i break return rtnTask 2
Task 3
Fig 3.1 Fig 3.2 Fig 3.3
Monday, December 25, 2023
Building a Tree from a BODMAS Expression (An exercise in Recursion)
There is an expression with N variables as in A, B, C, X, Y, Z. Note that variables do not repeat. And N-1 variables which can be +,-,*, and /
For example:
Ex 1: X + Y * Z Ex 2: X - Y * Z + A Ex 3: X * Y - Z / A Ex 4: X / Y / Z Ex 5: X * Y / Z Ex 6: Z - Y + X
Now, can you give me a recursion based JavaScript code to create a tree with operators on non-leaf nodes and leaf nodes contains the variables based on the BODMAS evaluation of the expression.
Code Pre-read
Certainly! To represent an expression as a tree structure in JavaScript, you can use a recursive data structure. Here's a simple example of how you can create a tree with operators on non-leaf nodes and variables on leaf nodes based on BODMAS evaluation:
…
Code in on next slide
This code defines a TreeNode class to represent nodes in the expression tree. The buildExpressionTree function takes an expression as input and returns the root of the expression tree. The tree is built recursively based on the BODMAS rules, with operators on non-leaf nodes and variables on leaf nodes.
Note: This code assumes that the input expression is properly formatted and follows BODMAS rules.
Code to Build a Tree Node
class TreeNode { constructor(value) { this.value = value; this.left = null; this.right = null; } }
Code
function buildExpressionTree(expression) { // Tokenize Expression const tokens = expression.split(/\s/).filter(token => token.trim() !== ''); // Helper function to build the expression tree recursively function buildTree(start, end) { if (start === end) { // Base case: single variable return new TreeNode(tokens[start]); } // Find the operator with the lowest precedence let minPrecedence = Infinity; let minIndex = -1; for (let i = start; i < end; i++) { if (tokens[i] === '+' || tokens[i] === '-') { const precedence = 1; if (precedence <= minPrecedence) { minPrecedence = precedence; minIndex = i; } } else if (tokens[i] === '*' || tokens[i] === '/') { const precedence = 2; if (precedence <= minPrecedence) { minPrecedence = precedence; minIndex = i; } } } // Create the current node with the operator const currentNode = new TreeNode(tokens[minIndex]); // Recursively build left and right subtrees currentNode.left = buildTree(start, minIndex - 1); currentNode.right = buildTree(minIndex + 1, end); return currentNode; } // Build the tree starting from the root return buildTree(0, tokens.length - 1); } // Example usage: const expression = "X + Y * Z"; const expressionTree = buildExpressionTree(expression); console.log(expressionTree);
Explanation of the code
Certainly! Let's go through the code line by line:
TreeNode Class
class TreeNode { constructor(value) { this.value = value; this.left = null; this.right = null; } } This part defines a TreeNode class, which will be used to represent nodes in the expression tree. Each node has a value (either an operator or a variable), a left child, and a right child.
function buildExpressionTree(expression) { const tokens = expression.split(/\s/).filter(token => token.trim() !== ''); // Tokenize expression The buildExpressionTree function takes an input expression and splits it into tokens using a regular expression. The regular expression \b matches word boundaries, and filter is used to remove any empty or whitespace-only tokens. // Helper function to build the expression tree recursively function buildTree(start, end) { if (start === end) { // Base case: single variable return new TreeNode(tokens[start]); } The buildTree function is a helper function for building the expression tree recursively. The base case checks if start and end are the same, which means we have reached a leaf node containing a single variable. In this case, a new TreeNode is created with the variable as its value.Operator Precedence Next, we are going to look at operator precedence. Viz: In the order DM > AS And if it’s DD or MM or DM, then left to right. // Find the operator with the lowest precedence let minPrecedence = Infinity; let minIndex = -1; for (let i = start; i < end; i++) { if (tokens[i] === '+' || tokens[i] === '-') { const precedence = 1; if (precedence <= minPrecedence) { minPrecedence = precedence; minIndex = i; } } else if (tokens[i] === '*' || tokens[i] === '/') { const precedence = 2; if (precedence <= minPrecedence) { minPrecedence = precedence; minIndex = i; } } } This part of the code searches for the operator with the lowest precedence in the given range (start to end). It iterates through the tokens, updating minPrecedence and minIndex whenever it finds an operator with lower precedence.
// Create the current node with the operator const currentNode = new TreeNode(tokens[minIndex]); // Recursively build left and right subtrees currentNode.left = buildTree(start, minIndex - 1); currentNode.right = buildTree(minIndex + 1, end); return currentNode; } Once the operator with the lowest precedence is found, a new TreeNode is created with that operator as its value. The left and right children of this node are then recursively built using the buildTree function. // Build the tree starting from the root return buildTree(0, tokens.length - 1); } Finally, the buildExpressionTree function initiates the construction of the expression tree by calling buildTree with the entire range of tokens. The function returns the root of the expression tree. // Example usage: const expression = "X + Y * Z"; const expressionTree = buildExpressionTree(expression); console.log(expressionTree); This part demonstrates the usage of the code by creating an example expression ("X + Y * Z") and building an expression tree from it. The resulting tree is then printed to the console. The example expression can be replaced with other expressions as needed. Overall, this code recursively builds an expression tree based on the BODMAS rules, where operators are placed on non-leaf nodes and variables on leaf nodes.
Prompts we tried on ChatGPT
There is an expression with N variables as in A, B, C, X, Y, Z. And N-1 variables which can be +,-,*, and /
For example:
Ex 1: X + Y * Z Ex 2: X - Y * Z + A Ex 3: X * Y - Z / A Ex 4: X / Y / Z Ex 5: X * Y / Z Ex 6: Z - Y + X
A problem can have from 1 to 5 operators.
Now, can you give me a recursion based Python code to create a BODMAS based parenthesized expression for the given expression.