Sunday, February 12, 2023

PySpark Books (2023 Feb)

Download Books
1. Tomasz Drabas, Denny Lee
Packt Publishing Ltd, 27-Feb-2017

2.
Data Analysis with Python and PySpark
Jonathan Rioux
Simon and Schuster, 12-Apr-2022 

3.
PySpark Cookbook: Over 60 recipes for implementing big data processing and analytics using Apache Spark and Python
Denny Lee, Tomasz Drabas
Packt Publishing Ltd, 29-Jun-2018

4.
Machine Learning with PySpark: With Natural Language Processing and Recommender Systems
Pramod Singh
Apress, 14-Dec-2018

5.
Learning Spark: Lightning-Fast Big Data Analysis
Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia
"O'Reilly Media, Inc.", 28-Jan-2015 

6.
Advanced Analytics with PySpark
Akash Tandon, Sandy Ryza, Sean Owen, Uri Laserson, Josh Wills
"O'Reilly Media, Inc.", 14-Jun-2022

7.
PySpark Recipes: A Problem-Solution Approach with PySpark2
Raju Kumar Mishra
Apress, 09-Dec-2017

8.
Learn PySpark: Build Python-based Machine Learning and Deep Learning Models
Pramod Singh
Apress, 06-Sept-2019

9.
Learning Spark
Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee
"O'Reilly Media, Inc.", 16-Jul-2020

10.
Advanced Analytics with Spark: Patterns for Learning from Data at Scale
Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills
"O'Reilly Media, Inc.", 12-Jun-2017

11.
Applied Data Science Using PySpark: Learn the End-to-End Predictive Model-Building Cycle
Ramcharan Kakarla, Sundar Krishnan, Sridhar Alla
Apress, 2021

12.
Essential PySpark for Scalable Data Analytics: A beginner's guide to harnessing the power and ease of PySpark 3
Sreeram Nudurupati
Packt Publishing Ltd, 29-Oct-2021

13.
Spark: The Definitive Guide: Big Data Processing Made Simple
Bill Chambers, Matei Zaharia
"O'Reilly Media, Inc.", 08-Feb-2018

14.
Spark for Python Developers
Amit Nandi
Packt Publishing, 24-Dec-2015

15.
Frank Kane's Taming Big Data with Apache Spark and Python
Frank Kane
Packt Publishing Ltd, 30-Jun-2017

16.
Stream Processing with Apache Spark: Mastering Structured Streaming and Spark Streaming
Gerard Maas, Francois Garillot
"O'Reilly Media, Inc.", 05-Jun-2019

17.
Data Analytics with Spark Using Python
Jeffrey Aven
Addison-Wesley Professional, 18-Jun-2018

18.
Graph Algorithms: Practical Examples in Apache Spark and Neo4j
Mark Needham, Amy E. Hodler
"O'Reilly Media, Inc.", 16-May-2019 

19.
Spark in Action: Covers Apache Spark 3 with Examples in Java, Python, and Scala
Jean-Georges Perrin
Simon and Schuster, 12-May-2020

20.
Mastering Spark with R: The Complete Guide to Large-Scale Analysis and Modeling
Javier Luraschi, Kevin Kuo, Edgar Ruiz
"O'Reilly Media, Inc.", 07-Oct-2019

21.
High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark
Holden Karau, Rachel Warren
"O'Reilly Media, Inc.", 25-May-2017

22.
Apache Spark in 24 Hours, Sams Teach Yourself
Jeffrey Aven
Sams Publishing, 31-Aug-2016
Tags: List of Books,Spark,

Hands-on 5 Regression Algorithms Using Scikit-Learn

Download Code and Data
What is Regression?

When the targets are real numbers and we are trying the establish a relationship between a target and a predictor, the problem is called a “regression problem”.

Example 1: Salary vs Years of Experience

Example 2: Weight vs Height
Regression: Predicting Bengaluru Housing Prices 1. Linear Regression (Ordinary Least Squares algorithm) 2. Polynomial Regression 3. Linear Regression using Stochastic Gradient Descent 4. Regression using Support Vector Machines 5. Regression using Decision Trees Linear Regression (Ordinary Least Squares algorithm) 1: In Linear Regression, you try to fit a line to the data.
Basic Idea Behind Ordinary Least Squares Algorithm: How much predictions are deviating from the actual data? Mapping errors on the graph:
>>> import numpy as np >>> from sklearn.linear_model import LinearRegression >>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]]) >>> # y = 1 * x_0 + 2 * x_1 + 3 >>> y = np.dot(X, np.array([1, 2])) + 3 >>> reg = LinearRegression().fit(X, y) >>> reg.score(X, y) 1.0 >>> reg.coef_ array([1., 2.]) >>> reg.intercept_ 3.0... >>> reg.predict(np.array([[3, 5]])) array([16.]) Ref: scikit-learn.org Which attributes to transform during EDA? 1. Check if you model requires numerical features and if you can make the attributes numerical. For ex, for the problem of predicting housing prices, we can convert BHK column to floating point numbers: 2 BHK -> 2 2 BHK + Study -> 2.5 3 BHK -> 3 3 BHK + Servent -> 3.5 2. What if the ‘bhk’ attribute is not given? >>> pandas_df.dropna(subset = [‘bhk’]) If we have engineered all the features, can we drop null records from all the features? >>> pandas_df.dropna(inplace = True) 2. Polynomial Regression What if your data is actually more complex than a simple straight line?
Generating Polynomial Features Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2]. include_bias: bool, default=True ::: If True (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model). >>> import numpy as np >>> from sklearn.preprocessing import PolynomialFeatures >>> X = np.arange(6).reshape(3, 2) >>> X array([[0, 1], [2, 3], [4, 5]]) >>> poly = PolynomialFeatures(2) >>> poly.fit_transform(X) array([[ 1., 0., 1., 0., 0., 1.], [ 1., 2., 3., 4., 6., 9.], [ 1., 4., 5., 16., 20., 25.]]) Building the Polynomial Regression Model >>> from sklearn.preprocessing import PolynomialFeatures >>> poly_features = PolynomialFeatures(degree=2, include_bias=False) >>> X_poly = poly_features.fit_transform(X) >>> X[0] array([-0.75275929]) >>> X_poly[0] Array([-0.75275929, 0.56664654]) X_poly now contains the original feature of X plus the square of this feature. Now you can fit a LinearRegression model to this extended training data: >>> lin_reg = LinearRegression() >>> lin_reg.fit(X_poly, y) >>> lin_reg.intercept_, lin_reg.coef_ (array([ 1.78134581]), array([[ 0.93366893, 0.56456263]]))

3. Linear Regression using Stochastic Gradient Descent

What’s Gradient Descent? Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function. Suppose you are lost in the mountains in a dense fog; you can only feel the slope of the ground below your feet. A good strategy to get to the bottom of the valley quickly is to go downhill in the direction of the steepest slope. This is exactly what Gradient Descent does: it measures the local gradient of the error function with regards to the parameter vector θ, and it goes in the direction of descending gradient. Once the gradient is zero, you have reached a minimum! Concretely, you start by filling θ with random values (this is called random initialization), and then you improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function (e.g., the MSE), until the algorithm converges to a minimum (see the figure below).
Solving the problem of Linear Regression (Using SGD)
Here are the high level steps that we take in implementing a simple and naive Linear Regression model using SGD: 1. Random Initialization: Initialize the model with a line along the x-axis. 2. Calculate the error function for this line. 3. By doing minor changes (d(slope) and d(intercept)) in slope and intercept, adjust the linear model to reduce the error function. 4. Repeat steps (2) and (3) until convergence. Code from sklearn.linear_model import SGDRegressor sgd_reg = SGDRegressor(n_iter=50, penalty=None, eta0=0.1) sgd_reg.fit(X, y.ravel()) >>> sgd_reg.intercept_, sgd_reg.coef_ (array([ 4.18380366]), array([ 2.74205299]))

4. Regression using Support Vector Machines

We start with explaining what SVM is and then move on to using it for regression: The fundamental idea behind SVMs is best explained with some pictures. Figures below shows part of the iris dataset. The two classes can clearly be separated easily with a straight line (they are linearly separable). The left plot shows the decision boundaries of three possible linear classifiers. The model whose decision boundary is represented by the dashed line is so bad that it does not even separate the classes properly. The other two models work perfectly on this training set, but their decision boundaries come so close to the instances that these models will probably not perform as well on new instances. In contrast, the solid line in the plot on the right represents the decision boundary of an SVM classifier; this line not only separates the two classes but also stays as far away from the closest training instances as possible. You can think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes. This is called large margin classification. And the circled points are your ‘support vectors’.
SVM Regression As we mentioned earlier, the SVM algorithm is quite versatile: not only does it support linear and nonlinear classification, but it also supports linear and nonlinear regression. The trick is to reverse the objective: Instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible on the street while limiting margin violations (i.e., instances off the street). The width of the street is controlled by a hyperparameter ϵ. Figure below shows two linear SVM Regression models trained on some random linear data, one with a large margin (ϵ = 1.5) and the other with a small margin (ϵ = 0.5).

5. Regression using Decision Trees

First we would explain what Decision Trees are and how they work. Binary decision trees operate by subjecting attributes to a series of binary (yes / no) decisions. Each decision leads to one of two possibilities. Each decision leads to another decision or it leads to prediction. How a Binary Decision Tree Generates Predictions? When an observation or row is passed to a nonterminal node, the row answers the node’s question. If it answers yes, the row of attributes is passed to the leaf node below and to the left of the current node. If the row answers no, the row of attributes is passed to the leaf node below and to the right of the current node. The process continues recursively until the row arrives at a terminal (that is, leaf) node where a prediction value is assigned to the row. The value assigned by the leaf node is the mean of the outcomes of the all the training observations that wound up in the leaf node. Below is the Decision Tree for Iris Dataset.
Simple Psuedo Code for ‘Regression Using Decision Tree’ Only For The Purpose of Demonstration. Step 1: Find avarage value for interval of x and y. Let us call these values XA abd YA. Step 2: Split the curve into two by drawing a vertical line. Step 3: For x < XA, choose the average values of (x, y) from left side, drawing a horizontal line passing from this point on the left side. Step 4: For x > XA, choose the average values of (x, y) from right side, drawing a horizontal line passing from this point on the left side. Repeat steps (1) to (4) for (n-1) times where n is the depth you want in your decision tree. Moving on to Regression. Below is our sample data:
Block diagram of depth 1 tree for simple problem
Comparison of predictions and actual values versus attribute for simple example Notice how the predicted value for each region is always the average target value of the instances in that region. The algorithm splits each region in a way that makes most training instances as close as possible to that predicted value.
DecisionTreeRegressor using sklearn from sklearn.tree import DecisionTreeRegressor tree_reg = DecisionTreeRegressor(max_depth=2) tree_reg.fit(X, y)

References

1. Linear Regression (Ordinary Least Squares algorithm) 1.1. linear-regression-theory 1.2. penalized linear regression 2, 3, 4: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Book by Aurelien Geron 5: Machine Learning in Python (Essential Techniques For Predictive Analysis) By: Michael Bowles
Tags: Machine Learning,Technology,

Thursday, February 9, 2023

Machine Learning Books (Mar 2020)

Download Books

Putting the books listed below into three categories based on complexity

I: Mathematical Theory

  • 1. Deep Learning
    Book by Aaron Courville, Ian Goodfellow, and Yoshua Bengio
  • 5. Pattern Recognition and Machine Learning
    Book by Christopher Bishop
  • 8. Understanding Machine Learning: From Theory to Algorithms
    Textbook by Shai Ben-David and Shai Shalev-Shwartz

II: Mix of Theory and Applied Study

  • 2. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow
    Concepts, Tools, and Techniques to Build Intelligent Systems
    Book by Aurelien Geron

III: Applied Study

  • 6. The Hundred-Page Machine Learning Book
    Book by Andriy Burkov
  • 9. Machine Learning for Absolute Beginners: A Plain English Introduction
    Book by O. Theobald
  • 52. Machine Learning in Python (Essential Techniques For Predictive Analysis)
    By: Michael Bowles
  • 53. Fifty Algorithms Every Programmer Should Know (2e)
    By: Imran Ahmad (PhD)
  • 54. Applied Machine Learning and AI for Engineers
    Solve Business Problems That Can't Be Solved Algorithmically (Release 1)
    Jeff Prosise
    O’Reilly Media, Inc. (2022)

All

1. Deep Learning Book by Aaron Courville, Ian Goodfellow, and Yoshua Bengio 2. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow Concepts, Tools, and Techniques to Build Intelligent Systems Book by Aurelien Geron 3. The Elements of Statistical Learning Book by Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie 4. An Introduction to Statistical Learning: With Applications in R Book 5. Pattern Recognition and Machine Learning Book by Christopher Bishop 6. The Hundred-Page Machine Learning Book Book by Andriy Burkov 7. Deep Learning with Python Book by François Chollet 8. Understanding Machine Learning: From Theory to Algorithms Textbook by Shai Ben-David and Shai Shalev-Shwartz 9. Machine Learning for Absolute Beginners: A Plain English Introduction Book by O. Theobald 10. Python Machine Learning Book by Sebastian Raschka 11. Artificial Intelligence: A Modern Approach Textbook by Peter Norvig and Stuart J. Russell 12. Introduction to Machine Learning Textbook by Ethem Alpaydın 13. Machine Learning: A Probabilistic Perspective Textbook by Kevin P. Murphy 14. Machine Learning for Hackers Book by Drew Conway and John Myles White 15. Programming Collective Intelligence Book: O'Reilly 16. Machine Learning For Dummies Book by John Mueller and Luca Massaron 17. Bayesian Reasoning and Machine Learning Book by David Barber 18. Reinforcement Learning: An Introduction Book by Andrew Barto and Richard S. Sutton 19. Learning from Data: A Short Course Book by Hsuan-Tien Lin, Malik Magdon-Ismail, and Yaser Abu-Mostafa 20. Machine Learning in Action Book by Peter Harrington 21. Machine Learning: The Art and Science of Algorithms that Make Sense of Data Book by Peter Flach 22. Introduction to Machine Learning with Python A Guide for Data Scientists Author(s): Andreas C. Müller, Sarah Guido Publisher: O’Reilly Media, Year: 2016 23. Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies Textbook by Aoife D'Arcy, Brian Mac Namee, and John D. Kelleher 24. Mining of Massive Datasets Book by Anand Rajaraman and Jeffrey Ullman 25. Foundations of Machine Learning Textbook by Afshin Rostamizadeh, Ameet Talwalkar, and Mehryar Mohri 26. Superintelligence: Paths, Dangers, Strategies Book by Nick Bostrom 27. Make Your Own Neural Network: A Gentle Journey Through the Mathematics ... Book by Tariq Rashid 28. Python Machine Learning: Machine Learning and Deep Learning with Python, Scikit-learn, and TensorFlow 2, 3rd Edition Book by Sebastian Raschka and Vahid Mirjalili 29. Machine Learning: An Algorithmic Perspective Book by Stephen Marsland 30. The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World Book by Pedro Domingos 31. Grokking Deep Learning Book by Andrew W. Trask 32. Advances in Financial Machine Learning Book by Marcos Lopez de Prado 33. Machine Learning: A Guide to Current Research Book by Tom M. Mitchell 34. Pattern Classification Book by David G. Stork, Peter E. Hart, and Richard O. Duda 35. Building Machine Learning Systems with Python - Second Edition Book by Luis Pedro Coelho and Willi Richert 36. Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference Book by Cameron Davidson-Pilon 37. Information Theory, Inference and Learning Algorithms Textbook by David J. C. MacKay 38. Probabilistic Graphical Models: Principles and Techniques Book by Daphne Koller and Nir Friedman 39. Interpretable Machine Learning Book by Christoph Molnar 40. The Book of Why: The New Science of Cause and Effect Book by Dana Mackenzie and Judea Pearl 41. Fundamentals of Deep Learning: Designing Next-Generation Machine Intelligence Algorithms Book by Nicholas Locascio and Nikhil Buduma 42. Deep Reinforcement Learning Hands-On: Apply Modern RL Methods, with Deep Q-networks, Value Iteration, Policy Gradients, TRPO, AlphaGo Zero and More 43. Think Stats Book by Allen B. Downey 44. Gaussian Processes for Machine Learning Book by Carl Edward Rasmussen and Christopher K. I. Williams 45. Data Mining: Practical Machine Learning Tools and Techniques Book 46. Machine Learning with R Book by Brett Lantz 47. Python Data Science Handbook: Essential Tools for Working with Data Book by Jake VanderPlas 48. Real world machine learning: video edition Book by Henrik Brink, Joseph Richards, and Mark Fetherolf 49. Machine Learning Algorithms: Popular Algorithms for Data Science and Machine Learning Book by Giuseppe Bonaccorso 50. Machine Learning: A Bayesian and Optimization Perspective Book by Sergios Theodoridis 51. Mathematics for Machine Learning Textbook by A. Aldo Faisal, Cheng Soon Ong, and Marc Peter Deisenroth 52. Machine Learning in Python (Essential Techniques For Predictive Analysis) By: Michael Bowles 53. Fifty Algorithms Every Programmer Should Know (2e) By: Imran Ahmad (PhD) 54. Applied Machine Learning and AI for Engineers Solve Business Problems That Can't Be Solved Algorithmically (Release 1) Jeff Prosise O’Reilly Media, Inc. (2022)
Tags: List of Books,Machine Learning,

Wednesday, February 8, 2023

Spark SQL in Images

1. Spark's components

2. Spark SQL Architecture

3. SQL Data Types

4. Spark's context objects

5. File Formats Supported By Spark

6. SQL Workflow

7. Catalyst Optimizer

Below steps explain the workflow of the catalyst optimizer: 1. Analyzing a logical plan with the metadata 2. Optimizing the logical plan 3. Creating multiple physical plans 4. Analyzing the plans and finding the most optimal physical plan 5. Converting the physical plan to RDDs
Tags: Spark,Technology,

A Solved Exercise in RDD Filter and Join Operations (Interview Preparation)

Download Code and Data
Problem Statement:

Consider the Universal Identity Number data scenario with two datasets UIN Customer data and Bank account linking data.

UIN Card data (UINCardData.csv):
Schema Details: UIN, MobileNumber,Gender,SeniorCitizens,Income

Bank account link data (BankAccountLink.csv):
Schema Details: MobileNumber, LinkedtoBankAccount, BankAccountNumber

Requirement

Join both datasets and find the UIN number that is not linked with the Bank Account number. Print UIN number and BankAccountNumber.
Save the final output to a specified HDFS directory.