Monday, May 29, 2023

Ch 2 - Descriptive Statistics

Descriptive Data Summarization

Distributive Measure - sum, count Algebraic Measure - mean, weighted mean Holistic Measure (expensive) - median Mode - value occurring most frequently Midrange - average of largest and smallest values

Practice Problem

Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 36, 40, 45, 46, 52, 70 What is the mode of the data? Comment on the data's modaliy (i.e, bimodal, trimodal, etc.). What is the midrange of the data? - - - - - - The data characteristics we discussed are called Central Tendencies. Another important characteristic is called Dispersion or Variance of the data - Range - Five-number summary (based on quartiles) - Interquartile range - - - - - -

Percentile

The kth percentile of a set of data in numerical order is the value x(i) having the property that k percent of the data entries lie at or below x(i). The median is the 50th percentile.

Practice Problem

Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 36, 40, 45, 46, 52, 70 Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?

Box Plot

Practice Problem

Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 36, 40, 45, 46, 52, 70 Show a box plot.

Box Plot for Outlier Analysis

Variance and Standard Deviation

Scatter Plot

A plot useful in visualizing the data as points in 2D or 3D is called Scatter Plot.

In 2D

In 3D

Code # Import libraries from mpl_toolkits import mplot3d import numpy as np import matplotlib.pyplot as plt # Creating dataset z = np.random.randint(100, size =(50)) x = np.random.randint(80, size =(50)) y = np.random.randint(60, size =(50)) # Creating figure fig = plt.figure(figsize = (10, 7)) ax = plt.axes(projection ="3d") # Creating plot ax.scatter3D(x, y, z, color = "green") plt.title("Demo of 3D scatter plot") # show plot plt.show()

Practice Problem

Suppose a hospital tested the age and body fat data for 18 randomly selected adults with the following result.
Calculate the mean, median and standard deviation of age and %fat. Answers The mean is 46.44, the median is 51, and the standard deviation is 12.85. For the variable %fat the mean is 28.78, the median is (???), and the standard deviation is 8.99 Q: Draw the boxplots for age and %fat. Q: Draw a scatter plot based on these two variables.

Correlation

Answer Correlation coefficient (Pearson's product moment coefficient) Moment: a - mean(A) Product moment: (a - mean(A))(b - mean(B)) Divided by: N.(std_a)(std_b) 0.82; since it’s > 0, they are positively correlated Scatter plot also showed same thing. Refer to earlier slide…

Now In Code

HOW ARE SCATTER PLOT AND CORRELATION CONCEPT RELATED?

Correlation coefficient goes from: -1 to 1 By looking at the scatter plot of two variables, we can make a rough estimate about their correlation.
Tags: Technology,Python,Data Analytics,

Ch 1 - What is Data Mining?

What is Data Mining?

- extracting knowledge from large amount of data
- cleaning, integration, selection, transformation, mining/processing, pattern evaluation, presentation

- - - - -

What kind of patterns can be mined/found by data mining techniques?

1. Characterization and Discrimination
2. Frequent patterns, Associations, and Correlations
3. Classification and Prediction
4. Cluster analysis
5. Outlier analysis
6. Evolution analysis

Give examples of each of the following

Characterization and Discrimination

- Characteristics of customers who buy a certain kind of product - Customers who buy product A vs customers who buy another product B Data Characterization − This refers to summarizing data of class under study. This class under study is called as Target Class. Data Discrimination − It refers to the mapping or classification of a class with some predefined group or class. - - - - -

Frequent patterns, Associations, and Correlations

What kind of products do customers buy together? (Ex. of Association Mining) If customer buys product A, what’s the chance that he/she will buy product B as well (Ex. of Association Mining) Frequent patterns are itemsets, subsequences, or substructures that appear in a data set with frequency no less than a user-specified threshold. For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set, is a frequent itemset. - - - - -

Classification and Prediction

- Classify the sales of various products into different classes - Predict the sales of the product Classification Imagine a T-shirt store: As a customer, you would tell your age, weight, height to the salesman and the salesman will show T-shirts of appropriate sizes as in small, medium, large. Prediction It is a forecast as in how many t-shirts store might sell this month? - - - - -

Cluster analysis

- Divide the data into different groups of similar items - No. of cluster are not known apriori If let's say we want to open T-shirt store, we might not know what all sizes of t-shirts should be placed in the store. We collect data such as weight and height and then we have to group them into classes like small, medium, large. Question is: Is this list of three sizes exhaustive or there can be more sizes like XL or XXL? Who would tell us what all sizes should be there? - - - - -

Outlier analysis

- What is an outlier: Deviant data from the expected Example: a fraudulent transaction: amount might be very high (in comparison to other routine transactions done by the user) for a fraudulent transaction because the criminal might think of pulling out as much money as possible before a stolen card or hacked account is blocked. - - - - -

Evolution analysis

- Time series analysis of data Example: Identification of current trend in the stock market whether it is saturated, or it is bullish or it is bearish.

Discess whether or not each of the following activities is a data mining task.

Q: Dividing the customers of a company according to their gender. A: No, this is a simple database query. Q: Dividing the customers of a company according to their profitability. A: No. This is an accounting calculation, followed by the application of a threshold. However, predicting the profitability of a new customer would be data mining. Q: Predicting the outcomes of a tossing a (fair) pair of dice. A: No. Since the die is fair, this is a probability calculation. Q: Predicting the future stock price of a company using historical recors. A: Yes. We would attempt to create a model that can predict the continuous value of the stock price. This is an example of the area of data mining known as predictive modeling. Q: Monitoring siesmic waves for earthquake activities. A: Yes, in this case, we would build a model of different types of siesmic wave behavior associated with earthquake activities and raise an alarm when one of these different types of seismic activity was observed. This is an example of the area of data mining known as classification.
Tags: Data Analytics,Technology

Sunday, May 28, 2023

Data Analytics Books (May 2023)

Download Books
1.
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
Tom Fawcett, 2013

2.
Big Data: A Revolution That Will Transform How We Live, Work, and Think
Viktor Mayer-Schönberger, 2013

3.
Storytelling With Data: A Data Visualization Guide for Business Professionals
Cole Nussbaumer Knaflic, 2015

4.
Python for Data Analysis
Wes McKinney, 2011

5.
Naked Statistics: Stripping the Dread from the Data
Charles Wheelan, 2012

6.
Business unIntelligence: Insight and Innovation beyond Analytics and Big Data
Barry Devlin, 2013

7.
Too Big to Ignore: The Business Case for Big Data
Phil Simon, 2013

8.
Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data
2014

9.
Lean Analytics: Use Data to Build a Better Startup Faster
Benjamin Yoskovitz, 2013

10.
Artificial Intelligence: A Guide for Thinking Humans
Melanie Mitchell, 2019

11.
Data Strategy: How to Profit from a World of Big Data, Analytics and the Internet of Things
Bernard Marr, 2017

12.
The Hundred-Page Machine Learning Book
Andriy Burkov, 2019

13.
Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python
Peter Bruce, 2017

14.
Learning R: A Step-by-Step Function Guide to Data Analysis
Richard Cotton, 2013

15.
The Art of Statistics: How to Learn from Data
David Spiegelhalter, 2019

16.
Developing Analytic Talent: Becoming a Data Scientist
Vincent Granville, 2014

17.
Data Smart: Using Data Science to Transform Information into Insight
John W. Foreman, 2013

18.
R for Data Science
Hadley Wickham, 2016

19.
Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, Or Die
Eric Siegel, 2013

20.
Now You See it: Simple Visualization Techniques for Quantitative Analysis
Stephen Few, 2009

21.
Predictive Analytics For Dummies
Anasse Bari, 2013

22.
Data Analytics: Become a Master in Data Analytics
Richard Dorsey, 2017

23.
Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again
Eric Topol, 2019

24.
Creating Value with Social Media Analytics: Managing, Aligning, and Mining Social Media Text, Networks, Actions, Location, Aps, Hyperlinks, Multimedia, and Search Engines Data
Gohar F. Khan, 2018

25.
The Quick Python Book
Kenneth McDonald, 1999

26.
Numsense! Data Science for the Layman: No Math Added
Annalyn Ng, 2017

27.
Weapons of Math Destruction
Cathy O'Neil, 2016

28.
Business Analytics: Data Analysis & Decision Making
Wayne L. Winston, 2014

29.
Microsoft Excel Data Analysis and Business Modeling
Wayne L. Winston, 2004

30.
A PRACTITIONER'S GUIDE TO BUSINESS ANALYTICS: Using Data Analysis Tools to Improve Your Organization's Decision Making and Strategy
Randy Bartlett, 2012

31.
Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions
Matt Taddy, 2019

32.
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Seth Stephens-Davidowitz, 2017

33.
Rebooting AI: Building Artificial Intelligence We Can Trust
Ernest Davis, 2019

34.
Data analysis using SQL and Excel
Gordon Linoff, 2007

35.
An introduction to statistical methods and data analysis
Lyman Ott, 1977

36.
Doing Data Science: Straight Talk from the Frontline
Cathy O'Neil, 2013

37.
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Walter Shields, 2015

38.
Big Data in Practice: How 45 Successful Companies Used Big Data Analytics to Deliver Extraordinary Results
Bernard Marr, 2016

39.
Data Analytics for Beginners: Basic Guide to Master Data Analytics
Paul Kinley, 2016

40.
Data Strategy: How to Profit from a World of Big Data, Analytics and Artificial Intelligence
Bernard Marr, 2021

41.
SQL for Data Analytics: Perform Fast and Efficient Data Analysis with the Power of SQL
Upom Malik, 2019

42.
The Data Detective: Ten Easy Rules to Make Sense of Statistics
Tim Harford, 2021

43.
Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies
John D. Kelleher, 2015

44.
From Big Data to Big Profits: Success with Data and Analytics
Russell Walker, 2015

45.
Analytics in a Big Data World. The Essential Guide to Data Science and Its Applications
Bart Baesens, 2014

46.
Competing on Analytics: The New Science of Winning
Thomas H. Davenport, 2007

47.
The Elements of Statistical Learning
Trevor Hastie, 2001

48.
Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses
Michael Minelli, 2012

49.
Marketing Analytics: Data-Driven Techniques with Microsoft Excel
Wayne L. Winston, 2014

50.
Head First Statistics
Dawn Griffiths, 2008
Tags: List of Books,Technology,Python,Machine Learning,

Python Quiz (13 Questions, May 2023)

Q1: What will be output of:

>>> s = 'malayalam'
>>> s.strip('mal')


Q2: Which of these are valid variable names?

a. &code = 'abc'
b. discount% = 90
c. _ = "Alpha"
d. string = "Beta"


Q3: What will be the output of:

>>> s = "   Python Program   "
>>> s.lstrip("P")


Q4: What will be the output of:

>>> l = ['Alpha', 'Beta', 'Gamma', 'Delta', 'Epsilon']
>>> l[-2][2]


Q5: What will be the output of:

var = 0

if var:
    print("In If")
elif (var == 0):
    print("In Elif 1")
elif (var == 0):
    print("In Elif 2")
else:
    print("In Else")


Q6: What will be the output of:

for i in range(10):
    if(i == 5):
        break
    else:
        print(i, sep = " ")
else:
    print("In Else 2")


Q7: What will be output of:

for x in range(6):
  print(x)
else:
  print("Finally finished!")


Q8: A riddle.

>>> t = (1, 2, [30, 40])
>>> t[2] += [50, 60]

What happens next? Choose the best answer:

a) t becomes (1, 2, [30, 40, 50, 60]).
b) TypeError is raised with the message 'tuple' object does not support item assignment.
c) Neither.
d) Both a and b.


Q9: What of these gives you back a dict:

a) a = dict(one=1, two=2, three=3)
b) b = {'one': 1, 'two': 2, 'three': 3}
c) c = dict(zip(['one', 'two', 'three'], [1, 2, 3]))
d) d = dict([('two', 2), ('one', 1), ('three', 3)])
e) e = dict({'three': 3, 'one': 1, 'two': 2})


Q10: What is the output of:

i = 01
print(i + 5)


Q11: What is the output of:

import re
s = "Malaaavikaa"
s = re.sub("a{2}", "*", s) 
print(s)


Q12: What is the output of:

class Person():
    def __init__(self, pid):
        self.pid = pid
        
obama = Person(100)

obama.age = 49

print(obama.age + 2)


Q13: l = ['alpha', 'beta', 'gamma', 'delta', 'epsilon']

Sort this list based on string length in one line.
Tags: Python,Technology,