Descriptive Data Summarization
Distributive Measure - sum, count Algebraic Measure - mean, weighted mean Holistic Measure (expensive) - median Mode - value occurring most frequently Midrange - average of largest and smallest valuesPractice Problem
Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 36, 40, 45, 46, 52, 70 What is the mode of the data? Comment on the data's modaliy (i.e, bimodal, trimodal, etc.). What is the midrange of the data? - - - - - - The data characteristics we discussed are called Central Tendencies. Another important characteristic is called Dispersion or Variance of the data - Range - Five-number summary (based on quartiles) - Interquartile range - - - - - -Percentile
The kth percentile of a set of data in numerical order is the value x(i) having the property that k percent of the data entries lie at or below x(i). The median is the 50th percentile.Practice Problem
Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 36, 40, 45, 46, 52, 70 Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?Box Plot
Practice Problem
Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 36, 40, 45, 46, 52, 70 Show a box plot.Box Plot for Outlier Analysis
Variance and Standard Deviation
Scatter Plot
A plot useful in visualizing the data as points in 2D or 3D is called Scatter Plot.In 2D
In 3D
Code # Import libraries from mpl_toolkits import mplot3d import numpy as np import matplotlib.pyplot as plt # Creating dataset z = np.random.randint(100, size =(50)) x = np.random.randint(80, size =(50)) y = np.random.randint(60, size =(50)) # Creating figure fig = plt.figure(figsize = (10, 7)) ax = plt.axes(projection ="3d") # Creating plot ax.scatter3D(x, y, z, color = "green") plt.title("Demo of 3D scatter plot") # show plot plt.show()Practice Problem
Suppose a hospital tested the age and body fat data for 18 randomly selected adults with the following result. Calculate the mean, median and standard deviation of age and %fat. Answers The mean is 46.44, the median is 51, and the standard deviation is 12.85. For the variable %fat the mean is 28.78, the median is (???), and the standard deviation is 8.99 Q: Draw the boxplots for age and %fat. Q: Draw a scatter plot based on these two variables.Correlation
Answer Correlation coefficient (Pearson's product moment coefficient) Moment: a - mean(A) Product moment: (a - mean(A))(b - mean(B)) Divided by: N.(std_a)(std_b) 0.82; since it’s > 0, they are positively correlated Scatter plot also showed same thing. Refer to earlier slide…
Now In Code
In [1]:
# Applying basis statistics functions on given data set in terms of list
# Available python packages to implement the same : Pandas, NumPy, SciPy & statsmodels
In [2]:
# For Ex -
DataSet = [13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 36, 40, 45, 46, 52, 70]
# Sum of all elements using simple built in sum function
print("Sum of all items of Data Set : " + str(sum(DataSet)))
# Getting Count of each items using counter collection
"""
Counter is an unordered collection where elements are stored as Dict keys and their count as dict value
"""
from collections import Counter
print("Count of each items in Data Set : ")
print(Counter(DataSet))
# Use of statistics module
import statistics as st
# Mean -> Sum of all data items / total no of data items
print("Mean of Data Set : ")
print(st.mean(DataSet))
# Median -> Average of two items exist in mid of data set
print("Median of Data Set : ")
print(st.median(DataSet))
# Mode -> Item with highest frequency of appearance
print("Mode of Data Set : ")
print(st.mode(DataSet))
# Mid-range -> Average of MaxVale And MinValue item
print("Mid Range Value Of Data Set : ")
print(st.mean([max(DataSet), min(DataSet)]))
# Other Useful statistical measures
print("Quantiles Of Data Set : ")
print(st.quantiles(data = DataSet, n = 4)) # [20.0, 25.0, 35.25]
print("Std. Deviation Of Data Set : ")
print(st.stdev(DataSet))
print("Variance Of Data Set : ")
print(st.variance(DataSet))
Sum of all items of Data Set : 774 Count of each items in Data Set : Counter({25: 4, 35: 3, 16: 2, 20: 2, 22: 2, 33: 2, 13: 1, 15: 1, 19: 1, 21: 1, 30: 1, 36: 1, 40: 1, 45: 1, 46: 1, 52: 1, 70: 1}) Mean of Data Set : 29.76923076923077 Median of Data Set : 25.0 Mode of Data Set : 25 Mid Range Value Of Data Set : 41.5 Quantiles Of Data Set : [20.0, 25.0, 35.25] Std. Deviation Of Data Set : 13.158442741624686 Variance Of Data Set : 173.14461538461538
In [ ]:
In [3]:
import pandas as pd
In [4]:
df = pd.read_csv('HeightWeight.csv')
In [5]:
df.head()
Out[5]:
Index | Height(Inches) | Weight(Pounds) | |
---|---|---|---|
0 | 1 | 65.78331 | 112.9925 |
1 | 2 | 71.51521 | 136.4873 |
2 | 3 | 69.39874 | 153.0269 |
3 | 4 | 68.21660 | 142.3354 |
4 | 5 | 67.78781 | 144.2971 |
In [8]:
st.correlation(df['Height(Inches)'], df['Weight(Pounds)'])
Out[8]:
0.5028585206028441
Linear Regression¶
In [6]:
# New in version 3.10
slope, intercept = st.linear_regression(df['Height(Inches)'], df['Weight(Pounds)'])
In [7]:
slope, intercept
Out[7]:
(3.0834764454029657, -82.57574306454092)
In [ ]:
In [2]:
import pandas as pd
import matplotlib.pyplot as plt
#Read the data
df = pd.read_csv(r'bodyfat.csv')
#Adding title
plt.title("Scatter Plot using matplotlib")
#Plotting the data
plt.scatter(df['BodyFat'], df['Age'],alpha=0.8)
plt.show()
In [5]:
import pandas as pd
import matplotlib.pyplot as plt
#Read the data
df = pd.read_csv(r'bodyfat.csv')
data=df[['BodyFat', 'Age']]
#Adding title
plt.title("Box Plot using matplotlib")
# Plotting the box plot
plt.boxplot(data,patch_artist=(True))
#Showing the plot
plt.show()
In [ ]:
HOW ARE SCATTER PLOT AND CORRELATION CONCEPT RELATED?
Correlation coefficient goes from: -1 to 1 By looking at the scatter plot of two variables, we can make a rough estimate about their correlation.