Monday, May 29, 2023

Ch 2 - Descriptive Statistics

Descriptive Data Summarization

Distributive Measure - sum, count Algebraic Measure - mean, weighted mean Holistic Measure (expensive) - median Mode - value occurring most frequently Midrange - average of largest and smallest values

Practice Problem

Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 36, 40, 45, 46, 52, 70 What is the mode of the data? Comment on the data's modaliy (i.e, bimodal, trimodal, etc.). What is the midrange of the data? - - - - - - The data characteristics we discussed are called Central Tendencies. Another important characteristic is called Dispersion or Variance of the data - Range - Five-number summary (based on quartiles) - Interquartile range - - - - - -

Percentile

The kth percentile of a set of data in numerical order is the value x(i) having the property that k percent of the data entries lie at or below x(i). The median is the 50th percentile.

Practice Problem

Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 36, 40, 45, 46, 52, 70 Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?

Box Plot

Practice Problem

Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 36, 40, 45, 46, 52, 70 Show a box plot.

Box Plot for Outlier Analysis

Variance and Standard Deviation

Scatter Plot

A plot useful in visualizing the data as points in 2D or 3D is called Scatter Plot.

In 2D

In 3D

Code # Import libraries from mpl_toolkits import mplot3d import numpy as np import matplotlib.pyplot as plt # Creating dataset z = np.random.randint(100, size =(50)) x = np.random.randint(80, size =(50)) y = np.random.randint(60, size =(50)) # Creating figure fig = plt.figure(figsize = (10, 7)) ax = plt.axes(projection ="3d") # Creating plot ax.scatter3D(x, y, z, color = "green") plt.title("Demo of 3D scatter plot") # show plot plt.show()

Practice Problem

Suppose a hospital tested the age and body fat data for 18 randomly selected adults with the following result.
Calculate the mean, median and standard deviation of age and %fat. Answers The mean is 46.44, the median is 51, and the standard deviation is 12.85. For the variable %fat the mean is 28.78, the median is (???), and the standard deviation is 8.99 Q: Draw the boxplots for age and %fat. Q: Draw a scatter plot based on these two variables.

Correlation

Answer Correlation coefficient (Pearson's product moment coefficient) Moment: a - mean(A) Product moment: (a - mean(A))(b - mean(B)) Divided by: N.(std_a)(std_b) 0.82; since it’s > 0, they are positively correlated Scatter plot also showed same thing. Refer to earlier slide…

Now In Code

HOW ARE SCATTER PLOT AND CORRELATION CONCEPT RELATED?

Correlation coefficient goes from: -1 to 1 By looking at the scatter plot of two variables, we can make a rough estimate about their correlation.
Tags: Technology,Python,Data Analytics,