Showing posts with label Data Analytics. Show all posts

Sunday, October 1, 2023

What are the Data Science Roles?

1. Data Strategist

Ideally, before a company collects any data, it hires a data strategist—a senior professional who understands how data can create value for businesses.

According to the famous data strategist Bernard Marr, there are four main ways how companies in different fields can use data science:

They can make data-driven decisions.

Data can help create smarter products and services.

Companies can use data to improve business processes.

They could create a new revenue stream via data monetization.

Companies often outsource such data roles. They hire external consultants to devise a plan that aligns with the organizational strategy.

Once a firm has a data strategy in place, it is time to ensure data availability. This is when a data architect comes into play.

2. Data Architect

A data architect (or data modeler) plans out high-level database structures. This involves the planning, organization, and management of information within a firm, ensuring its accuracy and accessibility. In addition, they must assess the needs of business stakeholders and optimize schemas to address them.

Such data roles are of crucial importance. Without proper data architecture, key business questions may remain answered due to the lack of coherence between different tables in the database.

A data architect is a senior professional and often a consultant. To become one, you'd need a solid resume and rigorous preparation for the interview process.

3. Data Engineer

The role of data engineers and data architects often overlaps—especially in smaller businesses. But there are key differences.

Data engineers build the infrastructure, organize tables, and set up the data to match the use cases defined by the architect. What's more, they handle the so-called ETL process, which stands for Extract, Transform, and Load. This involves retrieving data, processing it in a usable format, and moving it to a repository (the firm's database). Simply put, they pipe data into tables correctly.

Typically, they receive many ad-hoc ETL-related tasks throughout their work but rarely interact with business stakeholders directly. This is one of the best-paid data scientist roles, and for good reason. You need a plethora of skills to work in this position, including software engineering.

Okay, let's recap.

As you can see, the jobs in data science are interlinked and complement each other, but each position has slightly different requirements. First come data strategists who define how data can serve business goals. Next, the architect plans the database schemas necessary to achieve the objectives. Lastly, the engineers build the infrastructure and pipe the data into tables.

4. Data Analyst

Data analysts explore, clean, analyze, visualize, and present information, providing valuable insights for the business. They typically use SQL to access the database.

Next, they leverage an object-oriented programming language like Python or R to clean and analyze data and rely on visualization tools, such as Power BI or Tableau, to present the findings.

Side note: What is Data analysis?

Data analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains. In today's business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.

Data mining is a particular data analysis technique that focuses on statistical modeling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing mainly on business information. In statistical applications, data analysis can be divided into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new features in the data while CDA focuses on confirming or falsifying existing hypotheses. Predictive analytics focuses on the application of statistical models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All of the above are varieties of data analysis.

Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination.

5. Business Intelligence Analyst

Data analyst's and BI analyst's duties overlap to a certain extent, but the latter has more of a reporting role. Their main focus is on building meaningful reports and dashboards and updating them frequently. More importantly, they have to satisfy stakeholders' informational needs at different levels of the organization.

6. Data Scientist

A data scientist has the skills of a data analyst but can leverage machine and deep learning to create models and make predictions based on past data.

We can distinguish three main types of data scientists:

# Traditional data scientists

# Research scientists

# Applied scientists

A traditional data scientist does all sorts of tasks, including data exploration, advanced statistical modeling, experimentation via A/B testing, and building and tuning machine learning models.

Research scientists primarily work on developing new machine learning models for large companies.

Applied scientists—frequently hired in big tech and larger companies—boast one of the highest-paid jobs in data science. These specialists combine data science and software engineering skills to productionize models.

More prominent companies prefer this combined skillset because it allows one person to oversee the entire ML implementation process—from the model building until productionization—which leads to quicker results. An applied scientist can work with data, model it for machine learning, select the correct algorithm, train the model, fine-tune hyperparameters, and then put the model in production.

As you can see, there's a significant overlap between data scientists, data analysts, and BI analysts. The image below is a simplified illustration of the similarities and differences between these data science roles.

7. ML Ops Engineer

Companies that don't have applied scientists hire ML Ops engineers. They are responsible for putting the ML models prepared by traditional data scientists into production.

In many instances, ML Ops engineers are former data scientists who have developed an engineering skillset. Their main responsibilities are to put the ML model in production and fix it if something breaks.

8. Data Product Manager

The last role we discuss in this article is that of а product manager. The person in this position is accountable for the success of a data product. They consider the bigger picture, identifying what product needs to be created, when to build it, and what resources are necessary.

A significant focus of such data science roles is data availability—determining whether to collect data internally or find ways to acquire it externally. Ultimately, product managers strategize the most effective ways to execute the production process.

What is Data Science Lifecycle?

A data science lifecycle indicates the iterative steps taken to build, deliver and maintain any data science product. All data science projects are not built the same, so their life cycle varies as well. Still, we can picture a general lifecycle that includes some of the most common data science steps. A general data science lifecycle process includes the use of machine learning algorithms and statistical practices that result in better prediction models. Some of the most common data science steps involved in the entire process are data extraction, preparation, cleansing, modeling, evaluation etc. The world of data science refers to this general process as the “Cross Industry Standard Process for Data Mining”.

Who Are Involved in The Projects?

Domain Expert : The data science projects are applied in different domains or industries of real life like Banking, Healthcare, Petroleum industry etc. A domain expert is a person who has experience working in a particular domain and knows in and out about the domain.

Business analyst : A business analyst is required to understand the business needs in the domain identified. The person can guide in devising the right solution and timeline for the same.

Data Scientist : A data scientist is an expert in data science projects and has experience working with data and can work out the solution as to what data is needed to produce the required solution.

Machine Learning Engineer : A machine learning engineer can advise on which model to be applied to get the desired output and devise a solution to produce the correct and required output.

Data Engineer and Architect : Data architects and Data engineers are the experts in the modeling of data. Visualization of data for better understanding, as well as storage and efficient retrieval of data, are looked after by them.

The Lifecycle of Data Science

1. Problem identification

This is the crucial step in any Data Science project. The first thing is understanding in what way Data Science is useful in the domain under consideration and identification of appropriate tasks which are useful for the same. Domain experts and Data Scientists are the key persons in the problem identification of problem. Domain expert has in depth knowledge of the application domain and exactly what the problem is to be solved. Data Scientist understands the domain and help in identification of problem and possible solutions to the problems.

2. Business Understanding

Understanding what customer exactly wants from the business perspective is nothing but Business Understanding. Whether customer wish to do predictions or want to improve sales or minimize the loss or optimize any particular process etc forms the business goals. During business understanding, two important steps are followed:

KPI (Key Performance Indicator)

For any data science project, key performance indicators define the performance or success of the project. There is a need to be an agreement between the customer and data science project team on Business related indicators and related data science project goals. Depending on the business need the business indicators are devised and then accordingly the data science project team decides the goals and indicators. To better understand this let us see an example. Suppose the business need is to optimise the overall spendings of the company, then the data science goal will be to use the existing resources to manage double the clients. Defining the Key performance Indicators is very crucial for any data science projects as the cost of the solutions will be different for different goals.

SLA (Service Level Agreement)

Once the performance indicators are set then finalizing the service level agreement is important. As per the business goals the service level agreement terms are decided. For example, for any airline reservation system simultaneous processing of say 1000 users is required. Then the product must satisfy this service requirement is the part of service level agreement.

Once the performance indicators are agreed and service level agreement is completed then the project proceeds to the next important step.

3. Collecting Data

Data Collection is the important step as it forms the important base to achieve targeted business goals. There are various ways the data will flow into the system as listed below:

1. Surveys

2. Social media

3. Archives

4. Transactional data

5. Enterprise data

6. Statistical methods

4. Pre-processing data

Large data is collected from archives, daily transactions and intermediate records. The data is available in various formats and in various forms. Some data may be available in hard copy formats also. The data is scattered at various places on various servers. All these data are extracted and converted into single format and then processed. Typically, as data warehouse is constructed where the Extract, Transform and Loading (ETL) process or operations are carried out. In the data science project this ETL operation is vital and important. A data architect role is important in this stage who decides the structure of data warehouse and perform the steps of ETL operations.

5. Analyzing data

Now that the data is available and ready in the format required then next important step is to understand the data in depth. This understanding comes from analysis of data using various statistical tools available. A data engineer plays a vital role in analysis of data. This step is also called as Exploratory Data Analysis (EDA). Here the data is examined by formulating the various statistical functions and dependent and independent variables or features are identified. Careful analysis of data revels which data or features are important and what is the spread of data. Various plots are utilized to visualize the data for better understanding. The tools like Tableau, PowerBI etc are famous for performing Exploratory Data Analysis and Visualization. Knowledge of Data Science with Python and R is important for performing EDA on any type of data.

6. Data Modelling

Data modelling is the important next step once the data is analysed and visualized. The important components are retained in the dataset and thus data is further refined. Now the important is to decide how to model the data? What tasks are suitable for modelling? The tasks, like classification or regression, which is suitable is dependent upon what business value is required. In these tasks also many ways of modelling are available. The Machine Learning engineer applies various algorithms to the data and generates the output. While modelling the data many a times the models are first tested on dummy data similar to actual data.

7. Model Evaluation/ Monitoring

As there are various ways to model the data so it is important to decide which one is effective. For that model evaluation and monitoring phase is very crucial and important. The model is now tested with actual data. The data may be very few and in that case the output is monitored for improvement. There may be changes in data while model is being evaluated or tested and the output will drastically change depending on changes in data. So, while evaluating the model following two phases are important:

Data Drift Analysis

Changes in input data is called as data drift. Data drift is common phenomenon in data science as depending on the situation there will be changes in data. Analysis of this change is called Data Drift Analysis. The accuracy of the model depends on how well it handles this data drift. The changes in data are majorly because of change in statistical properties of data.

Model Drift Analysis

To discover the data drift machine learning techniques can be used. Also, more sophisticated methods like Adaptive Windowing, Page Hinkley etc. are available for use. Modelling Drift Analysis is important as we all know change is constant. Incremental learning also can be used effectively where the model is exposed to new data incrementally.

8. Model Training

Once the task and the model are finalised and data drift analysis modelling is finalized then the important step is to train the model. The training can be done is phases where the important parameters can be further fine tuned to get the required accurate output. The model is exposed to the actual data in production phase and output is monitored.

9. Model Deployment

Once the model is trained with the actual data and parameters are fine tuned then model is deployed. Now the model is exposed to real time data flowing into the system and output is generated. The model can be deployed as web service or as an embedded application in edge or mobile application. This is very important step as now model is exposed to real world.

10. Driving insights and generating BI reports

After model deployment in real world, the next step is to find out how the model is behaving in real-world scenario. The model is used to get insights that aid in strategic decisions related to business. The business goals are bound to these insights. Various reports are generated to see how business is driving. These reports help in finding out if key process indicators are achieved or not.

11. Taking a decision based on insight

For data science to do wonders, every step indicated above has to be done very carefully and accurately. When the steps are followed properly, then the reports generated in the above step help in making key decisions for the organization. The insights generated help in taking strategic decisions for example, the organization can predict that there will be a need for raw materials in advance. Data science can be of great help in making many important decisions related to business growth and better revenue generation.

Friday, August 4, 2023

Mapping the AI Finance Services Roadmap: Enhancing the Financial Landscape

Introduction

Artificial Intelligence (AI) has rapidly transformed the financial services industry, revolutionizing how we manage money, make investments, and access personalized financial advice. From robo-advisors to AI-driven risk management, the potential for AI in finance services is boundless. In this article, we'll navigate the AI Finance Services Roadmap, exploring the key milestones and opportunities that are reshaping the financial landscape and empowering consumers and businesses alike.

The Development of AI in the Financial Industry

Step 1: Personalized Financial Planning with Robo-Advisors

Robo-advisors have emerged as a revolutionary AI-powered tool that democratizes access to sophisticated financial planning. These platforms use AI algorithms to analyze an individual's financial situation, risk tolerance, and goals, enabling the creation of personalized investment portfolios. With lower fees and greater convenience, robo-advisors are transforming how we plan for our financial future.

Step 2: AI-Driven Credit Scoring and Lending

AI has revolutionized the lending process by introducing more efficient and accurate credit scoring models. By analyzing vast amounts of data, including transaction history, social media behavior, and online presence, AI algorithms can assess creditworthiness more effectively. This has opened up new avenues for individuals and businesses to access loans and credit facilities.

Step 3: Fraud Detection and Cybersecurity

The financial services industry faces persistent threats from cybercriminals. AI-based fraud detection systems can analyze vast data streams in real time, detecting suspicious activities and protecting against potential threats. By bolstering cybersecurity measures with AI, financial institutions can safeguard sensitive customer information and maintain trust in their services.

Step 4: AI-Powered Virtual Assistants

AI virtual assistants are reshaping customer interactions in the finance sector. These intelligent chatbots provide personalized support, answer inquiries, and perform routine tasks, enhancing the overall customer experience. By automating these processes, financial institutions can improve efficiency and focus on delivering high-value services to their clients.

Step 5: AI for Compliance and Regulatory Reporting

Compliance and regulatory reporting are critical aspects of the financial services industry. AI technologies can streamline these processes, ensuring adherence to complex regulations and reporting requirements. AI-driven solutions can identify potential compliance issues and proactively address them, reducing the risk of costly penalties and reputational damage.

Step 6: AI-Enhanced Risk Management

AI-powered risk management solutions provide more accurate and real-time risk assessment. These tools analyze historical data and market trends, enabling financial institutions to identify potential risks and make data-driven decisions. Enhanced risk management fosters stability and resilience, even in volatile market conditions.

Conclusion

The AI Finance Services Roadmap is shaping a future where financial services are more accessible, personalized, and secure than ever before. From robo-advisors offering tailored investment strategies to AI-driven fraud detection systems protecting against cyber threats, the transformative power of AI is revolutionizing the financial landscape. As we continue to innovate and embrace AI technologies, the potential for growth, efficiency, and customer satisfaction in the financial services industry is limitless. By navigating the AI Finance Services Roadmap, we can ensure a prosperous and inclusive financial future for individuals and businesses worldwide.

Overall, the AI finance services roadmap is promising. AI has the potential to improve efficiency, accuracy, and customer experience in the financial industry. However, there are also some challenges that need to be addressed before AI can be fully adopted in the financial sector.

I hope this article was helpful. If you have any questions, please feel free to leave a comment below.

Monday, July 24, 2023

Smoothing (Part of Data Analytics Course)

Highlighting Algorithm Followed For Smoothing The Data

1. Decide which kind of binning you want to use?

- Equal frequency

- Equal width

2. Once you have binned the data, you have to decide whether you are going to assign the bin with a value from:

- mean

- median

- boundary

3. Replace each bin value by the formula selected in Step 2.

Smoothing(noisy data)

Suppose a group of 12 sales price records has been sorted as follows:

5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215

Partition them into three bins by each of the following methods.

Equal-frequency partitioning

What is Smoothing by bin mean/median/boundary?

How do we define the first bin?

We need a bin that encloses 5, 10 and 11.

(4.5, 11.5]: This is also correct but let’s look at Pandas.

What Pandas has created is:

(4.999, 12.5]: Range exlusive of 4.999 and starting from there. Also range inclusive of 12.5 and ending there.

Is it wrong? No.

Next bin:

(12.5, 42.5]: Is it wrapping the elements 13, 15 and 35?

Next bin would start at 42.5. Can we say this?

Smoothing(noisy data)

Suppose a group of 12 sales price records has been sorted as follows:

5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215

Partition them into three bins by each of the following methods.

Equal-frequency partitioning

What is Smoothing by bin mean/median/boundary?

Replace each bin value is replaced by mean/median/nearest boundary

On smoothing by bin-boundary (bins follow equal-frequency partitioning):

Bin 1: 5, 13, 13, 13

As 5 is closer to boundary value ‘5’. And, 10, 11 are closer to boundary value ‘13’

Bin 2: 15, 15, 55, 55

Bin 3: 72, 72, 215, 215

Original:

Smoothing by equal-frequency binning using the mean of each bin

1. creation of bins 
In code: pd.qcut()

2. grouping the data according to bins 
In code: df.groupby()

3. find the mean of each group
In code: df.groupby().mean()

4. create a map of bin labels and mean values
In code: it is essentially a dictionary that looks like this:
{
  '(4.999, 14.333]': 9.75, 
  '(14.333, 60.667]': 38.75, 
  '(60.667, 215.0]': 145.75
}

A dictionary is simply key-value pairs.
5. Populate a new column containing the mean of each bin for each data point.

Smoothing (noisy data)

Suppose a group of 12 sales price records has been sorted as follows:

5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215

Partition them into three bins by each of the following methods.

Equal-width partitioning

The width of each interval is (215 - 5)/3 = 70.

Perform Smoothing by bin mean/median/boundary.

Bins using equal width partitioning.

Elements: 5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215

The width of each interval is (215 – 5)/3 = 70.

Domain for bin-1: 5 up to, but not, 75 (= 5 + 70)

Domain for bin-2: 75 to 144

Domain for bin-3: 145 Onwards (inc. 215 from the input data set)

Data Analytics Course Using Python (Index)

Chapters

Chapters Pending Publication

(6) decision tree
(7) measuring classification
(8) linear regression between two variables
(9) categorical encoding
(10) naive bayes classifier
(11) logistic regression
(12) support vector machines
(13) one-vs-rest and one-vs-one strategies for multiclass classification
(14) generalization and overfitting
(15) kNN classification
(16) clustering and kmeans
(17) clustering (kmeans)
(18) clustering (agglomerative hierarchical clustering)
(19) clustering - dbscan
(20) cluster evaluation
(21) random forest
(22) five regression algorithms
(23) Ensemble techniques
Exercises

Equal Frequency Binning (Part of Data Analytics Course)

import pandas as pd
l = [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
s = pd.Series(l)
s.nunique()


12


bins = pd.qcut(s, 3) # Equal frequency binning

How do we make data smooth using mean?

df = pd.DataFrame(
    {
        'data': l,
        'bins': bins
    }
)

df



t = df.groupby(['bins']).mean(['data'])

t



map_of_mean_values = {}
for i in t.iterrows():
    
    # type(i[0]): Interval
    # type(i[1]): Series
    
    map_of_mean_values.update({str(i[0]): i[1][0]})

map_of_mean_values

{'(4.999, 14.333]': 9.75, '(14.333, 60.667]': 38.75, '(60.667, 215.0]': 145.75}

df['bins'] = df['bins'].astype(str)

df['smoothed_values'] = df['bins'].apply(lambda x: map_of_mean_values[x])

df



Loading&#8230;

Thursday, June 1, 2023

Ch 3 - Scaling and Normalization

Scaling vs. Normalization: What's the difference?

One of the reasons that it's easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and, to make it even more confusing, they are very similar! In both cases, you're transforming the values of numeric variables so that the transformed data points have specific helpful properties. The difference is that:

% in scaling, you're changing the range of your data, while
% in normalization, you're changing the shape of the distribution of your data.

Ref: kaggle

Scaling

Why scale data?

Example:

Marks1 Marks2 Marks3

Student 1 280 70 60

Student 2 200 60 55

Student 3 270 40 30

Euclidean distance(s1,s2) = 80.77
Euclidean Distance(s1,s3) = 43.58
Euclidean Distance(s2,s3) = 71.58

Euclidean distance between Student-1 and Student-2 is being dominated by the marks1.
Same is the case for Euclidean Distance calculation between student-2 and student-3. There also the e. distance is being high because of difference in the marks-1 attribute.
If we see the e. distance for student-1 and student-3, there the distance is not high because marks-1 are close to each each other (viz. 280 and 270). Unlike 200 & 270 or 200 & 280.

Why do we have to scale / normalize the input for an artificial neural network?

There are 2 Reasons why we have to Normalize Input Features before Feeding them to Neural Network:

Reason 1:
If a Feature in the Dataset is big in scale compared to others then this big scaled feature becomes dominating and as a result of that, Predictions of the Neural Network will not be Accurate.

Example: In case of Employee Data, if we consider Age and Salary, Age will be a Two Digit Number while Salary can be 7 or 8 Digit (1 Million, etc..). In that Case, Salary will Dominate the Prediction of the Neural Network. But if we Normalize those Features, Values of both the Features will lie in the Range from (0 to 1).

Reason 2:
Front Propagation of Neural Networks involves the Dot Product of Weights with Input Features. So, if the Values are very high (for Image and Non-Image Data), Calculation of Output takes a lot of Computation Time as well as Memory. Same is the case during Back Propagation. Consequently, Model Converges slowly, if the Inputs are not Normalized.

Example: If we perform Image Classification, Size of Image will be very huge, as the Value of each Pixel ranges from 0 to 255. Normalization in this case is very important.

Mentioned below are the instances where Normalization is very important:

% K-Means
% K-Nearest-Neighbours
% Principal Component Analysis (PCA)
% Gradient Descent

Ref: stackoverflow

z-score is 0.67 in both cases.
Z-score is understood as how many standard deviations away the point is away from the mean.

What is there is an outlier in the data?

When there is outlier (on the higher side) in the data, MinMaxScaler sets the outlier data point to 1 and squashes all the other points to values near 0.

So, if you doubt that there are outliers in the data, don't use MinMaxScaler. Use StandardScaler.

Now in code

Min Max Scaler

In [8]:

import pandas as pd
import statsmodels as sm

In [2]:

df = pd.read_csv('student_marks.csv')

In [4]:

df

Out[4]:

	Student	Marks1	Marks2	Marks3
0	1	280	70	60
1	2	200	60	55
2	3	270	40	30
3	Harshitha	250	55	46
4	Yaju	260	54	45
5	Sahaj	230	55	35

In [7]:

# Short summary
df['Marks1'].describe()

Out[7]:

count      6.000000
mean     248.333333
std       29.268869
min      200.000000
25%      235.000000
50%      255.000000
75%      267.500000
max      280.000000
Name: Marks1, dtype: float64

In [9]:

import statsmodels.stats.descriptivestats as ds

In [11]:

# Long summary
desc_stats = ds.describe(df['Marks1'])
print("desc_stats using statsmodels : ", desc_stats)

desc_stats using statsmodels :                        Marks1
nobs                6.000000
missing             0.000000
mean              248.333333
std_err             4.878145
upper_ci          257.894321
lower_ci          238.772345
std                29.268869
iqr                32.500000
iqr_normal         24.092286
mad                22.222222
mad_normal         27.851425
coef_var            0.117861
range              80.000000
max               280.000000
min               200.000000
skew               -0.660190
kurtosis            2.228664
jarque_bera         0.584591
jarque_bera_pval    0.746548
mode              200.000000
mode_freq           0.166667
median            255.000000
1%                201.500000
5%                207.500000
10%               215.000000
25%               235.000000
50%               255.000000
75%               267.500000
90%               275.000000
95%               277.500000
99%               279.500000

/home/ashish/anaconda3/lib/python3.9/site-packages/statsmodels/stats/descriptivestats.py:418: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
mode_res = stats.mode(ser.dropna())

class sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), *, copy=True, clip=False) new_min: 0 (Default value as given by feature_range) new_max: 1 (Default value)

In [14]:

marks_col = ['Marks1', 'Marks2', 'Marks3']
for i in marks_col:
minmarks = df[i].describe().loc['min']
maxmarks = df[i].describe().loc['max']
print(i, minmarks, maxmarks)

Marks1 200.0 280.0
Marks2 40.0 70.0
Marks3 30.0 60.0

In [17]:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaled_marks = scaler.fit_transform(df[marks_col])
scaled_marks = pd.DataFrame(scaled_marks)

In [19]:

scaled_marks.columns = ['Scaled1', 'Scaled2', 'Scaled3']

In [20]:

scaled_marks

Out[20]:

	Scaled1	Scaled2	Scaled3
0	1.000	1.000000	1.000000
1	0.000	0.666667	0.833333
2	0.875	0.000000	0.000000
3	0.625	0.500000	0.533333
4	0.750	0.466667	0.500000
5	0.375	0.500000	0.166667

>>> print(scaler.data_max_) [ 1. 18.] >>> print(scaler.transform(data)) [[0. 0. ] [0.25 0.25] [0.5 0.5 ] [1. 1. ]] >>> print(scaler.transform([[2, 2]])) [[1.5 0. ]]

In [ ]:

With outlier in data: MinMaxScaler squashes the other values¶

In [21]:

dfo = pd.read_csv('student_marks_outlier.csv')

In [22]:

dfo

Out[22]:

	Student	Marks1	Marks2	Marks3
0	1	280	70	60
1	2	200	60	55
2	3	270	40	30
3	Harshitha	250	55	46
4	Yaju	260	54	45
5	Sahaj	230	55	35
6	Outlier	5000	5000	5000

In [23]:

scaled_marks = scaler.fit_transform(dfo[marks_col])
scaled_marks = pd.DataFrame(scaled_marks)

In [24]:

scaled_marks

Out[24]:

	0	1	2
0	0.016667	0.006048	0.006036
1	0.000000	0.004032	0.005030
2	0.014583	0.000000	0.000000
3	0.010417	0.003024	0.003219
4	0.012500	0.002823	0.003018
5	0.006250	0.003024	0.001006
6	1.000000	1.000000	1.000000

Thing to remember: Do not use MinMaxScaler on data with outliers.¶

In [ ]:

Standard Scaler

Q1: Which subject is easy?

In [1]:

import statistics as st

In [2]:

import pandas as pd

In [3]:

df = pd.read_csv('marks.csv')

In [4]:

df

Out[4]:

	subject	marks	mean	std
0	Sub1	70	60	15
1	Sub2	72	68	6

class statistics.NormalDist(mu=0.0, sigma=1.0) Returns a new NormalDist object where mu represents the arithmetic mean and sigma represents the standard deviation.zscore = (x - mean) / std

Assuming there are 100K records in this kind of format. Then you would not go to fill z-score for these records. You will need something automated.¶

In [7]:

def get_zscore(in_row):
return (in_row['marks'] - in_row['mean']) / in_row['std']

df['zscore'] = df.apply(get_zscore, axis = 1)

In [8]:

df

Out[8]:

	subject	marks	mean	std	zscore
0	Sub1	70	60	15	0.666667
1	Sub2	72	68	6	0.666667

In [13]:

# How many std. away the data point is form the mean?
# This is given by zscore.
print((70-60)/15)
print((72-68)/6)

0.6666666666666666
0.6666666666666666

In [10]:

Out[10]:

0.6666666666666666

In [ ]:

Q2: Applying zscore normalization using Scikit Learn on Pandas DataFrame

In [1]:

import pandas as pd

In [2]:

df = pd.read_csv('age_fat.csv')

In [3]:

df

Out[3]:

	age	%fat
0	23	9.5
1	23	26.5
2	27	7.8
3	27	17.8
4	39	31.4
5	41	25.9
6	47	27.4
7	49	27.2
8	50	31.2
9	52	34.6
10	54	42.5
11	54	28.8
12	56	33.4
13	57	30.2
14	58	34.1
15	58	32.9
16	60	41.2
17	61	35.7

In [4]:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data = scaler.fit_transform(df)

In [9]:

df_scaled = pd.DataFrame(data.round(2), columns = ['age', 'fat'])

In [10]:

df_scaled

Out[10]:

	age	fat
0	-1.83	-2.14
1	-1.83	-0.25
2	-1.51	-2.33
3	-1.51	-1.22
4	-0.58	0.29
5	-0.42	-0.32
6	0.04	-0.15
7	0.20	-0.18
8	0.28	0.27
9	0.43	0.65
10	0.59	1.53
11	0.59	0.00
12	0.74	0.51
13	0.82	0.16
14	0.90	0.59
15	0.90	0.46
16	1.06	1.38
17	1.13	0.77

In [ ]:

Q3: Visualizing what minmax scaler and standard scaler do to the data.

In [16]:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [44]:

df = pd.read_csv('../2_scatterplot and boxplot using matplotlib/bodyfat.csv')

In [45]:

df = df[['Age', 'BodyFat']]

In [34]:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data = scaler.fit_transform(df)

In [36]:

df_std = pd.DataFrame(data.round(2), columns = ['age_std', 'fat_std'])

In [37]:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

df_minmax = scaler.fit_transform(df)
df_minmax = pd.DataFrame(df_minmax.round(2),
                        columns = ['age_minmax', 'fat_minmax'])

In [38]:

df = pd.concat([df, df_std, df_minmax], axis = 1)

In [39]:

df

Out[39]:

	Age	BodyFat	age_std	fat_std	age_minmax	fat_minmax
0	23	12.3	-1.74	-0.82	0.02	0.26
1	22	6.1	-1.82	-1.56	0.00	0.13
2	22	25.3	-1.82	0.74	0.00	0.53
3	26	10.4	-1.50	-1.05	0.07	0.22
4	24	28.7	-1.66	1.14	0.03	0.60
...	...	...	...	...	...	...
247	70	11.0	2.00	-0.98	0.81	0.23
248	72	33.6	2.16	1.73	0.85	0.71
249	72	29.3	2.16	1.22	0.85	0.62
250	72	26.0	2.16	0.82	0.85	0.55
251	74	31.9	2.31	1.53	0.88	0.67

252 rows × 6 columns

In [43]:

plt.hist(df['Age'])
plt.show() 

In [41]:

plt.hist(df['age_std'])
plt.show() 

In [42]:

plt.hist(df['age_minmax'])
plt.show() 

In [ ]:

Pages

Sunday, October 1, 2023

1. Data Strategist

2. Data Architect

3. Data Engineer

4. Data Analyst

Side note: What is Data analysis?

5. Business Intelligence Analyst

6. Data Scientist

7. ML Ops Engineer

8. Data Product Manager

Who Are Involved in The Projects?

The Lifecycle of Data Science

1. Problem identification

2. Business Understanding

KPI (Key Performance Indicator)

SLA (Service Level Agreement)

3. Collecting Data

4. Pre-processing data

5. Analyzing data

6. Data Modelling

7. Model Evaluation/ Monitoring

Data Drift Analysis

Model Drift Analysis

8. Model Training

9. Model Deployment

10. Driving insights and generating BI reports

11. Taking a decision based on insight

Friday, August 4, 2023

Introduction

Monday, July 24, 2023

Highlighting Algorithm Followed For Smoothing The Data

Smoothing(noisy data)

Suppose a group of 12 sales price records has been sorted as follows:

5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215

Partition them into three bins by each of the following methods.

Equal-frequency partitioning

Smoothing(noisy data)

Suppose a group of 12 sales price records has been sorted as follows:

5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215

Partition them into three bins by each of the following methods.

Equal-frequency partitioning

Smoothing (noisy data)

Suppose a group of 12 sales price records has been sorted as follows:

5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215

Partition them into three bins by each of the following methods.

Equal-width partitioning

The width of each interval is (215 - 5)/3 = 70.

Chapters

Chapters Pending Publication

How do we make data smooth using mean?

Thursday, June 1, 2023

Scaling vs. Normalization: What's the difference?

Scaling

Why do we have to scale / normalize the input for an artificial neural network?

What is there is an outlier in the data?

Now in code

Min Max Scaler

With outlier in data: MinMaxScaler squashes the other values¶

Thing to remember: Do not use MinMaxScaler on data with outliers.¶

Standard Scaler

Q1: Which subject is easy?

Assuming there are 100K records in this kind of format. Then you would not go to fill z-score for these records. You will need something automated.¶

Q2: Applying zscore normalization using Scikit Learn on Pandas DataFrame

Q3: Visualizing what minmax scaler and standard scaler do to the data.