Showing posts with label Data Analytics. Show all posts

Wednesday, October 11, 2023

Similarity, dissimilarity and distance (Part of Data Analytics Course)

Measuring Similarity/Dissimilarity

Briefly outline how to compute the dissimilarity between objects described by the following types of variables:

Numerical (interval-scaled) variables

A = (2, 4)

B = (1, 3)

Euclidean distance = math.sqrt(pow(2-1, 2) + pow(4-3, 2))

A = (2, 7, 5)

B = (4, 8, 9)

ed = math.sqrt(pow(2-4, 2) + pow(7-8, 2) + pow(5-9, 2))

Euclidean distance is the direct and straight line distance.

Whereas:

Manhattan distance is indirect and calculated along x and y axis.

Euclidean Distance

Manhattan Distance

Manhattan Distance is inspired from a traveler going from one point to another in the city of Manhattan.

Euclidean distance: Straight line distance between two points.

Manhattan distance: is cityblock distance that is you would go from place to place in car.

Minkowski Distance

Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):

Compute the Minkowski distance between the two objects, using p = 3.

In plain English: kth root of sum of kth powers of absolute value of differences.

For manhattan distance, p = 1 in Minkowski dist.

For euclidean distance, p = 2 in Minkowski dist.

Manhattan and Euclidean distances are special cases of Minkowski.

Similarity/Dissimilarity

Briefly outline how to compute the dissimilarity between objects described by:

Asymmetric binary variables

If all binary attributes have the same weight then they are symmetric. Let’s say we have the contingency table:

If the binary attributes are asymmetric, Jaccard coefficient is often used:

For cell (i=1, j=1) representing #(object I = 1 and object J = 1):

J = q / (q + r + s)

J = # elements in intersection / # elements in the union

Jaccard similarity (aka coefficient or score) ranges from 0 to 1.

Jaccard dissimilarity (aka distance) = 1 – Jaccard Similarity

The reason why ‘t’ is not considered in Jaccard coeff. Is because, while taking an example of shopping cart, it would not make same sense to count the number of times that were missing from both the cart I and J. If we count the number of items that were missing in both carts, it would increase the similarity.

If it is given the attributes are binary and assymmetric, t is not counted. If it is symmetric, then formula for J may include ‘t’ in the numerator and denominator:

J = (q + t) / (q + r + s + t)

Similarity/Dissimilarity

Briefly outline how to compute the dissimilarity between objects described by the following types of variables:

Categorical variables

A categorical variable is a generalization of the binary variable in that it can take on more than two states.

The dissimilarity between two objects i and j can be computed as:

where m is the number of matches (i.e., the number of variables for which i and j are in the same state), and p is the total number of variables.

Similarity/Dissimilarity

Between text type of data:

This is equal to dot-product of the corresponding features divided by the magnitude. Geometrically, it is equal to the cosine of angle between two lines.

V	D1	D2	D3	D4	D5	D6	D7	D8	D9	D10
A	1	1	0	1	1	0	0	1	1	0
B	1	0	1	0	1	1	1	0	0	0
C	0	1	0	0	1	1	0	1	0	1
D	1	0	0	0	0	1	0	1	0	0
E	1	0	1	0	1	1	0	0	0	0

Cosine similarity is commonly seen in text data.

Which two out of these vectors are closest?

And, which two out of these vectors are farthest?

A	1	1	0	1	1	0	0	1	1
B	1	0	1	0	1	1	1	0	0
A.B	1	0	0	0	1	0	0	0	0

There are two ones so sum()

= 1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 + 0 + 0

= 2

A	1	1	0	1	1	0	0	1	1	0
B	1	0	1	0	1	1	1	0	0	0

|A| = math.sqrt(1**2 + 1**2 + 0 + 1**2 + 1**2 + 0 + 0 + 1**2 + 1**2 + 0)

= 2.449489742783178

|B| = math.sqrt(1**2 + 0 + 1 + 0 + 1 + 1 + 1 + 0 + 0 + 0)

= 2.23606797749979

Cos similarity (A, B) --> cos(theta) = A.B / |A|.|B|

Cosine similarity = 2 / ( 2.449 * 2.236 )

= 0.3652324960500105

A	1	1	0	1	1	0	0	1	1
B	1	0	1	0	1	1	1	0	0
A.B	1	0	0	0	1	0	0	0	0

Cosine Similarity

A = (4, 4)

B = (4, 0)

Cosine similarity between A and B is: cos(theta)

Note: It is very easy to visualize it in 2D.

Formula wise: Cosine similarity = A.B / |A| * |B|

This formula has come after rearranging the dot product formula:

A.B = |A| * |B| * cos(theta)

(0,0)

Cosine Similarity

Find cosine similarity between the following two sentences:

IIT Delhi course on data analytics IIT

IIT Delhi course on data analytics

Term: tf for d1, tf for d2

d1.d2 = 2.1 + 1.1 + 1.1 + 1.1 + 1.1 + 1.1 = 7

|d1| = sqrt(2^2 + 1^2 + 1^2 + 1^2 + 1^2 + 1^2) = sqrt(9)

|d2| = sqrt(1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 1^2) = sqrt(6)

Cosine similarity = d1.d2 / |d1| * |d2| = 7 / sqrt(9*6) = 0.93

IIT	2	1
Delhi	1	1
Course	1	1
On	1	1
Data	1	1
Analytics	1	1

Cosine Similarity

Sent 1: I bought a new car today.

Sent 2: I already have a new car.

Step 1: Create vectors.

	Vector 1	Vector 2	Product of components
I	1	1	1
bought	1	0	0
a	1	1	1
new	1	1	1
car	1	1	1
today	1	0	0
already	0	1	0
have	0	1	0

Cosine Similarity

V1 = 1,1,1,1,1,1,0,0

V2 = 1,0,1,1,1,0,1,1

v1.v2 = 1 + 0 + 1 + 1 + 1 + 0 + 0 + 0 = 4

|v1| = math.sqrt(12 + 12 + 12 + 12 + 12 + 12 + 0 + 0) = 2.45

|v2| = math.sqrt(12 + 0 + 12 + 12 + 12 + 0 + 12 + 12) = 2.45

Cosine sim = 4 / (2.45 * 2.45)

= 0.66

Cosine Similarity

Cosine Similarity: It is a bag-of-words based model i.e., order of words does not matter.

Sent1: I bought a new car today.

Sent2: I haven’t bought a new car today.

Step 1: Creating vectors

Cosine similarity is based on whether same words have been used in two sentences.

“How similar are two sentences in terms of the words (irrespective of their meaning) used in them?”

Cosine Similarity

Find cosine similarity between the following two sentences:

Brand new course on data analytics

Test-1 is scheduled later this month

Cosine similarity = 0.0

In Code Following Steps Would Be Followed To Find The Cosine Similarity Between Two Sentences

1. Load the libraries like Scikit Learn.

Projects might include other libraries like Pandas, NumPy, NLTK (Natural Language Toolkit).

2. Load the data (i.e. your two sentences)

3. Convert them into vector form using CountVectorizer.

Note: there are some other ways also to convert a sentence into vector, but we want to keep it simple for first class.

4. Next, you can use this method: sklearn.metrics.pairwise.cosine_similarity() to find the cosine similarity.

Note: These are high level steps. You can find a lot of similarity/dissimilarity measures following these steps after some minor modifications like changing Scikit Learn with SciPy, etc.

Thank You!

Tuesday, October 10, 2023

Traffic Prediction on my Blog (Oct 2023)

Poisson Distribution for Traffic Prediction in Layman Terms

Imagine you're trying to predict how many cars will pass through a specific point on a road during a certain period, like an hour. The Poisson distribution can help you make this prediction.

Here's how it works in simple terms:

Events Occurring Over Time

The Poisson distribution is used when you're dealing with events that happen randomly over a fixed period of time. In our example, these events are cars passing by.

Average Rate

You start by knowing the average rate at which these events occur. For instance, on average, 30 cars pass by in an hour.

Independence

You assume that the arrival of each car is independent of the others. In other words, one car passing by doesn't affect the likelihood of another car passing by.

Probability of a Specific Number

With the Poisson distribution, you can calculate the probability of a specific number of events happening in that fixed time period. So, you can calculate the chance of exactly 25 cars passing by in an hour or any other number.

Shape of Distribution

The Poisson distribution has a particular shape that's skewed to the right. It's more likely to have fewer events than the average rate, and less likely to have more events, but it allows for some variability.

On a side note:
Right skewed: The mean is greater than the median. The mean overestimates the most common values in a positively skewed distribution. Left skewed: The mean is less than the median.

Use in Traffic Prediction

To predict traffic, you can use the Poisson distribution to estimate the likelihood of different traffic volumes. If the average rate is 30 cars per hour, you can calculate the probabilities of having 20 cars, 40 cars, or any other number during that hour. This information can be useful for traffic planning and management.

In summary, the Poisson distribution helps you make educated guesses about the number of random events (like cars passing by) happening in a fixed time period, assuming they occur independently and at a known average rate. It's a valuable tool for predicting and managing things like traffic flow.

Here is how I used it to predict the traffic on my blog

1: Glimpse of the data

2: Tail of the data

3: Frequency plot for views in thousands

4: Number of views in thousands and probability of seeing that many views

5: Poisson distribution for views in thousands

6: Conclusion

# There is ~1.5% chance that there will be 9 thousand views.

# There is ~10% chance that there will be 16 thousand views.

# There is ~7.8% chance that there will be 19 thousand views.

Download Code and Data

In the next post, we will see why our analysis could be wrong.

When not to use Poisson Distribution for prediction?

Thursday, October 5, 2023

Interview for Data Science Trainer Role (Oct 2023)

Q1: What is Data Science Lifecycle?

Link to the solution

Q2: What are Data Science Roles?

Link to the solution

Q3: What all fields and subfields are closely related to Data Science?

Link to the solution

Q4: Draw the Data Science Venn Diagram.

Link to further explanation

Q5: What is EDA? And, what is CDA?

In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

Initial Data Analysis

The most important distinction between the initial data analysis phase and the main analysis phase, is that during initial data analysis one refrains from any analysis that is aimed at answering the original research question.[109] The initial data analysis phase is guided by the following four questions:

1. Quality of data

The quality of the data should be checked as early as possible. Data quality can be assessed in several ways, using different types of analysis: frequency counts, descriptive statistics (mean, standard deviation, median), normality (skewness, kurtosis, frequency histograms), normal imputation is needed.

# Analysis of extreme observations: outlying observations in the data are analyzed to see if they seem to disturb the distribution.

# Comparison and correction of differences in coding schemes: variables are compared with coding schemes of variables external to the data set, and possibly corrected if coding schemes are not comparable.

# Test for common-method variance.

The choice of analyses to assess the data quality during the initial data analysis phase depends on the analyses that will be conducted in the main analysis phase.

2. Quality of measurements

The quality of the measurement instruments should only be checked during the initial data analysis phase when this is not the focus or research question of the study. One should check whether structure of measurement instruments corresponds to structure reported in the literature.

There are two ways to assess measurement quality:

# Confirmatory factor analysis

# Analysis of homogeneity (internal consistency), which gives an indication of the reliability of a measurement instrument. During this analysis, one inspects the variances of the items and the scales, the Cronbach's α of the scales, and the change in the Cronbach's alpha when an item would be deleted from a scale.

3. Initial transformations

After assessing the quality of the data and of the measurements, one might decide to impute missing data, or to perform initial transformations of one or more variables, although this can also be done during the main analysis phase.

Possible transformations of variables are:

# Square root transformation (if the distribution differs moderately from normal)

# Log-transformation (if the distribution differs substantially from normal)

# Inverse transformation (if the distribution differs severely from normal)

# Make categorical (ordinal / dichotomous) (if the distribution differs severely from normal, and no transformations help)

4. Did the implementation of the study fulfill the intentions of the research design?

One should check the success of the randomization procedure, for instance by checking whether background and substantive variables are equally distributed within and across groups.

If the study did not need or use a randomization procedure, one should check the success of the non-random sampling, for instance by checking whether all subgroups of the population of interest are represented in sample.

Other possible data distortions that should be checked are:

# dropout (this should be identified during the initial data analysis phase)

# Item non-response (whether this is random or not should be assessed during the initial data analysis phase)

# Treatment quality (using manipulation checks).

CDA: Confirmatory Data Analysis

Confirmatory data analysis (CDA) is a tool that is used to confirm or reject the measurement theory. In reality, exploratory and confirmatory data analysis are not performed one after another, but continually intertwine to help you create the best possible model.

Confirmatory Data Analysis involves things like: testing hypotheses, producing estimates with a specified level of precision, regression analysis, and variance analysis. In this way, your confirmatory data analysis is where you put your findings and arguments to trial.

Q6: What is Descriptive Statistics? And, what is Inferential Statistics?

Descriptive statistics

A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the process of using and analysing those statistics. Descriptive statistics is distinguished from inferential statistics (or inductive statistics) by its aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent. This generally means that descriptive statistics, unlike inferential statistics, is not developed on the basis of probability theory, and are frequently nonparametric statistics.

Descriptive statistics provide simple summaries about the sample and about the observations that have been made. Such summaries may be either quantitative, i.e. summary statistics, or visual, i.e. simple-to-understand graphs. These summaries may either form the basis of the initial description of the data as part of a more extensive statistical analysis, or they may be sufficient in and of themselves for a particular investigation.

For example, the shooting percentage in basketball is a descriptive statistic that summarizes the performance of a player or a team. This number is the number of shots made divided by the number of shots taken. For example, a player who shoots 33% is making approximately one shot in every three. The percentage summarizes or describes multiple discrete events. Consider also the grade point average. This single number describes the general performance of a student across the range of their course experiences.

The use of descriptive and summary statistics has an extensive history and, indeed, the simple tabulation of populations and of economic data was the first way the topic of statistics appeared. More recently, a collection of summarisation techniques has been formulated under the heading of exploratory data analysis: an example of such a technique is the box plot.

In the business world, descriptive statistics provides a useful summary of many types of data. For example, investors and brokers may use a historical account of return behaviour by performing empirical and analytical analyses on their investments in order to make better investing decisions in the future.

Univariate analysis

Univariate analysis involves describing the distribution of a single variable, including its central tendency (including the mean, median, and mode) and dispersion (including the range and quartiles of the data-set, and measures of spread such as the variance and standard deviation). The shape of the distribution may also be described via indices such as skewness and kurtosis. Characteristics of a variable's distribution may also be depicted in graphical or tabular format, including histograms and stem-and-leaf display.

Bivariate and multivariate analysis

When a sample consists of more than one variable, descriptive statistics may be used to describe the relationship between pairs of variables. In this case, descriptive statistics include:

# Cross-tabulations and contingency tables

# Graphical representation via scatterplots

# Quantitative measures of dependence

# Descriptions of conditional distributions

The main reason for differentiating univariate and bivariate analysis is that bivariate analysis is not only a simple descriptive analysis, but also it describes the relationship between two different variables. Quantitative measures of dependence include correlation (such as Pearson's r when both variables are continuous, or Spearman's rho if one or both are not) and covariance (which reflects the scale variables are measured on). The slope, in regression analysis, also reflects the relationship between variables. The unstandardised slope indicates the unit change in the criterion variable for a one unit change in the predictor. The standardised slope indicates this change in standardised (z-score) units. Highly skewed data are often transformed by taking logarithms. The use of logarithms makes graphs more symmetrical and look more similar to the normal distribution, making them easier to interpret intuitively.

Inferential statistics (or Statistical Inference or Inferential Statistical Analysis)

Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.

Inferential statistics can be contrasted with descriptive statistics. Descriptive statistics is solely concerned with properties of the observed data, and it does not rest on the assumption that the data come from a larger population. In machine learning, the term inference is sometimes used instead to mean "make a prediction, by evaluating an already trained model"; in this context inferring properties of the model is referred to as training or learning (rather than inference), and using a model for prediction is referred to as inference (instead of prediction); see also predictive inference.

Statistical inference makes propositions about a population, using data drawn from the population with some form of sampling. Given a hypothesis about a population, for which we wish to draw inferences, statistical inference consists of (first) selecting a statistical model of the process that generates the data and (second) deducing propositions from the model.

Konishi & Kitagawa state, "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling". Relatedly, Sir David Cox has said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis".

The conclusion of a statistical inference is a statistical proposition. Some common forms of statistical proposition are the following:

# a point estimate, i.e. a particular value that best approximates some parameter of interest;

# an interval estimate, e.g. a confidence interval (or set estimate), i.e. an interval constructed using a dataset drawn from a population so that, under repeated sampling of such datasets, such intervals would contain the true parameter value with the probability at the stated confidence level;

# a credible interval, i.e. a set of values containing, for example, 95% of posterior belief;

# rejection of a hypothesis;

# clustering or classification of data points into groups.

Topics Covered in Inferential Statistics

The topics below are usually included in the area of statistical inference.

1. Statistical assumptions

2. Statistical decision theory

3. Estimation theory

4. Statistical hypothesis testing

5. Revising opinions in statistics

6. Design of experiments, the analysis of variance, and regression

7. Survey sampling

8. Summarizing statistical data

Q7: What are the Four Types of Analytical Studies?

Descriptive Analytics tells you what happened in the past.

Diagnostic Analytics helps you understand why something happened in the past.

Predictive Analytics predicts what is most likely to happen in the future.

Prescriptive Analytics recommends actions you can take to affect those outcomes.

Q8: Ordinal categorical data vs Nominal categorical data

Ordinal categorical data are non-numerical pieces of information with implied order — for example, survey responses on a scale from very dissatisfied to very satisfied.

And nominal categorical data are non-numerical pieces of information without any inherent order — for example, colors or states.

Side Note:

Questions 8 to 12 are related to ‘Types of Data’.

Actual Interview Question:

“How do you figure out if the data is categorical or continuous?”

Q9: Categorical data vs Quantitative data

Quantitative variables are any variables where the data represent amounts (e.g. height, weight, or age).

Categorical variables are any variables where the data represent groups. This includes rankings (e.g. finishing places in a race), classifications (e.g. brands of cereal), and binary outcomes (e.g. coin flips).

Q10: Categorical data vs Continuous data

Continuous data can take on any value within a defined range and is often measured on a continuous scale, such as weight, height, or temperature.

Categorical data, on the other hand, consists of discrete values that fall into distinct categories or groups, such as gender, ethnicity, or product types.

Q11: Categorical data vs Numerical data

Categorical data refers to a data type that can be stored and identified based on the names or labels given to them.
Numerical data refers to the data that is in the form of numbers, and not in any language or descriptive form. Also known as qualitative data as it qualifies data before classifying it.

Q12: Categorical data vs Discrete data

Categorical data might not have a logical order. For example, categorical predictors include gender, material type, and payment method.
Discrete variables are numeric variables that have a countable number of values between any two values. A discrete variable is always numeric.

Q13: When do we use supervised learning?

Supervised learning is a type of machine learning where the algorithm learns from labeled training data, meaning it is provided with input-output pairs (also known as features and labels) and learns to map the input to the corresponding output. Supervised learning is used in a wide range of applications whenever you have a dataset with known outcomes and want to build a predictive model. Here are some common scenarios where supervised learning is applied:

Classification: When you want to categorize data into predefined classes or labels. Examples include email spam detection, image classification (e.g., identifying objects in images), sentiment analysis (determining if a text is positive or negative), and disease diagnosis (categorizing patients as having a specific disease or not).

Regression: When you want to predict a continuous numeric value. Examples include predicting house prices based on features like square footage and location, forecasting stock prices, or estimating the time it will take to complete a task.

Recommendation Systems: When you want to make personalized recommendations to users. This is common in applications like movie recommendations (Netflix), product recommendations (Amazon), and content recommendations (YouTube).

Natural Language Processing (NLP): In NLP, supervised learning is used for tasks like named entity recognition, text classification, machine translation, and chatbot responses. For instance, training a model to classify news articles into different categories like sports, politics, or entertainment.

Speech Recognition: In applications like voice assistants (e.g., Siri, Alexa), supervised learning is used to transcribe spoken words into text.

Computer Vision: Supervised learning is crucial for image and video analysis tasks such as object detection, facial recognition, and autonomous driving (identifying pedestrians, vehicles, and road signs).

Anomaly Detection: Detecting unusual patterns or outliers in data, such as fraud detection in financial transactions, network intrusion detection, and equipment failure prediction in industrial settings.

Handwriting Recognition: Converting handwritten text into machine-readable text, often used in applications like OCR (Optical Character Recognition).

Biomedical and Healthcare: In medical imaging, supervised learning is used for tasks like tumor detection in MRI scans, disease diagnosis based on medical data, and drug discovery.

Game Playing: Supervised learning can be used to train agents in games like chess, Go, or video games to make decisions based on historical gameplay data.

Supervised learning is a versatile and widely used technique in machine learning, and its applications extend to various domains where predictive modeling, classification, or regression tasks are required, and labeled training data is available.

Q14: When do we use unsupervised learning?

Unsupervised learning is a type of machine learning where the algorithm is trained on unlabeled data, meaning it does not have access to predefined output labels. Instead, the algorithm aims to discover hidden patterns, structures, or relationships within the data. Here are some common scenarios where unsupervised learning is used:

Clustering: Unsupervised learning is frequently used for clustering, where the goal is to group similar data points together based on their inherent similarities or patterns. Common clustering algorithms include k-means clustering, hierarchical clustering, and DBSCAN. Applications include customer segmentation in marketing, image segmentation, and grouping similar news articles for recommendation.

Dimensionality Reduction: Unsupervised learning techniques like Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) are used to reduce the dimensionality of data. This is helpful for visualization, feature selection, and improving the efficiency of machine learning algorithms. Dimensionality reduction can be used in image compression, anomaly detection, and more.

Anomaly Detection: Unsupervised learning can be used to identify rare or anomalous data points that do not conform to the normal patterns in the dataset. This is valuable for fraud detection, network intrusion detection, and quality control in manufacturing.

Density Estimation: Unsupervised learning can be used to estimate the probability distribution of data. This is useful in various applications, including anomaly detection, outlier detection, and generative modeling.

Topic Modeling: In natural language processing, unsupervised learning techniques like Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF) are used for topic modeling. These methods help identify underlying topics in a collection of text documents, making them useful for text summarization, content recommendation, and document organization.

Recommendation Systems: Unsupervised learning can be used to find similarities between users or items in a recommendation system. While recommendation systems often involve collaborative filtering, which can be considered a form of unsupervised learning, they may also use supervised learning in conjunction with user ratings.

Generative Modeling: Unsupervised learning is used in generative modeling, where the goal is to generate new data samples that are similar to the training data. Examples include Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), which are used in image generation, style transfer, and data augmentation.

Market Basket Analysis: In retail and e-commerce, unsupervised learning can be applied to discover associations between products purchased together by customers. This information can be used for product recommendations and inventory management.

Unsupervised learning is valuable when you want to explore and understand the structure and relationships within your data, especially when you don't have access to labeled data or when the goal is to discover hidden patterns. It is a versatile approach used in various domains, including data analysis, pattern recognition, and exploratory data analysis.

Q15: When a Data Science problem is given to you by your company, what strategy you will follow to solve the problem? What is your approach?

Solving a data science problem for a company involves a systematic and well-structured approach. Here's a general strategy and approach that you can follow:

1. Understand the Problem:

Start by thoroughly understanding the problem statement and its business context. Engage in discussions with stakeholders and subject matter experts to gain insights into the problem's significance and objectives.

2. Define Clear Goals:

Clearly define the objectives and goals of the project. What specific outcomes or insights are you aiming to achieve? Establish measurable success criteria to evaluate the effectiveness of your solution.

3. Data Collection and Exploration:

# Identify the data sources and collect the relevant data required for the project. This may involve data acquisition, data scraping, or access to existing databases.

# Perform initial data exploration and visualization to gain insights into the data's characteristics, such as distributions, missing values, outliers, and correlations.

4. Data Preprocessing:

Clean and preprocess the data to ensure it is in a suitable format for analysis. This includes handling missing values, encoding categorical variables, scaling/normalizing numerical features, and addressing outliers.

5. Feature Engineering:

Create relevant features or transform existing ones to improve the model's performance. This step often requires domain knowledge and creativity.

6. Model Selection:

Based on the nature of the problem (e.g., classification, regression, clustering), select appropriate machine learning algorithms or techniques. Consider factors such as the volume of data, data dimensionality, and interpretability of the model.

7. Model Training:

# Split the data into training, validation, and test sets to evaluate model performance.

# Train the selected models on the training data, tune hyperparameters using cross-validation, and assess their performance using appropriate evaluation metrics.

8. Model Evaluation:

# Evaluate the models using the validation dataset to choose the best-performing one(s).

# Perform a detailed analysis of model performance, considering metrics like accuracy, precision, recall, F1-score, or RMSE (Root Mean Square Error), depending on the problem type.

9. Model Interpretability:

If applicable, ensure that the model's decisions are interpretable and explainable to stakeholders. Use techniques such as feature importance analysis, SHAP values, or LIME to provide insights into model predictions.

10. Deployment and Integration:

# If the project involves real-time predictions, deploy the model into a production environment. Ensure that the deployment process is reliable, scalable, and monitored for performance.

# Integrate the model into the company's existing systems and workflows, if necessary.

11. Monitoring and Maintenance:

# Continuously monitor the model's performance and retrain it as new data becomes available. Set up alerts for model degradation or data drift.

# Keep the documentation up to date and maintain a version control system for the code and models.

12. Communication and Reporting:

# Regularly communicate progress and results to stakeholders through reports, dashboards, or presentations.

# Explain the implications of your findings and provide actionable insights to support decision-making.

13. Feedback and Iteration:

# Collect feedback from users and stakeholders and use it to make improvements to the model or the overall solution.

# Iterate on the project as necessary to address evolving business needs or data quality issues.

14. Documentation:

Maintain thorough documentation of the entire process, including data sources, preprocessing steps, model architecture, hyperparameters, and results. This documentation is crucial for reproducibility.

15. Ethical Considerations:

Ensure that the project adheres to ethical and legal guidelines, especially when dealing with sensitive data or automated decision-making.

The specific details of your approach may vary depending on the nature of the problem, the available resources, and the company's goals. Flexibility and adaptability are essential qualities for a data scientist to navigate the complexities of real-world data science projects successfully.

Q16: What do you do as part of Data Engineering?

Data engineering is a critical component of the data pipeline in a data-driven organization. Data engineers are responsible for designing, building, and maintaining the infrastructure and systems that enable the collection, storage, processing, and retrieval of data. Here are some of the key tasks and responsibilities of data engineers:

Data Ingestion:

Set up processes to collect data from various sources, including databases, external APIs, streaming platforms, log files, and IoT devices.

Ensure data is ingested efficiently, reliably, and at scale.

Data Storage:

Design and implement data storage solutions that accommodate the organization's data volume and access patterns.

Choose appropriate data storage technologies, such as relational databases, NoSQL databases, data lakes, data warehouses, and distributed file systems.

Data Transformation:

Clean, preprocess, and transform raw data into formats suitable for analysis or downstream applications.

Create data pipelines to automate ETL (Extract, Transform, Load) processes.

Data Modeling:

Develop data models and schemas to represent structured and semi-structured data.

Optimize data models for efficient querying and reporting.

Data Quality and Validation:

Implement data quality checks and validation procedures to ensure data accuracy and consistency.

Monitor and address data quality issues.

Data Integration:

Integrate data from various sources to create a unified view of the data.

Implement data integration solutions, such as data consolidation, data federation, and data synchronization.

Data Security and Compliance:

Implement security measures to protect sensitive data.

Ensure compliance with data privacy regulations (e.g., GDPR, HIPAA) and industry standards.

Data Pipeline Orchestration:

Manage and orchestrate data pipelines using tools like Apache Airflow, Luigi, or cloud-based solutions.

Schedule and automate data processing tasks.

Scalability and Performance:

Architect data systems to be scalable, ensuring they can handle increasing data volumes and user loads.

Optimize system performance to deliver fast query responses.

Data Versioning and Cataloging:

Maintain a catalog of available data assets, making it easier for data scientists and analysts to discover and use data.

Implement data versioning to track changes in data assets.

Monitoring and Logging:

Set up monitoring and logging systems to track the health and performance of data pipelines and systems.

Implement alerting mechanisms for identifying and addressing issues in real-time.

Documentation:

Document data engineering processes, data schemas, and infrastructure configurations for knowledge sharing and future reference.

Collaboration:

Collaborate with data scientists, analysts, and other stakeholders to understand data requirements and deliver data solutions that meet business needs.

Cloud Services:

Leverage cloud platforms (e.g., AWS, Azure, GCP) to build and maintain data infrastructure and services.

Take advantage of managed services for data storage, computation, and orchestration.

Continuous Improvement:

Stay up-to-date with the latest data engineering technologies and best practices.

Continuously improve data engineering processes for efficiency and reliability.

Data engineers play a crucial role in enabling data-driven decision-making within an organization by ensuring that data is available, reliable, and accessible to data users, including data scientists, analysts, and business stakeholders. Their work forms the foundation upon which data analytics and machine learning efforts are built.

Q17: What do you do as part of MLOps?

MLOps, short for Machine Learning Operations, is a set of practices and tools aimed at streamlining and automating the end-to-end machine learning lifecycle, from model development to deployment and monitoring. MLOps involves collaboration between data scientists, machine learning engineers, data engineers, and IT/DevOps teams to ensure that machine learning models are deployed and maintained in a reliable and efficient manner. Here's what you might do as part of MLOps:

Infrastructure Provisioning and Management:

Set up and manage the infrastructure required for machine learning workloads, including cloud computing resources or on-premises servers.

Utilize containerization technologies (e.g., Docker) and container orchestration platforms (e.g., Kubernetes) to manage and deploy machine learning models.

Environment Management:

Create and manage development, testing, and production environments with consistent dependencies and configurations for reproducibility.

Use tools like virtual environments or containerization to isolate dependencies.

Version Control:

Implement version control for machine learning code, data, and models using platforms like Git.

Track changes to models and data to ensure traceability and reproducibility.

Continuous Integration and Continuous Deployment (CI/CD):

Set up CI/CD pipelines to automate the testing, building, and deployment of machine learning models.

Implement automated testing to validate model performance and data quality.

Model Packaging:

Package machine learning models in a way that allows them to be easily deployed and managed, such as containerized models or model-serving libraries like TensorFlow Serving or PyTorch Serve.

Model Deployment:

Deploy machine learning models into production, making them accessible to applications and end-users.

Implement strategies for gradual or canary deployments to minimize downtime and mitigate risks.

Monitoring and Logging:

Set up monitoring and logging for deployed models to track their performance, detect anomalies, and ensure they are functioning as expected.

Implement alerting systems to notify teams of issues or performance degradation.

Model Versioning and Rollback:

Implement model versioning and rollback mechanisms to manage different model versions and quickly revert to a previous version if necessary.

Scalability and Resource Management:

Design systems that can handle increased traffic and scale resources as needed to accommodate growing workloads.

Implement auto-scaling and resource optimization strategies.

Security and Compliance:

Ensure that machine learning systems meet security and compliance requirements, especially when handling sensitive data.

Implement access controls, encryption, and audit logs as needed.

Collaboration and Communication:

Facilitate collaboration between data science and engineering teams to ensure smooth transitions from model development to deployment.

Communicate updates and changes to stakeholders effectively.

Feedback Loop:

Collect feedback from production systems and end-users to improve models over time.

Use feedback to retrain and update models as needed.

Cost Management:

Monitor and optimize the costs associated with machine learning infrastructure and services to ensure cost-effectiveness.

Documentation and Knowledge Sharing:

Maintain comprehensive documentation of the MLOps process and infrastructure to facilitate knowledge sharing and onboarding of new team members.

Automated Testing:

Implement automated tests for machine learning models to ensure that they perform as expected in different environments.

MLOps practices help organizations streamline and automate the management of machine learning workflows, leading to faster development cycles, improved model reliability, and better alignment between data science and operations teams. The specific tasks you perform in MLOps can vary depending on the organization's size, technology stack, and machine learning use cases.

Monday, October 2, 2023

What all fields and subfields are closely related to Data Science?

Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured, and unstructured data.

Data science also integrates domain knowledge from the underlying application domain (e.g., natural sciences, information technology, and medicine). Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow, and a profession.

Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. However, data science is different from computer science and information science. Turing Award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational, and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.

A data scientist is a professional who creates programming code and combines it with statistical knowledge to create insights from data.

A list of fields and subfields related to Data Science

1. Data analysis

2. Data engineering

3. Machine learning

4. Business intelligence

5. Statistics

6. Business analytics

7. Software development

8. Data mining

9. Natural language processing

10. Computer vision

11. Data storytelling

12. Product Management

13. Artificial intelligence

14. Data modeling

Some Roles That Require Task Related Guidance And Training:

1. Data architect

2. Database Administrator

3. System Administrator

1. Data analysis

Data analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains. In today's business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.

Data mining is a particular data analysis technique that focuses on statistical modeling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing mainly on business information.

In statistical applications, data analysis can be divided into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new features in the data while CDA focuses on confirming or falsifying existing hypotheses. Predictive analytics focuses on the application of statistical models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All of the above are varieties of data analysis.

Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination.

2. Data engineering

Data engineering refers to the building of systems to enable the collection and usage of data. This data is usually used to enable subsequent analysis and data science; which often involves machine learning. Making the data usable usually involves substantial compute and storage, as well as data processing.

In the early 2010s, with the rise of the internet, the massive increase in data volumes, velocity, and variety led to the term big data to describe the data itself, and data-driven tech companies like Facebook and Airbnb started using the phrase data engineer. Due to the new scale of the data, major firms like Google, Facebook, Amazon, Apple, Microsoft, and Netflix started to move away from traditional ETL and storage techniques. They started creating data engineering, a type of software engineering focused on data, and in particular infrastructure, warehousing, data protection, cybersecurity, mining, modelling, processing, and metadata management. This change in approach was particularly focused on cloud computing. Data started to be handled and used by many parts of the business, such as sales and marketing, and not just IT.

Who is a Data engineer?

A data engineer is a type of software engineer who creates big data ETL pipelines to manage the flow of data through the organization. This makes it possible to take huge amounts of data and translate it into insights. They are focused on the production readiness of data and things like formats, resilience, scaling, and security. Data engineers usually hail from a software engineering background and are proficient in programming languages like Java, Python, Scala, and Rust. They will be more familiar with databases, architecture, cloud computing, and Agile software development.

Who is a Data Scientist?

Data scientists are more focused on the analysis of the data, they will be more familiar with mathematics, algorithms, statistics, and machine learning.

3. Machine learning

Machine learning (ML) is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machines "discover" their "own" algorithms, without needing to be explicitly told what to do by any human-developed algorithms. Recently, generative artificial neural networks have been able to surpass results of many previous approaches. Machine-learning approaches have been applied to large language models, computer vision, speech recognition, email filtering, agriculture and medicine, where it is too costly to develop algorithms to perform the needed tasks.

The mathematical foundations of ML are provided by mathematical optimization (mathematical programming) methods. Data mining is a related (parallel) field of study, focusing on exploratory data analysis through unsupervised learning.

ML is known in its application across business problems under the name predictive analytics. Although not all machine learning is statistically based, computational statistics is an important source of the field's methods.

Machine learning approaches

Machine learning approaches are traditionally divided into three broad categories, which correspond to learning paradigms, depending on the nature of the "signal" or "feedback" available to the learning system:

Supervised learning: The computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs.

Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).

Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle or playing a game against an opponent). As it navigates its problem space, the program is provided feedback that's analogous to rewards, which it tries to maximize. Although each algorithm has advantages and limitations, no single algorithm works for all problems.

4. Business intelligence

Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical processing, analytics, dashboard development, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics, and prescriptive analytics.

BI tools can handle large amounts of structured and sometimes unstructured data to help identify, develop, and otherwise create new strategic business opportunities. They aim to allow for the easy interpretation of these big data. Identifying new opportunities and implementing an effective strategy based on insights can provide businesses with a competitive market advantage and long-term stability, and help them take strategic decisions.

Business intelligence can be used by enterprises to support a wide range of business decisions ranging from operational to strategic. Basic operating decisions include product positioning or pricing. Strategic business decisions involve priorities, goals, and directions at the broadest level. In all cases, BI is most effective when it combines data derived from the market in which a company operates (external data) with data from company sources internal to the business such as financial and operations data (internal data). When combined, external and internal data can provide a complete picture which, in effect, creates an "intelligence" that cannot be derived from any singular set of data.

Among myriad uses, business intelligence tools empower organizations to gain insight into new markets, to assess demand and suitability of products and services for different market segments, and to gauge the impact of marketing efforts.

BI applications use data gathered from a data warehouse (DW) or from a data mart, and the concepts of BI and DW combine as "BI/DW" or as "BIDW". A data warehouse contains a copy of analytical data that facilitates decision support.

Definition of ‘Business Intelligence’

According to Solomon Negash and Paul Gray, business intelligence (BI) can be defined as systems that combine:

1. Data gathering

2. Data storage

3. Knowledge management

Some elements of business intelligence are:

1. Multidimensional aggregation and allocation

2. Denormalization, tagging, and standardization

3. Realtime reporting with analytical alert

4. A method of interfacing with unstructured data sources

5. Group consolidation, budgeting, and rolling forecasts

6. Statistical inference and probabilistic simulation

7. Key performance indicators optimization

8. Version control and process management

9. Open item management

Roles in the field of ‘Business Intelligence’

Some common technical roles for business intelligence developers are:

# Business analyst

# Data analyst

# Data engineer

# Data scientist

# Database administrator

5. Statistics (and Statisticians)

Statistics is a mathematical body of science that pertains to the collection, analysis, interpretation or explanation, and presentation of data, or as a branch of mathematics. Some consider statistics to be a distinct mathematical science rather than a branch of mathematics. While many scientific investigations make use of data, statistics is concerned with the use of data in the context of uncertainty and decision-making in the face of uncertainty.

A statistician is a person who works with theoretical or applied statistics. The profession exists in both the private and public sectors.

It is common to combine statistical knowledge with expertise in other subjects, and statisticians may work as employees or as statistical consultants.

6. Business analytics

Business analytics (BA) refers to the skills, technologies, and practices for iterative exploration and investigation of past business performance to gain insight and drive business planning. Business analytics focuses on developing new insights and understanding of business performance based on data and statistical methods. In contrast, business intelligence traditionally focuses on using a consistent set of metrics to both measure past performance and guide business planning. In other words, business intelligence focusses on description, while business analytics focusses on prediction and prescription.

Business analytics makes extensive use of analytical modeling and numerical analysis, including explanatory and predictive modeling, and fact-based management to drive decision making. It is therefore closely related to management science. Analytics may be used as input for human decisions or may drive fully automated decisions. Business intelligence is querying, reporting, online analytical processing (OLAP), and "alerts".

In other words, querying, reporting, and OLAP are alert tools that can answer questions such as what happened, how many, how often, where the problem is, and what actions are needed. Business analytics can answer questions like why is this happening, what if these trends continue, what will happen next (predict), and what is the best outcome that can happen (optimize).

7. Software development

Software development is the process used to conceive, specify, design, program, document, test, and bug fix in order to create and maintain applications, frameworks, or other software components. Software development involves writing and maintaining the source code, but in a broader sense, it includes all processes from the conception of the desired software through the final manifestation, typically in a planned and structured process often overlapping with software engineering. Software development also includes research, new development, prototyping, modification, reuse, re-engineering, maintenance, or any other activities that result in software products.

8. Data mining

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information (with intelligent methods) from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

The term "data mining" is a misnomer because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence. The book Data Mining: Practical Machine Learning Tools and Techniques with Java (which covers mostly machine learning material) was originally to be named Practical Machine Learning, and the term data mining was only added for marketing reasons. Often the more general terms (large scale) data analysis and analytics—or, when referring to actual methods, artificial intelligence and machine learning—are more appropriate.

The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, although they do belong to the overall KDD process as additional steps.

The difference between data analysis and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data. In contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data.

The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.

Data Mining Process

The knowledge discovery in databases (KDD) process is commonly defined with the stages:

1. Selection

2. Pre-processing

3. Transformation

4. Data mining

5. Interpretation/evaluation.

It exists, however, in many variations on this theme, such as the Cross-industry standard process for data mining (CRISP-DM) which defines six phases:

1. Business understanding

2. Data understanding

3. Data preparation

4. Modeling

5. Evaluation

6. Deployment

or a simplified process such as (1) Pre-processing, (2) Data Mining, and (3) Results Validation.

Polls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is the leading methodology used by data miners. The only other data mining standard named in these polls was SEMMA. However, 3–4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models, and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.

Note: SEMMA is an acronym that stands for Sample, Explore, Modify, Model, and Assess.

9. Natural language processing

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation.

10. Computer vision

Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images (the input to the retina in the human analog) into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.

The scientific discipline of computer vision is concerned with the theory behind artificial systems that extract information from images. The image data can take many forms, such as video sequences, views from multiple cameras, multi-dimensional data from a 3D scanner, 3D point clouds from LiDaR sensors, or medical scanning devices. The technological discipline of computer vision seeks to apply its theories and models to the construction of computer vision systems.

Sub-domains of computer vision include scene reconstruction, object detection, event detection, activity recognition, video tracking, object recognition, 3D pose estimation, learning, indexing, motion estimation, visual servoing, 3D scene modeling, and image restoration.

Adopting computer vision technology might be painstaking for organizations as there is no single point solution for it. There are very few companies that provide a unified and distributed platform or an Operating System where computer vision applications can be easily deployed and managed.

11. Data storytelling

Data storytellers visualize data, create reports, search for narratives that best characterize data, and design innovative methods to convey that narrative. Data storytelling is a creative job role that falls in between data analysis and human-centered communication. They reduce the data to focus on a certain feature, evaluate the behavior, and create a story that assists others in better understanding business trends.

12. Product Management

Product management is the business process of planning, developing, launching, and managing a product or service. It includes the entire lifecycle of a product, from ideation to development to go to market. Product managers are responsible for ensuring that a product meets the needs of its target market and contributes to the business strategy, while managing a product or products at all stages of the product lifecycle. Software product management adapts the fundamentals of product management for digital products.

13. Artificial intelligence

Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the intelligence of humans or animals. It is also the field of study in computer science that develops and studies intelligent machines. "AI" may also refer to the machines themselves.

AI technology is widely used throughout industry, government and science. Some high-profile applications are: advanced web search engines (e.g., Google Search), recommendation systems (used by YouTube, Amazon, and Netflix), understanding human speech (such as Siri and Alexa), self-driving cars (e.g., Waymo), generative or creative tools (ChatGPT and AI art), and competing at the highest level in strategic games (such as chess and Go).

Artificial intelligence was founded as an academic discipline in 1956. The field went through multiple cycles of optimism followed by disappointment and loss of funding, but after 2012, when deep learning surpassed all previous AI techniques, there was a vast increase in funding and interest.

The various sub-fields of AI research are centered around particular goals and the use of particular tools. The traditional goals of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception, and support for robotics. General intelligence (the ability to solve an arbitrary problem) is among the field's long-term goals. To solve these problems, AI researchers have adapted and integrated a wide range of problem-solving techniques, including search and mathematical optimization, formal logic, artificial neural networks, and methods based on statistics, probability, and economics. AI also draws upon psychology, linguistics, philosophy, neuroscience and many other fields.

14. Data Modeling

Data modeling is a process used to define and analyze data requirements needed to support the business processes within the scope of corresponding information systems in organizations. Therefore, the process of data modeling involves professional data modelers working closely with business stakeholders, as well as potential users of the information system.

There are three different types of data models produced while progressing from requirements to the actual database to be used for the information system. The data requirements are initially recorded as a conceptual data model which is essentially a set of technology independent specifications about the data and is used to discuss initial requirements with the business stakeholders. The conceptual model is then translated into a logical data model, which documents structures of the data that can be implemented in databases. Implementation of one conceptual data model may require multiple logical data models. The last step in data modeling is transforming the logical data model to a physical data model that organizes the data into tables, and accounts for access, performance and storage details. Data modeling defines not just data elements, but also their structures and the relationships between them.

Data modeling techniques and methodologies are used to model data in a standard, consistent, predictable manner in order to manage it as a resource. The use of data modeling standards is strongly recommended for all projects requiring a standard means of defining and analyzing data within an organization, e.g., using data modeling:

# to assist business analysts, programmers, testers, manual writers, IT package selectors, engineers, managers, related organizations and clients to understand and use an agreed upon semi-formal model that encompasses the concepts of the organization and how they relate to one another

# to manage data as a resource

# to integrate information systems

# to design databases/data warehouses (aka data repositories)

Data modeling may be performed during various types of projects and in multiple phases of projects. Data models are progressive; there is no such thing as the final data model for a business or application. Instead a data model should be considered a living document that will change in response to a changing business. The data models should ideally be stored in a repository so that they can be retrieved, expanded, and edited over time. Whitten et al. (2004) determined two types of data modeling:

# Strategic data modeling: This is part of the creation of an information systems strategy, which defines an overall vision and architecture for information systems. Information technology engineering is a methodology that embraces this approach.

# Data modeling during systems analysis: In systems analysis logical data models are created as part of the development of new databases.

Data modeling is also used as a technique for detailing business requirements for specific databases. It is sometimes called database modeling because a data model is eventually implemented in a database.

V	D1	D2	D3	D4	D5	D6	D7	D8	D9	D10
A	1	1	0	1	1	0	0	1	1	0
B	1	0	1	0	1	1	1	0	0	0
C	0	1	0	0	1	1	0	1	0	1
D	1	0	0	0	0	1	0	1	0	0
E	1	0	1	0	1	1	0	0	0	0

V	D1	D2	D3	D4	D5	D6	D7	D8	D9	D10
A	1	1	0	1	1	0	0	1	1	0
B	1	0	1	0	1	1	1	0	0	0
C	0	1	0	0	1	1	0	1	0	1
D	1	0	0	0	0	1	0	1	0	0
E	1	0	1	0	1	1	0	0	0	0

Pages

Wednesday, October 11, 2023

Similarity, dissimilarity and distance (Part of Data Analytics Course)

Measuring Similarity/Dissimilarity

Manhattan Distance

Minkowski Distance

Similarity/Dissimilarity

Briefly outline how to compute the dissimilarity between objects described by:

Similarity/Dissimilarity

Similarity/Dissimilarity

Cosine Similarity

Cosine Similarity

Sent 1: I bought a new car today.

Sent 2: I already have a new car.

Step 1: Create vectors.

Cosine Similarity

V1 = 1,1,1,1,1,1,0,0

V2 = 1,0,1,1,1,0,1,1

v1.v2 = 1 + 0 + 1 + 1 + 1 + 0 + 0 + 0 = 4

|v1| = math.sqrt(1**2 + 1**2 + 1**2 + 1**2 + 1**2 + 1**2 + 0 + 0) = 2.45

|v2| = math.sqrt(1**2 + 0 + 1**2 + 1**2 + 1**2 + 0 + 1**2 + 1**2) = 2.45

Cosine sim = 4 / (2.45 * 2.45)

= 0.66

Cosine Similarity

Cosine Similarity: It is a bag-of-words based model i.e., order of words does not matter.

Sent1: I bought a new car today.

Sent2: I haven’t bought a new car today.

Step 1: Creating vectors

Cosine similarity is based on whether same words have been used in two sentences.

“How similar are two sentences in terms of the words (irrespective of their meaning) used in them?”

Cosine Similarity

Find cosine similarity between the following two sentences:

Brand new course on data analytics

Test-1 is scheduled later this month

Cosine similarity = 0.0

Tuesday, October 10, 2023

Traffic Prediction on my Blog (Oct 2023)

Poisson Distribution for Traffic Prediction in Layman Terms

Here is how I used it to predict the traffic on my blog

1: Glimpse of the data

2: Tail of the data

3: Frequency plot for views in thousands

4: Number of views in thousands and probability of seeing that many views

5: Poisson distribution for views in thousands

6: Conclusion

Thursday, October 5, 2023

Interview for Data Science Trainer Role (Oct 2023)

Q1: What is Data Science Lifecycle?

Q2: What are Data Science Roles?

Q3: What all fields and subfields are closely related to Data Science?

Q4: Draw the Data Science Venn Diagram.

Q5: What is EDA? And, what is CDA?

Initial Data Analysis

1. Quality of data

2. Quality of measurements

3. Initial transformations

4. Did the implementation of the study fulfill the intentions of the research design?

CDA: Confirmatory Data Analysis

Q6: What is Descriptive Statistics? And, what is Inferential Statistics?

Descriptive statistics

Univariate analysis

Bivariate and multivariate analysis

Inferential statistics (or Statistical Inference or Inferential Statistical Analysis)

Topics Covered in Inferential Statistics

Q7: What are the Four Types of Analytical Studies?

Q8: Ordinal categorical data vs Nominal categorical data

Q9: Categorical data vs Quantitative data

Q10: Categorical data vs Continuous data

Q11: Categorical data vs Numerical data

Q12: Categorical data vs Discrete data

Q13: When do we use supervised learning?

Q14: When do we use unsupervised learning?

Q15: When a Data Science problem is given to you by your company, what strategy you will follow to solve the problem? What is your approach?

Q16: What do you do as part of Data Engineering?

Q17: What do you do as part of MLOps?

Monday, October 2, 2023

What all fields and subfields are closely related to Data Science?

A list of fields and subfields related to Data Science

1. Data analysis

2. Data engineering

|v1| = math.sqrt(12 + 12 + 12 + 12 + 12 + 12 + 0 + 0) = 2.45

|v2| = math.sqrt(12 + 0 + 12 + 12 + 12 + 0 + 12 + 12) = 2.45

V	D1	D2	D3	D4	D5	D6	D7	D8	D9	D10
A	1	1	0	1	1	0	0	1	1	0
B	1	0	1	0	1	1	1	0	0	0
C	0	1	0	0	1	1	0	1	0	1
D	1	0	0	0	0	1	0	1	0	0
E	1	0	1	0	1	1	0	0	0	0