Q1: What is Data Science Lifecycle?
Link to the solution
Q2: What are Data Science Roles?
Link to the solution
Q3: What all fields and subfields are closely related to Data Science?
Link to the
solution
Q4: Draw the Data Science Venn Diagram.
Link to further
explanation
Q5: What is EDA? And, what is CDA?
In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main
characteristics, often using statistical graphics and other data visualization methods. A statistical model can be
used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby
contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by John Tukey since 1970 to
encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data
collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on
checking assumptions required for model fitting and hypothesis testing, and handling missing values and making
transformations of variables as needed. EDA encompasses IDA.
Initial Data Analysis
The most important distinction between the initial data analysis phase and the main analysis phase, is that during
initial data analysis one refrains from any analysis that is aimed at answering the original research question.[109]
The initial data analysis phase is guided by the following four questions:
1. Quality of data
The quality of the data should be checked as early as possible. Data quality can be assessed in several ways, using
different types of analysis: frequency counts, descriptive statistics (mean, standard deviation, median), normality
(skewness, kurtosis, frequency histograms), normal imputation is needed.
# Analysis of extreme observations: outlying observations in the data are analyzed to see if they seem to disturb
the distribution.
# Comparison and correction of differences in coding schemes: variables are compared with coding schemes of
variables external to the data set, and possibly corrected if coding schemes are not comparable.
# Test for common-method variance.
The choice of analyses to assess the data quality during the initial data analysis phase depends on the analyses
that will be conducted in the main analysis phase.
2. Quality of measurements
The quality of the measurement instruments should only be checked during the initial data analysis phase when this
is not the focus or research question of the study. One should check whether structure of measurement instruments
corresponds to structure reported in the literature.
There are two ways to assess measurement quality:
# Confirmatory factor analysis
# Analysis of homogeneity (internal consistency), which gives an indication of the reliability of a measurement
instrument. During this analysis, one inspects the variances of the items and the scales, the Cronbach's α of the
scales, and the change in the Cronbach's alpha when an item would be deleted from a scale.
3. Initial transformations
After assessing the quality of the data and of the measurements, one might decide to impute missing data, or to
perform initial transformations of one or more variables, although this can also be done during the main analysis
phase.
Possible transformations of variables are:
# Square root transformation (if the distribution differs moderately from normal)
# Log-transformation (if the distribution differs substantially from normal)
# Inverse transformation (if the distribution differs severely from normal)
# Make categorical (ordinal / dichotomous) (if the distribution differs severely from normal, and no
transformations help)
4. Did the implementation of the study fulfill the intentions of the research
design?
One should check the success of the randomization procedure, for instance by checking whether background and
substantive variables are equally distributed within and across groups.
If the study did not need or use a randomization procedure, one should check the success of the non-random
sampling, for instance by checking whether all subgroups of the population of interest are represented in sample.
Other possible data distortions that should be checked are:
# dropout (this should be identified during the initial data analysis phase)
# Item non-response (whether this is random or not should be assessed during the initial data analysis phase)
# Treatment quality (using manipulation checks).
CDA: Confirmatory Data Analysis
Confirmatory data analysis (CDA) is a tool that is used to confirm or reject the measurement theory. In reality,
exploratory and confirmatory data analysis are not performed one after another, but continually intertwine to help
you create the best possible model.
Confirmatory Data Analysis involves things like: testing hypotheses, producing estimates with a specified level of
precision, regression analysis, and variance analysis. In this way, your confirmatory data analysis is where you put
your findings and arguments to trial.
Q6: What is Descriptive Statistics? And, what is Inferential Statistics?
Descriptive statistics
A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or
summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the
process of using and analysing those statistics. Descriptive statistics is distinguished from inferential statistics
(or inductive statistics) by its aim to summarize a sample, rather than use the data to learn about the population
that the sample of data is thought to represent. This generally means that descriptive statistics, unlike
inferential statistics, is not developed on the basis of probability theory, and are frequently nonparametric
statistics.
Descriptive statistics provide simple summaries about the sample and about the observations that have been made.
Such summaries may be either quantitative, i.e. summary statistics, or visual, i.e. simple-to-understand graphs.
These summaries may either form the basis of the initial description of the data as part of a more extensive
statistical analysis, or they may be sufficient in and of themselves for a particular investigation.
For example, the shooting percentage in basketball is a descriptive statistic that summarizes the performance of a
player or a team. This number is the number of shots made divided by the number of shots taken. For example, a
player who shoots 33% is making approximately one shot in every three. The percentage summarizes or describes
multiple discrete events. Consider also the grade point average. This single number describes the general
performance of a student across the range of their course experiences.
The use of descriptive and summary statistics has an extensive history and, indeed, the simple tabulation of
populations and of economic data was the first way the topic of statistics appeared. More recently, a collection of
summarisation techniques has been formulated under the heading of exploratory data analysis: an example of such a
technique is the box plot.
In the business world, descriptive statistics provides a useful summary of many types of data. For example,
investors and brokers may use a historical account of return behaviour by performing empirical and analytical
analyses on their investments in order to make better investing decisions in the future.
Univariate analysis
Univariate analysis involves describing the distribution of a single variable, including its central tendency
(including the mean, median, and mode) and dispersion (including the range and quartiles of the data-set, and
measures of spread such as the variance and standard deviation). The shape of the distribution may also be described
via indices such as skewness and kurtosis. Characteristics of a variable's distribution may also be depicted in
graphical or tabular format, including histograms and stem-and-leaf display.
Bivariate and multivariate analysis
When a sample consists of more than one variable, descriptive statistics may be used to describe the relationship
between pairs of variables. In this case, descriptive statistics include:
# Cross-tabulations and contingency tables
# Graphical representation via scatterplots
# Quantitative measures of dependence
# Descriptions of conditional distributions
The main reason for differentiating univariate and bivariate analysis is that bivariate analysis is not only a
simple descriptive analysis, but also it describes the relationship between two different variables. Quantitative
measures of dependence include correlation (such as Pearson's r when both variables are continuous, or Spearman's
rho if one or both are not) and covariance (which reflects the scale variables are measured on). The slope, in
regression analysis, also reflects the relationship between variables. The unstandardised slope indicates the unit
change in the criterion variable for a one unit change in the predictor. The standardised slope indicates this
change in standardised (z-score) units. Highly skewed data are often transformed by taking logarithms. The use of
logarithms makes graphs more symmetrical and look more similar to the normal distribution, making them easier to
interpret intuitively.
Inferential statistics (or Statistical Inference or Inferential Statistical
Analysis)
Statistical inference is the process of using data analysis to infer properties of an underlying distribution of
probability. Inferential statistical analysis infers properties of a population, for example by testing hypotheses
and deriving estimates. It is assumed that the observed data set is sampled from a larger population.
Inferential statistics can be contrasted with descriptive statistics. Descriptive statistics is solely concerned
with properties of the observed data, and it does not rest on the assumption that the data come from a larger
population. In machine learning, the term inference is sometimes used instead to mean "make a prediction, by
evaluating an already trained model"; in this context inferring properties of the model is referred to as
training or learning (rather than inference), and using a model for prediction is referred to as inference (instead
of prediction); see also predictive inference.
Statistical inference makes propositions about a population, using data drawn from the population with some form of
sampling. Given a hypothesis about a population, for which we wish to draw inferences, statistical inference
consists of (first) selecting a statistical model of the process that generates the data and (second) deducing
propositions from the model.
Konishi & Kitagawa state, "The majority of the problems in statistical inference can be considered to be
problems related to statistical modeling". Relatedly, Sir David Cox has said, "How [the] translation from
subject-matter problem to statistical model is done is often the most critical part of an analysis".
The conclusion of a statistical inference is a statistical proposition. Some common forms of statistical
proposition are the following:
# a point estimate, i.e. a particular value that best approximates some parameter of interest;
# an interval estimate, e.g. a confidence interval (or set estimate), i.e. an interval constructed using a dataset
drawn from a population so that, under repeated sampling of such datasets, such intervals would contain the true
parameter value with the probability at the stated confidence level;
# a credible interval, i.e. a set of values containing, for example, 95% of posterior belief;
# rejection of a hypothesis;
# clustering or classification of data points into groups.
Topics Covered in Inferential Statistics
The topics below are usually included in the area of statistical inference.
1. Statistical assumptions
2. Statistical decision theory
3. Estimation theory
4. Statistical hypothesis testing
5. Revising opinions in statistics
6. Design of experiments, the analysis of variance, and regression
7. Survey sampling
8. Summarizing statistical data
Q7: What are the Four Types of Analytical Studies?
Descriptive Analytics tells you what happened in the past.
Diagnostic Analytics helps you understand why something happened in the past.
Predictive Analytics predicts what is most likely to happen in the future.
Prescriptive Analytics recommends actions you can take to affect those outcomes.
Q8: Ordinal categorical data vs Nominal categorical data
Ordinal categorical data are non-numerical pieces of information with implied order — for example, survey responses
on a scale from very dissatisfied to very satisfied.
And nominal categorical data are non-numerical pieces of information without any inherent order — for example,
colors or states.
Side Note:
Questions 8 to 12 are related to ‘Types of Data’.
Actual Interview Question:
“How do you figure out if the data is categorical or continuous?”
Q9: Categorical data vs Quantitative data
Quantitative variables are any variables where the data represent amounts (e.g. height, weight, or age).
Categorical variables are any variables where the data represent groups. This includes rankings (e.g. finishing
places in a race), classifications (e.g. brands of cereal), and binary outcomes (e.g. coin flips).
Q10: Categorical data vs Continuous data
Continuous data can take on any value within a defined range and is often measured on a continuous scale, such as
weight, height, or temperature.
Categorical data, on the other hand, consists of discrete values that fall into distinct categories or groups, such
as gender, ethnicity, or product types.
Q11: Categorical data vs Numerical data
- Categorical data refers to a data type that can be stored and identified based on the names or labels given to
them.
- Numerical data refers to the data that is in the form of numbers, and not in any language or descriptive form.
Also known as qualitative data as it qualifies data before classifying it.
Q12: Categorical data vs Discrete data
- Categorical data might not have a logical order. For example, categorical predictors include gender, material
type, and payment method.
- Discrete variables are numeric variables that have a countable number of values between any two values. A
discrete variable is always numeric.
Q13: When do we use supervised learning?
Supervised learning is a type of machine learning where the algorithm learns from labeled training data, meaning it
is provided with input-output pairs (also known as features and labels) and learns to map the input to the
corresponding output. Supervised learning is used in a wide range of applications whenever you have a dataset with
known outcomes and want to build a predictive model. Here are some common scenarios where supervised learning is
applied:
Classification: When you want to categorize data into predefined classes or labels. Examples include email spam
detection, image classification (e.g., identifying objects in images), sentiment analysis (determining if a text is
positive or negative), and disease diagnosis (categorizing patients as having a specific disease or not).
Regression: When you want to predict a continuous numeric value. Examples include predicting house prices based on
features like square footage and location, forecasting stock prices, or estimating the time it will take to complete
a task.
Recommendation Systems: When you want to make personalized recommendations to users. This is common in
applications like movie recommendations (Netflix), product recommendations (Amazon), and content recommendations
(YouTube).
Natural Language Processing (NLP): In NLP, supervised learning is used for tasks like named entity recognition,
text classification, machine translation, and chatbot responses. For instance, training a model to classify news
articles into different categories like sports, politics, or entertainment.
Speech Recognition: In applications like voice assistants (e.g., Siri, Alexa), supervised learning is used to
transcribe spoken words into text.
Computer Vision: Supervised learning is crucial for image and video analysis tasks such as object detection,
facial recognition, and autonomous driving (identifying pedestrians, vehicles, and road signs).
Anomaly Detection: Detecting unusual patterns or outliers in data, such as fraud detection in financial
transactions, network intrusion detection, and equipment failure prediction in industrial settings.
Handwriting Recognition: Converting handwritten text into machine-readable text, often used in applications like
OCR (Optical Character Recognition).
Biomedical and Healthcare: In medical imaging, supervised learning is used for tasks like tumor detection in MRI
scans, disease diagnosis based on medical data, and drug discovery.
Game Playing: Supervised learning can be used to train agents in games like chess, Go, or video games to make
decisions based on historical gameplay data.
Supervised learning is a versatile and widely used technique in machine learning, and its applications extend to
various domains where predictive modeling, classification, or regression tasks are required, and labeled training
data is available.
Q14: When do we use unsupervised learning?
Unsupervised learning is a type of machine learning where the algorithm is trained on unlabeled data, meaning it
does not have access to predefined output labels. Instead, the algorithm aims to discover hidden patterns,
structures, or relationships within the data. Here are some common scenarios where unsupervised learning is used:
Clustering: Unsupervised learning is frequently used for clustering, where the goal is to group similar data
points together based on their inherent similarities or patterns. Common clustering algorithms include k-means
clustering, hierarchical clustering, and DBSCAN. Applications include customer segmentation in marketing, image
segmentation, and grouping similar news articles for recommendation.
Dimensionality Reduction: Unsupervised learning techniques like Principal Component Analysis (PCA) and
t-Distributed Stochastic Neighbor Embedding (t-SNE) are used to reduce the dimensionality of data. This is helpful
for visualization, feature selection, and improving the efficiency of machine learning algorithms. Dimensionality
reduction can be used in image compression, anomaly detection, and more.
Anomaly Detection: Unsupervised learning can be used to identify rare or anomalous data points that do not conform
to the normal patterns in the dataset. This is valuable for fraud detection, network intrusion detection, and
quality control in manufacturing.
Density Estimation: Unsupervised learning can be used to estimate the probability distribution of data. This is
useful in various applications, including anomaly detection, outlier detection, and generative modeling.
Topic Modeling: In natural language processing, unsupervised learning techniques like Latent Dirichlet Allocation
(LDA) and Non-Negative Matrix Factorization (NMF) are used for topic modeling. These methods help identify
underlying topics in a collection of text documents, making them useful for text summarization, content
recommendation, and document organization.
Recommendation Systems: Unsupervised learning can be used to find similarities between users or items in a
recommendation system. While recommendation systems often involve collaborative filtering, which can be considered a
form of unsupervised learning, they may also use supervised learning in conjunction with user ratings.
Generative Modeling: Unsupervised learning is used in generative modeling, where the goal is to generate new data
samples that are similar to the training data. Examples include Variational Autoencoders (VAEs) and Generative
Adversarial Networks (GANs), which are used in image generation, style transfer, and data augmentation.
Market Basket Analysis: In retail and e-commerce, unsupervised learning can be applied to discover associations
between products purchased together by customers. This information can be used for product recommendations and
inventory management.
Unsupervised learning is valuable when you want to explore and understand the structure and relationships within
your data, especially when you don't have access to labeled data or when the goal is to discover hidden patterns. It
is a versatile approach used in various domains, including data analysis, pattern recognition, and exploratory data
analysis.
Q15: When a Data Science problem is given to you by your company, what strategy
you will follow to solve the problem? What is your approach?
Solving a data science problem for a company involves a systematic and well-structured approach. Here's a general
strategy and approach that you can follow:
1. Understand the Problem:
Start by thoroughly understanding the problem statement and its business context. Engage in discussions with
stakeholders and subject matter experts to gain insights into the problem's significance and objectives.
2. Define Clear Goals:
Clearly define the objectives and goals of the project. What specific outcomes or insights are you aiming to
achieve? Establish measurable success criteria to evaluate the effectiveness of your solution.
3. Data Collection and Exploration:
# Identify the data sources and collect the relevant data required for the project. This may involve data
acquisition, data scraping, or access to existing databases.
# Perform initial data exploration and visualization to gain insights into the data's characteristics, such as
distributions, missing values, outliers, and correlations.
4. Data Preprocessing:
Clean and preprocess the data to ensure it is in a suitable format for analysis. This includes handling missing
values, encoding categorical variables, scaling/normalizing numerical features, and addressing outliers.
5. Feature Engineering:
Create relevant features or transform existing ones to improve the model's performance. This step often requires
domain knowledge and creativity.
6. Model Selection:
Based on the nature of the problem (e.g., classification, regression, clustering), select appropriate machine
learning algorithms or techniques. Consider factors such as the volume of data, data dimensionality, and
interpretability of the model.
7. Model Training:
# Split the data into training, validation, and test sets to evaluate model performance.
# Train the selected models on the training data, tune hyperparameters using cross-validation, and assess their
performance using appropriate evaluation metrics.
8. Model Evaluation:
# Evaluate the models using the validation dataset to choose the best-performing one(s).
# Perform a detailed analysis of model performance, considering metrics like accuracy, precision, recall, F1-score,
or RMSE (Root Mean Square Error), depending on the problem type.
9. Model Interpretability:
If applicable, ensure that the model's decisions are interpretable and explainable to stakeholders. Use techniques
such as feature importance analysis, SHAP values, or LIME to provide insights into model predictions.
10. Deployment and Integration:
# If the project involves real-time predictions, deploy the model into a production environment. Ensure that the
deployment process is reliable, scalable, and monitored for performance.
# Integrate the model into the company's existing systems and workflows, if necessary.
11. Monitoring and Maintenance:
# Continuously monitor the model's performance and retrain it as new data becomes available. Set up alerts for
model degradation or data drift.
# Keep the documentation up to date and maintain a version control system for the code and models.
12. Communication and Reporting:
# Regularly communicate progress and results to stakeholders through reports, dashboards, or presentations.
# Explain the implications of your findings and provide actionable insights to support decision-making.
13. Feedback and Iteration:
# Collect feedback from users and stakeholders and use it to make improvements to the model or the overall
solution.
# Iterate on the project as necessary to address evolving business needs or data quality issues.
14. Documentation:
Maintain thorough documentation of the entire process, including data sources, preprocessing steps, model
architecture, hyperparameters, and results. This documentation is crucial for reproducibility.
15. Ethical Considerations:
Ensure that the project adheres to ethical and legal guidelines, especially when dealing with sensitive data or
automated decision-making.
The specific details of your approach may vary depending on the nature of the problem, the available resources, and
the company's goals. Flexibility and adaptability are essential qualities for a data scientist to navigate the
complexities of real-world data science projects successfully.
Q16: What do you do as part of Data Engineering?
Data engineering is a critical component of the data pipeline in a data-driven organization. Data engineers are
responsible for designing, building, and maintaining the infrastructure and systems that enable the collection,
storage, processing, and retrieval of data. Here are some of the key tasks and responsibilities of data engineers:
Data Ingestion:
Set up processes to collect data from various sources, including databases, external APIs, streaming platforms,
log files, and IoT devices.
Ensure data is ingested efficiently, reliably, and at scale.
Data Storage:
Design and implement data storage solutions that accommodate the organization's data volume and access patterns.
Choose appropriate data storage technologies, such as relational databases, NoSQL databases, data lakes, data
warehouses, and distributed file systems.
Data Transformation:
Clean, preprocess, and transform raw data into formats suitable for analysis or downstream applications.
Create data pipelines to automate ETL (Extract, Transform, Load) processes.
Data Modeling:
Develop data models and schemas to represent structured and semi-structured data.
Optimize data models for efficient querying and reporting.
Data Quality and Validation:
Implement data quality checks and validation procedures to ensure data accuracy and consistency.
Monitor and address data quality issues.
Data Integration:
Integrate data from various sources to create a unified view of the data.
Implement data integration solutions, such as data consolidation, data federation, and data synchronization.
Data Security and Compliance:
Implement security measures to protect sensitive data.
Ensure compliance with data privacy regulations (e.g., GDPR, HIPAA) and industry standards.
Data Pipeline Orchestration:
Manage and orchestrate data pipelines using tools like Apache Airflow, Luigi, or cloud-based solutions.
Schedule and automate data processing tasks.
Scalability and Performance:
Architect data systems to be scalable, ensuring they can handle increasing data volumes and user loads.
Optimize system performance to deliver fast query responses.
Data Versioning and Cataloging:
Maintain a catalog of available data assets, making it easier for data scientists and analysts to discover and use
data.
Implement data versioning to track changes in data assets.
Monitoring and Logging:
Set up monitoring and logging systems to track the health and performance of data pipelines and systems.
Implement alerting mechanisms for identifying and addressing issues in real-time.
Documentation:
Document data engineering processes, data schemas, and infrastructure configurations for knowledge sharing and
future reference.
Collaboration:
Collaborate with data scientists, analysts, and other stakeholders to understand data requirements and deliver
data solutions that meet business needs.
Cloud Services:
Leverage cloud platforms (e.g., AWS, Azure, GCP) to build and maintain data infrastructure and services.
Take advantage of managed services for data storage, computation, and orchestration.
Continuous Improvement:
Stay up-to-date with the latest data engineering technologies and best practices.
Continuously improve data engineering processes for efficiency and reliability.
Data engineers play a crucial role in enabling data-driven decision-making within an organization by ensuring that
data is available, reliable, and accessible to data users, including data scientists, analysts, and business
stakeholders. Their work forms the foundation upon which data analytics and machine learning efforts are built.
Q17: What do you do as part of MLOps?
MLOps, short for Machine Learning Operations, is a set of practices and tools aimed at streamlining and automating
the end-to-end machine learning lifecycle, from model development to deployment and monitoring. MLOps involves
collaboration between data scientists, machine learning engineers, data engineers, and IT/DevOps teams to ensure
that machine learning models are deployed and maintained in a reliable and efficient manner. Here's what you might
do as part of MLOps:
Infrastructure Provisioning and Management:
Set up and manage the infrastructure required for machine learning workloads, including cloud computing resources
or on-premises servers.
Utilize containerization technologies (e.g., Docker) and container orchestration platforms (e.g., Kubernetes) to
manage and deploy machine learning models.
Environment Management:
Create and manage development, testing, and production environments with consistent dependencies and
configurations for reproducibility.
Use tools like virtual environments or containerization to isolate dependencies.
Version Control:
Implement version control for machine learning code, data, and models using platforms like Git.
Track changes to models and data to ensure traceability and reproducibility.
Continuous Integration and Continuous Deployment (CI/CD):
Set up CI/CD pipelines to automate the testing, building, and deployment of machine learning models.
Implement automated testing to validate model performance and data quality.
Model Packaging:
Package machine learning models in a way that allows them to be easily deployed and managed, such as containerized
models or model-serving libraries like TensorFlow Serving or PyTorch Serve.
Model Deployment:
Deploy machine learning models into production, making them accessible to applications and end-users.
Implement strategies for gradual or canary deployments to minimize downtime and mitigate risks.
Monitoring and Logging:
Set up monitoring and logging for deployed models to track their performance, detect anomalies, and ensure they
are functioning as expected.
Implement alerting systems to notify teams of issues or performance degradation.
Model Versioning and Rollback:
Implement model versioning and rollback mechanisms to manage different model versions and quickly revert to a
previous version if necessary.
Scalability and Resource Management:
Design systems that can handle increased traffic and scale resources as needed to accommodate growing workloads.
Implement auto-scaling and resource optimization strategies.
Security and Compliance:
Ensure that machine learning systems meet security and compliance requirements, especially when handling sensitive
data.
Implement access controls, encryption, and audit logs as needed.
Collaboration and Communication:
Facilitate collaboration between data science and engineering teams to ensure smooth transitions from model
development to deployment.
Communicate updates and changes to stakeholders effectively.
Feedback Loop:
Collect feedback from production systems and end-users to improve models over time.
Use feedback to retrain and update models as needed.
Cost Management:
Monitor and optimize the costs associated with machine learning infrastructure and services to ensure
cost-effectiveness.
Documentation and Knowledge Sharing:
Maintain comprehensive documentation of the MLOps process and infrastructure to facilitate knowledge sharing and
onboarding of new team members.
Automated Testing:
Implement automated tests for machine learning models to ensure that they perform as expected in different
environments.
MLOps practices help organizations streamline and automate the management of machine learning workflows, leading to
faster development cycles, improved model reliability, and better alignment between data science and operations
teams. The specific tasks you perform in MLOps can vary depending on the organization's size, technology stack, and
machine learning use cases.
Tags: Technology,Data Analytics,Interview Preparation,