Monday, October 2, 2023

What all fields and subfields are closely related to Data Science?

Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured, and unstructured data.

Data science also integrates domain knowledge from the underlying application domain (e.g., natural sciences, information technology, and medicine). Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow, and a profession.

Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. However, data science is different from computer science and information science. Turing Award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational, and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.

A data scientist is a professional who creates programming code and combines it with statistical knowledge to create insights from data.

A list of fields and subfields related to Data Science

1. Data analysis

2. Data engineering

3. Machine learning

4. Business intelligence

5. Statistics

6. Business analytics

7. Software development

8. Data mining

9. Natural language processing

10. Computer vision

11. Data storytelling

12. Product Management

13. Artificial intelligence

14. Data modeling

Some Roles That Require Task Related Guidance And Training:

1. Data architect

2. Database Administrator

3. System Administrator

1. Data analysis

Data analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains. In today's business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.

Data mining is a particular data analysis technique that focuses on statistical modeling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing mainly on business information.

In statistical applications, data analysis can be divided into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new features in the data while CDA focuses on confirming or falsifying existing hypotheses. Predictive analytics focuses on the application of statistical models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All of the above are varieties of data analysis.

Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination.

2. Data engineering

Data engineering refers to the building of systems to enable the collection and usage of data. This data is usually used to enable subsequent analysis and data science; which often involves machine learning. Making the data usable usually involves substantial compute and storage, as well as data processing.

In the early 2010s, with the rise of the internet, the massive increase in data volumes, velocity, and variety led to the term big data to describe the data itself, and data-driven tech companies like Facebook and Airbnb started using the phrase data engineer. Due to the new scale of the data, major firms like Google, Facebook, Amazon, Apple, Microsoft, and Netflix started to move away from traditional ETL and storage techniques. They started creating data engineering, a type of software engineering focused on data, and in particular infrastructure, warehousing, data protection, cybersecurity, mining, modelling, processing, and metadata management. This change in approach was particularly focused on cloud computing. Data started to be handled and used by many parts of the business, such as sales and marketing, and not just IT.

Who is a Data engineer?

A data engineer is a type of software engineer who creates big data ETL pipelines to manage the flow of data through the organization. This makes it possible to take huge amounts of data and translate it into insights. They are focused on the production readiness of data and things like formats, resilience, scaling, and security. Data engineers usually hail from a software engineering background and are proficient in programming languages like Java, Python, Scala, and Rust. They will be more familiar with databases, architecture, cloud computing, and Agile software development.

Who is a Data Scientist?

Data scientists are more focused on the analysis of the data, they will be more familiar with mathematics, algorithms, statistics, and machine learning.

3. Machine learning

Machine learning (ML) is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machines "discover" their "own" algorithms, without needing to be explicitly told what to do by any human-developed algorithms. Recently, generative artificial neural networks have been able to surpass results of many previous approaches. Machine-learning approaches have been applied to large language models, computer vision, speech recognition, email filtering, agriculture and medicine, where it is too costly to develop algorithms to perform the needed tasks.

The mathematical foundations of ML are provided by mathematical optimization (mathematical programming) methods. Data mining is a related (parallel) field of study, focusing on exploratory data analysis through unsupervised learning.

ML is known in its application across business problems under the name predictive analytics. Although not all machine learning is statistically based, computational statistics is an important source of the field's methods.

Machine learning approaches

Machine learning approaches are traditionally divided into three broad categories, which correspond to learning paradigms, depending on the nature of the "signal" or "feedback" available to the learning system:

Supervised learning: The computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs.

Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).

Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle or playing a game against an opponent). As it navigates its problem space, the program is provided feedback that's analogous to rewards, which it tries to maximize. Although each algorithm has advantages and limitations, no single algorithm works for all problems.

4. Business intelligence

Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical processing, analytics, dashboard development, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics, and prescriptive analytics.

BI tools can handle large amounts of structured and sometimes unstructured data to help identify, develop, and otherwise create new strategic business opportunities. They aim to allow for the easy interpretation of these big data. Identifying new opportunities and implementing an effective strategy based on insights can provide businesses with a competitive market advantage and long-term stability, and help them take strategic decisions.

Business intelligence can be used by enterprises to support a wide range of business decisions ranging from operational to strategic. Basic operating decisions include product positioning or pricing. Strategic business decisions involve priorities, goals, and directions at the broadest level. In all cases, BI is most effective when it combines data derived from the market in which a company operates (external data) with data from company sources internal to the business such as financial and operations data (internal data). When combined, external and internal data can provide a complete picture which, in effect, creates an "intelligence" that cannot be derived from any singular set of data.

Among myriad uses, business intelligence tools empower organizations to gain insight into new markets, to assess demand and suitability of products and services for different market segments, and to gauge the impact of marketing efforts.

BI applications use data gathered from a data warehouse (DW) or from a data mart, and the concepts of BI and DW combine as "BI/DW" or as "BIDW". A data warehouse contains a copy of analytical data that facilitates decision support.

Definition of ‘Business Intelligence’

According to Solomon Negash and Paul Gray, business intelligence (BI) can be defined as systems that combine:

1. Data gathering

2. Data storage

3. Knowledge management


Some elements of business intelligence are:

1. Multidimensional aggregation and allocation

2. Denormalization, tagging, and standardization

3. Realtime reporting with analytical alert

4. A method of interfacing with unstructured data sources

5. Group consolidation, budgeting, and rolling forecasts

6. Statistical inference and probabilistic simulation

7. Key performance indicators optimization

8. Version control and process management

9. Open item management

Roles in the field of ‘Business Intelligence’

Some common technical roles for business intelligence developers are:

# Business analyst

# Data analyst

# Data engineer

# Data scientist

# Database administrator

5. Statistics (and Statisticians)

Statistics is a mathematical body of science that pertains to the collection, analysis, interpretation or explanation, and presentation of data, or as a branch of mathematics. Some consider statistics to be a distinct mathematical science rather than a branch of mathematics. While many scientific investigations make use of data, statistics is concerned with the use of data in the context of uncertainty and decision-making in the face of uncertainty.

A statistician is a person who works with theoretical or applied statistics. The profession exists in both the private and public sectors.

It is common to combine statistical knowledge with expertise in other subjects, and statisticians may work as employees or as statistical consultants.

6. Business analytics

Business analytics (BA) refers to the skills, technologies, and practices for iterative exploration and investigation of past business performance to gain insight and drive business planning. Business analytics focuses on developing new insights and understanding of business performance based on data and statistical methods. In contrast, business intelligence traditionally focuses on using a consistent set of metrics to both measure past performance and guide business planning. In other words, business intelligence focusses on description, while business analytics focusses on prediction and prescription.

Business analytics makes extensive use of analytical modeling and numerical analysis, including explanatory and predictive modeling, and fact-based management to drive decision making. It is therefore closely related to management science. Analytics may be used as input for human decisions or may drive fully automated decisions. Business intelligence is querying, reporting, online analytical processing (OLAP), and "alerts".

In other words, querying, reporting, and OLAP are alert tools that can answer questions such as what happened, how many, how often, where the problem is, and what actions are needed. Business analytics can answer questions like why is this happening, what if these trends continue, what will happen next (predict), and what is the best outcome that can happen (optimize).

7. Software development

Software development is the process used to conceive, specify, design, program, document, test, and bug fix in order to create and maintain applications, frameworks, or other software components. Software development involves writing and maintaining the source code, but in a broader sense, it includes all processes from the conception of the desired software through the final manifestation, typically in a planned and structured process often overlapping with software engineering. Software development also includes research, new development, prototyping, modification, reuse, re-engineering, maintenance, or any other activities that result in software products.

8. Data mining

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information (with intelligent methods) from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

The term "data mining" is a misnomer because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence. The book Data Mining: Practical Machine Learning Tools and Techniques with Java (which covers mostly machine learning material) was originally to be named Practical Machine Learning, and the term data mining was only added for marketing reasons. Often the more general terms (large scale) data analysis and analytics—or, when referring to actual methods, artificial intelligence and machine learning—are more appropriate.

The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, although they do belong to the overall KDD process as additional steps.

The difference between data analysis and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data. In contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data.

The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.

Data Mining Process

The knowledge discovery in databases (KDD) process is commonly defined with the stages:

1. Selection

2. Pre-processing

3. Transformation

4. Data mining

5. Interpretation/evaluation.

It exists, however, in many variations on this theme, such as the Cross-industry standard process for data mining (CRISP-DM) which defines six phases:

1. Business understanding

2. Data understanding

3. Data preparation

4. Modeling

5. Evaluation

6. Deployment

or a simplified process such as (1) Pre-processing, (2) Data Mining, and (3) Results Validation.

Polls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is the leading methodology used by data miners. The only other data mining standard named in these polls was SEMMA. However, 3–4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models, and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.

Note: SEMMA is an acronym that stands for Sample, Explore, Modify, Model, and Assess.

9. Natural language processing

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation.

10. Computer vision

Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images (the input to the retina in the human analog) into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.

The scientific discipline of computer vision is concerned with the theory behind artificial systems that extract information from images. The image data can take many forms, such as video sequences, views from multiple cameras, multi-dimensional data from a 3D scanner, 3D point clouds from LiDaR sensors, or medical scanning devices. The technological discipline of computer vision seeks to apply its theories and models to the construction of computer vision systems.

Sub-domains of computer vision include scene reconstruction, object detection, event detection, activity recognition, video tracking, object recognition, 3D pose estimation, learning, indexing, motion estimation, visual servoing, 3D scene modeling, and image restoration.

Adopting computer vision technology might be painstaking for organizations as there is no single point solution for it. There are very few companies that provide a unified and distributed platform or an Operating System where computer vision applications can be easily deployed and managed.

11. Data storytelling

Data storytellers visualize data, create reports, search for narratives that best characterize data, and design innovative methods to convey that narrative. Data storytelling is a creative job role that falls in between data analysis and human-centered communication. They reduce the data to focus on a certain feature, evaluate the behavior, and create a story that assists others in better understanding business trends.

12. Product Management

Product management is the business process of planning, developing, launching, and managing a product or service. It includes the entire lifecycle of a product, from ideation to development to go to market. Product managers are responsible for ensuring that a product meets the needs of its target market and contributes to the business strategy, while managing a product or products at all stages of the product lifecycle. Software product management adapts the fundamentals of product management for digital products.

13. Artificial intelligence

Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the intelligence of humans or animals. It is also the field of study in computer science that develops and studies intelligent machines. "AI" may also refer to the machines themselves.

AI technology is widely used throughout industry, government and science. Some high-profile applications are: advanced web search engines (e.g., Google Search), recommendation systems (used by YouTube, Amazon, and Netflix), understanding human speech (such as Siri and Alexa), self-driving cars (e.g., Waymo), generative or creative tools (ChatGPT and AI art), and competing at the highest level in strategic games (such as chess and Go).

Artificial intelligence was founded as an academic discipline in 1956. The field went through multiple cycles of optimism followed by disappointment and loss of funding, but after 2012, when deep learning surpassed all previous AI techniques, there was a vast increase in funding and interest.

The various sub-fields of AI research are centered around particular goals and the use of particular tools. The traditional goals of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception, and support for robotics. General intelligence (the ability to solve an arbitrary problem) is among the field's long-term goals. To solve these problems, AI researchers have adapted and integrated a wide range of problem-solving techniques, including search and mathematical optimization, formal logic, artificial neural networks, and methods based on statistics, probability, and economics. AI also draws upon psychology, linguistics, philosophy, neuroscience and many other fields.

14. Data Modeling

Data modeling is a process used to define and analyze data requirements needed to support the business processes within the scope of corresponding information systems in organizations. Therefore, the process of data modeling involves professional data modelers working closely with business stakeholders, as well as potential users of the information system.

There are three different types of data models produced while progressing from requirements to the actual database to be used for the information system. The data requirements are initially recorded as a conceptual data model which is essentially a set of technology independent specifications about the data and is used to discuss initial requirements with the business stakeholders. The conceptual model is then translated into a logical data model, which documents structures of the data that can be implemented in databases. Implementation of one conceptual data model may require multiple logical data models. The last step in data modeling is transforming the logical data model to a physical data model that organizes the data into tables, and accounts for access, performance and storage details. Data modeling defines not just data elements, but also their structures and the relationships between them.

Data modeling techniques and methodologies are used to model data in a standard, consistent, predictable manner in order to manage it as a resource. The use of data modeling standards is strongly recommended for all projects requiring a standard means of defining and analyzing data within an organization, e.g., using data modeling:

# to assist business analysts, programmers, testers, manual writers, IT package selectors, engineers, managers, related organizations and clients to understand and use an agreed upon semi-formal model that encompasses the concepts of the organization and how they relate to one another

# to manage data as a resource

# to integrate information systems

# to design databases/data warehouses (aka data repositories)

Data modeling may be performed during various types of projects and in multiple phases of projects. Data models are progressive; there is no such thing as the final data model for a business or application. Instead a data model should be considered a living document that will change in response to a changing business. The data models should ideally be stored in a repository so that they can be retrieved, expanded, and edited over time. Whitten et al. (2004) determined two types of data modeling:

# Strategic data modeling: This is part of the creation of an information systems strategy, which defines an overall vision and architecture for information systems. Information technology engineering is a methodology that embraces this approach.

# Data modeling during systems analysis: In systems analysis logical data models are created as part of the development of new databases.

Data modeling is also used as a technique for detailing business requirements for specific databases. It is sometimes called database modeling because a data model is eventually implemented in a database.

Tags: Data Analytics,Technology,Interview Preparation,

Sunday, October 1, 2023

20 Quiz Questions For JavaScript Beginners with Easy Difficulty (Oct 2023)


    
Question ()


            
            

            
 
 

Your score is:




Question:


                

                

What are the Data Science Roles?

1. Data Strategist

Ideally, before a company collects any data, it hires a data strategist—a senior professional who understands how data can create value for businesses.

According to the famous data strategist Bernard Marr, there are four main ways how companies in different fields can use data science:

They can make data-driven decisions.

Data can help create smarter products and services.

Companies can use data to improve business processes.

They could create a new revenue stream via data monetization.

Companies often outsource such data roles. They hire external consultants to devise a plan that aligns with the organizational strategy.

Once a firm has a data strategy in place, it is time to ensure data availability. This is when a data architect comes into play.

2. Data Architect

A data architect (or data modeler) plans out high-level database structures. This involves the planning, organization, and management of information within a firm, ensuring its accuracy and accessibility. In addition, they must assess the needs of business stakeholders and optimize schemas to address them.

Such data roles are of crucial importance. Without proper data architecture, key business questions may remain answered due to the lack of coherence between different tables in the database.

A data architect is a senior professional and often a consultant. To become one, you'd need a solid resume and rigorous preparation for the interview process.

3. Data Engineer

The role of data engineers and data architects often overlaps—especially in smaller businesses. But there are key differences.

Data engineers build the infrastructure, organize tables, and set up the data to match the use cases defined by the architect. What's more, they handle the so-called ETL process, which stands for Extract, Transform, and Load. This involves retrieving data, processing it in a usable format, and moving it to a repository (the firm's database). Simply put, they pipe data into tables correctly.

Typically, they receive many ad-hoc ETL-related tasks throughout their work but rarely interact with business stakeholders directly. This is one of the best-paid data scientist roles, and for good reason. You need a plethora of skills to work in this position, including software engineering.

Okay, let's recap.

As you can see, the jobs in data science are interlinked and complement each other, but each position has slightly different requirements. First come data strategists who define how data can serve business goals. Next, the architect plans the database schemas necessary to achieve the objectives. Lastly, the engineers build the infrastructure and pipe the data into tables.

4. Data Analyst

Data analysts explore, clean, analyze, visualize, and present information, providing valuable insights for the business. They typically use SQL to access the database.

Next, they leverage an object-oriented programming language like Python or R to clean and analyze data and rely on visualization tools, such as Power BI or Tableau, to present the findings.

Side note: What is Data analysis?

Data analysis is the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains. In today's business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively.

Data mining is a particular data analysis technique that focuses on statistical modeling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing mainly on business information. In statistical applications, data analysis can be divided into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new features in the data while CDA focuses on confirming or falsifying existing hypotheses. Predictive analytics focuses on the application of statistical models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All of the above are varieties of data analysis.

Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination.

5. Business Intelligence Analyst

Data analyst's and BI analyst's duties overlap to a certain extent, but the latter has more of a reporting role. Their main focus is on building meaningful reports and dashboards and updating them frequently. More importantly, they have to satisfy stakeholders' informational needs at different levels of the organization.

6. Data Scientist

A data scientist has the skills of a data analyst but can leverage machine and deep learning to create models and make predictions based on past data.

We can distinguish three main types of data scientists:

# Traditional data scientists

# Research scientists

# Applied scientists

A traditional data scientist does all sorts of tasks, including data exploration, advanced statistical modeling, experimentation via A/B testing, and building and tuning machine learning models.

Research scientists primarily work on developing new machine learning models for large companies.

Applied scientists—frequently hired in big tech and larger companies—boast one of the highest-paid jobs in data science. These specialists combine data science and software engineering skills to productionize models.

More prominent companies prefer this combined skillset because it allows one person to oversee the entire ML implementation process—from the model building until productionization—which leads to quicker results. An applied scientist can work with data, model it for machine learning, select the correct algorithm, train the model, fine-tune hyperparameters, and then put the model in production.

As you can see, there's a significant overlap between data scientists, data analysts, and BI analysts. The image below is a simplified illustration of the similarities and differences between these data science roles.

7. ML Ops Engineer

Companies that don't have applied scientists hire ML Ops engineers. They are responsible for putting the ML models prepared by traditional data scientists into production.

In many instances, ML Ops engineers are former data scientists who have developed an engineering skillset. Their main responsibilities are to put the ML model in production and fix it if something breaks.

8. Data Product Manager

The last role we discuss in this article is that of а product manager. The person in this position is accountable for the success of a data product. They consider the bigger picture, identifying what product needs to be created, when to build it, and what resources are necessary.

A significant focus of such data science roles is data availability—determining whether to collect data internally or find ways to acquire it externally. Ultimately, product managers strategize the most effective ways to execute the production process.

Tags: Technology,Data Analytics,Interview Preparation

What is Data Science Lifecycle?

A data science lifecycle indicates the iterative steps taken to build, deliver and maintain any data science product. All data science projects are not built the same, so their life cycle varies as well. Still, we can picture a general lifecycle that includes some of the most common data science steps. A general data science lifecycle process includes the use of machine learning algorithms and statistical practices that result in better prediction models. Some of the most common data science steps involved in the entire process are data extraction, preparation, cleansing, modeling, evaluation etc. The world of data science refers to this general process as the “Cross Industry Standard Process for Data Mining”.

Who Are Involved in The Projects?

Domain Expert : The data science projects are applied in different domains or industries of real life like Banking, Healthcare, Petroleum industry etc. A domain expert is a person who has experience working in a particular domain and knows in and out about the domain.

Business analyst : A business analyst is required to understand the business needs in the domain identified. The person can guide in devising the right solution and timeline for the same.

Data Scientist : A data scientist is an expert in data science projects and has experience working with data and can work out the solution as to what data is needed to produce the required solution.

Machine Learning Engineer : A machine learning engineer can advise on which model to be applied to get the desired output and devise a solution to produce the correct and required output.

Data Engineer and Architect : Data architects and Data engineers are the experts in the modeling of data. Visualization of data for better understanding, as well as storage and efficient retrieval of data, are looked after by them.

The Lifecycle of Data Science

1. Problem identification

This is the crucial step in any Data Science project. The first thing is understanding in what way Data Science is useful in the domain under consideration and identification of appropriate tasks which are useful for the same. Domain experts and Data Scientists are the key persons in the problem identification of problem. Domain expert has in depth knowledge of the application domain and exactly what the problem is to be solved. Data Scientist understands the domain and help in identification of problem and possible solutions to the problems.

2. Business Understanding

Understanding what customer exactly wants from the business perspective is nothing but Business Understanding. Whether customer wish to do predictions or want to improve sales or minimize the loss or optimize any particular process etc forms the business goals. During business understanding, two important steps are followed:

KPI (Key Performance Indicator)

For any data science project, key performance indicators define the performance or success of the project. There is a need to be an agreement between the customer and data science project team on Business related indicators and related data science project goals. Depending on the business need the business indicators are devised and then accordingly the data science project team decides the goals and indicators. To better understand this let us see an example. Suppose the business need is to optimise the overall spendings of the company, then the data science goal will be to use the existing resources to manage double the clients. Defining the Key performance Indicators is very crucial for any data science projects as the cost of the solutions will be different for different goals.

SLA (Service Level Agreement)

Once the performance indicators are set then finalizing the service level agreement is important. As per the business goals the service level agreement terms are decided. For example, for any airline reservation system simultaneous processing of say 1000 users is required. Then the product must satisfy this service requirement is the part of service level agreement.

Once the performance indicators are agreed and service level agreement is completed then the project proceeds to the next important step.

3. Collecting Data

Data Collection is the important step as it forms the important base to achieve targeted business goals. There are various ways the data will flow into the system as listed below:

1. Surveys

2. Social media

3. Archives

4. Transactional data

5. Enterprise data

6. Statistical methods

4. Pre-processing data

Large data is collected from archives, daily transactions and intermediate records. The data is available in various formats and in various forms. Some data may be available in hard copy formats also. The data is scattered at various places on various servers. All these data are extracted and converted into single format and then processed. Typically, as data warehouse is constructed where the Extract, Transform and Loading (ETL) process or operations are carried out. In the data science project this ETL operation is vital and important. A data architect role is important in this stage who decides the structure of data warehouse and perform the steps of ETL operations.

5. Analyzing data

Now that the data is available and ready in the format required then next important step is to understand the data in depth. This understanding comes from analysis of data using various statistical tools available. A data engineer plays a vital role in analysis of data. This step is also called as Exploratory Data Analysis (EDA). Here the data is examined by formulating the various statistical functions and dependent and independent variables or features are identified. Careful analysis of data revels which data or features are important and what is the spread of data. Various plots are utilized to visualize the data for better understanding. The tools like Tableau, PowerBI etc are famous for performing Exploratory Data Analysis and Visualization. Knowledge of Data Science with Python and R is important for performing EDA on any type of data.

6. Data Modelling

Data modelling is the important next step once the data is analysed and visualized. The important components are retained in the dataset and thus data is further refined. Now the important is to decide how to model the data? What tasks are suitable for modelling? The tasks, like classification or regression, which is suitable is dependent upon what business value is required. In these tasks also many ways of modelling are available. The Machine Learning engineer applies various algorithms to the data and generates the output. While modelling the data many a times the models are first tested on dummy data similar to actual data.

7. Model Evaluation/ Monitoring

As there are various ways to model the data so it is important to decide which one is effective. For that model evaluation and monitoring phase is very crucial and important. The model is now tested with actual data. The data may be very few and in that case the output is monitored for improvement. There may be changes in data while model is being evaluated or tested and the output will drastically change depending on changes in data. So, while evaluating the model following two phases are important:

Data Drift Analysis

Changes in input data is called as data drift. Data drift is common phenomenon in data science as depending on the situation there will be changes in data. Analysis of this change is called Data Drift Analysis. The accuracy of the model depends on how well it handles this data drift. The changes in data are majorly because of change in statistical properties of data.

Model Drift Analysis

To discover the data drift machine learning techniques can be used. Also, more sophisticated methods like Adaptive Windowing, Page Hinkley etc. are available for use. Modelling Drift Analysis is important as we all know change is constant. Incremental learning also can be used effectively where the model is exposed to new data incrementally.

8. Model Training

Once the task and the model are finalised and data drift analysis modelling is finalized then the important step is to train the model. The training can be done is phases where the important parameters can be further fine tuned to get the required accurate output. The model is exposed to the actual data in production phase and output is monitored.

9. Model Deployment

Once the model is trained with the actual data and parameters are fine tuned then model is deployed. Now the model is exposed to real time data flowing into the system and output is generated. The model can be deployed as web service or as an embedded application in edge or mobile application. This is very important step as now model is exposed to real world.

10. Driving insights and generating BI reports

After model deployment in real world, the next step is to find out how the model is behaving in real-world scenario. The model is used to get insights that aid in strategic decisions related to business. The business goals are bound to these insights. Various reports are generated to see how business is driving. These reports help in finding out if key process indicators are achieved or not.

11. Taking a decision based on insight

For data science to do wonders, every step indicated above has to be done very carefully and accurately. When the steps are followed properly, then the reports generated in the above step help in making key decisions for the organization. The insights generated help in taking strategic decisions for example, the organization can predict that there will be a need for raw materials in advance. Data science can be of great help in making many important decisions related to business growth and better revenue generation.

Tags: Interview Preparation,Data Analytics,

Tuesday, September 26, 2023

Interview for Statistics and Data Science Profile (Part 1) - 26 Sep 2023

1: The Empirical Rule For Normal Distribution

The lifespans of gorillas in a particular zoo are normally distributed. The average gorilla lives 20.8 years; the standard deviation is 3.1 years. Estimate the probability of a gorilla living between 11.5 and 27 years.

2: Calculating the equation of the least-squares line

3: Simple probability

4: Dependent probability

5: Conditional probability

6: Dependent and independent events

Solution:

7: Probability with permutations and combinations

Solution:

Number of ways the mom can choose the favorable outcome (Sausage and onion): 1

Total number of ways in which two ingredients can be chosen: 8C2

Tags: Interview Preparation,Technology,Mathematical Foundations for Data Science,

Thursday, September 21, 2023

Interview for Data Engineering and Machine Learning Profile (20 Sep 2023) - For the position of Infosys Digital Specialist

Section 1: Programming

1. How much would you rate yourself out of 1 to 5 in these three:

Data engineering, ML Ops, Cloud

2. Broad concepts around Data Engineering and MLOps.

3. Write code to find the number of factors of a number.


import math
n = int(input("The number:"))

sqrt_n = math.ceil(math.sqrt(n))

l = set({})

for i in range(1, sqrt_n + 1):
    if n % i == 0:
        q, r = divmod(n, i)
        l.add(i)
        l.add(q)

print(l)
print(len(l))


Sample output:
The number:12
{1, 2, 3, 4, 6, 12}
6

The number:100
{1, 2, 4, 100, 5, 10, 50, 20, 25}
9

4. Tell what is the complexity of this code.

5. Can you suggest any optimization in it?

6. Write code to tell if a number is a happy number.

A happy number is a number defined by the following process:

- Starting with any positive integer, replace the number by the sum of the squares of it’s digits

- Repeat the process until the number equals 1 (where it will stay), or it “loops endlessly in a cycle” which does not include 1

- Those numbers for which this process “ends in 1” are happy.

Return true if n is a happy number, and false if not.

For ex: 19 is a happy number. Produces following sum of the squares of it’s digits: 19, 82, 68, 100, 1

And 2 is an unhappy number.

7. How would you identify an unhappy number for example: 2

A number is either a happy number or unhappy number.

We can create a list of all the happy numbers till 1000 and a list of unhappy numbers. Then preemptively stop on encountering one of those.

This way memoization would allow for optimization.

Section 2: Machine Learning

8. Which ML algorithm you are most comfortable with?

9. Can you take up questions on SVM?

10. The Machine Learning problem:

Let’s say you work in a financial institution, and you are given the task of using Support Vector Machines (SVM) to build a trading strategy for equities based on multiple features, such as moving average, volatility, and market sentiment.

Problem Statement:

To create an optimized SVM model that can effectively classify equities into “Buy”, “Hold”, and “Sell” categories based on historical and real time data.

Build an initial SVM model with a radial basis function (RBF) or polynomial kernel. Experiment with different parameters like the regularization constant (C), kernel coefficient (y), and others.

Discuss how are you going to do this.

11. How would you tune the hyper parameters of the model?

12. How would you use SVM on real time data?

13. What would be your strategy for feature selection?

14. What is RBF - Radial basis function?

15. What is matrix factorization with respect to SVMs?

Section 3: Cloud

16. Which cloud platform have you have used?

17. Which features of GCP have you used?

18. Which features of AWS you have used?

19. What is Elastic Cloud Compute or EC2?

20. What are steps of creating a project in GCP to use Buckets?

21. What are the steps of creating a project in AWS to use Lambda functions?

Tags: Machine Learning,Interview Preparation,Technology,

Recursion (An Introduction)

What's Recursion Got to do With Russian Dolls?

Before we get into recursion, have you ever seen a Russian doll?

You open it further...

And once more...

Recursion

The process of solving a problem using Recursion is like opening Russian dolls.

Once you open the biggest doll, a similar doll but a bit smaller in size appears.

The same way, when you try to solve a problem using recursion, when you try to solve a bigger problem, that usually requires you to solve a smaller instance of the same problem.

Let's take a look at some problems that can be solved using Recursion

1. Factorial

2. Determine whether a word is a palindrome

3. Computing powers of a number

4. N-th Fibonacci Number

5. Generating permutations of letters such 'a', 'b', 'c'.

Factorial

# n! = n.(n-1).(n-2)...3.2.1

Or, equivalently:

# n! = n.(n-1)!

If n=0, then declare that n!=1.

Otherwise, n must be positive. Solve the subproblem of computing (n-1)! and multiply this result by n, and declare n! equal to the result of this product.

When we're computing n!, in this way, we call the first case, where we immediately know the answer, the base case, and we call the second case, where we have to compute the same function but on a different value, the recursive case.

Distilling the idea of recursion into two simple rules

1. Each recursive call should be on a smaller instance of the same problem, that is, a smaller subproblem.

2. The recursive calls must eventually reach a base case, which is solved without further recursion.

Determine whether a word is a palindrome

1) If the string is made of no letters or just one letter, then it is a palindrome.

2) Otherwise, compare the first and last letters of the string.

3) If the first and last letters differ, then the string is not a palindrome.

4) Otherwise, the first and last letters are the same. Strip them from the string, and determine whether the string that remains is a palindrome. Take the answer for this smaller string and use it as the answer for the original string.

Computing powers of a number

x**n = x**(n-1) * x

N-th Fibonacci Number

fib(n) = fib(n-1) + fib(n-2)

Generating permutations of letters such 'a', 'b', 'c'

Permutations of a, b, c =

a + Permutations of (b, c),

b + Permutations of (a, c), and

c + Permutations of (a, b)

Tags: Python,Technology,Algorithms