Showing posts with label Databricks. Show all posts
Showing posts with label Databricks. Show all posts

Tuesday, November 10, 2020

Introduction to Big Data (by Databricks)



Learn foundational concepts about the big data landscape.
This course was created for individuals who are new to the big data landscape and want to become conversant with big data terminology. It will cover foundational concepts related to the big data landscape including characteristics of big data, the relationship between big data, artificial intelligence and data science, how individuals on data science teams work with big data, and how organizations can use big data to enable better business decisions. Note: This course will not cover Databricks concepts or functionality. This is an introductory-level course focused on big data concepts.

Learning objectives

- Explain foundational concepts used to define big data.
- Explain how the characteristics of big data have changed traditional organizational workflows for working with data.
- Summarize how individuals on data science teams work with big data on a daily basis to drive business outcomes.
- Articulate examples of real-world use-cases for big data in businesses across a variety of industries.

-----

Lesson 1: Technology and the explosion of data

Sources of data Human-generated – What is it? Data that humans create and share What are some examples of human-generated data? - Social media posts* - Emails - Spreadsheets - Presentations - Audio files - Video files * Social media has been a leading force in the propagation of human-generated data. Just think - every time we post a message, change our online statuses, upload images, or like and forward comments, we are generating data. Let's look at Facebook, for example. According to Forbes, 1.5 billion people are active on Facebook every day, 510,000 comments are posted every minute, and five new profiles are created every second. ~ ~ ~ Machine-generated - What is it? Data generated from machines that doesn’t rely on active human intervention What are some examples of machine-generated data sources? - Sensors on vehicles, appliances and industrial machinery - Security cameras - Satellites - Medical devices Personal tools such as smartphone apps or fitness trackers What does it mean for data to be generated without active human intervention? Think of a fitness tracker. Depending on the model you have, it might generate records for your heart rate, your geographic location, the calories you burn and more. You don't tell your fitness tracker to track these things - it comes programmed to do it and does it on its own. ~ ~ ~ Organization-generated – What is it? Data generated as organizations run their businesses What are some examples of organization-generated data? Records generated every time you make a purchase at an online or physical store - things like unique customer numbers, the items you purchased, the date and time you purchased items and how many of each item you purchased. Organization-generated data is often referred to as transactional data. You'll hear this term frequently in the world of big data.

Lesson 2: What makes big data “big”?

The major characteristics used to define big data are volume, variety and velocity. VOLUME: Volume refers to the vast amount of data being generated every second of every day. The International Data Corporation (IDC), forecasts that the amount of data that exists in the world is growing from 33 zettabytes in 2018 to 177 zettabytes by 2025. Just to put that into perspective, the computer being used to create this course has 256GB of storage. That’s equivalent to just .000000000256 (9 zeros) zettabytes. Organizations working with big data find ways to process, store and analyze data coming in at massive volumes that surpass traditional methods process and store data. VELOCITY: The second characteristic that defines big data is velocity, which refers to the speed at which new data is generated and the speed at which data moves around. A good example of data velocity is a social media post going viral in seconds. Another example is the speed at which credit card transactions are checked for fraudulent activities. Have you ever tried to purchase something you don't normally purchase and had that transaction declined? In just a matter of seconds, your credit card company received information about your purchase, was able to compare it to usual purchases you make and decide whether or not to flag this as a fraudulent transaction. Organizations working with big data find ways to work with data that is generated and moves around this quickly (or even faster!). VARIETY: Finally, variety is also used to define big data. Data variety refers to the many different types of data that exist today - social media posts, credit card transactions, legal contracts, biometric data and geographic information, just to name a few. Organizations working with big data find ways to use different types of data together - for example, an organization might want to extract data insights from a combination of social media posts, customer transaction records and real-time product usage.
----- At times one would hear about 5 V's of Big Data: V4: Veracity Data veracity, in general, is how accurate or truthful a data set may be. In the context of big data, however, it takes on a bit more meaning. More specifically, when it comes to the accuracy of big data, it's not just the quality of the data itself but how trustworthy the data source, type, and processing of it is. V5: Value This refers to the ability to transform a tsunami of data into business.

Lesson 3: Types of big data

Three types of Big Data are: Structured, Unstructured, Semi-structured Structured The term structured data refers to any data that conforms to a certain format or schema. A popular example of structured data is a spreadsheet. In a spreadsheet, there are usually clearly labeled rows and columns, and the information within those rows and columns follows a certain format. In the spreadsheet below for example, we see that months are written as three-letter words, customer IDs are five-digit long numerical values and colors are formatted as "Name|Name". Because structured data is clearly organized, it’s generally easier to analyze. For example, if I asked you to look at the spreadsheet below and tell me how much money we made for all of these orders, you could easily tell me because the prices listed here are numeric and can be summed up. A lot of the data that organizations work with every day can be categorized as structured data. Unstructured By contrast, unstructured data is often referred to as "messy" data, because it isn’t easily searched compared to structured data. For example, imagine that instead of providing you with a spreadsheet of sales, I ask you to review camera footage that shows customers buying products and ask you to tell me how much money was made. That task would be much harder to do than using our spreadsheet. Unstructured data is the most widespread type of data. The IDC reports that almost 90% of data today is unstructured. Today, many organizations struggle with trying to make sense of unstructured data, especially when trying to use it for business insights. That's where different fields of artificial intelligence become an important part of the data analysis process. Aside from videos, other examples of unstructured data include: - Social media posts - Photographs - Emails - Audio files, and - Images Semi-structured Finally, we have semi-structured data. Semi-structured data fits somewhere in-between structured and unstructured data. Semi-structured data does not reside in a formatted table, but it does have some level of organization. A good example of semi-structured data is HTML code. If you’ve ever right-clicked in your browser and selected “inspect” or “inspect element” you’ve seen an example of this. Although you are not restricted to how much information you want to collect or what kind of information you want to collect, there is still a defined way to express data.

Knowledge Check: Three Questions

Lesson 4: An introduction to distributed computing

The de-facto standard tool for distributed computing is Apache Spark. A very high level and basic way to understand Apache Spark architecture is by the example of 'couting a tub of candies by a group of 5 friends'. Here: Job: Count candies in parallel Driver: Myself (who collects results from other friends and does the reporting) Simple Executor: the other four friends (who do the counting)

Lesson 5: Batch vs. streaming processing

Batch Data What is it? Batch data is data that we have in storage and that we process all at once, or in a batch. Say for example that someone gives you a large jar of candies and asks you to count all of the candies in the jar. That is a simple example of a batch job - we take candies that were already present in some form of storage (in this case, a jar) and count them. Since we count all of the candies one time, this is considered batch processing. Example of batch processing A real-world example of batch processing is how telecommunication companies process cellular phone usage each month to generate our monthly phone bills. To do this they process batch data - the phone calls you’ve made, text messages you've sent and any additional charges you’ve incurred through that billing cycle, to generate your bill. They process that batch data in a batch job. Streaming Data What is it? On the other hand, we have streaming data. Streaming data is data that is being continually produced by one or more sources and therefore must be processed incrementally as it arrives. Now, what if instead of counting candy sitting in a jar, we are asked to count candy coming towards us on a conveyor belt? As the candy reaches us, we have to count the new pieces and constantly update our overall candy count. In a streaming job, our final count is changing in real-time as more and more candy arrives on the conveyor belt. Example of stream processing A real-world example of stream processing is how heart monitors work. All-day long, as you wear your heart monitor, it receives new data - dozens of thousands of data points per day as your heart beats. And, every time your heart beats, your heart monitor has new data added to its data store in real-time. If your heart monitor has a display of your average heartbeat for the day, that average must be constantly updated with the new numbers from the incoming stream of data. - - - Both batch and streaming data have their place when it comes to big data analytics. Batch data is used for things like periodic reporting and streaming data for things like fraud detection, which need to be identified in real-time. Historically it’s been difficult to use these different types of data in conjunction. Thanks to new advances in technology however combining batch and stream processing is possible, and it leads to significant advantages in big data analytics. At this point, we’ve reviewed how we process big data and have explored the types of input data we have to work with. Next, we’ll discuss another topic in data management - where to store big data.

Lesson 6: Data storage systems

Today, most organizations are storing their big data in one or a combination of the following storage systems: Data Warehouses, Data Lakes, Unified Data Platforms Data warehouse technology emerged in the 1980’s and provides a centralized repository for storing all of an organization's data. Data warehouses can be on-premises or in the cloud. Data warehouse Benefits of data warehouses They’ve been around for decades, work well for structured data and are reliable. Since they generally only take structured data, data is typically clean and easy to query. Challenges with data warehouses They can be hard and expensive to scale (if you need more space, for example). You lose a lot of valuable potential by not taking advantage of unstructured data. You often have to deal with vendor lock-in. This occurs when your data is stored in a system that does not belong to you. Data warehouses are very expensive to build, license and maintain, especially for large data volumes, even with the availability of cloud storage.

Lesson 7: Knowledge Check

L8: Techniques for working with big data

Artificial intelligence What is it? Artificial intelligence (AI) is a branch of computer science in which computer systems are developed to perform tasks that would typically need human intelligence. AI is a broad field, and it encapsulates many techniques within its umbrella. Example To contextualize AI, let’s look at a classic example - a Turing test. In a Turing test: 1. A human evaluator asks a series of text-based questions to a machine and a human, without being able to see either. 2. The human and the machine answer the questions. 3. If the evaluator cannot differentiate between human and machine responses, the computer passes the Turing test. This means that it exhibited human-like behavior or, artificial intelligence! ~ ~ ~ Machine learning What is it? Machine learning (ML) is a subset of artificial intelligence that works very well with structured data. The goal behind machine learning is for machines to learn patterns in your data without you explicitly programming them to do so. There are a few types of machine learning; the most commonly used type is called supervised machine learning. Example Supervised machine learning is commonly used in detecting fraud. It works, at a high-level, like this: 1. A human-being specifies rules for what constitutes fraud (for example, a bank account with more than 20 transactions a month or an average balance of less than $100). 2. These rules are passed through a machine using an algorithm with data that is labeled either as "fraud" or "not fraud" and the machine learns what fraudulent data looks like. 3. The machine uses those rules to predict fraud. 4. A human manually investigates and verifies the model’s predictions if the model predicts “fraud”. ~ ~ ~ Deep learning What is it? Deep learning (DL) is a subset of machine learning that uses neural networks or sets of algorithms modeled by the structure of the human brain. They are much more complex than most machine learning models, and require significantly more time and effort to build. Unlike machine learning which plateaus after a certain amount of data, deep learning continues to improve as the data size increases. It performs well on complex datasets like images, sequences and natural language. Example Deep learning is often used to classify images. For example, say that you want to build a machine learning model to classify if an image contains a koala. You would feed hundreds, thousands, or millions of pictures into a machine - some of these showing koalas and others not showing koalas. Over time, the model learns what a koala is and what it isn’t. Over time, it can more easily and quickly identify a koala over other images. It's important to note that while humans might recognize koalas by their fluffy ears or large oval-shaped noses, a machine will detect things that we cannot - things like patterns in the koala's fur or the exact shape of its eyes. It is able to make decisions quickly based on that information. ~ ~ ~ Data science What is it? Data science is a field that combines tools and workflows from disciplines like math and statistics, computer science and business, to process, manage and analyze data. Data science is very popular in businesses today as a way to extract insights from big data to help inform business decisions. Example You already saw a couple of examples of data science techniques! Machine learning and deep learning are common tools (among many others) in a data scientist’s toolbox to help extract insights from data. ~ ~ ~ While this was not an exhaustive list, these are some of the most popular techniques for working with big data. An important idea to note here is that one of the benefits of using these techniques, particularly machine learning and deep learning, is that they help scale analytics. As you can imagine, once machines learn how to detect patterns in our data, they are able to make predictions much faster than humans can. As we discussed, all of these techniques are used by data science practitioners to help extract insights from big data. They use these techniques as part of a data science workflow, a series of steps they follow to process, manage and analyze data.

L9: The data science workflow

The data science workflow is a series of steps that data practitioners follow to work with big data. It is a cyclical process that often starts with identifying business problems and ends with delivering business value. The data science workflow

L10: Roles on a data science team

Platform administrators What do they do?

Platform administrators can also be called devops engineers, infrastructure engineers and cloud engineers. They are responsible for managing and supporting big data infrastructure. Some of these tasks include:

  • setting up of big data infrastructure
  • performing updates and maintenance work
  • performing health checks
  • keeping track of how team members are using the platform by setting up and monitoring alerts, for example
  • implementing best practices for managing data
Additionally, platform administrators provide governance to development teams around change, configuration and upgrades. and often evaluate new tools and technologies that can compliment their big data infrastructure. What do they need? To perform their duties, platform administrators often use tools like infrastructure and monitoring services that major cloud providers offer to help them keep data secure and scale and manage their infrastructure. Data engineers What do they do? Data engineers develop, construct, test and maintain data pipelines, which are mechanisms that allow data to move between systems or people. If we think back to the data science workflow, we talked about data ingestion and that once data is ingested, it needs to be prepared for use for machine learning and business analytics. This is where a data pipeline fits in - taking data from its raw data source and moving it along that pipeline to where it can be used at different stages of a machine learning or data analytics project. What do they need? To perform their duties, data engineers use a set of tools to build and maintain these pipelines including:
  • Programming languages like Python and Scala
  • Different data storage solutions
  • Data processing engines like Apache Spark
Data analysts What do they do? Data analysts also take data prepared by data engineers to extract insights. Typically, a data analyst will also present data in the form of graphs, charts and dashboards to stakeholders to help them make business decisions. Data analysts can also take advantage of the work of machine learning engineers to help derive insights from data. They are typically well-versed in data visualization tools and business intelligence concepts and can be in charge of interpreting data insights and effectively communicating their findings with stakeholders. What do they need? To perform their duties, data analysts often use:
  • SQL programming language
  • Visualization tools like Tableau, PowerBI, Looker and others.
Data scientists What do they do? Data scientists take the data prepared by data engineers and use a variety of methods to extract insights. Data scientists usually have a strong background in disciplines like math, statistics, and computer science. They are often tasked with building machine learning models, testing those models, and keeping track of their machine learning experiments. What do they need? To perform their duties, data scientists use tools like:
  • Programming languages like Python, R and SQL
  • Machine learning libraries
  • Notebook interfaces like Jupyter

L12: Big data for business decision-making

Welcome to the last lesson in this course! At this point, you should have a good understanding of what big data is and how organizations process, manage and analyze it to extract business insights. In this lesson, we'll take this information one step forward and answer the question, "Why does all of this matter"? First, let's do a quick exercise to see how well you can predict customer behavior. How well do you know this customer? Imagine that you and I work at a bank, and we accidentally transferred $5,000 into the wrong customer’s account. All we know about this customer is that he: - Is male - Is 20 years old - Is single - Currently resides in New York City - Makes $100,000 a year Based on this information, what do you think our customer would do with the $5,000? A. Give the money back B. Take the money and run Take a second to think about this. Once you have an answer, then continue reading. What did you decide? How did you come to this decision? Were you thinking that you needed more data? What if now I told you that this error has happened twice in the past to the same individual and that each time, they’ve given the money back? How would this information change your assessment of what would happen? To those of you who thought, "I need more data!" - you were right! Without the additional piece of information about past transactions, there was really no way to know or even make an educated guess about what our customer would do. Big data analytics for business Think about this example at a much more complex and massive scale. Imagine if we could use data to understand our customer’s next moves. Imagine if we could go into a business decision having a fairly certain idea of what the outcome would be. Big data analytics makes this possible. With big data analytics, organizations are able to do things like: - Understand their customers better : Who are our customers? What do they like? How do they use our products? - Improve their products : Do we need to make changes to our products? What types of changes should we make? What do people like most about our product? - Protect their business : Are we investing money in the correct things? Will the risks we take pay off? - Stay ahead of the competition : Who are our biggest competitors? How will we do with upcoming trends in our industry? Regardless of industry, big data enables organizations to find answers to questions they want to know and helps them find patterns to answer questions they didn't even know to ask. Next, we'll take this one step further and look at specific examples of how big data analytics is being used in a wide-range of industries.

L13: Big data use-cases in different industries

Thousands of organizations around the world are applying advanced analytics to big data to enrich and accelerate business outcomes. In this section, we’ll review some of these examples, by industry. Please note that there is a lot of content here! While you're more than welcome to read all of it, it might help if you start with the ones of most interest to you. Advertising and marketing Who? Organizations like global agencies, small ad tech companies and more. Goals: Apply advanced analytics to large volumes of consumer, clickstream and ad data to improve return on ad spend, inventory management and audience segmentation efforts
Automotive Who? Automotive and automotive parts manufacturers Goals: Apply advanced analytics to their large volumes of driver, vehicle, supply chain and IoT data to improve manufacturing efficiency and create better and safer autonomous driver experiences
Financial services Who? Retail and commercial banks, hedge funds, fintech innovators and more Goals: Apply advanced analytics to large volumes of customer and transaction data to reduce risk, boost returns and improve customer satisfaction Example use-cases:
Health and life sciences Who? Large integrated healthcare systems, major pharmaceutical companies, diagnostic labs and more Goals: Applying advanced analytics to their large volumes of clinical and research data to accelerate R&D and improve patient outcomes.
Media and entertainment Who? Major publishers, streamers, gaming companies and more Goals: Apply advanced analytics to large volumes of audience and content data to deepen audience engagement, reduce churn and optimize advertising revenues
Oil, gas and energy Who? Oil upstream and downstream organizations, utility companies and more Goals: Apply advanced analytics to large volumes of sensor, supply chain, and customer data to improve exploration, reduce machinery downtime and optimize sales and supply chain operations
Retail Who? Traditional brick and mortar companies, e-commerce companies Goals: Apply advanced analytics to large volumes of customer, product and supply chain data to better attract customers, increase basket size and reduce costs Example use-cases:
Telecom Who? Global communication service providers, network and equipment providers and more Goals: Apply advanced analytics to large volumes of customer and network data to improve network services and performance while reducing customer churn Example use-cases: