Learn foundational concepts about the big data landscape. This course was created for individuals who are new to the big data landscape and want to become conversant with big data terminology. It will cover foundational concepts related to the big data landscape including characteristics of big data, the relationship between big data, artificial intelligence and data science, how individuals on data science teams work with big data, and how organizations can use big data to enable better business decisions. Note: This course will not cover Databricks concepts or functionality. This is an introductory-level course focused on big data concepts. Learning objectives - Explain foundational concepts used to define big data. - Explain how the characteristics of big data have changed traditional organizational workflows for working with data. - Summarize how individuals on data science teams work with big data on a daily basis to drive business outcomes. - Articulate examples of real-world use-cases for big data in businesses across a variety of industries. -----Lesson 1: Technology and the explosion of data
Sources of data Human-generated – What is it? Data that humans create and share What are some examples of human-generated data? - Social media posts* - Emails - Spreadsheets - Presentations - Audio files - Video files * Social media has been a leading force in the propagation of human-generated data. Just think - every time we post a message, change our online statuses, upload images, or like and forward comments, we are generating data. Let's look at Facebook, for example. According to Forbes, 1.5 billion people are active on Facebook every day, 510,000 comments are posted every minute, and five new profiles are created every second. ~ ~ ~ Machine-generated - What is it? Data generated from machines that doesn’t rely on active human intervention What are some examples of machine-generated data sources? - Sensors on vehicles, appliances and industrial machinery - Security cameras - Satellites - Medical devices Personal tools such as smartphone apps or fitness trackers What does it mean for data to be generated without active human intervention? Think of a fitness tracker. Depending on the model you have, it might generate records for your heart rate, your geographic location, the calories you burn and more. You don't tell your fitness tracker to track these things - it comes programmed to do it and does it on its own. ~ ~ ~ Organization-generated – What is it? Data generated as organizations run their businesses What are some examples of organization-generated data? Records generated every time you make a purchase at an online or physical store - things like unique customer numbers, the items you purchased, the date and time you purchased items and how many of each item you purchased. Organization-generated data is often referred to as transactional data. You'll hear this term frequently in the world of big data.Lesson 2: What makes big data “big”?
The major characteristics used to define big data are volume, variety and velocity. VOLUME: Volume refers to the vast amount of data being generated every second of every day. The International Data Corporation (IDC), forecasts that the amount of data that exists in the world is growing from 33 zettabytes in 2018 to 177 zettabytes by 2025. Just to put that into perspective, the computer being used to create this course has 256GB of storage. That’s equivalent to just .000000000256 (9 zeros) zettabytes. Organizations working with big data find ways to process, store and analyze data coming in at massive volumes that surpass traditional methods process and store data. VELOCITY: The second characteristic that defines big data is velocity, which refers to the speed at which new data is generated and the speed at which data moves around. A good example of data velocity is a social media post going viral in seconds. Another example is the speed at which credit card transactions are checked for fraudulent activities. Have you ever tried to purchase something you don't normally purchase and had that transaction declined? In just a matter of seconds, your credit card company received information about your purchase, was able to compare it to usual purchases you make and decide whether or not to flag this as a fraudulent transaction. Organizations working with big data find ways to work with data that is generated and moves around this quickly (or even faster!). VARIETY: Finally, variety is also used to define big data. Data variety refers to the many different types of data that exist today - social media posts, credit card transactions, legal contracts, biometric data and geographic information, just to name a few. Organizations working with big data find ways to use different types of data together - for example, an organization might want to extract data insights from a combination of social media posts, customer transaction records and real-time product usage. ----- At times one would hear about 5 V's of Big Data: V4: Veracity Data veracity, in general, is how accurate or truthful a data set may be. In the context of big data, however, it takes on a bit more meaning. More specifically, when it comes to the accuracy of big data, it's not just the quality of the data itself but how trustworthy the data source, type, and processing of it is. V5: Value This refers to the ability to transform a tsunami of data into business.Lesson 3: Types of big data
Three types of Big Data are: Structured, Unstructured, Semi-structured Structured The term structured data refers to any data that conforms to a certain format or schema. A popular example of structured data is a spreadsheet. In a spreadsheet, there are usually clearly labeled rows and columns, and the information within those rows and columns follows a certain format. In the spreadsheet below for example, we see that months are written as three-letter words, customer IDs are five-digit long numerical values and colors are formatted as "Name|Name". Because structured data is clearly organized, it’s generally easier to analyze. For example, if I asked you to look at the spreadsheet below and tell me how much money we made for all of these orders, you could easily tell me because the prices listed here are numeric and can be summed up. A lot of the data that organizations work with every day can be categorized as structured data. Unstructured By contrast, unstructured data is often referred to as "messy" data, because it isn’t easily searched compared to structured data. For example, imagine that instead of providing you with a spreadsheet of sales, I ask you to review camera footage that shows customers buying products and ask you to tell me how much money was made. That task would be much harder to do than using our spreadsheet. Unstructured data is the most widespread type of data. The IDC reports that almost 90% of data today is unstructured. Today, many organizations struggle with trying to make sense of unstructured data, especially when trying to use it for business insights. That's where different fields of artificial intelligence become an important part of the data analysis process. Aside from videos, other examples of unstructured data include: - Social media posts - Photographs - Emails - Audio files, and - Images Semi-structured Finally, we have semi-structured data. Semi-structured data fits somewhere in-between structured and unstructured data. Semi-structured data does not reside in a formatted table, but it does have some level of organization. A good example of semi-structured data is HTML code. If you’ve ever right-clicked in your browser and selected “inspect” or “inspect element” you’ve seen an example of this. Although you are not restricted to how much information you want to collect or what kind of information you want to collect, there is still a defined way to express data.Knowledge Check: Three Questions
Lesson 4: An introduction to distributed computing
The de-facto standard tool for distributed computing is Apache Spark. A very high level and basic way to understand Apache Spark architecture is by the example of 'couting a tub of candies by a group of 5 friends'. Here: Job: Count candies in parallel Driver: Myself (who collects results from other friends and does the reporting) Simple Executor: the other four friends (who do the counting)Lesson 5: Batch vs. streaming processing
Batch Data What is it? Batch data is data that we have in storage and that we process all at once, or in a batch. Say for example that someone gives you a large jar of candies and asks you to count all of the candies in the jar. That is a simple example of a batch job - we take candies that were already present in some form of storage (in this case, a jar) and count them. Since we count all of the candies one time, this is considered batch processing. Example of batch processing A real-world example of batch processing is how telecommunication companies process cellular phone usage each month to generate our monthly phone bills. To do this they process batch data - the phone calls you’ve made, text messages you've sent and any additional charges you’ve incurred through that billing cycle, to generate your bill. They process that batch data in a batch job. Streaming Data What is it? On the other hand, we have streaming data. Streaming data is data that is being continually produced by one or more sources and therefore must be processed incrementally as it arrives. Now, what if instead of counting candy sitting in a jar, we are asked to count candy coming towards us on a conveyor belt? As the candy reaches us, we have to count the new pieces and constantly update our overall candy count. In a streaming job, our final count is changing in real-time as more and more candy arrives on the conveyor belt. Example of stream processing A real-world example of stream processing is how heart monitors work. All-day long, as you wear your heart monitor, it receives new data - dozens of thousands of data points per day as your heart beats. And, every time your heart beats, your heart monitor has new data added to its data store in real-time. If your heart monitor has a display of your average heartbeat for the day, that average must be constantly updated with the new numbers from the incoming stream of data. - - - Both batch and streaming data have their place when it comes to big data analytics. Batch data is used for things like periodic reporting and streaming data for things like fraud detection, which need to be identified in real-time. Historically it’s been difficult to use these different types of data in conjunction. Thanks to new advances in technology however combining batch and stream processing is possible, and it leads to significant advantages in big data analytics. At this point, we’ve reviewed how we process big data and have explored the types of input data we have to work with. Next, we’ll discuss another topic in data management - where to store big data.Lesson 6: Data storage systems
Today, most organizations are storing their big data in one or a combination of the following storage systems: Data Warehouses, Data Lakes, Unified Data Platforms Data warehouse technology emerged in the 1980’s and provides a centralized repository for storing all of an organization's data. Data warehouses can be on-premises or in the cloud. Data warehouse Benefits of data warehouses They’ve been around for decades, work well for structured data and are reliable. Since they generally only take structured data, data is typically clean and easy to query. Challenges with data warehouses They can be hard and expensive to scale (if you need more space, for example). You lose a lot of valuable potential by not taking advantage of unstructured data. You often have to deal with vendor lock-in. This occurs when your data is stored in a system that does not belong to you. Data warehouses are very expensive to build, license and maintain, especially for large data volumes, even with the availability of cloud storage.Lesson 7: Knowledge Check
L8: Techniques for working with big data
Artificial intelligence What is it? Artificial intelligence (AI) is a branch of computer science in which computer systems are developed to perform tasks that would typically need human intelligence. AI is a broad field, and it encapsulates many techniques within its umbrella. Example To contextualize AI, let’s look at a classic example - a Turing test. In a Turing test: 1. A human evaluator asks a series of text-based questions to a machine and a human, without being able to see either. 2. The human and the machine answer the questions. 3. If the evaluator cannot differentiate between human and machine responses, the computer passes the Turing test. This means that it exhibited human-like behavior or, artificial intelligence! ~ ~ ~ Machine learning What is it? Machine learning (ML) is a subset of artificial intelligence that works very well with structured data. The goal behind machine learning is for machines to learn patterns in your data without you explicitly programming them to do so. There are a few types of machine learning; the most commonly used type is called supervised machine learning. Example Supervised machine learning is commonly used in detecting fraud. It works, at a high-level, like this: 1. A human-being specifies rules for what constitutes fraud (for example, a bank account with more than 20 transactions a month or an average balance of less than $100). 2. These rules are passed through a machine using an algorithm with data that is labeled either as "fraud" or "not fraud" and the machine learns what fraudulent data looks like. 3. The machine uses those rules to predict fraud. 4. A human manually investigates and verifies the model’s predictions if the model predicts “fraud”. ~ ~ ~ Deep learning What is it? Deep learning (DL) is a subset of machine learning that uses neural networks or sets of algorithms modeled by the structure of the human brain. They are much more complex than most machine learning models, and require significantly more time and effort to build. Unlike machine learning which plateaus after a certain amount of data, deep learning continues to improve as the data size increases. It performs well on complex datasets like images, sequences and natural language. Example Deep learning is often used to classify images. For example, say that you want to build a machine learning model to classify if an image contains a koala. You would feed hundreds, thousands, or millions of pictures into a machine - some of these showing koalas and others not showing koalas. Over time, the model learns what a koala is and what it isn’t. Over time, it can more easily and quickly identify a koala over other images. It's important to note that while humans might recognize koalas by their fluffy ears or large oval-shaped noses, a machine will detect things that we cannot - things like patterns in the koala's fur or the exact shape of its eyes. It is able to make decisions quickly based on that information. ~ ~ ~ Data science What is it? Data science is a field that combines tools and workflows from disciplines like math and statistics, computer science and business, to process, manage and analyze data. Data science is very popular in businesses today as a way to extract insights from big data to help inform business decisions. Example You already saw a couple of examples of data science techniques! Machine learning and deep learning are common tools (among many others) in a data scientist’s toolbox to help extract insights from data. ~ ~ ~ While this was not an exhaustive list, these are some of the most popular techniques for working with big data. An important idea to note here is that one of the benefits of using these techniques, particularly machine learning and deep learning, is that they help scale analytics. As you can imagine, once machines learn how to detect patterns in our data, they are able to make predictions much faster than humans can. As we discussed, all of these techniques are used by data science practitioners to help extract insights from big data. They use these techniques as part of a data science workflow, a series of steps they follow to process, manage and analyze data.L9: The data science workflow
The data science workflow is a series of steps that data practitioners follow to work with big data. It is a cyclical process that often starts with identifying business problems and ends with delivering business value. The data science workflowL10: Roles on a data science team
Platform administrators What do they do?Platform administrators can also be called devops engineers, infrastructure engineers and cloud engineers. They are responsible for managing and supporting big data infrastructure. Some of these tasks include:
- setting up of big data infrastructure
- performing updates and maintenance work
- performing health checks
- keeping track of how team members are using the platform by setting up and monitoring alerts, for example
- implementing best practices for managing data
- Programming languages like Python and Scala
- Different data storage solutions
- Data processing engines like Apache Spark
- SQL programming language
- Visualization tools like Tableau, PowerBI, Looker and others.
- Programming languages like Python, R and SQL
- Machine learning libraries
- Notebook interfaces like Jupyter
No comments:
Post a Comment