Showing posts with label Spark. Show all posts
Showing posts with label Spark. Show all posts

Saturday, March 9, 2024

What is an RDD in PySpark?

RDD, which stands for Resilient Distributed Dataset, is a fundamental data structure in Apache Spark, a distributed computing framework for big data processing. RDDs are immutable, partitioned collections of objects that can be processed in parallel across a cluster of machines. The term "resilient" in RDD refers to the fault-tolerance feature, meaning that RDDs can recover lost data due to node failures.

Here are some key characteristics and properties of RDDs in PySpark:

# Immutable: Once created, RDDs cannot be modified. However, you can transform them into new RDDs by applying various operations.

# Distributed: RDDs are distributed across multiple nodes in a cluster, allowing for parallel processing.

# Partitioned: RDDs are divided into partitions, which are the basic units of parallelism. Each partition can be processed independently on different nodes.

# Lazy Evaluation: Transformations on RDDs are lazily evaluated, meaning that the execution is deferred until an action is triggered. This helps optimize the execution plan and avoid unnecessary computations.

# Fault-Tolerant: RDDs track the lineage information to recover lost data in case of node failures. This is achieved through the ability to recompute lost partitions based on the transformations applied to the original data.

In PySpark, you can create RDDs from existing data in memory or by loading data from external sources such as HDFS, HBase, or other storage systems. Once created, you can perform various transformations (e.g., map, filter, reduce) and actions (e.g., count, collect, save) on RDDs.

However, it's worth noting that while RDDs were the primary abstraction in earlier versions of Spark, newer versions have introduced higher-level abstractions like DataFrames and Datasets, which provide a more structured and optimized API for data manipulation and analysis. These abstractions are built on top of RDDs and offer better performance and ease of use in many scenarios. 

5 Questions on PySpark Technology

Q1 of 5 

Which of the below Spark Core API is used to load the retail.csv file and create RDD? 

retailRDD = sc.readFile("/HDFSPATH/retail.csv") 

retailRDD = sc.parallelize("/HDFSPATH/retail.csv") 

retailRDD = sc.textFile("/HDFSPATH/retail.csv") *** 

retailRDD = sc.createFile("/HDFSPATH/retail.csv") 

Q2 of 5 

Shane works in data analytics project and needs to process Users event data (UserLogs.csv file). Which of the below code snippet can be used to split the fields with a comma as a delimiter and fetch only the first two fields from it? 

logsRDD = sc.textFile("/HDFSPATH/UserLogs.csv"); 
FieldsRDD = logsRDD.map(lambda r : r.split(",")).map(lambda r: (r[0],r[1])) *** 

logsRDD = sc.parallelize("/HDFSPATH/UserLogs.csv"); 
FieldsRDD = logsRDD.map(lambda r : r.split(",")).map(lambda r: (r[0],r[1])) 

logsRDD = sc.parallelize("/HDFSPATH/UserLogs.csv"); 
FieldsRDD = logsRDD.filter(lambda r : r.split(",")).map(lambda r: (r[0],r[1])) 

logsRDD = sc.textFile("/HDFSPATH/UserLogs.csv"); 
FieldsRDD = logsRDD.filter(lambda r : r.split(",")).map(lambda r: (r[0],r[1])) 

Q3 of 5

Consider a retail scenario where a paired RDD exists with data (ProductName, Price). Price value must be reduced by 500 as a customer discount. Which paired RDD function in spark can be used for this requirement? 

mapValues() 

keys() 

values() 

map() 

--- mapValues applies the function logic to the value part of the paired RDD without changing the key 

Q4 of 5 
Consider a banking scenario where credit card transaction logs need to be processed. The log contains CustomerID, CustomerName, CreditCard Number, and TransactionAmount fields. Which code snippet below creates a paired RDD ? 

logsRDD = sc.textFile("/HDFSPath/Logs.txt"); 

logsRDD = sc.textFile("/HDFSPath/Logs.txt"); 
LogsPairedRDD = logsRDD.map(lambda r : r.split(",")).map(lambda r: (r[0],int(r[3]))) *** 

logsRDD = sc.textFile("/HDFSPath/Logs.txt"); 
LogsPairedRDD = logsRDD.map(lambda r : r.split(",")).map(lambda r: (r[0],int(r[2]))) 

logsRDD = sc.textFile("/HDFSPath/Logs.txt").map(lambda r: (r[0],int(r[3]))) 

Q5 of 5 

Consider a Spark scenario where an array must be used as a Broadcast variable. Which of the below code snippet is used to access the broadcast variable value? 

bv = sc.broadcast(Array(100,200,300)) 
bv.getValue --- 

bv = sc.broadcast(Array(100,200,300)) 
bv.value 

bv = sc.broadcast(Array(100,200,300)) 
bv.find 

bv = sc.broadcast(Array(100,200,300)) 
bv.fetchValue 

Spark Core Challenges

Business Scenario Arisconn Cars provides rental car service across the globe. To improve their customer service, the client wants to analyze periodically each car’s sensor data to repair faults and problems in the car. Sensor data from cars are streamed through events hub (data ingestion tool) into Hadoop's HDFS (distributed file system) and analyzed using Spark Core programming to find out cars generating maximum errors. This analysis would help Arisconn to send the service team to repair the cars even before they fail. Below is the Schema of the big dataset of Arisconn Cars which holds 10 million records approximately. [sensorID, carID, latitude, longitude, engine_speed, accelerator_pedal_position, vehicle_speed, torque_at_transmission, fuel_level, typeOfMessage, timestamp] typeOfMessage: INFO, WARN, ERR, DEBUG Arisconn has the below set of requirements to be performed against the dataset: Filter fields - Sensor id, Car id, Latitude, Longitude, Vehicle Speed, TypeOfMessage Filter valid records i.e., discard records containing '?' Filter records holding only error messages (ignore warnings and info messages) Apply aggregation to count number of error messages produced by cars Below is the Python code to implement the first three requirements. #Loading a text file in to an RDD Car_Info = sc.textFile("/HDFSPath/ArisconnDataset.txt"); #Referring the header of the file header=Car_Info.first() #Removing header and splitting records with ',' as delimiter and fetching relevant fields Car_temp = Car_Info.filter(lambda record:record!=header).map(lambda r:r.split(",")).map(lambda c:(c[0],c[1],float([2]),float(c[4]),int(c[6]),c[9])); #Filtering only valid records(records not starting with '?'), and f[1] refers to first field (sensorid) Car_Eng_Specs = Car_temp.filter(lambda f:str(f[1]).startswith("?")) #Filtering records holding only error messages and f[6] refers to 6th field (Typeofmessage) Car_Error_logs = Car_Eng_Specs.filter(lambda f:str(f[6]).startswith("ERR")) In the above code, Arisconn's dataset is loaded into RDD (Car_Info) The header of the dataset is removed and only fields (sensorid, carid, latitude, longitude, vehiclespeed, TypeOfMessage) are filtered. Refer to RDD Car_temp Records starting with '?' are removed. Refer to RDD Car_Eng_Specs. Records containing TypeOfMessage = "ERR" get filtered There are few challenges in the above code and even the fourth requirement is too complex to implement in Spark Core. We shall discuss this next.

Sunday, February 12, 2023

PySpark Books (2023 Feb)

Download Books
1. Tomasz Drabas, Denny Lee
Packt Publishing Ltd, 27-Feb-2017

2.
Data Analysis with Python and PySpark
Jonathan Rioux
Simon and Schuster, 12-Apr-2022 

3.
PySpark Cookbook: Over 60 recipes for implementing big data processing and analytics using Apache Spark and Python
Denny Lee, Tomasz Drabas
Packt Publishing Ltd, 29-Jun-2018

4.
Machine Learning with PySpark: With Natural Language Processing and Recommender Systems
Pramod Singh
Apress, 14-Dec-2018

5.
Learning Spark: Lightning-Fast Big Data Analysis
Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia
"O'Reilly Media, Inc.", 28-Jan-2015 

6.
Advanced Analytics with PySpark
Akash Tandon, Sandy Ryza, Sean Owen, Uri Laserson, Josh Wills
"O'Reilly Media, Inc.", 14-Jun-2022

7.
PySpark Recipes: A Problem-Solution Approach with PySpark2
Raju Kumar Mishra
Apress, 09-Dec-2017

8.
Learn PySpark: Build Python-based Machine Learning and Deep Learning Models
Pramod Singh
Apress, 06-Sept-2019

9.
Learning Spark
Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee
"O'Reilly Media, Inc.", 16-Jul-2020

10.
Advanced Analytics with Spark: Patterns for Learning from Data at Scale
Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills
"O'Reilly Media, Inc.", 12-Jun-2017

11.
Applied Data Science Using PySpark: Learn the End-to-End Predictive Model-Building Cycle
Ramcharan Kakarla, Sundar Krishnan, Sridhar Alla
Apress, 2021

12.
Essential PySpark for Scalable Data Analytics: A beginner's guide to harnessing the power and ease of PySpark 3
Sreeram Nudurupati
Packt Publishing Ltd, 29-Oct-2021

13.
Spark: The Definitive Guide: Big Data Processing Made Simple
Bill Chambers, Matei Zaharia
"O'Reilly Media, Inc.", 08-Feb-2018

14.
Spark for Python Developers
Amit Nandi
Packt Publishing, 24-Dec-2015

15.
Frank Kane's Taming Big Data with Apache Spark and Python
Frank Kane
Packt Publishing Ltd, 30-Jun-2017

16.
Stream Processing with Apache Spark: Mastering Structured Streaming and Spark Streaming
Gerard Maas, Francois Garillot
"O'Reilly Media, Inc.", 05-Jun-2019

17.
Data Analytics with Spark Using Python
Jeffrey Aven
Addison-Wesley Professional, 18-Jun-2018

18.
Graph Algorithms: Practical Examples in Apache Spark and Neo4j
Mark Needham, Amy E. Hodler
"O'Reilly Media, Inc.", 16-May-2019 

19.
Spark in Action: Covers Apache Spark 3 with Examples in Java, Python, and Scala
Jean-Georges Perrin
Simon and Schuster, 12-May-2020

20.
Mastering Spark with R: The Complete Guide to Large-Scale Analysis and Modeling
Javier Luraschi, Kevin Kuo, Edgar Ruiz
"O'Reilly Media, Inc.", 07-Oct-2019

21.
High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark
Holden Karau, Rachel Warren
"O'Reilly Media, Inc.", 25-May-2017

22.
Apache Spark in 24 Hours, Sams Teach Yourself
Jeffrey Aven
Sams Publishing, 31-Aug-2016
Tags: List of Books,Spark,

Wednesday, February 8, 2023

Spark SQL in Images

1. Spark's components

2. Spark SQL Architecture

3. SQL Data Types

4. Spark's context objects

5. File Formats Supported By Spark

6. SQL Workflow

7. Catalyst Optimizer

Below steps explain the workflow of the catalyst optimizer: 1. Analyzing a logical plan with the metadata 2. Optimizing the logical plan 3. Creating multiple physical plans 4. Analyzing the plans and finding the most optimal physical plan 5. Converting the physical plan to RDDs
Tags: Spark,Technology,

A Solved Exercise in RDD Filter and Join Operations (Interview Preparation)

Download Code and Data
Problem Statement:

Consider the Universal Identity Number data scenario with two datasets UIN Customer data and Bank account linking data.

UIN Card data (UINCardData.csv):
Schema Details: UIN, MobileNumber,Gender,SeniorCitizens,Income

Bank account link data (BankAccountLink.csv):
Schema Details: MobileNumber, LinkedtoBankAccount, BankAccountNumber

Requirement

Join both datasets and find the UIN number that is not linked with the Bank Account number. Print UIN number and BankAccountNumber.
Save the final output to a specified HDFS directory.