Thursday, February 2, 2023

Grouping Data using RDD, SQL DataFrame and PySpark.Pandas

Download Code and Data
Requirement: Compute revenue generated by all postpaid and prepaid service customers.

Schema details:
CustomerID, Mobile Number, Gender, SeniorCitizen Flag, Mode, Calls, SMS, Internet Service Status, MonthlyCharges (USD), CustomerChurn Flag

Before solving the requirement, let us understand the concept of Paired RDD.

In most analytical programs, data in the <Key, Value> format provides a feasible way to perform computations. RDD with this data format is called Paired RDD. Every record contains only two fields, key and value.

Solution:

Step 1: Create a paired RDD with the <key, value> structure. In this requirement, it is <Mode, MonthlyCharges>.

Step 2: Group all the fields and apply the sum arithmetic function

Step 3: Use the same paired RDD we created and apply the reduceByKey() transformation.

pyspark.RDD.groupByKey

Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions. Notes If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will provide much better performance.
map()
Narrow transformation used to create a new RDD from the parent RDD with the required structure and fields
Allows fetching required fields from parent RDD
Used to create Paired RDDs

groupByKey()

A wide transformation used to group values of the same key. It should be used only on Paired RDDs.
Results into new RDD with the data format as <Key, List of values>
Note: The above solution map() transformation is used to add all grouped values. The sum is the Python function to perform the addition operation.

groupByKey() operation performs huge amounts of data shuffling, leading to network congestion and performance issues.Let us solve the requirement using an alternate solution with reduceByKey()

reduceByKey()

Wide transformation performs aggregation operations on the same key values. It should be used only on Paired RDDs.
Results into new RDD with the data format as <Key, aggregated value>
Launches combiners to perform local aggregations before final reducing. The amount of data shuffling reduces drastically and improves the performance

pyspark.sql.DataFrame.groupBy

Groups the DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions. groupby() is an alias for groupBy(). Ex: df.groupBy().avg().collect() [Row(avg(age)=3.5)] sorted(df.groupBy('name').agg({'age': 'mean'}).collect()) [Row(name='Alice', avg(age)=2.0), Row(name='Bob', avg(age)=5.0)] sorted(df.groupBy(df.name).avg().collect()) [Row(name='Alice', avg(age)=2.0), Row(name='Bob', avg(age)=5.0)]

Reference

1. SQL Query - GroupBy 2. pyspark.sql.DataFrame.groupBy

Note

MonthlyCharges" column as read from the text file is read as String (on which we cannot perform aggregation). >>> df.groupBy('Mode').sum('MonthlyCharges').collect() AnalysisException: "MonthlyCharges" is not a numeric column. Aggregation function can only be applied on a numeric column.
Tags: Spark,Technology,