Tuesday, February 7, 2023

Broadcasting variables to nodes to reduce data shuffling from RDD joins

Download Code and Data
Broadcast variables

Broadcast variable is used to send the large datasets to all the worker nodes that can be used as a lookup
Cached on each machine rather than moving a copy of it with tasks
Broadcast variables are read-only.

Requirement: Telecom customer data is distributed across 4 slave nodes in HDFS. As part of this data processing requirement, the international roaming dataset needs to be used as a lookup. The initial solution implemented is by creating two RDDs and joining them.

This led to performance issues due to a massive amount of data shuffling.

Solution: Broadcasting the roaming dataset into all worker nodes (where the job is getting executed) and using it as a lookup.