Monday, February 6, 2023

Counting and filtering blank lines from input file in PySpark

pyspark.Accumulator

A shared variable that can be accumulated, i.e., has a commutative and associative “add” operation. Worker tasks on a Spark cluster can add values to an Accumulator with the += operator, but only the driver program is allowed to access its value, using value. Updates from the workers get propagated automatically to the driver program.Accumulators

Accumulators provide a way of aggregating values in worker nodes and sending back the final value to the driver program
Used to count events that occur during job execution and debugging purpose
Used to implement counters similar to MapReduce counters.

Limitations:
If the executor failed for any reason, you would see an inconsistent value for count or sum as it executes from the beginning again.

Requirement: Count the empty lines in a data set distributed across multiple worker nodes.
Tags: Spark,Technology,

No comments:

Post a Comment