pyspark.RDD.reduceByKey()

Merge the values for each key using an associative and commutative reduce function.

What is the associative property? 
The associative property is a math rule that says that the way in which factors are grouped in a multiplication problem does not change the product.

What is the commutative property? 
The commutative property is a math rule that says that the order in which we multiply numbers does not change the product.

spark.apache.org/docs

In [32]:

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

In [33]:

from operator import add

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])

sorted(rdd.reduceByKey(add).collect())

Out[33]:

[('a', 2), ('b', 1)]

One More Basic Example Before The Word Count Program¶

In [34]:

data = [('Project', 1),
('Gutenberg’s', 1),
('Alice’s', 1),
('Adventures', 1),
('in', 1),
('Wonderland', 1),
('Project', 1),
('Gutenberg’s', 1),
('Adventures', 1),
('in', 1),
('Wonderland', 1),
('Project', 1),
('Gutenberg’s', 1)]

rdd = sc.parallelize(data)

rdd2 = rdd.reduceByKey(lambda a,b: a+b)

for element in rdd2.collect():
    print(element)

('Gutenberg’s', 3)
('Alice’s', 1)
('in', 2)
('Adventures', 2)
('Wonderland', 2)
('Project', 3)

Word Count Program¶

In [35]:

txt = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor sit amet, consectetur adipiscing elit."""
txt_tokens = txt.split()

In [36]:

txt_rdd = sc.parallelize(txt_tokens)

In [37]:

txt_rdd.map(lambda s : (s, 1)).collect()

Out[37]:

[('Lorem', 1),
 ('ipsum', 1),
 ('dolor', 1),
 ('sit', 1),
 ('amet,', 1),
 ('consectetur', 1),
 ('adipiscing', 1),
 ('elit,', 1),
 ('sed', 1),
 ('do', 1),
 ('eiusmod', 1),
 ('tempor', 1),
 ('incididunt', 1),
 ('ut', 1),
 ('labore', 1),
 ('et', 1),
 ('dolore', 1),
 ('magna', 1),
 ('aliqua.', 1),
 ('Lorem', 1),
 ('ipsum', 1),
 ('dolor', 1),
 ('sit', 1),
 ('amet,', 1),
 ('consectetur', 1),
 ('adipiscing', 1),
 ('elit.', 1)]

In [38]:

word_counts = txt_rdd.map(lambda s : (s,1)).reduceByKey(lambda a, b: a + b)
for word in word_counts.collect():
    print(word)

('Lorem', 2)
('ipsum', 2)
('consectetur', 2)
('do', 1)
('elit.', 1)
('dolor', 2)
('sit', 2)
('amet,', 2)
('adipiscing', 2)
('eiusmod', 1)
('tempor', 1)
('incididunt', 1)
('ut', 1)
('et', 1)
('sed', 1)
('labore', 1)
('magna', 1)
('elit,', 1)
('dolore', 1)
('aliqua.', 1)

Line count program¶

In [39]:

lines = sc.textFile("in/lorem_ipsum.txt")
lineLengths = lines.map(lambda s : (s,1)).reduceByKey(lambda a, b: a + b)
for word in lineLengths.collect():
    print(word)

('Lorem ipsum dolor sit amet', 2)
('consectetur adipiscing elit', 2)
('sed do eiusmod tempor incididunt ut labore et dolore magna aliqua', 1)

Note: This program can also be used to compute the number of duplicate occurrences of records in a dataset

In [ ]:

survival8

Pages

Tuesday, January 31, 2023

PySpark's reduceByKey Method, Word count and Line count Programs

pyspark.RDD.reduceByKey()

One More Basic Example Before The Word Count Program¶

Word Count Program¶

Line count program¶