pyspark.RDD.reduceByKey()
Merge the values for each key using an associative and commutative reduce function. What is the associative property? The associative property is a math rule that says that the way in which factors are grouped in a multiplication problem does not change the product. What is the commutative property? The commutative property is a math rule that says that the order in which we multiply numbers does not change the product. spark.apache.org/docs
In [32]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
In [33]:
from operator import add
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
sorted(rdd.reduceByKey(add).collect())
Out[33]:
[('a', 2), ('b', 1)]
One More Basic Example Before The Word Count Program¶
In [34]:
data = [('Project', 1),
('Gutenberg’s', 1),
('Alice’s', 1),
('Adventures', 1),
('in', 1),
('Wonderland', 1),
('Project', 1),
('Gutenberg’s', 1),
('Adventures', 1),
('in', 1),
('Wonderland', 1),
('Project', 1),
('Gutenberg’s', 1)]
rdd = sc.parallelize(data)
rdd2 = rdd.reduceByKey(lambda a,b: a+b)
for element in rdd2.collect():
print(element)
('Gutenberg’s', 3)
('Alice’s', 1)
('in', 2)
('Adventures', 2)
('Wonderland', 2)
('Project', 3)
Word Count Program¶
In [35]:
txt = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor sit amet, consectetur adipiscing elit."""
txt_tokens = txt.split()
In [36]:
txt_rdd = sc.parallelize(txt_tokens)
In [37]:
txt_rdd.map(lambda s : (s, 1)).collect()
Out[37]:
[('Lorem', 1),
('ipsum', 1),
('dolor', 1),
('sit', 1),
('amet,', 1),
('consectetur', 1),
('adipiscing', 1),
('elit,', 1),
('sed', 1),
('do', 1),
('eiusmod', 1),
('tempor', 1),
('incididunt', 1),
('ut', 1),
('labore', 1),
('et', 1),
('dolore', 1),
('magna', 1),
('aliqua.', 1),
('Lorem', 1),
('ipsum', 1),
('dolor', 1),
('sit', 1),
('amet,', 1),
('consectetur', 1),
('adipiscing', 1),
('elit.', 1)]
In [38]:
word_counts = txt_rdd.map(lambda s : (s,1)).reduceByKey(lambda a, b: a + b)
for word in word_counts.collect():
print(word)
('Lorem', 2)
('ipsum', 2)
('consectetur', 2)
('do', 1)
('elit.', 1)
('dolor', 2)
('sit', 2)
('amet,', 2)
('adipiscing', 2)
('eiusmod', 1)
('tempor', 1)
('incididunt', 1)
('ut', 1)
('et', 1)
('sed', 1)
('labore', 1)
('magna', 1)
('elit,', 1)
('dolore', 1)
('aliqua.', 1)
Line count program¶
In [39]:
lines = sc.textFile("in/lorem_ipsum.txt")
lineLengths = lines.map(lambda s : (s,1)).reduceByKey(lambda a, b: a + b)
for word in lineLengths.collect():
print(word)
('Lorem ipsum dolor sit amet', 2)
('consectetur adipiscing elit', 2)
('sed do eiusmod tempor incididunt ut labore et dolore magna aliqua', 1)
Note: This program can also be used to compute the number of duplicate occurrences of records in a dataset
In [ ]: