pyspark.RDD.reduceByKey()
Merge the values for each key using an associative and commutative reduce function. What is the associative property? The associative property is a math rule that says that the way in which factors are grouped in a multiplication problem does not change the product. What is the commutative property? The commutative property is a math rule that says that the order in which we multiply numbers does not change the product. spark.apache.org/docs
In [32]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
In [33]:
from operator import add
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
sorted(rdd.reduceByKey(add).collect())
Out[33]:
[('a', 2), ('b', 1)]
One More Basic Example Before The Word Count Program¶
In [34]:
data = [('Project', 1),
('Gutenberg’s', 1),
('Alice’s', 1),
('Adventures', 1),
('in', 1),
('Wonderland', 1),
('Project', 1),
('Gutenberg’s', 1),
('Adventures', 1),
('in', 1),
('Wonderland', 1),
('Project', 1),
('Gutenberg’s', 1)]
rdd = sc.parallelize(data)
rdd2 = rdd.reduceByKey(lambda a,b: a+b)
for element in rdd2.collect():
print(element)
('Gutenberg’s', 3) ('Alice’s', 1) ('in', 2) ('Adventures', 2) ('Wonderland', 2) ('Project', 3)
Word Count Program¶
In [35]:
txt = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor sit amet, consectetur adipiscing elit."""
txt_tokens = txt.split()
In [36]:
txt_rdd = sc.parallelize(txt_tokens)
In [37]:
txt_rdd.map(lambda s : (s, 1)).collect()
Out[37]:
[('Lorem', 1), ('ipsum', 1), ('dolor', 1), ('sit', 1), ('amet,', 1), ('consectetur', 1), ('adipiscing', 1), ('elit,', 1), ('sed', 1), ('do', 1), ('eiusmod', 1), ('tempor', 1), ('incididunt', 1), ('ut', 1), ('labore', 1), ('et', 1), ('dolore', 1), ('magna', 1), ('aliqua.', 1), ('Lorem', 1), ('ipsum', 1), ('dolor', 1), ('sit', 1), ('amet,', 1), ('consectetur', 1), ('adipiscing', 1), ('elit.', 1)]
In [38]:
word_counts = txt_rdd.map(lambda s : (s,1)).reduceByKey(lambda a, b: a + b)
for word in word_counts.collect():
print(word)
('Lorem', 2) ('ipsum', 2) ('consectetur', 2) ('do', 1) ('elit.', 1) ('dolor', 2) ('sit', 2) ('amet,', 2) ('adipiscing', 2) ('eiusmod', 1) ('tempor', 1) ('incididunt', 1) ('ut', 1) ('et', 1) ('sed', 1) ('labore', 1) ('magna', 1) ('elit,', 1) ('dolore', 1) ('aliqua.', 1)
Line count program¶
In [39]:
lines = sc.textFile("in/lorem_ipsum.txt")
lineLengths = lines.map(lambda s : (s,1)).reduceByKey(lambda a, b: a + b)
for word in lineLengths.collect():
print(word)
('Lorem ipsum dolor sit amet', 2) ('consectetur adipiscing elit', 2) ('sed do eiusmod tempor incididunt ut labore et dolore magna aliqua', 1)
Note: This program can also be used to compute the number of duplicate occurrences of records in a dataset
In [ ]: