'PySpark reduceByKey on RDD with nested lists
I have an RDD comprised of nested lists as shown below:
This is one of roughly 150k nested lists in this RDD.
What I want to do is a reduceByKey() call to each list so that I'm left with only one tuple for each word and the total count of the times it appears next to it: i.e. ('the':7). I also want to keep the lists nested so I don't want to use flatMap.
I've been trying something like this:
word_ct = myrdd.reduceByKey(lambda xs: [x + y for x in xs])
which follows the syntax for how I performed other commands on the nested lists, but it's not working.
Any help?
**In case the image doesn't work, I don't want to get yelled at, so here is an example of my RDD:
[[('the': 1),
('text': 1),
('the': 1)],
[('he': 1),
('them': 1)]]
and I want the final result to be:
[[('the': 2),
('text': 1)],
[('he': 1),
('them': 1)]]
**
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
