'PySpark reduceByKey on RDD with nested lists

I have an RDD comprised of nested lists as shown below:

This is one of roughly 150k nested lists in this RDD.

What I want to do is a reduceByKey() call to each list so that I'm left with only one tuple for each word and the total count of the times it appears next to it: i.e. ('the':7). I also want to keep the lists nested so I don't want to use flatMap.

I've been trying something like this:

word_ct = myrdd.reduceByKey(lambda xs: [x + y for x in xs])

which follows the syntax for how I performed other commands on the nested lists, but it's not working.

Any help?

**In case the image doesn't work, I don't want to get yelled at, so here is an example of my RDD:

[[('the': 1),
  ('text': 1),
  ('the': 1)],
 [('he': 1),
  ('them': 1)]]

and I want the final result to be:

[[('the': 2),
  ('text': 1)],
 [('he': 1),
  ('them': 1)]]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'PySpark reduceByKey on RDD with nested lists

Sources

Related Questions