'Can't aggregate count for an RDD?

I am working on a PySpark streaming application and I'm running into the following problem,

After performing some transformations on a DStream object, I end up with the following RDD called "rdd_new":

+------+---+---+
|    _1| _2| _3|
+------+---+---+
|Python| 36| 10|
|     C|  6|  1|
|    C#|  8|  1|
+------+---+---+

I then run this rdd through the following command that will aggregate the values in the RDD:

rdd_new = rdd_new.updateStateByKey(aggregate_count)

Where aggregate_count looks like this

def aggregate_count(new_values, total_sum):
      return sum(new_values) + (total_sum or 0)

But after that line of code is executed I am getting this error:

for obj in iterator:
  File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2052, in add_shuffle_key
ValueError: too many values to unpack (expected 2)

There are a lot more lines of error but I've narrowed it down to that. The thing is, the aggregate function works if my rdd looks like this:

+------+---+
|    _1| _2|
+------+---+
|Python| 36|
|     C|  6|
|    C#|  8|
+------+---+

The key difference is that there are just 2 columns with this one. Since I really need this aggregate_count function to work for my project, how can I feed my 3 column RDD into the function and have it actually work? I have no idea how to even approach this sort of issue, thanks!

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Can't aggregate count for an RDD?

Sources

Related Questions