'Can't aggregate count for an RDD?
I am working on a PySpark streaming application and I'm running into the following problem,
After performing some transformations on a DStream object, I end up with the following RDD called "rdd_new":
+------+---+---+
| _1| _2| _3|
+------+---+---+
|Python| 36| 10|
| C| 6| 1|
| C#| 8| 1|
+------+---+---+
I then run this rdd through the following command that will aggregate the values in the RDD:
rdd_new = rdd_new.updateStateByKey(aggregate_count)
Where aggregate_count looks like this
def aggregate_count(new_values, total_sum):
return sum(new_values) + (total_sum or 0)
But after that line of code is executed I am getting this error:
for obj in iterator:
File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2052, in add_shuffle_key
ValueError: too many values to unpack (expected 2)
There are a lot more lines of error but I've narrowed it down to that. The thing is, the aggregate function works if my rdd looks like this:
+------+---+
| _1| _2|
+------+---+
|Python| 36|
| C| 6|
| C#| 8|
+------+---+
The key difference is that there are just 2 columns with this one. Since I really need this aggregate_count function to work for my project, how can I feed my 3 column RDD into the function and have it actually work? I have no idea how to even approach this sort of issue, thanks!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
