'How to output a tuple from foldByKey in pyspark?

I was practicing foldByKey on generating tuples in the output.

I have some input in the form:

    x = sc.parallelize([[1,2],[3,4],[5,6],[1,1],[1,3],[3,2],[3,6]])

Converting it to a paired rdd:

    x2 = x.map(lambda y: (y[0],y[1]))

I want two values for each key in the input: one is adding all elements belonging to each key and the other is just counting the number of elements of each key.

So, the output should be something like this:

    [(1,(6,3)),(3,(12,3)),(5,(6,1))]

I have tried code for this as:

    x3 = x2.foldByKey((0,0), lambda acc,x: (acc[0] + x,acc[1] + 1))

But, I am getting this error:

    TypeError: unsupported operand type(s) for +: 'int' and 'tuple'

I don't understand how acc[0] and acc[1] are tuples. They should be integers.



Solution 1:[1]

I was getting this error because foldByKey return type should be the same as the input RDD element type(By definition). I have passed a tuple RDD to foldByKey and I want an integer as its return value. What I was trying to achieve can be done through aggregateByKey() because it can return a different type than its RDD input type.

If I pass a tuple to foldByKey I get the correct output as:

     x2 = x.map(lambda y: (y[0],(y[0],y[1])))
     x3 = x2.foldByKey((0,0), lambda acc,x: (acc[0] + x[0],acc[1] + 1))
     
     [(1, (3, 2)), (5, (5, 1)), (3, (9, 2))]

Please feel free to provide suggestions.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Shanif Ansari