'How do I convert strings in a 2-D RDD to int using PySpark

I'm new to pyspark, and have been trying to figure this out for hours.

Currently, my RDD looks like this:

[['74', '85', '123'], ['73', '84', '122'], ['72', '83', '121'], ['70', '81', '119'], ['70', '81', '119'], ['69', '80', '118'], ['70', '81', '119'], ['70', '81', '119'], ['76', '87', '125'], ['76', '87', '125']]

I want it to look like this (where all the entries are integers):

[[74, 85, 123], [73, 84, 122], [72, 83, 121], [70, 81, 119], [70, 81, 119], [69, 80, 118], [70, 81, 119], [70, 81, 119], [76, 87, 125], [76, 87, 125]]

The closest I've gotten was by using flatMap to turn in into a 1D array and then converting the entries into integers. However, I am hoping to process the integers three at a time (calculate the sum and average of the entries 3 at a time), and I figured keeping it in a 2-d array would be the easiest way to do that. I also tried list comprehensions but they don't seem to work since it isn't a list. Any help would be greatly appreciated!

Solution 1:^[1]

Update

Before performing the below operations, you can use map and collect to convert your RDD into list.

rdd = spark.sparkContext.parallelize(data)
list_string = rdd.map(list).collect()

Using list comprehension is actually quick and efficient enough to convert all your strings to integers. Practising it more and you will know the way to deal with it.

list_value = [[int(i) for i in list_] for list_ in list_string]

print(list_value)
[[74, 85, 123], [73, 84, 122], [72, 83, 121], [70, 81, 119], [70, 81, 119], [69, 80, 118], [70, 81, 119], [70, 81, 119], [76, 87, 125], [76, 87, 125]]

The same goes for summing and averaging in 2D array.

list_sum = [sum(vector) for vector in list_value]
list_sum = [sum(vector)/len(vector) for vector in list_value]

Or better, just use NumPy to do the trick.

array = np.array(list_value)

np.sum(array, axis = 1)
Out[174]: array([282, 279, 276, 270, 270, 267, 270, 270, 288, 288])

np.average(array, axis=1)
Out[175]: array([94., 93., 92., 90., 90., 89., 90., 90., 96., 96.])

For speed comparison, I've created a list and an array of (1000,3). Hope this gives you some clear insight about their efficiency.

%timeit np.sum(array, axis=1)
%timeit [sum(vector) for vector in list_value]

20.3 µs ± 412 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
167 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.average(array, axis=1)
%timeit [sum(vector)/len(vector) for vector in list_value]

29.3 µs ± 536 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
256 µs ± 23.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For small sets of 2D list, however, it is faster than using NumPy array.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'How do I convert strings in a 2-D RDD to int using PySpark

Solution 1:[1]

Update

Sources

Related Questions

Solution 1:^[1]