'pyspark sc.parallelize.map hangs
I'm trying to parallelize mapping on a list. I'm running on 64 cores. My code is along these lines:
r = sc.parallelize(X_list, 128)
m = r.map(func(x)).collect()
The first 107 tasks breeze by each in < 4 minutes, but the last 21 hang forever. I checked for skew (didn't think this would be an issue without a group/join key either way) and it looks there's around the same number of items in each slice. What could be happening here?
I've also triewd this with 64 slices and also the default number of slices, seeing the same issue.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
