'Simultaneously download and process data in Apache Spark
I have multiple files which ideally would all be the input to the map-reduce job. But each takes some time to download, so I was wondering if I can begin in-memory processing on the ones already downloaded meanwhile the others are pending download.
This methodology will pipeline the whole process according to me. How do I achieve this with Apache Spark, and would this lead to huge deltas in the mapper sizes, causing me to shuffle and shard? Are there any other problems you see in this approach?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
