'Spark limit + write is too slow

I have a dataset of 8Billion records stored in parquet files in Azure Data Lake Gen 2.

I wanted to separate out a sample dataset of 2Billion records in a different location for some benchmarking needs so I did the following

df = spark.read.option('inferSchema', 'true').format('parquet').option('badRecordsPath', f'/tmp/badRecords/').load(read_path)
df.limit(2000000000).write.option('badRecordsPath', f'/tmp/badRecords/').format('parquet').save(f'{write_path}/advertiser/2B_parquet')

This job is running on 8 nodes of 8core 28GB RAM machines [ 8 WorkerNodes + 1 Master Node ]. It's been running for over an hour with not a single file is written yet. The load did finish within 2s, so I know the limit + write action is what's causing the bottleneck [ although load just infers schema and creates a list of files but not actually reading the data ].

So I started inspecting the Spark UI for some clues and here are my observations

  1. 2 Jobs have been created by Spark enter image description here
  2. The first job took 35 mins. Here's the DAG enter image description here enter image description here
  3. The second job has been running for about an hour now with no progress at all. The second job has two stages in it. enter image description here
  4. If you notice, stage 3 has one running task, but if I open the stages panel, I can't see any details of the task. I also don't understand why it's trying to do a shuffle when all I have is a limit on my DF. Does limit really need a shuffle? Even if it's shuffling, it seems like 1hr is awfully long to shuffle data around. Also if this is what's really performing the limit, what did the first job really do? Just read the data? 35mins for that also seems too long, but for now I'd just settle on the job being completed. enter image description here enter image description here enter image description here
  5. Stage 4 is just stuck which is believed to be the actual writing stage and I believe is waiting for this shuffle to end.

I am new to spark and I'm kinda clueless about what's happening here. Any insights on what I'm doing wrong will be very useful.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source