'Spark write file csv/hive take too much time and performance benchmark
I am having a very simple problem with spark, but there is very little information on the web. I have encountered this problem using both pyspark and scala.
The problem is that it takes a lot of time to save the csv / hive file.
Here is a very simple piece of code I have.
spark = SparkSession.
sql = '''
select * from some_table
'''
df = spark.sql(sql)
df.write.csv(path)
This code is very simple, but a 200,000 volume of data can take about 30-40 minutes, and a 10 million volume of data can take hours. Even repartition(1)
does not significantly improve write performance. saveAsTable
(to hive) may be better, but it is still an unacceptable amount of time, and after all, it is much faster to use hive directly. But hive is hard to engineer as a big project.
My question is:
- Is there a way to improve performance?
- What is the performance baseline? What is the approximate time for 1 million data in what configuration?
Solution 1:[1]
its not number of rows. its number of columns and data too. what is the size if the table in gb mainly determine dump time. You can use below commands to get that info.
show tblproperties some_table;
or
analyze table some_table compute statistics;
Now, once you know GB, you can estimate dump time. if its too high, your data dump will take time and you can improve by using
- add some filters to exclude unwanted data.
- select only required columns.
- dump data during night time when system is not busy.
- check with final consumer and see if you can tune it.
You can try this command to dump into csv data- its easy and faster.
hive -e 'select * from some_table' | sed 's/[\t]/,/g' > /tmp/some_table.csv
Solution 2:[2]
I know this question is pretty old but after couple of trial and errors my team figured out that when we are writing to Csv, we need to make sure the data is repartitioned according to the parallelism required. For example:
Executors used = 30
Repartition before write could be higher 100+
And if data to write is more please increase the repartition and executors count too
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Koushik Roy |
Solution 2 | ArunSelvam P M |