'Efficiently join one large table(1TB) and other small table(250GB) in spark
There is a huge dataset of size 1TB and other dataset of size 250GB. The current setup is as follows
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) //As I was getting timeoutException
var masterDfPrimary: DataFrame = spark.read //Huge data set
.format("com.databricks.spark.sqldw")
.options(readAdwMap) //Connection and table details passed as parameter
.load().persist(StorageLevel.DISK_ONLY) //Persisted to disk as it will be used twice for computation
masterDfPrimary.createOrReplaceTempView("HugeData")
var smallDF= spark.read.parquet(path)
smallDF.createOrReplaceTempView("smallData")
var computeDf = spark.sql("""select * from HugeData as h left outer join smallData as s on h.id = s.id)
display(computeDf)
How does the above code partition the huge data? How to improve the performance further by either using repartition and broadcast concepts or is there any other configuration setting that I can leverage
Solution 1:[1]
You need to go with sort-merge join As the second table size is 250GB it will not be feasible to broadcast it Another approach would be If a small table has a single instance "id" then you can do a pre-processing to limit the records in the table on it and broadcast it to a larger table to get better performance.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Doney |
