'Efficiently join one large table(1TB) and other small table(250GB) in spark

There is a huge dataset of size 1TB and other dataset of size 250GB. The current setup is as follows

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) //As I was getting timeoutException

var masterDfPrimary: DataFrame = spark.read //Huge data set
      .format("com.databricks.spark.sqldw")
      .options(readAdwMap) //Connection and table details passed as parameter
      .load().persist(StorageLevel.DISK_ONLY) //Persisted to disk as it will be used twice for computation

masterDfPrimary.createOrReplaceTempView("HugeData")

var smallDF= spark.read.parquet(path)
    smallDF.createOrReplaceTempView("smallData")

var computeDf = spark.sql("""select * from HugeData as h left outer join smallData as s on h.id = s.id)

display(computeDf)

How does the above code partition the huge data? How to improve the performance further by either using repartition and broadcast concepts or is there any other configuration setting that I can leverage



Solution 1:[1]

You need to go with sort-merge join As the second table size is 250GB it will not be feasible to broadcast it Another approach would be If a small table has a single instance "id" then you can do a pre-processing to limit the records in the table on it and broadcast it to a larger table to get better performance.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Doney