'when to use refreshtabke and refreshbypath in spark application

I have a use case where, I need to overwrite the data in a specific partition in hive table using partition.

Since using insertOverwrite method overwrites the entire table instead of only the partition hence I am altering the partition directory in the application and then overwriting the partition directory only.

Doing so, 1st time it went fine but second time onwards i started getting error as below

Caused by: java.io.FileNotFoundException: Item not found: ''. Note, it is possible that the live version is still available but the requested generation is deleted.

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

To avoid this i am using the method spark.catalog.refreshTable(<table_name>), sometimes its getting successful and sometime its getting failed

can anyone guide me here what am i doing wrong. I also explored spark.catalog.refreshByPath() method but didn't get any specific use case when to use which one.

apache-spark

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'when to use refreshtabke and refreshbypath in spark application

Sources

Related Questions