'Spark to cache a file to prevent the file being deleted while processing it

I have a Spark application trying to read a file. Due to lazy loading of Spark, it is possible that the file exists when spark.read, but the file is deleted when I actually load the file such as count operation.

// t0: file exists when initially trying to load the file
val ds = spark.read.json("s3://some-location/some-file")

// some operations on ds

// t1: the file s3://some-location/some-file is deleted from S3 by someone else

// t2: continue doing some operations on ds

ds.count // throws exception

Can I mitigate the problem by caching the file immediately after spark.read, something like

val ds = spark.read.json("s3://some-location/some-file")
ds.cache
ds.count // force load the file


Solution 1:[1]

Spark doesn't read and store the data in memory, depending on the type of file you are dealing. Reading is a continuous process that happens throughout the lifecycle of the application.

If you have a problem that another application is deleting your file, I would suggest using some sort of locks, that prevents deletion of the file. It does seem s3 supports this https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html (full disclosure I have never used it)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Sai Kiran KrishnaMurthy