'Spark to cache a file to prevent the file being deleted while processing it
I have a Spark application trying to read a file. Due to lazy loading of Spark, it is possible that the file exists when spark.read, but the file is deleted when I actually load the file such as count operation.
// t0: file exists when initially trying to load the file
val ds = spark.read.json("s3://some-location/some-file")
// some operations on ds
// t1: the file s3://some-location/some-file is deleted from S3 by someone else
// t2: continue doing some operations on ds
ds.count // throws exception
Can I mitigate the problem by caching the file immediately after spark.read, something like
val ds = spark.read.json("s3://some-location/some-file")
ds.cache
ds.count // force load the file
Solution 1:[1]
Spark doesn't read and store the data in memory, depending on the type of file you are dealing. Reading is a continuous process that happens throughout the lifecycle of the application.
If you have a problem that another application is deleting your file, I would suggest using some sort of locks, that prevents deletion of the file. It does seem s3 supports this https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html (full disclosure I have never used it)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Sai Kiran KrishnaMurthy |
