'Spark dataframe is being re-evaluated after a cache
I am running into some issues using cache on a spark dataframe. My expectation is that after a cache on a dataframe, the dataframe is created and cached the fist time it is needed. Any further calls to the dataframe should be from the cache
here's my code:
val mydf = spark.sql("read about 400 columns from a hive table").
withColumn ("newcol", someudf("existingcol")).
cache()
To test I ran a mydf.count() twice. I would expect the first time to take some time since the data is being cached. But the second time should be instantaneous?
What I am actually seeing is that it takes the same time for both the counts. This first one comes back pretty quickly which I think tells me that the data was not cached. If I remove the withColumn part of the code and just cache the raw data, the second count is instantaneous
Am I doing something wrong? How can I load raw data from hive, add columns and then cache the dataframe for further use? Using spark 2.3
Any help will be great!
Solution 1:[1]
the problem with your case is that mydf.count() is not actually materializing the dataframe (i.e. not all columns are read, your udf will no be called). That is because count() is highly optimized.
To make sure the entire dataframe is cached into memory, you should repeat your experiment with mydf.rdd.count() or another query (e.g. using sorting and/or aggregation)
See e.g. this SO question
Solution 2:[2]
As you are caching a dataset/dataframe, se the documented default behavior:
def cache(): Dataset.this.typePersist this Dataset with the default storage level (
MEMORY_AND_DISK).
So for your case you can try persist(MEMORY_ONLY)
def persist(newLevel: StorageLevel): Dataset.this.typePersist this Dataset with the given storage level.
newLevelOne of:MEMORY_ONLY,MEMORY_AND_DISK,MEMORY_ONLY_SER,MEMORY_AND_DISK_SER,DISK_ONLY,MEMORY_ONLY_2,MEMORY_AND_DISK_2, etc.
Solution 3:[3]
If its relevant
.cache/persist is lazy evaluation, to force it you can use the spark SQL's API which have the capability change form lazy to eager.
CACHE [ LAZY ] TABLE table_identifier
[ OPTIONS ( 'storageLevel' [ = ] value ) ] [ [ AS ] query ]
Unless LAZY specified it would be eager mode, you need to register a temp table prior to this.
Pseudo code would be:
df.createOrReplaceTempView("dummyTbl")
spark.sql("cache table dummyTbl")
More on the document reference - https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-cache-table.html
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Raphael Roth |
| Solution 2 | Martijn Pieters |
| Solution 3 | Bhargav Kosaraju |
