'What's most efficient way to display first n rows of pyspark dataframe
In Pandas everytime I do some operation to a dataframe, I call .head() to see visually what data looks like.
While working with large dataset using pyspark, calling df.show(5) takes a very long time.
After reading about caching and persisting I've tried df.show().cache() hoping that after caching I could display content of df easily like pandas however doesn't seem to improve speed. This is what I've done to measure time. Note time measure was done on small dataset that fits in memory.
# w/o cache
%%timeit
df.show()
# with cache
df.show().cache()
%%timeit
df.show()
# testing again w/o cache
df.show().uncache()
%%timeit
df.show()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
