'What's most efficient way to display first n rows of pyspark dataframe

In Pandas everytime I do some operation to a dataframe, I call .head() to see visually what data looks like.

While working with large dataset using pyspark, calling df.show(5) takes a very long time.

After reading about caching and persisting I've tried df.show().cache() hoping that after caching I could display content of df easily like pandas however doesn't seem to improve speed. This is what I've done to measure time. Note time measure was done on small dataset that fits in memory.

# w/o cache
%%timeit
df.show()

# with cache 
df.show().cache()
%%timeit  
df.show()

# testing again w/o cache
df.show().uncache() 
%%timeit
df.show()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'What's most efficient way to display first n rows of pyspark dataframe

Sources

Related Questions