'In pyspark, how to select n rows of DataFrame without scan the whole table

I'm using pyspark, and want to show user a preview of a (very large, 10 million for example) table, for example, user can see 5000 rows in the table, (first/last/random, any 5000 rows are ok), so what is the fastest way to get n rows from the table? I have tried limit, sample, but these function will still scan the whole table, the time complexity are O(N*), which takes a lot of time.

spark.sql('select * from some_table').limit(N)

Can some help me.

pyspark hive

Solution 1:^[1]

spark.sql('select * from some_table limit 10')

Since you are making a sql call from python, this is by far the easiest solution. And it's fast. I don't think it scans the whole table when you use a sql call. Assuming your table is already cached- are you sure the delay is caused by scanning the table, or is it caused by materializing the table?

As an alternative, assuming you had a python dataframe handle, df_some_table, it gets trickier because the .head() and .show() functions return something other than a dataframe, but they can work for peeking at the dataframe.

df_some_table.head(N)
df_some_table.show(N)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Nathan T Alexander

'In pyspark, how to select n rows of DataFrame without scan the whole table

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]