'In pyspark, how to select n rows of DataFrame without scan the whole table
I'm using pyspark, and want to show user a preview of a (very large, 10 million for example) table, for example, user can see 5000 rows in the table, (first/last/random, any 5000 rows are ok), so what is the fastest way to get n rows from the table? I have tried limit, sample, but these function will still scan the whole table, the time complexity are O(N*), which takes a lot of time.
spark.sql('select * from some_table').limit(N)
Can some help me.
Solution 1:[1]
spark.sql('select * from some_table limit 10')
Since you are making a sql call from python, this is by far the easiest solution. And it's fast. I don't think it scans the whole table when you use a sql call. Assuming your table is already cached- are you sure the delay is caused by scanning the table, or is it caused by materializing the table?
As an alternative, assuming you had a python dataframe handle, df_some_table, it gets trickier because the .head() and .show() functions return something other than a dataframe, but they can work for peeking at the dataframe.
df_some_table.head(N)
df_some_table.show(N)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Nathan T Alexander |
