'Fast sqlite data retrieval with pandas

For one batch process I query my database ~30_000x. Every query returns between 0 and 20_000 rows. Profiling the code shows that the most time is spend getting this data out.

I tested 2 different methods with similar results. Assuming that the database schema is not the bottleneck, what can I do to speed up the data retrieval (besides going parallel)?

  • I was thinking on using another library/wrapper for sqlite
  • getting rid of pandas, but a lot of computation is done later on with the df, dask eventual faster?
  • wrapping it in cpython/julia?
db = sqlite.connect('D:\data.db')
cur = db.cursor()

# 1.just pandas, 231 ms
t = pd.read_sql("SELECT * FROM daily d WHERE d.id == 1 ", db)

# 2. db-api & pandas, 242ms
# query is actual half the time of 1., but creating that df cost time
cur.execute("SELECT * FROM daily d WHERE d.id ==1")
rows = cur.fetchall()
t = pd.DataFrame(rows, columns=col)


Solution 1:[1]

The question is very general given that we know little about the processing you do with Pandas.

The one suggestion though, would be to move as much processing as possible (particularly limiting the size of the dataframe) to SQL for SQLite to process.

So, if there is any filtering done in Pandas, I would strive to move it to SQL even if the condition for SELECT or GROUP BY is somewhat cumbersome. There is a cost of copying the data to Python realm and pandas is eating up memory and time.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 sophros