'How to speed up a SQLAlchemy Query?

I have a table of over 10 million rows. There are roughly 50+ columns. The table stores sensors data/parameters. Let's say that I need to query data for the whole day or 86,400 seconds. It would need take roughly 20 or more seconds to complete this query.

I have added individual indices on a few columns such as recordTimestamp(that store when the data is captured), deviceId(the identification of the sensor), positionValid(whether GPS geolocation is valid). Then I added a composite index which includes all three columns.

Below is my query:

t1 = time.time()
conn = engine.connect()
select_statement = select([Datatable]).where(and_(
    Datatable.recordTimestamp >= start_date,
    Datatable.recordTimestamp <= end_date,
    Datatable.deviceId == device_id,
    Datatable.positionValid != None,
    Datatable.recordTimestamp % query_interval == 0))
lol_data = conn.execute(select_statement).fetchall()    
conn.close() 
t2 = time.time()
time_taken = t2 - t1
print('Select: ' + time_taken)

Below is my EXPLAIN ANALYZE statement:

EXPLAIN ANALYZE SELECT datatable.id, datatable."createdAt", datatable."analogInput01", datatable."analogInput02", datatable."analogInput03", datatable."analogInput04", datatable."analogInput05", datatable."analogInput06", datatable."analogInput07", datatable."canEngineRpm", datatable."canEngineTemperature", datatable."canFuelConsumedLiters", datatable."canFuelLevel", datatable."canVehicleMileage", datatable."deviceId", datatable."deviceTemperature", datatable."deviceInternalVoltage", datatable."deviceExternalVoltage", datatable."deviceAntennaCut", datatable."deviceEnum", datatable."deviceVehicleMileage", datatable."deviceSimSignal", datatable."deviceSimStatus", datatable."iButton01", datatable."iButton02", datatable."recordSequence", datatable."recordTimestamp", datatable."accelerationAbsolute", datatable."accelerationBrake", datatable."accelerationBump", datatable."accelerationTurn", datatable."accelerationX", datatable."accelerationY", datatable."accelerationZ", datatable."positionAltitude", datatable."positionDirection", datatable."positionSatellites", datatable."positionSpeed", datatable."positionLatitude", datatable."positionLongitude", datatable."positionHdop", datatable."positionMovement", datatable."positionValid", datatable."positionEngine" FROM datatable WHERE datatable."recordTimestamp" >= 1519744521 AND datatable."recordTimestamp" <= 1519745181 AND datatable."deviceId" = '864495033990901' AND datatable."positionValid" IS NOT NULL AND datatable."recordTimestamp" % 1 = 0;

Below is the result from EXPLAIN ANALYZE of the SELECT:

Index Scan using "ix_dataTable_recordTimestamp" on dataTable (cost=0.44..599.35 rows=5 width=301) (actual time=0.070..10.487 rows=661 loops=1)
Index Cond: (("recordTimestamp" >= 1519744521) AND ("recordTimestamp" <= 1519745181))
Filter: (("positionValid" IS NOT NULL) AND (("deviceId")::text = '864495033990901'::text) AND (("recordTimestamp" % 1) = 0))
Rows Removed by Filter: 6970
Planning time: 0.347 ms
Execution time: 10.658 ms

Whereas below is the result from time taken calculated by Python:

Select:  47.98712515830994 
JSON:  0.19731807708740234

Below is my code profiling:

10302 function calls (10235 primitive calls) in 12.612 seconds

Ordered by: cumulative time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000   12.595   12.595 /Users/afeezaziz/Projects/Bursa/backend/env/lib/python3.6/site-packages/sqlalchemy/engine/base.py:882(execute)
    1    0.000    0.000   12.595   12.595 /Users/afeezaziz/Projects/Bursa/backend/env/lib/python3.6/site-packages/sqlalchemy/sql/elements.py:267(_execute_on_connection)
    1    0.000    0.000   12.595   12.595 /Users/afeezaziz/Projects/Bursa/backend/env/lib/python3.6/site-packages/sqlalchemy/engine/base.py:1016(_execute_clauseelement)
    1    0.000    0.000   12.592   12.592 /Users/afeezaziz/Projects/Bursa/backend/env/lib/python3.6/site-packages/sqlalchemy/engine/base.py:1111(_execute_context)
    1    0.000    0.000   12.590   12.590 /Users/afeezaziz/Projects/Bursa/backend/env/lib/python3.6/site-packages/sqlalchemy/engine/default.py:506(do_execute)
    1   12.590   12.590   12.590   12.590 {method 'execute' of 'psycopg2.extensions.cursor' objects}
    1    0.000    0.000    0.017    0.017 /Users/afeezaziz/Projects/Bursa/backend/env/lib/python3.6/site-packages/sqlalchemy/engine/result.py:1113(fetchall)
    1    0.000    0.000    0.017    0.017 /Users/afeezaziz/Projects/Bursa/backend/env/lib/python3.6/site-packages/sqlalchemy/engine/result.py:1080(_fetchall_impl)
    1    0.008    0.008    0.017    0.017 {method 'fetchall' of 'psycopg2.extensions.cursor' objects}


Solution 1:[1]

Try using Postgres built in COPY or if your really need to retrieve the result in Python (e.g. you can't write to disk directly via COPY) you can use COPY via psycopgs copy_expert function:

cur = conn.cursor()

outputquery = "COPY ({0}) TO STDOUT WITH CSV HEADER".format(query)

with open('resultsfile', 'w') as f:
    cur.copy_expert(outputquery, f)

conn.close()

This should avoid serialization all together.

Solution 2:[2]

  • According to your query and your index, I think that you already tried to do the best because your query concerns about "recordTimestamp", "deviceId" and "positionValid", so, just make sure that you already created the index from 3 columns.
  • I think the problem is in "select([Datatable])", I guess you select all the columns, so, as your description, with 50+ columns, it takes time to parse data and send data to client. To more clear, adding the index just helps your "executing time" (time to find the result) but doesn't help your "fetching time" (when you run "lol_data = conn.execute(select_statement).fetchall()").
  • Solution: If you don't want to change the structure of table, you just select the columns you need. But spliting the table into 2 tables is better. 1 tables includes the parameters and others contains the "deviceId", "recordTimestamp", value. And you can change the "deviceId" by using index (comparing and sending string takes more time than using integer).

Solution 3:[3]

SQLAlchemy is just the connector to the database, the whole query runs at the end of the database.

Optimizing the query with the help of Procedures and SQLAlchemy, you can archive it. Here is a good read which can optimize your way of using. SQLAlchemy collection docs

If you are using MySQL database, you should also try MySQLdb API which is a bit faster then SQLAlchemy because MySQLdb is specifically object-oriented for the MySQL operations and iterations.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 fny
Solution 2 Kien.N
Solution 3 LAMRIN TAWSRAS