'Pandas to_sql() slow on one DataFrame but fast on others

Goal

I'm trying to use pandas DataFrame.to_sql() to send a large DataFrame (>1M rows) to an MS SQL server database.

Problem

The command is significantly slower on one particular DataFrame, taking about 130 sec to send 10,000 rows. In contrast, a similar DataFrame takes just 7 sec to send the same number of rows. The latter DataFrame actually has more columns, and more data as measured by df.memory_usage(deep=True).

Details

The SQLAlchemy engine is created via

engine = create_engine('mssql+pyodbc://@<server>/<db>?driver=ODBC+Driver+17+for+SQL+Server', fast_executemany=True)

The to_sql() call is as follows: df[i:i+chunksize].to_sql(table, conn, index=False, if_exists='replace') where chunksize = 10000.

I've attempted to locate the bottleneck via cProfile, but this only revealed that nearly all of the time is spent in pyodbc.Cursor.executemany.

Any tips for debugging would be appreciated!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source