'Optimal way to store data from Pandas to Snowflake

The dataframe is huge (7-8 million rows). Tried to_sql with chunksize = 5000 but it never finished.

Using,

from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL

df.to_sql(snowflake_table , engine, if_exists='replace', index=False, index_label=None, chunksize=20000)

What are other optimal solutions for storing data into SF from Pandas DF? Or what am I doing wrong here? The DF is usually of size 7-10 million rows.



Solution 1:[1]

The optimal way that ilja-everila pointed out is “copy into...” as SF required the csv to be staged on cloud before transformation I was hesitant to do it but it seems like that is the only option given that the performance is in 5-10 minutes for 6.5million records.

Solution 2:[2]

The least painful way I can imagine is to dump the file to S3 and have Snowpipe load it into Snowflake automatically. With that set up you don't have to execute any copy command or make any Snowflake calls at all.

Refer to Snowflake documentation for details on how to set up Snowpipe for S3. In short you need to create a stage, a target table, a file format (I guess you already have these things in place though) and a pipe. Then set up SQS notifications for your bucket that the pipe will listen to.

Snowflake suggests having files sized around 10-100 MB, so it is likely a good idea to split the file.

# set up credentials (s3fs is built on BOTO hence this is AWS specific)
fs = s3fs.S3FileSystem(key=key, secret=secret)

# number of files to split into
n_chunks = 2

# loop over dataframe and dump chunk by chunk to S3
# (you likely want to expand file naming logic to avoid overwriting existing files)
for f_name, chunks in enumerate(np.array_split(np.arange(df.shape[0]), n_chunks)):
    bytes_to_write = df.iloc[chunks].to_csv(index=False).encode()
    with fs.open('s3://mybucket/test/dummy_{}.csv'.format(f_name), 'wb') as f:
        f.write(bytes_to_write)

For reference I tried this with a 7M row dataframe splitted into 5 files of around 40 MB. It took around 3 minutes and 40 seconds from starting splitting the dataframe until all rows had arrived in Snowflake.

Solution 3:[3]

for using SQLAlchemy, could you also add, in the connection parameter, the paramstyle=qmark that binds data. This is also referenced here: https://github.com/snowflakedb/snowflake-connector-python/issues/37#issuecomment-365503841

After this change, if you feel appropriate, it may be good idea to do the performance comparison between the SQLAlchemy approach and bulk load approach of writing the large DF to files and use COPY INTO to load the files into Snowflake table.

Solution 4:[4]

pandas does an 'insert into ...' with multiple values behind the scene. Snowflake has a restriction up to 16384 records on ingestion. Please change your chunksize=16384.

Solution 5:[5]

Snowflake provides the write_pandas and pd_writer helper functions to manage that:

from snowflake.connector.pandas_tools import pd_writer

df.to_sql(snowflake_table, engine, index=False, method=pd_writer)
#                                                      ^ here

The pd_writer() function uses write_pandas():

write_pandas(): Writes a Pandas DataFrame to a table in a Snowflake database

To write the data to the table, the function saves the data to Parquet files, uses the PUT command to upload these files to a temporary stage, and uses the COPY INTO command to copy the data from the files to the table.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 abdthe1
Solution 2
Solution 3 Seeling Cheung
Solution 4 Mr-_-Boy
Solution 5 Didi Bear