'Handling UUID values in Arrow with Parquet files

I'm new to Python and Pandas - please be gentle!

I'm using SqlAlchemy with pymssql to execute a SQL query against a SQL Server database and then convert the result set into a dataframe. I'm then attempting to write this dataframe as a Parquet file:

  engine = sal.create_engine(connectionString)

  conn = engine.connect()
  df = pd.read_sql(query, con=conn)
  df.to_parquet(outputFile)

The data I'm retrieving in the SQL query includes a uniqueidentifier column (i.e. a UUID) named rowguid. Because of this, I'm getting the following error on the last line above:

pyarrow.lib.ArrowInvalid: ("Could not convert UUID('92c4279f-1207-48a3-8448-4636514eb7e2') with type UUID: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column rowguid with type object')

Is there any way I can force all UUIDs to strings at any point in the above chain of events?

A few extra notes:

  • The goal for this portion of code was to receive the SQL query text as a parameter and act as a generic SQL-to-Parquet function.
  • I realise I can do something like df['rowguid'] = df['rowguid'].astype(str), but it relies on me knowing which columns have uniqueidentifier types. By the time it's a dataframe, everything is an object and each query will be different.
  • I also know I can convert it to a char(36) in the SQL query itself, however, I was hoping to do something more "automatic" so the person writing the query doesn't trip over this problem accidentally all the time / doesn't have to remember to always convert the datatype.

Any ideas?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source