'Setting a divisible index column in dask dataframe

What's the best way to set an index column in the dask read_sql_table function if I have no integer or time series data to partition with? I have a table with just strings in the columns.

I thought about using ROW_NUMBER() in a query to create an incrementing number that dask can use to divide the large dataset, using dask's run_sql_query() function, but it doesn't seem to work:

My code is:

import dask.dataframe as dd
from sqlalchemy import text

sql_query = text("SELECT ROW_NUMBER() OVER(ORDER BY(Select 0)) row_num, * FROM my_table")
    
df = dd.read_sql_query(sql_query, con="mssql+pyodbc://username:password@server/database?driver", index_col='row_num')

However dask gives me the error:

AttributeError: 'TextClause' object has no attribute 'limit'

Can I make this work or is there a way to use string columns for partitioning?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source