'Setting a divisible index column in dask dataframe
What's the best way to set an index column in the dask read_sql_table function if I have no integer or time series data to partition with? I have a table with just strings in the columns.
I thought about using ROW_NUMBER() in a query to create an incrementing number that dask can use to divide the large dataset, using dask's run_sql_query() function, but it doesn't seem to work:
My code is:
import dask.dataframe as dd
from sqlalchemy import text
sql_query = text("SELECT ROW_NUMBER() OVER(ORDER BY(Select 0)) row_num, * FROM my_table")
df = dd.read_sql_query(sql_query, con="mssql+pyodbc://username:password@server/database?driver", index_col='row_num')
However dask gives me the error:
AttributeError: 'TextClause' object has no attribute 'limit'
Can I make this work or is there a way to use string columns for partitioning?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
