'Pandas Sqlite read_sql_query (Too many columns)
I have a dataset with 566 000 rows and 25 000 columns. I would like quickly reading sections of it with the sqlite3 python package. All the information I find is about how to do this with a few columns.
import sqlite3
import pandas as pd
# Create a new database file:
db = sqlite3.connect("db.sqlite")
# Load the CSV in chunks:
for c in pd.read_csv("dataframe.csv", # usecols = ['id_col'],
chunksize=1000):
# Append all rows to a new database table, which we name 'voters':
c.to_sql('db', db, if_exists="append")
# Add an index on 'id_col':
db.execute("CREATE INDEX id_col ON db(id_col)")
db.close()
Running this gives this error:
sqlite3.OperationalError: too many columns on HumanProtAtlas_scRNA
Running it with the commented out parameter usecols = ['id_col'] is successful and creates an indexed SQL table that can be read with the following:
def read_filtered_df(id):
conn = sqlite3.connect("db.sqlite")
q = f"SELECT * FROM db WHERE id_col = '{id}'" #
return pd.read_sql_query(q, conn)
The subsetting according to the desired id is correct. However, the read dataframe only contains the cl_id column and not the other 25k columns. Is there a way of using this index to read the original dataframe? Or of including all columns in the indexed sql table?
Kind regards,
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
