'Yet another SQL on Pandas DFs question. Help me end these once and for all

I frequently pull data from different databases into pandas dataframes, do some processing, and then eventually hit a processing step that I would strongly prefer to write in SQL (often involving joins across DFs). I figured I'd write a quick function which would:

  • accept an arbitrary sql query and set of relevant dataframes
  • write the dataframes to a tmp sqlite3 db
  • execute the SQL query in sqlite3 and return results as a dataframe

The function I came up with looks like this:

def SQL_on_DFs(query:str, dfs=[]) -> pd.DataFrame:
    """convenience function to write sql on arbitrary DFs
        ...execute in a temp sqlite db and return results as DF
    """
    # validation - every item in dfs list must be a dataframe
    assert [isinstance(x, pd.DataFrame) for x in dfs].all()

    # if no temp.db exists here, sqlite3 will create one
    sql3conn = sqlite3.connect("temp.db")
    
    # write dfs to sqlite3 db to perform sql query on them there
    for df in dfs:
        df.to_sql(sql3conn)
    
    result = pd.read_sql_query(query, sql3conn)
    return result

# TEST
df_ab = pd.DataFrame({"a": [1,2,3], "b": [2,2,2]})
df_ac = pd.DataFrame({"a": [1,2,3], "c": [3,3,3]})
df_abc = pd.DataFrame({"a": [1,2,3], "b": [2,2,2], "c": [3,3,3]})

q = "select ab.*, ac.c from df_ab ab left join df_ac ac on ab.a = ac.a"
assert SQL_on_DFs(q, [df_ab, df_ac]).equals(df_abc)

Super useful, super easy. Right?! Wrong! Don't you feel like a sucker. Definitely you and not me.

df.to_sql(sql3conn) throws an error. It says it's missing 1 required positional argument: 'con', but we've passed a con. What it's missing is the other required positional parameter, name of the table to write to sqlite3 as. So I need a way to get that the "name" of df_ab should be "df_ab", and the "name" of df_ac should be "df_ac".

Now I know what you're thinking: Max Power, why don't you just change your second function parameter from dfs = [df_ab, df_ac], to [(df_ab, "df_ab"), (df_ac, "df_ac")]?

Please don't make me do that. This is supposed to be a convenience function and that looks inconvenient



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source