'Loading "pivoted" data with pyarrow (or, "stack" or "melt" for pyarrow.Table)

I have large-ish CSV files in "pivoted" format: rows and columns are categorical, and values are a homogeneous data type.

What's the best (memory and compute efficient) way to load such a file into a pyarrow.Table with an "unpivoted" schema? In other words, given a CSV file with n rows and m columns, how do I get a pyarrow.Table with n*m rows and one column?

In terms of pandas, I think I want the pyarrow equivalent of pandas.DataFrame.melt() or .stack().

For example...

  • given this CSV file
item,A,B
item_0,0,0
item_1,370,1
item_2,43,0
  • I want this pyarrow.Table
  item    group  value
item_0        A      0
item_0        B      0
item_1        A    370
item_1        B      1
item_2        A     43
item_2        B      0


Solution 1:[1]

Pyarrow has got some limited computation capacity and doesn't support melt at the moment. You can see what's available there: https://arrow.apache.org/docs/python/api/compute.html#

One alternative is to create the melted table yourself:

table = pyarrow.csv.read_csv("data.csv")
tables = []
for column_name in table.schema.names[1:]:
    tables.append(pa.Table.from_arrays(
        [
            table[0],
            pa.array([column_name]*table.num_rows, pa.string()),
            table[column_name],
        ],
        names=[
            table.schema.names[0],
            "key",
            "value"
        ]
    
    ))
    
result = pa.concat_tables(tables)

Another option is to use pola-rs which is similar to pandas, but uses arrow as a back end. Unlike pyarrow it has got a lot more compute functions, including melt: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.melt.html

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 0x26res