'How to read a C struct (or Numpy record array) into a Polars Dataframe?

I have a binary file containing records from a C struct. I would like to read that file into a Polars Dataframe.

I can accomplish that as below, but I'm wondering if there is a more direct path?

My current solution involves:

  • Reading the file into a Numpy record array (see below) using using np.fromfile()
  • Converting that into a Pandas DataFrame
  • Converting that to a Polars DataFrame
# Data read in from file using np.fromfile()
data = np.array([(1, 2002, 2, 13, 0.3),
                 (2, 2005, 1, -10, 1.5),
                 (3, 2004, 2, 54, -0.12)],
    dtype=[("id", "<i4"),("yr", "<u2"),("sex", "<u2"),("val1", "<i2"),("val2", "<f4")]
)
df = pl.from_pandas(pd.DataFrame(data))
df
 id   yr    sex val1    val2
i32  u16    u16  i16     f32
  1 2002      2   13     0.3
  2 2005      1  -10     1.5
  3 2004      2   54   -0.12

I've tried reading data directly into Polars from numpy using pl.DataFrame(data) or pl.from_records(data), but in both cases I get a single column dataframe of type "object", which I can't work out how to separate into separate columns or convert to a struct.



Solution 1:[1]

data = np.array([(1, 2002, 2, 13, 0.3),
                (2, 2005, 1, -10, 1.5),
                (3, 2004, 2, 54, -0.12)],
    dtype=[("id", "<i4"),("yr", "<u2"),("sex", "<u2"),("val1", "<i2"),("val2", "<f4")]
)

pl.DataFrame(
    {
        field_name: data[field_name]
        for field_name in data.dtype.fields
    }
)
???????????????????????????????????
? id  ? yr   ? sex ? val1 ? val2  ?
? --- ? ---  ? --- ? ---  ? ---   ?
? i32 ? u16  ? u16 ? i16  ? f32   ?
???????????????????????????????????
? 1   ? 2002 ? 2   ? 13   ? 0.3   ?
???????????????????????????????????
? 2   ? 2005 ? 1   ? -10  ? 1.5   ?
???????????????????????????????????
? 3   ? 2004 ? 2   ? 54   ? -0.12 ?
???????????????????????????????????

To convert back to a numpy struct array, assign a numpy array per field:

# Create numpy struct array of the correct size.
numpy_struct_array = np.empty(df.height, data.dtype)

# Fill in the correct values.
for field, col in zip(data.dtype.fields, df.columns):
    numpy_struct_array[field] = df.get_column(col).to_numpy()

numpy_struct_array
array([(1, 2002, 2,  13,  0.3 ), (2, 2005, 1, -10,  1.5 ),
       (3, 2004, 2,  54, -0.12)],
      dtype=[('id', '<i4'), ('yr', '<u2'), ('sex', '<u2'), ('val1', '<i2'), ('val2', '<f4')])

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1