'How can I filter or select sub-fields of StructType columns in PyArrow

I'm looking for a way to filter and/or select sub-fields of StructType columns. For example in this table:

pylist = [
    {'int': 1, 'str': 'a', 'struct':{'sub': 1, 'sub2':3}}, 
    {'int': 2, 'str': 'b', 'struct':{'sub': 2, 'sub2':3}}
]
my_table = pa.Table.from_pylist(pylist)

my_table["struct"]

I want a way to select struct.sub. Is this possible?

Ideally, I'd like to be able to filter based on values in the sub-field. Something like this:

my_table.filter(pa.compute.equal(my_table.column('struct').field('sub'), 1))


Solution 1:[1]

Would flattening the table work for your use case?

>>> my_table.flatten()
pyarrow.Table
int: int64
str: string
struct.sub: int64
struct.sub2: int64
----
int: [[1,2]]
str: [["a","b"]]
struct.sub: [[1,2]]
struct.sub2: [[3,3]]

You can then do something like this:

>>> my_table.flatten()["struct.sub"]
<pyarrow.lib.ChunkedArray object at 0x7fac31ff9b20>
[
  [
    1,
    2
  ]
]

Solution 2:[2]

In 7.0.0 you can use the struct_field kernel:

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> import pyarrow.dataset as ds
>>> 
>>> pa.__version__
'7.0.0'
>>> 
>>> pylist = [
...     {'int': 1, 'str': 'a', 'struct':{'sub': 1, 'sub2':3}}, 
...     {'int': 2, 'str': 'b', 'struct':{'sub': 2, 'sub2':3}}
... ]
>>> my_table = pa.Table.from_pylist(pylist)
>>> 
>>> # Select
>>> pc.struct_field(my_table['struct'], [0])
<pyarrow.lib.ChunkedArray object at 0x7fec2f499cb0>
[
  [
    1,
    2
  ]
]
>>> pc.struct_field(my_table['struct'], [1])
<pyarrow.lib.ChunkedArray object at 0x7fec2f499d50>
[
  [
    3,
    3
  ]
]
>>> 
>>> # Filter
>>> my_table.filter(pc.equal(pc.struct_field(my_table['struct'], [0]), 1))
pyarrow.Table
int: int64
str: string
struct: struct<sub: int64, sub2: int64>
  child 0, sub: int64
  child 1, sub2: int64
----
int: [[1]]
str: [["a"]]
struct: [  -- is_valid: all not null  -- child 0 type: int64
    [
      1
    ]  -- child 1 type: int64
    [
      3
    ]]

In 8.0.0 you can also use the query engine:

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> import pyarrow.dataset as ds
>>> 
>>> pa.__version__
'8.0.0.dev477'
>>> 
>>> pylist = [
...     {'int': 1, 'str': 'a', 'struct':{'sub': 1, 'sub2':3}}, 
...     {'int': 2, 'str': 'b', 'struct':{'sub': 2, 'sub2':3}}
... ]
>>> my_table = pa.Table.from_pylist(pylist)
>>> 
>>> # Select
>>> ds.dataset(my_table).to_table(columns={'sub': ds.field('struct', 'sub')})
pyarrow.Table
sub: int64
----
sub: [[1,2]]
>>> 
>>> # Filter
>>> ds.dataset(my_table).to_table(filter=(ds.field('struct', 'sub') == 1))
pyarrow.Table
int: int64
str: string
struct: struct<sub: int64, sub2: int64>
  child 0, sub: int64
  child 1, sub2: int64
----
int: [[1]]
str: [["a"]]
struct: [
  -- is_valid: all not null
  -- child 0 type: int64
[1]
  -- child 1 type: int64
[3]]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 raulcumplido
Solution 2 li.davidm