'How can I filter or select sub-fields of StructType columns in PyArrow
I'm looking for a way to filter and/or select sub-fields of StructType columns. For example in this table:
pylist = [
{'int': 1, 'str': 'a', 'struct':{'sub': 1, 'sub2':3}},
{'int': 2, 'str': 'b', 'struct':{'sub': 2, 'sub2':3}}
]
my_table = pa.Table.from_pylist(pylist)
my_table["struct"]
I want a way to select struct.sub. Is this possible?
Ideally, I'd like to be able to filter based on values in the sub-field. Something like this:
my_table.filter(pa.compute.equal(my_table.column('struct').field('sub'), 1))
Solution 1:[1]
Would flattening the table work for your use case?
>>> my_table.flatten()
pyarrow.Table
int: int64
str: string
struct.sub: int64
struct.sub2: int64
----
int: [[1,2]]
str: [["a","b"]]
struct.sub: [[1,2]]
struct.sub2: [[3,3]]
You can then do something like this:
>>> my_table.flatten()["struct.sub"]
<pyarrow.lib.ChunkedArray object at 0x7fac31ff9b20>
[
[
1,
2
]
]
Solution 2:[2]
In 7.0.0 you can use the struct_field kernel:
>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> import pyarrow.dataset as ds
>>>
>>> pa.__version__
'7.0.0'
>>>
>>> pylist = [
... {'int': 1, 'str': 'a', 'struct':{'sub': 1, 'sub2':3}},
... {'int': 2, 'str': 'b', 'struct':{'sub': 2, 'sub2':3}}
... ]
>>> my_table = pa.Table.from_pylist(pylist)
>>>
>>> # Select
>>> pc.struct_field(my_table['struct'], [0])
<pyarrow.lib.ChunkedArray object at 0x7fec2f499cb0>
[
[
1,
2
]
]
>>> pc.struct_field(my_table['struct'], [1])
<pyarrow.lib.ChunkedArray object at 0x7fec2f499d50>
[
[
3,
3
]
]
>>>
>>> # Filter
>>> my_table.filter(pc.equal(pc.struct_field(my_table['struct'], [0]), 1))
pyarrow.Table
int: int64
str: string
struct: struct<sub: int64, sub2: int64>
child 0, sub: int64
child 1, sub2: int64
----
int: [[1]]
str: [["a"]]
struct: [ -- is_valid: all not null -- child 0 type: int64
[
1
] -- child 1 type: int64
[
3
]]
In 8.0.0 you can also use the query engine:
>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> import pyarrow.dataset as ds
>>>
>>> pa.__version__
'8.0.0.dev477'
>>>
>>> pylist = [
... {'int': 1, 'str': 'a', 'struct':{'sub': 1, 'sub2':3}},
... {'int': 2, 'str': 'b', 'struct':{'sub': 2, 'sub2':3}}
... ]
>>> my_table = pa.Table.from_pylist(pylist)
>>>
>>> # Select
>>> ds.dataset(my_table).to_table(columns={'sub': ds.field('struct', 'sub')})
pyarrow.Table
sub: int64
----
sub: [[1,2]]
>>>
>>> # Filter
>>> ds.dataset(my_table).to_table(filter=(ds.field('struct', 'sub') == 1))
pyarrow.Table
int: int64
str: string
struct: struct<sub: int64, sub2: int64>
child 0, sub: int64
child 1, sub2: int64
----
int: [[1]]
str: [["a"]]
struct: [
-- is_valid: all not null
-- child 0 type: int64
[1]
-- child 1 type: int64
[3]]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | raulcumplido |
| Solution 2 | li.davidm |
