'Chaining filters with pyarrow

I am trying to search a table in pyarrow using multiple parameters. It looks like filters can be chained, but I am missing the magical incantation to make it actually work.

Table is loaded from CSV, so the structure works — I can filter using a single condition and the results are as expected.

Chaining the filters:

table.filter(
   compute.equal(table['a'], a_val)
).filter(
   compute.equal(table['b'], b_val)
).filter(
   compute.equal(table['c'], b_val)
)

Results in an error:

pyarrow.lib.ArrowInvalid: Filter inputs must all be the same length
    

I suspect the issue is that the second filter is on the original table and not the filtered output of the first filter.



Solution 1:[1]

You can combine 2 filters together with and_:

import pyarrow as pa
import pyarrow.compute as compute


table = pa.Table.from_arrays(
    [
        pa.array([1,2,2], pa.int32()),
        pa.array(["foo","bar","hello"], pa.string())
    ],
    ['a', 'b']
)


compute.filter(
    table,
    compute.and_(
        compute.equal(table['a'], 2),
        compute.equal(table['b'], 'hello'),
    )
)

Solution 2:[2]

I believe your suspicion is correct.

The first call to table.filter gives an output table that is smaller then the original table, but your expression in the second filter call still depends on the original table, which is now to large.
To fix this, it should be enough to simply save the table back to a variable after the first call.
For instance like this:

table = table.filter(
   compute.equal(table['a'], a_val)
)
table.filter(
   compute.equal(table['b'], b_val)
)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 0x26res
Solution 2 Jonas V