'Iterating pandas dataframe row pairwise

Is there a faster way to iterate Pandas data frame row pairwise to do some calculations? My code below is not fast enough. I wonder if there is Pandas workaround this.

I started with iterrows, then found itertuples faster, but still not fast enough.


def pairwisecalculate(df):
    sim = []
    for row_1 in df.itertuples():
      for row_2 in df.itertuples():
        sum = 0.
        for i, c in enumerate(df.columns):
            if row_1[i] == row_2[i]:
                sum+=1
        sim.append(sum/ (len(df.columns)-1))
    return sim


Solution 1:[1]

You can try:

df.rolling(2).sum() / (len(df.columns) - 1)

Solution 2:[2]

You can also try to use https://www.pola.rs/ (-> https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Series.rolling_var.html)

Source: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Series.rolling_apply.html#polars.Series.rolling_apply

s = pl.Series("A", [1.0, 2.0, 9.0, 2.0, 13.0])
s.rolling_apply(function=lambda s: s.std(), window_size=3)
shape: (5,)
Series: 'A' [f64]
[
    null
    null
    4.358898943540674
    4.041451884327381
    5.5677643628300215
]

or other https://arrow.apache.org/docs/python/pandas.html Apache Arrow implantations. If you are aiming for speed.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Corralien
Solution 2