'Xarray: grouping by contiguous identical values
In Pandas, it is simple to slice a series(/array) such as [1,1,1,1,2,2,1,1,1,1] to return groups of [1,1,1,1], [2,2,],[1,1,1,1]. To do this, I use the syntax:
datagroups= df[key].groupby(df[key][df[key][variable] == some condition].index.to_series().diff().ne(1).cumsum())
...where I would obtain individual groups by df[key][variable] == some condition. Groups that have the same value of some condition that aren't contiguous are their own groups. If the condition was x < 2, I would end up with [1,1,1,1],[1,1,1,1] from the above example.
I am attempting to do the same thing in xarray package, because I am working with multidimensional data, but the above syntax obviously doesn't work.
What I have been successful doing so far:
a) apply some condition to separate the values I want by NaNs:
datagroups_notsplit = df[key].where(df[key][variable] == some condition)
So now I have groups as in the example above [1,1,1,1,Nan,Nan,1,1,1,1] (if some condition was x <2). The question is, how do I cut these groups so that it becomes [1,1,1,1],[1,1,1,1]?
b) Alternatively, group by some condition...
datagroups_agglomerated = df[key].groupby_bins('variable', bins = [cleverly designed for some condition])
But then, following the example above, I end up with groups [1,1,1,1,1,1,1], [2,2]. Is there a way to then groupby the groups on noncontiguous index values?
Solution 1:[1]
Without knowing more about what your 'some condition' can be, or the domain of your data (small integers only?), I'd just workaround the missing pandas functionality, something like:
import pandas as pd
import xarray as xr
dat = xr.DataArray([1,1,1,1,2,2,1,1,1,1], dims='x')
# Use `diff()` to get groups of contiguous values
(dat.diff('x') != 0)]
# ...prepend a leading 0 (pedantic syntax for xarray)
xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x')
# ...take cumsum() to get group indices
xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x').cumsum()
# array([0, 0, 0, 0, 1, 1, 2, 2, 2, 2])
dat.groupby(xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x').cumsum() )
# DataArrayGroupBy, grouped over 'group'
# 3 groups with labels 0, 1, 2.
The xarray How do I page could use some recipes like this ("Group contiguous values"), suggest you contact them and have them added.
Solution 2:[2]
My use case was a bit more complicated than the minimal example I posted due to the use of timeseries indices and the desire to subselect certain conditions; however, I was able to adapt the answer of smci, above in the following way:
(1) create indexnumber variable:
df = Dataset( data_vars={ 'some_data' : (('date'), some_data), 'more_data' : (('date'), more_data), 'indexnumber' : (('date'), arange(0,len(date_arr)) }, coords={ 'date' : date_arr } )
(2) get the indices for the groupby groups:
ind_slice = df.where(df['more_data'] == some_condition)['indexnumber'].dropna(dim='date').diff(dim='date') !=1).cumsum().indexes
(3) get the cumsum field:
sumcum = df.where(df['more_data'] == some_condition)['indexnumber'].dropna(dim='date').diff(dim='date') !=1).cumsum()
(4) reconstitute a new df:
df2 = df.loc[ind_slice]
(5) add the cumsum field:
df2['sumcum'] = sumcum
(6) groupby:
groups = df2.groupby(df['sumcum'])
hope this helps anyone else out there looking to do this.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | smci |
| Solution 2 | EJSABOLK |
