'TypeError after vectorizing a function
I wrote the following function, that takes as input a dataframe that contains the min and max values for all combinations, and a list of arguments.
The function is the following. The first argument (df) is a dataframe where all combinations of keys and corresponding minimal and maximal values are stored, then, the following arguments are the keys and the amount. The function returns whether the amount is within or outside the expected range.
def within_range(df,CoCode,Lease_type,
Position,Mvt_Type,BKPF_WAERS,
BSEG_BSCHL,COBL_KOSTL,Amount):
print(CoCode,Lease_type,Position,Mvt_Type,BKPF_WAERS,BSEG_BSCHL,COBL_KOSTL,Amount)
mask=(df['CoCode']==CoCode)&(df['Lease_type']==Lease_type)&\
(df['Position']==Position)&(df['Mvt_Type']==Mvt_Type)&\
(df['BKPF-WAERS']==BKPF_WAERS)&(df['BSEG-BSCHL']==BSEG_BSCHL)&\
(df['COBL-KOSTL']==COBL_KOSTL)
mini = float(df.loc[mask,'min'].values)
maxi = float(df.loc[mask,'max'].values)
if mini <= Amount <= maxi:
return 'OK, within range'
else:
return f'{str(Amount)} is outside range [{str(mini)};{str(maxi)}]'
if I test with following values:
within_range(df=df_3,CoCode='2510',Lease_type='1',Position='17310C',Mvt_Type='F30',BKPF_WAERS='HUF',BSEG_BSCHL='50',COBL_KOSTL='2510DDA-612121-01.C',Amount=2442000.0)
I get exactly the good output: 'OK, within range'
Now, I vectorized the function using np.vectorize and applied it to a second dataframe I need to check. For information, the first line corresponds exactly to the case successfully tested above.
This is how I called the function:
df_test['in_range']=np.vectorize(within_range)(df=df_3,
CoCode=df_test['BKPF-BUKRS'],
Lease_type=df_test['COBL-AUFNR'].str[5:6],
Position=df_test['BSEG-HKONT'].str[0:6],
Mvt_Type=df_test['BSEG-HKONT'].str[6:],
BKPF_WAERS=df_test['BKPF-WAERS'],
BSEG_BSCHL=df_test['BSEG-BSCHL'],
COBL_KOSTL=df_test['COBL-KOSTL'],
Amount=df_test['BSEG-WRBTR'],
)
from the embedded print, I can see that the first line correspond exactly to the test above:
2510 1 17310C F30 HUF 50 2510DDA-612121-01.C 2442000.0
Then, problem: instead of populating the new column 'in_range' with the result of the function ('in range' or 'outside range'), I get a long TypeError message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-50-0d714785e5d5> in <module>
----> 1 df_test['in_range']=np.vectorize(within_range)(df=df_3,
2 CoCode=df_test['BKPF-BUKRS'].values,
3 Lease_type=df_test['COBL-AUFNR'].str[5:6].values,
4 Position=df_test['BSEG-HKONT'].str[0:6].values,
5 Mvt_Type=df_test['BSEG-HKONT'].str[6:].values,
c:\users\forszpaniak\appdata\local\programs\python\python39\lib\site-packages\numpy\lib\function_base.py in __call__(self, *args, **kwargs)
2106 vargs.extend([kwargs[_n] for _n in names])
2107
-> 2108 return self._vectorize_call(func=func, args=vargs)
2109
2110 def _get_ufunc_and_otypes(self, func, args):
c:\users\forszpaniak\appdata\local\programs\python\python39\lib\site-packages\numpy\lib\function_base.py in _vectorize_call(self, func, args)
2184 res = func()
2185 else:
-> 2186 ufunc, otypes = self._get_ufunc_and_otypes(func=func, args=args)
2187
2188 # Convert args to object arrays first
c:\users\forszpaniak\appdata\local\programs\python\python39\lib\site-packages\numpy\lib\function_base.py in _get_ufunc_and_otypes(self, func, args)
2144
2145 inputs = [arg.flat[0] for arg in args]
-> 2146 outputs = func(*inputs)
2147
2148 # Performance note: profiling indicates that -- for simple
c:\users\forszpaniak\appdata\local\programs\python\python39\lib\site-packages\numpy\lib\function_base.py in func(*vargs)
2101 the_args[_i] = vargs[_n]
2102 kwargs.update(zip(names, vargs[len(inds):]))
-> 2103 return self.pyfunc(*the_args, **kwargs)
2104
2105 vargs = [args[_i] for _i in inds]
<ipython-input-47-ef44db83b86c> in within_range(df, CoCode, Lease_type, Position, Mvt_Type, BKPF_WAERS, BSEG_BSCHL, COBL_KOSTL, Amount)
1 def within_range(df,CoCode,Lease_type,Position,Mvt_Type,BKPF_WAERS,BSEG_BSCHL,COBL_KOSTL,Amount):
2 print(CoCode,Lease_type,Position,Mvt_Type,BKPF_WAERS,BSEG_BSCHL,COBL_KOSTL,Amount)
----> 3 mask=(df['CoCode']==CoCode)&(df['Lease_type']==Lease_type)&\
4 (df['Position']==Position)&(df['Mvt_Type']==Mvt_Type)&\
5 (df['BKPF-WAERS']==BKPF_WAERS)&(df['BSEG-BSCHL']==BSEG_BSCHL)&\
TypeError: string indices must be integers
I looked at previous messages for similar TypeError, and I asked for the values (e.g CoCode=df_test['BKPF-BUKRS'].values to get the true value and not a tuple. But I still get the message and don't see why.
Did I misunderstood the way vectorizing is working or is it that I am not allowed to vectorize the 'mask' inside the function?
NOTE 30/04/2022:
I moved the mask and the determination of mini and maxi values outside the vectorized function, in a separate function that is called by the vectorized one. This is how it's looking like, and it's working fine:
def get_minimax(df,CoCode,Lease_type,Position,Mvt_Type,BKPF_WAERS,BSEG_BSCHL,COBL_KOSTL):
mask=(df['CoCode']==CoCode)&(df['Lease_type']==Lease_type)&\
(df['Position']==Position)&(df['Mvt_Type']==Mvt_Type)&\
(df['BKPF-WAERS']==BKPF_WAERS)&(df['BSEG-BSCHL']==BSEG_BSCHL)&\
(df['COBL-KOSTL']==COBL_KOSTL)
try:
mini = float(df.loc[mask,'min'].values)
maxi = float(df.loc[mask,'max'].values)
except:
mini = 0.0
maxi = np.inf
return mini, maxi
def within_range(CoCode,Lease_type,Position,Mvt_Type,BKPF_WAERS,BSEG_BSCHL,COBL_KOSTL,Amount):
print(CoCode,Lease_type,Position,Mvt_Type,BKPF_WAERS,BSEG_BSCHL,COBL_KOSTL,Amount)
mini,maxi = get_minimax(df_3,CoCode,Lease_type,Position,Mvt_Type,BKPF_WAERS,BSEG_BSCHL,COBL_KOSTL)
print (mini,maxi,Amount)
if float(mini) <= float(Amount) <= float(maxi):
return 'OK, within range'
else:
return f'{str(Amount)} is outside range [{str(mini)};{str(maxi)}]'
In the code above, only within_range will be vectorized, but not get_minimax. It looks like filters can't be vectorized. Is my assumption correct ?
Solution 1:[1]
The removal of the call to the global dataframe in the vectorized function shows that the problems come from this particular point.
Numpy's documentation mentions that "the
Blockquote The excluded argument can be used to prevent vectorizing over certain arguments. This can be useful for array-like arguments of a fixed length [...]
Considering my dataframe as a 'fixed length array-like' argument, I changed the initial code as follows:
df_test['in_range2']=np.vectorize(within_range2,excluded=['df'])(
df=df_3,
CoCode=df_test['BKPF-BUKRS'],
Lease_type=df_test['COBL-AUFNR'].str[5:6],
Position=df_test['BSEG-HKONT'].str[0:6],
Mvt_Type=df_test['BSEG-HKONT'].str[6:],
BKPF_WAERS=df_test['BKPF-WAERS'],
BSEG_BSCHL=df_test['BSEG-BSCHL'].values,
COBL_KOSTL=df_test['COBL-KOSTL'].values,
Amount=df_test['BSEG-WRBTR'].values,
)
The function works fine, now, and I find this exclusion much more elegant than removing and storing calls to the df into a separate function.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | JCF |
