'Scikit-learn with a custom scoring function using a 'feature'

I am trying to use a new metric called 'SERA' (Squared Error Relevance Area) as a custom scoring function for imbalanced regression as mentioned in this paper. https://link.springer.com/article/10.1007/s10994-020-05900-9

Here is what the paper tells in brief. To calculate SERA a feature known as 'relevance' defined by the user is required for each feature-label pair. Relevance varies from 0 to 1. 0 for not relevant and 1 for highly relevant.

This is the procedure for the calculation of SERA. Relevance is varied from 0 to 1 in small steps. For each value of relevance (phi) (e.g. 0.45) a subset of the training dataset is selected where the relevance is greater or equal to that value (e.g. 0.45). And for that selected training subset sum of squared errors is calculated i.e. sum(y_true - y_pred)**2 which is known as squared error relevance (SER). Then a plot us created for SER vs phi and area under the curve is calculated i.e. SERA.

Here is the code I have written in python for sklearn using make_scorer. I ran this code but I get errors.

import pandas as pd
from scipy.integrate import simps
from sklearn.metrics import make_scorer

def calc_sera(y_true, y_pred, x_relevance=None):

    # creating a list from 0 to 1 with 0.001 interval
    start_range = 0
    end_range = 1
    interval_size = 0.001

    list_1 = [round(val * interval_size, 3) for val in range(1, 1000)]
    list_1.append(start_range)
    list_1.append(end_range)
    epsilon = sorted(list_1, key=lambda x: float(x))

    # Initiating lists to store relevance(phi) and squared-error relevance (ser)
    relevance = []
    ser = []

    # Converting the dataframe to a numpy array
    rel_arr = x_relevance.to_numpy()
    # selecting a phi value
    for phi in epsilon:
        relevance.append(phi)
        error_squared_sum = 0
        for i in rel_arr:
            # Getting the subset of the training data
            if i >= phi:
                # Error calculation
                error_squared_sum += (y_true - y_pred)**2
        ser.append(error_squared_sum)

    # squared-error relevance area (sera)
    # numerical integration using simps(y, x)

    sera = simps(ser, relevance)

    return sera

score = make_scorer(calc_sera, x_relevance=df['Relevance'], greater_is_better=False)   

VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray return array(a, dtype, copy=False, order=order) job exception: scoring must return a number, got [.....] (<class 'numpy.ndarray'>) instead. (scorer=score)

Can anyone please help me with this ?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source