'NDCG as scoring function with GridSearchCV and stratified data?
I'm working on a learning to rank task, dataset has a column thread_id which is a group label (stratified data).
In the evaluation phase I must take into account these groups as my scoring function works on a per-thread fashion (e.g. nDCG).
Now, if I implement nDCG with a signature scorer(estimator, X, y) I can easily pass it to GridSearchCV as scoring function as in the example below:
def my_nDCG(estimator, X, y):
# group by X['thread_id']
# compute the result
return result
splitter = GroupShuffleSplit(...).split(X, groups=X['thread_id'])
cv = GridSearchCV(clf, cv=splitter, scoring=my_nDCG)
GridSearchCV selects the model by calling my_nDCG().
Unfortunately, inside my_nDCG, X doesn't have the thread_id column as it must be dropped beforehand passing X to fit(), otherwise I'd train the model using thread_id as feature.
cv.fit(X.drop('best_answer', axis=1), y)
How can I do this without the terrible workaround of keeping thread_id apart as global and merging it with X inside my_nDCG()?
Is there any other way to use nDCG with scikit-learn? I see scikit supports stratified data but when it comes to model evaluation with stratified data it seems missing proper support.
Edit
Just noticed GridSearchCV.fit() accepts a groups parameter, in my case it'd still be X['thread_id'].
At this point I only need to read that param within my custom scoring function. How to do it?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
