'XGBOOST RANKER not smooth
I have the following toy problem centered around Baseball.
In this baseball dataset, I have a lot of data about tournament results over many many years where I am able to collect the following information per result:
- Final ranking of each Team that participated
- The Batting average of that team over the entire tournament.
Let us say in this tourny 2005 World Cup, 8 teams participated, I can generate the following dataframe:
Tourny, Team, Rank, Batting Avg,
2005-WC, Jays, 1, 0.45,
2005-WC, Cards, 2, 0.25,
2005-WC, Ravens, 3, 0.85,
2005-WC, Crows, 4, 0.23,
2005-WC, Jays, 5, 0.11,
...
Then if I have multiple Tournys from different years, I can then extend my list and get lots of data.
I can then ask the question, is Batting average useful in predicting the final rank of my Team?
This seems like the question that XGBRanker should be able to answer.
We can then plug in this data with some vanilla XGBRanker via the following:
model = xgb.XGBRanker(
max_depth = 10,
learning_rate = 0.01,
n_estimators = 100,
objective='rank:pairwise',
booster = 'gbtree',
gamma = 5,
min_child_weight=1,
subsample=0.1,
colsample_bytree = 1,
reg_alpha = 0.5,
reg_lambda = 0.5,
base_score = 0.5,
seed = 42,
)
X_train = df['Batting Avg']
Y_train = df['Rank]
groups = [number of entries per tourney]
model.fit(X_train.values, Y_train.values, group=groups)
After the appropriate training (very fast) We do the following plot to view how Rank changes with the team's batting average.
x = np.arange(0, 1, 0.001)
y = model.predict(x)
plot(x,y)
As a general statement, our normalized X values, (which is why the graph isn't 0-1) shows that roughly, better batting average, better rank! Exactly what we expect:
However, if you start zooming in, you start finding extremely undesirable traits. Namely, a very small change in batting averages can DRASTICALLY change your rank, and while we do not expect a fully monotonic result, we do expect to be way more SMOOTH, given there is only 1 parameter being used.
Can someone help me understand this? While on the surface, there is nothing ABSOLUTELY wrong, these are not desirable traits to have. I am not asking for monoticity, but having the graph be smoother is far more intuitive.
What kind of parameters do I need to tune to make it work better?
for references, I am using python.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|


