'In contextual bandit for Vowpal wabbit, does the --cb_explore option includes training an optimum predictor (--cb option) as well?
When using Vowpal wabbit for contextual bandits, here is my understanding so far,
- We can build a predictor model for predicting the rewards
- We can also then use an exploration strategy to choose actions (each action's reward is obtained from the predictions from the predictor model of #1 above)
I can use the --cb option to optimize a predictor based on the already collected contextual bandit data. The --cb obtain is only for building a model for predicting the rewards and it doesn't contain any exploration is choosing the rewards (it always picks the maximum reward). Hence this is the functionality for #1 above. Doubly robust is the default for --cb and you can specify other method using --cb_type flag
The --cb_explore option performs exploration for the rewards (#2 above). What I am not sure is what method it used for predicting the actions' rewards when I specify the --cb_explore? All the examples refers to the exploration strategies and doesn't specify the default prediction strategy used for --cb_explore,
Solution 1:[1]
If no exploration strategy is provided the default will be epsilon greedy. You can see some of the other alternatives here
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | olgavrou |
