| ./dev |
|
Original theme by orderedlist (CC-BY-SA)
Where applicable, all content is licensed under a CC-BY-SA.
|
The canonical setup is that you have $K$ slot machines each with a fixed but unknown probability of winning, $\theta _ k$, with the goal of minimizing regret after $T$ rounds ( $T \in [0,\infty]$ ).
Some common examples are:
| Name | Reward | Prior |
|---|---|---|
| Beta-Bernoulli Bandit | 0 or 1 | Beta distribution |
| Gaussian Bandit | $\mathbb{R}$ | Gaussian |
Some common algorithms are:
| Name | Notes |
|---|---|
| $\epsilon$-greedy | Simple but often not very good |
| UCB1 | Good but often perfoms worse than Thompson (?) |
| Thompson Sampling | Good (?) |
| EXP3 | Works in advesarial settings but has ubounded regret by necessity |
Extensions involve assuming delayed feedback, non-stationary systems, and contextual systems.
$$ \begin{array}{l} \text{regret} _ t (\theta) = \max _ k (\theta _ k - \theta _ {a _ t}) \ \end{array} $$
$K$ bandits, each with a fixed parameter $\theta _ k$ that determines the reward, $R(k) = \{0,1\}$.
$$ \begin{array}{ll} \theta = ( \theta _ 0, \theta _ 1, \dots, \theta _ {K-1} ) & \\ R( k ) = \left\{ \begin{array}{ll} 1 & \text{w/ prob } \theta _ k \\ 0 & \text{w/ prob } 1-\theta _ k \\ \end{array} \right. & \\ \text{Beta}(x | \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} x^{\alpha-1} (1-\beta)^{\beta-1} & \\ \end{array} $$
$$ \begin{array}{ll} \text{BernoulliBanditThompsonSampling} & \\ \hline \begin{array}{ll} \text{for } t \in [0,T): & \\ \begin{array}{lll} & \text{for } k \in [0,K): \widehat{\theta} _ k \sim \text{beta}(\alpha _ k, \beta _ k) & \\ & a _ t = \arg\max _ k \widehat{\theta } _ k & \\ & r _ t = R(a _ t) & \\ & \alpha _ k = \alpha _ {a _ t} + r _ t & \\ & \beta _ k = \beta _ {a _ t} + (1-r _ t) & \\ \end{array} \end{array} \end{array} $$
I guess we have to assume bandits have Gaussian distributions drawn from the prior meta parameters $w _ k = ( m _ k, \nu _ k, \alpha _k , \beta _ k )$, with:
$$ \begin{array}{ll} m _ k & \equiv \text{ prior mean } \\ \nu _ k & \equiv \text{ prior count } \\ \alpha _ k & \equiv \text{ prior shape } \\ \beta _ k & \equiv \text{ prior scale } \\ \end{array} $$
Below:
$$ \begin{array}{ll} N _ t ( k ) & = \sum _ {\tau = 0} ^ {t-1} \mathbb{1} _ {k = a _ \tau} \\ r _ t & = R ( a _ t ) \\ \bar{ x } _ {t,k} & = \frac{1}{N _ t ( k) } \sum _ {\tau = 0} ^ {t-1} \mathbb{1} _ { k = a _ \tau} r _ t \\ s _ {t,k} & = \sum _ {\tau = 0} ^ {t-1} \mathbb{1} _ {k = a _ \tau} ( r _ {\tau} - \bar{x} _ {t,k} ) ^2 \\ \end{array} $$
So $\bar{x} _ {t,k}$ and $s _ {t,k}$ are the observed mean and variance at time $t$ and bandit $k$, respectively.
$$ \begin{array}{ll} \text{GaussianBanditThompsonSampling} & \\ \hline \begin{array}{ll} \text{for } t \in [0,T): & \\ \begin{array}{lll} & \text{for } k \in [0,K): & \\ & \begin{array}{llll} & & \widehat{ \sigma } ^ 2 _ k \sim \mathcal{IG}(\frac{1}{2} N_t(k) + \alpha _ k, \beta _ {t,k} ) & \\ & & \widehat{ \mu } _ k \sim \mathcal{N}( \rho _ {t,k}, \frac{ \widehat{\sigma} ^ 2 _ k }{ N _ t ( k) + \nu _ k) } ) & \\ \end{array} & \\ & a _ t = \arg\max _ { k \in [0,K) } \mathbb{E} [ X _ t | \widehat{ \mu } _ k, \widehat{ \sigma } ^ 2 _ k ] = \arg\max _ {k \in [0,K)} \widehat { \mu } _ k & \\ & r _ t = R ( a _ t ) & \\ & N _ {t+1} (a _ t) = N _ t ( a _ t ) + 1 & \\ & \bar{x} _ {t+1,k} = \bar{x} _ {t,k} + \frac{1}{ N _ t(k) + 1} ( X _ t - \bar{x} _ {t,k} ) & \\ & s _ {t + 1, k} = s _ {t,k} + r _ t ^ 2 + N _ t (k) \bar{x} ^ 2 _ {t,k} - N _ {t+1} (k) \bar{x}^2 _ {t+1, k} & \\ & \rho _ {t+1,k} = \frac{ \nu _ k m _ k + N _ {t+1} ( k ) \bar{x} _ {t+1,k} }{ \nu _ k + N _ {t+1} (k) } & \\ & \beta _ { t+1, k } = \beta _ k + \frac{s _ {t+1,k}}{2} + \frac{ N _ {t+1}(k) \nu _ k ( \bar{x} _ {t+1,k} - m _ k) ^2 }{ 2 (N _ {t+1} ( k ) + \nu _k } & \\ \end{array} \end{array} \end{array} $$
(src, inverse gamma)
$$ \begin{array}{ll} \text{EXP3}(\gamma \in [0,1]) & \\ \hline \begin{array}{ll} \text{for } k \in [0,k): w _ {0,k} = 1 & \\ \text{for } t \in [0,T): & \\ \begin{array}{lll} & \text{for } k \in [0,K): & \\ & \begin{array}{llll} & & p _ {t,k} = (1 - \gamma) \frac{ w _ {t,k} }{ \sum _ {a=0}^{K-1} w _ {t,a} } + \frac{\gamma}{K} & \\ \end{array} & \\ & a _ t \sim (p _ {t,0}, p _ {t,1} , \dots, p _ {t,K-1} ) & \\ & r _ t = R ( a _ t ) & \\ & w _ {t + 1, a _ t} = w _ {t,a _ t} \exp( \gamma r _ t / ( K p _ {t, a _ t} ) ) & \\ \end{array} \end{array} \end{array} $$