UCB algorithm

算法

UCBi(t1,δ)={ if Ti(t1)=0μ^i(t1)+2log(1/δ)Ti(t1) otherwise \begin{aligned} & \mathrm{UCB}_i(t-1, \delta)= \begin{cases}\infty & \text { if } T_i(t-1)=0 \\ \hat{\mu}_i(t-1)+\sqrt{\frac{2 \log (1 / \delta)}{T_i(t-1)}} & \text { otherwise }\end{cases} \end{aligned}

At=argmaxiUCBi(t1,δ)A_t=\operatorname{argmax}_i \mathrm{UCB}_i(t-1, \delta)

μ^i(t1)\hat{\mu}_i(t-1)很大或 Ti(t1)T_i(t-1)很小时,算法在直觉上都能满足

引理一

P(μμ^+2log(1/δ)n)δ for all δ(0,1)\mathbb{P}\left(\mu \geqslant \hat{\mu}+\sqrt{\frac{2 \log (1 / \delta)}{n}}\right) \leqslant \delta \quad \text { for all } \delta \in(0,1)

对于 XiμX_{i}-\mu服从 σsubgaussian\sigma-\text{subgaussian}分布

P(μ^μ+ϵ)enϵ22σ2 and P(μ^μϵ)enϵ22σ2\mathbb{P}(\hat{\mu} \geqslant \mu+\epsilon) \leqslant e^{-\frac{n \epsilon^2}{2 \sigma^2}} \quad \text { and } \quad \mathbb{P}(\hat{\mu} \leqslant \mu-\epsilon) \leqslant e^{-\frac{n \epsilon^2}{2 \sigma^2}}

(详见ETC算法的引理三证明)

对于上式,令 δ=enϵ22σ2\delta=e^{-\frac{n \epsilon^2}{2 \sigma^2}}ϵ=2log(1/δ)n\epsilon=\sqrt{\frac{2 \log (1 / \delta)}{n}}

Regret分析

假设 μ1=μ\mu_{1}=\mu^{*}

Rn=i=1kΔiE[Ti(n)]R_n=\sum_{i=1}^k \Delta_i \mathbb{E}\left[T_i(n)\right]

设定 GiG_{i}为一个“好事件”

Gi={μ1<mint[n]UCB1(t,δ)}{μ^iui+2uilog(1δ)<μ1}G_i=\left\{\mu_1<\min _{t \in[n]} \mathrm{UCB}_1(t, \delta)\right\} \cap\left\{\hat{\mu}_{i u_i}+\sqrt{\frac{2}{u_i} \log \left(\frac{1}{\delta}\right)}<\mu_1\right\}

可以得到“好事件”发生的两个性质

Ti(n)uiT_i(n) \leqslant u_i

假设在时刻 tt,满足 Ti(t1)=uiT_i(t-1)=u_iAt=iA_t=i

UCBi(t1,δ)=μ^i(t1)+2log(1/δ)Ti(t1)=μ^iui+2log(1/δ)ui<μ1<UCB1(t1,δ)\begin{aligned} \mathrm{UCB}_i(t-1, \delta) & =\hat{\mu}_i(t-1)+\sqrt{\frac{2 \log (1 / \delta)}{T_i(t-1)}} \\ & =\hat{\mu}_{i u_i}+\sqrt{\frac{2 \log (1 / \delta)}{u_i}} \\ & <\mu_1 \\ & <\mathrm{UCB}_1(t-1, \delta) \end{aligned}

At=argmaxjUCBj(t1,δ)=iA_t=\operatorname{argmax}_j \mathrm{UCB}_j(t-1, \delta)=i

以及

Gic 为小概率事件G_{i}^{c}\text{ 为小概率事件}

Gic={μ1mint[n]UCB1(t,δ)}{μ^iui+2uilog(1δ)μ1}G_i^c=\left\{\mu_1 \geqslant \min _{t \in[n]} \mathrm{UCB}_1(t, \delta)\right\} \cup\left\{\hat{\mu}_{i u_i}+\sqrt{\frac{2}{u_i} \log \left(\frac{1}{\delta}\right)} \geqslant \mu_1\right\}

对于左式

P(μ1mint[n]UCB1(t,δ))P(s[n]{μ1μ^1s+2log(1/δ)s})s=1nP(μ1μ^1s+2log(1/δ)s)nδ\begin{aligned} \mathbb{P}\left(\mu_1 \geqslant \min _{t \in[n]} \mathrm{UCB}_1(t, \delta)\right) & \leqslant \mathbb{P}\left(\bigcup_{s \in[n]}\left\{\mu_1 \geqslant \hat{\mu}_{1 s}+\sqrt{\frac{2 \log (1 / \delta)}{s}}\right\}\right) \\ & \leqslant \sum_{s=1}^n \mathbb{P}\left(\mu_1 \geqslant \hat{\mu}_{1 s}+\sqrt{\frac{2 \log (1 / \delta)}{s}}\right) \\ & \leqslant n \delta \end{aligned}

对于右式,μ1μi=Δi\mu_1-\mu_i=\Delta_i

P(μ^iui+2log(1/δ)uiμ1)=P(μ^iuiμiΔi2log(1/δ)ui)P(μ^iuiμicΔi)exp(uic2Δi22)\begin{aligned} \mathbb{P}\left(\hat{\mu}_{i u_i}+\sqrt{\left.\frac{2 \log (1 / \delta)}{u_i} \geqslant \mu_1\right)}\right. & =\mathbb{P}\left(\hat{\mu}_{i u_i}-\mu_i \geqslant \Delta_i-\sqrt{\frac{2 \log (1 / \delta)}{u_i}}\right) \\ & \leqslant \mathbb{P}\left(\hat{\mu}_{i u_i}-\mu_i \geqslant c \Delta_i\right) \\ & \leqslant \exp \left(-\frac{u_i c^2 \Delta_i^2}{2}\right) \end{aligned}

注意 0<c<10<c<1

对于最后一步,Xσ-subgaussian X \sim \sigma \text {-subgaussian }

P(Xε)exp(ε22σ2)\mathbb{P}(X \geqslant \varepsilon) \leqslant \exp \left(-\frac{\varepsilon^2}{2 \sigma^2}\right)

结合起来可得

P(Gic)nδ+exp(uic2Δi22)\mathbb{P}\left(G_i^c\right) \leq n \delta+\exp \left(-\frac{u_i c^2 \Delta_i^2}{2}\right)

根据“好事情”的两个性质,我们可以得到

E[Ti(n)]E[I{Gi}Ti(n)]+E[I{Gic}Ti(n)]ui+n(nδ+exp(uic2Δi22))\begin{aligned} \mathbb{E}\left[T_i(n)\right] & \leqslant \mathbb{E}\left[\mathbb{I}\left\{G_i\right\} T_i(n)\right]+\mathbb{E}\left[\mathbb{I}\left\{G_i^c\right\} T_i(n)\right] \\ & \leqslant u_i+n\left(n \delta+\exp \left(-\frac{u_i c^2 \Delta_i^2}{2}\right)\right) \end{aligned}

不妨取 ui=2log(1/δ)(1c)2Δi2,δ=1/n2u_i=\left\lceil\frac{2 \log (1 / \delta)}{(1-c)^2 \Delta_i^2}\right\rceil, \quad \delta=1 / n^2 以及 c=1/2c=1 / 2

E[Ti(n)]ui+n(nδ+exp(uic2Δi22))ui+1+n12c2/(1c)2=2log(n2)(1c)2Δi2+1+n12c2/(1c)23+16log(n)Δi2\begin{aligned} \mathbb{E}\left[T_i(n)\right] & \leqslant u_i+n\left(n \delta+\exp \left(-\frac{u_i c^2 \Delta_i^2}{2}\right)\right) \\ & \leqslant u_i+1+n^{1-2 c^2 /(1-c)^2}=\left\lceil\frac{2 \log \left(n^2\right)}{(1-c)^2 \Delta_i^2}\right\rceil+1+n^{1-2 c^2 /(1-c)^2} \\ & \leqslant 3+\frac{16 \log (n)}{\Delta_i^2} \end{aligned}

Regret的上界为

Rn=i=1kΔiE[Ti(n)]=i:Δi<ΔΔiE[Ti(n)]+i:ΔiΔΔiE[Ti(n)]nΔ+i:ΔiΔ(3Δi+16log(n)Δi)nΔ+16klog(n)Δ+3iΔi8nklog(n)+3i=1kΔi,\begin{aligned} R_n & =\sum_{i=1}^k \Delta_i \mathbb{E}\left[T_i(n)\right]=\sum_{i: \Delta_i<\Delta} \Delta_i \mathbb{E}\left[T_i(n)\right]+\sum_{i: \Delta_i \geqslant \Delta} \Delta_i \mathbb{E}\left[T_i(n)\right] \\ & \leqslant n \Delta+\sum_{i: \Delta_i \geqslant \Delta}\left(3 \Delta_i+\frac{16 \log (n)}{\Delta_i}\right) \leqslant n \Delta+\frac{16 k \log (n)}{\Delta}+3 \sum_i \Delta_i \\ & \leqslant 8 \sqrt{n k \log (n)}+3 \sum_{i=1}^k \Delta_i, \end{aligned}

运行结果