ETC algorithm

Regret定义

n轮游戏,参与者每次选择一个action,获取reward。参与者每次选择action时都要参考之前的reward,来使n轮的reward累积最大化,或者是让regret最小化,regret的定义为:

Rn=nmaxaAμaE[t=1nXt]R_n=n \max _{a \in \mathcal{A}} \mu_a-\mathbb{E}\left[\sum_{t=1}^n X_t\right]

算法

At={(tmodk)+1 if tmkargmaxiμ^i(mk) if t>mkA_t= \begin{cases}(t \bmod k)+1 & \text { if } t \leqslant m k \\ \operatorname{argmax}_i \hat{\mu}_i(m k) & \text { if } t>m k\end{cases}

引理一

引入

Ta=t=1nX{At=a},Δa=μμaT_a=\sum_{t=1}^n \mathcal{X}\left\{A_t=a\right\}, \quad \Delta_a=\mu^*-\mu_a

可以将regret函数改写成

Rn=aAt=1nE[(μXt)X{At=a}]=aAt=1nX{At=a}Δa=aAΔaE(Ta(n))\begin{aligned} R_n & =\sum_{a \in \mathcal{A}} \sum_{t=1}^n \mathbb{E}\left[\left(\mu^*-X_t\right) \mathcal{X}\left\{A_t=a\right\}\right] \\ & =\sum_{a \in \mathcal{A}} \sum_{t=1}^n \mathcal{X}\left\{A_t=a\right\} \Delta_a \\ & =\sum_{a \in \mathcal{A}} \Delta_a E\left(T_a(n)\right) \end{aligned}

对于上式的第二步的处理,需要考虑条件期望

E[(μXt)X{At=a}At]=X{At=a}E[(μXt)At]=X{At=a}(μμa)=X{At=a}Δa\begin{aligned} & E\left[\left(\mu^*-X_t\right) \mathcal{X}\left\{A_t=a\right\} \mid A_t\right] \\ = & \mathcal{X}\left\{A_t=a\right\} \mathbb{E}\left[\left(\mu^*-X_t\right) \mid A_t\right] \\ = & \mathcal{X}\left\{A_t=a\right\}\left(\mu^*-\mu_a\right) \\ = & \mathcal{X}\left\{A_t=a\right\} \Delta_a \end{aligned}

引理二

对于 XX服从 σsubgaussian\sigma-\text{subgaussian}分布,有

P(Xε)e2ελσ2\mathbb{P}(X \geqslant \varepsilon) \leqslant e^{-\frac{2 \varepsilon}{\lambda \sigma^2}}

亚高斯分布定义:

λRE(eλx)eλ2σ22\forall \lambda \in R \quad \mathbb{E}\left(e^{\lambda x}\right) \leqslant e^{\frac{\lambda^2 \sigma^2}{2}}

对于 λ>0\forall \lambda >0

P(Xϵ)=P(eλXeλϵ)E(eλX)eλϵ (Markov’s inequality) eλ2σ22λϵ (Def. of subgaussianity) e2ελσ2\begin{aligned} \mathbb{P}(X \geqslant \epsilon) & =\mathbb{P}\left(e^{\lambda X} \geqslant e^{\lambda \epsilon}\right) \\ & \leqslant \frac{\mathbb{E}\left(e^{\lambda X}\right)}{e^{\lambda \epsilon}} \quad \text { (Markov's inequality) } \\ & \leqslant e^{\frac{\lambda^2 \sigma^2}{2 \lambda \epsilon}} \quad \text { (Def. of subgaussianity) } \\ & \leqslant e^{-\frac{2 \varepsilon}{\lambda \sigma^2}} \end{aligned}

λ=4ε\lambda=\frac{4}{\varepsilon},即可得到

引理三

对于 XX服从 σsubgaussian\sigma-\text{subgaussian}分布

E(X)=0,V(X)σ2,cXcσ subguassian \mathbb{E}(X)=0, \quad \mathbb{V}(X) \leq \sigma^2, \quad c X \sim|c| \cdot \sigma-\text { subguassian }

X1,X2X_{1},X_{2}独立,服从 σ1subgaussian\sigma_1-\text{subgaussian}σ2subgaussian\sigma_2-\text{subgaussian}分布

X1+X2σ12+σ22 subguassian X_1+X_2 \sim \sqrt{\sigma_1^2+\sigma_2^2}-\text { subguassian }

对于 XiμX_{i}-\mu服从 σsubgaussian\sigma-\text{subgaussian}分布

P(μ^μ+ε)enε22σ2\mathbb{P}(\hat{\mu} \geqslant \mu+\varepsilon) \leqslant e^{-\frac{n \varepsilon^2}{2 \sigma^2}}

P(μ^με)=P(1ni=1n(Xiμ)ε)eε22nσ2\begin{aligned} & \mathbb{P}(\hat{\mu}-\mu \geqslant \varepsilon) \\ = & \mathbb{P}\left(\frac{1}{n} \sum_{i=1}^n\left(X_i-\mu\right) \geqslant \varepsilon\right) \\ \leqslant & e^{-\frac{\varepsilon^2}{2} \frac{n}{\sigma^2}} \end{aligned}

对于上式的最后一步的处理,需要考虑

1ni=1n(Xiμ)nσ2n subguassian \frac{1}{n} \sum_{i=1}^n\left(X_i-\mu\right) \sim \frac{\sqrt{n \sigma^2}}{n}-\text { subguassian }

Regret分析

Rnmi=1kΔi+(nmk)i=1kΔiexp(mΔi24)R_n \leqslant m \sum_{i=1}^k \Delta_i+(n-m k) \sum_{i=1}^k \Delta_i \exp \left(-\frac{m \Delta_i^2}{4}\right)

假设 μ1=μ\mu_{1}=\mu^{*},根据引理一

Rn=i=1kΔiE[Ti(n)]R_n=\sum_{i=1}^k \Delta_i \mathbb{E}\left[T_i(n)\right]

前mk轮选择一定

E[Ti(n)]=m+(nmk)P(Amk+1=i)m+(nmk)P(μ^i(mk)maxjFiμ^j(mk))m+(nmk)P(μ^i(mk)μ^1(mk))m+(nmk)P(μ^i(mk)μi(μ^1(mk)μ1)Δi)m+(nmk)exp(mΔi24)\begin{aligned} \mathbb{E}\left[T_i(n)\right] & =m+(n-m k) \mathbb{P}\left(A_{m k+1}=i\right) \\ & \leqslant m+(n-m k) \mathbb{P}\left(\hat{\mu}_i(m k) \geqslant \max _{j_F i} \hat{\mu}_j(m k)\right) \\ & \leqslant m+(n-m k) \mathbb{P}\left(\hat{\mu}_i(m k) \geqslant \hat{\mu}_1(m k)\right) \\ & \leqslant m+(n-m k) \mathbb{P}\left(\hat{\mu}_i(m k)-\mu_i-\left(\hat{\mu}_1(m k)-\mu_1\right) \geqslant \Delta_i\right) \\ & \leqslant m+(n-m k) \exp \left(-\frac{m \Delta_i^2}{4}\right) \end{aligned}

对于上式的最后一步的处理,需要考虑引理三及

μ^i(mk)μi(μ^1(mk)μ1)1m+1m subguassian \hat{\mu}_i(m k)-\mu_i-\left(\hat{\mu}_1(m k)-\mu_1\right) \sim \sqrt{\frac{1}{m}+\frac{1}{m}}-\text { subguassian }

如果k=2

RnmΔ+(n2m)Δexp(mΔ24)mΔ+nΔexp(mΔ24)R_n \leqslant m \Delta+(n-2 m) \Delta \exp \left(-\frac{m \Delta^2}{4}\right) \leqslant m \Delta+n \Delta \exp \left(-\frac{m \Delta^2}{4}\right)

给出一个m的选取

m=max{1,[4Δ2log(nΔ24)]}m=\max \left\{1,\left[\frac{4}{\Delta^2} \log \left(\frac{n \Delta^2}{4}\right)\right]\right\}

Regret的上界为

Rnmin{nΔ,Δ+4Δ(1+max{0,log(nΔ24)})}RnΔ+Cn\begin{gathered} R_n \leqslant \min \left\{n \Delta, \Delta+\frac{4}{\Delta}\left(1+\max \left\{0, \log \left(\frac{n \Delta^2}{4}\right)\right\}\right)\right\} \\ R_n \leqslant \Delta+C \sqrt{n} \end{gathered}

运行结果