Cost function properties in Multi-Armed Bandits

MathJax TeX Test Page

I've been reading Section 1.3, Proposition 1.3.1. in "Dynamic Programming and Optimal Control, vol 2" by Dimitri Bertsekas, where author posed a proof of a proposition as an exercise to a reader. The proposition described properties of the optimal reward function $J(x, M)$, paraphrased here for reference:

Let $B = max_{l}max_{x^{l}} \mid R^{l}(x^{l}) \mid $. For fixed $x$, the optimal reward function $J(x, M)$ has the following properties as a function of M:

$J(x, M)$ is convex and monotinically nondecreasing.
$J(x, M)$ is constant for $M \leq -B/(1-\alpha)$
$J(x, M)$ = $M$ for $M \geq B/(1-\alpha)$

Author proceeded with the proof by considering the reward function at timestep $k$: $J_{k}$ and using classical DP iteration: $J_{k+1}(x, M) = max(M, max_{l}L^{l}(x_{l}, M, J_{k}))$, where we choose between retiring and continuing working on projects. We know as $k \to \infty$ $J_{k} \to J$. He proved the 1st property, and below I will prove the last two.

I will prove this by induction on $k$. Properties 2 and 3 are evident for $k = 0$: $$J_{0} = max(0, M) = 0$$ for $M \lt -B/(1-\alpha)$ and $$J_{0} = M$$ for $M \gt B/(1-\alpha)$.

Where $$B = max_{l}max_{x_{l}}\mid R(x_{l}) \mid$$. Essentially we want to bound $L^{l}$: $\overline L$ is an infinite geometric series: $B + \alpha B + \alpha^{2}B + ... + \alpha^{n} B = B / (1 - \alpha) $ is an upperbound for $L^{l}$.

Let's assume property 2 hold for some $k$. Particularly, $$J_{k}(x, M) = const$$ for $M \leq -B/(1-\alpha)$. Then, $$\begin{align} L^{l}(x_{k}, M, J_{k}) &= R(x_{k}) + \alpha E\{J_{k}^{l}(x_{1},..,f^{l}(x^{l}, w^{l}), x^{l+1},...,x_{n}, M)\} \\ &= R(x_{k}) + const \\ &= const \\ \end{align}$$ , which is a constant in terms of $M$. Moreover, using an earlier bound, we know: $$\begin{align} \mid L^{l} \mid \leq B / (1 - \alpha) \\ -B/(1-\alpha) \leq L^{l} \leq B/(1-\alpha) \end{align}$$ Therefore, $$\begin{align} J_{k+1} &= max(M, max_{l}L^{l}(x, J_{k}, M)) \\ &= max_{l}L^{l}(\cdot) \\ &= const \end{align}$$, where the second equality follows from $-B/(1-\alpha) \leq L^{l}$ and $M \leq -B/(1-\alpha)$. Therefore, property 2 was proved for $k+1$ and it completes induction. Similar argument applies to property 3, except the last step where: $$\begin{align} L^{l} &\leq B / (1-\alpha) & B/(1-\alpha) &\leq M \end{align}$$ , so $J_{k+1} = M$.

Q.E.D.

Search This Blog

Aza's Scientific Journal

Cost function properties in Multi-Armed Bandits

Comments

Post a Comment

Popular posts from this blog

Obvious, but not so obvious, thing about GPT