Cost function properties in Multi-Armed Bandits

MathJax TeX Test Page
I've been reading Section 1.3, Proposition 1.3.1. in "Dynamic Programming and Optimal Control, vol 2" by Dimitri Bertsekas, where author posed a proof of a proposition as an exercise to a reader. The proposition described properties of the optimal reward function \(J(x, M)\), paraphrased here for reference:

Let \(B = max_{l}max_{x^{l}} \mid R^{l}(x^{l}) \mid \). For fixed \(x\), the optimal reward function \(J(x, M)\) has the following properties as a function of M:

  1. \(J(x, M)\) is convex and monotinically nondecreasing.
  2. \(J(x, M)\) is constant for \(M \leq -B/(1-\alpha)\)
  3. \(J(x, M)\) = \(M\) for \(M \geq B/(1-\alpha)\)
Author proceeded with the proof by considering the reward function at timestep \(k\): \(J_{k}\) and using classical DP iteration: \(J_{k+1}(x, M) = max(M, max_{l}L^{l}(x_{l}, M, J_{k}))\), where we choose between retiring and continuing working on projects. We know as \(k \to \infty\) \(J_{k} \to J\). He proved the 1st property, and below I will prove the last two.

I will prove this by induction on \(k\). Properties 2 and 3 are evident for \(k = 0\): $$J_{0} = max(0, M) = 0$$ for \(M \lt -B/(1-\alpha)\) and $$J_{0} = M$$ for \(M \gt B/(1-\alpha)\).

Where $$B = max_{l}max_{x_{l}}\mid R(x_{l}) \mid$$. Essentially we want to bound \(L^{l}\): \(\overline L\) is an infinite geometric series: \(B + \alpha B + \alpha^{2}B + ... + \alpha^{n} B = B / (1 - \alpha) \) is an upperbound for \(L^{l}\).

Let's assume property 2 hold for some \(k\). Particularly, $$J_{k}(x, M) = const$$ for \(M \leq -B/(1-\alpha)\). Then, $$\begin{align} L^{l}(x_{k}, M, J_{k}) &= R(x_{k}) + \alpha E\{J_{k}^{l}(x_{1},..,f^{l}(x^{l}, w^{l}), x^{l+1},...,x_{n}, M)\} \\ &= R(x_{k}) + const \\ &= const \\ \end{align}$$ , which is a constant in terms of \(M\). Moreover, using an earlier bound, we know: $$\begin{align} \mid L^{l} \mid \leq B / (1 - \alpha) \\ -B/(1-\alpha) \leq L^{l} \leq B/(1-\alpha) \end{align}$$ Therefore, $$\begin{align} J_{k+1} &= max(M, max_{l}L^{l}(x, J_{k}, M)) \\ &= max_{l}L^{l}(\cdot) \\ &= const \end{align}$$, where the second equality follows from \(-B/(1-\alpha) \leq L^{l}\) and \(M \leq -B/(1-\alpha)\). Therefore, property 2 was proved for \(k+1\) and it completes induction. Similar argument applies to property 3, except the last step where: $$\begin{align} L^{l} &\leq B / (1-\alpha) & B/(1-\alpha) &\leq M \end{align}$$ , so \(J_{k+1} = M\).

Q.E.D.

Comments

Popular posts from this blog

Obvious, but not so obvious, thing about GPT