Cost function properties in Multi-Armed Bandits

MathJax TeX Test Page
I've been reading Section 1.3, Proposition 1.3.1. in "Dynamic Programming and Optimal Control, vol 2" by Dimitri Bertsekas, where author posed a proof of a proposition as an exercise to a reader. The proposition described properties of the optimal reward function J(x,M), paraphrased here for reference:

Let B=maxlmaxxlRl(xl). For fixed x, the optimal reward function J(x,M) has the following properties as a function of M:

  1. J(x,M) is convex and monotinically nondecreasing.
  2. J(x,M) is constant for MB/(1α)
  3. J(x,M) = M for MB/(1α)
Author proceeded with the proof by considering the reward function at timestep k: Jk and using classical DP iteration: Jk+1(x,M)=max(M,maxlLl(xl,M,Jk)), where we choose between retiring and continuing working on projects. We know as k JkJ. He proved the 1st property, and below I will prove the last two.

I will prove this by induction on k. Properties 2 and 3 are evident for k=0: J0=max(0,M)=0 for M<B/(1α) and J0=M for M>B/(1α).

Where B=maxlmaxxlR(xl). Essentially we want to bound Ll: L is an infinite geometric series: B+αB+α2B+...+αnB=B/(1α) is an upperbound for Ll.

Let's assume property 2 hold for some k. Particularly, Jk(x,M)=const for MB/(1α). Then, Ll(xk,M,Jk)=R(xk)+αE{Jkl(x1,..,fl(xl,wl),xl+1,...,xn,M)}=R(xk)+const=const , which is a constant in terms of M. Moreover, using an earlier bound, we know: Ll∣≤B/(1α)B/(1α)LlB/(1α) Therefore, Jk+1=max(M,maxlLl(x,Jk,M))=maxlLl()=const, where the second equality follows from B/(1α)Ll and MB/(1α). Therefore, property 2 was proved for k+1 and it completes induction. Similar argument applies to property 3, except the last step where: LlB/(1α)B/(1α)M , so Jk+1=M.

Q.E.D.

Comments

Popular posts from this blog

Obvious, but not so obvious, thing about GPT