Posts

Showing posts from November, 2020

Cost function properties in Multi-Armed Bandits

MathJax TeX Test Page I've been reading Section 1.3, Proposition 1.3.1. in "Dynamic Programming and Optimal Control, vol 2" by Dimitri Bertsekas, where author posed a proof of a proposition as an exercise to a reader. The proposition described properties of the optimal reward function \(J(x, M)\), paraphrased here for reference: Let \(B = max_{l}max_{x^{l}} \mid R^{l}(x^{l}) \mid \). For fixed \(x\), the optimal reward function \(J(x, M)\) has the following properties as a function of M: \(J(x, M)\) is convex and monotinically nondecreasing. \(J(x, M)\) is constant for \(M \leq -B/(1-\alpha)\) \(J(x, M)\) = \(M\) for \(M \geq B/(1-\alpha)\) Author proceeded with the proof by considering the reward function at timestep \(k\): \(J_{k}\) and using classical DP iteration: \(