Posts

Obvious, but not so obvious, thing about GPT

Often to learn the subject really well requires looking at it over and over again. I found this to be true in my personal hobbies and in studying ML. I first studied about transformers 3 years ago - and what I learned was mostly the "mechanics" of the architecture -- formulas for attention, general encoder-decoder structure etc. Next time I saw it I learned its applications - language modeling in particular. Following similar pattern, I picked up new and new insights about the architecture over time. The latest insight came from Andrej Karpathy's lecture on nano-GPT .  I knew that transformers are encoder-decoder architecture and that they are the backbone of modern language modeling. However, what Andrej explained was quite satisfying to me - GPT and most other modern LLMs often train only the decoder part of transformer. In his implementation of GPT, Andrej didn't use encoder at all - the pipeline was as following: tokenize the input , pass through transformer , un-

Cost function properties in Multi-Armed Bandits

MathJax TeX Test Page I've been reading Section 1.3, Proposition 1.3.1. in "Dynamic Programming and Optimal Control, vol 2" by Dimitri Bertsekas, where author posed a proof of a proposition as an exercise to a reader. The proposition described properties of the optimal reward function \(J(x, M)\), paraphrased here for reference: Let \(B = max_{l}max_{x^{l}} \mid R^{l}(x^{l}) \mid \). For fixed \(x\), the optimal reward function \(J(x, M)\) has the following properties as a function of M: \(J(x, M)\) is convex and monotinically nondecreasing. \(J(x, M)\) is constant for \(M \leq -B/(1-\alpha)\) \(J(x, M)\) = \(M\) for \(M \geq B/(1-\alpha)\) Author proceeded with the proof by considering the reward function at timestep \(k\): \(J_{k}\) and using classical DP iteration: \(