Obvious, but not so obvious, thing about GPT
Often to learn the subject really well requires looking at it over and over again. I found this to be true in my personal hobbies and in studying ML. I first studied about transformers 3 years ago - and what I learned was mostly the "mechanics" of the architecture -- formulas for attention, general encoder-decoder structure etc. Next time I saw it I learned its applications - language modeling in particular. Following similar pattern, I picked up new and new insights about the architecture over time. The latest insight came from Andrej Karpathy's lecture on nano-GPT . I knew that transformers are encoder-decoder architecture and that they are the backbone of modern language modeling. However, what Andrej explained was quite satisfying to me - GPT and most other modern LLMs often train only the decoder part of transformer. In his implementation of GPT, Andrej didn't use encoder at all - the pipeline was as following: tokenize the input , pass through transformer , un-