Obvious, but not so obvious, thing about GPT

Often to learn the subject really well requires looking at it over and over again. I found this to be true in my personal hobbies and in studying ML. I first studied about transformers 3 years ago - and what I learned was mostly the "mechanics" of the architecture -- formulas for attention, general encoder-decoder structure etc. Next time I saw it I learned its applications - language modeling in particular. Following similar pattern, I picked up new and new insights about the architecture over time. The latest insight came from Andrej Karpathy's lecture on nano-GPT

I knew that transformers are encoder-decoder architecture and that they are the backbone of modern language modeling. However, what Andrej explained was quite satisfying to me - GPT and most other modern LLMs often train only the decoder part of transformer. In his implementation of GPT, Andrej didn't use encoder at all - the pipeline was as following: tokenize the input, pass through transformer, un-tokenize the output. This was quite unexpected to me because I thought learning representation of the text (i.e. encoding) is crucial to making it work. However, Andrej seemed to get away without it.

This makes me wonder if using pre-trained encoder representations, such as BERT, would make any difference for the use case of dialog (i.e. text generation conditioned on user input). This, perhaps, is a different use case than text generation which is what nano-GPT tries to solve. Still this is rather mind-boggling to me -- how nano-GPT is able to converse with the user w/o the knowledge of the language provided by the encoder...

Comments

Popular posts from this blog

Cost function properties in Multi-Armed Bandits