Yeah that analogy is fairly poor. You have to think about it in terms of some pr...

Yeah that analogy is fairly poor. You have to think about it in terms of some probability distribution that each step of the model is sampling from. Out of the distribution of all text, find the top n values for the next token that maximize `P(next | prefix)`, which is done efficiently through making a vector embedding to encode the tokens inside the statistical model.

Things that look like Q-and-A transcripts do exist in the training set, think interviews, books, stage plays, etc, and at a different layer of abstraction the rules of English text in general are very well represented. What RLHF is doing is slightly shifting the shape of the probability distribution to make it look more like the Q-and-A formats that are desired. They build a large dataset with human tagging to collect samples of good and bad outputs and using reinforcement learning techniques to generate outputs that look more like the good examples and less like the bad ones.

This probably involves creating a (much smaller, not-LLM) model that is trained to discriminate good outputs and bad outputs, learning to mimic the human tagging. There's some papers that have been published.

Here's one article from Huggingface: https://huggingface.co/blog/rlhf