I think you're not looking at this from the right perspective. A LLM is designed to sample text, that follows the training distribution. It is not designed to tell you the "most likely" text that follows, and we don't actually want that. This would mean you have no diversity in your outputs.
In your example, sampling a 0 in 40% of cases and a 1 in 60% of cases does make sense for chat applications.
For applications where we do care about the most likely sentence (e.g. question answering), then beam search helps, as others have mentioned.
Another thing to consider is that the model can "look ahead" and precompute what the future tokens might be. And it can then use this to predict the current token. In fact, some work have been investigating this, such as [1].
And a final note, predicting one token at a time is what we are doing as humans when we speak, so clearly it is not a wrong approach. We are doing this "look ahead" in our mind before speaking.
> It is not designed to tell you the "most likely" text that follows, and we don't actually want that. This would mean you have no diversity in your outputs.
No, we specifically do want "most likely" to follow; the goal is to approximate Solomonoff induction as well as possible. See this recent paper by Hutter's team: https://arxiv.org/pdf/2401.14953
Quote from the paper:
"LLMs pretrained on long-range coherent documents can learn new tasks from a few examples by inferring a shared latent concept. They can do so because in-context learning does implicit Bayesian inference (in line with our CTW experiments) and builds world representations and algorithms (necessary to perform SI [Solomonoff Induction]). In fact, one could argue that the impressive in-context generalization capabilities of LLMs is a sign of a rough approximation of Solomonoff induction."
> In your example, sampling a 0 in 40% of cases and a 1 in 60% of cases does[n't] make sense for chat applications.
I didn't say anything about sampling. A sequence prediction model represents a mapping between an input sequence and a probability distribution over all possible output sequences up to a certain length.
My example uses a binary alphabet, but LLMs use an alphabet of tokens. Any chat application that expresses its output as a string of concatenated symbols from a given alphabet has a probability distribution defined over all possible output sequences. I'm simply comparing the fundamental limitations of any approach to inference that restricts its outcome space to sequences consisting of one symbol (and then layers on a meta-model to generate longer sequences by repeatedly calling the core inference capability) vs an approach that performs inference over an outcome space consisting of sequences longer than one symbol.
> "It is not designed to tell you the "most likely" text that follows,"
It is exactly designed to do that. A temperature of 0 this is what you are approximating. The crucial point though is that it is the most likely next word given the proceeding multi-token context, not just the previous token.
Indeed. An interesting reference to this is the work Millman Parry did to describe the key phrases in the Odyssey and the queues they gave to help someone memorize the poem.
Also, this is maybe a semantic point, but, I am not predicting any words I speak. Not in a statistical sense. I have intent behind my words, which means I have an abstraction of meaning that I want to convey and I assemble the correct words to do that. no part of that is "predictive"
In your example, sampling a 0 in 40% of cases and a 1 in 60% of cases does make sense for chat applications.
For applications where we do care about the most likely sentence (e.g. question answering), then beam search helps, as others have mentioned.
Another thing to consider is that the model can "look ahead" and precompute what the future tokens might be. And it can then use this to predict the current token. In fact, some work have been investigating this, such as [1].
And a final note, predicting one token at a time is what we are doing as humans when we speak, so clearly it is not a wrong approach. We are doing this "look ahead" in our mind before speaking.
[1] https://arxiv.org/abs/2404.00859