I think you're not looking at this from the right perspective. A LLM is designed...

Xcelerate · on May 1, 2024

> It is not designed to tell you the "most likely" text that follows, and we don't actually want that. This would mean you have no diversity in your outputs.

No, we specifically do want "most likely" to follow; the goal is to approximate Solomonoff induction as well as possible. See this recent paper by Hutter's team: https://arxiv.org/pdf/2401.14953

Quote from the paper:

"LLMs pretrained on long-range coherent documents can learn new tasks from a few examples by inferring a shared latent concept. They can do so because in-context learning does implicit Bayesian inference (in line with our CTW experiments) and builds world representations and algorithms (necessary to perform SI [Solomonoff Induction]). In fact, one could argue that the impressive in-context generalization capabilities of LLMs is a sign of a rough approximation of Solomonoff induction."

> In your example, sampling a 0 in 40% of cases and a 1 in 60% of cases does[n't] make sense for chat applications.

I didn't say anything about sampling. A sequence prediction model represents a mapping between an input sequence and a probability distribution over all possible output sequences up to a certain length.

My example uses a binary alphabet, but LLMs use an alphabet of tokens. Any chat application that expresses its output as a string of concatenated symbols from a given alphabet has a probability distribution defined over all possible output sequences. I'm simply comparing the fundamental limitations of any approach to inference that restricts its outcome space to sequences consisting of one symbol (and then layers on a meta-model to generate longer sequences by repeatedly calling the core inference capability) vs an approach that performs inference over an outcome space consisting of sequences longer than one symbol.

wantsanagent · on May 1, 2024

> "It is not designed to tell you the "most likely" text that follows,"

It is exactly designed to do that. A temperature of 0 this is what you are approximating. The crucial point though is that it is the most likely next word given the proceeding multi-token context, not just the previous token.

elcomet · on May 2, 2024

It's not designed to output the most likely séquence. It's designed to output the most likely next token. That's quite different.

tmoertel · on May 1, 2024

> And a final note, predicting one token at a time is what we are doing as humans when we speak...

I wouldn't be surprised if we could predict token groups. When speaking off the cuff, people often rely on well-worn phrases and cliches.

abakker · on May 1, 2024

Indeed. An interesting reference to this is the work Millman Parry did to describe the key phrases in the Odyssey and the queues they gave to help someone memorize the poem.

Also, this is maybe a semantic point, but, I am not predicting any words I speak. Not in a statistical sense. I have intent behind my words, which means I have an abstraction of meaning that I want to convey and I assemble the correct words to do that. no part of that is "predictive"

thwarted · on May 1, 2024

> describe the key phrases in the Odyssey and the queues they gave to help someone memorize the poem.

Queues of words give cues to help memorize.

abakker · on May 1, 2024

lol. Hurray for speech to text, I guess.

ivalm · on May 1, 2024

I think that corresponds to well worn phrases being a “single token”