faabian's comments

faabian · on May 1, 2024

Language models factor the joint probability p(y, x) as p(y, x) = p(y|x) p(x) which is exact. I.e. if you train a language model on your distribution and sample with temperature 1, you will get the exact same distribution out. If you sample at lower temperature or even greedily, evidently, you will get other distributions.

faabian · on May 1, 2024

Thanks for the feedback! Let me try to state it better:

In the end, we only use the next-token head for generating. So which parts of the 2-token target H(X) + H(Y) are "auxiliary" in the sense that they help learning and which are "wasted"? H(X | Y) and I(X; Y) are useful for next-token generation while, by definition, H(Y | X) is the information quantity not related to the next token X. So we could say: "multi-token prediction trades the useful information I(X; Y) from H(Y) for the wasted computations on H(Y | X)". However, note that H(Y | X) is a next-token entropy for predicting Y from the prefix (C, X). If the attention mechanism allows to transfer computations already made for predicting Y|X to the next step, these computations may actually not have been wasted -- it was just pre-computations.

stealthcat · on May 1, 2024

Did you have some small toy experiments to prove this?

faabian · on May 1, 2024

To some degree, attention is already a mechanism to make computations from previous tokens useful later. (You can think of the KV cache as a representation of the text so far and all the models thoughts on it.) And since language models are trained on sequences end-to-end, I think this is likely to happen. Multi-token prediction encourages this behavior explicitly but only for the small n token window you define.

That said, there are many works attempting to increase the compute utilization of transformer language models (early exit, mixture of depths) and novel architectures (SSMs etc.).

jacobsimon · on May 1, 2024

Thanks for highlighting the KV cache, I’ve been wondering the same thing and hadn’t come across that or didn’t remember.

edmara · on May 1, 2024

Transformers are still stateless, KV cache is just a compute-saving measure (but otherwise correctly described)

jacobsimon · on May 1, 2024

Oh huh. Why not make it stateful, like re-use and compute just the “diff” when you add a new token? Assuming it’s not that easy because each token can affect attention globally.

I think I’ve read something about this but I wonder if you could abstract attention to sentence/page levels and then only recalculate the parts that are relevant.

edmara · on May 2, 2024

Because attention is all you need.

I.E. the KV cache is 'just' a time saving measure because an LLM goes back and calculates those values anyway. (Which is why per-token compute increases exponentially otherwise)

You're not wrong that you could make an LLM more stateful. There are plenty of ideas for that but it would

a) be far more compute intensive to train and run (especially train)

B)be susceptible to all of the issues that RNNs have.

C) most importantly, it would almost certainly just converge at scale with transformers. Labs run small scale, internal tests of architectures all the time and most of them basically come to this conclusion and abandon it

faabian · on May 1, 2024

Vectors can do what one-hot vectors cannot do -- no one said inputs need to be rows from an token_id -> vector embeddings map. Basically, we are doing this already by moving from one-hot vectors to n-tuples of one-hot vectors, increasing the effective vocabulary size from V to V^n.

faabian · on May 1, 2024

Exactly, but there is also a rejection sampling based method for speculative sampling: https://arxiv.org/abs/2302.01318

faabian · on May 1, 2024

Author here -- that's a very good point and as I understand work in progress in different teams. Training autoencoders for language is actually super easy given the small amount of information contained in text (compared to vision/video), the hard part is making the model focus on the semantic part if all signal we have comes from exact match in token space. Hence Yann LeCun's ideas on joint embedding predictive architectures. Note also that there is always a trade-off between auxiliary tasks giving more signal but shifting the focus. In our case, we noticed degradation if the number of predicted tokens is too high. So latent prediction methods need to sort out what is useful.

mike_hearn · on May 1, 2024

Aren't the models already doing this, in a way? We know they can do things like write rhyming poems and song lyrics that do make perfect sense, so at some point the activations must be encoding some sort of overall plan for the upcoming sentences, even if maybe every word isn't predicted yet.

faabian · on May 1, 2024

Yes. Otherwise next-token models wouldn't be nearly as good as they are. But the question is how to train these capabilities most efficiently! We had some interesting findings on how with increasing model/dataset scale/data quality, capabilities can move from "only learnable with multi-token prediction" to "indifferent" and "multi-token prediction actually hurts". This depends on the capability itself, induction e.g. matures way earlier in this sense than code generation capabilities.

mike_hearn · on May 1, 2024

Is it possible that anti-scaling effect occurs because you are removing some middle layers to free up space for the extra output heads? I only scanned the paper quickly but what happens if you treat the technique as strictly additive and don't keep parameter sizes fixed?

flawsofar · on May 1, 2024

In case you’re thinking that rhyming requires planning, that’s just as silly as a rabbit tanning.

You can make things up as you go, and the constraints emerge from the flow.

gbasin · on May 1, 2024

great comment

mjburgess · on May 1, 2024

> so at some point the activations must be encoding some sort of overall plan for the upcoming sentences

This isn't obviously the case, compare this "intelligent designer" view with evolution: there was no prior plan for rabbits. it's sufficient to create the appearance of design that sequential steps are simply probabilistically modulated by prior ones.

Consider a continuation of "the cat..." merely a distribution over all possible words suffices to create the illusion of a plan, suppose: "the cat sat..." then, "on.., the..." etc. follow from the training data.

I think there's a strong argument against trying to model entire sentences exactly because the system isn't modelling semantics: one should expect accuracy to drop off a cliff if there is no actual plan. ie., predicting "sat on the mat" from "cat" shouldnt be a valid prediction, because of the infinite number of possible continuations that as a whole is terrible (eg., what about "chased the mouse" etc.). The space of all possible sentences to continue from "the cat" is infinite, which much of that space actually useful; whereas the number of words is very small, very fininte, and many of them not useful.

The only reason that "the cat sat..", "the cat sat on..." is reasonable is because each sequential word can be modulated by the prompt to seem as if planned.

edmara · on May 1, 2024

The modelling is advanced enough that you can't fundamentally distinguish it from (lossy, limited) planning in the way you're describing.

If the KQV doesn't encode information about likely future token sequences then a transformer empirically couldn't outperform Markov text generators.

mjburgess · on May 1, 2024

No one is spending $10-50mil building a markov text model of everything ever digitised; if they did so, their performance would approach a basic LLM.

Though, more simply, you can just take any LLM and rephrase it as a markov model. All algorithms which model conditional probability are equivalent; you can even unpack a NN as a kNN model or a decision tree.

They all model 'planning' in the same way: P(C|A, B) is a 'plan' for C following A, B. There is no model of P("A B C" | "A B"). Literally, at inference time, no computation whatsoever is performed to anticipate any future prediction -- this follows both trivially form the mathematical formalism (which no one seems to want to understand); or you can also see this empirically: inference time is constant regardless of prompt/continuation.

The reason 'the cat sat...' is completed by 'on the mat' is that it's maximal that P(on|the cat sat...), P(the|the cat sat on...), P(mat|the cat sat on the...)

Why its maximal is not in the model at all, nor in the data. It's in the data generating process, ie., us. It is we who arranged text by these frequencies and we did so because the phrase is a popular one for academic demonstrations (and so on).

As ever, people attribute "to the data" or worse, "to the LLM" no properties it has.. rather it replays the data to us and we suppose the LLM must have the property that generates this data originally. Nope.

Why did the tape recorder say, "the cat sat on the mat"? What, on the tape or in the recorder made "mat" the right word? Surely, the tape must have planned the word...

edmara · on May 2, 2024

>Why it's maximal is not in the model at all, nor the data

>It replays the data to us and we suppose the LLM must have the property that generates this data originally.

So to clarify, what you're saying is that under the hood, an LLM is essentially just performing a search for similar strings in its training data and regurgitating the most commonly found one?

Because that is demonstrably not what's happening. If this were 2019 and we were talking about GPT-2 it would be more understandable but SoTA LLMs can in-context learn and translate entire languages which aren't in their dataset.

Also RE inference time, when you give transformers more compute for an individual token, they perform better https://openreview.net/forum?id=ph04CRkPdC