Hacker Newsnew | past | comments | ask | show | jobs | submit | sasjaws's commentslogin

A while ago i did the nanogpt tutorial, i went through some math with pen and paper and noticed the loss function for 'predict the next token' and 'predict the next 2 tokens' (or n tokens) is identical.

That was a bit of a shock to me so wanted to share this thought. Basically i think its not unreasonable to say llms are trained to predict the next book instead of single token.

Hope this is usefull to someone.


As an expert in the field: this is exactly right.

LLMs are trained to do whole book prediction, at training time we throw in whole books at the time. It's only when sampling we do one or a few tokens at the time.


where do you get these books?

honking intensifies

WHERE DO YOU GET THESE BOOKS?!


The local library.


We do things, but it doesn't feel right


Can anyone even say what a book really is at the end of the day? It's such an abstract concept. /s


Isn't that the same as compressing the whole book, in a special differential format that compares how the text looks from any given point before and after?


There are many ways to model how the model works in simpler terms. Next-word prediction is useful to characterize how you do inference with the model. Maximizing mutual information, compressing, gradient descent, ... are all useful characterisations of the training process.

But as stated above, next token prediction is a misleading frame for the training process. While the sampling is indeed happening 1 token at a time, due to the training process, much more is going on in the latent space where the model has its internal stream of information.


Everything is the same as everything else. It's all just hydrogen and time mixed together.


Are you referring to this one?: https://github.com/karpathy/build-nanogpt


Thats the one, lots of fun and a great entrypoint for experimentation.


Isn't that why noise was introduced (seed rolling/temperature/high p/low p/etc)? I mean it is still deterministic given the same parameters.

But this might be misleadingly interpreted as an LLM having "thought out an answer" before generating tokens, which is an incorrect conclusion.

Not suggesting you did.


> this might be misleadingly interpreted as an LLM having "thought out an answer"

I'm convinced that that is exactly what happens. Anthropic confirms it:

"Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so."

https://www.anthropic.com/research/tracing-thoughts-language...


This is about reasoning tokens right? I didnt mean that, nanogpt doesnt do that. Nanogpt inference just outputs letters directly, no intermediate tokens.


No, this is about normal tokens. While a SOTA LLM outputs a token at a time, it already has a high level plan of what it is going to say many tokens ahead. This is in reply to the GP who thinks that an LLM can somehow produce coherent and thoughtful sentences while never seeing more than one token ahead.


Thats actually an interesting way to look at it. However i just posted that because i often see articles expressing amazement at how training an llm at next token prediction can take it so far. Seemingly ontrasting the simplicity of the training task to the complexity of the outcome. The insight is that the training task was in fact 'predict the next book', just as much as it is 'predict the next token'. So every time i see that 'predict the next token' representation of the training task it rubs me the wrong way. Its not wrong, but misleading.

I didnt mean to suggest that is how it 'thinks ahead' but i believe you can see it like that in a way. Because it has been trained to 'predict all the following tokens'. So it learned to guess the end of a phrase just as much as the beginning. I consider the mechanism of feeding each output token back in to be an implementation detail that distracts from what it actually learned to do.

I hope this makes sense. Fyi im no expert in any way, just dabbling.


I'd like to explore this idea, did you make a blog post about it? is it simple enough to post in the reply?


No blog post, my llm expert friend told me this was kinda obvious when i shared it with him so i didnt think it was worth it.

I can tell you how i got there, i did nanogpt, then tried to be smart and train a model with a loss function that targets 2 next tokens instead of one. Calculate the loss function and you'll see its exactly the same during training.

Sibling commenter also mentions:

> the joint probability of a token sequence can be broken down autogressively: P(a,b,c) = P(a) * P(b|a) * P(c|a,b) and then with cross-entropy loss which optimizes for log likelihood this becomes a summation."

Hope that helps.


Look up attention masks


Unless I've misunderstood the math myself, I don't think GPs comment is quite right if taken literally since "predict the next 2 tokens" would literally mean predict index t+1, t+2 off of the same hidden state at index t, which is the much newer field of multi-token prediction and not classic LLM autoregressive training.

Instead what GP likely means is the observation that the joint probability of a token sequence can be broken down autogressively: P(a,b,c) = P(a) * P(b|a) * P(c|a,b) and then with cross-entropy loss which optimizes for log likelihood this becomes a summation. So training with teacher forcing to minimize "next token" loss simultaneously across every prefix of the ground-truth is equivalent to maximizing the joint probability of that entire ground-truth sequence.

Practically, even though inference is done one token at a time, you don't do training "one position ahead" at a time. You can optimize the loss function for the entire sequence of predictions at once. This is due the autoregressive nature of the attention computation: if you start with a chunk of text, as it passes through the layers you don't just end up with the prediction for the next word in the last token's final layer, but _all_ of the final-layer residuals for previous tokens will encode predictions for their following index.

So attention on a block of text doesn't give you just the "next token prediction" but the simultaneous predictions for each prefix which makes training quite nice. You can just dump in a bunch of text and it's like you trained for the "next token" objective on all its prefixes. (This is convenient for training, but wasted work for inference which is what leads to KV caching).

Many people also know by now that attention is "quadratic" in nature (hidden state of token i attends to states of tokens 1...i-1), but they don't fully grasp the implication that even though this means for forward inference you only predict the "next token", for backward training this means that error for token i can backpropagate to tokens 1...i-1. This is despite the causal masking, since token 1 doesn't attend to token i directly but the hidden state of token 1 is involved in the computation of the residual stream for token i.

When it comes to the statement

>its not unreasonable to say llms are trained to predict the next book instead of single token.

You have to be careful, since during training there is no actual sampling happening. We've optimized to maximize the joint probability of ground truth sequence, but this is not the same as maximizing the probability the the ground truth is generated during sampling. Consider that there could be many sampling strategies: greedy, beam search, etc. While the most likely next token is the "greedy" argmax of the logits, the most likely next N tokens is not always found by greedily sampling N times. It's thought that this is one reason why RL is so helpful, since rollouts do in fact involve sampling so you provide rewards at the "sampled sequence" level which mirrors how you do inference.

It would be right to say that they're trained to ensure the most likely next book is assigned the highest joint probability (not just the most likely next token is assigned highest probability).


The idea i tried to express was purely the loss function thing you mentioned, and how both tasks (1 vs 2 vs n) lead to identical training runs. At least with nanogpt. I dont know if that extrapolates well to current llm internals and current training.


Personalized audio streams for language learners. Ideal for during driving or while doing chores.

https://listen.longyan.io/

At the intermediate level lots of learners struggle to find suitable content that matches their level and interests, more than a few learners turn to notebookLM podcasts to provide that, but that's a bit of a hassle to set up. So I built a platform that generates and manages infinite and shareable streams around your interests or specific vocabulary. It also provides live interactive transcripts (karaoke / teleprompter style) if you need it.

Core features work but still rough around the edges. Happy to help you out with any issues you encounter, languages to add, feature requests etc...


I'm building a service that generates audio streams about subjects and vocab of your choosing, currently notebookLM based. If you have intermediate listening skills its pretty useful for deepening regular vocab and acquiring specialized jargon.

I dumped my 400 hardest recurring anki words in it and listen to the stream whenever doing chores or driving. Then sync with my deck again after a while.

Can you help me out and give it a try, you seem like the target audience and i'd value your feedback. If your target language is not available or want to upload an anki deck I can help you out.

https://listen.longyan.io


I'll give this a go. My second TL is Lithuanian which is very difficult to find content in outside of state TV stuff.


I've added support for Lithuanian and created a stream about version control for you to try it out. Just 'select language' -> Lithuanian -> Play

If you find it useful, you can register for free and create new streams on any subject. Send me a mail on alex@longyan.io if you'd like more stream/content quota or if you want to try the Anki thing, I'll gladly set it up for you.


I'm building a reader app that tries to solve this exact problem by providing a range of gradually simplified versions of each article to match your proficiency. So you can stay in the sweet spot, or work your way up version by version.

If your target language happens to be Chinese then you can give it a try at https://reader.longyan.io/landing

No login required, love your feedback.


Sure. This kind of project seems to be pretty common. I'd strongly suggest using traditional characters as a base because it's very easy to map multiple characters into simplified forms but much harder to disambiguate simplified forms into the traditional versions.

Related comment on another app: https://news.ycombinator.com/item?id=43769831


Thanks for having a look, I actually started out from traditional characters, but once I realized >90% of the students only do simplified I switched.

I also tend to believe to just convert between them is not the best approach. Better to find different content for both. If student wants to learn traditional script, they usualy want content from Taiwan and not from China, and the other way round.


Almost anybody serious about learning Chinese is going to want to read some things written before the 1960s and for those things, people are reading the exact same books, essays, poems, speeches, etc. The simplified versions of all of those works are literally converted from the traditional versions. Ditto for all kinds of popular content that originated in HK, TW and overseas Chinese communities.

There is no long-term gain from storing "hair" and "emit" under the same entry in your database. Storing 髮 and 發 separately, along with 发 as the simplification of both is a small effort now that will constrain you a lot less in the future. I've literally seen this pitfall happen with about 40 different Chinese learning apps over the last 15 years. Only a few (like Du Chinese and Pleco) got it right early on.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: