More

EGreg · 2026-04-23T10:46:43 1776941203

Security by obscurity through morality? :)

The thing is, technology is either enabling something or not. The exploration space might be huge, but once an exploit is found, the exploitation code / strategy / plan can trivially proceed and be shared worldwide. So you have to deal with this when you design and patch systems.

Example: preserving paths in URLs. Safari ITP aggressively removes “utm_” and other well-known querystring parameters even in links clicked from email. Well, it is trivial to embed it in a path instead, so that first-party websites can track attribution, eg for campaign perfomance or email verification links etc. In theory, Apple and Mozilla could actually play a cat-and-mouse game with links across all their users and actually remove high-entropy path segments or confuse websites so much that they give up on all attribution. Browser makers or email client makers or messenger makers could argue that users don’t want to have attribution of their link clicks tracked silently without their permission. They could then say if users really wanted, they could manually enter a code (assisted by the OS or browser) into a website, or simply provide interactive permission of being tracked after clicking a link, otherwise the website will receive some dummy results and break. Where is the line after all?

EGreg · 2026-04-22T22:20:06 1776896406

Peter Ilyich Tchaikovsky was born in Votkinsk, May 7 1840.

When he was a little boy he never played out in the streets of Votkinsk like the other little children of Votkinsk, because when Tchaikovsky was one month old, his parents moved to St. Petersburg.

— Victor Borge

mjcohen · 2026-04-23T00:37:18 1776904638

As Victor said, his parents were very upset when they came home and found him in front of a roaring fire, because they did not have a fireplace.

_doctor_love · 2026-04-22T22:24:33 1776896673

Put up in a place

where it is easy to see

the cryptic admonishment

T.T.T

¨

When you feel how depressingly

slowly you climb

it's well to remember that

Things Take Time

-- Piet Hein

EGreg · 2026-04-21T14:47:56 1776782876

I would add this: https://magarshak.com/blog/perfection-is-the-enemy-of-the-go...

EGreg · 2026-04-21T03:42:46 1776742966

"This holds for almost all matrices" is actually something you'd want to know if we're talking about probabilities, no?

EGreg · 2026-04-21T03:31:45 1776742305

The prediction being used is the model's prediction of the next token's KV vector, given all previous KV vectors. Because the model was trained on language, it has strong priors about what comes next. The residual, i.e the difference between the predicted next KV vector and the actual one -- is much smaller in entropy than the raw vector, for the same reason language model perplexity is low on fluent text.

aesthesia · 2026-04-21T03:44:58 1776743098

What model is doing this prediction? The only way a transformer predicts the "next KV vector" is by sampling the next token and then running a forward pass with that token.

EGreg · 2026-04-21T04:12:36 1776744756

The predicted KV vector is the expected KV vector under the model's distribution over next tokens, i.e. a weighted average over the vocabulary, not an actual sampled token. So no forward pass with a sampled token is involved. Yes, the exact computation is expensive (one forward pass per vocabulary token), which the paper acknowledges, and the practical section covers top-k approximations that capture most of the probability mass cheaply. The entropy bound holds regardless of approximation scheme -- it's a statement about the theoretical floor. The residual is small whenever the model assigns high probability to the actual next token, which is exactly what low perplexity means.

magicalhippo · 2026-04-21T04:27:22 1776745642

> the practical section covers top-k approximations that capture most of the probability mass cheaply.

You say cheaply, but top-k with k=20 still means 20 forward passes for each position in the predicted KV cache vector, no? So to compute the residual at position i+1 you need another 20 passes?

It's late, perhaps I'm missing something.

aesthesia · 2026-04-21T04:20:02 1776745202

A top-k approximation still requires k forward passes; that's k times as expensive as just computing the exact value. Unless you're doing a prefix-unconditional prediction, in which case you still likely need quite a large token -> vector dictionary, and particularly for inner layers a significant amount of information left in the residual.

EGreg · 2026-04-21T04:35:56 1776746156

the k forward passes for different candidate tokens share all their prefix computation -- the KV cache up to position i-1 is identical for all candidates, so you run one pass through the shared layers and then k cheap single-token extensions. At long context lengths the shared prefix dominates the cost. This is also structurally what speculative decoding already does, so the infrastructure largely exists.

EGreg · 2026-04-21T03:22:05 1776741725

You're right, I'm not a well-known researcher, simply an entrepreneur who started to publish academic papers.

However, I do have a long history of diving deep into fields and building practical, open-source solutions to major problems I perceive in the fields.

15 years ago I started with social networks and PHP: https://github.com/Qbix http://laweekly.com/restoring-healthy-communities/

8 years ago I got into smart contracts on EVM, which was the SOTA at the time: https://github.com/Intercoin https://intercoin.org/applications

About a year and a half ago I started teaching a course on AI at a university not far from NYU where I studied... and that's what got me into this: https://vimeo.com/1063008765/c7ef3abcc5

I try to document everything on GitHub and popular articles, but only recently started publishing academic papers on arXiv and plan to actually start submitting them for real publications. While I build, I realized that I should start publishing any novel theoretical results that underpin my work.

I plan to publish actual code in a few weeks. To be fair, TurboQuant is also a purely theoretical paper. I just wanted to get this out and share.

thethirdone · 2026-04-21T03:31:07 1776742267

> To be fair, TurboQuant is also a purely theoretical paper. I just wanted to get this out and share.

TurboQuant is not a purely theoretical paper. Section 4 "Experiments" (page 15) [0] has a bunch of figure based on actual GPU computations.

[0]: https://arxiv.org/abs/2504.19874

kumarhn · 2026-04-21T21:17:54 1776806274

TurboQuant looks like it has very serious research integrity issues.

https://openreview.net/forum?id=tO3ASKZlok

stingraycharles · 2026-04-21T03:44:43 1776743083

TurboQuant went through ICLR review, has multiple Google Research co-authors, open-source implementations, CUDA kernels, and LongBench benchmarks.

Contrast that with your paper: no experiments, no implementation, no empirical validation of any kind.

Did you try engaging with LLM researchers and get their feedback on your paper?

mskkm · 2026-04-21T21:14:34 1776806074

went through ICLR review: scores 4 4 6 10, serious? open-source implementations: where is the official code? CUDA kernels: where?

EGreg · 2026-04-22T00:30:34 1776817834

Since yesterday, I put up the source code btw:

https://github.com/Safebots/KV

EGreg · 2026-04-21T02:55:24 1776740124

Author here. Since starting to teach AI at IENYC, I started publishing my papers recently on arXiv, and considering submitting them to a journal.

This is based on my original "PLT" paper: Probablistic Language Tries (https://news.ycombinator.com/item?id=47743585). A "Trie" is basically a tree of prefixes. While working on https://safebots.ai I became obsessed with caching generated artifacts as a means to do a lot of things: extremely cheap inference, near-optimal compression, modeling decision trees for strategies, and so on.

The PLT model was about compression in general. My main insight there was that the LLM's own weights actually contain an incredibly detailed probability distribution of "the next token" in any sequence, which can therefore be very useful to supercharge statistical compression. Sequences which occur frequently in the domain of the model receive short codes. The other insight is that if we allowed lossy compression, we could compress well below the Shannon information limit, and just have an "overflow" bag for surprising sequences.

When TurboQuant came out, I realized we can also go way below the Shannon limit in the same way, and take advantage of PLT. In fact, I'm working on publishing a paper that generalizes this to robotics (which needs to do cheap fast on-board inference "in the field"). I also believe this is how animals actually learn. In other words, over time they learn overall "sequences" of actions and then can check whether they are "good enough" to solve the problem, or whether to switch to a full analysis -- this corresponds to System 1 and 2 of Daniel Kahneman's "Thinking Fast and Slow".

If you want more specific information, or see the code for a working prototype, you can write me at the email in the paper.

tomrod · 2026-04-22T16:09:25 1776874165

Can you show a working example/implementation of these theoretical improvements? Working code would also go far for replication.

mbernstein · 2026-04-21T03:26:17 1776741977

This is a compute memory trade, not compression vs. turobquant? Lemma 1 is something like, "forward pass is deterministic because it's deterministic" which means the input tokens were always the lower bound...which isn't caching? Smells tautological. What am I missing?

EGreg · 2026-04-21T03:34:09 1776742449

Well yeah, I just wrote it as a lemma, but it's basically close to tautological. Its only job is to formally ground the entropy argument that follows it. The interesting claim is what comes after: because KV vectors are deterministic functions of tokens, and because the model is a near-optimal predictor of its own distribution, the conditional entropy of each new KV vector given all previous ones is bounded by token-level perplexity. TurboQuant compresses against the marginal distribution of each vector in isolation -- that's the gap.

And yes, it's a compute/memory tradeoff, all caching is. The claim is just that the memory floor is much lower than anyone had formally established. Whether the compute cost of getting there is worth it is a fair open question the paper doesn't settle. But what if it is? Caching is the thread running through most of my work, and I intend to find out.

himata4113 · 2026-04-21T03:19:35 1776741575

The reasoning around the 900000x claim isn't sound and violates way too many information density principles.

I was incredibly curious since I had a pet theory in my mind about something extremely similar, but arrived at a conclusion that the time complexity of such cache would end up being extremely slow.

This is like saying that you've achieved single token compression when you're passing a single token into a model and letting it regenerate the entire output since at the end of the day models are probabilistic stateless devices. At that point you don't have a cache and are just replaying the tokens or have a caching algorithm with a complexity similar to that of a model defeating the purpose of such cache.

I've never considered that arXiv had a problem, now I do.

EGreg · 2026-04-21T03:27:50 1776742070

No, the 914,000x in the paper is talking about the ratio between two entropy floors, it's not a claim about practical compression. The point is that per-vector quantization has been chasing the wrong theoretical limit: the sequential entropy bound is just fundamentally lower, by that factor, because KV vectors aren't independent samples!

On complexity, that's fair concern, and the paper doesn't fully resolve it. But the analogy to "replaying tokens through the model" isn't exactly right. The delta coding layer uses the model's own next-token prediction, which is already happening during normal autoregressive inference. You're not adding a forward pass, you're using the one already running and storing only the residual, which is much smaller than the raw vector -- precisely because the model is a good predictor of its own next state.

The trie index lookup is O(sequence length), not O(model forward pass). Whether that's fast enough in practice at scale is actually a legitimate open question and I'd be the first to admit the paper doesn't settle it. But the contribution here is simply establishing that the bound exists and is dramatically lower than what the field has been targeting. That's what I wanted to put out. The engineering question of how close you can get is the natural next step.

Your pet theory about time complexity sounds interesting actually, did you write it up anywhere?

usernametaken29 · 2026-04-21T03:13:07 1776741187

Kahnemans book is considered outdated by modern neuroscience.

stingraycharles · 2026-04-21T03:07:59 1776740879

Dropping a grand theory of animal cognition into a defense of a KV cache compression bound is not something I was anticipating. I don’t think it’s a great argument.

wholinator2 · 2026-04-21T03:19:07 1776741547

At least some random pseudocrackpotery like that is points in the direction of it being a human. There's some strange human tendencies that AI just doesn't usually replicate

Rekindle8090 · 2026-04-21T03:00:50 1776740450

[flagged]

cristoperb · 2026-04-21T03:09:15 1776740955

I can't speak for the person you're replying too, but I use -- for emdash for two reasons: I never remember how to type an actual emdash in linux/X11, and more importantly, I do most of my writing in Asciidoc which converts -- to an emdash automatically. It's nothing to do with bot detection or whatever.

But it does get me confused sometimes because in LaTeX (and other markup languages) -- gets converted to an endash whereas it takes three hyphens --- to make an emdash.

rhet0rica · 2026-04-21T03:21:38 1776741698

you are hereby sentenced by the council of dashers to type "—" ten million times using Windows-1252 alt codes

you have 5 seconds to comply before your planet will be demolished to make room for a giant space-typographer's punctuation case

EGreg · 2026-04-21T03:01:43 1776740503

Haha, yes I always used -- when I typed an em-dash manually. What bot detection extensions? :-P

EGreg · 2026-04-20T02:00:31 1776650431

Is war2.ru next?

I'm glad Blizzard doesn't mess with servers of its older games. Warcraft 2 was such a classic! Even more than Starcraft. The original granddaddy that people play 25 years later. That, and Myth 2 TFL was my favorite.

EGreg · 2026-04-20T01:15:15 1776647715

Actually, prolly trees are probably best for intersections. You can use bloom filters as a first pass

EGreg · 2026-04-16T18:27:41 1776364061

Can Cloudflare do an SMS service? That would be something :)