> Local models sound great until you realize you dont get alot of the features that we implicitly expect from hosted models. Many things would require additional investment into the operations and setup to get to a comparable system. We ended up wanting things that would require us to roll our own memory system, harnesses for the model, compliance needs, and security.
That's not local models vs hosted models, that's using the enterprise services from Anthropic. Any local LLM inference engine such as VLLM gives you an OpenAI compatible API with the exact same features as a hosted model.
I'm not sure what your use case is, but I personally found Anthropic's offerings lacking and inferior to open source or custom-built solutions. I have yet to see any "memory" system that's better than markdown files or search, and harnesses for agentic AIs are dime a dozen.
AGI means artificial general intelligence, as opposed to artificial narrow intelligence. General intelligence means being able to generalise to many tasks beyond the single narrow one that an AI has been designed/trained on, and LLMs fit that description perfectly, being able to do anything from writing poetry, programming, summarising documents, translating, NLP, and if multi-modal, vision, audio, image generation... not all to human-level performance, but certainly to a useful one. As opposed to previous AI that was able to do only a single thing, like play chess or classify images, and had no way of being generalised to other tasks.
LLMs aren't artificial superintelligence and might not reach that point, but refusing to call them AGI is absolutely moving the goalposts.
There's also a difference between having no immediate use, and having no reason to exist. From what I understand, sexual differentiation works by having the Y chromosome act as a switch, and both sexes have to share the same blueprint with hormones guided the development of their organs.
For males not to have nipples, they'd need to be actively destroyed, which poses a risk for females to also not have nipples, which is much worse than males having harmless, inactive nipples.
That's true, but inactive nipples don't cost anything, which certainly isn't the case for an inactive uterus. I don't know how it works, but I assume that such developments follow some kind of cost-benefit function.
afaik they serve some purpose in regulating androgenic-estrogenic hormone production.
The amount of testosterone in women is not zero, likewise the amount of estrogen in men is not zero as well, and breast tissue does serve some purpose in regulating hormoe production, even in men.
Aren't nipples pretty recent? The egg part has been there for a very long time, nipples haven't evolved as long, maybe in a few hundred million years we no longer have nipples.
Mammalian fetuses all start out the same and sexual dimorphism happens several weeks into development. The same structure that eventually develops into a uterus can instead develop into a penis/prostate. Testicles and ovaries are the same tissue early in development, just like the glans and clitoris.
Biology doesn't generally suppress one entire set of organs in favor of another. They're built from the same precursor tissue and only diverge after sex hormones are activated. Biology and evolution modify existing structures, it does not typically erase one structure and replace it with another.
In addition, intersex humans exist. There are documented incidences of males born with uteri, external genitals can form halfway between male and female. Biology can get very messy sometimes. Sex is not a hard binary switch, it's a sliding scale just like most biological features. Only most individuals are at one end or the other, there's a lot of room between.
They would honestly have been better off refusing customers if compute is so limited. Degrading the quality leads to customers leaving in the short term, and ruins their long term reputation.
But in either case, if compute is so limited, they’ll have to compete with local coding agents. Qwen3.6-27B is good enough to beat having to wait until 5PM for your Claude Code limit to reset.
The recent Deepseek release probably has them more worried. But locally running these large models requires a lot of infra expertise. Market impact will be minimal. Not to mention the companies that can pull this off have enough cash to just pay Anthropic to begin with.
> Too bad "tiny screens" pretty much do not exist anymore. Screens with hundreds of pixels on each side are very cheap already.
Find me a 0.66" OLED display for ~$1 that has hundreds of pixels on each side then.
> It reminds me people who research "colorizing grayscale photos", which do not exist anymore either (if you want a color photo of someone you met in your life, there probably exists a color photo of that person).
What train of thought led you to think people are primarily researching colorising new B&W photos? As opposed to historical ones, or those of relatives taken when they were young? You can take a colour photo of granddad today but most likely the photos of him in his 20s are all in black and white.
If you know a person who is 70 years old, they were 20 in 1975 - color photos existed back then.
Every grayscale photo of someone famous has already been colorized during the past 50 years. If there are only grayscale photos of you, you were probably born before 1900, and all your friends or your children (who might want to colorize your photo) are probably dead, too.
1. Improving the colourisation algorithms has value, it might be that the available colourised photos of celebrities have inaccurate colours or are of poorer quality than say, one done with a diffusion model that can be instructed about the colours of certain objects
2. Don’t forget about B&W films! Getting automatic methods to be consistent over a long length is still not 100% solved. People are very interested in seeing films from WW1 and WW2 in colour, for instance.
3. Plenty of people (myself included) have relatives in their 80s or 90s. Or maybe someone wants to see their ancestors from the 19th century in colour for whatever reason?
Color photos existed but color film and processing was very expensive (and while mono film development "middle school student can do at home" for a generation, home color work wasn't a thing until late 80s/early 90s as far as I recall.) So in practice, I personally have childhood pics of my dad with his mom and sister - that were shot black and white but colorized by being hand painted, and this was pretty common...
Hetzner also offers a VPS with superior specs to their old DO server for €374.99/month, or €0.6009/hour. They could just switch to a VPS temporarily while waiting for the hardware fix.
Although since they were running a LEMP server stack manually and did their migration by copying all files in /var/www/html via rsync and ad-hoc python scripts, even a DO droplet doesn't have the best guarantee. Their lowest-hanging fruit is probably switching to infrastructure as code, and dividing their stack across multiple cheaper servers instead of having a central point of failure for 34 applications.
I used the $60/mo subscription and I bet most developers get access to AI agents via their company, and there was no difference. They should have reduced the rate limits, or offered a new model, anything except silently reduce the quality of their flagship product to reduce cost.
The cost of switching is too low for them to be able to get away with the standard enshittification playbook. It takes all of 5 minutes to get a Codex subscription and it works almost exactly the same, down to using the same commands for most actions.
> 2017’s Attention is All You Need was groundbreaking and paved the way for ChatGPT et al. Since then ML researchers have been trying to come up with new architectures, and companies have thrown gazillions of dollars at smart people to play around and see if they can make a better kind of model. However, these more sophisticated architectures don’t seem to perform as well as Throwing More Parameters At The Problem. Perhaps this is a variant of the Bitter Lesson.
This is not true and unfortunately this significantly reduced the credibility of this article for me. Raw parameter counts stopped increasing almost 5 years ago, and modern models rely on sophisticated architectures like mixture-of-experts, multi-head latent attention, hybrid Mamba/Gated linear attention layers, sparse attention for long context lengths, etc. Training is also vastly more sophisticated.
The Bitter Lesson is misunderstood. It doesn't say "algorithms are pointless, just throw more compute at the problem", it says that general algorithms that scale with more compute are better than algorithms that try to directly encode human understanding. It says nothing about spending time optimising algorithms to scale better for the same compute, and attention algorithms and LLMs in general have significantly advanced beyond "moar parameters" since the time of Attention is All You Need/GPT2/GPT3.
Literally the paragraph right before the one you quote is this:
> I am generally outside the ML field, but I do talk with people in the field. One of the things they tell me is that we don’t really know why transformer models have been so successful, or how to make them better. This is my summary of discussions-over-drinks; take it with many grains of salt. I am certain that People in The Comments will drop a gazillion papers to tell you why this is wrong.
As I understand it, this article is basically a conglomeration of several attempts at an article that the author has attempted to make over the past decade or so considering the impacts of AI on society. In their own words:
> Some of these ideas felt prescient in the 2010s and are now obvious. Others may be more novel, or not yet widely-heard. Some predictions will pan out, but others are wild speculation. I hope that regardless of your background or feelings on the current generation of ML systems, you find something interesting to think about.
As for the "Bitter Lesson" part, they pretty much directly said that it wasn't the Bitter Lesson exactly, saying it might be a variant of it. Honestly, it felt more like a way of throwing in a reference to something that also might provoke thought, which was done throughout the piece (which again, is the entire point).
It's totally valid to say "this article didn't provoke much thought for me". I'm a bit confused at why you think a lack of specific domain knowledge in a domain that they literally state they are not an expert in would be disqualifying for that purpose though.
The title of the article is “The Future of Everything is Lies, I Guess” and the first part is literally complaining about LLMs being bullshit machines, while the author proceeds to tell confabulations (or lies) of his own. Is there not a bit of irony in that?
If you’re a non-expert in a field, I don’t think it’s a good sign if you’re writing a 10 part article about that field’s impact on society and getting basic facts wrong. How can I trust that the conclusions will be any more credible?
> The title of the article is “The Future of Everything is Lies, I Guess” and the first part is literally complaining about LLMs being bullshit machines, while the author proceeds to tell confabulations (or lies) of his own. Is there not a bit of irony in that?
Maybe some, but not that much given the disclaimers I cited above. There's value in a qualitative confidence level for a statement, and I'd argue that this is something that LLMs do not seem to produce in practice without someone explicitly asking for it. The human author's ability to anticipate potential mistakes in their logic and communicate those ahead of time is not equivalent to the type of fabrications that LLMs routinely make.
> If you’re a non-expert in a field, I don’t think it’s a good sign if you’re writing a 10 part article about that field’s impact on society and getting basic facts wrong. How can I trust that the conclusions will be any more credible?
I don't know why an expert in LLM implementation would be inherently more qualified to analyze the second-order effects of their product than anyone else. There's precedent for people who are "too close" to something having biases that make them less effective at recognizing how tools will get used by non-experts, and society as a whole is largely composed of people who are not experts in LLM implementations. If you wanted to understand what the net effect of everyone having access to LLMs, having an understanding of people is probably more important than knowing exactly what an LLM does under the hood.
Might the conclusions be correct even if some of the facts are not? Even a stopped clock is right twice a day. And, "approximately correct" is still sometimes valuable.
The most obvious reason is that transformers accept a sequence as an input and produce a sequence as an output. The vast majority of pre-transformer architectures only accepted a fixed input and output size. Before 2016 I was somewhat interested in ML, but my curiosity vanished because of the fixed input and output size limitations.
RNNs including LSTMs at the time were pretty bad and difficult to train due to vanishing and exploding gradients at long sequence lengths and sequential training along the sequence length. Meanwhile transformers can be parallelized along the sequence length.
Then there are theoretical limitations. Transformers re-read the entire sequence for every output. This leads to quadratic attention. There are plenty of papers that tell you why it is impossible to replicate the properties of quadratic attention with linear attention.
The reason is blatantly obvious. If you want linear attention to have the same capability, you need to re-read the entire input sequence after every output. If you do this at the token level, then you have basically implemented quadratic attention.
Transformers aren't a mystery success, they are using computational brute force, which is hard to beat with other architectures. If you go with a more efficient architecture, you are by definition giving up some non-zero capabilities. Nobody really cares about getting slightly worse results from a much more efficient architecture. In the current ML space, it's SOTA (state of the art) or go home.
Transformers do have a fixed input/output size though - that's what a context window is. It's just that, via scaling and algorithmic improvements, the length of usable context windows has increased to the point that they're much less of a bottleneck.
I think your points around parallelisation and the flexibility of quadratic attention are spot-on though.
transformers have a fixed input size (padding the unneeded context window with null tokens). Whether you put in a sequence of things or just random tokens is irrelevant. To the network it is just "one input"
They also have a fixed output of one probability distribution for the next one token.
running it in a loop does not mean it can work with sequences, by that definition, so can literally everything else
Sorry but that's false, you are confusing transformers as an architecture, and auto-regressive generation, and padding during training.
Standard transformers take in an arbitrary input size and run blocks (self and possibly cross attention, positional encoding, MLPs) that don't care about its length.
> They also have a fixed output of one probability distribution for the next one token.
No, in most implementations, they output a probability distribution for every token in the input. If you input 512 tokens, you get 512 probability distributions. You can input however many tokens you want - 1, 2048, one million, it's the same thing (although since standard self-attention scales quadratically you'll eventually run out of memory). Modern relative embeddings like RoPE can support infinite length although the quality will degrade if you extrapolate too far beyond what the model saw during training.
For typical auto-regressive generation, they are trained with causal masking/teacher forcing, which makes it calculate the probability for the next token. During inference, you throw away all but the last probability distribution and use that to sample the next token, and then repeat. You also do this with an RNN. An autoregressive CNN (e.g. WaveNet) would be closer to what you described in that it has a fixed window looking backwards.
But a transformer doesn't have to be used for auto-regressive generation, you can use it for diffusion, as a classifier model, for embedding text. It doesn't even see a sequence as spatially organised - unlike a CNN or an RNN it doesn't have architectural intrinsic biases about the position of elements, which is why it needs positional embeddings. This lets you have 2D, 3D, 4D, or disordered elements in a sequence. You can even have non-regularly sampled sequences. (Again this is for a classic transformer without sliding window attention or any other special modifications).
> (padding the unneeded context window with null tokens).
To have efficient training, you pad all samples in a batch to have the same length (and maybe make it a power of two). But you are working with a single sequence, the length is arbitrary up to hardware limitations, and no padding is needed.
The network has a fixed number of input neurons. You have to put something in all of them.
If you enter "hello", the network might get " hello", but all of its inputs need some inputs. It doesn't (and can't) process tokens one at a time.
"No, in most implementations, they output a probability distribution for every token in the input."
A probability distribution obviously contains a probability for every possible next token. But the whole probability distribution (which adds up to one) only predicts the next ONE token. It predicts what is the probability of that one token being A, or B, or C, etc, giving a probability for each possible token. It's still predicting only one token.
In anything but the last column, the numbers are junk. You can treat them as probability distributions all you want, but the system is only trained to get the outputs of the last column "correct".
Not to be rude, but you're arguing with a machine learning engineer about the basics of neural network architectures :P
> The network has a fixed number of input neurons. You have to put something in all of them.
The way transformers work is that they apply the same "input neurons" to each individual token! It's not:
Token 1 -> Neuron 1
Token 2 -> Neuron 2
Token 3 -> Neuron 3...
With excess neurons not being used, it's
Token 1 -> Vector of dimensions N -> ALL neurons
Token 2 -> Vector of dimensions N -> ALL neurons
Token 3- > Vector of dimensions N -> ALL neurons
...
Grossly oversimplified, in a typical transformer layer, you have 3 distinct such "networks" of neurons. You apply them each token, giving you, for each token, a "query", a "key", and a "value". You take the dot product of there query and key, apply softmax, then multiply it with the value, giving you the vector to input for the next layer.
A probability distribution obviously contains a probability for every possible next token. But the whole probability distribution (which adds up to one) only predicts the next ONE token. It predicts what is the probability of that one token being A, or B, or C, etc, giving a probability for each possible token. It's still predicting only one token.
In anything but the last column, the numbers are junk. You can treat them as probability distributions all you want, but the system is only trained to get the outputs of the last column "correct".
Not quite, the reason transformers train fast is because you can train on all columns at once.
For tokens 1, 2, 3, 4, ... you get predictions for tokens 2, 3, 4, 5... Typical autoregressive transformer training uses a causal mask, so that token 1 doesn't see token 2, enabling you to train on all the predictions at once.
>Raw parameter counts stopped increasing almost 5 years ago, and modern models rely on sophisticated architectures like mixture-of-experts, multi-head latent attention, hybrid Mamba/Gated linear attention layers, sparse attention for long context lengths, etc.
Agree, I recently updated our office's little AI server to use Qwen 3.5 instead of Qwen 3 and the capability has considerably increased, even though the new model has fewer parameters (32b => 27b)
Yesterday I spent some time investigating it:
- Gated DeltaNet (invented in 2024 I think) in Qwen3.5 saves memory for the KV kache so we can afford larger quants
- larger quants => more accurate
- I updated the inference engine to have TurboQuant's KV rotations (2026) => 8-bit KV cache is more accurate
Before, Qwen3 on this humble infra could not properly function in OpenCode at all (wrong tool calls, generally dumb, small context), now Qwen 3.5 can solve 90% problems I throw at it.
All that thanks to algorithmic/architectural innovations while actually decreasing the parameter count.
I agree the original poster exaggerated it. But generally models indeed have stopped growing at around 1-1.5 trillion parameters, at least for the last couple of years.
>Even now, I don't know if parameter count stopped mattering or just matters less
Models in the 20b-100b range are already very capable when it comes to basic knowledge, reasoning etc. Improving the architecture, having better training recipes helped decrease the required parameter count considerably (currently 8b models can easily beat the 175b strong GPT3 from 3 years ago in many domains). What increasing the parameter count currently gives you is better memorization, i.e. better world knowledge without having to consult external knowledge bases, say, using RAG. For example, Qwen3.5 can one-short compilable code, reason etc. but can't remember the exact API calls to to many libraires, while Sonnet 4.6 can. I think what we need is split models into 2 parts: "reasoner" and "knowledge base". I think a reasoner could be pretty static with infrequent updates, and it's the knowledge base part which needs continuous updates (and trillions of parameters). Maybe we could have a system where a reasoner could choose different knowledge bases on demand.
5 years ago was the beginning of 2021, just under a year after GPT3 was released (which was not good at doing anything useful). And that model was 175B params.
GPT4 has been widely rumored to have 1.8 trillion params, which is 10x more, and was released 2 years after this "5 years ago" date that you are using here.
So, to quote yourself here, "This is not true and unfortunately this significantly reduced the credibility of this article for me" /s/article/comment
In late 2021, GLaM had 1.2T parameters. It's difficult to find much use of it in the wild and while the benchmarks it uses are rather outdated, it has a HellaSwag score of 76.6% and WinoGrande of 73.5%. GPT3 had 64.3% and 70.2%.
Meanwhile, Gemma 2 9B, a model from July 2024 with 133x fewer parameters than GLaM, scores 82% and 80.6%. Hellaswag and WinoGrande aren't used in modern benchmarks, probably because they're too easy and largely memorised at this point.
And GPT-4 had 1.8T parameters sure, but it's noticeably worse than any modern model a fraction of the size, and the original incarnation was ridiculously expensive per token. And in any case, its number of parameters was only possible due using mixture-of-experts, which I would definitely classify as a sophisticated architecture as opposed to just throwing more parameters at a vanilla transformer. Even in 2021 GLaM was a MoE because the limits of scaling dense transformers had already been hit.
MoE has made it vastly easier to increase total parameters (and recent open models are really quite large) but it's also hard to compare a MoE with an earlier dense model.
Yeah I also came here to be one of those People In The Comments the author refers to.
Transformers are not magical. They are just a huge improvement over other architectures at the time such as LSTMs and RNNs and even CNNs. They allowed us to throw more and more compute at the problem of next token prediction. And we’ve been riding that horse ever since.
Another big advancement that deserves mentioning is “reasoning” models that have the opportunity to spit out thinking tokens before giving a final answer.
None of this is to say transformers are the most principled approach. But they work.
Transformers' greatest improvement over RNN/LSTM was to enable better parallelization of large-scale training. This is what enabled language models to become "large". But when controlling for overall size, more RNN/LSTM-like approaches seem to be more efficient, as seen e.g. in state space models. The transformer architecture does add some notable capabilities in accounting for long-range dependencies and "needle in a haystack" scenarios, but these are not a silver bullet; they matter in very specific circumstances.
With modern training techniques, RNNs (not just linear SSMs, potentially even vanilla LSTMs) can scale just as well as transformers or even better when it comes to enormous context lengths. Dot-product attention has better performance in a number of domains however (especially for exact retrieval) so the best architectures are likely to remain hybrid for now.
>With modern training techniques, RNNs (not just linear SSMs, potentially even vanilla LSTMs) can scale just as well as transformers or even better when it comes to enormous context lengths.
That's not true. Modern training techniques aren't enough. Vanilla RNNs with modern training techniques still scale poorly. You have to make some pretty big architectural divergences (throwing away recurrency during training) to get a RNN to scale well. None of the big labs seem to be bothered with hybrid approaches.
> That's not true. Modern training techniques aren't enough. Vanilla RNNs with modern training techniques still scale poorly. You have to make some pretty big architectural divergences (throwing away recurrency during training) to get a RNN to scale well.
SSMs move the non-linearity outside of the recurrence which enables parallelisation during training. It is trivial to do this architectural change with an LSTM (see the xLSTM paper). Linear RNNs are still RNNs.
But you can still keep the non linearity by training with parallel Newtown methods, which work on vanilla LSTMs and scale to billion of parameters.
> None of the big labs seem to be bothered with hybrid approaches.
Does Alibaba not count? Qwen3.5 models are the top performers in terms of small models as far as my tests and online benchmarks go.
>SSMs move the non-linearity outside of the recurrence which enables parallelisation during training. It is trivial to do this architectural change with an LSTM (see the xLSTM paper). Linear RNNs are still RNNs.
Removing the non-linearity from the recurrence path is exactly what constitutes a "pretty big architectural divergence." A linear RNN is an RNN in a structural sense, certainly, but functionally it strips out the non-linear state transitions that made traditional LSTMs so expressive, entirely to enable associative scans. The inductive bias is fundamentally altered. Calling that simply 'modern training techniques' is disingenous at best.
>But you can still keep the non linearity by training with parallel Newtown methods, which work on vanilla LSTMs and scale to billion of parameters.
That does not scale anywhere near as well as Transformers in compute spend. It's paper/research novelty. Nobody will be doing this for production.
>Does Alibaba not count? Qwen3.5 models are the top performers in terms of small models as far as my tests and online benchmarks go.
I guess there's some misunderstanding here because Qwen is 100% a transformer, not a hybrid RNN/LSTM whatever.
> That does not scale anywhere near as well as Transformers in compute spend. It's paper/research novelty. Nobody will be doing this for production.
What exactly makes you so confident?
The world is not just labs that can afford billion dollar datacentres and selling access to SOTA LLMs at $30/Mtokens. Transformers are highly unsuitable for many applications for a variety of reasons and non-linear RNNs trained via parallel methods are an extremely attractive value proposition and will likely feature in production in the next products I work on.
> I guess there's some misunderstanding here because Qwen is 100% a transformer, not a hybrid RNN/LSTM whatever.
See the Qwen3.5 Huggingface description: https://huggingface.co/Qwen/Qwen3.5-27B
> Efficient Hybrid Architecture: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead.
Existing research? If you want something that scales as well as transformers you have to make the divergences I was talking about. If you don't then it scales a lot worse. The Newton methods don't match transformer efficiency at scale. That's just a fact.
>The world is not just labs that can afford billion dollar datacentres and selling access to SOTA LLMs at $30/Mtokens.
Billion dollar labs want to save money too. If Modern RNNs were a massive unanimous win, they and everyone else would switch in a heartbeat, just like they did for transformers. The reason they don't is because these architectures at best simply match transformers, while introducing their own architectural issues.
The dotcom bubble burst and 26 years later we’re all hopelessly addicted to the internet and the top companies on the stock market are almost all what would have been called “dotcoms” then.
The railroad bubble burst in 1846 not because trains were a dead end - passenger number would increase more than 10x in the UK in the following 50 years.
That's not local models vs hosted models, that's using the enterprise services from Anthropic. Any local LLM inference engine such as VLLM gives you an OpenAI compatible API with the exact same features as a hosted model.
I'm not sure what your use case is, but I personally found Anthropic's offerings lacking and inferior to open source or custom-built solutions. I have yet to see any "memory" system that's better than markdown files or search, and harnesses for agentic AIs are dime a dozen.
reply