More

joefourier · 2026-04-28T21:02:09 1777410129

> Local models sound great until you realize you dont get alot of the features that we implicitly expect from hosted models. Many things would require additional investment into the operations and setup to get to a comparable system. We ended up wanting things that would require us to roll our own memory system, harnesses for the model, compliance needs, and security.

That's not local models vs hosted models, that's using the enterprise services from Anthropic. Any local LLM inference engine such as VLLM gives you an OpenAI compatible API with the exact same features as a hosted model.

I'm not sure what your use case is, but I personally found Anthropic's offerings lacking and inferior to open source or custom-built solutions. I have yet to see any "memory" system that's better than markdown files or search, and harnesses for agentic AIs are dime a dozen.

joefourier · 2026-04-27T21:35:19 1777325719

AGI means artificial general intelligence, as opposed to artificial narrow intelligence. General intelligence means being able to generalise to many tasks beyond the single narrow one that an AI has been designed/trained on, and LLMs fit that description perfectly, being able to do anything from writing poetry, programming, summarising documents, translating, NLP, and if multi-modal, vision, audio, image generation... not all to human-level performance, but certainly to a useful one. As opposed to previous AI that was able to do only a single thing, like play chess or classify images, and had no way of being generalised to other tasks.

LLMs aren't artificial superintelligence and might not reach that point, but refusing to call them AGI is absolutely moving the goalposts.

joefourier · 2026-04-26T12:29:36 1777206576

There's also a difference between having no immediate use, and having no reason to exist. From what I understand, sexual differentiation works by having the Y chromosome act as a switch, and both sexes have to share the same blueprint with hormones guided the development of their organs.

For males not to have nipples, they'd need to be actively destroyed, which poses a risk for females to also not have nipples, which is much worse than males having harmless, inactive nipples.

canjobear · 2026-04-27T02:36:51 1777257411

It doesn't seem like eliminating nipples should be any harder than eliminating the uterus...

ahf8Aithaex7Nai · 2026-04-27T05:57:46 1777269466

That's true, but inactive nipples don't cost anything, which certainly isn't the case for an inactive uterus. I don't know how it works, but I assume that such developments follow some kind of cost-benefit function.

lostlogin · 2026-04-27T07:56:08 1777276568

Not nipple related per se, but males do get breast cancer.

littlecorner · 2026-04-27T12:59:40 1777294780

And, in weird circumstances, men can lactate. There's even a story about a viking whose name is escaping me who nursed his son after his wife died.

torginus · 2026-04-27T11:29:49 1777289389

afaik they serve some purpose in regulating androgenic-estrogenic hormone production.

The amount of testosterone in women is not zero, likewise the amount of estrogen in men is not zero as well, and breast tissue does serve some purpose in regulating hormoe production, even in men.

Jensson · 2026-04-27T04:25:17 1777263917

Aren't nipples pretty recent? The egg part has been there for a very long time, nipples haven't evolved as long, maybe in a few hundred million years we no longer have nipples.

numpad0 · 2026-04-27T03:22:18 1777260138

male and female sexual organs are the same thing inside out of each others, to some extent.

estimator7292 · 2026-04-27T21:17:47 1777324667

The uterus is not "eliminated" in males.

Mammalian fetuses all start out the same and sexual dimorphism happens several weeks into development. The same structure that eventually develops into a uterus can instead develop into a penis/prostate. Testicles and ovaries are the same tissue early in development, just like the glans and clitoris.

Biology doesn't generally suppress one entire set of organs in favor of another. They're built from the same precursor tissue and only diverge after sex hormones are activated. Biology and evolution modify existing structures, it does not typically erase one structure and replace it with another.

In addition, intersex humans exist. There are documented incidences of males born with uteri, external genitals can form halfway between male and female. Biology can get very messy sometimes. Sex is not a hard binary switch, it's a sliding scale just like most biological features. Only most individuals are at one end or the other, there's a lot of room between.

Polizeiposaune · 2026-04-27T01:50:18 1777254618

The actual switch (in humans and I believe most mammals) is a gene called SRY. The Y chromosome is just the (usual) container for the switch.

joefourier · 2026-04-24T03:29:21 1777001361

They would honestly have been better off refusing customers if compute is so limited. Degrading the quality leads to customers leaving in the short term, and ruins their long term reputation.

But in either case, if compute is so limited, they’ll have to compete with local coding agents. Qwen3.6-27B is good enough to beat having to wait until 5PM for your Claude Code limit to reset.

slashdave · 2026-04-24T16:16:36 1777047396

The recent Deepseek release probably has them more worried. But locally running these large models requires a lot of infra expertise. Market impact will be minimal. Not to mention the companies that can pull this off have enough cash to just pay Anthropic to begin with.

joefourier · 2026-04-22T17:57:08 1776880628

> Too bad "tiny screens" pretty much do not exist anymore. Screens with hundreds of pixels on each side are very cheap already.

Find me a 0.66" OLED display for ~$1 that has hundreds of pixels on each side then.

> It reminds me people who research "colorizing grayscale photos", which do not exist anymore either (if you want a color photo of someone you met in your life, there probably exists a color photo of that person).

What train of thought led you to think people are primarily researching colorising new B&W photos? As opposed to historical ones, or those of relatives taken when they were young? You can take a colour photo of granddad today but most likely the photos of him in his 20s are all in black and white.

IvanK_net · 2026-04-22T18:22:39 1776882159

If you know a person who is 70 years old, they were 20 in 1975 - color photos existed back then.

Every grayscale photo of someone famous has already been colorized during the past 50 years. If there are only grayscale photos of you, you were probably born before 1900, and all your friends or your children (who might want to colorize your photo) are probably dead, too.

joefourier · 2026-04-22T19:14:22 1776885262

1. Improving the colourisation algorithms has value, it might be that the available colourised photos of celebrities have inaccurate colours or are of poorer quality than say, one done with a diffusion model that can be instructed about the colours of certain objects

2. Don’t forget about B&W films! Getting automatic methods to be consistent over a long length is still not 100% solved. People are very interested in seeing films from WW1 and WW2 in colour, for instance.

3. Plenty of people (myself included) have relatives in their 80s or 90s. Or maybe someone wants to see their ancestors from the 19th century in colour for whatever reason?

eichin · 2026-04-22T21:06:13 1776891973

Color photos existed but color film and processing was very expensive (and while mono film development "middle school student can do at home" for a generation, home color work wasn't a thing until late 80s/early 90s as far as I recall.) So in practice, I personally have childhood pics of my dad with his mom and sister - that were shot black and white but colorized by being hand painted, and this was pretty common...

zimpenfish · 2026-04-22T18:29:04 1776882544

> If you know a person who is 70 years old, they were 20 in 1975

Bloody hell, warn people before you post things like that.

joefourier · 2026-04-18T15:47:44 1776527264

Hetzner also offers a VPS with superior specs to their old DO server for €374.99/month, or €0.6009/hour. They could just switch to a VPS temporarily while waiting for the hardware fix.

Although since they were running a LEMP server stack manually and did their migration by copying all files in /var/www/html via rsync and ad-hoc python scripts, even a DO droplet doesn't have the best guarantee. Their lowest-hanging fruit is probably switching to infrastructure as code, and dividing their stack across multiple cheaper servers instead of having a central point of failure for 34 applications.

joefourier · 2026-04-16T17:05:54 1776359154

I used the $60/mo subscription and I bet most developers get access to AI agents via their company, and there was no difference. They should have reduced the rate limits, or offered a new model, anything except silently reduce the quality of their flagship product to reduce cost.

The cost of switching is too low for them to be able to get away with the standard enshittification playbook. It takes all of 5 minutes to get a Codex subscription and it works almost exactly the same, down to using the same commands for most actions.

brightball · 2026-04-16T20:45:49 1776372349

Thank goodness for capitalism for providing multiple competitors to multibillion dollar companies

joefourier · 2026-04-08T20:34:07 1775680447

> 2017’s Attention is All You Need was groundbreaking and paved the way for ChatGPT et al. Since then ML researchers have been trying to come up with new architectures, and companies have thrown gazillions of dollars at smart people to play around and see if they can make a better kind of model. However, these more sophisticated architectures don’t seem to perform as well as Throwing More Parameters At The Problem. Perhaps this is a variant of the Bitter Lesson.

This is not true and unfortunately this significantly reduced the credibility of this article for me. Raw parameter counts stopped increasing almost 5 years ago, and modern models rely on sophisticated architectures like mixture-of-experts, multi-head latent attention, hybrid Mamba/Gated linear attention layers, sparse attention for long context lengths, etc. Training is also vastly more sophisticated.

The Bitter Lesson is misunderstood. It doesn't say "algorithms are pointless, just throw more compute at the problem", it says that general algorithms that scale with more compute are better than algorithms that try to directly encode human understanding. It says nothing about spending time optimising algorithms to scale better for the same compute, and attention algorithms and LLMs in general have significantly advanced beyond "moar parameters" since the time of Attention is All You Need/GPT2/GPT3.

saghm · 2026-04-09T01:44:31 1775699071

Literally the paragraph right before the one you quote is this:

> I am generally outside the ML field, but I do talk with people in the field. One of the things they tell me is that we don’t really know why transformer models have been so successful, or how to make them better. This is my summary of discussions-over-drinks; take it with many grains of salt. I am certain that People in The Comments will drop a gazillion papers to tell you why this is wrong.

As I understand it, this article is basically a conglomeration of several attempts at an article that the author has attempted to make over the past decade or so considering the impacts of AI on society. In their own words:

> Some of these ideas felt prescient in the 2010s and are now obvious. Others may be more novel, or not yet widely-heard. Some predictions will pan out, but others are wild speculation. I hope that regardless of your background or feelings on the current generation of ML systems, you find something interesting to think about.

As for the "Bitter Lesson" part, they pretty much directly said that it wasn't the Bitter Lesson exactly, saying it might be a variant of it. Honestly, it felt more like a way of throwing in a reference to something that also might provoke thought, which was done throughout the piece (which again, is the entire point).

It's totally valid to say "this article didn't provoke much thought for me". I'm a bit confused at why you think a lack of specific domain knowledge in a domain that they literally state they are not an expert in would be disqualifying for that purpose though.

joefourier · 2026-04-09T02:23:51 1775701431

The title of the article is “The Future of Everything is Lies, I Guess” and the first part is literally complaining about LLMs being bullshit machines, while the author proceeds to tell confabulations (or lies) of his own. Is there not a bit of irony in that?

If you’re a non-expert in a field, I don’t think it’s a good sign if you’re writing a 10 part article about that field’s impact on society and getting basic facts wrong. How can I trust that the conclusions will be any more credible?

saghm · 2026-04-10T19:53:14 1775850794

> The title of the article is “The Future of Everything is Lies, I Guess” and the first part is literally complaining about LLMs being bullshit machines, while the author proceeds to tell confabulations (or lies) of his own. Is there not a bit of irony in that?

Maybe some, but not that much given the disclaimers I cited above. There's value in a qualitative confidence level for a statement, and I'd argue that this is something that LLMs do not seem to produce in practice without someone explicitly asking for it. The human author's ability to anticipate potential mistakes in their logic and communicate those ahead of time is not equivalent to the type of fabrications that LLMs routinely make.

> If you’re a non-expert in a field, I don’t think it’s a good sign if you’re writing a 10 part article about that field’s impact on society and getting basic facts wrong. How can I trust that the conclusions will be any more credible?

I don't know why an expert in LLM implementation would be inherently more qualified to analyze the second-order effects of their product than anyone else. There's precedent for people who are "too close" to something having biases that make them less effective at recognizing how tools will get used by non-experts, and society as a whole is largely composed of people who are not experts in LLM implementations. If you wanted to understand what the net effect of everyone having access to LLMs, having an understanding of people is probably more important than knowing exactly what an LLM does under the hood.

otterley · 2026-04-09T03:23:21 1775705001

Might the conclusions be correct even if some of the facts are not? Even a stopped clock is right twice a day. And, "approximately correct" is still sometimes valuable.

imtringued · 2026-04-09T12:26:28 1775737588

He is wrong about why transformers are popular.

The most obvious reason is that transformers accept a sequence as an input and produce a sequence as an output. The vast majority of pre-transformer architectures only accepted a fixed input and output size. Before 2016 I was somewhat interested in ML, but my curiosity vanished because of the fixed input and output size limitations.

RNNs including LSTMs at the time were pretty bad and difficult to train due to vanishing and exploding gradients at long sequence lengths and sequential training along the sequence length. Meanwhile transformers can be parallelized along the sequence length.

Then there are theoretical limitations. Transformers re-read the entire sequence for every output. This leads to quadratic attention. There are plenty of papers that tell you why it is impossible to replicate the properties of quadratic attention with linear attention.

The reason is blatantly obvious. If you want linear attention to have the same capability, you need to re-read the entire input sequence after every output. If you do this at the token level, then you have basically implemented quadratic attention.

Transformers aren't a mystery success, they are using computational brute force, which is hard to beat with other architectures. If you go with a more efficient architecture, you are by definition giving up some non-zero capabilities. Nobody really cares about getting slightly worse results from a much more efficient architecture. In the current ML space, it's SOTA (state of the art) or go home.

ainch · 2026-04-09T17:49:34 1775756974

Transformers do have a fixed input/output size though - that's what a context window is. It's just that, via scaling and algorithmic improvements, the length of usable context windows has increased to the point that they're much less of a bottleneck.

I think your points around parallelisation and the flexibility of quadratic attention are spot-on though.

vrighter · 2026-04-10T04:33:43 1775795623

transformers have a fixed input size (padding the unneeded context window with null tokens). Whether you put in a sequence of things or just random tokens is irrelevant. To the network it is just "one input"

They also have a fixed output of one probability distribution for the next one token.

running it in a loop does not mean it can work with sequences, by that definition, so can literally everything else

joefourier · 2026-04-10T14:41:37 1775832097

Sorry but that's false, you are confusing transformers as an architecture, and auto-regressive generation, and padding during training.

Standard transformers take in an arbitrary input size and run blocks (self and possibly cross attention, positional encoding, MLPs) that don't care about its length.

> They also have a fixed output of one probability distribution for the next one token.

No, in most implementations, they output a probability distribution for every token in the input. If you input 512 tokens, you get 512 probability distributions. You can input however many tokens you want - 1, 2048, one million, it's the same thing (although since standard self-attention scales quadratically you'll eventually run out of memory). Modern relative embeddings like RoPE can support infinite length although the quality will degrade if you extrapolate too far beyond what the model saw during training.

For typical auto-regressive generation, they are trained with causal masking/teacher forcing, which makes it calculate the probability for the next token. During inference, you throw away all but the last probability distribution and use that to sample the next token, and then repeat. You also do this with an RNN. An autoregressive CNN (e.g. WaveNet) would be closer to what you described in that it has a fixed window looking backwards.

But a transformer doesn't have to be used for auto-regressive generation, you can use it for diffusion, as a classifier model, for embedding text. It doesn't even see a sequence as spatially organised - unlike a CNN or an RNN it doesn't have architectural intrinsic biases about the position of elements, which is why it needs positional embeddings. This lets you have 2D, 3D, 4D, or disordered elements in a sequence. You can even have non-regularly sampled sequences. (Again this is for a classic transformer without sliding window attention or any other special modifications).

> (padding the unneeded context window with null tokens). To have efficient training, you pad all samples in a batch to have the same length (and maybe make it a power of two). But you are working with a single sequence, the length is arbitrary up to hardware limitations, and no padding is needed.

vrighter · 2026-04-13T06:43:34 1776062614

you, the user can enter any size input you want.

The network has a fixed number of input neurons. You have to put something in all of them.

If you enter "hello", the network might get " hello", but all of its inputs need some inputs. It doesn't (and can't) process tokens one at a time.

"No, in most implementations, they output a probability distribution for every token in the input."

A probability distribution obviously contains a probability for every possible next token. But the whole probability distribution (which adds up to one) only predicts the next ONE token. It predicts what is the probability of that one token being A, or B, or C, etc, giving a probability for each possible token. It's still predicting only one token.

In anything but the last column, the numbers are junk. You can treat them as probability distributions all you want, but the system is only trained to get the outputs of the last column "correct".

joefourier · 2026-04-13T12:32:12 1776083532

Not to be rude, but you're arguing with a machine learning engineer about the basics of neural network architectures :P

> The network has a fixed number of input neurons. You have to put something in all of them.

The way transformers work is that they apply the same "input neurons" to each individual token! It's not:

Token 1 -> Neuron 1 Token 2 -> Neuron 2 Token 3 -> Neuron 3... With excess neurons not being used, it's

Token 1 -> Vector of dimensions N -> ALL neurons Token 2 -> Vector of dimensions N -> ALL neurons Token 3- > Vector of dimensions N -> ALL neurons ...

Grossly oversimplified, in a typical transformer layer, you have 3 distinct such "networks" of neurons. You apply them each token, giving you, for each token, a "query", a "key", and a "value". You take the dot product of there query and key, apply softmax, then multiply it with the value, giving you the vector to input for the next layer.

A probability distribution obviously contains a probability for every possible next token. But the whole probability distribution (which adds up to one) only predicts the next ONE token. It predicts what is the probability of that one token being A, or B, or C, etc, giving a probability for each possible token. It's still predicting only one token. In anything but the last column, the numbers are junk. You can treat them as probability distributions all you want, but the system is only trained to get the outputs of the last column "correct". Not quite, the reason transformers train fast is because you can train on all columns at once.

For tokens 1, 2, 3, 4, ... you get predictions for tokens 2, 3, 4, 5... Typical autoregressive transformer training uses a causal mask, so that token 1 doesn't see token 2, enabling you to train on all the predictions at once.

kgeist · 2026-04-08T20:57:51 1775681871

>Raw parameter counts stopped increasing almost 5 years ago, and modern models rely on sophisticated architectures like mixture-of-experts, multi-head latent attention, hybrid Mamba/Gated linear attention layers, sparse attention for long context lengths, etc.

Agree, I recently updated our office's little AI server to use Qwen 3.5 instead of Qwen 3 and the capability has considerably increased, even though the new model has fewer parameters (32b => 27b)

Yesterday I spent some time investigating it:

- Gated DeltaNet (invented in 2024 I think) in Qwen3.5 saves memory for the KV kache so we can afford larger quants

- larger quants => more accurate

- I updated the inference engine to have TurboQuant's KV rotations (2026) => 8-bit KV cache is more accurate

- smaller KV cache requirements => larger contexts

Before, Qwen3 on this humble infra could not properly function in OpenCode at all (wrong tool calls, generally dumb, small context), now Qwen 3.5 can solve 90% problems I throw at it.

All that thanks to algorithmic/architectural innovations while actually decreasing the parameter count.

Vachyas · 2026-04-09T08:36:33 1775723793

What you described sounds plausible (expected, even).

But

>Raw parameter counts stopped increasing almost 5 years ago

Really? 5 years ago? Until just about 3 years ago OpenAI's latest offering was only ChatGPT 3.5

Most of the models people talk about now didn't even exist 3 years ago let alone 5.

Even now, I don't know if parameter count stopped mattering or just matters less

For example, I have no idea if the new Mythos is MoE but I'm pretty sure it's more parameters.

kgeist · 2026-04-09T11:25:06 1775733906

I agree the original poster exaggerated it. But generally models indeed have stopped growing at around 1-1.5 trillion parameters, at least for the last couple of years.

>Even now, I don't know if parameter count stopped mattering or just matters less

Models in the 20b-100b range are already very capable when it comes to basic knowledge, reasoning etc. Improving the architecture, having better training recipes helped decrease the required parameter count considerably (currently 8b models can easily beat the 175b strong GPT3 from 3 years ago in many domains). What increasing the parameter count currently gives you is better memorization, i.e. better world knowledge without having to consult external knowledge bases, say, using RAG. For example, Qwen3.5 can one-short compilable code, reason etc. but can't remember the exact API calls to to many libraires, while Sonnet 4.6 can. I think what we need is split models into 2 parts: "reasoner" and "knowledge base". I think a reasoner could be pretty static with infrequent updates, and it's the knowledge base part which needs continuous updates (and trillions of parameters). Maybe we could have a system where a reasoner could choose different knowledge bases on demand.

RoddaWallPro · 2026-04-08T21:43:31 1775684611

5 years ago was the beginning of 2021, just under a year after GPT3 was released (which was not good at doing anything useful). And that model was 175B params.

GPT4 has been widely rumored to have 1.8 trillion params, which is 10x more, and was released 2 years after this "5 years ago" date that you are using here.

So, to quote yourself here, "This is not true and unfortunately this significantly reduced the credibility of this article for me" /s/article/comment

joefourier · 2026-04-08T23:16:01 1775690161

In late 2021, GLaM had 1.2T parameters. It's difficult to find much use of it in the wild and while the benchmarks it uses are rather outdated, it has a HellaSwag score of 76.6% and WinoGrande of 73.5%. GPT3 had 64.3% and 70.2%.

Meanwhile, Gemma 2 9B, a model from July 2024 with 133x fewer parameters than GLaM, scores 82% and 80.6%. Hellaswag and WinoGrande aren't used in modern benchmarks, probably because they're too easy and largely memorised at this point.

And GPT-4 had 1.8T parameters sure, but it's noticeably worse than any modern model a fraction of the size, and the original incarnation was ridiculously expensive per token. And in any case, its number of parameters was only possible due using mixture-of-experts, which I would definitely classify as a sophisticated architecture as opposed to just throwing more parameters at a vanilla transformer. Even in 2021 GLaM was a MoE because the limits of scaling dense transformers had already been hit.

zozbot234 · 2026-04-08T21:47:11 1775684831

MoE has made it vastly easier to increase total parameters (and recent open models are really quite large) but it's also hard to compare a MoE with an earlier dense model.

janalsncm · 2026-04-08T20:47:52 1775681272

Yeah I also came here to be one of those People In The Comments the author refers to.

Transformers are not magical. They are just a huge improvement over other architectures at the time such as LSTMs and RNNs and even CNNs. They allowed us to throw more and more compute at the problem of next token prediction. And we’ve been riding that horse ever since.

Another big advancement that deserves mentioning is “reasoning” models that have the opportunity to spit out thinking tokens before giving a final answer.

None of this is to say transformers are the most principled approach. But they work.

zozbot234 · 2026-04-08T21:07:05 1775682425

Transformers' greatest improvement over RNN/LSTM was to enable better parallelization of large-scale training. This is what enabled language models to become "large". But when controlling for overall size, more RNN/LSTM-like approaches seem to be more efficient, as seen e.g. in state space models. The transformer architecture does add some notable capabilities in accounting for long-range dependencies and "needle in a haystack" scenarios, but these are not a silver bullet; they matter in very specific circumstances.

joefourier · 2026-04-08T21:14:37 1775682877

With modern training techniques, RNNs (not just linear SSMs, potentially even vanilla LSTMs) can scale just as well as transformers or even better when it comes to enormous context lengths. Dot-product attention has better performance in a number of domains however (especially for exact retrieval) so the best architectures are likely to remain hybrid for now.

famouswaffles · 2026-04-09T00:17:32 1775693852

>With modern training techniques, RNNs (not just linear SSMs, potentially even vanilla LSTMs) can scale just as well as transformers or even better when it comes to enormous context lengths.

That's not true. Modern training techniques aren't enough. Vanilla RNNs with modern training techniques still scale poorly. You have to make some pretty big architectural divergences (throwing away recurrency during training) to get a RNN to scale well. None of the big labs seem to be bothered with hybrid approaches.

joefourier · 2026-04-09T01:10:26 1775697026

> That's not true. Modern training techniques aren't enough. Vanilla RNNs with modern training techniques still scale poorly. You have to make some pretty big architectural divergences (throwing away recurrency during training) to get a RNN to scale well.

SSMs move the non-linearity outside of the recurrence which enables parallelisation during training. It is trivial to do this architectural change with an LSTM (see the xLSTM paper). Linear RNNs are still RNNs.

But you can still keep the non linearity by training with parallel Newtown methods, which work on vanilla LSTMs and scale to billion of parameters.

> None of the big labs seem to be bothered with hybrid approaches.

Does Alibaba not count? Qwen3.5 models are the top performers in terms of small models as far as my tests and online benchmarks go.

famouswaffles · 2026-04-09T01:34:59 1775698499

>SSMs move the non-linearity outside of the recurrence which enables parallelisation during training. It is trivial to do this architectural change with an LSTM (see the xLSTM paper). Linear RNNs are still RNNs.

Removing the non-linearity from the recurrence path is exactly what constitutes a "pretty big architectural divergence." A linear RNN is an RNN in a structural sense, certainly, but functionally it strips out the non-linear state transitions that made traditional LSTMs so expressive, entirely to enable associative scans. The inductive bias is fundamentally altered. Calling that simply 'modern training techniques' is disingenous at best.

>But you can still keep the non linearity by training with parallel Newtown methods, which work on vanilla LSTMs and scale to billion of parameters.

That does not scale anywhere near as well as Transformers in compute spend. It's paper/research novelty. Nobody will be doing this for production.

>Does Alibaba not count? Qwen3.5 models are the top performers in terms of small models as far as my tests and online benchmarks go.

I guess there's some misunderstanding here because Qwen is 100% a transformer, not a hybrid RNN/LSTM whatever.

joefourier · 2026-04-09T02:01:01 1775700061

> That does not scale anywhere near as well as Transformers in compute spend. It's paper/research novelty. Nobody will be doing this for production.

What exactly makes you so confident?

The world is not just labs that can afford billion dollar datacentres and selling access to SOTA LLMs at $30/Mtokens. Transformers are highly unsuitable for many applications for a variety of reasons and non-linear RNNs trained via parallel methods are an extremely attractive value proposition and will likely feature in production in the next products I work on.

> I guess there's some misunderstanding here because Qwen is 100% a transformer, not a hybrid RNN/LSTM whatever.

See the Qwen3.5 Huggingface description: https://huggingface.co/Qwen/Qwen3.5-27B > Efficient Hybrid Architecture: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead.

famouswaffles · 2026-04-14T03:52:58 1776138778

>What exactly makes you so confident?

Existing research? If you want something that scales as well as transformers you have to make the divergences I was talking about. If you don't then it scales a lot worse. The Newton methods don't match transformer efficiency at scale. That's just a fact.

>The world is not just labs that can afford billion dollar datacentres and selling access to SOTA LLMs at $30/Mtokens.

Billion dollar labs want to save money too. If Modern RNNs were a massive unanimous win, they and everyone else would switch in a heartbeat, just like they did for transformers. The reason they don't is because these architectures at best simply match transformers, while introducing their own architectural issues.

saghm · 2026-04-09T01:47:09 1775699229

This sounds almost identical to the article that's literally linked at the end of the paragraph that the parent comment quoted: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...

I don't think anything you're saying here is in disagreement with the points they're making.

jgammell · 2026-04-09T01:34:07 1775698447

> hybrid Mamba/Gated linear attention layers,

Do any large-scale architectures use mamba? I was under the impression that people don't use it yet due to lack of efficient implementations.

> Training is also vastly more sophisticated

Is it? In what ways?

joefourier · 2026-04-09T01:45:08 1775699108

Qwen3.5 uses Gated Delta Networks which is essentially Mamba 2 + Delta Rule. It’s quite hardware efficient.

> Is it? In what ways?

Just the reinforcement learning for reasoning, and then tool use for agents, could be its own topic.

joefourier · 2026-04-02T15:17:00 1775143020

Ever hit your daily limit on Claude Code and saw how expensive it is to pay per token?

girvo · 2026-04-02T23:10:43 1775171443

All the time now… it’s wild how little usage you get with Opus on the Pro sub now haha

joefourier · 2026-03-30T13:51:01 1774878661

The dotcom bubble burst and 26 years later we’re all hopelessly addicted to the internet and the top companies on the stock market are almost all what would have been called “dotcoms” then.

The railroad bubble burst in 1846 not because trains were a dead end - passenger number would increase more than 10x in the UK in the following 50 years.