More

thot_experiment · 2026-05-12T23:34:59 1778628899

idk, my 10 year old makerbot 2 has been pretty reliable, ever since Prusa slicer came out and I tuned a profile for it maybe 6 years ago it's been spitting out quick dimensionally accurate prints. i use it all the time, probably go through a spool every month or two and all i've had to replace is the cooling fan for the extruder once

i'm mostly printing small mechanical parts and i can't say i have any complaints, i assume a modern prusa would be much better, surely there are other FDM printers that are good?

thot_experiment · 2026-05-12T00:15:53 1778544953

I can't tell if this is sarcastic or not, but it seems insane to let the AI write the tests.

AI can't be held accountable, it shouldn't be writing the tests that determine whether car systems function correctly.

anonym29 · 2026-05-12T00:20:38 1778545238

>AI can't be held accountable

I hear this all the time. Why does it matter? Punishing a human for making a mistake does not prevent mistakes, nor does it undo the harm of the mistake. A human saying "my bad, I messed up" and an AI saying "my bad, I messed up" are equally worthless, in a functional sense.

forgetfreeman · 2026-05-12T00:25:15 1778545515

"Punishing a human for making a mistake does not prevent mistakes" This statement suggests you don't believe in some combination of neuroplasticity as a concept or the arrow of time.

anonym29 · 2026-05-12T00:46:25 1778546785

Tell the families of the people who died on 737 MAX disasters. "Don't worry - everything's going to be okay! The engineers learned from their mistakes - accountability works, you have nothing to be sad about!"

Tell the family of the person killed by a semi truck driver who showed up to work drunk or high: "Don't worry - the driver went to jail! Accountability prevented anything bad from happening!"

Accountability alone fails to prevent deadly mistakes millions of times a day; millions of mistakes are avoided daily through process, redundancy, independent review, and formal methods.

"Accountability prevents mistakes" is a comforting delusion. In reality, accountability is only marginally related to whether or not mistakes are made.

forgetfreeman · 2026-05-12T05:24:50 1778563490

"Accountability alone fails to prevent deadly mistakes millions of times a day"

...in his desperation to finally win an argument online our hero advanced, grimly ignoring the concept of Engineering.

"millions of mistakes are avoided daily through process, redundancy, independent review, and formal methods."

Ahh spoke too soon, Engineering has finally joined the chat. So what mechanism do you propose lead to the foundation of process, redundancy, independent review, and formal methods?

russelldjimmy · 2026-05-12T10:53:41 1778583221

What are you even on about mate? Sure accountability doesn’t prevent all mistakes. Guess what, nothing prevents all mistakes. Accountability can help prevent some mistakes some of the time. It sounds like you’re suggesting getting rid of the concept of accountability because it doesn’t prevent ALL mistakes. Way to throw the baby out with the bath water.

anonym29 · 2026-05-12T19:04:26 1778612666

>It sounds like you’re suggesting getting rid of the concept of accountability

Where on earth are you getting this?

Accountability alone is insufficient, and the things that actually prevent mistakes don't require it. The mechanisms that do work: independent reviews, redundancy, formal verification where applicable, staged rollouts, testing against adversarial inputs... these all function on the artifact, not on whether or not the entity producing the artifact can be held accountable. A formally verified proof is correct whether a human or an LLM generated it. A code review catches the same bug regardless of who's liable for shipping it.

The argument isn't "let's get rid of accountability", the argument is "it's ridiculous to suggest that the reason you shouldn't use AI is that AI lacks accountability - lacking accountability isn't the reason AI makes mistakes, and adding accountability to AI won't prevent AI from making mistakes - the answer to preventing mistakes with AI rests in process, and accountability does nothing to inherently ensure that".

Accountability is nothing but a transmission mechanism, and is blind to the values instilled through it. Accountability is literally what caused the 737 MAX disasters. The FAA decided it was more accountable to industry efficiency than it was to safety when it allowed Boeing to self-certify, which violated a process control of independent reviews. Boeing's board decided it wanted to more accountable to shareholder value maximization than it was to safety when it allowed MCAS to experience scope creep without re-review, which violated a process control of formal verification. Boeing's designers and engineers decided to be accountable to shareholder value maximization when they decided to make MCAS rely on only one of two flight control computers, which violated a process control of redundancy. Engineers at Boeing flagged these failures, but they ignored when management decided to be more accountable to shipping on time, which violated a process control of incorporating adversarial inputs and feedback.

Accountability did not prevent these mistakes, it caused them. Failure to abide by process controls caused the mistakes. Adding more accountability wouldn't have prevented these mistakes; maintaining strict adherence to the process controls that used to be in place would have.

legostormtroopr · 2026-05-12T00:31:55 1778545915

If a human messes up enough eventually they well get fired, fined or jailed. An AI will not.

A human also knows they might get punished if it messes up bad enough, which might cause it to think twice before doing something bad. For an AI there is a reward, but there is no risk.

So while both might lie, only the human will be worried that it will be found out. That makes a difference.

anonym29 · 2026-05-12T00:40:28 1778546428

You say that like all humans are alike: that they all care about getting fired, fined, or jailed; that they're even considering punishment when they're making their decisions; that risk factors into decision making.

What you are describing is a hypothetical "rational person". In real life, even the most rational people you know do completely irrational things routinely.

The Therac-25 engineers were accountable. The 737 MAX engineers were accountable. Accountability is doing much less work in the safety story than you seem to think.

The real work is done by process, redundancy, independent review, formal methods. None of these inherently require someone to be penalized for making mistakes, and penalizing people for making mistakes is a demonstrably, empirically unreliable mechanism for preventing mistakes.

beardbandit · 2026-05-12T00:39:10 1778546350

There is a human in the loop that either prompted the agent or approved the code. So it doesn't matter if the AI is accountable or not.

the_af · 2026-05-12T00:41:59 1778546519

I hear you, but isn't the human in the loop precisely the one who should be putting their foot down and saying "no, the AI shouldn't be writing the tests to begin with", which would bring us full circle?

bigstrat2003 · 2026-05-12T03:12:53 1778555573

Punishing humans does, in fact, prevent mistakes. Or rather, the threat of punishment causes people to be careful to avoid mistakes, and that prevents mistakes. Sure, this doesn't work 100% of the time, but it does work and has throughout human history. Meanwhile, there's no equivalent paradigm for LLMs.

anonym29 · 2026-05-12T10:55:01 1778583301

Even if you could threaten an LLM with punishment for making mistakes, you might get longer CoTs, but that wouldn't prevent mistakes in LLMs. The lack of accountability isn't the reason that LLMs make mistakes - adding accountability wouldn't change anything.

Insanity · 2026-05-12T00:17:17 1778545037

If you wouldn’t let AI run your nuclear power plant, you need to drink more of the AI kool-aid.

2ndorderthought · 2026-05-12T00:43:16 1778546596

All these luddites out here. The cats out of the bag. Get with the program. Give the AI nukes already.

xigoi · 2026-05-12T05:32:22 1778563942

You’re absolutely right! I shouldn’t have launched the nukes. Would you like to learn more about nuclear safety?

thot_experiment · 2026-05-11T18:09:54 1778522994

Yo, MTP for Qwen is sick, thank you! Your work is invaluable.

thot_experiment · 2026-05-11T18:06:12 1778522772

Also you can feed it ALL of your data willy nilly without ever worrying about safety because you can just do it with the LAN cable unplugged, for applications that demand data hygiene it's a cheat code that guarantees safety without any sort of data sanitization.

thot_experiment · 2026-05-11T08:55:32 1778489732

It's more complex than that, I think the reality is that there's a lot of code that's just not that deep bro. I have some purely personal projects that have components that I don't understand anymore, I wrote that shit by hand, they still work but I haven't touched that shit in years. There's a lot of code that AI can write that's like that that helps me, the stuff I would forget about even if I wrote it by hand. I think you have to have discipline in it's use, it's a tool like any other.

AI, and especially agentic AI can make you lose situational awareness over a codebase and when you're doing deep work that SUUUUCKS, but it's not useless, you just have to play to it's strengths. Though my favorite hill to die on is telling people not to underestimate it's value as autocomplete. Turns out 40 gigabytes of autocomplete makes for a fucking amazing autocomplete. Try it with llama.vim + qwen coder 30b, it feels like the editor is reading your mind sometimes and the latency is so low.

thot_experiment · 2026-05-11T05:47:51 1778478471

Flat wrong. Q6 Gemma 31b feels a lot like opus 4.5 to me when run in a harness so it can retrieve information and ground itself. The gap is not that big for a lot of usecases. Qwen MoE is fast as fuck locally for things that are oneshottable. I have subscriptions to all the major providers right now and since Gemma 4 and Qwen 3.6 came out I haven't hit limits a single time. I'm actually super surprised by the number of things I try with Gemma 4 with the intent of seeing how it fails and then having Claude do it only to come away with something perfectly usable from the local model.

cbg0 · 2026-05-11T06:13:19 1778479999

Your n=1 might not be very relevant outside your personal use. In less contaminated benchmarks Gemma 4 is way below Sonnet 4.5, let alone Opus models: https://swe-rebench.com/

thot_experiment · 2026-05-11T07:47:23 1778485643

Benchmarks only give you the roughest idea of how models compare in real world use. They're essentially useless beyond maybe classifying models into a few buckets. The only way you gain an understanding of something as complex as how an LLM integrates with your workflow is by doing it and measuring across many trials. I've been running Opus 4.7 in Claude Code and Gemma 4 31b in parallel on projects for hours a day this past week, Opus 4.7 is definitely better, but for many things they are roughly equivalent, there are some things on the edge that are just up to chance, and either model may stumble across the solution, and there are some areas of my work that reliably trip up both models and I get better mileage out of writing code the old fashioned way. I understand that I'm just one data point, but I'm not writing CRUD apps here, I'm doing DSPs and weird color math in shaders, I don't think any of it is hard, and the stuff that I think is hard none of the models are good at yet, but idk, they just don't seem that extremely disparate from one another.

FWIW I think Gemma 4 31b is more likely to be of use to me than Sonnet, idfk, maybe it's a skill issue but I love Opus 4.7, undisputed king, but Sonnet seems borderline useless and I basically think of it as on the same level as Qwen 35b MoE.

cbg0 · 2026-05-11T08:27:32 1778488052

"essentially useless" is a gross overstatement. Your personal benchmarks will always provide you with the most value, but disregarding standardized benchmarks because you care more about vibes is not exactly scientific.

thot_experiment · 2026-05-11T09:23:12 1778491392

Sorry, "essentially useless in the context of local model availability". It's a fine model but it's tier of inference is fully fungible.

larodi · 2026-05-11T06:51:17 1778482277

I’m building a pipeline and testing against gemma4 and Gemini’s 3-1 flash. Both are very good on certain tasks and even n-way clustering works almost perfect almost always.

But they diverge greatly on other particular ones whenever the ViT tower and the apriori knowledge of the world is crucial. I wish Gemma was on par but both me and Google know they not.

onion2k · 2026-05-11T06:46:26 1778481986

You do need to ask whether or not Sonnet or Opus are overkill for a lot of work though. If Gemma4 with some human effort can achieve the same result as Sonnet then it's arguably a lot more cost effective as you're paying for the person to operate each one regardless.

thot_experiment · 2026-05-11T07:36:36 1778484996

I 100% agree with your philosophy but I wanna note that I genuinely find Gemma 4 31b to be better than Sonnet. To be clear, this makes NO sense to me, so I'm probably just high and making stuff up or just biased by a small sample size since I don't use Sonnet that often. I find that Gemma 4 makes the sort of "dumb AI" mistakes Sonnet makes less often, especially in agentic mode. I genuinely don't know how that can be true but Sonnet feels much more like "autocomplete" and Gemma 4 feels like "some facsimile of thought".

alfiedotwtf · 2026-05-11T06:10:34 1778479834

I’m guessing Qwen3.6 for agentic coding and Gemma4 for non-coding stuff?

thot_experiment · 2026-05-11T07:17:31 1778483851

No, exactly the opposite actually. Qwen3.6 is too imprecise for long running agentic tasks. It doesn't have the same ability to check itself as Gemma does in my testing. I keep Qwen MoE in vram by default because there are tons of tasks i trust it to oneshot and it's 90tok/sec is unparalleled, anything where I don't want to have to intervene too much it can't be trusted.

alfiedotwtf · 2026-05-11T16:08:27 1778515707

Oh interesting. I've read that Gemma 4 is really good for creative stuff, but I'm mostly interested in agentic coding. Unfortunately, each time I use Gemma 4, I just get it stuck in loops.

thot_experiment · 2026-05-11T17:58:08 1778522288

This is probably a precision thing, I think there's a really big difference in long running tasks between q4 and q6.

alfiedotwtf · 2026-05-12T02:02:29 1778551349

Ok, you’ve given me the umph to try again. Thanks!

stuaxo · 2026-05-11T08:27:41 1778488061

What harness are you using ?

I'm going to switch to local LLMs for most stuff soon.

thot_experiment · 2026-05-11T08:45:27 1778489127

Overall using screentime as the metric, derived from some imperfect logging and vibes it's about 50% OpenCode 15% Continue 15% my homebrew bullshit 13% Claude Code and 7% Cline. I've been deep on agentic stuff lately (1.3wks aka 3 months of AI time), there are only so many hours in the day to duplicate work and AB test, but in the past I've sworn by Qwen Coder + llama.vim and I still enjoy that workflow for deep work far more than I like prompting agents, but there's a lot of dross I'm learning to delegate.

stuaxo · 2026-05-12T02:02:30 1778551350

Interesting.

I stopped doing local stuff for a bit when I realised I didn't know how well it is supposed to work so have been on Claude for a few months now.

I think I'll try OpenCode this time.

Usually I do stuff in devcontainers, qwen code (non local) was the only time I managed to lose some work as it got confused when I ran out of tokens.

There's still quite a way to go - it does seem like Claude code itself is pretty badly coded, so I think there is a space for open source to come in with a high quality harness at some point.

root_axis · 2026-05-11T06:20:24 1778480424

Sorry but you're just seeing what you want to see. The idea that a 31b model is anywhere even in the ballpark of something like Opus 4.5 is just absurd on its face.

thot_experiment · 2026-05-11T07:27:20 1778484440

False. The absolute capability is irrelevant, with the proper harness 31b is more than adequate for a very large portion of the tasks I ask AI to do. The metric isn't how good the model is at Erdos Problems, it's how reliably it can remove drudgery in my life. It just autonomously reverse engineered a bluetooth protocol with minimal intervention, it's ability to react to data and ground itself is constantly impressive to me. I do a ton of testing with these models, today I had Gemma answer a physics problem that Opus 4.7 gave up on. With a decent harness and context the set of tasks where their capabilities are both good enough is very surprising. The tasks I have that stump Gemma often also stump Opus 4.7.

diordiderot · 2026-05-11T09:54:14 1778493254

Maybe reaching for an analogy would be helpful here.

Thot_experiment is saying that his 2016 Toyota Prius is a great and reliable car for his daily commute and running errands.

Whereas everyone is screeching about its capability gap with a Lockheed Martin F35 lightning.

thot_experiment · 2026-05-11T18:02:01 1778522521

Yeah, thanks, though I think local models are at least a Cessna, which while being nothing like an F-35 can fly.

aceazzameen · 2026-05-12T13:16:30 1778591790

Flying is fun. But shooting Cessnas out of the sky is more fun!

I'm kidding around. I run 31b models myself too and am perfectly happy with them.

amelius · 2026-05-11T07:38:10 1778485090

This is like saying that 640kB is enough for anybody.

thot_experiment · 2026-05-11T08:00:11 1778486411

No, it isn't. I am saying that the set of tasks that can be completed by Opus 4.7 has a surprisingly large overlap with the set of tasks that can be completed by Gemma 31B. It is meaningfully equivalent in many cases.

(of course if i'm being honest 640kB is fine, i'm sure tons of the world's commerce is handled by less for example, the delta between a system with 640kb of ram and a modern one is near nil for many people, the UX on a PoS terminal does not require more than that for example, the hacker news UX could also be roughly the same)

lioeters · 2026-05-11T12:47:10 1778503630

> 640kB is fine

How refreshing to hear this kind of old-school hacker thinking, in a thread where most people have given up on local computing in exchange for convenience and permanent third-party dependency.

With embedded systems affordable and ubiquitous, hopefully a growing segment of the new generation will also learn to push the limit of available hardware and see how far we can take it. As an engineer there's a satisfaction in solving things with what you got.

There's a new technique, 1-bit family of language models that can achieve up to 9x memory efficiency compared to existing models. Still multiple gigabytes for practical use I imagine, but it's great progress toward local AI, which I believe will be common in the near future. https://prismml.com/news/ternary-bonsai

degamad · 2026-05-11T22:46:17 1778539577

It's more like saying "HIMEM.SYS is not much better than 640kB".

BoredomIsFun · 2026-05-11T07:50:51 1778485851

It would be true, if model providers did not throttle their models. I do not have definitive proof they do but the rumors are abundant.

creativeSlumber · 2026-05-11T15:48:40 1778514520

I think you are missing the point here. what matters is for that user the local models are good enough for their use case.

thot_experiment · 2026-05-11T05:41:48 1778478108

Depending on your laptop, if your laptop is a Strix Halo or a Macbook with a decent amount of ram, that day they arrived is about 6 months ago, and today if you can run Gemma 31b, you're golden for your basic workslop code. You can do most of it with local models. Heck, for a lot of the tier of programming you might encounter in the average job Qwen 35b MoE is good enough and it can hit 100tok/s on decent hardware.

thot_experiment · 2026-05-11T03:29:45 1778470185

Very different from my experience, Gemma 31b just solved a physics problem Opus 4.7 gave up on. I definitely don't think they're equivalent in general, Opus for sure is way smarter and way more likely to get things right on the edge, but it's still quite likely to get things wrong too it doesn't make it that useful for a lot of stuff. Conversely there are so many things that you would use an LLM for that they will both reliably oneshot. Especially in agentic mode where you have ground truth feedback between turns the difference gets quite small for a lot of tasks.

That all being said I've spent hundreds (maybe thousands?) of hours on this stuff over the past few years so I don't see a lot of the rough edges. I really believe capability is there, Gemma 4 31B is a useful agent for all sorts of stuff, and anything you can reasonably expect an LLM to oneshot Qwen 3.6 35b MoE will handle at like 90tok/sec, absolutely fantastic for tasks that don't require a huge amount of precision.

fg137 · 2026-05-11T04:20:20 1778473220

Sure. Sample size = 1.

coldtea · 2026-05-11T12:03:11 1778500991

If it works for me it works for me. Sample size of 1 is all I need to tell that.

thot_experiment · 2026-05-11T04:22:14 1778473334

It may surprise you but over thousands of hours I have actually gathered more than one sample.

EDIT: Here's another sample for ya. I went to the store to buy mixers and while I was out Gemma 4 31b got pretty far along with reverse engineering the bluetooth protocol of a desk thermometer I have. I forgot to turn on the web search tool, so it just went at it, writing more and more specific diagnostic logging/probing tools over the course of like 8 turns. It connected to the thermometer, scanned the characteristics and had made a dump of the bluetooth notification data. When I got back it was theorizing about how the data might be encoded in the bluetooth characteristics and it got into an infinite loop. (local models aren't perfect and i never said they were) I turned on the websearch tool and told it to "pick up the project where it left off", it read the directory, did a couple googles and had a working script to print temperature, humidity and battery state in like 3 turns. Reading back throught it's chain of thought I'm pretty sure it would have been able to get it eventually without googling.

idk, I thought I was a cool and smart engineer type for being able to do stuff like this, if my GPUs being able to do this more or less unsupervised isn't impressive I guess fuck me lol.

K0balt · 2026-05-11T11:13:10 1778497990

What is your opinion on qwen 35b MOEvs qwen 27b dense?

thot_experiment · 2026-05-11T17:59:36 1778522376

Maybe a skill issue but they both feel about the same and the MoE is 3x faster so I barely use the dense model.

latable · 2026-05-12T10:22:25 1778581345

Not the person asked but on a medium bug that would span a few python files, I found the MOE be too enthusiastic trying things without trying to understand first the issue, when the dense model though hard and added debug statements to understand how to fix it. But the dense model is quite slow (Q4KM quant, MI50 32GB, llama.cpp, pi)

2ndorderthought · 2026-05-11T10:09:36 1778494176

The models op is using are from a year ago. The big breakthroughs happened in April this past month

baq · 2026-05-11T12:46:06 1778503566

lots of interesting things happen in anecdotes.

thot_experiment · 2026-05-11T03:22:37 1778469757

Gemma 4 IS good, I've literally had it get a thing right that Opus 4.7 missed, the edges are ragged and I'm reliably finding usecases where it's basically equivalent. Ultimately the metric is "what can I RELY on it to do". Opus definitely knows a lot more and can sometimes do much more complex tasks, but especially when you're good about feeding the context Gemma is amazing. The difference between the sets of things I trust the two models to do is surprisingly small. I've had some insanely good runs recently working on my personal tooling as well as random projects. The first local model that can reliably left to implement features in agentic mode on non-trivial projects.

https://thot-experiment.github.io/gradient-gemma4-31b/

This is a relatively complex piece of tooling built entirely by Gemma 4 inside OpenCode where I manually intervened maybe only 4 times over the course of a few hours.

running Q6_K_XL, 128k context @ q8 ~ 800tok/s read 16tok/sec write

eagerly awaiting turboquant and MTP in llama.cpp, should take me to 256k and 25-30tok/s if the rumors are true

thot_experiment · 2026-05-11T05:25:00 1778477100

Re-posting this from a buried comment for visibility because it's just so fucking impressive to me.

I went to the store to buy mixers and while I was out Gemma 4 31b got pretty far along with reverse engineering the bluetooth protocol of a desk thermometer I have. I forgot to turn on the web search tool, so it just went at it, writing more and more specific diagnostic logging/probing tools over the course of like 8 turns. It connected to the thermometer, scanned the characteristics and had made a dump of the bluetooth notification data. When I got back it was theorizing about how the data might be encoded in the bluetooth characteristics and it got into an infinite loop. (local models aren't perfect and i never said they were) I turned on the websearch tool and told it to "pick up the project where it left off", it read the directory, did a couple googles and had a working script to print temperature, humidity and battery state in like 3 turns. Reading back throught it's chain of thought I'm pretty sure it would have been able to get it eventually without googling.

idk, I thought I was a cool and smart engineer type for being able to do stuff like this, if my GPUs being able to do this more or less unsupervised isn't impressive I guess fuck me lol.

AdamConwayIE · 2026-05-11T12:02:25 1778500945

Had a very similar experience recently.

Built a basic authentication handler for this test just so it wouldn't be in the training data of either model. It had deliberately planted bugs. One was a hardcoded secret, another was a wrap-on-0xFFFFFFFF bug as a result of a malloc(length+1).

Qwen 3.6 found both, alongside two other issues I hadn't even considered, and the location of the magic value. GPT-5.4, though, missed the malloc issue (flagging memory exhaustion as the only risk), it missed a separate timing bug (it explicitly said the function was safe), and it hallucinated the location of the magic value. Qwen correctly identified the integer overflow. GPT-5.4 did not.

I then compared basic research between them using SearXNG for web search. For example, the current status of MTP in llama.cpp. Qwen 3.6 27B found the current PR, but flagged a related issue that shows the current implementation can be slower than just using a draft model right now. GPT-5.5 Thinking found the same PR, but didn't flag the downsides.

In a similar comparison, I asked both models how I should get started with ESPHome as a total beginner. ChatGPT suggested an ESP32-S3 and a BME280, which is... just not a good idea. It also talked about the ESP32-P4 not having Wi-Fi, and installing with HA or Docker. Meanwhile, Qwen3.6 27B said regular ESP32, DHT22, and mentioned HA, Docker, and pip as installation methods. While GPT was good, it was just throwing out jargon for a prompt that explicitly requested it for a beginner.

It kind of blew my mind that in all three of these, Qwen landed it better.

AntiUSAbah · 2026-05-11T07:29:41 1778484581

It definitly is and just a few years ago unheared of.

And we progress on so many different frontiers in parallel: Agent harness, Agent model, hardware etc.

hparadiz · 2026-05-11T09:23:04 1778491384

A technology indistinguishable from magic.

thot_experiment · 2026-05-11T03:16:36 1778469396

It's a figurative 900lbs.

NoMoreNicksLeft · 2026-05-11T06:09:13 1778479753

Someone forgot to put figurative weights and measures into the model's instructions again. Going to take twice as long to farm updoots with such lazy prompt engineering.