idk, my 10 year old makerbot 2 has been pretty reliable, ever since Prusa slicer came out and I tuned a profile for it maybe 6 years ago it's been spitting out quick dimensionally accurate prints. i use it all the time, probably go through a spool every month or two and all i've had to replace is the cooling fan for the extruder once
i'm mostly printing small mechanical parts and i can't say i have any complaints, i assume a modern prusa would be much better, surely there are other FDM printers that are good?
I hear this all the time. Why does it matter? Punishing a human for making a mistake does not prevent mistakes, nor does it undo the harm of the mistake. A human saying "my bad, I messed up" and an AI saying "my bad, I messed up" are equally worthless, in a functional sense.
"Punishing a human for making a mistake does not prevent mistakes" This statement suggests you don't believe in some combination of neuroplasticity as a concept or the arrow of time.
Tell the families of the people who died on 737 MAX disasters. "Don't worry - everything's going to be okay! The engineers learned from their mistakes - accountability works, you have nothing to be sad about!"
Tell the family of the person killed by a semi truck driver who showed up to work drunk or high: "Don't worry - the driver went to jail! Accountability prevented anything bad from happening!"
Accountability alone fails to prevent deadly mistakes millions of times a day; millions of mistakes are avoided daily through process, redundancy, independent review, and formal methods.
"Accountability prevents mistakes" is a comforting delusion. In reality, accountability is only marginally related to whether or not mistakes are made.
"Accountability alone fails to prevent deadly mistakes millions of times a day"
...in his desperation to finally win an argument online our hero advanced, grimly ignoring the concept of Engineering.
"millions of mistakes are avoided daily through process, redundancy, independent review, and formal methods."
Ahh spoke too soon, Engineering has finally joined the chat. So what mechanism do you propose lead to the foundation of process, redundancy, independent review, and formal methods?
What are you even on about mate? Sure accountability doesn’t prevent all mistakes. Guess what, nothing prevents all mistakes. Accountability can help prevent some mistakes some of the time. It sounds like you’re suggesting getting rid of the concept of accountability because it doesn’t prevent ALL mistakes. Way to throw the baby out with the bath water.
>It sounds like you’re suggesting getting rid of the concept of accountability
Where on earth are you getting this?
Accountability alone is insufficient, and the things that actually prevent mistakes don't require it. The mechanisms that do work: independent reviews, redundancy, formal verification where applicable, staged rollouts, testing against adversarial inputs... these all function on the artifact, not on whether or not the entity producing the artifact can be held accountable. A formally verified proof is correct whether a human or an LLM generated it. A code review catches the same bug regardless of who's liable for shipping it.
The argument isn't "let's get rid of accountability", the argument is "it's ridiculous to suggest that the reason you shouldn't use AI is that AI lacks accountability - lacking accountability isn't the reason AI makes mistakes, and adding accountability to AI won't prevent AI from making mistakes - the answer to preventing mistakes with AI rests in process, and accountability does nothing to inherently ensure that".
Accountability is nothing but a transmission mechanism, and is blind to the values instilled through it. Accountability is literally what caused the 737 MAX disasters. The FAA decided it was more accountable to industry efficiency than it was to safety when it allowed Boeing to self-certify, which violated a process control of independent reviews. Boeing's board decided it wanted to more accountable to shareholder value maximization than it was to safety when it allowed MCAS to experience scope creep without re-review, which violated a process control of formal verification. Boeing's designers and engineers decided to be accountable to shareholder value maximization when they decided to make MCAS rely on only one of two flight control computers, which violated a process control of redundancy. Engineers at Boeing flagged these failures, but they ignored when management decided to be more accountable to shipping on time, which violated a process control of incorporating adversarial inputs and feedback.
Accountability did not prevent these mistakes, it caused them. Failure to abide by process controls caused the mistakes. Adding more accountability wouldn't have prevented these mistakes; maintaining strict adherence to the process controls that used to be in place would have.
If a human messes up enough eventually they well get fired, fined or jailed. An AI will not.
A human also knows they might get punished if it messes up bad enough, which might cause it to think twice before doing something bad. For an AI there is a reward, but there is no risk.
So while both might lie, only the human will be worried that it will be found out. That makes a difference.
You say that like all humans are alike: that they all care about getting fired, fined, or jailed; that they're even considering punishment when they're making their decisions; that risk factors into decision making.
What you are describing is a hypothetical "rational person". In real life, even the most rational people you know do completely irrational things routinely.
The Therac-25 engineers were accountable. The 737 MAX engineers were accountable. Accountability is doing much less work in the safety story than you seem to think.
The real work is done by process, redundancy, independent review, formal methods. None of these inherently require someone to be penalized for making mistakes, and penalizing people for making mistakes is a demonstrably, empirically unreliable mechanism for preventing mistakes.
I hear you, but isn't the human in the loop precisely the one who should be putting their foot down and saying "no, the AI shouldn't be writing the tests to begin with", which would bring us full circle?
Punishing humans does, in fact, prevent mistakes. Or rather, the threat of punishment causes people to be careful to avoid mistakes, and that prevents mistakes. Sure, this doesn't work 100% of the time, but it does work and has throughout human history. Meanwhile, there's no equivalent paradigm for LLMs.
Even if you could threaten an LLM with punishment for making mistakes, you might get longer CoTs, but that wouldn't prevent mistakes in LLMs. The lack of accountability isn't the reason that LLMs make mistakes - adding accountability wouldn't change anything.
Also you can feed it ALL of your data willy nilly without ever worrying about safety because you can just do it with the LAN cable unplugged, for applications that demand data hygiene it's a cheat code that guarantees safety without any sort of data sanitization.
It's more complex than that, I think the reality is that there's a lot of code that's just not that deep bro. I have some purely personal projects that have components that I don't understand anymore, I wrote that shit by hand, they still work but I haven't touched that shit in years. There's a lot of code that AI can write that's like that that helps me, the stuff I would forget about even if I wrote it by hand. I think you have to have discipline in it's use, it's a tool like any other.
AI, and especially agentic AI can make you lose situational awareness over a codebase and when you're doing deep work that SUUUUCKS, but it's not useless, you just have to play to it's strengths. Though my favorite hill to die on is telling people not to underestimate it's value as autocomplete. Turns out 40 gigabytes of autocomplete makes for a fucking amazing autocomplete. Try it with llama.vim + qwen coder 30b, it feels like the editor is reading your mind sometimes and the latency is so low.
Flat wrong. Q6 Gemma 31b feels a lot like opus 4.5 to me when run in a harness so it can retrieve information and ground itself. The gap is not that big for a lot of usecases. Qwen MoE is fast as fuck locally for things that are oneshottable. I have subscriptions to all the major providers right now and since Gemma 4 and Qwen 3.6 came out I haven't hit limits a single time. I'm actually super surprised by the number of things I try with Gemma 4 with the intent of seeing how it fails and then having Claude do it only to come away with something perfectly usable from the local model.
Your n=1 might not be very relevant outside your personal use. In less contaminated benchmarks Gemma 4 is way below Sonnet 4.5, let alone Opus models: https://swe-rebench.com/
Benchmarks only give you the roughest idea of how models compare in real world use. They're essentially useless beyond maybe classifying models into a few buckets. The only way you gain an understanding of something as complex as how an LLM integrates with your workflow is by doing it and measuring across many trials. I've been running Opus 4.7 in Claude Code and Gemma 4 31b in parallel on projects for hours a day this past week, Opus 4.7 is definitely better, but for many things they are roughly equivalent, there are some things on the edge that are just up to chance, and either model may stumble across the solution, and there are some areas of my work that reliably trip up both models and I get better mileage out of writing code the old fashioned way. I understand that I'm just one data point, but I'm not writing CRUD apps here, I'm doing DSPs and weird color math in shaders, I don't think any of it is hard, and the stuff that I think is hard none of the models are good at yet, but idk, they just don't seem that extremely disparate from one another.
FWIW I think Gemma 4 31b is more likely to be of use to me than Sonnet, idfk, maybe it's a skill issue but I love Opus 4.7, undisputed king, but Sonnet seems borderline useless and I basically think of it as on the same level as Qwen 35b MoE.
"essentially useless" is a gross overstatement. Your personal benchmarks will always provide you with the most value, but disregarding standardized benchmarks because you care more about vibes is not exactly scientific.
I’m building a pipeline and testing against gemma4 and Gemini’s 3-1 flash. Both are very good on certain tasks and even n-way clustering works almost perfect almost always.
But they diverge greatly on other particular ones whenever the ViT tower and the apriori knowledge of the world is crucial. I wish Gemma was on par but both me and Google know they not.
You do need to ask whether or not Sonnet or Opus are overkill for a lot of work though. If Gemma4 with some human effort can achieve the same result as Sonnet then it's arguably a lot more cost effective as you're paying for the person to operate each one regardless.
I 100% agree with your philosophy but I wanna note that I genuinely find Gemma 4 31b to be better than Sonnet. To be clear, this makes NO sense to me, so I'm probably just high and making stuff up or just biased by a small sample size since I don't use Sonnet that often. I find that Gemma 4 makes the sort of "dumb AI" mistakes Sonnet makes less often, especially in agentic mode. I genuinely don't know how that can be true but Sonnet feels much more like "autocomplete" and Gemma 4 feels like "some facsimile of thought".
No, exactly the opposite actually. Qwen3.6 is too imprecise for long running agentic tasks. It doesn't have the same ability to check itself as Gemma does in my testing. I keep Qwen MoE in vram by default because there are tons of tasks i trust it to oneshot and it's 90tok/sec is unparalleled, anything where I don't want to have to intervene too much it can't be trusted.
Oh interesting. I've read that Gemma 4 is really good for creative stuff, but I'm mostly interested in agentic coding. Unfortunately, each time I use Gemma 4, I just get it stuck in loops.
Overall using screentime as the metric, derived from some imperfect logging and vibes it's about 50% OpenCode 15% Continue 15% my homebrew bullshit 13% Claude Code and 7% Cline. I've been deep on agentic stuff lately (1.3wks aka 3 months of AI time), there are only so many hours in the day to duplicate work and AB test, but in the past I've sworn by Qwen Coder + llama.vim and I still enjoy that workflow for deep work far more than I like prompting agents, but there's a lot of dross I'm learning to delegate.
I stopped doing local stuff for a bit when I realised I didn't know how well it is supposed to work so have been on Claude for a few months now.
I think I'll try OpenCode this time.
Usually I do stuff in devcontainers, qwen code (non local) was the only time I managed to lose some work as it got confused when I ran out of tokens.
There's still quite a way to go - it does seem like Claude code itself is pretty badly coded, so I think there is a space for open source to come in with a high quality harness at some point.
Sorry but you're just seeing what you want to see. The idea that a 31b model is anywhere even in the ballpark of something like Opus 4.5 is just absurd on its face.
False. The absolute capability is irrelevant, with the proper harness 31b is more than adequate for a very large portion of the tasks I ask AI to do. The metric isn't how good the model is at Erdos Problems, it's how reliably it can remove drudgery in my life. It just autonomously reverse engineered a bluetooth protocol with minimal intervention, it's ability to react to data and ground itself is constantly impressive to me. I do a ton of testing with these models, today I had Gemma answer a physics problem that Opus 4.7 gave up on. With a decent harness and context the set of tasks where their capabilities are both good enough is very surprising. The tasks I have that stump Gemma often also stump Opus 4.7.
No, it isn't. I am saying that the set of tasks that can be completed by Opus 4.7 has a surprisingly large overlap with the set of tasks that can be completed by Gemma 31B. It is meaningfully equivalent in many cases.
(of course if i'm being honest 640kB is fine, i'm sure tons of the world's commerce is handled by less for example, the delta between a system with 640kb of ram and a modern one is near nil for many people, the UX on a PoS terminal does not require more than that for example, the hacker news UX could also be roughly the same)
How refreshing to hear this kind of old-school hacker thinking, in a thread where most people have given up on local computing in exchange for convenience and permanent third-party dependency.
With embedded systems affordable and ubiquitous, hopefully a growing segment of the new generation will also learn to push the limit of available hardware and see how far we can take it. As an engineer there's a satisfaction in solving things with what you got.
There's a new technique, 1-bit family of language models that can achieve up to 9x memory efficiency compared to existing models. Still multiple gigabytes for practical use I imagine, but it's great progress toward local AI, which I believe will be common in the near future. https://prismml.com/news/ternary-bonsai
Depending on your laptop, if your laptop is a Strix Halo or a Macbook with a decent amount of ram, that day they arrived is about 6 months ago, and today if you can run Gemma 31b, you're golden for your basic workslop code. You can do most of it with local models. Heck, for a lot of the tier of programming you might encounter in the average job Qwen 35b MoE is good enough and it can hit 100tok/s on decent hardware.
Very different from my experience, Gemma 31b just solved a physics problem Opus 4.7 gave up on. I definitely don't think they're equivalent in general, Opus for sure is way smarter and way more likely to get things right on the edge, but it's still quite likely to get things wrong too it doesn't make it that useful for a lot of stuff. Conversely there are so many things that you would use an LLM for that they will both reliably oneshot. Especially in agentic mode where you have ground truth feedback between turns the difference gets quite small for a lot of tasks.
That all being said I've spent hundreds (maybe thousands?) of hours on this stuff over the past few years so I don't see a lot of the rough edges. I really believe capability is there, Gemma 4 31B is a useful agent for all sorts of stuff, and anything you can reasonably expect an LLM to oneshot Qwen 3.6 35b MoE will handle at like 90tok/sec, absolutely fantastic for tasks that don't require a huge amount of precision.
It may surprise you but over thousands of hours I have actually gathered more than one sample.
EDIT: Here's another sample for ya. I went to the store to buy mixers and while I was out Gemma 4 31b got pretty far along with reverse engineering the bluetooth protocol of a desk thermometer I have. I forgot to turn on the web search tool, so it just went at it, writing more and more specific diagnostic logging/probing tools over the course of like 8 turns. It connected to the thermometer, scanned the characteristics and had made a dump of the bluetooth notification data. When I got back it was theorizing about how the data might be encoded in the bluetooth characteristics and it got into an infinite loop. (local models aren't perfect and i never said they were) I turned on the websearch tool and told it to "pick up the project where it left off", it read the directory, did a couple googles and had a working script to print temperature, humidity and battery state in like 3 turns. Reading back throught it's chain of thought I'm pretty sure it would have been able to get it eventually without googling.
idk, I thought I was a cool and smart engineer type for being able to do stuff like this, if my GPUs being able to do this more or less unsupervised isn't impressive I guess fuck me lol.
Not the person asked but on a medium bug that would span a few python files, I found the MOE be too enthusiastic trying things without trying to understand first the issue, when the dense model though hard and added debug statements to understand how to fix it. But the dense model is quite slow (Q4KM quant, MI50 32GB, llama.cpp, pi)
Gemma 4 IS good, I've literally had it get a thing right that Opus 4.7 missed, the edges are ragged and I'm reliably finding usecases where it's basically equivalent. Ultimately the metric is "what can I RELY on it to do". Opus definitely knows a lot more and can sometimes do much more complex tasks, but especially when you're good about feeding the context Gemma is amazing. The difference between the sets of things I trust the two models to do is surprisingly small. I've had some insanely good runs recently working on my personal tooling as well as random projects. The first local model that can reliably left to implement features in agentic mode on non-trivial projects.
This is a relatively complex piece of tooling built entirely by Gemma 4 inside OpenCode where I manually intervened maybe only 4 times over the course of a few hours.
Re-posting this from a buried comment for visibility because it's just so fucking impressive to me.
I went to the store to buy mixers and while I was out Gemma 4 31b got pretty far along with reverse engineering the bluetooth protocol of a desk thermometer I have. I forgot to turn on the web search tool, so it just went at it, writing more and more specific diagnostic logging/probing tools over the course of like 8 turns. It connected to the thermometer, scanned the characteristics and had made a dump of the bluetooth notification data. When I got back it was theorizing about how the data might be encoded in the bluetooth characteristics and it got into an infinite loop. (local models aren't perfect and i never said they were) I turned on the websearch tool and told it to "pick up the project where it left off", it read the directory, did a couple googles and had a working script to print temperature, humidity and battery state in like 3 turns. Reading back throught it's chain of thought I'm pretty sure it would have been able to get it eventually without googling.
idk, I thought I was a cool and smart engineer type for being able to do stuff like this, if my GPUs being able to do this more or less unsupervised isn't impressive I guess fuck me lol.
Built a basic authentication handler for this test just so it wouldn't be in the training data of either model. It had deliberately planted bugs. One was a hardcoded secret, another was a wrap-on-0xFFFFFFFF bug as a result of a malloc(length+1).
Qwen 3.6 found both, alongside two other issues I hadn't even considered, and the location of the magic value. GPT-5.4, though, missed the malloc issue (flagging memory exhaustion as the only risk), it missed a separate timing bug (it explicitly said the function was safe), and it hallucinated the location of the magic value. Qwen correctly identified the integer overflow. GPT-5.4 did not.
I then compared basic research between them using SearXNG for web search. For example, the current status of MTP in llama.cpp. Qwen 3.6 27B found the current PR, but flagged a related issue that shows the current implementation can be slower than just using a draft model right now. GPT-5.5 Thinking found the same PR, but didn't flag the downsides.
In a similar comparison, I asked both models how I should get started with ESPHome as a total beginner. ChatGPT suggested an ESP32-S3 and a BME280, which is... just not a good idea. It also talked about the ESP32-P4 not having Wi-Fi, and installing with HA or Docker. Meanwhile, Qwen3.6 27B said regular ESP32, DHT22, and mentioned HA, Docker, and pip as installation methods. While GPT was good, it was just throwing out jargon for a prompt that explicitly requested it for a beginner.
It kind of blew my mind that in all three of these, Qwen landed it better.
Someone forgot to put figurative weights and measures into the model's instructions again. Going to take twice as long to farm updoots with such lazy prompt engineering.
i'm mostly printing small mechanical parts and i can't say i have any complaints, i assume a modern prusa would be much better, surely there are other FDM printers that are good?
reply