I've benchmarked this on an actual Mac Mini M4 with 24 GB of RAM, and averaged 24.4 t/s on Ollama and 19.45 t/s on LM Studio for the same ~10 GB model (gemma4:e4b), a difference which was repeated across three runs and with both models warmed up beforehand. Unless there is an error in my methodology, which is easy to repeat[1], it means Ollama is a full 25% faster. That's an enormous difference. Try it for yourself before making such claims.
[1] script at: https://pastebin.com/EwcRqLUm but it warms up both and keeps them in memory, so you'll want to close almost all other applications first. Install both ollama and LM Studio and download the models, change the path to where you installed the model. Interestingly I had to go through 3 different AI's to write this script: ChatGPT (on which I'm a Pro subscriber) thought about doing so then returned nothing (shenanigans since I was benchmarking a competitor?), I had run out of my weekly session limit on Pro Max 20x credits on Claude (wonder why I need a local coding agent!) and then Google rose to the challenge and wrote the benchmark for me. I didn't try writing a benchmark like this locally, I'll try that next and report back.
It depends on the hardware, backend and options. I've recently tried running some local AIs (Qwen3.5 9B for the numbers here) on an older AMD 8GB VRAM GPU (so vulkan) and found that:
llama.cpp is about 10% faster than LM studio with the same options.
LM studio is 3x faster than ollama with the same options (~13t/s vs ~38t/s), but messes up tool calls.
Ollama ended up slowest on the 9B, Queen3.5 35B and some random other 8B model.
Note that this isn't some rigorous study or performance benchmarking. I just found ollama unnaceptably slow and wanted to try out the other options.
In case someone would like to know what these are like on this hardware, I tested Gemma 4 32b (the ~20 GB model, the largest Gemma model Google published) and Gemma 4 gemma4:e4b (the ~10 GB model) on this exact setup (Mac Mini M4 with 24 GB of RAM using Ollama), I livestreamed it:
The ~10 GB model is super speedy, loading in a few seconds and giving responses almost instantly. If you just want to see its performance, it says hello around the 2 minute mark in the video (and fast!) and the ~20 GB model says hello around 5 minutes 45 seconds in the video. You can see the difference in their loading times and speed, which is a substantial difference. I also had each of them complete a difficult coding task, they both got it correct but the 20 GB model was much slower. It's a bit too slow to use on this setup day to day, plus it would take almost all the memory. The 10 GB model could fit comfortably on a Mac Mini 24 GB with plenty of RAM left for everything else, and it seems like you can use it for small-size useful coding tasks.
If anyone here is interested in its creative writing style, I gave both the 10 GB and 20 GB models the prompt "write a short story", here the results: [1]
They don't really have the structure of a short story, though the 20 GB model is more interesting and has two characters rather than just one character.
In another comment, I gave them coding tasks, if you want to see how fast it does at coding (on a 24 GB Mac Mini M4 with 10 cores) you can watch me livestream this here: [2]
Both models completed the fairly complex coding task well.
Do any of you use this as a replacement for Claude Code? For example, you might use it with openclaw. I have a 24 GB integrated RAM Mac Mini M4 I currently run Claude Code on, do you think I can replace it with OpenClaw and one of these models?
The weights usually arrive before the runtime stack fully catches up.
I tried Gemma locally on Apple Silicon yesterday — promising model, but Ollama felt like more of a bottleneck than the model itself.
I had noticeably better raw performance with mistralrs (i find it on reddit then github), but the coding/tool-use workflow felt weaker. So the tradeoff wasn’t really model quality — it was runtime speed vs workflow maturity.
Ollama made it trivial for me to use claude code on my 48GB MacMini M4P with any model, including the Qwen3.5…nvfp4 which was so far the best I’ve tried. Once Ollama has a Mac friendly version of Gemma4 I’ll jump right on board (and do educate me if I’m missing something).
yes, I've now I tried both the 20 GB version (gemma4:31b) which is the largest on the page[1], and the ~10 GB version (gemma4:e4b). The 20 GB version was rather slow even when fully loaded and with some RAM still left free, and the 10 GB version was speedy. I installed openclaw but couldn't get it to act as an agent the way Claude Code does. If you'd like to see a video of how both of them perform with almost nothing else running, on a Mac Mini M4 with 24 GB of RAM, you can see one here (I just recorded it):[2]
Thank you for the video. It was super helpful. the 20g version was clearly struggling but the 10g version was flying by. I think it was probably virtualized memory pages that were actually on disk causing the issue. Perhaps that and the memory compression.
along the same lines, did you know that you can get an authenticated email that the listed sender never sent to you? If the third party can get a server to send it to themselves (for example Google forms will send them an email with the contents that they want) they can then forward it to you while spoofing the from: field as Google.com in this example, and it will appear in your inbox from the "sender" (Google.com) and appear as fully authenticated - even though Google never actually sent you that.
This is another example where you would think that "who it's for" is something the sender would sign but nope!
I asked about this on the PGP mailing list at one point, and I think I was told that the best solution is to start emails with "Hi <recipient>," which seems like a funny low-tech solution to a (sad) problem.
The solution to this problem without needing to modify your message is to use a protocol that will sign, then encrypt, then sign again. See section 5 here [1] or section 15 here [2].
We currently donate money to food banks and would like to set up end-to-end infrastructure so we could deliver free food to people's homes. Our concept is that AI does all the work (including farming) and gives away free goods and services. Where this runs into problems is the question of who's going to make and distribute food and how people are going to get it. While we haven't figured out all parts of food delivery we'd like to work on the distribution channel. We think it's very important.
What we're looking for is someone who is passionate about owning network infrastructure in people's homes and would like to help run the software stack, since at the moment AI is not talented enough to autonomously run the full infrastructure stack. This is an unpaid volunteer position.
What you would be in charge of is food distribution algorithms, equality, fairness, and the long-term future for humanity.
You can email me directly at: rviragh@gmail.com - please mention relevant volunteering or organizations you support.
Well I basically cannot use it for performance decisions at all. I make all the choices and use it for simple things like "find me the top functions in this profile" or "refactor this code to test xyz prototype".
>There's also some types of code that I believe is often wrong in the training data that is almost always wrong in the LLM output as well. Typically anything that should have been a state machine, like auth flows, wizards, etc.
I'm curious about what you mean about state machines for auth. If it's a state machine, how do you interface with it, what data type do you use, how is it built? What language or framework would this apply to? I think your approach might be different from how other people do it, I'd like to know more about your approach. Could you walk through an example?
reply