"how does a smart car compare to a ford f150? its different in its intent and intended audience.
Ollama is someone who goes to walmart and buys a $100 huffy mountain bike because they heard bikes are cool. Torchchat is someone who built a mountain bike out of high quality components chosen for a specific task/outcome with the understanding of how each component in the platform functions and interacts with the others to achieve an end goal."
https://www.reddit.com/r/LocalLLaMA/comments/1eh6xmq/comment...
Longer Answer with some more details is
If you don't care about which quant you're using, only use ollama and want easy integration with desktop/laptop based projects use Ollama.
If you want to run on mobile, integrate into your own apps or projects natively, don't want to use GGUF, want to do quantization, or want to extend your PyTorch based solution use torchchat
Right now Ollama (based on llama.cpp) is a faster way to get performance on a laptop desktop and a number of projects are pre-integrated with Ollama thanks to the OpenAI spec. It's also more mature with more fit and polish.
That said the commands that make everything easy use 4bit quant models and you have to do extra work to go find a GGUF model with a higher (or lower) bit quant and load it into Ollama.
Also worth noting is that Ollama "containerizes" the models on disk so you can't share them with other projects without going through Ollama which is a hard pass for any users and usecases since duplicating model files on disk isn't great.
https://www.reddit.com/r/LocalLLaMA/comments/1eh6xmq/comment...
So you can also run on any LLM privately with ollama, lmstudio, and or LLamaSharp with windows, mac and iphone, all are opensource and customizable too and user friendly and frequently maintained.
Probably if you have any esoteric flags that pytorch supports. Flash attention 2, for example, was supported way earlier on pt than llama.cpp, so if flash attention 3 follows the same path it'll probably make more sense to use this when targeting nvidia gpus.
It would appear that Flash-3 is already something that exists for PyTorch based on this joint blog between Nvidia, Together.ai and Princeton about enabling Flash-3 for PyTorch: https://pytorch.org/blog/flashattention-3/
Olamma currently has only one "supported backend" which is llama.cpp. It enables downloading and running models on CPU. And might have more mature server.
How well does it run in AMD GPUs these days compared to Nvidia or Apple silicon?
I've been considering buying one of those powerful Ryzen mini PCs to use as an LLM server in my LAN, but I've read before that the AMD backend (ROCm IIRC) is kinda buggy