I'm not well versed in LLMs, can someone with more experience share how this com...

Star_Ship_1010 · on Aug 1, 2024

Best answer to this is from Reddit

"how does a smart car compare to a ford f150? its different in its intent and intended audience.

Ollama is someone who goes to walmart and buys a $100 huffy mountain bike because they heard bikes are cool. Torchchat is someone who built a mountain bike out of high quality components chosen for a specific task/outcome with the understanding of how each component in the platform functions and interacts with the others to achieve an end goal." https://www.reddit.com/r/LocalLLaMA/comments/1eh6xmq/comment...

Longer Answer with some more details is

If you don't care about which quant you're using, only use ollama and want easy integration with desktop/laptop based projects use Ollama. If you want to run on mobile, integrate into your own apps or projects natively, don't want to use GGUF, want to do quantization, or want to extend your PyTorch based solution use torchchat

Right now Ollama (based on llama.cpp) is a faster way to get performance on a laptop desktop and a number of projects are pre-integrated with Ollama thanks to the OpenAI spec. It's also more mature with more fit and polish. That said the commands that make everything easy use 4bit quant models and you have to do extra work to go find a GGUF model with a higher (or lower) bit quant and load it into Ollama. Also worth noting is that Ollama "containerizes" the models on disk so you can't share them with other projects without going through Ollama which is a hard pass for any users and usecases since duplicating model files on disk isn't great. https://www.reddit.com/r/LocalLLaMA/comments/1eh6xmq/comment...

dagaci · on Aug 1, 2024

If you running windows anywhere then you better off using ollama, lmstudio, and or LLamaSharp for coding these are all cross-platform too.

lostmsu · on Aug 2, 2024

I found LlamaSharp to be quite unstable with random crashes in the built-in llama.cpp build.

sunshinesfbay · on Aug 1, 2024

Pretty cool! What are the steps to use these on mobile? Stoked about using ollama on my iPhone!

dagaci · on Aug 2, 2024

>> "If running windows" << All of these have web interfaces actually, and all of these implement the same openai api.

So you get to browse locally and remotely if you are able to expose the service remotely adjusting your router.

Coudflare will also expose services remotely if you wishhttps://developers.cloudflare.com/cloudflare-one/connections...

So you can also run on any LLM privately with ollama, lmstudio, and or LLamaSharp with windows, mac and iphone, all are opensource and customizable too and user friendly and frequently maintained.

JackYoustra · on Aug 1, 2024

Probably if you have any esoteric flags that pytorch supports. Flash attention 2, for example, was supported way earlier on pt than llama.cpp, so if flash attention 3 follows the same path it'll probably make more sense to use this when targeting nvidia gpus.

sunshinesfbay · on Aug 1, 2024

It would appear that Flash-3 is already something that exists for PyTorch based on this joint blog between Nvidia, Together.ai and Princeton about enabling Flash-3 for PyTorch: https://pytorch.org/blog/flashattention-3/

JackYoustra · on Aug 1, 2024

Right - my point about "follows the same path" mostly revolves around llama.cpp's latency in adopting it.

jerrygenser · on Aug 1, 2024

Olamma currently has only one "supported backend" which is llama.cpp. It enables downloading and running models on CPU. And might have more mature server.

This allows running models on GPU as well.

Zambyte · on Aug 1, 2024

I have been running Ollama on AMD GPUs (which support for came after NVIDIA GPUs) since February. Llama.cpp has supported it even longer.

tarruda · on Aug 1, 2024

How well does it run in AMD GPUs these days compared to Nvidia or Apple silicon?

I've been considering buying one of those powerful Ryzen mini PCs to use as an LLM server in my LAN, but I've read before that the AMD backend (ROCm IIRC) is kinda buggy

SushiHippie · on Aug 1, 2024

I have an RTX 7900 XTX and never had AMD specific issues, except that I needed to set some environment variable.

But it seems like integrated GPUs are not supported

https://github.com/ollama/ollama/issues/2637

RealStickman_ · on Aug 2, 2024

Not sure about Ollama, but llama.cpp supports vulkan for GPU computing.

darkteflon · on Aug 1, 2024

Ollama runs on GPUs just fine - on Macs, at least.

Kelteseth · on Aug 1, 2024

Forks fine on Windows with an AMD 7600XT

amunozo · on Aug 1, 2024

I use it in Ubuntu and works fine too.

ekianjo · on Aug 1, 2024

it runs on GPUs everywhere. On Linux, on Windows...