Hacker Newsnew | past | comments | ask | show | jobs | submit | eis's commentslogin

Apple Silicon uses unified memory where the CPU and GPU use the exact same memory and no copies from RAM to VRAM are needed. The article opens with mentioning just that and indeed it is the whole point of the article.

I am always a bit baffled why Apple gets credited with this. Unified memory has been a thing for decades. I can still load the biggest models on my 10th gen Intel Core CPU and the integrated GPU can run inference.

The difference being that modern integrated GPU are just that much faster and can run inference at tolerable speeds.

(Plus NPUs being a thing now, but that also started much earlier. Thr 10th gen Intel Core architecture already had instructions to deal with "AI" workloads... just very preliminary)


That’s shared, not unified, it’s partitioned where cpu and gpu copies are managed by driver. Lunar lake (2024) is getting closer but still not as tightly integrated as apple and capped to 32GB only (Apple has up to 512GB). AMD ryzen ai max is closer to Apple but still 3 times slower memory.

Shared vs unified is merely a driver implementation detail. Regardless, in practice (IIUC) data is still going to be copied if you perform a transfer using a graphics API because the driver has no way of knowing what the host might do with the pointed-to memory after the transfer.

If you make use of host pointers and run on an iGPU no copy will take place.


My last serious GPU programming was with OpenCL. And if my memory does not fail me the API was quite specific about copying and/or sharing memory on a shared memory system.

I am pretty sure that my old 10th gen CPU/GPU combo has the ability to use the "unified"/zero-copy access mode for the GPU.


I don't think people are crediting Apple with inventing unified memory - I certainly did not. There have been similar systems for decades. What Apple did is popularize this with widely available hardware with GPUs that don't totally suck for inference in combination with RAM that has decent speed at an affordable price. You either had iGPUs which were slow (plus not exactly the fastest DDR memory) but at least sitting on the same die or you had fast dGPUs which had their own limited amount of VRAM. So the choice was between direct memory access but not powerfull or powerfull but strangled by having to go through the PCIE subsystem to access RAM.

The article is talking about one particular optimization that one can implement with Apple Silicon and I at least wasn't aware that it is now possible to do so from WebAssembly - so to completely dismiss it as if it had nothing to do with Apple Silicon is imho not fair.


Back in the 8 and 16 bit home computer days, or game consoles for that matter it was popular enough already.

And yes things like the Amiga Blitter, arcade or console graphics units were already baby GPUs.


It's irrelevant because no matter if the system memory is unified or not the point of the article is if WASM adds extra memory copiy.

That's the same no matter the physical memory system architecture.


Yes but that is just a tiny part of the whole CF worker ecosystem. The other services are not open source and so the lock-in is very very real. There are no API compatible alternatives that cover a good chunk of the services. If you build your application around workers and make use of the integrated services and APIs there is no way for you to switch to another provider because well, there is none.

And now you've put everything on the equivalent of a single NodeJS process running on a tiny VM. Next step: spread out over multiple durable objects but that means implementing a sharding logic. Complexity escalates very fast once you leave toy project territory.

D1 reliability has been bad in our experience. We've had queries hanging on their internal network layer for several seconds, sometimes double digits over extended periods (on the order of weeks). Recently I've seen a few times plain network exceptions - again, these are internal between their worker and the D1 hosts. And many of the hung queries wouldn't even show up under traces in their observability dashboard so unless you have your own timeout detection you wouldn't even know things are not working. It was hard to get someone on their side to take a look and actually acknowledge and understand the problem.

But even without network issues that have plagued it I would hesitate to build anything for production on it because it can't even do transactions and the product manager for D1 openly stated they wont implement them [0]. Your only way to ensure data consistency is to use a Durable Object which comes with its own costs and tradeoffs.

https://github.com/cloudflare/workers-sdk/issues/2733#issuec...

The basic idea of D1 is great. I just don't trust the implementation.

For a hobby project it's a neat product for sure.


> And many of the hung queries wouldn't even show up under traces in their observability dashboard

How did you work around this problem? As in, how do you monitor for hung queries and cancel them?

> D1 reliability has been bad in our experience.

What about reads? We use D1 in prod & our traffic pattern may not be similar to yours (our workload is async queue-driven & so retries last in order of weeks), nor have we really observed D1 erroring out for extended periods or frequently.


> How did you work around this problem? As in, how do you monitor for hung queries and cancel them?

You just wrap your DB queries in your own timeout logic. You can then continue your business logic but you can't truly cancel the query because well, the communication layer for it is stuck and you can't kill it via a new connection. Your only choice is to abandon that query. Sometimes we could retry and it would immediately succeed suggesting that the original query probably had something like packetloss that wasn't handled properly by CF. Easy when it's a read but when you have writes then it gets complicated fast and you have to ensure your writes are idempotent. And since they don't support transactions it's even more complex.

Aphyr would have a field day with D1 I'd imagine.

> What about reads? We use D1 in prod & our traffic pattern may not be similar to yours (our workload is async queue-driven & so retries last in order of weeks), nor have we really observed D1 erroring out for extended periods or frequently.

We have reads and writes which most of the time are latency sensitive (direct user feedback). A user interaction can usually involve 3-5 queries and they might need to run in sequence. When queries take 500ms+ the system starts to feel sluggish. When they take 2-3s it's very frustrating. The high latencies happened for both reads and writes, you can do a simple "SELECT 123" and it would hang. You could even reproduce that from the Cloudflare dashboard when it's in this degradated state.

From the comments of others who had similar issues I think it heavily depends on the CF locations or D1 hosts. Most people probably are lucky and don't get one of the faulty D1 servers. But there are a few dozen people who were not so lucky, you can find them complaining on Github, on the CF forum etc. but simply not heard. And you can find these complaints going back years.

This long timeframe without fixes to their network stack (networking is CF's bread and butter!), the refusal to implement transactions, the silence in their forum to cries for help, the absurdly low 10GB limit for databases... it just all adds up. We made the decision to not implement any new product on D1 and just continue using proper databases. It's a shame because workers + a close-by read replica could be absolutely great for latency. Paradoxically it was the opposite outcome.


The blog post has a benchmark comparison table with these two in it

Thanks, I missed that. It's very interesting. They're quite close, but I found Qwen 3.6 plus was just marginally better than Kimi 2.5. But looking at the stats I'll definitely give GLM 5.1 a try now. [edit: even though looking at it, it's not cheap and has a much smaller context size.And I can't tell about tool use.]

Quite strong results in the benchmarks but why Gemini 3 Pro instead of 3.1? Why only for a few of the benchmarks? Why is OpenAI not there in the coding benchmarks? Why Opus 4.5 and not 4.6? Just jumps out into my eye as a bit strange.

As always, we'll have to try and see how it performs in the real world but the open weight models of Qwen were pretty decent for some tasks so still excited to see what this brings.


> These theorems apply to any system of axioms that are rich enough to state the liar's paradox.

Isn't that circular reasoning or tautological though? Rephrased: any system that can state something that these theorems apply to, can have the theorems applied to.

I think the word "rich" is too inaccurate in this context. It is not clear why there can't be a more "rich" system which does not suffer from this issue and can't state the liars paradox.


Yeah rich is a vague word here. Really we're trying to say 'if a set of axioms can express provability, and can get a sentence to refer to itself, then it can state the liar sentence. And once it can state the liar sentence, then the rest of Gödel's argument follows.' There's no circularity.


After all the AI slop from Cloudflare in recent months and the embarrassment that came with it, they dare to launch this vibe coded project with THAT name on April 1st? I'm really not sure what to think anymore. Reality became too absurd.


After https://news.ycombinator.com/item?id=46781516 ? Yes, one can unfortunately put that into the realm of reality.


Which link exactly did you try to use? Or what specific version on the Github releases page? I checked both the latest windows and macos versions against Google Safe Browsing and all were fine.


I can't reproduce this either, OP is light on details.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: