Hacker Newsnew | past | comments | ask | show | jobs | submit | Bombthecat's commentslogin

It's cheap. That's all.

Both of them look pretty old?

code clash I think would be quite hard to game or contaminate unintentionally; considering that models need to compete against one another

https://gertlabs.com already does this at scale.

An industry-standard benchmark shouldn't be hosted or designed by a lab producing the models, regardless.


I mean the data / benchmarks

It's more like no one cares about UX. People keep using the product and they keep printing. Why invest in a UX researcher or designer?

Google stated a while back, that with tpus they are able to sell at cost / with profit.

Aka: everyone who uses Nvidia isn't selling at cost, because Nvidia is so expensive.


In six month deepseek won't be sota anymore und usage will be wayyyy down.

Only comparing on SOTA scores (ignoring price etc.) is like choosing your daily-driver by looking at who makes the fastest sports-car...

The constant improvements of SOTA are the main thing keeping the investment machine running. We can't really remove training costs from inference costs, because a bunch of the funding and loans for the inference hardware only exists because the promises the continuous training (tries to) provides.

Not really. SOTA vs non SOTA is "can I get my coding work actually done today" vs. "this can do customer support chat"

It is like car vs. kick scooter.


It really isn't. We get coding work actually done today on Opus 4.5. That's not SOTA any more, and anything proximate to that level, even quite loosely, is genuinely useful.

OK we are in Opus 4.5 is not SOTA. Right by that definition .... yes you are right.

I mean its almost halve a year, i think that counts ?

Time wise you are correct.

> "can I get my coding work actually done today" vs. "this can do customer support chat"

I think you need to define "can get coding work done" for this to make sense. Ive been using GPT-3 back-then for basic scripts, does that count ? Or only Claude-Code ?

I also think this is a false dichotomy, if you look at the Project Vend project or Vending-Bench, customer support etc. is at no means trivial. (Old but great story https://www.businessinsider.com/car-dealership-chevrolet-cha...)


This, I have been doing my side hustle code with open code an 3.2 reasoner and it is way better than what I have at day job with copilot and whatever models are there.

Copilot is a bad harness that perverts the productivity of models like GPT 5.5.

Tell me more please!

Not really. The current SOTAs are already at the point that they can do that. The following models will start to surpass the daily work level. It's a diminishing returns situation just like anything else in tech.

A huge proportion of those scores are gamed anyways. Use whatever works for you at the price and availability you can afford

Or there will be DSv4.1/2/3 ;)

Definitely something in this realm, they call the models "preview" at a bunch of different points in the paper.

What im really hoping is for a double-punch like with V3 -> R1


Well, if they distilled once…

Or it could be something

X-Files music plays


Gpt xhigh isn't that bad..

Or the realization that you will lose your job

You still need to hold the model in memory. If you have for example 16 GB ram, the gains aren't that much

That's not what consumes the most memory at scale. The KV caches are per-user.

And users won't need to visit your site.

Thank you


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: