Hacker Newsnew | past | comments | ask | show | jobs | submit | RC_ITR's commentslogin

Isn't the whole selling point of AI agents that you now can do things like scale 3x without scaling your team accordingly?

I haven't seen anyone claim that applies to infrastructure or compute.

Since apparently LLMs have also conquered physics, “Claude, transmute this lead to gold for me.”

Yeah, it's almost like the point I was making is that everyone is overselling AI agents' capabilities.

I’m sure someone is out there claiming that AI is going to solve all your business’s problems no matter what they are. Remotely sane people are saying it will solve (or drastically improve) certain classes of problems. 3x code? Sure. 3x the physical hardware in a data center? Surely not.

Implying that software is somehow divorce from Infrastructure/compute efficiency and utilization isn't a claim I've seen many make either.

I assume so. They're doing it with around 99% uptime.

Or, to put it another way, almost 2 9s.

This is a discussion with nearly unanimous agreement that poor ATC working conditions are causing Americans to die in preventable aviation accidents.

Maybe this is the one evidence-driven case where you can be open minded about the value of a public employee union?


Nope. Public employee unions bring zero value and this incident is not evidence to support such unions. Relying on unions to act as ersatz safety regulators would be stupid, just completely the wrong approach. Decisions about things like ATC procedures, staffing levels, and training standards should be the responsibility of apolitical career bureaucrats.


Why would a career bureaucrat be a more efficient way to figure out how to attract and retain ATC workers, ass opposed to a union representing those ATC workers?

Your proposal intentionally injects inefficiency and noise into the system because you don't like some political boogeyman.


I'm not positive this was a secret (See: Reddit post about it from 2018):

https://www.reddit.com/r/TheSilphRoad/comments/8i7byi/pokemo...


I’m sure all the Pokémon Go players caught that post!


Well, then we get into the area of 'How many people know Google is logging their searches to serve them more targeted YouTube ads?'


It's like that FT chart claiming that the rapid rise in iOS apps is evidence of an AI-fueled productivity boom.

I always ask people, in the past year, how many AI-coded apps have you 1) downloaded 2) paid for?


In addition to that, what they don’t mention is that:

1. Other app stores like Google Play and Steam haven’t seen this rapid rise.

2. There are thousands maybe tens of thousands of apps that are just wrappers calling OpenAI APIs or similar low effort AI apps making up a large percentage of this increase.

3. There are billions of dollars pouring into AI startups and many of them launch an iOS app.


Has steam not seen a rapid rise in AI-asset shovelware?

I'm not talking about the AAA or the AA or even the A space (where AI is being incorporated into dev processes with various degrees of both success and low effort slop), I'm talking about the actual bottom of the barrel.


You never needed AI to make shovelware, you have been able to make a shitty game over a weekend ever since RPG maker was made and there are still games made using that.

AI just helps create some assets for games, it doesn't really make it easier or faster to make games but they might look a bit better.


I can’t speak to the quality of all the games released, but in January 2025 there were 1,413 games released on Steam and in January of this year there were 1,448.


> I always ask people, in the past year, how many AI-coded apps have you 1) downloaded 2) paid for?

In the past 5 years, the only "new" app I've added to my phone has been Claude.ai.

Before that I guess DoorDash. And that probably covers the past 7ish years of phone use.

There's just too much shit in the store, a lot of it is scammy or has dark patterns.

For me, "app stores" are largely dead.


> It's like that FT chart claiming that the rapid rise in iOS apps is evidence of an AI-fueled productivity boom.

I mean, there is evidence for some change. Personally, I'm sceptical of what this will amount to, but prior to EOY 2025, there really wasn't any evidence for an app/service boom, and now there's weak evidence, which is better than none.


Because so much technical functionality has been lost/paywalled/dark patterned/enshitified, I've cut the number of apps I use. I've realized building core personal functionality around the whims of corporations eventually just gets weaponized against me, so I might as well start undoing that on my own terms. Who in 2026 is really bringing in a new app/Saas to do much of anything like we naively did a decade ago? No one I know, we've been shown we will be treated as suckers for doing that.


The bird not having wings, but all of us calling it a 'solid bird' is one of the most telling examples of the AI expectations gap yet. We even see its own reasoning say it needs 'webbed feet' which are nowhere to be found in the image.

This pattern of considering 90% accuracy (like the level we've seemingly we've stalled out on for the MMLU and AIME) to be 'solved' is really concerning for me.

AGI has to be 100% right 100% of the time to be AGI and we aren't being tough enough on these systems in our evaluations. We're moving on to new and impressive tasks toward some imagined AGI goal without even trying to find out if we can make true Artificial Niche Intelligence.


This test is so far beyond AGI. Try to spit out the SVG for a pelican riding a bicycle. You are only allowed to use a simple text editor. No deleting or moving the text cursor. You have 1 minute.


Sorry, is your definition of AGI "doing things worse than humans can do, but way faster?" because that's been true of computers for a long time.


I mean for this particular benchmark, yes.

You'd have to put it in an agentic loop to perform corrections otherwise.


MMLU performance caps out around 90% because there are tons of errors in the actual test set. There's a pretty solid post on it here: https://www.reddit.com/r/LocalLLaMA/comments/163x2wc/philip_...

As far as I can tell for AIME, pretty much every frontier model gets 100% https://llm-stats.com/benchmarks/aime-2025


Here's the score for new AIME's, where we know the answers aren't in training.

https://matharena.ai/?view=problem&comp=aime--aime_2026

As for MMLU, is your assertion that these AI labs are not correcting for errors in these exams and then self-reporting scores less than 100%?

As implied by the video, wouldn't it then take 1 intern a week max to fix those errors and allow any AI lab to become the first to consistently 100% the MMLU? I can guarantee Moonshot, DeepSeek, or Alibaba would be all over the opportunity to do just that if it were a real problem.


The benchmarks are harder than you might imagine and contain more wrong answers and terrible questions than you would expect.

You don't need to take my word for it, try playing MMLU yourself.

https://d.erenrich.net/are-you-smarter-than-an-llm/index.htm...

Its not MMLU-Pro btw, which is considerably harder.


Sure and AGI will 100% it 100% of the time, even if it is hard.


Your definition of AGI must be absurd


It has a wing. Look at the code comments in the SVG!


I may not be AGI, but here's a $615 2 Queen bed hotel room for the dates he wants in exactly the location he wants (just not on Airbnb).

https://www.booking.com/Share-Wt9ksz

Maybe he really is tied to $600 as his absolute upper limit, but also seems like something a few years from AGI would think to check elsewhere.


Yeah, I've found AI 'miracle' use-cases like these are most obvious for wealthy people who stopped doing things for themselves at some point.

Typing 'Find me reservations at X restaurant' and getting unformatted text back is way worse than just going to OpenTable and seeing a UI that has been honed for decades.

If your old process was texting a human to do the same thing, I can see how Clawdbot seems like a revolution though.

Same goes for executives who vibecode in-house CRM/ERP/etc. tools.

We all learned the lesson that mass-market IT tools almost always outperform in-house, even with strong in-house development teams, but now that the executive is 'the creator,' there's significantly less scrutiny on things like compatibility and security.

There's plenty real about AI, particularly as it relates to coding and information retrieval, but I'm yet to see an agent actually do something that even remotely feels like the result of deep and savvy reasoning (the precursor to AGI) - including all the examples in this post.


> Typing 'Find me reservations at X restaurant' and getting unformatted text back is way worse than just going to OpenTable and seeing a UI that has been honed for decades.

Your conflating the example with the opportunity:

"Cancel Service XXX" where the service is riddled with dark patterns. Giving every one an "assistant" that can do this is a game changer. This is why a lot of people who aren't that deep in tech think open claw is interesting.

> We all learned the lesson that mass-market IT tools almost always outperform in-house

Do they? Because I know a lot of people who have (as an example) terrible setups with sales force that they have to use.


I feel bad for whoever gets an oncall page that some executive's vibe coded app stopped working and needs to be fixed ASAP.


> We all learned the lesson that mass-market IT tools almost always outperform in-house,

Funny, I learned the exact opposite lesson. Almost all software suck, and a good way for it not to suck is to know where the developer is and go tell them their shit is broken, in person.

If you want a large scale example, one of the two main law enforcement agency in france spun off libreoffice into their own legal writing software. Developped by LEOs that can take up to two weeks a year to work on that. Awesome software. Would cost litterally millions if bought on the market.


Speaking of suboptimal writing, why call it a 'gay' love affair, when he was openly gay?


One of the most important details of Sacks's life which dogged him nearly to the end (and which is important to this NY piece), was a minimization by Sacks of his own sexuality. He was not "openly gay" at all.


For most of his life, he was not openly gay.


One of the biggest problems frontier models will face going forward is how many tasks require expertise that cannot be achieved through Internet-scale pre-training.

Any reasonably informed person realizes that most AI start-ups looking to solve this are not trying to create their own pre-trained models from scratch (they will almost always lose to the hyperscale models).

A pragmatic person realizes that they're not fine-tuning/RL'ing existing models (that path has many technical dead ends).

So, a reasonably informed and pragmatic VC looks at the landscape, realizes they can't just put all their money into the hyperscale models (LP's don t want that) and they look for start-ups that take existing hyperscale models and expose them to data that wasn't in their pre-Training set, hopefully in a way that's useful to some users somewhere.

To a certain extent, this study is like saying that Internet start-ups in the 90's relied on HTML and weren't building their own custom browsers.

I'm not saying that this current generation of start-ups will be successful as Amazon and Google, but I just don't know what the counterfactual scenario is.


The question that isn't answered completely in the article is how useful are the pipelines for these startups? The article certainly implies that for at least some of these startups there very little value add in the wrapper.


Got any links to explanations of why fine tuning open models isn’t a productive solution? Besides renting the GPU time, what other downsides exist on today’s SOTA open models for doing this?


When the new pre-trained parameters come out in a new model generation, your old fine tuning doesn't apply to them.


I think the word "de-enshittify" is probably the least elegant piece of slang ever uttered.

I know linguistics is descriptive not prescriptive, but it's truly amazing to me the lengths people will go to swear.


https://news.ycombinator.com/item?id=45918211

Blame Doctorow for swearing, not me!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: