Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Every week we get a new AI that according to the AI-goodness-benchmarks is 20% better than the old AI, yet the utility of these latest SOTA models is only marginally higher than the first ChatGPT version released to the public a few years back.

These things have the reasoning skills of a toddler, yet we keep fine-tuning their writing style to be more and more authoritative - this one is only missing the font and color scheme, other than that the output formatted exactly like a research paper.



Just yesterday I did my first Deep Research with OpenAI on a topic I know well.

I have to say I am really underwhelmed. It sounds all authoritative and the structure is good. It all sounds and feels substantial on the surface but the content is really poor.

Now people will blame me and say: you have to get the prompt right! Maybe. But then at the very least put a disclaimer on your highly professional sounding dossier.


> It all sounds and feels substantial on the surface but the content is really poor.

They're optimizing for the sales demo. Purchasing managers aren't reading the output.


You didn't expect it to do all the job for you on PhD level, did you? You did? Hmm.. ;) They are not there yet but getting closer. Quite a progress for 3 years.


No :) the prompt was about a marketing strategy for an app. It was very generic and it got the category of the app completely wrong to begin with.

But I admit that I didn’t spend huge amount of time designing the prompt.


I think what some people are finding is it's producing superficially good results, but there are actually no decent 'insights' integrated with the words. In other words, it's just a super search on steroids. Which is kind of disappointing?


This sounds like a good thing! Sounds like “it’s professional sounding” is becoming less effective as a means of persuasion, which means we’ll have much less fallacious logic floating around and will ultimately get back to our human roots:

Prove it or fight me


I think it's bound to underwhelm the experts. What this does is go through a number of public search results (i think its google search for now, coudl be internal corpus). And hence skips all the paywalled and proprietary data that is not directly accessible via Google. It can produce great output but limited by the sources it can access. If you know more, cos you understand it better, plus know sources which are not indexed by google yet. Moreover there is a possiblity most google surfaced results are a dumbed down and simplified version to appeal to a wider audience.


What was the prompt?


There were two step changes: ChatGPT/GPT-3.5, and GPT-4. Everything after feels incremental. But that's perhaps understandable. GPT-4 established just how many tasks could be done by such models: approximately anything that involves or could be adjusted to involve text. That was the categorical milestone that GPT-4 crossed. Everything else since then is about slowly increasing model capabilities, which translated to which tasks could then be done in practice, reliably, to acceptable standards. Gradual improvement is all that's left now.

Basically how progress of everything ever looks like.

The next huge jump will have to again make a qualitative change, such as enabling AI to handle a new class of tasks - tasks that fundamentally cannot be represented in text form in a sensible fashion.


But they are already multi-modal. The Google one can do live streaming video understanding with a conversational in-out prompt. You can literally walk around with your camera and just chat about the world. No text to be seen (although perhaps under the covers it is translating everything to text, but the point is the user sees no text)


Fair, but OpenAI was doing that half year ago (though limited access; I myself got it maybe a month ago), and I haven't seen it yet translate into anything in practice, so I feel like it (and multimodality in general) must be a GPT-3 level ability at this point.

But I do expect the next qualitative change to come from this area. It feels exactly like what is needed, but it somehow isn't there just yet.


Not true at all. The original ChatGPT was useless other than as a curious entertainment app.

Perplexity, OTOH, has almost completely replaced Google for me now. I'm asking it dozens of questions per day, all for free because that's how cheap it is for them to run.

The emergence of reliable tool use last year is what has sky-rocketed the utility of LLMs. That has made search and multi-step agents feasible, and by extension applications like Deep Research.


If your goal is to replace one unreliable source of information (Google first page) with another, sure - we may be there. I'd argue the GPT 3.5 already outperformed Google for a significant number of queries. The only difference between then and now is that now the context window is large enough that we can afford to paste into the prompt what we hope are a few relevant files.

Yet what's essentially "cat [62 random files we googled] > prompt.txt" is now being confidently presented with academic language as "62 sources". This rubs me the wrong way. Maybe this time the new AI really is so much better than the old AI that it justifies using that sort of language, but I've seen this pattern enough times that I can be confident that's not the case.


> Yet what's essentially "cat [62 random files we googled] > prompt.txt" is now being confidently presented with academic language as "62 sources".

That's not a very charitable take.

I recently quizzed Perplexity (Pro) on a niche political issue in my niche country, and it compared favorably with a special purpose-built RAG on exactly that news coverage (it was faster and more fluent, info content was the same). As I am personally familiar with these topics I was able to manually verify that both were correct.

Outside these tests I haven't used Perplexity a lot yet, but so far it does look capable of surfacing relevant and correct info.


Perplexity with Deepseek R1 (they have the real thing running on Amazon servers in USA) is a game changer, it doesn’t just use top results from a Google search, it considers what domains to search for information relevant to your prompt.

I boycotted ai for about a year considering it to be mostly garbage but I’m back to perplexifying basically everything I need an answer fo

(That said, I agree with you they’re not really citations, but I don’t think they’re trying to be academic, it’s just, here’s the source of the info)


I'd love to read something on how Perplexity+R1 integrates sources into the reasoning part.


> all for free because that's how cheap it is for them to run.

No, these AI companies are burning through huge amounts of cash to keep the thing running. They're competing for market share - the real question is will anyone ever pay for this? I'm not convinced they will.


> They're competing for market share - the real question is will anyone ever pay for this?

The leadership of every 'AI' company will be looking to go public and cash out well before this question ever has to be answered. At this point, we all know the deal. Once they're publicly traded, the quality of the product goes to crap while fees get ratcheted up every which way.


That's when the 'enshitification' engine kicks in. Pop up ads on every result page etc. It's not going to be pretty.


The question of "will people pay" is answered--OpenAI alone is at something like $4 billion in ARR. There are also smaller players (relatively) with impressive revenue, many of whom are profitable.

There are plenty of open questions in the AI space around unit economics, defensibility, regulatory risks, and more. "Will people pay for this" isn't one of them.


As someone who loves OpenAI’s products, I still have to say that if you’re paying $200/month for this stuff then you’ve been taken for a ride.


Honestly, I've not coded in 5+ years ( RoR ) and a project I'm involved with needed a few of days worth of TLC. A combination of Cursor, Warp and OAI Pro has delivered the results with no sweat at all. Upgrade of Ruby 2 to 3.7, a move to jsbundling-rails and cssbundling-rails, upgrade Yarn and an all-new pipeline. It's not trivial stuff for a production app with paying customers.

The obvious crutch of this new AI stack reduced go-live time from 3 weeks to 3 days. Well worth the cost IMHO.


Yeah, I'm skeptical about the price point of that particular product as well.


This is my first time using anything from Perplexity and I am liking this quite a bit.

There seems to be such variance in the utility people find with these models. I think it is the way Feynman wouldn't find much value in what the language model says on quantum electrodynamics but neither would my mom.

I suspect there is a sweet spot of ignorance and curiosity.

Deep Research seems to be reading a bunch of arXiv papers for me, combining the results and then giving me the references. Pretty incredible.


It's not free because it's cheap for them to run. It's free because they are burning that late-stage VC dollars. Despite what you might believe if you only follow them on twitter the biggest input to their product, aka a search index, is mostly based on brave/bing/serpAPI and those numbers are pretty tight. Big expectations for ads will determine what the company does.


Yeah, I don't get OPs take. ChatGPT 3.5 was basically just a novelty, albeit an exciting one. The models we've gotten since have ingrained themselves into my workflows as productivity multipliers. They are significantly better and more useful (and multimodal) than what we had in 2022, not just marginally better.


I use these models to aid bleeding edge ml research every day. Sonnet can make huge changes and bug fixes to my code (that does stuff nobody else has tried in this way before) whereas GPT 3.5 Turbo couldn’t even repeat a given code block without dropping variables and breaking things. O1 can reason through very complex model designs and signal processing stuff even I have a hard time wrapping my head around.


On the other hand, if you try to solve some problem by creating the code by using AI only, and it misses only one thing, it takes more time to debug this problem rather than creating this code from scratch. Understanding some larger piece of AI code is sometimes equally hard or harder than constructing the solution into your problem by yourself.


Yes it’s important to make sure it’s easy to verify the code is correct.


As someone who's been using OpenAI's ChatGPT every day for work, I tested Perplexity's free Deep Research feature today and I was blown away by how good it is. It's unlike anything I've seen over at OpenAI and have tested all of their models. I have canceled my OpenAI monthly subscription.


What did you ask it that blew you away?

Every time I see a comment about someone getting excited about some new AI thing, I want to go try and see for myself, but I can't think of a real world use case that is the right level of difficulty that would impress me.


I asked it to expand an article with further information about the topic, and it searched online and that’s what it did.


It is ridiculous.

Many of the AI companies ride on the hype are being overvalued with idea that if we just fine-tune LLMs a bit more, a spark of consciousness will emerge.

It is not going to happen with this tech - I wish the LLM-AGI bubble would burst already.


If you don't realize how models like gemini 2 and o3 mini are wildly better than gpt-4 then clearly you're not very good at using them




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: