Hacker Newsnew | past | comments | ask | show | jobs | submit | tempestn's commentslogin

The one that always gets me is how they're insistent on giving 17-step instructions to any given problem, even when each step is conditional and requires feedback. So in practice you need to do the first step, then report the results, and have it adapt, at which point it will repeat steps 2-16. IME it's almost impossible to reliably prevent it from doing this, however you ask, at least without severely degrading the value of the response.

In my experience Gemini 3.0 pro is noticeably better than chatgpt 5.2 for non-coding tasks. The latter gives me blatantly wrong information all the time, the former very rarely.

I agree and it has been my almost exclusive go to ever since Gemini 3 Pro came out in November.

In my opinion Google isn't as far behind in coding as comments here would suggest. With Fast, it might already have edited 5 files before Claude Sonnet finished processing your prompt.

There is a lot of potential here, and with Antigravity as well as Gemini CLI - I did not test that one - they are working on capitalizing on it.


Strange that you say that because the general consensus (and my experience) seems to be the opposite, as well as the AA-Omniscience Hallucination Rate Benchmark which puts 3.0 Pro among the higher hallucinating models. 3.1 seems to be a noticeable improvement though.

Google actually has the BEST ratings in the AA-Omniscience Index: AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer.

Gemini 3.1 is the top spot, followed by 3.0 and then opus 4.6 max


This isn't actually correct.

Gemini 3.0 gets a very high score because it's very often correct, but it does not have a low hallucination rate.

https://artificialanalysis.ai/#aa-omniscience-hallucination-...

It looks like 3.1 is a big improvement in this regard, it hallucinates a lot less.


Yes and no. The hallucination rate shown there is the percentage of time the model answers incorrectly when it should have instead admitted to not knowing the answer. Most models score very poorly on this, with a few exceptions, because they nearly always try to answer. It's true that 3.0 is no better than others on this. By given that it does know the correct answers much more often than eg. GPT 5.2, it does in fact give hallucinated answers much less often.

In short, its hallucination rate as a percentage of unknown answers is no better than most models, but its hallucination rate as a percentage of total answers in indeed better.


I can only speak to my own experience, but for the past couple of months I've been duplicating prompts across both for high value tasks, and that has been my consistent finding.

> the AA-Omniscience Hallucination Rate Benchmark which puts 3.0 Pro among the higher hallucinating models. 3.1 seems to be a noticeable improvement though.

As sibling comment says, AA-Omniscience Hallucination Rate Benchmark puts Gemini 3.0 as the best performing aside from Gemini 3.1 preview.

https://artificialanalysis.ai/evaluations/omniscience


You are misreading the benchmark.

https://artificialanalysis.ai/#aa-omniscience-hallucination-...

If you look at the results 3.0 hallucinates an awful lot, when it's wrong.

It's just not wrong that often.

(And it looks like 3.1 does better on both fronts)


Google is good for answering questions but its writing is lacking. I’ve had to deal with Gemini slop and it’s worse than ChatGPT

Based on the self driving trials in my Model Y, I find it terrifying that anyone trusts it to drive them around. It required multiple interventions in a single 10-minute drive last time I tried it.

I'm using FSD for 100% of my driving and only need to intervene maybe once a week. It's usually because the car is not confident of too slow, not because it's doing something dangerous. Two years ago it was very different where almost every trip I needed to intervene to avoid crash. The progress they have made is truly amazing.

Would you use FSD with your children in the car? I sure as hell wouldn’t. Progress is not safety.

Yes I do in fact use FSD with my children in the car.

I pray for you and them. You need it

Oh well that's because you aren't using V18.58259a, I follow Elon's X and he said FSD is solved in that update. Clearly user error.

How long ago was that? I doubt it was the v14 software. The software has become scary good in the last few weeks, in my own subjective experience.

This exact sentence (minus the specific version) is claimed every single week.

No, you do not "become scary good" every single week the past 10 years and yet still not be able to drive coast to coast all by itself (which Elon promised it would do a decade ago)

You are just human and bad at evaluating it. You might even be experiencing literal statistical noise.


I have not been proclaiming scary good every week for the last 10 years. In fact, I have cancelled my subscription at least two times, once on v13 and once on v14, with the reason ‘not good enough yet.’ I am telling you that for me personally it has crossed a threshold very recently.

It certainly wasn't in the past few weeks, but I've been hearing about how good it's gotten for years. Certainly not planning to pay to find out if it's true now, but I'll give it another try next free trial!

Make sure you are on AI4 hardware when you do. If you buy FSD on AI3 you’ll be limited to v13, which is is terrible. I have used both and they are in different leagues altogether.

Because Opus 4.5 was released like a month ago and state of the art, and now the significantly faster and cheaper version is already comparable.

"Faster" is also a good point. I'm using different models via GitHub copilot and find the better, more accurate models way to slow.

Opus 4.5 was November, but your point stands.

Fair. Feels like a month!

Would've been, once. These days I assume bentcorner asked their favourite LLM to generate a poem parodying Ozymandias about once-popular youtube videos.

It doesn't feel like it at all (I'd never expect an LLM to say 'pfp' like that, or 'lossly[sic] compressed', ASCII instead of fancy quotes) but who knows at this point.

I may have gotten incredibly neurotic about online text since 2022.


or you could get over it and still enjoy it anyway. Like how Coke Zero tastes.

That is a fair point. Especially since, assuming it was AI-generated, it presumably wouldn't have existed at all otherwise.

Brought to you by Carl's Jr

Nope, I hand wrote this.

I actually considered using an LLM but in my experience they "warp" the content too much for anything like this. The effort required to get them to retain what I would consider something to my taste would take longer than just writing the poem myself. (Although tbf it's been awhile since I've asked a LLM to do parody work, so I could be wrong)


Ah, well, kudos then!

I think you're missing their point. The question you're replying to is, how do we know that this made up content is a hallucination. Ie., as opposed to being made up by a human. I think it's fairly obvious via Occam's Razor, but still, they're not claiming the quotes could be legit.

[dead]


You seem to be quite certain that I had not read the article, yet I distinctly remember doing do.

By what proceess do you imagine I arrived at the conclusion that the article suggested that published quotes were LLM hallucinations when that was not mentioned in the article title?

You accuse me of performative skepticism, yet all I think is that it is better to have evidence over assumptions, and it is better to ask if that evidence exists.

It seems a much better approach than making false accusations based upon your own vibes, I don't think Scott Shambaugh went to that level though.


https://news.ycombinator.com/item?id=47026071

https://arstechnica.com/staff/2026/02/editors-note-retractio...

>On Friday afternoon, Ars Technica published an article containing fabricated quotations generated by an AI tool and attributed to a source who did not say them. That is a serious failure of our standards. Direct quotations must always reflect what a source actually said.


I don't think the threat is to leave for 2 years then come back. He just doesn't want to commit to leaving forever. Who knows if in a decade it'll be Android with the shitty keyboard (or Apple will have the better Direct Brain Interface, or whatever). Most likely though, if someone switches ecosystems for 2+ years, they're going to get used to the new one and stay there.

I'm curious why my experience with Windows 11 is so different from what I regularly read. It was some years ago now, so I don't remember exactly what configuration steps I went through, but presumably I turned off ads when I first installed. And so, I don't get ads. I don't recall ever seeing an ad embedded in Windows. Are people talking about Edge (which I don't use) or inside the Microsoft Store (which I very rarely use, but I presume does have sponsored apps or whatever)? Or is this mostly people who don't use Windows, repeating what others have said? Or are these ads targeted at users who aren't me?

There is a setting that turns off many of the notifications that irritate people.

Settings -> System -> Notifications. Scroll to the bottom, expand Additional settings. Uncheck "Suggest ways to get the most out of Windows and finish setting up this device" and "Get tips and suggestions when using Windows".

I get more prompts from macOS about Apple products than I get from Windows about Microsoft products after unchecking those two settings.


Your Windows 11 experience strongly, strongly depends on where you are. Are you inside the EU? 90% of the crap people complain about is simply illegal and you don't see any of it.

In the US, of course, our government loves to let citizens be the product for corporations. America: by the corporations for the corporations.

Even more true now than it has been in maybe 100 years.


I'm in the US, I never experience any of the issues people complain about. Just checked and I don't have the setting disabled that that one guy talked about up thread. But I do have all notifications off. Maybe that is why?

Pro/Enterprise Version?

Regular Joe version as far as I know.

I'm in Canada. I do have the pro version though; maybe that makes a difference.

People just like to hate on Windows. The home version has some issues and limitations, but if you are willing to invest in the Pro version, it's mostly fine, really.

There are still many complaints to be had, but the fact is that Windows does what it needs to do on a wide range of hardware without much hassle if you know what you are doing.


I've also never seen an ad in windows 11.

I did uninstall all of the weird apps like "News" "Weather" etc.


I've often found when receiving a clueless support response like this, it can be effective to just follow up with a polite request to forward the ticket to an engineer or developer. Usually the front-line csr simply hasn't understood the issue. In this case I would say something like,

"Yes, I managed to work around the issue by switching to my personal email address, but this bug is still preventing me from using my work email domain. If you could please forward the error log I included to a developer, it should help them resolve the issue. Thank you."


But imagine that you do that, and they solve the problem. What would you write in your blog about?

I think insiders would tell you off the record not to get an R1, but that the R2s should be much more robust. Of course it's untested at this point, but hopefully that's the case.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: