More

tempestn · 2026-02-24T03:25:15 1771903515

The one that always gets me is how they're insistent on giving 17-step instructions to any given problem, even when each step is conditional and requires feedback. So in practice you need to do the first step, then report the results, and have it adapt, at which point it will repeat steps 2-16. IME it's almost impossible to reliably prevent it from doing this, however you ask, at least without severely degrading the value of the response.

tempestn · 2026-02-20T02:55:33 1771556133

In my experience Gemini 3.0 pro is noticeably better than chatgpt 5.2 for non-coding tasks. The latter gives me blatantly wrong information all the time, the former very rarely.

user34283 · 2026-02-20T08:48:30 1771577310

I agree and it has been my almost exclusive go to ever since Gemini 3 Pro came out in November.

In my opinion Google isn't as far behind in coding as comments here would suggest. With Fast, it might already have edited 5 files before Claude Sonnet finished processing your prompt.

There is a lot of potential here, and with Antigravity as well as Gemini CLI - I did not test that one - they are working on capitalizing on it.

pants2 · 2026-02-20T03:19:26 1771557566

Strange that you say that because the general consensus (and my experience) seems to be the opposite, as well as the AA-Omniscience Hallucination Rate Benchmark which puts 3.0 Pro among the higher hallucinating models. 3.1 seems to be a noticeable improvement though.

maxwellcoffee · 2026-02-20T04:51:36 1771563096

Google actually has the BEST ratings in the AA-Omniscience Index: AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer.

Gemini 3.1 is the top spot, followed by 3.0 and then opus 4.6 max

holbrad · 2026-02-20T16:58:44 1771606724

This isn't actually correct.

Gemini 3.0 gets a very high score because it's very often correct, but it does not have a low hallucination rate.

https://artificialanalysis.ai/#aa-omniscience-hallucination-...

It looks like 3.1 is a big improvement in this regard, it hallucinates a lot less.

tempestn · 2026-02-20T20:24:31 1771619071

Yes and no. The hallucination rate shown there is the percentage of time the model answers incorrectly when it should have instead admitted to not knowing the answer. Most models score very poorly on this, with a few exceptions, because they nearly always try to answer. It's true that 3.0 is no better than others on this. By given that it does know the correct answers much more often than eg. GPT 5.2, it does in fact give hallucinated answers much less often.

In short, its hallucination rate as a percentage of unknown answers is no better than most models, but its hallucination rate as a percentage of total answers in indeed better.

tempestn · 2026-02-20T03:40:57 1771558857

I can only speak to my own experience, but for the past couple of months I've been duplicating prompts across both for high value tasks, and that has been my consistent finding.

fnord123 · 2026-02-20T11:09:01 1771585741

> the AA-Omniscience Hallucination Rate Benchmark which puts 3.0 Pro among the higher hallucinating models. 3.1 seems to be a noticeable improvement though.

As sibling comment says, AA-Omniscience Hallucination Rate Benchmark puts Gemini 3.0 as the best performing aside from Gemini 3.1 preview.

https://artificialanalysis.ai/evaluations/omniscience

holbrad · 2026-02-20T17:00:48 1771606848

You are misreading the benchmark.

https://artificialanalysis.ai/#aa-omniscience-hallucination-...

If you look at the results 3.0 hallucinates an awful lot, when it's wrong.

It's just not wrong that often.

(And it looks like 3.1 does better on both fronts)

b3ing · 2026-02-20T04:17:56 1771561076

Google is good for answering questions but its writing is lacking. I’ve had to deal with Gemini slop and it’s worse than ChatGPT

tempestn · 2026-02-18T06:31:48 1771396308

Based on the self driving trials in my Model Y, I find it terrifying that anyone trusts it to drive them around. It required multiple interventions in a single 10-minute drive last time I tried it.

loandbehold · 2026-02-18T06:58:59 1771397939

I'm using FSD for 100% of my driving and only need to intervene maybe once a week. It's usually because the car is not confident of too slow, not because it's doing something dangerous. Two years ago it was very different where almost every trip I needed to intervene to avoid crash. The progress they have made is truly amazing.

herewego · 2026-02-18T15:22:04 1771428124

Would you use FSD with your children in the car? I sure as hell wouldn’t. Progress is not safety.

loandbehold · 2026-02-18T19:05:13 1771441513

Yes I do in fact use FSD with my children in the car.

tomalbrc · 2026-02-19T01:23:22 1771464202

I pray for you and them. You need it

tencentshill · 2026-02-18T14:18:17 1771424297

Oh well that's because you aren't using V18.58259a, I follow Elon's X and he said FSD is solved in that update. Clearly user error.

geertj · 2026-02-18T06:52:33 1771397553

How long ago was that? I doubt it was the v14 software. The software has become scary good in the last few weeks, in my own subjective experience.

mrguyorama · 2026-02-18T21:05:32 1771448732

This exact sentence (minus the specific version) is claimed every single week.

No, you do not "become scary good" every single week the past 10 years and yet still not be able to drive coast to coast all by itself (which Elon promised it would do a decade ago)

You are just human and bad at evaluating it. You might even be experiencing literal statistical noise.

geertj · 2026-02-19T14:11:02 1771510262

I have not been proclaiming scary good every week for the last 10 years. In fact, I have cancelled my subscription at least two times, once on v13 and once on v14, with the reason ‘not good enough yet.’ I am telling you that for me personally it has crossed a threshold very recently.

tempestn · 2026-02-18T19:30:46 1771443046

It certainly wasn't in the past few weeks, but I've been hearing about how good it's gotten for years. Certainly not planning to pay to find out if it's true now, but I'll give it another try next free trial!

geertj · 2026-02-18T20:35:12 1771446912

Make sure you are on AI4 hardware when you do. If you buy FSD on AI3 you’ll be limited to v13, which is is terrible. I have used both and they are in different leagues altogether.

tempestn · 2026-02-17T18:27:25 1771352845

Because Opus 4.5 was released like a month ago and state of the art, and now the significantly faster and cheaper version is already comparable.

micw · 2026-02-17T19:33:10 1771356790

"Faster" is also a good point. I'm using different models via GitHub copilot and find the better, more accurate models way to slow.

stavros · 2026-02-17T18:44:28 1771353868

Opus 4.5 was November, but your point stands.

tempestn · 2026-02-17T21:35:30 1771364130

Fair. Feels like a month!

tempestn · 2026-02-15T00:03:33 1771113813

Would've been, once. These days I assume bentcorner asked their favourite LLM to generate a poem parodying Ozymandias about once-popular youtube videos.

1bpp · 2026-02-15T04:38:02 1771130282

It doesn't feel like it at all (I'd never expect an LLM to say 'pfp' like that, or 'lossly[sic] compressed', ASCII instead of fancy quotes) but who knows at this point.

I may have gotten incredibly neurotic about online text since 2022.

fragmede · 2026-02-15T03:26:38 1771125998

or you could get over it and still enjoy it anyway. Like how Coke Zero tastes.

tempestn · 2026-02-15T05:44:56 1771134296

That is a fair point. Especially since, assuming it was AI-generated, it presumably wouldn't have existed at all otherwise.

joquarky · 2026-02-15T05:22:14 1771132934

Brought to you by Carl's Jr

bentcorner · 2026-02-15T15:23:34 1771169014

Nope, I hand wrote this.

I actually considered using an LLM but in my experience they "warp" the content too much for anything like this. The effort required to get them to retain what I would consider something to my taste would take longer than just writing the poem myself. (Although tbf it's been awhile since I've asked a LLM to do parody work, so I could be wrong)

tempestn · 2026-02-16T05:02:55 1771218175

Ah, well, kudos then!

tempestn · 2026-02-14T09:50:54 1771062654

I think you're missing their point. The question you're replying to is, how do we know that this made up content is a hallucination. Ie., as opposed to being made up by a human. I think it's fairly obvious via Occam's Razor, but still, they're not claiming the quotes could be legit.

2026-02-14T12:47:50 1771073270

[dead]

Lerc · 2026-02-14T15:53:16 1771084396

You seem to be quite certain that I had not read the article, yet I distinctly remember doing do.

By what proceess do you imagine I arrived at the conclusion that the article suggested that published quotes were LLM hallucinations when that was not mentioned in the article title?

You accuse me of performative skepticism, yet all I think is that it is better to have evidence over assumptions, and it is better to ask if that evidence exists.

It seems a much better approach than making false accusations based upon your own vibes, I don't think Scott Shambaugh went to that level though.

DonHopkins · 2026-02-16T10:56:33 1771239393

https://news.ycombinator.com/item?id=47026071

https://arstechnica.com/staff/2026/02/editors-note-retractio...

>On Friday afternoon, Ars Technica published an article containing fabricated quotations generated by an AI tool and attributed to a source who did not say them. That is a serious failure of our standards. Direct quotations must always reflect what a source actually said.

tempestn · 2026-02-13T21:12:46 1771017166

I don't think the threat is to leave for 2 years then come back. He just doesn't want to commit to leaving forever. Who knows if in a decade it'll be Android with the shitty keyboard (or Apple will have the better Direct Brain Interface, or whatever). Most likely though, if someone switches ecosystems for 2+ years, they're going to get used to the new one and stay there.

tempestn · 2026-02-13T21:07:52 1771016872

I'm curious why my experience with Windows 11 is so different from what I regularly read. It was some years ago now, so I don't remember exactly what configuration steps I went through, but presumably I turned off ads when I first installed. And so, I don't get ads. I don't recall ever seeing an ad embedded in Windows. Are people talking about Edge (which I don't use) or inside the Microsoft Store (which I very rarely use, but I presume does have sponsored apps or whatever)? Or is this mostly people who don't use Windows, repeating what others have said? Or are these ads targeted at users who aren't me?

daggersandscars · 2026-02-13T21:28:52 1771018132

There is a setting that turns off many of the notifications that irritate people.

Settings -> System -> Notifications. Scroll to the bottom, expand Additional settings. Uncheck "Suggest ways to get the most out of Windows and finish setting up this device" and "Get tips and suggestions when using Windows".

I get more prompts from macOS about Apple products than I get from Windows about Microsoft products after unchecking those two settings.

pbmonster · 2026-02-13T22:20:49 1771021249

Your Windows 11 experience strongly, strongly depends on where you are. Are you inside the EU? 90% of the crap people complain about is simply illegal and you don't see any of it.

sleight42 · 2026-02-14T00:13:38 1771028018

In the US, of course, our government loves to let citizens be the product for corporations. America: by the corporations for the corporations.

Even more true now than it has been in maybe 100 years.

savanaly · 2026-02-15T04:46:27 1771130787

I'm in the US, I never experience any of the issues people complain about. Just checked and I don't have the setting disabled that that one guy talked about up thread. But I do have all notifications off. Maybe that is why?

pbmonster · 2026-02-15T12:41:22 1771159282

Pro/Enterprise Version?

savanaly · 2026-02-16T16:06:30 1771257990

Regular Joe version as far as I know.

tempestn · 2026-02-14T23:10:01 1771110601

I'm in Canada. I do have the pro version though; maybe that makes a difference.

seec · 2026-02-14T08:32:05 1771057925

People just like to hate on Windows. The home version has some issues and limitations, but if you are willing to invest in the Pro version, it's mostly fine, really.

There are still many complaints to be had, but the fact is that Windows does what it needs to do on a wide range of hardware without much hassle if you know what you are doing.

4rt · 2026-02-13T21:18:03 1771017483

I've also never seen an ad in windows 11.

I did uninstall all of the weird apps like "News" "Weather" etc.

tempestn · 2026-02-13T06:56:32 1770965792

I've often found when receiving a clueless support response like this, it can be effective to just follow up with a polite request to forward the ticket to an engineer or developer. Usually the front-line csr simply hasn't understood the issue. In this case I would say something like,

"Yes, I managed to work around the issue by switching to my personal email address, but this bug is still preventing me from using my work email domain. If you could please forward the error log I included to a developer, it should help them resolve the issue. Thank you."

RobertoG · 2026-02-13T09:44:41 1770975881

But imagine that you do that, and they solve the problem. What would you write in your blog about?

tempestn · 2026-02-11T05:48:59 1770788939

I think insiders would tell you off the record not to get an R1, but that the R2s should be much more robust. Of course it's untested at this point, but hopefully that's the case.