The one that always gets me is how they're insistent on giving 17-step instructions to any given problem, even when each step is conditional and requires feedback. So in practice you need to do the first step, then report the results, and have it adapt, at which point it will repeat steps 2-16. IME it's almost impossible to reliably prevent it from doing this, however you ask, at least without severely degrading the value of the response.
In my experience Gemini 3.0 pro is noticeably better than chatgpt 5.2 for non-coding tasks. The latter gives me blatantly wrong information all the time, the former very rarely.
I agree and it has been my almost exclusive go to ever since Gemini 3 Pro came out in November.
In my opinion Google isn't as far behind in coding as comments here would suggest. With Fast, it might already have edited 5 files before Claude Sonnet finished processing your prompt.
There is a lot of potential here, and with Antigravity as well as Gemini CLI - I did not test that one - they are working on capitalizing on it.
Strange that you say that because the general consensus (and my experience) seems to be the opposite, as well as the AA-Omniscience Hallucination Rate Benchmark which puts 3.0 Pro among the higher hallucinating models. 3.1 seems to be a noticeable improvement though.
Google actually has the BEST ratings in the AA-Omniscience Index:
AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer.
Gemini 3.1 is the top spot, followed by 3.0 and then opus 4.6 max
Yes and no. The hallucination rate shown there is the percentage of time the model answers incorrectly when it should have instead admitted to not knowing the answer. Most models score very poorly on this, with a few exceptions, because they nearly always try to answer. It's true that 3.0 is no better than others on this. By given that it does know the correct answers much more often than eg. GPT 5.2, it does in fact give hallucinated answers much less often.
In short, its hallucination rate as a percentage of unknown answers is no better than most models, but its hallucination rate as a percentage of total answers in indeed better.
I can only speak to my own experience, but for the past couple of months I've been duplicating prompts across both for high value tasks, and that has been my consistent finding.
> the AA-Omniscience Hallucination Rate Benchmark which puts 3.0 Pro among the higher hallucinating models. 3.1 seems to be a noticeable improvement though.
As sibling comment says, AA-Omniscience Hallucination Rate Benchmark puts Gemini 3.0 as the best performing aside from Gemini 3.1 preview.
Based on the self driving trials in my Model Y, I find it terrifying that anyone trusts it to drive them around. It required multiple interventions in a single 10-minute drive last time I tried it.
I'm using FSD for 100% of my driving and only need to intervene maybe once a week. It's usually because the car is not confident of too slow, not because it's doing something dangerous. Two years ago it was very different where almost every trip I needed to intervene to avoid crash. The progress they have made is truly amazing.
This exact sentence (minus the specific version) is claimed every single week.
No, you do not "become scary good" every single week the past 10 years and yet still not be able to drive coast to coast all by itself (which Elon promised it would do a decade ago)
You are just human and bad at evaluating it. You might even be experiencing literal statistical noise.
I have not been proclaiming scary good every week for the last 10 years. In fact, I have cancelled my subscription at least two times, once on v13 and once on v14, with the reason ‘not good enough yet.’ I am telling you that for me personally it has crossed a threshold very recently.
It certainly wasn't in the past few weeks, but I've been hearing about how good it's gotten for years. Certainly not planning to pay to find out if it's true now, but I'll give it another try next free trial!
Make sure you are on AI4 hardware when you do. If you buy FSD on AI3 you’ll be limited to v13, which is is terrible. I have used both and they are in different leagues altogether.
Would've been, once. These days I assume bentcorner asked their favourite LLM to generate a poem parodying Ozymandias about once-popular youtube videos.
It doesn't feel like it at all (I'd never expect an LLM to say 'pfp' like that, or 'lossly[sic] compressed', ASCII instead of fancy quotes) but who knows at this point.
I may have gotten incredibly neurotic about online text since 2022.
I actually considered using an LLM but in my experience they "warp" the content too much for anything like this. The effort required to get them to retain what I would consider something to my taste would take longer than just writing the poem myself. (Although tbf it's been awhile since I've asked a LLM to do parody work, so I could be wrong)
I think you're missing their point. The question you're replying to is, how do we know that this made up content is a hallucination. Ie., as opposed to being made up by a human. I think it's fairly obvious via Occam's Razor, but still, they're not claiming the quotes could be legit.
You seem to be quite certain that I had not read the article, yet I distinctly remember doing do.
By what proceess do you imagine I arrived at the conclusion that the article suggested that published quotes were LLM hallucinations when that was not mentioned in the article title?
You accuse me of performative skepticism, yet all I think is that it is better to have evidence over assumptions, and it is better to ask if that evidence exists.
It seems a much better approach than making false accusations based upon your own vibes, I don't think Scott Shambaugh went to that level though.
>On Friday afternoon, Ars Technica published an article containing fabricated quotations generated by an AI tool and attributed to a source who did not say them. That is a serious failure of our standards. Direct quotations must always reflect what a source actually said.
I don't think the threat is to leave for 2 years then come back. He just doesn't want to commit to leaving forever. Who knows if in a decade it'll be Android with the shitty keyboard (or Apple will have the better Direct Brain Interface, or whatever). Most likely though, if someone switches ecosystems for 2+ years, they're going to get used to the new one and stay there.
I'm curious why my experience with Windows 11 is so different from what I regularly read. It was some years ago now, so I don't remember exactly what configuration steps I went through, but presumably I turned off ads when I first installed. And so, I don't get ads. I don't recall ever seeing an ad embedded in Windows. Are people talking about Edge (which I don't use) or inside the Microsoft Store (which I very rarely use, but I presume does have sponsored apps or whatever)? Or is this mostly people who don't use Windows, repeating what others have said? Or are these ads targeted at users who aren't me?
There is a setting that turns off many of the notifications that irritate people.
Settings -> System -> Notifications. Scroll to the bottom, expand Additional settings. Uncheck "Suggest ways to get the most out of Windows and finish setting up this device" and "Get tips and suggestions when using Windows".
I get more prompts from macOS about Apple products than I get from Windows about Microsoft products after unchecking those two settings.
Your Windows 11 experience strongly, strongly depends on where you are. Are you inside the EU? 90% of the crap people complain about is simply illegal and you don't see any of it.
I'm in the US, I never experience any of the issues people complain about. Just checked and I don't have the setting disabled that that one guy talked about up thread. But I do have all notifications off. Maybe that is why?
People just like to hate on Windows.
The home version has some issues and limitations, but if you are willing to invest in the Pro version, it's mostly fine, really.
There are still many complaints to be had, but the fact is that Windows does what it needs to do on a wide range of hardware without much hassle if you know what you are doing.
I've often found when receiving a clueless support response like this, it can be effective to just follow up with a polite request to forward the ticket to an engineer or developer. Usually the front-line csr simply hasn't understood the issue. In this case I would say something like,
"Yes, I managed to work around the issue by switching to my personal email address, but this bug is still preventing me from using my work email domain. If you could please forward the error log I included to a developer, it should help them resolve the issue. Thank you."
I think insiders would tell you off the record not to get an R1, but that the R2s should be much more robust. Of course it's untested at this point, but hopefully that's the case.
reply