Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Even then, there are situations where a portable program in a high level language has hand-written, per-processor assembler code (OpenSSL comes to mind). Why do you think that is?

It makes sense at times, in specific contexts. Which is a far cry from your "compile anything with gcc -S -O3 and behold the extra code which a human would never write" (emphasis mine).

> nopw instructions, the idea being to give the processor time to prime the data cache.

I'm still only 95% sure what you mean by "priming the data cache", but if you mean prefetching, (a) nop doesn't prefetch; and (b) there are prefetch instructions for prefetching. And in any case prefetching wouldn't be cheating.

> If you do more comparisons over decades like I have

Compilers have gotten much better over the last few decades. Maybe the things you "know" are "always" true were true at some point in time and aren't true anymore.

> the far worse danger is that -ffast-math won't generate IEEE-754 compliant results

Yes, that's what I said. You took my original function and wrote a version that didn't generate IEEE-754 compliant results. If you're OK with changing the semantics, you should use -ffast-math and let the compiler vectorize.

> There are two ways to do it fast: eor (or what the idiotic intel architecture calls "xor"), or the sub instruction

For whatever it's worth, I think sub would be incorrect if the original bit pattern in the register were a NaN.

> two instructions wasted with what could amount to a single subq. This isn't more efficient.

And yet, somehow, this version of the code runs faster than your version.

> testl is a total waste of processor cycles here. Which purpose does it serve? Then more cycles are wasted on jumping to an instruction to clear %xmm0

That's the code for testing whether the for loop should ever be entered. If n is less than or equal to zero (that's what's being tested by testl/jne), the function returns zero. Which is why xmm0 (the return register) needs to be zeroed in that case.

> why xorl %eax in a loop when it could be done once?

It's not in a loop. It's the "i = 0" setup code before the loop.

You have made it very clear that you don't feel qualified to write highly optimized x86-64 code. Neither are you qualified to judge the quality of x86-64 code if you can't tell what is inside a loop and what isn't.

Two last points before I drop this thread:

> you will see how silly it was trying to argue that compilers generate faster code than coders

I didn't argue that. I argued that your assertion "compile anything with gcc -S -O3 and behold the extra code which a human would never write" (emphasis mine, again) was incorrect. That doesn't mean that I think that gcc will always, or even sometimes, beat human coders. But it can match them very very often.

> you chose a dot product because you likely knew that the compiler would generate pretty fast code [...] More waste of processor cycles. [...] Idiotic in the extreme

You're contradicting yourself. And you are calling idiots the many GCC developers and Intel/AMD microarchitecture experts who very likely have pored over every single instruction of this very code and decided that this is the way it should be written for maximum performance.

I hope you have a wonderful day.



"(a) nop doesn't prefetch;"

I wrote nopsw and you are writing about nop. Are you doing this on purpose? nop doesn't, but only on intel family of processors, nopsw has a side-effect of prefetching.

"Yes, that's what I said. You took my original function and wrote a version that didn't generate IEEE-754 compliant results."

Turns out, so did the GCC compiler, at least the one I have, so I'd say your point is moot.

Truth of the matter is, you picked a really bad example: to solve it correctly, one would have to implement at least a portion of the algorithms in the GNU multiple precision library ("GMP"). I suspect you picking a floating point example was not by accident.

"I think sub would be incorrect if the original bit pattern in the register were a NaN."

Even NaN has to be represented by a bit pattern, so subtracting that bit pattern from the register will yield zero.

"That's the code for testing whether the for loop should ever be entered."

And here we come back to my point: if you were coding this from scratch in assembler, you wouldn't write a generic function, and you'd know that n will never be zero. And the reason why you'd never write a generic function is because they lose you speed and increase code size. But a compiler cannot know that and cannot optimize for such a situation. It's just a dumb program.

"You have made it very clear that you don't feel qualified to write highly optimized x86-64 code. Neither are you qualified to judge the quality of x86-64 code if you can't tell what is inside a loop and what isn't."

I spent 30 seconds looking at assembler code for a processor family I have never coded on. I spent less than 15 minutes writing a piece of optimized assembler code for that family and using GNU as, an assembler I never wrote code in. Now you judge me on mis-interpreting one clumsily generated compiler instruction. By which logic, considering I was able to do all of this in under 15 minutes am I not qualified? I'm very pleased with myself, for the time budget, an unknown processor and unknown assembler I think I did very well. We will have to disagree, vehemently if you please.

I stand by my assertion that a compiler will never be able to beat a human at generating fast, optimized code, nor will it ever be capable of generating smaller code. In addition I don't hold the GCC developers in high regard, considering how notoriously bad their compilers are when compared to say, intel or Sun Studio ones. Even the Microsoft compilers beat GCC in generating code which runs faster. In fact, pretty much every compiler beats GCC in performance, which means that people working on those GCC compilers aren't good enough. GCC's only undisputed strength is in the vast support of different processors. There, they are #1, but everywhere else they're last. The GCC developers just don't have what it takes to be the best in that business.

"And yet, somehow, this version of the code runs faster than your version."

I don't know that; you ran code which I wrote blindly; I was not even able to reproduce your output with my GCC. That it runs faster is just your assertion. Based on my experience, I have no reason to believe that.

"I hope you have a wonderful day."

As a matter of fact, I am about to go create a SVR4 OS package of GCC 9.2.0 which I patched and managed to bootstrap after a week worth of work on Solaris 10 on sparc, so yes I will have a wonderful day enjoying the fruits of my labors. I wish you a wonderful day as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: