hubicka's comments

hubicka · on Dec 31, 2018

It is Google's blogger.com. I am not very advanced user of it and generally just use whatever it provides for me.

The link is off. This link http://hubicka.blogspot.com/2018/12/even-more-fun-with-build... renders better in general.

PS: I found it too ironic to mention in the blog, but I had to uninstall mobile firefox to get the two-level application needed to submit a patch :)). For some reason Google authentificator requires over 30MB to take a picture of bar code.

nixpulvis · on Jan 1, 2019

Well, better published than nothing. I found the writeup very interesting and informative. Looking forward to seeing where this all leads.

hubicka · on Dec 31, 2018

GCC was designed to be extensible and portable (after all it ended up ported to more architectures than any other compiler). The political limitation was more subtle.

Originally FSF did not want to make it easy let GCC store its intermediate language and read it back. That was because it would let others to build proprietary frontends and back-ends which is against RMS' vision. This has changed and LTO does precisely that (and indeed in addition to technical difficulties this political issue delayed its arrival in GCC). So this political block of extensibility is long gone (and I am happy for that).

A lot has changed in compilation between mid 1980's GCC was started and early 2000's. LLVM design reflects its time. Almost 20 years passed since that and both projects needs to evolve and develop strategies of doing so.

loeg · on Dec 31, 2018

Regardless of the LTO relaxation, I believe it is still the position of the FSF and GCC that any middleware that links in to GCC itself is GPL-infected. Obviously, this is not the case for LLVM/Clang.

hubicka · on Dec 24, 2018

I have re-tested on my skylake notebook and updated the blog. It confirms darn old CPU I use as my benchmark machine. Maybe it is bit more sensitive to the difference which is expected for non-server CPU.

GCC does "almost full LTO" with partitioning, while clang does thinLTO that does make most of code size/speed tradeoffs without considering whole program context, so it may be interesting to get both alternatives closer in code size/performance metrics.

I have got Firefox developer account of level1 and I am looking into official benchmarking architecture which I have now updated to GCC 8 with LTO+PGO.

hubicka · on Dec 18, 2018

I would be happy to help with solving GCC related issues and look into performance regressions relative to clang (I am still in process of looking into -O2 performance and plan to set up talos next)

hubicka · on Dec 16, 2018

You need to explicitly ask for it via attribute, no automatic multiversioning is done (yet) and it would be more for -Ofast than usual -O2 builds I guess.

hubicka · on Dec 16, 2018

You can try the binary on your CPU.

This particular workload does not make much difference between modern CPUs. I just tried the Sunspider benchmark on my skylake and it has similar outcomes as reported, but there is more noise since it is notebook

What I got is: GCC 8 build: 333 +- 3.3% Tumbleweed distro firefox: 352 +- 3.4% Firefox 63 (GCC) official binary: 346 +- 5.6% Firefox 64 (llvm) official binary: 342 +- 5.1% but I do not completely trust the numbers as re-running the benchmark leads to different outcome each time

hubicka · on Dec 16, 2018

I would be interested to know what cache aware code layout optimizations are available in LLVM. I personally know of none. GCC is bit simplistic in this sense (it does reorder functions based on profile feedback and execution time) and I plan to change that for next stage 1 (i.e. GCC 10)

DannyBee · on Dec 16, 2018

Hey Jan, Long time ;)

LLVM will do the same kind of reordering.

(Both are interestingly well behind what commercial compilers do, and this is one of the very few areas where that is true. My suspicion is that it does not matter as much in practice as we want it to. Most forms of layout optimization are also very hard to perform on the C++ code you want to optimize due to inability to prove safety)

hubicka · on Dec 16, 2018

Hehe, nice to see you :)

Yep, I have code layout pass in my tree for a while, but because I was never really able to measure off-noise improvements it is not in the tree, yet. I hope to make more sense of it with help of CPU counters which improved over the time.

mhh__ · on Dec 16, 2018

I'm not familiar enough with LLVM to really say, so I was just speculating: I vaguely remembered some kind of talk about cache optimisation and LLVM, so it's possible it was talking about the LLVM codebase rather than the passes available in LLVM.

hubicka · on Dec 16, 2018

There are several independent things

- first how you set -O2 defaults in your compiler. This is a delicate problem since you need to find right balance of code size, compile time, robustness of generated code (do not trigger undefined effect in super evil ways) and of course runtime. In benchmarks I have found that Clang has bit of edge for runtime which is mostly vectorization (on x86-64)

- selection of minimal ISA you support. For GCC x86-64 is still the original Opteron, but distributions can easily (and some do) decide for better. Indeed AVX is big win, but for general purpose distribution this is still too agressive. You can provide AVX optimized libraries where it depends

- selection of CPU tunning (i.e. generic/intel)

So I consider it mistake that GCC traded vectorization over compile time speed+reliablity for -O2 because it can make important difference in common workloads this days (not 10 years ago, say).

It is also clearly a bug for GCC to produce AVX instruction when not explicitly asked for :)

I also do testing on Zen, Core and some PowerPC. For the firefox machine I use Buldozer box because I don't care it spends long nights running builds & benchmarks and I think this particular problem is not very CPU specific.