I'm not sure it's likely that the LLM here learned from gcc. The size optimization work here is focused on learning phase orderings for LLVM passes/the LLVM pipeline, which wouldn't be at all applicable to gcc.
Additionally, they train approximately half on assembly and half on LLVM-IR. They don't talk much about how they generate the dataset other than that they generated it from the CodeLlama dataset, but I would guess they compile as much code as they can into LLVM-IR and then just lower that into assembly, leaving gcc out of the loop completely for the vast majority of the compiler specific training.
Yep! No GCC on this one.
And yep, that's not far off how the pretraining data was gathered - but with random optimisations to give it a bit of variety.
We plan to have a peer reviewed version of the paper where we will probably have more details on that. Otherwise we can't give anymore details than in the paper or post, etc. without going through legal which takes ages.
Science is getting harder to do :-(
This would be difficult to deploy as-is in production.
There are correctness issues mentioned in the paper regarding adjusting phase orderings away from the well-trodden O0/O1/O2/O3/Os/Oz path. Their methodology works for a research project quite well, but I personally wouldn't trust it in production. While some obvious issues can be caught by a small test suite and unit tests, there are others that won't be, and that's really risky in production scenarios.
There are also some practical software engineering things like deployment in the compiler. There is actually tooling in upstream LLVM to do this (https://www.youtube.com/watch?v=mQu1CLZ3uWs), but running models on a GPU would be difficult and I would expect CPU inference to massively blow up compile times.
I'm reasonably certain the authors are aware of alive2.
The problem with using alive2 to verify LLM based compilation is that alive2 isn't really designed for that. It's an amazing tool for catching correctness issues in LLVM, but it's expensive to run and will time out reasonably often, especially on cases involving floating point. It's explicitly designed to minimize the rate of false-positive correctness issues to serve the primary purpose of alerting compiler developers to correctness issues that need to be fixed.
PGO can be used in such situations, but the profile needs to be checked in. Same code + same profile -> same binary (assuming the compiler is deterministic, which is tested quite extensively).
There are several big projects that use PGO (like Chrome), and you can get a deterministic build at whatever revision using PGO as the profiles are checked in to the repository.
It's not called AutoFDO. AutoFDO refers to a specific sampling-based profile technique out of Google (https://dl.acm.org/doi/abs/10.1145/2854038.2854044). Sometimes people will refer to that as PGO though (with PGO and FDO being somewhat synonymous, but PGO seeming to be the preferred term in the open source LLVM world). Chrome specifically uses instrumented PGO which is very much not AutoFDO.
I wasn’t trying to conflate the two. PGO traditionally meant a trace build but as a term it’s pretty generic, at least to me to the general concept of “you have profile information that replaces generically tuned heuristics that the compiler uses). AutoFDO I’d classify as an extension to that concept to a more general PGO technique; kind of ThinLTO vs LTO. Specifically, it generates the “same” information to supplant compiler heuristics, but is more flexible in that the sample can be fed back into “arbitrary” versions of the code using normal sampling techniques instead of an instrumented trace. The reason sampling is better is that it more easily fits into capturing data from production which is much harder to accomplish for the tracing variant (due to perf overheads). Additionally, because it works across versions the amortized compile cost drops from 2x to 1x because you only need to reseed your profile data periodically.
I was under the impression they had switched to AutoFDO across the board but maybe that’s just for their cloud stuff and Chrome continues to run a representative workload since that path is more mature. I would guess that if it’s not being used already, they’re exploring how to make Chrome run AutoFDO for the same reason everyone started using ThinLTO - it brought most of the advantages while fixing the disadvantages that hampered adoption.
And yes, while PGO is available natively, AutoFDO isn’t quite as smooth.
I'm not sure where you're getting your information from.
Chrome (and many other performance-critical workloads) is using instrumented PGO because it gives better performance gains, not because it's a more mature path. AutoFDO is only used in situations where collecting data with an instrumented build is difficult.
Last I looked AutoFDO builds were similar in performance to PGO as ThinLTO vs LTO is. I’d say that collecting data with an instrumented Chrome build is extremely difficult - you’re relying on your synthetic benchmark environment which is very very different from the real world (eg extensions aren’t installed, the patterns of websites being browsed is not realistic, etc). There’s also a 2x compile cost because you have to build Chrome twice in the exact same way + you have to run a synthetic benchmark on each build to generate the trace.
I’m just using an educated guess to say that at some point in the future Chrome will switch to AutoFDO, potentially using traces harvested from end user computers (potentially just from their employees even to avoid privacy complaints).
You can make the synthetic benchmarks relatively accurate, it just takes effort. The compile-time hit and additional effort is often worth it for the extra couple percent for important applications.
Performance is also pretty different on the scales that performance engineers are interested in for these sorts of production codes, but without the build system scalability problems that LTO has. The original AutoFDO paper shows an improvement of 10.5%->12.5% going from AutoFDO to instrumented PGO. That is pretty big. It's probably even bigger with newer instrumentation based techniques like CSPGO.
They also mention the exact reasons that AutoFDO will not perform as well, with issues in debug info and losing profile accuracy due to sampling inaccuracy.
I couldn't find any numbers for Chrome, but I am reasonably certain that they have tried both and continue to use instrumented PGO for the extra couple percent. There are other pieces of the Chrome ecosystem (specifically the ChromeOS kernel) that are already optimized using sampling-based profiling. It's been a while since I last talked to the Chromium toolchain people about this though. I also remember hearing them benchmark FEPGO vs IRPGO at some point and concluding that IRPGO was better.
In practice though, correctness even over ordering of hand-written passes is difficult. Within the paper they describe a methodology to evaluate phase orderings against a small test set as a smoke test for correctness (PassListEval) and observe that ~10% of the phase orderings result in assertion failures/compiler crashes/correctness issues.
You will end up with a lot more correctness issues adjusting phase orderings like this than you would using one of the more battle-tested default optimization pipelines.
Correctness in a production compiler is a pretty hard problem.
LLVM doesn’t spend really any runtime solving the phase ordering problem since the pass pipelines are static. There have been proposals to dynamically adjust the pipeline based on various factors, but those are a ways out from happening.