Hacker Newsnew | past | comments | ask | show | jobs | submit | boomanaiden154's commentslogin

Quite a few patches have landed. A couple features using this have already shipped in Apple’s downstream clang.


I'm not sure it's likely that the LLM here learned from gcc. The size optimization work here is focused on learning phase orderings for LLVM passes/the LLVM pipeline, which wouldn't be at all applicable to gcc.

Additionally, they train approximately half on assembly and half on LLVM-IR. They don't talk much about how they generate the dataset other than that they generated it from the CodeLlama dataset, but I would guess they compile as much code as they can into LLVM-IR and then just lower that into assembly, leaving gcc out of the loop completely for the vast majority of the compiler specific training.


Yep! No GCC on this one. And yep, that's not far off how the pretraining data was gathered - but with random optimisations to give it a bit of variety.


Do you have more information on how the dataset was constructed?

It seems like somehow build systems were invoked given the different targets present in the final version?

Was it mostly C/C++ (if so, how did you resolve missing includes/build flags), or something else?


We plan to have a peer reviewed version of the paper where we will probably have more details on that. Otherwise we can't give anymore details than in the paper or post, etc. without going through legal which takes ages. Science is getting harder to do :-(


This would be difficult to deploy as-is in production.

There are correctness issues mentioned in the paper regarding adjusting phase orderings away from the well-trodden O0/O1/O2/O3/Os/Oz path. Their methodology works for a research project quite well, but I personally wouldn't trust it in production. While some obvious issues can be caught by a small test suite and unit tests, there are others that won't be, and that's really risky in production scenarios.

There are also some practical software engineering things like deployment in the compiler. There is actually tooling in upstream LLVM to do this (https://www.youtube.com/watch?v=mQu1CLZ3uWs), but running models on a GPU would be difficult and I would expect CPU inference to massively blow up compile times.


Sure, performance is more interesting, but it's significantly harder.

With code size, you just need to run the code through the compiler and you have a deterministic measurement for evaluation.

Performance has no such metric. Benchmarks are expensive and noisy. Cost models seem like a promising direction, but they aren't really there yet.


I'm reasonably certain the authors are aware of alive2.

The problem with using alive2 to verify LLM based compilation is that alive2 isn't really designed for that. It's an amazing tool for catching correctness issues in LLVM, but it's expensive to run and will time out reasonably often, especially on cases involving floating point. It's explicitly designed to minimize the rate of false-positive correctness issues to serve the primary purpose of alerting compiler developers to correctness issues that need to be fixed.


Yep, we tried it :-) These were exactly the problems we had with it.


PGO can be used in such situations, but the profile needs to be checked in. Same code + same profile -> same binary (assuming the compiler is deterministic, which is tested quite extensively).

There are several big projects that use PGO (like Chrome), and you can get a deterministic build at whatever revision using PGO as the profiles are checked in to the repository.


It’s called autofdo although I’ve struggled to get it working well in Rust.


It's not called AutoFDO. AutoFDO refers to a specific sampling-based profile technique out of Google (https://dl.acm.org/doi/abs/10.1145/2854038.2854044). Sometimes people will refer to that as PGO though (with PGO and FDO being somewhat synonymous, but PGO seeming to be the preferred term in the open source LLVM world). Chrome specifically uses instrumented PGO which is very much not AutoFDO.

PGO works just fine in Rust and has support built into the compiler (https://doc.rust-lang.org/rustc/profile-guided-optimization....).


I wasn’t trying to conflate the two. PGO traditionally meant a trace build but as a term it’s pretty generic, at least to me to the general concept of “you have profile information that replaces generically tuned heuristics that the compiler uses). AutoFDO I’d classify as an extension to that concept to a more general PGO technique; kind of ThinLTO vs LTO. Specifically, it generates the “same” information to supplant compiler heuristics, but is more flexible in that the sample can be fed back into “arbitrary” versions of the code using normal sampling techniques instead of an instrumented trace. The reason sampling is better is that it more easily fits into capturing data from production which is much harder to accomplish for the tracing variant (due to perf overheads). Additionally, because it works across versions the amortized compile cost drops from 2x to 1x because you only need to reseed your profile data periodically.

I was under the impression they had switched to AutoFDO across the board but maybe that’s just for their cloud stuff and Chrome continues to run a representative workload since that path is more mature. I would guess that if it’s not being used already, they’re exploring how to make Chrome run AutoFDO for the same reason everyone started using ThinLTO - it brought most of the advantages while fixing the disadvantages that hampered adoption.

And yes, while PGO is available natively, AutoFDO isn’t quite as smooth.


I'm not sure where you're getting your information from.

Chrome (and many other performance-critical workloads) is using instrumented PGO because it gives better performance gains, not because it's a more mature path. AutoFDO is only used in situations where collecting data with an instrumented build is difficult.


Last I looked AutoFDO builds were similar in performance to PGO as ThinLTO vs LTO is. I’d say that collecting data with an instrumented Chrome build is extremely difficult - you’re relying on your synthetic benchmark environment which is very very different from the real world (eg extensions aren’t installed, the patterns of websites being browsed is not realistic, etc). There’s also a 2x compile cost because you have to build Chrome twice in the exact same way + you have to run a synthetic benchmark on each build to generate the trace.

I’m just using an educated guess to say that at some point in the future Chrome will switch to AutoFDO, potentially using traces harvested from end user computers (potentially just from their employees even to avoid privacy complaints).


You can make the synthetic benchmarks relatively accurate, it just takes effort. The compile-time hit and additional effort is often worth it for the extra couple percent for important applications.

Performance is also pretty different on the scales that performance engineers are interested in for these sorts of production codes, but without the build system scalability problems that LTO has. The original AutoFDO paper shows an improvement of 10.5%->12.5% going from AutoFDO to instrumented PGO. That is pretty big. It's probably even bigger with newer instrumentation based techniques like CSPGO.

They also mention the exact reasons that AutoFDO will not perform as well, with issues in debug info and losing profile accuracy due to sampling inaccuracy.

I couldn't find any numbers for Chrome, but I am reasonably certain that they have tried both and continue to use instrumented PGO for the extra couple percent. There are other pieces of the Chrome ecosystem (specifically the ChromeOS kernel) that are already optimized using sampling-based profiling. It's been a while since I last talked to the Chromium toolchain people about this though. I also remember hearing them benchmark FEPGO vs IRPGO at some point and concluding that IRPGO was better.


I would not say we are anywhere close to perfect in compilation.

Even just looking at inlining for size, there are multiple recent studies showing ~10+% improvement (https://dl.acm.org/doi/abs/10.1145/3503222.3507744, https://arxiv.org/abs/2101.04808).

There is a massive amount of headroom, and even tiny bits still matter as ~0.5% gains on code size, or especially performance, can be huge.


Right, it's only solving phase ordering.

In practice though, correctness even over ordering of hand-written passes is difficult. Within the paper they describe a methodology to evaluate phase orderings against a small test set as a smoke test for correctness (PassListEval) and observe that ~10% of the phase orderings result in assertion failures/compiler crashes/correctness issues.

You will end up with a lot more correctness issues adjusting phase orderings like this than you would using one of the more battle-tested default optimization pipelines.

Correctness in a production compiler is a pretty hard problem.


There are two models.

- foundation model is pretrained on asm and ir. Then it is trained to emulate the compiler (ir + passes -> ir or asm)

- ftd model is fine tuned for solving phase ordering and disassembling

FTD is there to demo capabilities. We hope people will fine tune for other optimisations. It will be much, much cheaper than starting from scratch.

Yep, correctness in compilers is a pain. Auto-tuning is a very easy way to break a compiler.


LLVM doesn’t spend really any runtime solving the phase ordering problem since the pass pipelines are static. There have been proposals to dynamically adjust the pipeline based on various factors, but those are a ways out from happening.


Pretty much this. It's called Alive2.

https://dl.acm.org/doi/abs/10.1145/3453483.3454030


tnanks


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: