Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A Tourist’s Guide to the LLVM Source Code (regehr.org)
489 points by zdw on Jan 6, 2017 | hide | past | favorite | 44 comments


Awesome overview! The author threw a hilarious quip in at the end too:

Target: the processor-specific parts of the backends live here. There are lots of TableGen files. As far as I can tell, you create a new LLVM backend by cloning the one for the architecture that looks the most like yours and then beating on it for a couple of years.


I've heard this many times that llvm targets are really hard to do. I don't really understand why. The code has already been parsed, optimized, and cleaned into a presumably clean format. What is the issue? Is the framework provided by llvm just crappy?


A while back I got about half of an LLVM backend for the VideoCore IV done (since completed by someone else). I eventually ground to a halt due to intractable bugs; later, someone picked it up and completed it.

This is from years ago, mind, but my recollection of the pain points were:

- enormously steep learning curve; very little of this layer is documented and there's a lot of concepts hidden behind layers of abstractions hidden behind other layers of abstractions, all of which you need to understand, but they're all cyclic. There's nowhere to start; everything you look at depends on your understanding of something else.

- very hard to debug --- the compiler's main reaction to things going wrong is to seg fault, which requires debugging through compiler internals (and understanding the compiler internals) to fix.

- holes in the tools. After generating the IR node tree, the main piece of machinery in the compiler is a pattern matcher which finds machine code instructions for sets of nodes in the tree. This is automatically generated by tablegen, but there are some patterns that it can't match, which means that along with the tablegen matcher you also need to write a manual matcher in C++ for the other patterns, which is very verbose and difficult to keep track of. Also, the tablegen matcher is pretty brittle and has a tendency to either hang or match the wrong pattern if you don't get it just right. Understanding why it's doing what it's doing is non-trivial.

The main problem, though, is that the pool of knowledge of how to write targets is restricted to a very small number of people --- it's all what an ex-boss of mine used to refer to as 'tribal knowledge'. So that when you run into a problem, you're reliant on finding one of these people who has enough time or incentive to help. Since the problems you run into tend to be complex, helping is hard, so getting good help isn't easy, and so on.

I should add that gcc suffers from exactly the same set of problems, except possibly a little more so. gcc's get better target layer documentation, but the LLVM community is much more helpful; gcc is simpler, but LLVM I think has the edge of good design. They're both much of a muchness.

It's frustrating, as compiler backends aren't that hard. There should be a compiler somewhere which combines easy porting with adequate code generation[1], for people who don't want to spend the resources needed for an LLVM or gcc port.

[1] I'm actually working on this.


> There should be a compiler somewhere which combines easy porting with adequate code generation[1] > [1] I'm actually working on this.

Anything to share yet? Will it take LLVM IR as input or are you targeting a specific language? I'm very interested. So far, I found only the Plan 9 C compilers seemed to have relatively straightforward code as it was designed for this.


I have mostly-working code --- but the register allocator I was using turns out to be pretty pants, so it's generating lousy code and is nowhere ready to look at yet. I'm taking time off to learn about graph colouring to try and do a proper job. (Turns out graph colouring on irregular architectures is kinda exciting, and not in a good way.)

It's not actually based on LLVM at all; I'm working with a completely different (and simpler) compiler suite, albeit using some of the same principles. The code quality is never going to be good, but I'm hoping the end result will allow a compiler port to any random relatively sane architecture in only two source files.


> Turns out graph colouring on irregular architectures is kinda exciting, and not in a good way.

Have you looked into PBQP register allocation? It's almost as simple as basic graph coloring but handles irregularities in a nice and disciplined way.


I helped building the one in libfirm and I agree. The x86 architecture is quite regular and we have not used it for others in practice. However, for all irregularities I have come across I could immediately see how to model it in theory.


Oh, I thought PBQP was the only register allocator in libfirm. What do you use for other architectures?


The default is a graph coloring algorithm. Well actually a "recoloring" algorithm due to its SSA-based nature.

There is also an ILP one, which is only useful for comparison.


The Plan 9 C compilers are nice as they're so small. Still, the backend creation process would look roughly similar, at least from where I'm standing: copy a roughly similar one and beat on it for some time. The difference is in duration, though. You're unlikely to pound on it for several years until you get something working.


Who completed it and is it in an usable state? I have seen the GCC port used for the open firmware project but regarding LLVM I only found a QPU backend used by Simon for the Raspberry Pi Quake challenge.


Christina Brooks:

https://github.com/christinaa/LLVM-VideoCore4

It's not entirely finished, but it's good enough to get real work done if you're careful.

Previous HN thread:

https://news.ycombinator.com/item?id=11612449


Ahh thanks, that's what they used before switching over to GCC (Which seems to be also based on your first steps ;-)). Great work btw! It's kinda annoying that they didn't release at least an instruction set documentation yet.


As someone who's aiming to write an lx106 backend, thanks for the writeup and for the warning. I'll now think twice (and brace for a real challenge) before starting it.


Check out QBE and see how you like it vs LLVM:

http://c9x.me/compile/

It's designed to be simpler. One user told me it already compiles their project 4x faster than GCC with estimated speed between 01 and 02 despite so few lines of code. If it's backends are easier, might be a selling point on top of simplicity.


It's not been "already optimized". The kind of machine language affects optimizations too.

An example of this is branch delay slots. Some RISCs have these. With a branch delay slot, the instruction after a branch is unconditionally executed, which lets the processor pipeline the branch efficiently. You need to tweak your codegen so that some earlier instruction will get delayed till the branch slot if you want to make full use of your CPU cycles.

Each target arch has its own quirks when it comes to stuff like this, and there's a different way to write efficient code for each. So it's not correct that optimizations have already been done.

(This is not necessarily the reason why tablegen is hard, it's just another factor)


Because emitting efficient target machine code is a non-trivial problem?

... but also because the LLVM framework is very generic. This is both a blessing and a challenge. You can implement a new target by writing much less code (due to those TableGen files), but this code will be trickier since you have to grok several generic metaprogramming mechanisms.


This quip is also quite funny:

> ObjectYAML seems to support encoding object files as YAML. I do not know why this is desirable.


The beauty of LLVM is that developers of compilers based on LLVM like Pony or Julia don't have to even think about the backends.


Unfortunately that's only superficially true. You have to care about GC interop, unwind & debug dwarf data, implementing part of the ABI in the frontend, and of course the occasional bugs when you exercise code path that clang does not.

Because of this, enabling a new backend is a substantial amount of work even if the instruction selection/generation is indeed taken care of for you.


Sounds like you have some practical experience with that. What frontend was it? Are these problems frontend/language independent?


Same applies to JVM and CLR.


100+ points, first place on front page, and not a single comment.

Now that's rare.

I assume everyone is excitedly browsing https://github.com/llvm-mirror/llvm? ;) ( the official repo at http://llvm.org/git/llvm is down)


I have a theory that since the introduction of "favorites" on HN, people more often fave something as a "Read Later [Totally Some Day, Really, Only Just... Not Today... But I Will Soo Come Back To It]" bookmark. Thus "lists of 100 interesting books", "algorithms you must know", "write yourself an OS in 99 days", etc. etc., I feel they have tendency to behave like that. (The theory is based partially on what I observed looking at myself ;). But please note it's totally just a loose theory, I don't have data or whatever, nor I'd judge this as bad or good. Only observing with curiosity.


Anec-data: I favorite things as a "Read Later [Totally Some Day, Really, Only Just... Not Today... But I Will Soo Come Back To It]" bookmark.

Well, mostly. I do read some percentage of the articles and comments as I go, and I do go back to see updates to the comments... and to, every once in awhile, read the fave'd article.

I feel like I ought to have a side project that uses my HN favorites, pinboard links, comments, etc to seed a recommendation engine of 'other stuff you should probably read'.


> I feel like I ought to have a side project that uses my HN favorites, pinboard links, comments, etc to seed a recommendation engine of 'other stuff you should probably read'.

I would totally use it. I considered writing something myself, but I have too many on-going side projects as it is unfortunately.


Guilty as charged! This is so so welcome to an old school compiler guy


as someone who has never written a compiler but has an intense interest in doing so, this is also very welcomed!


Now is is a great time to write your first compiler. The wealth of tools and docs these days is breathtaking. Give it a go!


I wrote a toy compiler a very long time ago and I'd like to try again. Do you have any links/recommendations for resources that would be a good starting place?


LLVM Kaleidoscope, or if you'd like to start with your own toy CPU, then nand2tetris on Coursera.


I bet you have, but if not, do yourself a favour and grab a copy of "Compilers: Principles, Techniques and Tools" aka "The Dragon Book".

Still got my copy from uni, dog-eared and sticky-noted throughout.


I recall the classic 1986 edition of the Dragon Book spending waaaay too much time on boring details of lexing and parsing. I think it turned many people off the interesting aspects of compilers. It also had an interval-based formulation of data-flow analysis that nobody uses, possibly because it's (a) more complex than worklist iteration, and (b) not applicable to irreducible control flow graphs.

Are newer editions better and more focused on interesting solutions to interesting problems?


Try the Tiger Book instead (a.k.a. 'Modern compiler implementation in C', or there are Java or ML versions if those are more to your taste). It has a better balance: more to the point treatment of lexing and parsing, leaving more space for a more comprehensive treatment of optimization and code generation.


I second it, even if the book has a few issues here and there, it was one of the best books I read on the subject.


I recommend making sure to avoid reading this book, it spends all its time making mountains out of molehills in the easy parts and leaves out the hard parts. How did they write so many pages about text parsing?

The books by Appel and Louden (forgot the names) are good.



This is the kind of doc that should be written about work projects as well. I always love seeing great docs that make the job of understanding a codebase way simpler.


Regehr's blog is an invaluable source of practically relevant SE knowledge.


I've been putting it off, but I am definitely learning enough LLVM IR to code the 57 exercises for programmers.

Anyone interested in banging out the LLVM IR solutions to http://rosettacode.org? It looks like a few exist, but not many: http://rosettacode.org/wiki/99_Bottles_of_Beer/Assembly#LLVM


Only problem is the CamelCase is incredibly verbose.


That's critical for enterprise adoption! More seriously, I guess some "design patterns" really ought to appear in a project this big..


I wouldn't call CamelCase naming a design pattern, but I agree that a large project should stick to consistent coding guidelines - which LLVM does [http://llvm.org/docs/CodingStandards.html]

Any particular style grows on you with time. It's more important to make sure the code is internally consistent.


Sorry, I didn't realized the op was really criticizing the casing, I though it was about AbstractFactoryGenerator kinds of names.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: