Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Snowman native code to C/C++ decompiler for x86/x86_64/ARM (github.com/yegord)
83 points by pabs3 on April 22, 2022 | hide | past | favorite | 37 comments


How about this for an idea: a decompiler that uses Machine Learning to name the decompiled variables and functions. Would be nice even if it worked only sometimes.


For an AI-less solution there is IDA's Lumina which works pretty well. There's also a reverse engineered server for it [0] so you could build plugins for other disasemblers/decompilers to use with non-official servers.

It basically hashes machine code (with address parts removed) [1], then when reverse engineers label and push symbols to the server (or get them from some debug build), others can pull and see what the functions are called in completely unrelated projects, that use the same libraries / have the same functions.

[0] https://abda.nl/lumen/ [1] https://github.com/naim94a/lumen/issues/2


I'm surprised nobody has mentioned DIRE[0] yet. They did exactly this and got some very impressive results.

[0]: https://arxiv.org/abs/1909.09029 / J. Lacomis et al., "DIRE: A Neural Approach to Decompiled Identifier Naming," 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2019, pp. 628-639, doi: 10.1109/ASE.2019.00064.

[1]: https://github.com/pcyin/dire


This is awesome! Thanks for sharing!


It's certainly possible - Compile all the C projects on github with `gcc -O0`. Map statements, blocks, or functions to ASM output. Put everything in a giant SQL. Repeat for all of gcc's compiler flags.

Wait, did I say it was possible? I'm curious what a neural netted compiler would produce. Probably your average CRUD software.


The thing is a classifier is going to classify. So it’ll create _some_ name… but will it be correct? Useful? Or just completely misleading? I think the misleading bits can be worse than no name at all.

Better would be a high level description of what’s happening. I have a feeling that would be easier to achieve.


Something very similar: a "decompiler" that decompiles minified javascript and guesses variables and functions: http://www.jsnice.org/


It's also something that could be somewhat easy to get a lot of learning material for.


Yes, but the problem is learning through the graph data. I.e., it is not a well-structured problem like an image which will always have a certain size in pixels. Coming from a compiler background I am genuinely excited about the research on learning how to label nodes and edges in graphs, but I know little about the challenges in the ML side to bring this kind of technology into reality.


Here is a talk by Dr. Yegor Derevenets about this project https://www.youtube.com/watch?v=f_0EF2BqeQ4


If we would write some code in rust, compile it to, for example, x86_64. And then de-compile it to C. It would be perfectly memory-safe C code, right?


As Rust is llvm based, you don't need to compile it to C. Just write a backend that translates LLVM IR to C instead of x86_64. The IR is very C looking, it's probably overly complex.

Compiling down to asm, lot of information is lost regarding memory layout etc, so it's not the best source for generating code.


https://github.com/JuliaComputingOSS/llvm-cbe

edit: you also have mrustc as a Rust to C compiler outright.


Hey, LLVM to C is what Rellic does! https://github.com/lifting-bits/rellic


Yes, but it wouldn't necessarily be readable. And it definitely wouldn't be portable!


Isn't Mozilla doing that? Rust to WASM to C.

Edit: Sorry I misremembered, they seem to compile C/C++ code to WASM then back to C: https://hacks.mozilla.org/2021/12/webassembly-and-back-again...

Although technically the plugins could be written in Rust.


Link?


Some architectural/program assumptions may not be encoded in assembly or preserved in the asm -> C -> asm roundtrip, especially if the assemblers are for different architectures. The obvious example is pointer word size and memory model.


Rust might work reasonably well though, given its compiled with llvm. If not, I'm sure whatever asm structures llvm outputs for rust all have C equivalents.

I'd be curious to see the output! It'd be funny decompiling the rust compiler to C, then running it on another platform that way. (Though it would still be, for example, an x86 rust compiler running on arm).


Strictly speaking all x86 MOVs would have to decompile to some level of atomic load/stores, because of the ordering guarantees on x86.

That's an example of "lost in translation" as we don't know if the original source required ordering or not. x86 cannot express the weakest memory model supported by C.


I would guess that writing a decompiler that gets c semantics 100% correct is also really hard. Think about inline assembly, memory barriers, etc.

I don't think this would work, unless the c file just contains inline assembly.


The simpler way to obtain C from Rust would be to use mrustc https://github.com/thepowersgang/mrustc

Not that the C will be much more readable than the dissassembly, but there's a chance less information will be lost.


It wouldn't look much like what you'd expect C to look like. If it did decompile back to idiomatic C it could introduce some kind of aliasing bug along the way that'd make it not so memory safe.


Nothing is perfectly memory-safe. Also, not sure I would see the point of this translation?


> Nothing is perfectly memory-safe.

How about formally verified SPARK code?


It’s strange that people think if they put the word “formal” in front of something it means it’ll work perfectly.


The guidelines ask that you Please don't post shallow dismissals.

Are you aware of the scope of SPARK's assurances? You can't accidentally dereference a NULL pointer in verified SPARK code, for instance.


Yes, but that's a quality of SPARK, not of the category of "formal methods" by default.

For instance, if you write your own proof and then prove the program meets it, there could still be a logic error in your proof.

Also, "in verified code" is a big loophole enough to leave security issues in - for instance a web browser (probably the thing you'd most like to prove security-issue-free) can still overwrite its own memory through things like OS image and font code, JavaScript JITs generating code to an insecure ABI you don't have a model for, syscalls to kernel code that can write back into your memory, etc.

Believing "formal" makes things magically correct does seem to be a common problem; it also comes up when people say something has a "formal audit" or needs to get one from someone. How do they "formally" audit it? Are they wearing suits?


> For instance, if you write your own proof and then prove the program meets it, there could still be a logic error in your proof.

Sure. Even mathematics research journals sometimes publish erroneous proofs. Practical formal methods generally rely on automated provers.

You're right though that software development using formal methods can still have a non-zero number of defects. AdaCore use the term ultra-low-defect software rather than bug-free software. For an interesting case-study see [0].

Unfortunately, even automated provers can have bugs. To my knowledge, all provers suitable for practical use are not themselves formally verified. I don't think this is often an issue in practice, though. It remains that formal methods have an excellent real-world track record. The 'problems' with formal methods aren't effectiveness, but effort/price, and perhaps scalability.

> a web browser (probably the thing you'd most like to prove security-issue-free) can still overwrite its own memory through things like OS image and font code, JavaScript JITs generating code to an insecure ABI you don't have a model for, syscalls to kernel code that can write back into your memory, but not effectiveness.

If you verify only certain parts of a software solution, then sure, you don't get formal assurances about its overall behaviour.

> How do they "formally" audit it? Are they wearing suits?

That's an entirely different use of the word, isn't it?

[0] https://www.adacore.com/tokeneer See page 59 of the (freely available) full report for Analysis section


The main website seems to be dead, and there are no examples of its output.


A shame. My understanding is this was state of the art, for a free and open source decompiler.

Of course, times change and AFAICT Ghidra has taken up that mantle.


I'm surprised to hear that, I thought retdec was pretty well regarded.


Good point. I don't mean this as a slight against retdec. Simply, retdec was proprietary for quite a while, and it escaped my mind.

retdec haven't tagged (or built) a release in a couple years, unfortunately, but there is recent activity on their repo.



Critic for the author have picture of you decompile so people can feel it out and judge the quality. I wanted to check it out but I was on my phone so I can quite build and run the repo


A few months ago I was looking for a 16bit equivalent. Unfortunately I couldn't find a snowman for Win3.1!


as someone who's used snowman for ctfs, here's my recommendation for it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: