Snowman native code to C/C++ decompiler for x86/x86_64/ARM

kubb · on April 22, 2022

How about this for an idea: a decompiler that uses Machine Learning to name the decompiled variables and functions. Would be nice even if it worked only sometimes.

kuroguro · on April 22, 2022

For an AI-less solution there is IDA's Lumina which works pretty well. There's also a reverse engineered server for it [0] so you could build plugins for other disasemblers/decompilers to use with non-official servers.

It basically hashes machine code (with address parts removed) [1], then when reverse engineers label and push symbols to the server (or get them from some debug build), others can pull and see what the functions are called in completely unrelated projects, that use the same libraries / have the same functions.

[0] https://abda.nl/lumen/ [1] https://github.com/naim94a/lumen/issues/2

Andoryuuta · on April 22, 2022

I'm surprised nobody has mentioned DIRE[0] yet. They did exactly this and got some very impressive results.

[0]: https://arxiv.org/abs/1909.09029 / J. Lacomis et al., "DIRE: A Neural Approach to Decompiled Identifier Naming," 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2019, pp. 628-639, doi: 10.1109/ASE.2019.00064.

[1]: https://github.com/pcyin/dire

efferifick · on April 23, 2022

This is awesome! Thanks for sharing!

glouwbug · on April 22, 2022

It's certainly possible - Compile all the C projects on github with `gcc -O0`. Map statements, blocks, or functions to ASM output. Put everything in a giant SQL. Repeat for all of gcc's compiler flags.

Wait, did I say it was possible? I'm curious what a neural netted compiler would produce. Probably your average CRUD software.

azinman2 · on April 22, 2022

The thing is a classifier is going to classify. So it’ll create _some_ name… but will it be correct? Useful? Or just completely misleading? I think the misleading bits can be worse than no name at all.

Better would be a high level description of what’s happening. I have a feeling that would be easier to achieve.

efferifick · on April 22, 2022

Something very similar: a "decompiler" that decompiles minified javascript and guesses variables and functions: http://www.jsnice.org/

planede · on April 22, 2022

It's also something that could be somewhat easy to get a lot of learning material for.

efferifick · on April 22, 2022

Yes, but the problem is learning through the graph data. I.e., it is not a well-structured problem like an image which will always have a certain size in pixels. Coming from a compiler background I am genuinely excited about the research on learning how to label nodes and edges in graphs, but I know little about the challenges in the ML side to bring this kind of technology into reality.

AlexDenisov · on April 22, 2022

Here is a talk by Dr. Yegor Derevenets about this project https://www.youtube.com/watch?v=f_0EF2BqeQ4

blueflow · on April 22, 2022

If we would write some code in rust, compile it to, for example, x86_64. And then de-compile it to C. It would be perfectly memory-safe C code, right?

kalnins · on April 22, 2022

As Rust is llvm based, you don't need to compile it to C. Just write a backend that translates LLVM IR to C instead of x86_64. The IR is very C looking, it's probably overly complex.

Compiling down to asm, lot of information is lost regarding memory layout etc, so it's not the best source for generating code.

my123 · on April 22, 2022

https://github.com/JuliaComputingOSS/llvm-cbe

edit: you also have mrustc as a Rust to C compiler outright.

frabert · on April 22, 2022

Hey, LLVM to C is what Rellic does! https://github.com/lifting-bits/rellic

pjc50 · on April 22, 2022

Yes, but it wouldn't necessarily be readable. And it definitely wouldn't be portable!

tym0 · on April 22, 2022

Isn't Mozilla doing that? Rust to WASM to C.

Edit: Sorry I misremembered, they seem to compile C/C++ code to WASM then back to C: https://hacks.mozilla.org/2021/12/webassembly-and-back-again...

Although technically the plugins could be written in Rust.

eatonphil · on April 22, 2022

Link?

vnorilo · on April 22, 2022

Some architectural/program assumptions may not be encoded in assembly or preserved in the asm -> C -> asm roundtrip, especially if the assemblers are for different architectures. The obvious example is pointer word size and memory model.

josephg · on April 22, 2022

Rust might work reasonably well though, given its compiled with llvm. If not, I'm sure whatever asm structures llvm outputs for rust all have C equivalents.

I'd be curious to see the output! It'd be funny decompiling the rust compiler to C, then running it on another platform that way. (Though it would still be, for example, an x86 rust compiler running on arm).

vnorilo · on April 24, 2022

Strictly speaking all x86 MOVs would have to decompile to some level of atomic load/stores, because of the ordering guarantees on x86.

That's an example of "lost in translation" as we don't know if the original source required ordering or not. x86 cannot express the weakest memory model supported by C.

thargor90 · on April 22, 2022

I would guess that writing a decompiler that gets c semantics 100% correct is also really hard. Think about inline assembly, memory barriers, etc.

I don't think this would work, unless the c file just contains inline assembly.

speed_spread · on April 22, 2022

The simpler way to obtain C from Rust would be to use mrustc https://github.com/thepowersgang/mrustc

Not that the C will be much more readable than the dissassembly, but there's a chance less information will be lost.

astrange · on April 22, 2022

It wouldn't look much like what you'd expect C to look like. If it did decompile back to idiomatic C it could introduce some kind of aliasing bug along the way that'd make it not so memory safe.

shin_lao · on April 22, 2022

Nothing is perfectly memory-safe. Also, not sure I would see the point of this translation?

MaxBarraclough · on April 22, 2022

> Nothing is perfectly memory-safe.

How about formally verified SPARK code?

astrange · on April 22, 2022

It’s strange that people think if they put the word “formal” in front of something it means it’ll work perfectly.

MaxBarraclough · on April 22, 2022

The guidelines ask that you Please don't post shallow dismissals.

Are you aware of the scope of SPARK's assurances? You can't accidentally dereference a NULL pointer in verified SPARK code, for instance.

astrange · on April 24, 2022

Yes, but that's a quality of SPARK, not of the category of "formal methods" by default.

For instance, if you write your own proof and then prove the program meets it, there could still be a logic error in your proof.

Also, "in verified code" is a big loophole enough to leave security issues in - for instance a web browser (probably the thing you'd most like to prove security-issue-free) can still overwrite its own memory through things like OS image and font code, JavaScript JITs generating code to an insecure ABI you don't have a model for, syscalls to kernel code that can write back into your memory, etc.

Believing "formal" makes things magically correct does seem to be a common problem; it also comes up when people say something has a "formal audit" or needs to get one from someone. How do they "formally" audit it? Are they wearing suits?

MaxBarraclough · on April 25, 2022

> For instance, if you write your own proof and then prove the program meets it, there could still be a logic error in your proof.

Sure. Even mathematics research journals sometimes publish erroneous proofs. Practical formal methods generally rely on automated provers.

You're right though that software development using formal methods can still have a non-zero number of defects. AdaCore use the term ultra-low-defect software rather than bug-free software. For an interesting case-study see [0].

Unfortunately, even automated provers can have bugs. To my knowledge, all provers suitable for practical use are not themselves formally verified. I don't think this is often an issue in practice, though. It remains that formal methods have an excellent real-world track record. The 'problems' with formal methods aren't effectiveness, but effort/price, and perhaps scalability.

> a web browser (probably the thing you'd most like to prove security-issue-free) can still overwrite its own memory through things like OS image and font code, JavaScript JITs generating code to an insecure ABI you don't have a model for, syscalls to kernel code that can write back into your memory, but not effectiveness.

If you verify only certain parts of a software solution, then sure, you don't get formal assurances about its overall behaviour.

> How do they "formally" audit it? Are they wearing suits?

That's an entirely different use of the word, isn't it?

[0] https://www.adacore.com/tokeneer See page 59 of the (freely available) full report for Analysis section

secondcoming · on April 22, 2022

The main website seems to be dead, and there are no examples of its output.

orra · on April 22, 2022

A shame. My understanding is this was state of the art, for a free and open source decompiler.

Of course, times change and AFAICT Ghidra has taken up that mantle.

taviso · on April 22, 2022

I'm surprised to hear that, I thought retdec was pretty well regarded.

orra · on April 22, 2022

Good point. I don't mean this as a slight against retdec. Simply, retdec was proprietary for quite a while, and it escaped my mind.

retdec haven't tagged (or built) a release in a couple years, unfortunately, but there is recent activity on their repo.

mdaniel · on April 22, 2022

https://web.archive.org/web/20220207001349/https://derevenet...

xphos · on April 22, 2022

Critic for the author have picture of you decompile so people can feel it out and judge the quality. I wanted to check it out but I was on my phone so I can quite build and run the repo

richardfey · on April 22, 2022

A few months ago I was looking for a 16bit equivalent. Unfortunately I couldn't find a snowman for Win3.1!

marcelluscat · on April 22, 2022

as someone who's used snowman for ctfs, here's my recommendation for it.