How about this for an idea: a decompiler that uses Machine Learning to name the decompiled variables and functions. Would be nice even if it worked only sometimes.
For an AI-less solution there is IDA's Lumina which works pretty well. There's also a reverse engineered server for it [0] so you could build plugins for other disasemblers/decompilers to use with non-official servers.
It basically hashes machine code (with address parts removed) [1], then when reverse engineers label and push symbols to the server (or get them from some debug build), others can pull and see what the functions are called in completely unrelated projects, that use the same libraries / have the same functions.
I'm surprised nobody has mentioned DIRE[0] yet. They did exactly this and got some very impressive results.
[0]: https://arxiv.org/abs/1909.09029 / J. Lacomis et al., "DIRE: A Neural Approach to Decompiled Identifier Naming," 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2019, pp. 628-639, doi: 10.1109/ASE.2019.00064.
It's certainly possible - Compile all the C projects on github with `gcc -O0`. Map statements, blocks, or functions to ASM output. Put everything in a giant SQL. Repeat for all of gcc's compiler flags.
Wait, did I say it was possible? I'm curious what a neural netted compiler would produce. Probably your average CRUD software.
The thing is a classifier is going to classify. So it’ll create _some_ name… but will it be correct? Useful? Or just completely misleading? I think the misleading bits can be worse than no name at all.
Better would be a high level description of what’s happening. I have a feeling that would be easier to achieve.
Yes, but the problem is learning through the graph data. I.e., it is not a well-structured problem like an image which will always have a certain size in pixels. Coming from a compiler background I am genuinely excited about the research on learning how to label nodes and edges in graphs, but I know little about the challenges in the ML side to bring this kind of technology into reality.
As Rust is llvm based, you don't need to compile it to C. Just write a backend that translates LLVM IR to C instead of x86_64. The IR is very C looking, it's probably overly complex.
Compiling down to asm, lot of information is lost regarding memory layout etc, so it's not the best source for generating code.
Some architectural/program assumptions may not be encoded in assembly or preserved in the asm -> C -> asm roundtrip, especially if the assemblers are for different architectures. The obvious example is pointer word size and memory model.
Rust might work reasonably well though, given its compiled with llvm. If not, I'm sure whatever asm structures llvm outputs for rust all have C equivalents.
I'd be curious to see the output! It'd be funny decompiling the rust compiler to C, then running it on another platform that way. (Though it would still be, for example, an x86 rust compiler running on arm).
Strictly speaking all x86 MOVs would have to decompile to some level of atomic load/stores, because of the ordering guarantees on x86.
That's an example of "lost in translation" as we don't know if the original source required ordering or not. x86 cannot express the weakest memory model supported by C.
It wouldn't look much like what you'd expect C to look like. If it did decompile back to idiomatic C it could introduce some kind of aliasing bug along the way that'd make it not so memory safe.
Yes, but that's a quality of SPARK, not of the category of "formal methods" by default.
For instance, if you write your own proof and then prove the program meets it, there could still be a logic error in your proof.
Also, "in verified code" is a big loophole enough to leave security issues in - for instance a web browser (probably the thing you'd most like to prove security-issue-free) can still overwrite its own memory through things like OS image and font code, JavaScript JITs generating code to an insecure ABI you don't have a model for, syscalls to kernel code that can write back into your memory, etc.
Believing "formal" makes things magically correct does seem to be a common problem; it also comes up when people say something has a "formal audit" or needs to get one from someone. How do they "formally" audit it? Are they wearing suits?
> For instance, if you write your own proof and then prove the program meets it, there could still be a logic error in your proof.
Sure. Even mathematics research journals sometimes publish erroneous proofs. Practical formal methods generally rely on automated provers.
You're right though that software development using formal methods can still have a non-zero number of defects. AdaCore use the term ultra-low-defect software rather than bug-free software. For an interesting case-study see [0].
Unfortunately, even automated provers can have bugs. To my knowledge, all provers suitable for practical use are not themselves formally verified. I don't think this is often an issue in practice, though. It remains that formal methods have an excellent real-world track record. The 'problems' with formal methods aren't effectiveness, but effort/price, and perhaps scalability.
> a web browser (probably the thing you'd most like to prove security-issue-free) can still overwrite its own memory through things like OS image and font code, JavaScript JITs generating code to an insecure ABI you don't have a model for, syscalls to kernel code that can write back into your memory, but not effectiveness.
If you verify only certain parts of a software solution, then sure, you don't get formal assurances about its overall behaviour.
> How do they "formally" audit it? Are they wearing suits?
That's an entirely different use of the word, isn't it?
Critic for the author have picture of you decompile so people can feel it out and judge the quality. I wanted to check it out but I was on my phone so I can quite build and run the repo