When you launch a container (either through docker you manually through namespaces) you are effectively representing yourself to the kernel as a separate thing. This allows you to construct a completely separate environment when interacting with the kernel where none of your concerns are going to leak out and nothing you don't care for is going to leak in.
When people say that static executables would solve the problem they are wrong, a static executable just means that you can eschew constructing a separate file-system inside your container - and you will probably need to populate some locations anyway.
Properly configured containers are actually supposed to be secure sandboxes, such that any violation is a kernel exploit. However the Linux kernel attack surface is very large so no one serious who offers multi-tenant hosting can afford to rely on containers for isolation. They have to assume that a container escape 0day can be sourced. It may be more accurate to say that a general kernel 0day can be sourced since the entire kernel surface area is open for anyone to poke. seccomp can mitigate the surface area but also narrow down the usefulness.
not .. really. Linux kernel has no concept of a container, you have to be super careful to avoid "mixing" host stuff in. I'm yet to see an case where "leaking in" would be prevented by default. Docker "leaks in" as much as you want. Containers also do not nest gracefully (due to, e.g., uids), so cannot be used as a software component. It's mostly a linux system admin thing right now.
Docker has made some strange decisions for default behavior but if you take a more hands on approach such as with bubblewrap/bwrap nothing will leak in.
How would you do it? I'm quite interested! How can you hide container processes in host procfs using bwrap? And make sure no mounts stay mounted in the host?
The most "nothing leaks in" runtime I've seen is gVisor (before going VM). Attaining that with bwrap would be nice, but I'm sceptical.
When people say that static executables would solve the problem they are wrong, a static executable just means that you can eschew constructing a separate file-system inside your container - and you will probably need to populate some locations anyway.
Properly configured containers are actually supposed to be secure sandboxes, such that any violation is a kernel exploit. However the Linux kernel attack surface is very large so no one serious who offers multi-tenant hosting can afford to rely on containers for isolation. They have to assume that a container escape 0day can be sourced. It may be more accurate to say that a general kernel 0day can be sourced since the entire kernel surface area is open for anyone to poke. seccomp can mitigate the surface area but also narrow down the usefulness.