Fork() is evil; vfork() is goodness; afork() would be better; clone() is stupid

infogulch · on Feb 28, 2022

The dense fog lifts, tree branches part, a ray of light beams down on a pedestal revealing the hidden intentions of the ancients. A plaque states "The operational semantics of the most basic primitives of your operating system are designed to simplify the implementation of shells." You hesitantly lift your eyes to the item presented upon the pedestal, take a pause in respect, then turn away slumped and disappointed but not entirely surprised. As you walk you shake your head trying to evict the after image of a beam of light illuminating a turd.

chubot · on March 1, 2022

It seems like this 2019 paper covers this point, and the content in the gist? I was expecting to see a reference to it

A fork() in the road

https://dl.acm.org/doi/abs/10.1145/3317550.3321435

Discussed at the time: https://news.ycombinator.com/item?id=19621799

Although it does say that vfork() is difficult to use safely, while the gist recommends it? I think there is still some clarity needed around the use cases.

Fork today is a convenient API for a single-threaded process with a small memory footprint and simple memory layout that requires fine-grained control over the execution environment of its children but does not need to be strongly isolated from them. In other words, a shell. It’s no surprise that the Unix shell was the first program to fork [69], nor that defenders of fork point to shells as the prime example of its elegance [4, 7]. However, most modern programs are not shells. Is it still a good idea to optimise the OS API for the shell’s convenience?

cryptonector · on March 1, 2022

As u/amaranth pointed out, my gist predates the MSFT paper, which mostly explains why I didn't reference. Though, to be fair, I saw that paper posted here back in 2019, and I commented on it plenty (13 comments) then. I could have edited my gist to reference it, and, really, probably should have. Sometime this week I will add a reference to it, as well as this and that HN post, since they are clearly germane and useful threads.

I vehemently disagree with those who say that vfork() is much more difficult to use correctly than fork(). Neither is particularly easy to use though. Both have issues to do with, e.g., signals. posix_spawn() is not exactly trivial to use, but it is easier to use it correctly than fork() or vfork(). And posix_spawn() is extensible -- it is not a dead end.

My main points are that vfork() has been unjustly vilified, fork() is really not good, vfork() is better than fork(), and we can do better than vfork(). That said, posix_spawn() is the better answer whenever it's applicable.

Note that the MSFT paper uncritically accepts the idea that vfork() is dangerous. I suspect that is because their focus was on the fork-is-terrible side of things. Their preference seems to be for spawn-type APIs, which is reasonable enough, so why bother with vfork() anyways, right? But here's the thing: Windows WSL can probably get a vfork() added easily enough, and replacing fork() with vfork() will generally be a much simpler change than replacing fork() with posix_spawn(), so I think there is value in vfork() for Microsoft.

Use cases for vfork() or afork()? Wherever you're using fork() today to then exec, vfork() will make that code more performant and it generally won't take too much effort to replace the call to fork() with vfork(). afork() is for apps that need to spawn lots of processes quickly -- these are rare apps, but uses for them do arise from time to time. But also, afork() should be easier to use safely than vfork(). And, again, for Microsoft there is value in vfork() as a smaller change to Linux apps so they can run well in WSL.

BTW, see @famzah's popen-noshell issue #11 [0] for a high-perf spawn use case. I linked it from my gist, and, in fact, the discussion there led directly to my writing that gist.

  [0] https://github.com/famzah/popen-noshell/issues/11

Macha · on March 1, 2022

If you are going to edit, the google query links with the #q=xyz format no longer seem to work, so maybe update them to the ?q=xyz format which still works.

(Also this article and discussions on it now take up many of the top spots, which I guess is the disadvantage to linking to google for a topic)

amaranth · on March 1, 2022

The gist seems to be from 2017 so it wouldn't have been able to reference that paper.

cryptonector · on March 2, 2022

I've updated the gist to include that, this, and many other links.

rezonant · on March 1, 2022

I too could use some more clarity around the use cases

cryptonector · on March 1, 2022

https://news.ycombinator.com/item?id=30510519

ckastner · on Feb 28, 2022

> "The operational semantics of the most basic primitives of your operating system are designed to simplify the implementation of shells."

Yes, but why is this characterized as something negative?

Isn't that the entire point? Operating systems are there to serve user requests, and shells are an interface between user and OS.

Shells simply developed features that users required of them.

eru · on March 1, 2022

> Isn't that the entire point?

The exokernel people would disagree.

You see, an operating system as commonly conceived has at least two major jobs:

- abstract away underlying hardware

- safely multiplex resources

And do the above with as little overhead as possible.

Now the thing is: whenever you have multiple goals, you need to make trade-offs, and you aren't as good at any one goal as you could be.

So the exokernel folks made a suggestion in the 90s: let the OS concentrate on safely multiplexing resources, and do all the abstracting in user level libraries.

See eg https://www.classes.cs.uchicago.edu/archive/2019/winter/3310... or https://people.eecs.berkeley.edu/~kubitron/cs262/handouts/pa...

Normal application programming would mostly look the same as before, your libraries just do more of the heavy lifting. But it's much easier to swap out different libraries than it is to swap out kernel-level functionality.

That vision never caught on with mainstream OSes. But: widespread virtualisation made it possible. You can see hypervisors like Xen as exokernel OSes that do the bare minimum required to safely multiplex, but don't provide (many) abstractions.

rtpg · on March 1, 2022

Shells have relatively simple operational models, so _any_ API would probably be workable for shells.

Meanwhile, programs with more complex requirements have to work around these APIs. And many programs call other programs, or otherwise have to do tricky process lifecycle management.

The lowest-level APIs should, in theory, cater to the most complex cases, not to the simplest ones. This doesn't prevent a simpler API from existing, but catering to a simple use case in the primitives does hinder more complex needs.

(I think the more nuanced point is that the OS itself might not have a much better design available in any case. Unixes have a lot of neat stuff, but it's a lot of "design by user feature request", and "standardize 4 slightly different ways of doing things", so there is a lot of weirdness and it's hard to have The Perfect API in that case)

rezonant · on March 1, 2022

> Shells have relatively simple operational models, so _any_ API would probably be workable for shells.

You'd think that, but implementing the UNIX shell and all of its semantics (piping, redirection, waiting, child reaping, jobs, foreground/background, prompting etc) using fork/clone + exec* is way more simple than, say, on Windows. Some API designs are better for that specific task

cryptonector · on March 1, 2022

> Shells have relatively simple operational models, so _any_ API would probably be workable for shells.

True. Today anyways. Back in the 70s though, there was a lot of innovation going on around process spawning, and fork+exec almost certainly made it easy to play with those ideas. I'm referring to job control, for example. But also things like the parent-child relationships between the shell and all the processes in a pipeline -- not all shells have set those up the same way.

So, yeah, maybe we need not just posix_spawn() but posix_pipeline_spawn(), why not. Make it even easier to write a shell. After all, plumbing a complex pipeline with posix_spawn() requires a fair bit of code.

Will any API do? Yes, provided it covers all the things Unix shells do nowadays. It's still easiest to get all the functionality (that a shell dev might want to build) with fork+exec though, especially since the shell author gets a great deal of control that way, though they get that at the price of having to know a great deal of stuff intimately. Arguably, anyone wishing to implement a posix_pipeline_spawn() would be like a shell developer.

rtpg · on March 1, 2022

The thing is that there are many other programs which require process control, which are not shells. Orders and orders of magnitudes of programs which are not shells. So we can optimize an API for building shells, but it's not going to make writing those other programs easier.

Shells are cool and good, and I don't want to discount fork too much, just saying that the API design space isn't _only_ for shells.

matu3ba · on March 1, 2022

> Yes, but why is this characterized as something negative?

Unfortunately, the text does not provide sufficient context. Shell are not properly supported in any OS (probably except plan9), since 1. the OS provides no enforcement or convention of CLI API interface (there is no enforced encoding standard or checkable stuff), 2. the OS provides no rules for file names to be shell-friendly and 3. there are no dedicated communication channels towards shells or in between programs and shells.

So all in all, shells remain a hack around the system that is "simple to implement the initials" and is annoying to use and write at many corner cases.

> Shells simply developed features that users required of them.

Cross out "simply" and call it convenience+arbitrary complex scripting glue for 4 main goals: 1. piping 2. basic text processing 3. basic job control 4. path hackery

int_19h · on Feb 28, 2022

Shells haven't been the primary interface between the user and the OS for decades.

kragen · on March 1, 2022

"The primary interface between the user and the OS" is the definition of "shell". That's why the Microsoft Windows process that draws the Start button and filesystem windows is called "the Windows shell".

int_19h · on March 1, 2022

I don't think OP meant shell as in the Windows shell, or Linux DEs. I mean, how many of those use fork() even on Linux, or would be easier to implement if they did?

kragen · on March 1, 2022

Linux desktop environments do use fork(), and the Microsoft shell doesn't use fork() because Microsoft Windows doesn't have it.

In the Linux context, the fact that random things inherit stdout appending to .xsession-errors and inheriting environment variables is often useful. fork() also makes it fairly straightforward to do things like set a VM size limit or change an environment variable for a newly launched program, which is often useful when you're launching a program from just about anything. I don't know whether rearchitecting Microsoft Windows to work that way would have made the Windows Shell easier to write.

However, and this is the crucial point, fork() was impossible to support on Win16, because segment register values can be stashed anywhere in your 8086 program's memory, and they're just literally added to the offset address with a 4-bit shift, so there's no reliable way to make a copy of a running process elsewhere in memory that doesn't accidentally share segments with the original. You'd have to do what monocasa was saying old Unix did and checkpoint the process to disk. (I suspect Unix never did that, but it's similar to what PDP-11 Unix did do.)

int_19h · on March 1, 2022

Which Linux DEs use fork without exec?

Inheriting stdout etc does not require fork. It requires a spawn API that has a flag to inherit stdout, such as e.g. Win32 CreateProcess. Inheriting handles by default, on the other hand, is a recipe for hard-to-debug bugs.

kragen · on March 1, 2022

Oh, I didn't mean without exec, but there are some programs like gnome-terminal that do that too. I just meant that forking, doing process configuration with system calls to open and close files and whatnot, and then running exec, is maybe a more convenient way to launch a program in a modified environment, than having a CreateProcess system call with fifty zillion flags.

Everything in Unix is a recipe for hard-to-debug bugs.

peterburkimsher · on March 1, 2022

That is the most glorious ** that i've read all day.

Larry Wall, creator of Perl, famously wrote that "It is easier to port a shell than a shell script."

https://en.wikipedia.org/wiki/Shell_script

So we can write operating systems easily if it's just an infinite superloop?

chmaynard · on Feb 28, 2022

[flagged]

infogulch · on March 1, 2022

Can you elaborate your grievances in more detail than "your comment stinks"?

chmaynard · on March 1, 2022

Sure. While clever and entertaining, I didn't find your comment to be a constructive contribution to the discussion. Also, I've found that attempts at humor on HN are often misinterpreted and can stir up trouble. (No, I did not downvote your comment.)

infogulch · on March 1, 2022

My comment contains more information more densely than what I could have stated flatly. This thread is the third longest on the post, and contains interesting and unique discussion. I don't see any troublesome misinterpretations.

Your concerns seem to be misplaced.

Emotionless propositional statements are not unconditionally better than other forms of writing.

OneLeggedCat · on March 1, 2022

And would you consider that a good thing or a bad thing?

chmaynard · on March 1, 2022

A turd stinks. Draw your own conclusions.

evmar · on Feb 28, 2022

In Ninja, which needs to spawn a lot of subprocesses but it otherwise not especially large in memory and which doesn't use threads, we moved from fork to posix_spawn (which is the "I want fork+exec immediately, please do the smartest thing you can" wrapper) because it performed better on OS X and Solaris:

https://github.com/ninja-build/ninja/commit/89587196705f54af...

ridiculous_fish · on Feb 28, 2022

posix_spawn also outperforms fork on Linux under more recent glibc and musl, which can use vfork under the hood. https://twitter.com/ridiculous_fish/status/12328893907639336...

xroche · on March 1, 2022

The issue with posix_spawn is that you can't close all descriptors before exec. This is especially an issue as most libraries are still unaware they need to open every single handle with the close-on-exec flag set.

kazinator · on March 2, 2022

Closing all descriptors is next to useless; you usually need to inherit at least standard in/out/error.

What you need is an operation like "close all descriptors >= N", as posix_spawn opcode.

cryptonector · on March 2, 2022

Indeed, it's very common to want to close all FDs other than 0, 1, and 2, of course, as well as a few other exceptions (e.g., a pipe a parent might read from, FDs on which flocks are held). The reason one often wants to close all open FDs besides those is simple: too many FDs that should be made O_CLOEXEC often aren't, and even when they are, too often there is a race to use fcntl() to do so on one thread while another one forks. Yes, there are new system calls that allow race-free setting of O_CLOEXEC on new FDs, but they will take a long time to be widely used.

I've implemented closefrom() type APIs more than once. Of course, I happen to know about Illumos', so there's that.

cryptonector · on March 1, 2022

Solaris/Illumos has an extension[0] for that.

  [0] http://src.illumos.org/source/search?project=illumos-gate&full=posix_spawn_file_actions_addclosefrom_np&defs=&refs=&path=&hist=&type=&xrd=&nn=1
  [1] https://docs.oracle.com/cd/E36784_01/html/E36874/posix-spawn-file-actions-addclosefrom-np-3c.html

kazinator · on March 2, 2022

For implementations which don't have it, you can stuff, into the file_actions, say, 4093 close action entries into the file_actions, targeting descriptors 3 to 4095. This big file_actions object can be cached and re-used for multiple calls to posix_spawn.

It won't close descriptor 4096, but that's probably beyond giving a darn in most cases. If you have an application that opens high descriptor numbers, you probably know.

cryptonector · on March 2, 2022

A better approach is to exec an intermediate helper program that will do it and then exec the actual intended program. One can also use this approach to do things like reset signal dispositions to SIG_IGN.

kazinator · on March 2, 2022

... add another option to /usr/bin/env and you got it!

mattgreenrocks · on Feb 28, 2022

> Long ago, I, like many Unix fans, thought that fork(2) and the fork-exec process spawning model were the greatest thing, and the Windows sucked for only having exec() and _spawn(), the last being a Windows-ism.

I appreciate this quite a bit. Vocal Unix proponents tend to believe that anything Unix does is automatically better than Windows, sometimes without even knowing what the Windows analogue is. Programming in both is necessary to have an informed opinion on this subject.

The one thing I miss most on Unix: the unified model of HANDLEs that enables you to WaitOnMultipleObjects() with almost any system primitive you could want, such as an event with a socket (blocking I/O + a shutdown notification) in one call. On Unix, a flavor of select() tends to be the base primitive for waiting on things to happen, which means you end up writing adapter code for file descriptors to other resources, or need something like eventfd.

Things I don't miss from Windows at all: wchar_t everywhere. :)

cryptonector · on Feb 28, 2022

WIN32 got a few things very right:

  - SIDs
  - access tokens
    (like struct cred / cred_t in Unix kernels,
     but exposed as a first-class type to user-land)
  - security descriptors
    (like owner + group mode_t + ACL in Unix land,
     but as a first-class type)
  - HANDLEs, as you say
  - HANDLEs for processes

Many other things, Windows got wrong. But the above are far superior to what Unix has to offer.

mananaysiempre · on March 1, 2022

How are SIDs the right thing?

Superficial silliness like allocating 48 bits to encode integers in [0,18] aside, what problem do structured SIDs actually solve? I’ve been trying to figure that out for the last couple of days and I still don’t get it, possibly because the Windows documentation doesn’t seem to actually say it anywhere.

I completely agree with having UUIDs or something in that vein for user and group IDs and will not dismiss IDs for sessions and such in the same namespace (although haven’t actually seen a use case for those), but structured variable-length SIDs as NT defines them just don’t make sense to me.

cryptonector · on March 1, 2022

While it's true that SIDs have too much structure, that's a lot better than a flat UID namespace that is also distinct from the also flat GID namespace.

The UID/GID namespace is strictly local in POSIX. There's no way to make any two systems agree on UIDs/GIDs other than by making them have the same /etc/passwd and /etc/group content. Sure, you can use LDAP, but still, that's just one domain. Come time to do a merger or acquisition, you can't just set up a trust between two domains and have it work -- you have to do a hard migration.

SIDs don't have that problem.

The 48-bit authority part of SIDs is silly.

And the domain SID prefix of SIDs is annoyingly large (20 bytes!).

However, they are very compressible. For example, ZFS stores them as "FUIDs", which are {interned_domain_sid_id, rid}, and in each dataset ZFS stores the table of interned domain SIDs. I.e., where NTFS needs 24 bytes to store any one domain user/group SID, ZFS uses 8, so a 67% savings.

Of course, MSFT should have applied that sort of compression much more aggressively early on. That would have reduced the sizes of PACs a great deal.

lukeh · on March 1, 2022

SIDs are a post-DCE evolution of UUIDs. SIDs differ from UUIDs in that they are hierarchical. In the context of the Windows domain model, they're split into a component which identifies the domain, and a "relative" component which identifies the security principal within the domain. Thus you can easily determine the domain authority to which a principal belongs (useful for filtering across trust boundaries), and you can also efficiently translate between SIDs and human-readable names (you don't need to ask every authority).

There is a good paper from Paul Leach which discusses what they learned from using UUIDs in DCE, but I've only ever sighted a paper copy and I don't have access to it anymore...

cryptonector · on March 1, 2022

The hierarchical thing didn't really happen though -- there's no public SID registry. And machine/domain SIDs got pinned to 3 RIDs. So AD always had machine/domain SID conflict issues. It would only ever not have had SID conflict issues if they had had a public SID registry or if you had to install Windows as a domain member rather than install then join (and if there had never been a forest-of-forests feature).

Once you accept that machine/domain SID conflicts can happen, the value of having arbitrarily long SIDs goes away and you might as well use UUIDs to ID domains.

lukeh · on March 1, 2022

OK, perhaps hierarchical wasn't the correct word; it's not hierarchical in the sense of reflecting a (possibly global) domain hierarchy, but it does consist a component that identifies the issuing authority, and a component that identifies the principal relative to that authority.

So yes, a (UUID, RID) tuple would have worked just as well.

cryptonector · on March 1, 2022

Hierarchical was correct. I think the intention must have been to have a registry. Clearly a registry didn't happen :/

lukeh · on March 1, 2022

Well, whilst they are hierarchical they’re really only used as unique authority identifier plus local identifier.

twoodfin · on March 1, 2022

I’d add an I/O interface to the kernel that was built to be asynchronous from Day 0.

cryptonector · on March 1, 2022

Yes! Every system call should take:

- an event queue handle for completion notification

- an optional event queue handle and timeout for waiting on before returning

- the actual arguments to the system call

lenkite · on March 1, 2022

Yep like Minix does. NO wonder its the world's most popular OS installed everywhere.

al2o3cr · on Feb 28, 2022

I'd be curious how many of those derive from NT's VMS roots - for instance:

http://lxmi.mi.infn.it/~calcolo/OpenVMS/ssb71/6346/6346p004....

cryptonector · on Feb 28, 2022

Most of them, as far as I know.

p_l · on March 1, 2022

AFAIK WinNT consolidated a lot of ideas from VMS into more coherent constructs, pity that not all of them are exposed to developers (There is, for example, the option of using kernel upcalls in VMS style i.e. ASTs, but it's completely "private" API)

monocasa · on Feb 28, 2022

These decisions here are all older than Windows and weren't in reaction to them. It's in reaction to the awful mainframe ways to spawn processes like using JCL.

We've sort of come back to that with kubernetes yaml files to describe how to launch an executable in a specific env and all of the resources it needs. Like it can be traced explicitly, the Borg paper references mainframes and knowingly calls the language that would be replaced by kubernetes's yaml files 'BCL' instead of z/OS's JCL.

zozbot234 · on March 1, 2022

Plan9 is a lot older than Kubernetes and has the same namespacing of all processes. So it's not impossible to have a "*nix like" OS that still has mainframe-like separation of concerns to ease deployment.

monocasa · on March 1, 2022

The distinction I'm making here is between opt-in and opt-out namespacing.

Plan9's vfs namespacing is closer to clone(2) than kubernetes.

zozbot234 · on March 1, 2022

If you want foolproof sandboxing, you need opt-out namespacing. Because there might be resource types that your version of the software doesn't know about, and these should really be namespaced by default.

Besides, what really matter is whether namespacing is idiomatic or not. It was always idiomatic in plan9, and containerization has certainly made it more idiomatic even on *nix systems.

monocasa · on March 1, 2022

Plan9 was the model you're saying isn't foolproof. Switching namespaces were explicit calls seperate from process creation.

The Linux namespace scheme was explicitly inspired by plan9, but didn't have nearly as many gotchas (like plan9 vfs namespaces being only per uid).

My bringing up kubernetes is to contrast the Unix style methods in a way that developers today would recognize.

zozbot234 · on March 1, 2022

> The Linux namespace scheme was explicitly inspired by plan9, but didn't have nearly as many gotchas

There were other OS's doing similar stuff at the time, e.g. BSD jails predate Linux containers AIUI.

monocasa · on March 1, 2022

I'm going to be honest, I don't know what point you're arguing against at this point. Can you clarify that?

ogazitt · on Feb 28, 2022

Having written server software that had to work in both places, I always loved the simplicity of fork(2) / vfork(2) relative to Windows CreateProcess. Threading models in Win32 were always a pain. Which only got worse with COM (remember apartment threading? rental threading? ugh)

Back in the 90's, processes had smaller memory footprint, and every UNIX my software supported had COW optimizations. So the difference between fork(2) and vfork(2) were not very large in practice. Often, the TCP handshake behind the accept(2) call was of more concern than how long it would take fork(2) to complete. Of course, bandwidth has increased by a factor of 1000 since then, so considerations have changed.

t43562 · on March 1, 2022

It's how CreatProcess handles commandline argument that infuriates me - not as an argv array but a big string. It's so difficult to work around quoting.

pcwalton · on March 1, 2022

The problem with WaitForMultipleObjects (WFMO) is that it's limited to 64 handles, which basically makes it useless for anything where the number of handles is dynamic as opposed to static. There are ways to get around this limitation by grouping handles into trees, but it's tremendously clunky.

AnIdiotOnTheNet · on Feb 28, 2022

UCS-2 seemed like a good(ish) idea at the time when Unicode's scope didn't include every possible human concept represented in icon form and UTF-8 hadn't yet been spec'd on a napkin by the first adults to bother thinking about the problem.

xiaq · on Feb 28, 2022

Even in 1989, it should have been clear that 16 bits were not enough to encode all of the Chinese characters, let alone encoding all the human scripts. Unicode today encodes 92,865 Chinese characters (https://en.wikipedia.org/wiki/CJK_Unified_Ideographs).

The only reason anybody would think of UCS-2 was a good idea was that they did not consult a single Chinese or Japanese scholar on Chinese characters.

Arnavion · on March 1, 2022

Nobody in 1989 expected to encode 92k Chinese characters into Unicode because none of the existing encodings were encoding 92k characters either. The most common encoding for Chinese, GB2312, only has 7k characters.

I recommend reading your own link, specifically the list of sources for the first CJK block to see how many characters were included and where they were sourced from.

cryptonector · on Feb 28, 2022

Quite true. One of the things Windows got very wrong was UCS-2 and, later, UTF-16. So did JavaScript.

int_19h · on Feb 28, 2022

And macOS, and Java, and Qt, and ...

It's almost as if it was universally seen as a good idea at the time. ~

cryptonector · on Feb 28, 2022

Yes. I'm a bit surprised it took so long for someone to come up with something better. But if someone had tried and had come up with anything other than Rob Pike's UTF-8, we might still be sad. Sometimes you have to make mistakes before you know that's what they were.

p_l · on March 1, 2022

The problem is that everyone wanted to keep simple array semantics for text, and that's not really workable with full scope of Unicode (even if you have 21-bit code points exposed, Runes, etc.)

cryptonector · on March 2, 2022

On the plus side, because Unix was so ASCII-based, it couldn't easily make the jump to UCS-2/wchar_t. I suspect this was ultimately the motivation that led to UTF-8 (both, IBM's first attempt and Rob Pike's winner). Being late to the game sometimes means you're more prepared.

marwis · on Feb 28, 2022

Is there any difference between Windows HANDLE and Linux file descriptor? Aren't they both just indexes into a table of objects managed by the kernel?

cryptonector · on Feb 28, 2022

HANDLE values are opaque, and generally not reused. Imagine an implementation like this:

  typedef struct HANDLE_s {
    uintptr_t ptr;
    uintptr_t verifier;
  } HANDLE;

where `ptr` might be an index into a table (much like a file descriptor) or maybe a pointer in kernel-land (dangerous sounding!) and `verifier` is some sort of value that can be used by the kernel to validate the `ptr` before "dereferencing" it.

On Unix the semantics of file descriptors are dangerous. EBADF can be a symptom of a very dangerous bug where some thread closed a still-in-use FD then a open gets the same FD and now maybe you get file corruption. This particular type of bug doesn't happen with HANDLEs.

asveikau · on Feb 28, 2022

> This particular type of bug doesn't happen with HANDLEs.

This does not match my experience at all. Just like what you said about EBADF, Win32 error code 6 (ERROR_INVALID_HANDLE) is a huge red flag for a race condition where a HANDLE gets re-used and inappropriately called upon in some invalid context, possibly even with security or stability concerns. I used to chase these bugs a lot when I worked on Win32 code bases.

If anything this class of bug is worse in Windows because (1) multi-threaded programs are way more common on Windows and (2) HANDLEs are used for more things than file descriptors.

I guess fd reuse is more likely because they tend to get handed out by the kernel as integers in increasing order. But handle reuse absolutely does happen, and if you have this class of bug in a process with a lot of concurrent handle creation in many threads and in a commonly used program it absolutely will bite as a bug at some point.

cryptonector · on March 2, 2022

Ah, my mistake.

marwis · on Feb 28, 2022

Gotcha. But it looks like file descriptors could be made almost as safe by avoiding index reuse. Is there any reason why it is not done? Hashtable too costly costly vs array?

cryptonector · on Feb 28, 2022

File descriptor numbers have to be "small" -- that's part of their semantics. To ensure this, the kernel is supposed to always allocate the smallest available FD number. A lot of code assumes that FDs are "small" like this. Threaded code can't assume that "no FD numbers less than some number are available", but all code on Unix can assume that generally the used FD number space is dense. Even single-threaded code can't assume that "no FD numbers less than some number are available" because of libraries, but still, the assumption that the used FD number space is dense does get made. This basically forces the reuse of FDs to be a thing that happens.

For example, the traditional implementations of FD_SET() and related macros for select(3) assume that FDs are <1024.

Mind you, aside from select(), not much might break from doing away with the FDs-are-small constraint. Still, even so, they'd better be 64-bit ints if you want to be safe.

HANDLEs are just better.

zozbot234 · on March 1, 2022

io_uring allows you to associate arbitrary 64-bit data with any operation and match it on completion, so it looks like it should address these concerns.

cryptonector · on March 1, 2022

Sure, but how does that remediate existing code that uses select()?

neerajsi · on March 1, 2022

That's not true, unfortunately. Handle values are lifo without any uniquifier.

Cloudef · on March 1, 2022

Isn't HANDLE basically fd?

notriddle · on March 1, 2022

FD has been gradually turned into HANDLE.

cryptonector · on Feb 28, 2022

Well, I'm surprised to see this on the front page, let alone as #1. Ask me anything.

EDIT: Also, don't miss @NobodyXu's comment on my gist, and don't miss @NobodyXu's aspawn[1].

  [0] https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234?permalink_comment_id=3467980#gistcomment-3467980 
  [1] https://github.com/NobodyXu/aspawn/

Lerc · on Feb 28, 2022

Since you said anything... This is not strictly related to the article but your expertise seems to be in the right area.

I have a process that executes actions for users, at the moment that process runs as root until it receives a token indicating an accepted user, then it fork()s and the fork changes to the UID of the user before executing the action.

Is there a better way? I hadn't actually heard of vfork() before reading this article. I'm guessing maybe you could do a threaded server model where each thread vfork()s. I'm not really aware what happens when threads and forks combine. Does the v/fork() branch get trimmed down to just that one thread? If so what happens to the other thread stacks? It feels like a can of worms.

cryptonector · on Feb 28, 2022

If the parent is threaded, then yes, vfork() will be better. You could also use posix_spawn().

As to "becoming a user", that's a tough one. There are no standard tools for this on Unix. The most correct way to do it would be to use PAM in the child. See su(1) and sudo(1), and how they do it.

> I'm not really aware what happens when threads and forks combine. Does the v/fork() branch get trimmed down to just that one thread? If so what happens to the other thread stacks? It feels like a can of worms.

Yes, fork() only copies the calling thread. The other threads' stacks also get copied (because, well, you might have pointers into them, who knows), but there will only be one thread in the child process.

vfork() also creates only one thread in the child.

There used to be a forkall() on Solaris that created a child with copies of all the threads in the parent. That system call was a spectacularly bad idea that existed only to help daemonize: the parent would do everything to start the service, then it would forkall(), and on the parent side it would exit() (or maybe _exit()). That is, the idea is that the parent would not finish daemonizing (i.e., exit) until the child (or grandchild) was truly ready. However, there's no way to make forkall() remotely safe, and there's a much better way to achieve the same effect of not completing daemonization until the child (or grandchild) is fully ready.

In fact, the daemonization pattern of not exiting the parent until the child (or grandchild) is ready is very important, especially in the SMF / systemd world. I've implemented the correct pattern many times now, starting in 2005 when project Greenline (SMF) delivered into OS/Net. It's this: instead of calling daemon(), you need a function that calls pipe(), then fork() or vfork(), and if fork(), and on the parent side then calls read() on the read end of the pipe, while on the child side it returns immediately so the child can do the rest of the setup work, then finally it should write one byte into the write side of the pipe to tell the parent it's ready so the parent can exit.

aidenn0 · on March 1, 2022

What about fork(2) for network servers? I've written parallel network servers two ways; open the socket to listen on and call fork() N times for the desired level of parallelism, and just create N processes and use SO_REUSEPORT. I prefer the former. I suppose there is hidden option C of "have a simple process that opens the listening port and then vfork/execs each worker" I find that to be a bit strange because the code will be split into "things that happen before listening on the port" (which includes, e.g. reading configuration files) and "things that happen after listening on the port" (which includes, e.g. reading configuration files)

ahmedalsudani · on Feb 28, 2022

No questions yet as I am yet to read ... but I can already comment and say grade A title.

cryptonector · on Feb 28, 2022

It's a bit opinionated. It's meant to get a reaction, but also to have meaningful and thought-provoking content, and I think it's correct in the main too. Anyways, hope you and others enjoy it.

ahmedalsudani · on Feb 28, 2022

That was a great read. Thank you for writing it up; I learned quite a few things!

Especially appreciated the OS minutiae and opinionated commentary (... and the doc vs reality observation in Linux's vfork).

The piece lives up to the great title :)

disgruntledphd2 · on Feb 28, 2022

What do you mean by zones/jails and why are they better than containers?

cryptonector · on Feb 28, 2022

Zones -> Solaris/Illumos Zones

Jails -> BSD jails

They're software VMs. It's a lot like containers, yes.

The problem with containers is that the construction toolkit for them is subtractive ("start by cloning my environment, then remove / replace various namespaces"), while the construction toolkit for zones/jails is additive ("start with an empty universe, and add namespaces or share them with the parent").

Constructing containers subtractively means that every time there's a new kind of namespace to virtualize, you have to update all container-creating tools or risk a security vulnerability.

Constructing containers additively from an empty universe means that every time there's a new kind of namespace to virtualize, you have to update all container-creating tools or risk not getting sharing that you want (i.e., breakage).

I'm placing a higher value on security. Maybe that's a bad choice. It's not like breaking is a good thing -- it might be just as bad as creating a security vulnerability.

ape4 · on Feb 28, 2022

Yes if we starting again today, we wouldn't do containers as they are now.

monocasa · on Feb 28, 2022

Hard disagree to most of this.

fork(2) makes a lot more sense when you realize its heritage. It came from a land before Unix supported full MMUs. In this model, to still have per process address spaces and preemptive multitasking on what was essentially a PC-DOS level of hardware, the kernel would checkpoint the memory for a process, slurp it all out to dectape or some such, and load in the memory for whatever the scheduler wanted to run next. It's simplicity of being process checkpoint based wasn't a reaction to windows style calls (which wouldn't exist for almost a couple decades), but instead mainframe process spawning abominations like JCL. The idea "you probably want most of what you have so force a checkpoint, copy the checkpoint into a new slot, and continue separately from both checkpoints" was soooo much better than JCL and it's tomes of incantations to do just about anything.

vfork(2) is an abomination. Even when the child returns, the parent now has a heavily modified stack if the child didn't immediately exec(). All of those bugs that causes are super fun to chase, lemme tell you. AFAIC, about the only valid use for vfork now is nommu systems where fork() incredibly expensive compared to what is generally expected.

clone(2) is great. Start from a checkpoint like fork, but instead of semantically copying everything, optionally share or not based on a bitmask. Share a tgid, virtual address space, and FD table? You just made a thread. Share nothing? You just made a process. It's the most 'mechanism, not policy' way I've seen to do context creation outside of maybe the l4 variants and the exokernels. This isn't an old holdover, this is how threads work today, processes spawned that happen to share resources. Modern archs on linux don't even have a fork(2) syscall; it all happens through clone(2). Even vfork is clone set to share virtual address space and nothing else that fork wouldn't share. Namespaces are a way to opt into not sharing resources that normally fork would share.

And I don't see what afork gets you that clone doesn't, except afork isn't as general.

Quekid5 · on Feb 28, 2022

(This is a bit of a tangent, apologies.)

> fork(2) makes a lot more sense when you realize its heritage.

I think it only makes sense when you consider its heritage. It has ALL the wrong defaults for what it's almost always used for these days: running a subprocess.

It copies "random" kernel data structures like open FDs, etc. and you have to be very careful about closing the ones you don't want to be inherited, etc. etc. It may copy things that weren't even a relevant concept when you wrote your program.

The correct thing to do is to very explicit about what you want to pass onto the subprocess and to choose safe defaults for programs compiled against the old API when you extend it. (Off the top of my head, the only thing I'd want to be automatically inherited by default would be the environment and CWD.)

It's 100% the wrong API for spawning processes.

Now, I don't think afork() solves any of these problems, AFAICT. But my personal perspective is that fork() and its derivatives are the wrong starting point in the first place for what they are used for in 99% of all cases.

twic · on March 1, 2022

The behaviour of subprocesses inheriting resources like file descriptors is absolutely bizarre. Why on earth would you want this to be the default?! But we're so used to it, we think it's normal.

cryptonector · on March 2, 2022

Windows did the opposite and that still resulted in problems.

monocasa · on March 1, 2022

Practically, this is the struct you have to fill in if you don't use clone or fork.

https://github.com/torvalds/linux/blob/719fce7539cd3e186598e...

IMO clone looks a lot better than screwing with that giant struct and all of the kernel bugs that would exist from validating every goofy way those options could be setup wrong by user space.

cryptonector · on March 1, 2022

afork() could do some things differently. The point of afork() is to be able to spawn child processes (that will exec-or-_exit) faster.

kragen · on March 1, 2022

The PDP-11 had segment base registers and memory protection, so it wasn't necessary to swap out one process to run another one at the same (virtual) address. It didn't have paging, so it couldn't swap out part of a segment. I think it's true that PDP-11 fork() would stop the process to make a copy of the writable segments, but it didn't have to "checkpoint" the process to a disk or tape. Are you talking about the PDP-7? I don't know anything about the PDP-7.

I agree about vfork(), since I haven't seen a system with segment base registers and no paging in a long time, and about clone(). Unfortunately it's true that clone() (which came from Plan9) has made POSIX threads difficult to support.

What's the L4 approach? Construct the state of the process you want to run in some memory and then use a launch-new-thread system call, then possibly relinquish access to that memory?

monocasa · on March 1, 2022

> Are you talking about the PDP-7?

Yes

> Unfortunately it's true that clone() (which came from Plan9) has made POSIX threads difficult to support.

clone was literally designed to support posix threads.

> What's the L4 approach?

Capabilities over all of the kernel objects so user space can do safe brain surgery on them. Since everything is capability based including the cap tables you end up duping a cap table, allocating a non running thread, setting registers, and attaching duped cap table. Four syscalls in the minimal case, but it's l4 so they're fairly cheap. Total disclosure, one of my side projects is a kernel with caps and a first class VM to do that in one syscall amortized.

kragen · on March 1, 2022

I see. Maybe that explains why on PDP-7 Unix programs would exec the shell instead of terminating the process; swapping your process out to disk or tape can't have been very fast. But without an MMU what else could you do?

Plan9 clone() was not designed to support POSIX threads; IIRC they didn't exist and Plan9 didn't support POSIX. Wasn't Linux clone() mostly a copy of it?

The L4 approach sounds pretty reasonable; not as convenient as fork() in the common case but not as much of a pain as, I don't know, opening a pty or opening an X11 window. I guess L4 syscalls are a bit pricier post-Spectre. How are you going to handle atomicity in your one syscall?

monocasa · on March 1, 2022

> Plan9 clone() was not designed to support POSIX threads; IIRC they didn't exist and Plan9 didn't support POSIX. Wasn't Linux clone() mostly a copy of it?

Plan9 doesn't have clone(). When they say clone was designed after plan 9, they just mean the general namespacing (which was not configured from their fork or new_thread equivalents). Linux clone was very much designed to support posix threads.

> The L4 approach sounds pretty reasonable; not as convenient as fork() in the common case but not as much of a pain as, I don't know, opening a pty or opening an X11 window. I guess L4 syscalls are a bit pricier post-Spectre.

Yeah, they got more expensive having to hide kernel address space layout.

> How are you going to handle atomicity in your one syscall?

Capabilities to bpf style programs that look like any other kernel objects and can call other kernel objects, combined with a scheme where mutex/spinlock wrapped objects have a locking order declared upfront that can be statically checked, combined with RCU primitives that the VM program verifier knows about and can make guarantees about. I'm not quite happy with the locking and RCU interfaces at the moment though, it feels like there's a more general solution, but each I've come up with has some real sharp edges. : \

kragen · on March 1, 2022

Oh right, the Plan9 thing was called rfork(), and it only had the flags argument. Thank you for the correction.

The bpf approach sounds interesting! It sounds like you're going to significant effort with RCU to avoid mutexes (for performance I assume?), but there are a few places that you still feel like such optimistic synchronization approaches would be unacceptably costly. What are they?

If you could get rid of them, you wouldn't need a statically declared locking order (and what does "statically" mean in a kernel interface to poke code into the kernel at runtime?)

I've been thinking it would be fun to try a pure capability language along the lines of E, but using pure optimistic STM instead of single threading. That would eliminate three of the biggest theoretical weaknesses of E: malicious code can deny service by infinite-looping a vat, so in practice you have to put potentially untrusted code in its own vat; the error handling is ad hoc and therefore probably prone to the kinds of devastating problems we've seen in the DAO ecosystem; and it doesn't scale on multicore. The E design, meanwhile, eliminates shared mutable data, which avoids a plethora of bugs and security problems L4 userland programs are likely to include.

Such a system of course doesn't need a kernel, but also isn't very suitable for running malicious machine code, and its runtime overhead is likely to be a lot higher than a traditional memory-protection-based system.

What's your main use case?

cryptonector · on Feb 28, 2022

> vfork(2) is an abomination. Even when the child returns, the parent now has a heavily modified stack if the child didn't immediately exec().

What stack modifications? Sure, the child can scribble over the stack frame, or worse, the child could do things like return -- but you are the author of the code calling vfork() and you know not to do that, so why would that happen?

A: It just wouldn't happen.

And as to exec() failing, this is why exec calls must be followed with calls to either exec() or _exit(), and this is true even if you use fork() instead of vfork(). I.e.:

    /* do a bunch of pre-vfork() setup */
    ...
    
    pid_t pid = vfork();
    
    if (pid == -1) err(1, "Couldn't vfork()");
    
    if (pid == 0) {
      /* do a bunch of child-side setup */
      execve(...);
      /* oops, ENOENT or something */
      _exit(1);
    }
    
    /* the child either exec'ed or exited */
    if (waitpid(pid, &status, 0) != pid) err(1, "...");
    
    ...

How do you detect if the child exec'ed or exited? Well, you make a pipe before you vfork(), you set its ends to be O_CLOEXEC, then on the child side of vfork() you write one byte into it if the exec call fails. On the parent side you read from the pipe before you reap the child, and if you get EOF then you know the child exec'ed, and if you get one byte then you know the child exited. The one byte could be an errno value.

No, really, what you say about vfork() is lore, and very very wrong.

That said, vfork() blocks a thread in the parent. The point of my gist was to explain why fork() sucks, why vfork() is much better, and what would be better still.

> And I don't see what afork gets you that clone doesn't, except afork isn't as general.

afork()/avfork() is not meant to be as general as clone() but to be more performant than vfork() by not blocking a thread on the parent side.

clone() needs some improvements. It should be possible to create a container additively. See elsewhere in the comments on this post.

monocasa · on Feb 28, 2022

> What stack modifications? Sure, the child can scribble over the stack frame, or worse, the child could do things like return -- but you're the author of the code calling vfork() and you know not to do that

Within a sentence you described the stack modification. 'It's not a footgun, just don't make mistakes' doesn't hold a lot of water with me.

> No, really, what you say about vfork() is lore, and very very wrong.

Like I've said elsewhere in the comments, I've literally had to fix awful bugs, some security related, from how much vfork() is a preloaded foot gun with the safety off. Not everyone who has a bad impression of it is just following the "lore".

> afork()/avfork() is not meant to be as general as clone() but to be more performant than vfork() by not blocking a thread on the parent side.

Ok, but I'm not going to hold it against clone for being a more general solution.

> clone() needs some improvements. It should be possible to create a container additively. See elsewhere in the comments on this post.

I agree with this, but there's practical reasons why this isn't the case, mainly around how asking user space for every little thing is expensive, and large sparse structs to copy into kernel space covering basically everything in struct task sounds like a special kind of security hell I would not want to be a part of.

A flag to clone to create an empty process and something like a bunch of io_uring calls or a box program to hydrate the new task state would be really neat, and has been kicked around a bunch. There's just a ton corner cases that haven't been ironed out.

cryptonector · on Feb 28, 2022

> 'It's not a footgun, just don't make mistakes.'

fork() -> fork bombs -> fork() is a footgun!

You have to know how to use it. Yes. So what?

> Like I've said elsewhere in the comments, I've literally had to fix awful bugs, some security related, from how much vfork() is a preloaded foot gun with the safety off. Not everyone who has a bad impression of it is just following the "lore".

Links or it didn't happen :)

monocasa · on Feb 28, 2022

> fork() -> fork bombs -> fork() is a footgun!

> You have to know how to use it. Yes. So what?

No, you have to own everything that you could call. For one example of many, are you in and out of a library that longjump's? That's really fun.

Basically vfork's sharing of the full on mutable stack between the parent and child is full on bananers.

> Links or it didn't happen :)

You know that some people write proprietary code, even for unixen, right?

cryptonector · on March 1, 2022

> No, you have to own everything that you could call. For one example of many, are you in and out of a library that longjump's? That's really fun.

That is also true of fork().

You're supposed to only use async-signal-safe functions on the child-side of fork().

It is surprisingly easy to do dumb things with fork().

> You know that some people write proprietary code, even for unixen, right?

I was hoping it was in open source code.

monocasa · on March 1, 2022

> That is also true of fork().

> You're supposed to only use async-signal-safe functions on the child-side of fork().

Not practically, there's way more code out there designed day one for fork(). Next to none designed for vfork() explicitly.

Signal safety has more to do with shared mutability, which isn't a concern for fork. You can get into gross situations mixing fork and threads, but that's equally true of vfork.

cryptonector · on March 1, 2022

> Signal safety has more to do with shared mutability, which isn't a concern for fork.

And yet that's what the spec says about child-side code following fork(). There's a reason for that. It's not just about signals. Async-signal-safe means, yes, that you can use it in an asynchronous signal handler, but there are contexts other than async signal handlers that require async-signal-safe code.

> You can get into gross situations mixing fork and threads...

You can get into bad situations just using fork and no threads.

> Not practically, there's way more code out there designed day one for fork(). Next to none designed for vfork() explicitly.

The adjustments needed to make fork-using code use vfork instead are often small. I recently did that to existing code: https://github.com/heimdal/heimdal/pull/957/commits

monocasa · on March 1, 2022

> And yet that's what the spec says about child-side code following fork(). There's a reason for that. It's not just about signals. Async-signal-safe means, yes, that you can use it in an asynchronous signal handler, but there are contexts other than async signal handlers that require async-signal-safe code.

You cut off with the reason being threads and shared mutability.

In fact that's what the spec says too.

1003.1-2017 on fork()

> A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called.

Practically if you don't use threads you can do anything in the child process you can do in the parent. Any env that doesn't support that breaks decades of important Unix software.

And what are you fixing by changing fork to vfork there?

cryptonector · on March 1, 2022

> Practically if you don't use threads you can do anything in the child process you can do in the parent. Any env that doesn't support that breaks decades of important Unix software.

Not true. I mentioned PKCS#11 elsewhere in this post or thread. The PKCS#11 case is more generally about devices, or even TCP and other connections. You can't share, say, a file descriptor connected to an IMAP server (or whatever) between the parent and the child (not without adding synchronization, though that need not mean mutexes).

monocasa · on March 1, 2022

That's like saying you can't write to the same file willy nilly after any context creation. In context, I obviously meant that you can perform the same actions in the child or the parent, not that you somehow get free synchronization for accessing all kernel objects.

Also, you can specify CKF_INTERFACE_FORK_SAFE if you want a handle in PKCS#11 that handles synchronization enough internally to call from both the child and the parent simultaneously.

quietbritishjim · on March 1, 2022

Your code snippet assumes that your C compiler is just a high-level assembler. But it's not - it executes against a theoretical C virtual machine that doesn't know about about forking. It's allowed to generate some non-obvious code so long as it acts "as if" it has the same behaviour - but only from the point of view of that theoretic C VM.

For example, in theory _exit(1) could be implemented as longjmp(...) up to a point in some compiler-created top-level function that wraps up main(). Then that wrapper function could perform some steps to communicate the return code to the OS that trashes the stack before actually exiting. After all, if the process is about to exit anyway, what difference does it make if a bunch of memory is fiddled with? We know the answer to this but, from the point of view of the C virtual machine, it's irrelevant.

That particular scenario is unlikely but the point is that compiler implementations and optimisations are allowed to do very non-obvious things. You're only safe if you stick the rules of the C standard, which this 100% does not.

josefx · on March 1, 2022

> Your code snippet assumes that your C compiler is just a high-level assembler. But it's not - it executes against a theoretical C virtual machine that doesn't know about about forking.

Luckily a C compiler that doesn't know about concepts outside of the C Virtual machine will not be able to compile a Linux executable or even dynamically load a library that exposes the vfork call (let alone try to execute the underlying system call directly).

quietbritishjim · on March 1, 2022

That doesn't make sense. The C VM only affects how C code is understood by the compiler, in particular what optimisations are allowed. It doesn't stop the compiler from generating an executable or linking to libraries.

josefx · on March 1, 2022

> It doesn't stop the compiler from generating an executable or linking to libraries.

The C standard claims multiple definitions result in undefined behavior. Dynamic libraries are filled to the brim with copies of symbols because it is impossible to tell in which library a symbol should be stored. Linking against a dynamic standard library cannot end well.

__s · on March 1, 2022

Stack manipulations are a real problem. Say if some parameter to exec after vfork uses stack slots created by compiler for temporary variables. & sure you compute those before the call to vfork, but then compiler applies code motion..

cryptonector · on March 1, 2022

This is bad:

    int exec_failed = 0;
    
    {
      some_type some_var;
    
      pid = vfork();
      if (pid == -1) err(1, "vfork() failed");
    
      if (pid == 0)        
        execve(...);
    
      /* oops, execve() failed */
      exec_failed = 1;
    }
    
    if (exec_failed)
      cleanup_code; /* bad! */
    
    /* parent */

But, it's hard to write code like that instead of:

    pid = vfork();
    if (pid == -1) err(1, "vfork() failed");
    
    if (pid == 0) {
      execve(...);
      
      /* oops, execve failed */
      some_cleanup;
      _exit(1);
    }
    
    /* parent */

You have to really try.

jsmith45 · on March 1, 2022

Sure but if you have code like the following:

    pid = vfork();
    if (pid==0) {
       int something;
       exec();
       // cleanup code that uses something
       _exit(1);
    }

Then the compiler (which knows `_exit` is noreturn) can conclude that if you enter the `if`, none of the existing stack slots will be read again, so it can reuse one of those stack slots for the `something` variable. But whoops, that means the original process has has its stack corrupted.

This applies even when the variable declared at start of method, as compilers can perform equivalent variable lifetime analysis to let it reuse the stack slot. This is exactly why the POSIX spec makes it undefined to write to any variable after vfork (except the pid return variable, obviously).

But even that is not strictly safe enough, since the compiler is allowed to introduce writes to the stack. This may for example, happen as part of calculating a temporary, if the compiler wants to use the register for something else, and decides against using some other register for storage, so spills to the stack.

Obviously your `afork` completely avoids all those sorts of concerns by using a separate stack.

cryptonector · on March 1, 2022

If "[s]tack manipulations are a real problem" (I say there are none if you're writing the code and know not to add any problematic stack manipulations) then avfork() should satisfy that concern.

jandrese · on March 1, 2022

I'm still struggling to understand the point of vfork(). The whole point of fork is to offload work to a different part of your program so the original part can continue to do work. The entire idea fails if it halts the original program for the duration of the child's life. How is this better than just doing a regular function call?

ddulaney · on March 1, 2022

vfork halts the parent until the child exits or calls exec, getting its own address space. In the normal case, you vfork and immediately exec, and the parent continues on with what it was doing. The time between vfork and exec is “special” in that the child is temporarily running in the parent’s address space, then it uses exec to separate and do its own thing.

jandrese · on March 1, 2022

Ah, that makes a lot more sense.

I must be weird in that I almost never use exec() after a fork().

ddulaney · on March 1, 2022

Yeah, if you’re never planning on calling exec, vfork doesn’t make much sense.

Can I ask how you approach resource management and dependencies in that kind of code base? As the article briefly mentions, using fork without exec means you need to keep everything else in the process fork-safe, which I know can be a challenge in the presence of third-party code.

aidenn0 · on March 1, 2022

Not who you're replying to, but it's trivial as long as you don't use threads.

I suppose third-party code could be opening up file-descriptors behind your back and privately maintaining that state in private storage, but third-party code that does that without documenting it is relatively rare in the Unix/C world in my experience.

cryptonector · on March 1, 2022

Historically getXbyY functions and the name service switch had a way of doing that, and that was one reason for nscd to come along (another was to cache better, naturally).

aidenn0 · on March 1, 2022

Most (all?) of the nsswitch functions were datagram based back in the day, so those would be safe.

I've certainly never had issues using e.g. getpwent on a NIS setup with forking and modern rpcbind may use TCP I believe. Maybe it opens a new connection each time?

jandrese · on March 1, 2022

Static file descriptors were a bit more common in the old days, but look horribly out of place in modern code. Keeping the code fork safe is easier than keeping it thread safe, at least with fork you aren't sharing the heap.

cryptonector · on March 1, 2022

But you're sharing file descriptors, which might be for devices, or for SOCK_SEQ connections, etc, and you can't just have the parent and child step all over each other writing to them. Now, you wouldn't do that, but you might use a library that lets you end up doing that without noticing. Fork-safety is not trivial.

monocasa · on March 1, 2022

I've seen an argument for immediately execing and not marking the whole mutable process VA space as 'trap on write', including the thread stack that you're about immediately write to if you're going to throw that work away and exec(). There's also 'I want support cheap forks on a nommu system and vforking is easier to retrofit in'.

cryptonector · on March 1, 2022

That is the argument for vfork(), and it's been the argument for it since it was incepted, decades ago.

cryptonector · on March 1, 2022

If you really think vfork() is hard to use because of the stack sharing, the avfork() should be good for you!

mark_undoio · on Feb 28, 2022

The code I currently work on actually has a use of `clone` with the `CLONE_VM` flag to create something that isn't a thread. Since `CLONE_VM` will share the entire address space with the child (you know, like a thread does!) a very reasonable response would be "WAT?!"

What led us here was a need to create an additional thread within an existing process's address space but in a way that was non-disruptive - to the rest of the process it shouldn't really appear to exist.

We achieved this by using `CLONE_VM` (and a handful of other flags) to give the new "thread-like" entity access to the whole address space. But, we omitted `CLONE_THREAD`, as if we were making a new process. The new "thread-like" entity would not technically be part of the same thread group but would live in the same address space.

We also used two chained `clone()` calls (with the intermediate exiting, like when you daemonise) so that the new "thread-like" wouldn't be a child of the original process.

All this existed before I joined, it's just really cool that it works. I've never encountered a such a non-standard use of clone before but it was the right tool for this particular job!

scottlamb · on Feb 28, 2022

> What led us here was a need to create an additional thread within an existing process's address space but in a way that was non-disruptive - to the rest of the process it shouldn't really appear to exist.

I'm curious to hear more. What's its purpose?

mark_undoio · on Feb 28, 2022

> I'm curious to hear more. What's its purpose?

Sure! I'll try to illustrate the general idea, though I'm taking liberties with a few of the details to keep things simple(r).

Our software (see https://undo.io) does record and replay (including the full set of Time Travel Debug stuff - executing backwards, etc) of Linux processes. Conceptually that's similar to `rr` (see https://rr-project.org/) - the differences probably aren't relevant here.

We're using `ptrace` as part of monitoring process behaviour (we also have in-process instrumentation). This reflects our origins in building a debugger - but it's also because `ptrace` is just very powerful for monitoring a process / thread. It is a very challenging API to work with, though.

One feature / quirk of `ptrace` is that you can't really do anything useful with a traced thread that's currently running - including peeking its memory. So if a program we're recording is just getting along with its day we can't just examine it whenever we want.

First choice is just to avoid messing with the process but sometimes we really do need to interact with it. We could just interrupt a thread, use `ptrace` to examine it, then start it up again. But there's a problem - in the corners of Linux kernel behaviour there's a risk that this will have a program-visible side effect. Specifically, you might cause a syscall restart not to happen.

So when we're recording a real process we need something that:

* acts like a thread in the process - so we can peek / poke its memory, etc via ptrace * is always in a known, quiescent state - so that we can use ptrace on it whenever we want * doesn't impact the behaviour of the process it's "in" - so we don't affect the process we're trying to record * doesn't cause SIGCHLD to be sent to the process we're recording when it does stuff - so we don't affect the process we're trying to record

Our solution is double clone + magic flags. There are other points in the solution space (manage without, handle the syscall restarting problem, ...) but this seems to be a pretty good tradeoff.

[edit: fixed a typo]

aidenn0 · on March 1, 2022

I looked into something similar for implementing a concurrent GC. I ended up just using mmap() and ptrace() since I did have to manipulate the process for certain barrier operations; I probably could have done it with non-ptrace system calls; there are tradeoffs to be made (either way you need to interrupt any pending systemcalls, but there are multiple ways of doing that).

switch33 · on March 2, 2022

The problem record and replay is expansions of languages and apis too. That is a good thing for some things but it needs to be reworded sometimes too and implementations of things aren't always newer versions of things either.

mark_undoio · on March 2, 2022

> The problem record and replay is expansions of languages and apis too. That is a good thing for some things but it needs to be reworded sometimes too and implementations of things aren't always newer versions of things either.

Changes to languages and APIs can be a problem to record/replay depending on exactly how they're implemented.

Undo's core tech, rr (and, arguably, GDB's built in record/replay) operate at the level of machine instructions and operating system calls, so changes to language and library behaviours don't generally affect us, outside of a few corner cases.

When you have that, you don't need to even know what the language is in order to operate - though if you want source-level debugging then it does matter as you have to be able to map from "your program counter is here" to "you're at this source line".

We occasionally need to add support for new system calls but an advantage of Linux is that the kernel ABI is very stable. New extensions to CPU instruction set also require work - these can be harder to support but they change more slowly.

Of course, operating at such a low level level isn't the only way to record/replay - there are distinct costs and benefits to operating at a higher level in the stack.

kccqzy · on Feb 28, 2022

Maybe some kind of snapshotting for an in-memory database?

Ericson2314 · on March 1, 2022

This stuff is still all confused

Read http://catern.com/rsys21.pdf

What you want is:

1. create "embryonic" unscheduled process

2. Set it up from the parent process, it just lies on the operating table passively.

3. Submit it to the scheduler.

This is just....obviously correct. Totally flexible. Totally efficient. Hell, if you really want to fork anything, fork those embryonic process which have no active threads! Much safer and easier to understand!

I did not write the paper above, but I did write

https://lore.kernel.org/lkml/f8457e20-c3cc-6e56-96a4-3090d7d...

https://lists.freebsd.org/archives/freebsd-arch/2022-January...

I hope I or someone else will have time to make it happen!

IgorPartola · on March 1, 2022

When I was first learning about UNIX and similar OSes I just assumed that this is how things worked because this is the obvious way of doing it. Why would you fork a process, then try to determine which of the two processes you are, then fix whatever the parent process messed up in your global state, and only then execute what you actually wanted to do? That seems insane (I guess until you realize that the main use case is creating /bin/sh).

Ericson2314 · on March 1, 2022

Me too!

But even when writing /bin/sh, I don't see why this would get in the way? I was once told earlier Unix didn't even have fork, but something more purpose-made for shells instead.

zokier · on March 1, 2022

Sounds a bit like fuchsias launchpad library where you create launchpad object, do all the setup, and then call launchpad_go to actually start the process. Launchpad doesn't allow arbitrary syscalls in the setup, so in that sense it is maybe closer to "spawn" interface but with better ergonomics

https://cs.opensource.google/fuchsia/fuchsia/+/main:zircon/s...

Ericson2314 · on March 1, 2022

Yes, it is basically the same thing. Fuschia has the capbilities mindset that would lead one here.

cryptonector · on March 1, 2022

Yes, I like the larval process idea. No doubt it's good.

londons_explore · on Feb 28, 2022

I was always disappointed by the performance of fork()/clone().

CompSci class told me it was a very cheap operation, because all the actual memory is copy-on-write, so its a great way to do all kinds of things.

But the reality is that duplicating huge page tables, and hundreds of file handles is very slow. Like 10's of milliseconds slow for a big process.

And then the process runs slowly for a long time after that because every memory access ends up causing lots of faults and page copying.

I think my CompSci class lied to me... it might seem cheap and a neat thing to do, but the reality is there are very few usecases where it makes sense.

vgel · on Feb 28, 2022

CS classes (and, far too often, professional programmers) talk about computers like they're just faster PDP-11s with fundamentally the same performance characteristics.

mark_undoio · on Feb 28, 2022

Agreed that these costs can be larger than is perhaps implied in compsci classes (though it's possible that they've changed their message since I took them!)

I suppose it is still essentially free for some common uses - e.g. if a shell uses `fork()` rather than one of the alternatives it's unlikely to have a very big address space, so it'll still be fast.

My experience has been that big processes - 100+GB - which are now pretty reasonable in size really do show some human-perceptible latency for forking. At least tens of milliseconds matches my experience (I wouldn't be surprised to see higher). This is really jarring when you're used to thinking of it as cost-free.

The slowdown afterwards, resulting from copy-on-write, is especially noticeable if (for instance) your process has a high memory dirtying rate. Simulators that rapidly write to a large array in memory are a good example here.

When you really need `fork()` semantics this could all still be acceptable - but I think some projects do ban the use of `fork()` within a program to avoid unexpected costs. If you really have a big process that needs to start workers I guess it might be worth having a small daemon specifically for doing that.

cryptonector · on Feb 28, 2022

Right, shells are no threaded and they tend to have small resident set sizes. Even in shells though, there's no reason not to use vfork(), and if you have a tight loop over starting a bunch of child processes, you might as well use it. Though, in a shell, you do need fork() in order to trivially implement sub-shells.

fork() is most problematic for things like Java.

smasher164 · on Feb 28, 2022

Also, mandating copy-on-write as an implementation strategy is a huge burden to place on the host. Now you’ve made the amount of memory a process is is using unquantifiable.

vgel · on Feb 28, 2022

It's not necessarily unquantifiable -- the kernel can count the not-yet-copied pages pessimistically as allocated memory, triggering OOM allocation failures if the amount of potential memory usage is greater than RAM. IIUC, this is how Linux vm.overcommit_memory[1] mode 2 works, if overcommit_ratio = 100.

However, if an application is written to assume that it can fork a ton and rely on COW to not trigger OOM, it obviously won't work under mode 2.

[1] https://www.kernel.org/doc/Documentation/vm/overcommit-accou...

> 2 - Don't overcommit. The total address space commit for the system is not permitted to exceed swap + a configurable amount (default is 50%) of physical RAM.

> Depending on the amount you use, in most situations this means a process will not be killed while accessing pages but will receive errors on memory allocation as appropriate.

> Useful for applications that want to guarantee their memory allocations will be available in the future without having to initialize every page.

smasher164 · on Feb 28, 2022

You're right, "unquantifiable" was the wrong word here. I meant, a program has no real way of predicting/reacting to OOM. I didn't realize mode 2 with overcommit_ratio = 100 behaved that way, thanks for sharing.

vgel · on Feb 28, 2022

Yeah I think in a practical sense you're right, since AFAIK using mode 2 is fairly rare because most software assumes overcommit, and even if a program is written with an understanding that malloc can return NULL, its in the sense of

    if (!(ptr = malloc(...))) { exit(1); }

cryptonector · on Feb 28, 2022

POSIX doesn't require that fork() be implemented using copy-on-write techniques. An implementation is free to copy all of the parent's writable address space.

int_19h · on March 1, 2022

An implementation of fork() that doesn't do CoW would have borderline unusable perf in many real-world scenarios.

cryptonector · on March 1, 2022

If the parent is a JVM, for sure. But a copy-on-write fork() still doesn't perform well. The point isn't to just copy the whole parent. The point is to stop copying at all.

immibis · on Feb 28, 2022

You also mandate a system complex enough to have an MMU.

cryptonector · on Feb 28, 2022

Copy-on-write is supposed to be cheap, but in fact it's not. MMU/TLB manipulations are very slow. Page faults are slow. So the common thing now is to just copy the entire resident set size (well, the writable pages in it), and if that is large, that too is slow.

scottlamb · on Feb 28, 2022

> clone() is stupid ... the clone(2) design, or its maintainers, encourages a proliferation of flags, which means one must constantly pay attention to the possible need to add new flags at existing call sites.

IMHO a bigger problem [2] in practice with clone is that (according to glibc maintainers) once your program calls it, you can't call any glibc function anymore. [1] Essentially the raw syscall is a tool for the libc implementation to use. The libc implementation hasn't provided a wrapper for programs to use which maintains the libc's internal invariants about things like (IIUC) thread-local storage for errno.

The author's aforkx implementation is something that glibc maintainers could (and maybe should) provide, but my understanding is that you can get in trouble by implementing it yourself.

[1] https://github.com/rust-lang/rust/issues/89522#issuecomment-...

[2] editing to add: or at least a more concrete expression of the problem. Wouldn't surprise me if they haven't provided this wrapper in part because the proliferation the author mentioned makes it difficult for them to do so.

jlokier · on March 3, 2022

It's really unfortunate that the sanctioned way to call Linux syscalls directly is via the syscall() function (previously the _syscallN macros), and both of those methods set errno on error, which fails in a clone() thread.

If only Glibc provided a syscall_r() or something that returns the raw return value whether it's an error or not.

It is possible to make syscall() (and regular libc syscalls like read()) work in a clone() thread. I use this in performance-optimised I/O code in a database engine, so I know it works, but it requires some ugly Glibc-and-architecture-specific things. Doing it portably doesn't seem to be an option.

tych0 · on Feb 28, 2022

The problem with this argument is that the set of programs that just fork() and then exec() is fairly small. Sure, shells are small and do this, but then the article argues that shells are a good use of fork().

In larger programs, you're forking because you need to diverge the work that's going to be done and probably where it's going to be done (maybe you want to create a new pid ns, you need a separate mm because you're going to allocate a bunch, whatever). Maybe the argument is that programs should never do this? I don't buy that. Then there's a lot of string-slinging through exec().

pm215 · on Feb 28, 2022

That's backwards from my experience, which is that most users of fork() only do "fork; child does small amount of setup, eg closing file descriptors; exec". Shells are one of the few programs that do serious work in the child, because the POSIX shell semantics surface "create a subshell and do..." to the shell user, and then the natural way to implement that when you're evaluating an expression tree is "fork, and let the child process continue evaluating as a long-lived process continuing to execute as the same shell binary". (Depending on what's in that sub-tree of the expression, it might eventually exec, but it equally might not.)

Many years back I worked on an rtos that had no fork(), only a 'spawn new process' primitive (it didn't use an MMU and all processes shared an address space, so fork would have been hard). Most unixy programs were easy to port, because you could just replace the fork-tweak-exec sequence with an appropriate spawn call. The shells (bash, ash I think were the two I looked at) were practically impossible to port -- at any rate, we never found it worth the effort, though I think with a lot of effort and willingness to carry invasive local patches it could have been done.

olliej · on Feb 28, 2022

The vast majority of programs that fork are doing fork() followed almost immediately by exec(), to the extent that on macOS for example a process is only really considered safe for exec() after fork() happens. Pretty much nothing else is considered safe.

NovemberWhiskey · on Feb 28, 2022

Yeah; that would be my assumption too. I worked one time on a significant project that benefit from fork() without exec() and it was a monstrous pain - only if you own every single line of code in your project, have centralized resource management, and have no significant library dependencies should you ever consider doing this.

olliej · on Feb 28, 2022

Yeah, you can't depend on pthreads or pthread mutexes (they're not defined as being fork safe).

The entirety of Foundation (so presumably anything in Swift) is not fork safe either.

To be clear: "not fork safe" in this case means "severely constrained environment": e.g. you can do things liker limits, set up pipes, etc but good luck with much more. I guess morally similar to the restrictions you have in a signal handler, albeit with different restrictions.

cryptonector · on Feb 28, 2022

Oh no, there's tons of ProcessBuilder type APIs in Java, Python, and... every major language you can think of.

The problems with fork() become very apparent in any Java apps that try to run external programs, especially in apps that have many threads and massive heaps and are very busy.

kllrnohj · on Feb 28, 2022

> In larger programs, you're forking because you need to diverge the work that's going to be done and probably where it's going to be done

That's usually going to be done with clone() instead, no? You'll likely want to fiddle with the various flags for those usages and are unlikely to be happy with what fork() otherwise does.

ismaildonmez · on Feb 28, 2022

Microsoft Research has a paper about the very same issue (2019): https://www.microsoft.com/en-us/research/publication/a-fork-...

cryptonector · on Feb 28, 2022

It's a very good paper, yeah. I will link it from the gist.

georgia_peach · on Feb 28, 2022

That paper smacks of a Chesterton Fence. They haven't come up with a tested replacement for many of the use cases, i.e.:

  These designs are not yet general enough to cover all the use-cases outlined above, but perhaps can serve as a starting point...

yet bullet #1 in the next paragraph is

  Deprecate Fork

I think this is a case of security guys being upset about fork gumming-up their experiments. I don't really care about their experiments. The security regime for the past 20 years may have bought us a little more security against eastern bloc hackers, but it hasn't done squat to protect us from Apple, Google, & Microsoft! I have never had a virus de-rail my computing life as much as the automatic Windows 10 upgrade. Robert Morris got 400 hours community service for a relatively benign worm. If that's the penalty scale, Redmond should get actual time in the slammer for Cortana, forced Windows Update, and adding telemetry to Calculator.

cryptonector · on Feb 28, 2022

You fail to address any of the substance of their paper, or of my gist (TFA), then go on a rant about unrelated things. The authors of that paper deserve better treatment even if you hate Microsoft.

georgia_peach · on March 1, 2022

I did. Chesterton Fence. fork() has been in Unix from the beginning. Taking it out at this point will cause more problems than it solves. Until you have a working Unix distro (kernel AND common userland services) that elegantly covers all of the forkless cases, your paper and their paper are just opinions. Theirs is a formally written one. Yours is a clickbaity one. And casting vfork() as any kind of improvement here is just bonkers.

And the rant is totally related: i.e. devs breaking things that worked just fine to begin with for the sake of doctrinal purity. It is usually a false doctrine.

cryptonector · on March 2, 2022

I'm not proposing that fork() be removed. Microsoft is much more interested in not ever implementing fork() than I am in removing it. So your dilapidated fence can stay up where it's up.

bergkvist · on March 1, 2022

I have to disagree that fork is evil. fork is great because of copy-on-write. I guess my particular use case is not very typical/common though.

I'm running powerflow simulations on a power grid model (several GB of memory to store the model). Copy-on-write means I can make small modifications to this model and run simulations in parallel. Thanks to fork/copy-on-write, I can run 32 simulations in parallel, each will small modifications without requiring 32 times as much memory.

cryptonector · on March 1, 2022

Neat!

psanford · on Feb 28, 2022

I saw a bug once where an application would get way slower on MacOS after calling fork(). Not just temporarily either; many syscalls would continue to run slowly from the call to fork() until the process exited.

Looking on Stack Overflow, I see a few reports of this behavior[0][1].

[0]: https://stackoverflow.com/questions/4411840/memory-access-af...

[1]: https://stackoverflow.com/questions/27932330/why-is-tzset-a-...

throwaway984393 · on Feb 28, 2022

I don't think containers should be like jails. Containers should be more like chroots than they are now.

Have you ever tried to run a modern X/whatever app with 3D graphics and audio and DBUS and God knows what else in a container and get it to show up on your desktop? It's a fucking nightmare. I spent over a week trying to get 1Password to run in a container. Somebody decided containers had to be "secure", even though they don't actually exist as a single concept and security was never their primary purpose. If instead containers were used only to isolate filesystem dependencies, we could actually pretend containers were like normal applications and treat them with the same lack of security concern that all the rest of our non-containerized programs are.

Firecracker is the correct abstraction for isolation: a micro-VM. That is the model you want if you want to run an app securely (not to mention reliably, as it can come with its own kernel, rather than needing you to run a compatible host kernel).

cryptonector · on March 2, 2022

I... didn't mean that containers have to have a copy of the operating system inside them, systemd and many other things included. I meant only that they should be created in ways like how the BSDs and Illumos do it.

jph · on Feb 28, 2022

Is it a fair point to implement first with fork() because of memory protection, then optimize by using benchmarks and potentially vfork() for speed? Benchmark areas can look at synchronous locks, copy-on-write memory, stack sharing, etc.

What are the good practices of security tradeoffs of fork() vs. vfork() especially in terms of ease of writing correct code? I'd thought that fork() + exec() tends to favor thinking about clearer separation/isolation. For example I've written small daemons using fork() + exec() because it seems safe and easy to do at the start.

GrumpySloth · on Feb 28, 2022

In short, fork() mixes poorly with multi-threaded code (and has some security footguns like needing to explicitly unshare elements of environment which may be sensitive, such as file descriptors (suddenly you need to know all the file descriptors used in the whole program from a single place in code)). Here is a well-written comment about fork() from David Chisnall: <https://lobste.rs/s/cowy6y/fork_road_2019#c_zec42d>

Additionally, the fork()+exec() idiom practically forces OS designers into a corner where they simply have to implement Copy-on-Write for virtual memory pages, or otherwise the whole userspace using this idiom is going to be terribly slow. Without the fork()+exec() idiom you don't need CoW to be efficient.

jerf · on Feb 28, 2022

Fork mixes so poorly with multithreaded code that a lot of modern languages that are built from the beginning with threads of one sort or another in mind, like Go, simply won't let you do it. There is no binding to fork in the standard library.

I think you could bash it together yourself with raw syscalls, because that can't really be stopped once you have a syscall interface, but basically the Go runtime is built around assuming it won't be forked. I have no idea what would happen to even a "single threaded" Go program if you forked it, and I have no intention of finding out. The lowest level option given in the syscall package is ForkExec: https://pkg.go.dev/syscall#ForkExec And this is a package that will, if you want, create new event loops outside of the Go runtime's control, set up network connections outside of the runtime's control, and go behind the runtime's back in a variety of other ways... but not this one. If you want this, you'll be looking up numbers yourself and using the raw Syscall or RawSyscall functions.

pcwalton · on Feb 28, 2022

> I have no idea what would happen to even a "single threaded" Go program if you forked it, and I have no intention of finding out.

I'm not an expert on Go internals, but the GC in Go is multithreaded, so I would assume forking will kill the GC. Better hope it's not holding any mutexes.

kevincox · on Feb 28, 2022

TL;DR if another thread is holding a lock when you fork that lock will be stuck locked in the child, but that thread that was using that lock no longer exists.

So if your multi-threaded program uses malloc you may fork while a global allocation lock is being held and you won't be able to use malloc or free in the child (thread-local caches aside).

There are other problems but this is the basic idea. To be fork-safe you need to allow any thread to just disappear (or halt forever) at any point in your program.

kazinator · on Feb 28, 2022

malloc has to guard its locks against fork, probably using pthread_atfork, or some lower level internal API related to that.

The problem with pthread_atfork is third party libs.

YOU will use it in YOUR code. The C library will correctly use it in its code. But you have no assurance that any other libraries are doing the right things with their locks.

dgrunwald · on Feb 28, 2022

Your "third party libs" includes system libraries like libdl.

We had a Python process using both threads (for stuff like background downloads, where the GIL doesn't hurt) and multiprocessing (for CPU-intensive work), and found that on Linux, the child process sometimes deadlocks in libdl (which Python uses to import extension modules).

The fix was to use `multiprocessing.set_start_method('spawn')` so that Python doesn't use fork().

kazinator · on Feb 28, 2022

libdl is a component of glibc; that needs to be debugged.

mark_undoio · on Feb 28, 2022

Also if, for any reason, you end up doing a `fork()` syscall directly rather than via libc you'll still have a problem as appropriate cleanup won't happen.

Of course, the best answer to that is usually going to be "don't do that"!

cryptonector · on March 1, 2022

> But you have no assurance that any other libraries are doing the right things with their locks.

I mean, if they're broken, fix them or get upstream to fix them.

kllrnohj · on Feb 28, 2022

The more stuff that piles on using pthread_atfork then also contribute to fork() being unnecessarily slow for the specific combination of fork+exec.

kazinator · on Feb 28, 2022

Right, and so POSIX "fixed" that by standardizing posix_spawn. Thus fork is now mainly for those scenarios in which exec is not called, plus traditional coding that is portable to old systems.

kazinator · on Feb 28, 2022

fork came first; it's POSIX threads that is a bolted on piece of clunk that mixes badly with fork, signal handlers, chdir, ...

quietbritishjim · on March 1, 2022

Apologies if this is a silly question, but it seems like there's a false dichotomy here:

(1) You have separate fork() (etc.) and exec(), so that in the brief window in between you can set all the properties of the new process using APIs that exist anyway for controlling your own process.

(2) You have a single call to spawn a new process, but you have a million different options to control every aspect of the new process.

Why not do it this other way instead? Perhaps a bit late now but seems like in retrospect it would give the API simplicity of fork+exec without any of the complications.

(3) There are two steps to run a new process. The first fully sets up its memory and returns a PID, but doesn't start running it. The second call, unfreeze(), allows it to begin executing code. All the usual APIs that exist anyway for controlling your own process take an extra parameter specifying the PID of a frozen child (or -1 for the current process).

lisper · on Feb 28, 2022

There is something about fork which I have never understood. Maybe someone here can explain it to me.

Why would anyone ever want fork as a primitive? It seems to me that what you really want is a combination of fork and exec because 99% of the time you immediately call exec after fork (at least that's what I do 99% of the time when I use fork). If you know that you're going to call exec immediately after fork, then all the issues of dealing with the (potentially large) address space of the parent just evaporate because the child process is just going to immediately discard it all.

So why is there not a fork-exec combo? And why has it not replaced fork for 99% of use cases?

And as long as I'm asking stupid questions, why would anyone ever use vfork? If the child shares the parent's address space and uses the same stack as the parent, and the parent has to block, how is that different from a function call (other than being more expensive)?

None of this makes sense to me.

10000truths · on Feb 28, 2022

Because there are many, many use cases where you don't want to call exec() immediately after fork().

Want to constrain memory usage or CPU time of an arbitrary child process? You have to call setrlimit() before exec(). Privilege separation? Call setuid() before exec(). Sandbox an untrusted child process in some way? Call seccomp() (or your OS equivalent) before exec(). And so on and so forth. Any time you want to change what OS resources the child process will have access to, you'll need to do some set-up work before invoking exec().

wongarsu · on Feb 28, 2022

Windows solves this by adding a bunch of optional parameters to CreateProcess, as well as having two more variants (CreateProcessAsUser and CreateProcessWithLogon). Some of the arguments are complicated enough that they have helper functions to construct them.

I like the more composable fork()->modify->exec() approach of unix, but I wouldn't call either of them really elegant.

notriddle · on Feb 28, 2022

That's one option, yes.

The one I've favored while reading these arguments has been the "suspended process" model. The primitives are CREATE(), which takes an executable as a parameter and returns the PID of a paused process, and START(), which allows the process to actually run.

Unix already has the concept of a paused executable, after all.

This model also requires all the process-mutation syscalls, like setrlimit(), to accept a PID as a parameter, but prlimit() wound up being created anyway, because the ability to mutate an already-running process is useful.

ChrisSD · on Feb 28, 2022

A third way is to grant the parent process access to the child such that they can use the child process handle to "remotely" set restrictions, write memory, start a thread, etc.

monocasa · on Feb 28, 2022

Practically, syscall overhead has gotten in the way of that being the ubiquitous in the past. Here's to hoping that newer models of syscalls that reduce kernel/user overhead make such a thing possible.

univspl · on Feb 28, 2022

To me this feels like a call for more powerful language primitives. i.e. a way to specify some action to take to "set up" the child process that's more explicit and readable than one special behaving in a particularly odd way. I'm imagining closures with some kind of Rust-like move semantics, but not entirely sure.

(if we're speaking in terms of greenfield implementation of OS features)

lisper · on Feb 28, 2022

Yeah, this. Why not mkprocess/exec instead of fork/exec?