The dense fog lifts, tree branches part, a ray of light beams down on a pedestal revealing the hidden intentions of the ancients. A plaque states "The operational semantics of the most basic primitives of your operating system are designed to simplify the implementation of shells." You hesitantly lift your eyes to the item presented upon the pedestal, take a pause in respect, then turn away slumped and disappointed but not entirely surprised. As you walk you shake your head trying to evict the after image of a beam of light illuminating a turd.
Although it does say that vfork() is difficult to use safely, while the gist recommends it? I think there is still some clarity needed around the use cases.
Fork today is a convenient API for a single-threaded process with a small memory footprint and simple memory layout that requires fine-grained control over the execution environment of its children but does not need to be strongly isolated from them. In other words, a shell. It’s no surprise that the Unix shell was the first program to fork [69], nor that defenders of fork point to shells as the prime example of its elegance [4, 7]. However, most modern programs are not shells. Is it still a good idea to optimise the OS API for the shell’s convenience?
As u/amaranth pointed out, my gist predates the MSFT paper, which mostly explains why I didn't reference. Though, to be fair, I saw that paper posted here back in 2019, and I commented on it plenty (13 comments) then. I could have edited my gist to reference it, and, really, probably should have. Sometime this week I will add a reference to it, as well as this and that HN post, since they are clearly germane and useful threads.
I vehemently disagree with those who say that vfork() is much more difficult to use correctly than fork(). Neither is particularly easy to use though. Both have issues to do with, e.g., signals. posix_spawn() is not exactly trivial to use, but it is easier to use it correctly than fork() or vfork(). And posix_spawn() is extensible -- it is not a dead end.
My main points are that vfork() has been unjustly vilified, fork() is really not good, vfork() is better than fork(), and we can do better than vfork(). That said, posix_spawn() is the better answer whenever it's applicable.
Note that the MSFT paper uncritically accepts the idea that vfork() is dangerous. I suspect that is because their focus was on the fork-is-terrible side of things. Their preference seems to be for spawn-type APIs, which is reasonable enough, so why bother with vfork() anyways, right? But here's the thing: Windows WSL can probably get a vfork() added easily enough, and replacing fork() with vfork() will generally be a much simpler change than replacing fork() with posix_spawn(), so I think there is value in vfork() for Microsoft.
Use cases for vfork() or afork()? Wherever you're using fork() today to then exec, vfork() will make that code more performant and it generally won't take too much effort to replace the call to fork() with vfork(). afork() is for apps that need to spawn lots of processes quickly -- these are rare apps, but uses for them do arise from time to time. But also, afork() should be easier to use safely than vfork(). And, again, for Microsoft there is value in vfork() as a smaller change to Linux apps so they can run well in WSL.
BTW, see @famzah's popen-noshell issue #11 [0] for a high-perf spawn use case. I linked it from my gist, and, in fact, the discussion there led directly to my writing that gist.
If you are going to edit, the google query links with the #q=xyz format no longer seem to work, so maybe update them to the ?q=xyz format which still works.
(Also this article and discussions on it now take up many of the top spots, which I guess is the disadvantage to linking to google for a topic)
You see, an operating system as commonly conceived has at least two major jobs:
- abstract away underlying hardware
- safely multiplex resources
And do the above with as little overhead as possible.
Now the thing is: whenever you have multiple goals, you need to make trade-offs, and you aren't as good at any one goal as you could be.
So the exokernel folks made a suggestion in the 90s: let the OS concentrate on safely multiplexing resources, and do all the abstracting in user level libraries.
Normal application programming would mostly look the same as before, your libraries just do more of the heavy lifting. But it's much easier to swap out different libraries than it is to swap out kernel-level functionality.
That vision never caught on with mainstream OSes. But: widespread virtualisation made it possible. You can see hypervisors like Xen as exokernel OSes that do the bare minimum required to safely multiplex, but don't provide (many) abstractions.
Shells have relatively simple operational models, so _any_ API would probably be workable for shells.
Meanwhile, programs with more complex requirements have to work around these APIs. And many programs call other programs, or otherwise have to do tricky process lifecycle management.
The lowest-level APIs should, in theory, cater to the most complex cases, not to the simplest ones. This doesn't prevent a simpler API from existing, but catering to a simple use case in the primitives does hinder more complex needs.
(I think the more nuanced point is that the OS itself might not have a much better design available in any case. Unixes have a lot of neat stuff, but it's a lot of "design by user feature request", and "standardize 4 slightly different ways of doing things", so there is a lot of weirdness and it's hard to have The Perfect API in that case)
> Shells have relatively simple operational models, so _any_ API would probably be workable for shells.
You'd think that, but implementing the UNIX shell and all of its semantics (piping, redirection, waiting, child reaping, jobs, foreground/background, prompting etc) using fork/clone + exec* is way more simple than, say, on Windows. Some API designs are better for that specific task
> Shells have relatively simple operational models, so _any_ API would probably be workable for shells.
True. Today anyways. Back in the 70s though, there was a lot of innovation going on around process spawning, and fork+exec almost certainly made it easy to play with those ideas. I'm referring to job control, for example. But also things like the parent-child relationships between the shell and all the processes in a pipeline -- not all shells have set those up the same way.
So, yeah, maybe we need not just posix_spawn() but posix_pipeline_spawn(), why not. Make it even easier to write a shell. After all, plumbing a complex pipeline with posix_spawn() requires a fair bit of code.
Will any API do? Yes, provided it covers all the things Unix shells do nowadays. It's still easiest to get all the functionality (that a shell dev might want to build) with fork+exec though, especially since the shell author gets a great deal of control that way, though they get that at the price of having to know a great deal of stuff intimately. Arguably, anyone wishing to implement a posix_pipeline_spawn() would be like a shell developer.
The thing is that there are many other programs which require process control, which are not shells. Orders and orders of magnitudes of programs which are not shells. So we can optimize an API for building shells, but it's not going to make writing those other programs easier.
Shells are cool and good, and I don't want to discount fork too much, just saying that the API design space isn't _only_ for shells.
> Yes, but why is this characterized as something negative?
Unfortunately, the text does not provide sufficient context. Shell are not properly supported in any OS (probably except plan9), since 1. the OS provides no enforcement or convention of CLI API interface (there is no enforced encoding standard or checkable stuff), 2. the OS provides no rules for file names to be shell-friendly and 3. there are no dedicated communication channels towards shells or in between programs and shells.
So all in all, shells remain a hack around the system that is "simple to implement the initials" and is annoying to use and write at many corner cases.
> Shells simply developed features that users required of them.
Cross out "simply" and call it convenience+arbitrary complex scripting glue for 4 main goals:
1. piping
2. basic text processing
3. basic job control
4. path hackery
"The primary interface between the user and the OS" is the definition of "shell". That's why the Microsoft Windows process that draws the Start button and filesystem windows is called "the Windows shell".
I don't think OP meant shell as in the Windows shell, or Linux DEs. I mean, how many of those use fork() even on Linux, or would be easier to implement if they did?
Linux desktop environments do use fork(), and the Microsoft shell doesn't use fork() because Microsoft Windows doesn't have it.
In the Linux context, the fact that random things inherit stdout appending to .xsession-errors and inheriting environment variables is often useful. fork() also makes it fairly straightforward to do things like set a VM size limit or change an environment variable for a newly launched program, which is often useful when you're launching a program from just about anything. I don't know whether rearchitecting Microsoft Windows to work that way would have made the Windows Shell easier to write.
However, and this is the crucial point, fork() was impossible to support on Win16, because segment register values can be stashed anywhere in your 8086 program's memory, and they're just literally added to the offset address with a 4-bit shift, so there's no reliable way to make a copy of a running process elsewhere in memory that doesn't accidentally share segments with the original. You'd have to do what monocasa was saying old Unix did and checkpoint the process to disk. (I suspect Unix never did that, but it's similar to what PDP-11 Unix did do.)
Inheriting stdout etc does not require fork. It requires a spawn API that has a flag to inherit stdout, such as e.g. Win32 CreateProcess. Inheriting handles by default, on the other hand, is a recipe for hard-to-debug bugs.
Oh, I didn't mean without exec, but there are some programs like gnome-terminal that do that too. I just meant that forking, doing process configuration with system calls to open and close files and whatnot, and then running exec, is maybe a more convenient way to launch a program in a modified environment, than having a CreateProcess system call with fifty zillion flags.
Everything in Unix is a recipe for hard-to-debug bugs.
Sure. While clever and entertaining, I didn't find your comment to be a constructive contribution to the discussion. Also, I've found that attempts at humor on HN are often misinterpreted and can stir up trouble. (No, I did not downvote your comment.)
My comment contains more information more densely than what I could have stated flatly. This thread is the third longest on the post, and contains interesting and unique discussion. I don't see any troublesome misinterpretations.
Your concerns seem to be misplaced.
Emotionless propositional statements are not unconditionally better than other forms of writing.
In Ninja, which needs to spawn a lot of subprocesses but it otherwise not especially large in memory and which doesn't use threads, we moved from fork to posix_spawn (which is the "I want fork+exec immediately, please do the smartest thing you can" wrapper) because it performed better on OS X and Solaris:
The issue with posix_spawn is that you can't close all descriptors before exec. This is especially an issue as most libraries are still unaware they need to open every single handle with the close-on-exec flag set.
Indeed, it's very common to want to close all FDs other than 0, 1, and 2, of course, as well as a few other exceptions (e.g., a pipe a parent might read from, FDs on which flocks are held). The reason one often wants to close all open FDs besides those is simple: too many FDs that should be made O_CLOEXEC often aren't, and even when they are, too often there is a race to use fcntl() to do so on one thread while another one forks. Yes, there are new system calls that allow race-free setting of O_CLOEXEC on new FDs, but they will take a long time to be widely used.
I've implemented closefrom() type APIs more than once. Of course, I happen to know about Illumos', so there's that.
For implementations which don't have it, you can stuff, into the file_actions, say, 4093 close action entries into the file_actions, targeting descriptors 3 to 4095. This big file_actions object can be cached and re-used for multiple calls to posix_spawn.
It won't close descriptor 4096, but that's probably beyond giving a darn in most cases. If you have an application that opens high descriptor numbers, you probably know.
A better approach is to exec an intermediate helper program that will do it and then exec the actual intended program. One can also use this approach to do things like reset signal dispositions to SIG_IGN.
> Long ago, I, like many Unix fans, thought that fork(2) and the fork-exec process spawning model were the greatest thing, and the Windows sucked for only having exec() and _spawn(), the last being a Windows-ism.
I appreciate this quite a bit. Vocal Unix proponents tend to believe that anything Unix does is automatically better than Windows, sometimes without even knowing what the Windows analogue is. Programming in both is necessary to have an informed opinion on this subject.
The one thing I miss most on Unix: the unified model of HANDLEs that enables you to WaitOnMultipleObjects() with almost any system primitive you could want, such as an event with a socket (blocking I/O + a shutdown notification) in one call. On Unix, a flavor of select() tends to be the base primitive for waiting on things to happen, which means you end up writing adapter code for file descriptors to other resources, or need something like eventfd.
Things I don't miss from Windows at all: wchar_t everywhere. :)
- SIDs
- access tokens
(like struct cred / cred_t in Unix kernels,
but exposed as a first-class type to user-land)
- security descriptors
(like owner + group mode_t + ACL in Unix land,
but as a first-class type)
- HANDLEs, as you say
- HANDLEs for processes
Many other things, Windows got wrong. But the above are far superior to what Unix has to offer.
Superficial silliness like allocating 48 bits to encode integers in [0,18] aside, what problem do structured SIDs actually solve? I’ve been trying to figure that out for the last couple of days and I still don’t get it, possibly because the Windows documentation doesn’t seem to actually say it anywhere.
I completely agree with having UUIDs or something in that vein for user and group IDs and will not dismiss IDs for sessions and such in the same namespace (although haven’t actually seen a use case for those), but structured variable-length SIDs as NT defines them just don’t make sense to me.
While it's true that SIDs have too much structure, that's a lot better than a flat UID namespace that is also distinct from the also flat GID namespace.
The UID/GID namespace is strictly local in POSIX. There's no way to make any two systems agree on UIDs/GIDs other than by making them have the same /etc/passwd and /etc/group content. Sure, you can use LDAP, but still, that's just one domain. Come time to do a merger or acquisition, you can't just set up a trust between two domains and have it work -- you have to do a hard migration.
SIDs don't have that problem.
The 48-bit authority part of SIDs is silly.
And the domain SID prefix of SIDs is annoyingly large (20 bytes!).
However, they are very compressible. For example, ZFS stores them as "FUIDs", which are {interned_domain_sid_id, rid}, and in each dataset ZFS stores the table of interned domain SIDs. I.e., where NTFS needs 24 bytes to store any one domain user/group SID, ZFS uses 8, so a 67% savings.
Of course, MSFT should have applied that sort of compression much more aggressively early on. That would have reduced the sizes of PACs a great deal.
SIDs are a post-DCE evolution of UUIDs. SIDs differ from UUIDs in that they are hierarchical. In the context of the Windows domain model, they're split into a component which identifies the domain, and a "relative" component which identifies the security principal within the domain. Thus you can easily determine the domain authority to which a principal belongs (useful for filtering across trust boundaries), and you can also efficiently translate between SIDs and human-readable names (you don't need to ask every authority).
There is a good paper from Paul Leach which discusses what they learned from using UUIDs in DCE, but I've only ever sighted a paper copy and I don't have access to it anymore...
The hierarchical thing didn't really happen though -- there's no public SID registry. And machine/domain SIDs got pinned to 3 RIDs. So AD always had machine/domain SID conflict issues. It would only ever not have had SID conflict issues if they had had a public SID registry or if you had to install Windows as a domain member rather than install then join (and if there had never been a forest-of-forests feature).
Once you accept that machine/domain SID conflicts can happen, the value of having arbitrarily long SIDs goes away and you might as well use UUIDs to ID domains.
OK, perhaps hierarchical wasn't the correct word; it's not hierarchical in the sense of reflecting a (possibly global) domain hierarchy, but it does consist a component that identifies the issuing authority, and a component that identifies the principal relative to that authority.
So yes, a (UUID, RID) tuple would have worked just as well.
AFAIK WinNT consolidated a lot of ideas from VMS into more coherent constructs, pity that not all of them are exposed to developers (There is, for example, the option of using kernel upcalls in VMS style i.e. ASTs, but it's completely "private" API)
These decisions here are all older than Windows and weren't in reaction to them. It's in reaction to the awful mainframe ways to spawn processes like using JCL.
We've sort of come back to that with kubernetes yaml files to describe how to launch an executable in a specific env and all of the resources it needs. Like it can be traced explicitly, the Borg paper references mainframes and knowingly calls the language that would be replaced by kubernetes's yaml files 'BCL' instead of z/OS's JCL.
Plan9 is a lot older than Kubernetes and has the same namespacing of all processes. So it's not impossible to have a "*nix like" OS that still has mainframe-like separation of concerns to ease deployment.
If you want foolproof sandboxing, you need opt-out namespacing. Because there might be resource types that your version of the software doesn't know about, and these should really be namespaced by default.
Besides, what really matter is whether namespacing is idiomatic or not. It was always idiomatic in plan9, and containerization has certainly made it more idiomatic even on *nix systems.
Having written server software that had to work in both places, I always loved the simplicity of fork(2) / vfork(2) relative to Windows CreateProcess. Threading models in Win32 were always a pain. Which only got worse with COM (remember apartment threading? rental threading? ugh)
Back in the 90's, processes had smaller memory footprint, and every UNIX my software supported had COW optimizations. So the difference between fork(2) and vfork(2) were not very large in practice. Often, the TCP handshake behind the accept(2) call was of more concern than how long it would take fork(2) to complete. Of course, bandwidth has increased by a factor of 1000 since then, so considerations have changed.
It's how CreatProcess handles commandline argument that infuriates me - not as an argv array but a big string. It's so difficult to work around quoting.
The problem with WaitForMultipleObjects (WFMO) is that it's limited to 64 handles, which basically makes it useless for anything where the number of handles is dynamic as opposed to static. There are ways to get around this limitation by grouping handles into trees, but it's tremendously clunky.
UCS-2 seemed like a good(ish) idea at the time when Unicode's scope didn't include every possible human concept represented in icon form and UTF-8 hadn't yet been spec'd on a napkin by the first adults to bother thinking about the problem.
Even in 1989, it should have been clear that 16 bits were not enough to encode all of the Chinese characters, let alone encoding all the human scripts. Unicode today encodes 92,865 Chinese characters (https://en.wikipedia.org/wiki/CJK_Unified_Ideographs).
The only reason anybody would think of UCS-2 was a good idea was that they did not consult a single Chinese or Japanese scholar on Chinese characters.
Nobody in 1989 expected to encode 92k Chinese characters into Unicode because none of the existing encodings were encoding 92k characters either. The most common encoding for Chinese, GB2312, only has 7k characters.
I recommend reading your own link, specifically the list of sources for the first CJK block to see how many characters were included and where they were sourced from.
Yes. I'm a bit surprised it took so long for someone to come up with something better. But if someone had tried and had come up with anything other than Rob Pike's UTF-8, we might still be sad. Sometimes you have to make mistakes before you know that's what they were.
The problem is that everyone wanted to keep simple array semantics for text, and that's not really workable with full scope of Unicode (even if you have 21-bit code points exposed, Runes, etc.)
On the plus side, because Unix was so ASCII-based, it couldn't easily make the jump to UCS-2/wchar_t. I suspect this was ultimately the motivation that led to UTF-8 (both, IBM's first attempt and Rob Pike's winner). Being late to the game sometimes means you're more prepared.
where `ptr` might be an index into a table (much like a file descriptor) or maybe a pointer in kernel-land (dangerous sounding!) and `verifier` is some sort of value that can be used by the kernel to validate the `ptr` before "dereferencing" it.
On Unix the semantics of file descriptors are dangerous. EBADF can be a symptom of a very dangerous bug where some thread closed a still-in-use FD then a open gets the same FD and now maybe you get file corruption. This particular type of bug doesn't happen with HANDLEs.
> This particular type of bug doesn't happen with HANDLEs.
This does not match my experience at all. Just like what you said about EBADF, Win32 error code 6 (ERROR_INVALID_HANDLE) is a huge red flag for a race condition where a HANDLE gets re-used and inappropriately called upon in some invalid context, possibly even with security or stability concerns. I used to chase these bugs a lot when I worked on Win32 code bases.
If anything this class of bug is worse in Windows because (1) multi-threaded programs are way more common on Windows and (2) HANDLEs are used for more things than file descriptors.
I guess fd reuse is more likely because they tend to get handed out by the kernel as integers in increasing order. But handle reuse absolutely does happen, and if you have this class of bug in a process with a lot of concurrent handle creation in many threads and in a commonly used program it absolutely will bite as a bug at some point.
Gotcha. But it looks like file descriptors could be made almost as safe by avoiding index reuse. Is there any reason why it is not done? Hashtable too costly costly vs array?
File descriptor numbers have to be "small" -- that's part of their semantics. To ensure this, the kernel is supposed to always allocate the smallest available FD number. A lot of code assumes that FDs are "small" like this. Threaded code can't assume that "no FD numbers less than some number are available", but all code on Unix can assume that generally the used FD number space is dense. Even single-threaded code can't assume that "no FD numbers less than some number are available" because of libraries, but still, the assumption that the used FD number space is dense does get made. This basically forces the reuse of FDs to be a thing that happens.
For example, the traditional implementations of FD_SET() and related macros for select(3) assume that FDs are <1024.
Mind you, aside from select(), not much might break from doing away with the FDs-are-small constraint. Still, even so, they'd better be 64-bit ints if you want to be safe.
io_uring allows you to associate arbitrary 64-bit data with any operation and match it on completion, so it looks like it should address these concerns.
Since you said anything... This is not strictly related to the article but your expertise seems to be in the right area.
I have a process that executes actions for users, at the moment that process runs as root until it receives a token indicating an accepted user, then it fork()s and the fork changes to the UID of the user before executing the action.
Is there a better way? I hadn't actually heard of vfork() before reading this article. I'm guessing maybe you could do a threaded server model where each thread vfork()s. I'm not really aware what happens when threads and forks combine. Does the v/fork() branch get trimmed down to just that one thread? If so what happens to the other thread stacks? It feels like a can of worms.
If the parent is threaded, then yes, vfork() will be better. You could also use posix_spawn().
As to "becoming a user", that's a tough one. There are no standard tools for this on Unix. The most correct way to do it would be to use PAM in the child. See su(1) and sudo(1), and how they do it.
> I'm not really aware what happens when threads and forks combine. Does the v/fork() branch get trimmed down to just that one thread? If so what happens to the other thread stacks? It feels like a can of worms.
Yes, fork() only copies the calling thread. The other threads' stacks also get copied (because, well, you might have pointers into them, who knows), but there will only be one thread in the child process.
vfork() also creates only one thread in the child.
There used to be a forkall() on Solaris that created a child with copies of all the threads in the parent. That system call was a spectacularly bad idea that existed only to help daemonize: the parent would do everything to start the service, then it would forkall(), and on the parent side it would exit() (or maybe _exit()). That is, the idea is that the parent would not finish daemonizing (i.e., exit) until the child (or grandchild) was truly ready. However, there's no way to make forkall() remotely safe, and there's a much better way to achieve the same effect of not completing daemonization until the child (or grandchild) is fully ready.
In fact, the daemonization pattern of not exiting the parent until the child (or grandchild) is ready is very important, especially in the SMF / systemd world. I've implemented the correct pattern many times now, starting in 2005 when project Greenline (SMF) delivered into OS/Net. It's this: instead of calling daemon(), you need a function that calls pipe(), then fork() or vfork(), and if fork(), and on the parent side then calls read() on the read end of the pipe, while on the child side it returns immediately so the child can do the rest of the setup work, then finally it should write one byte into the write side of the pipe to tell the parent it's ready so the parent can exit.
What about fork(2) for network servers? I've written parallel network servers two ways; open the socket to listen on and call fork() N times for the desired level of parallelism, and just create N processes and use SO_REUSEPORT. I prefer the former. I suppose there is hidden option C of "have a simple process that opens the listening port and then vfork/execs each worker" I find that to be a bit strange because the code will be split into "things that happen before listening on the port" (which includes, e.g. reading configuration files) and "things that happen after listening on the port" (which includes, e.g. reading configuration files)
It's a bit opinionated. It's meant to get a reaction, but also to have meaningful and thought-provoking content, and I think it's correct in the main too. Anyways, hope you and others enjoy it.
They're software VMs. It's a lot like containers, yes.
The problem with containers is that the construction toolkit for them is subtractive ("start by cloning my environment, then remove / replace various namespaces"), while the construction toolkit for zones/jails is additive ("start with an empty universe, and add namespaces or share them with the parent").
Constructing containers subtractively means that every time there's a new kind of namespace to virtualize, you have to update all container-creating tools or risk a security vulnerability.
Constructing containers additively from an empty universe means that every time there's a new kind of namespace to virtualize, you have to update all container-creating tools or risk not getting sharing that you want (i.e., breakage).
I'm placing a higher value on security. Maybe that's a bad choice. It's not like breaking is a good thing -- it might be just as bad as creating a security vulnerability.
fork(2) makes a lot more sense when you realize its heritage. It came from a land before Unix supported full MMUs. In this model, to still have per process address spaces and preemptive multitasking on what was essentially a PC-DOS level of hardware, the kernel would checkpoint the memory for a process, slurp it all out to dectape or some such, and load in the memory for whatever the scheduler wanted to run next. It's simplicity of being process checkpoint based wasn't a reaction to windows style calls (which wouldn't exist for almost a couple decades), but instead mainframe process spawning abominations like JCL. The idea "you probably want most of what you have so force a checkpoint, copy the checkpoint into a new slot, and continue separately from both checkpoints" was soooo much better than JCL and it's tomes of incantations to do just about anything.
vfork(2) is an abomination. Even when the child returns, the parent now has a heavily modified stack if the child didn't immediately exec(). All of those bugs that causes are super fun to chase, lemme tell you. AFAIC, about the only valid use for vfork now is nommu systems where fork() incredibly expensive compared to what is generally expected.
clone(2) is great. Start from a checkpoint like fork, but instead of semantically copying everything, optionally share or not based on a bitmask. Share a tgid, virtual address space, and FD table? You just made a thread. Share nothing? You just made a process. It's the most 'mechanism, not policy' way I've seen to do context creation outside of maybe the l4 variants and the exokernels. This isn't an old holdover, this is how threads work today, processes spawned that happen to share resources. Modern archs on linux don't even have a fork(2) syscall; it all happens through clone(2). Even vfork is clone set to share virtual address space and nothing else that fork wouldn't share. Namespaces are a way to opt into not sharing resources that normally fork would share.
And I don't see what afork gets you that clone doesn't, except afork isn't as general.
> fork(2) makes a lot more sense when you realize its heritage.
I think it only makes sense when you consider its heritage. It has ALL the wrong defaults for what it's almost always used for these days: running a subprocess.
It copies "random" kernel data structures like open FDs, etc. and you have to be very careful about closing the ones you don't want to be inherited, etc. etc. It may copy things that weren't even a relevant concept when you wrote your program.
The correct thing to do is to very explicit about what you want to pass onto the subprocess and to choose safe defaults for programs compiled against the old API when you extend it. (Off the top of my head, the only thing I'd want to be automatically inherited by default would be the environment and CWD.)
It's 100% the wrong API for spawning processes.
Now, I don't think afork() solves any of these problems, AFAICT. But my personal perspective is that fork() and its derivatives are the wrong starting point in the first place for what they are used for in 99% of all cases.
The behaviour of subprocesses inheriting resources like file descriptors is absolutely bizarre. Why on earth would you want this to be the default?! But we're so used to it, we think it's normal.
IMO clone looks a lot better than screwing with that giant struct and all of the kernel bugs that would exist from validating every goofy way those options could be setup wrong by user space.
The PDP-11 had segment base registers and memory protection, so it wasn't necessary to swap out one process to run another one at the same (virtual) address. It didn't have paging, so it couldn't swap out part of a segment. I think it's true that PDP-11 fork() would stop the process to make a copy of the writable segments, but it didn't have to "checkpoint" the process to a disk or tape. Are you talking about the PDP-7? I don't know anything about the PDP-7.
I agree about vfork(), since I haven't seen a system with segment base registers and no paging in a long time, and about clone(). Unfortunately it's true that clone() (which came from Plan9) has made POSIX threads difficult to support.
What's the L4 approach? Construct the state of the process you want to run in some memory and then use a launch-new-thread system call, then possibly relinquish access to that memory?
> Unfortunately it's true that clone() (which came from Plan9) has made POSIX threads difficult to support.
clone was literally designed to support posix threads.
> What's the L4 approach?
Capabilities over all of the kernel objects so user space can do safe brain surgery on them. Since everything is capability based including the cap tables you end up duping a cap table, allocating a non running thread, setting registers, and attaching duped cap table. Four syscalls in the minimal case, but it's l4 so they're fairly cheap. Total disclosure, one of my side projects is a kernel with caps and a first class VM to do that in one syscall amortized.
I see. Maybe that explains why on PDP-7 Unix programs would exec the shell instead of terminating the process; swapping your process out to disk or tape can't have been very fast. But without an MMU what else could you do?
Plan9 clone() was not designed to support POSIX threads; IIRC they didn't exist and Plan9 didn't support POSIX. Wasn't Linux clone() mostly a copy of it?
The L4 approach sounds pretty reasonable; not as convenient as fork() in the common case but not as much of a pain as, I don't know, opening a pty or opening an X11 window. I guess L4 syscalls are a bit pricier post-Spectre. How are you going to handle atomicity in your one syscall?
> Plan9 clone() was not designed to support POSIX threads; IIRC they didn't exist and Plan9 didn't support POSIX. Wasn't Linux clone() mostly a copy of it?
Plan9 doesn't have clone(). When they say clone was designed after plan 9, they just mean the general namespacing (which was not configured from their fork or new_thread equivalents). Linux clone was very much designed to support posix threads.
> The L4 approach sounds pretty reasonable; not as convenient as fork() in the common case but not as much of a pain as, I don't know, opening a pty or opening an X11 window. I guess L4 syscalls are a bit pricier post-Spectre.
Yeah, they got more expensive having to hide kernel address space layout.
> How are you going to handle atomicity in your one syscall?
Capabilities to bpf style programs that look like any other kernel objects and can call other kernel objects, combined with a scheme where mutex/spinlock wrapped objects have a locking order declared upfront that can be statically checked, combined with RCU primitives that the VM program verifier knows about and can make guarantees about. I'm not quite happy with the locking and RCU interfaces at the moment though, it feels like there's a more general solution, but each I've come up with has some real sharp edges. : \
Oh right, the Plan9 thing was called rfork(), and it only had the flags argument. Thank you for the correction.
The bpf approach sounds interesting! It sounds like you're going to significant effort with RCU to avoid mutexes (for performance I assume?), but there are a few places that you still feel like such optimistic synchronization approaches would be unacceptably costly. What are they?
If you could get rid of them, you wouldn't need a statically declared locking order (and what does "statically" mean in a kernel interface to poke code into the kernel at runtime?)
I've been thinking it would be fun to try a pure capability language along the lines of E, but using pure optimistic STM instead of single threading. That would eliminate three of the biggest theoretical weaknesses of E: malicious code can deny service by infinite-looping a vat, so in practice you have to put potentially untrusted code in its own vat; the error handling is ad hoc and therefore probably prone to the kinds of devastating problems we've seen in the DAO ecosystem; and it doesn't scale on multicore. The E design, meanwhile, eliminates shared mutable data, which avoids a plethora of bugs and security problems L4 userland programs are likely to include.
Such a system of course doesn't need a kernel, but also isn't very suitable for running malicious machine code, and its runtime overhead is likely to be a lot higher than a traditional memory-protection-based system.
> vfork(2) is an abomination. Even when the child returns, the parent now has a heavily modified stack if the child didn't immediately exec().
What stack modifications? Sure, the child can scribble over the stack frame, or worse, the child could do things like return -- but you are the author of the code calling vfork() and you know not to do that, so why would that happen?
A: It just wouldn't happen.
And as to exec() failing, this is why exec calls must be followed with calls to either exec() or _exit(), and this is true even if you use fork() instead of vfork(). I.e.:
/* do a bunch of pre-vfork() setup */
...
pid_t pid = vfork();
if (pid == -1) err(1, "Couldn't vfork()");
if (pid == 0) {
/* do a bunch of child-side setup */
execve(...);
/* oops, ENOENT or something */
_exit(1);
}
/* the child either exec'ed or exited */
if (waitpid(pid, &status, 0) != pid) err(1, "...");
...
How do you detect if the child exec'ed or exited? Well, you make a pipe before you vfork(), you set its ends to be O_CLOEXEC, then on the child side of vfork() you write one byte into it if the exec call fails. On the parent side you read from the pipe before you reap the child, and if you get EOF then you know the child exec'ed, and if you get one byte then you know the child exited. The one byte could be an errno value.
No, really, what you say about vfork() is lore, and very very wrong.
That said, vfork() blocks a thread in the parent. The point of my gist was to explain why fork() sucks, why vfork() is much better, and what would be better still.
> And I don't see what afork gets you that clone doesn't, except afork isn't as general.
afork()/avfork() is not meant to be as general as clone() but to be more performant than vfork() by not blocking a thread on the parent side.
clone() needs some improvements. It should be possible to create a container additively. See elsewhere in the comments on this post.
> What stack modifications? Sure, the child can scribble over the stack frame, or worse, the child could do things like return -- but you're the author of the code calling vfork() and you know not to do that
Within a sentence you described the stack modification. 'It's not a footgun, just don't make mistakes' doesn't hold a lot of water with me.
> No, really, what you say about vfork() is lore, and very very wrong.
Like I've said elsewhere in the comments, I've literally had to fix awful bugs, some security related, from how much vfork() is a preloaded foot gun with the safety off. Not everyone who has a bad impression of it is just following the "lore".
> afork()/avfork() is not meant to be as general as clone() but to be more performant than vfork() by not blocking a thread on the parent side.
Ok, but I'm not going to hold it against clone for being a more general solution.
> clone() needs some improvements. It should be possible to create a container additively. See elsewhere in the comments on this post.
I agree with this, but there's practical reasons why this isn't the case, mainly around how asking user space for every little thing is expensive, and large sparse structs to copy into kernel space covering basically everything in struct task sounds like a special kind of security hell I would not want to be a part of.
A flag to clone to create an empty process and something like a bunch of io_uring calls or a box program to hydrate the new task state would be really neat, and has been kicked around a bunch. There's just a ton corner cases that haven't been ironed out.
> Like I've said elsewhere in the comments, I've literally had to fix awful bugs, some security related, from how much vfork() is a preloaded foot gun with the safety off. Not everyone who has a bad impression of it is just following the "lore".
> You're supposed to only use async-signal-safe functions on the child-side of fork().
Not practically, there's way more code out there designed day one for fork(). Next to none designed for vfork() explicitly.
Signal safety has more to do with shared mutability, which isn't a concern for fork. You can get into gross situations mixing fork and threads, but that's equally true of vfork.
> Signal safety has more to do with shared mutability, which isn't a concern for fork.
And yet that's what the spec says about child-side code following fork(). There's a reason for that. It's not just about signals. Async-signal-safe means, yes, that you can use it in an asynchronous signal handler, but there are contexts other than async signal handlers that require async-signal-safe code.
> You can get into gross situations mixing fork and threads...
You can get into bad situations just using fork and no threads.
> Not practically, there's way more code out there designed day one for fork(). Next to none designed for vfork() explicitly.
> And yet that's what the spec says about child-side code following fork(). There's a reason for that. It's not just about signals. Async-signal-safe means, yes, that you can use it in an asynchronous signal handler, but there are contexts other than async signal handlers that require async-signal-safe code.
You cut off with the reason being threads and shared mutability.
In fact that's what the spec says too.
1003.1-2017 on fork()
> A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called.
Practically if you don't use threads you can do anything in the child process you can do in the parent. Any env that doesn't support that breaks decades of important Unix software.
And what are you fixing by changing fork to vfork there?
> Practically if you don't use threads you can do anything in the child process you can do in the parent. Any env that doesn't support that breaks decades of important Unix software.
Not true. I mentioned PKCS#11 elsewhere in this post or thread. The PKCS#11 case is more generally about devices, or even TCP and other connections. You can't share, say, a file descriptor connected to an IMAP server (or whatever) between the parent and the child (not without adding synchronization, though that need not mean mutexes).
That's like saying you can't write to the same file willy nilly after any context creation. In context, I obviously meant that you can perform the same actions in the child or the parent, not that you somehow get free synchronization for accessing all kernel objects.
Also, you can specify CKF_INTERFACE_FORK_SAFE if you want a handle in PKCS#11 that handles synchronization enough internally to call from both the child and the parent simultaneously.
Your code snippet assumes that your C compiler is just a high-level assembler. But it's not - it executes against a theoretical C virtual machine that doesn't know about about forking. It's allowed to generate some non-obvious code so long as it acts "as if" it has the same behaviour - but only from the point of view of that theoretic C VM.
For example, in theory _exit(1) could be implemented as longjmp(...) up to a point in some compiler-created top-level function that wraps up main(). Then that wrapper function could perform some steps to communicate the return code to the OS that trashes the stack before actually exiting. After all, if the process is about to exit anyway, what difference does it make if a bunch of memory is fiddled with? We know the answer to this but, from the point of view of the C virtual machine, it's irrelevant.
That particular scenario is unlikely but the point is that compiler implementations and optimisations are allowed to do very non-obvious things. You're only safe if you stick the rules of the C standard, which this 100% does not.
> Your code snippet assumes that your C compiler is just a high-level assembler. But it's not - it executes against a theoretical C virtual machine that doesn't know about about forking.
Luckily a C compiler that doesn't know about concepts outside of the C Virtual machine will not be able to compile a Linux executable or even dynamically load a library that exposes the vfork call (let alone try to execute the underlying system call directly).
That doesn't make sense. The C VM only affects how C code is understood by the compiler, in particular what optimisations are allowed. It doesn't stop the compiler from generating an executable or linking to libraries.
> It doesn't stop the compiler from generating an executable or linking to libraries.
The C standard claims multiple definitions result in undefined behavior. Dynamic libraries are filled to the brim with copies of symbols because it is impossible to tell in which library a symbol should be stored. Linking against a dynamic standard library cannot end well.
Stack manipulations are a real problem. Say if some parameter to exec after vfork uses stack slots created by compiler for temporary variables. & sure you compute those before the call to vfork, but then compiler applies code motion..
pid = vfork();
if (pid==0) {
int something;
exec();
// cleanup code that uses something
_exit(1);
}
Then the compiler (which knows `_exit` is noreturn) can conclude that if you enter the `if`, none of the existing stack slots will be read again, so it can reuse one of those stack slots for the `something` variable. But whoops, that means the original process has has its stack corrupted.
This applies even when the variable declared at start of method, as compilers can perform equivalent variable lifetime analysis to let it reuse the stack slot. This is exactly why the POSIX spec makes it undefined to write to any variable after vfork (except the pid return variable, obviously).
But even that is not strictly safe enough, since the compiler is allowed to introduce writes to the stack. This may for example, happen as part of calculating a temporary, if the compiler wants to use the register for something else, and decides against using some other register for storage, so spills to the stack.
Obviously your `afork` completely avoids all those sorts of concerns by using a separate stack.
If "[s]tack manipulations are a real problem" (I say there are none if you're writing the code and know not to add any problematic stack manipulations) then avfork() should satisfy that concern.
I'm still struggling to understand the point of vfork(). The whole point of fork is to offload work to a different part of your program so the original part can continue to do work. The entire idea fails if it halts the original program for the duration of the child's life. How is this better than just doing a regular function call?
vfork halts the parent until the child exits or calls exec, getting its own address space. In the normal case, you vfork and immediately exec, and the parent continues on with what it was doing. The time between vfork and exec is “special” in that the child is temporarily running in the parent’s address space, then it uses exec to separate and do its own thing.
Yeah, if you’re never planning on calling exec, vfork doesn’t make much sense.
Can I ask how you approach resource management and dependencies in that kind of code base? As the article briefly mentions, using fork without exec means you need to keep everything else in the process fork-safe, which I know can be a challenge in the presence of third-party code.
Not who you're replying to, but it's trivial as long as you don't use threads.
I suppose third-party code could be opening up file-descriptors behind your back and privately maintaining that state in private storage, but third-party code that does that without documenting it is relatively rare in the Unix/C world in my experience.
Historically getXbyY functions and the name service switch had a way of doing that, and that was one reason for nscd to come along (another was to cache better, naturally).
Most (all?) of the nsswitch functions were datagram based back in the day, so those would be safe.
I've certainly never had issues using e.g. getpwent on a NIS setup with forking and modern rpcbind may use TCP I believe. Maybe it opens a new connection each time?
Static file descriptors were a bit more common in the old days, but look horribly out of place in modern code. Keeping the code fork safe is easier than keeping it thread safe, at least with fork you aren't sharing the heap.
But you're sharing file descriptors, which might be for devices, or for SOCK_SEQ connections, etc, and you can't just have the parent and child step all over each other writing to them. Now, you wouldn't do that, but you might use a library that lets you end up doing that without noticing. Fork-safety is not trivial.
I've seen an argument for immediately execing and not marking the whole mutable process VA space as 'trap on write', including the thread stack that you're about immediately write to if you're going to throw that work away and exec(). There's also 'I want support cheap forks on a nommu system and vforking is easier to retrofit in'.
The code I currently work on actually has a use of `clone` with the `CLONE_VM` flag to create something that isn't a thread. Since `CLONE_VM` will share the entire address space with the child (you know, like a thread does!) a very reasonable response would be "WAT?!"
What led us here was a need to create an additional thread within an existing process's address space but in a way that was non-disruptive - to the rest of the process it shouldn't really appear to exist.
We achieved this by using `CLONE_VM` (and a handful of other flags) to give the new "thread-like" entity access to the whole address space. But, we omitted `CLONE_THREAD`, as if we were making a new process. The new "thread-like" entity would not technically be part of the same thread group but would live in the same address space.
We also used two chained `clone()` calls (with the intermediate exiting, like when you daemonise) so that the new "thread-like" wouldn't be a child of the original process.
All this existed before I joined, it's just really cool that it works. I've never encountered a such a non-standard use of clone before but it was the right tool for this particular job!
> What led us here was a need to create an additional thread within an existing process's address space but in a way that was non-disruptive - to the rest of the process it shouldn't really appear to exist.
Sure! I'll try to illustrate the general idea, though I'm taking liberties with a few of the details to keep things simple(r).
Our software (see https://undo.io) does record and replay (including the full set of Time Travel Debug stuff - executing backwards, etc) of Linux processes. Conceptually that's similar to `rr` (see https://rr-project.org/) - the differences probably aren't relevant here.
We're using `ptrace` as part of monitoring process behaviour (we also have in-process instrumentation). This reflects our origins in building a debugger - but it's also because `ptrace` is just very powerful for monitoring a process / thread. It is a very challenging API to work with, though.
One feature / quirk of `ptrace` is that you can't really do anything useful with a traced thread that's currently running - including peeking its memory. So if a program we're recording is just getting along with its day we can't just examine it whenever we want.
First choice is just to avoid messing with the process but sometimes we really do need to interact with it. We could just interrupt a thread, use `ptrace` to examine it, then start it up again. But there's a problem - in the corners of Linux kernel behaviour there's a risk that this will have a program-visible side effect. Specifically, you might cause a syscall restart not to happen.
So when we're recording a real process we need something that:
* acts like a thread in the process - so we can peek / poke its memory, etc via ptrace
* is always in a known, quiescent state - so that we can use ptrace on it whenever we want
* doesn't impact the behaviour of the process it's "in" - so we don't affect the process we're trying to record
* doesn't cause SIGCHLD to be sent to the process we're recording when it does stuff - so we don't affect the process we're trying to record
Our solution is double clone + magic flags. There are other points in the solution space (manage without, handle the syscall restarting problem, ...) but this seems to be a pretty good tradeoff.
I looked into something similar for implementing a concurrent GC. I ended up just using mmap() and ptrace() since I did have to manipulate the process for certain barrier operations; I probably could have done it with non-ptrace system calls; there are tradeoffs to be made (either way you need to interrupt any pending systemcalls, but there are multiple ways of doing that).
The problem record and replay is expansions of languages and apis too. That is a good thing for some things but it needs to be reworded sometimes too and implementations of things aren't always newer versions of things either.
> The problem record and replay is expansions of languages and apis too. That is a good thing for some things but it needs to be reworded sometimes too and implementations of things aren't always newer versions of things either.
Changes to languages and APIs can be a problem to record/replay depending on exactly how they're implemented.
Undo's core tech, rr (and, arguably, GDB's built in record/replay) operate at the level of machine instructions and operating system calls, so changes to language and library behaviours don't generally affect us, outside of a few corner cases.
When you have that, you don't need to even know what the language is in order to operate - though if you want source-level debugging then it does matter as you have to be able to map from "your program counter is here" to "you're at this source line".
We occasionally need to add support for new system calls but an advantage of Linux is that the kernel ABI is very stable. New extensions to CPU instruction set also require work - these can be harder to support but they change more slowly.
Of course, operating at such a low level level isn't the only way to record/replay - there are distinct costs and benefits to operating at a higher level in the stack.
2. Set it up from the parent process, it just lies on the operating table passively.
3. Submit it to the scheduler.
This is just....obviously correct. Totally flexible. Totally efficient. Hell, if you really want to fork anything, fork those embryonic process which have no active threads! Much safer and easier to understand!
When I was first learning about UNIX and similar OSes I just assumed that this is how things worked because this is the obvious way of doing it. Why would you fork a process, then try to determine which of the two processes you are, then fix whatever the parent process messed up in your global state, and only then execute what you actually wanted to do? That seems insane (I guess until you realize that the main use case is creating /bin/sh).
But even when writing /bin/sh, I don't see why this would get in the way? I was once told earlier Unix didn't even have fork, but something more purpose-made for shells instead.
Sounds a bit like fuchsias launchpad library where you create launchpad object, do all the setup, and then call launchpad_go to actually start the process. Launchpad doesn't allow arbitrary syscalls in the setup, so in that sense it is maybe closer to "spawn" interface but with better ergonomics
I was always disappointed by the performance of fork()/clone().
CompSci class told me it was a very cheap operation, because all the actual memory is copy-on-write, so its a great way to do all kinds of things.
But the reality is that duplicating huge page tables, and hundreds of file handles is very slow. Like 10's of milliseconds slow for a big process.
And then the process runs slowly for a long time after that because every memory access ends up causing lots of faults and page copying.
I think my CompSci class lied to me... it might seem cheap and a neat thing to do, but the reality is there are very few usecases where it makes sense.
CS classes (and, far too often, professional programmers) talk about computers like they're just faster PDP-11s with fundamentally the same performance characteristics.
Agreed that these costs can be larger than is perhaps implied in compsci classes (though it's possible that they've changed their message since I took them!)
I suppose it is still essentially free for some common uses - e.g. if a shell uses `fork()` rather than one of the alternatives it's unlikely to have a very big address space, so it'll still be fast.
My experience has been that big processes - 100+GB - which are now pretty reasonable in size really do show some human-perceptible latency for forking. At least tens of milliseconds matches my experience (I wouldn't be surprised to see higher). This is really jarring when you're used to thinking of it as cost-free.
The slowdown afterwards, resulting from copy-on-write, is especially noticeable if (for instance) your process has a high memory dirtying rate. Simulators that rapidly write to a large array in memory are a good example here.
When you really need `fork()` semantics this could all still be acceptable - but I think some projects do ban the use of `fork()` within a program to avoid unexpected costs. If you really have a big process that needs to start workers I guess it might be worth having a small daemon specifically for doing that.
Right, shells are no threaded and they tend to have small resident set sizes. Even in shells though, there's no reason not to use vfork(), and if you have a tight loop over starting a bunch of child processes, you might as well use it. Though, in a shell, you do need fork() in order to trivially implement sub-shells.
Also, mandating copy-on-write as an implementation strategy is a huge burden to place on the host. Now you’ve made the amount of memory a process is is using unquantifiable.
It's not necessarily unquantifiable -- the kernel can count the not-yet-copied pages pessimistically as allocated memory, triggering OOM allocation failures if the amount of potential memory usage is greater than RAM. IIUC, this is how Linux vm.overcommit_memory[1] mode 2 works, if overcommit_ratio = 100.
However, if an application is written to assume that it can fork a ton and rely on COW to not trigger OOM, it obviously won't work under mode 2.
> 2 - Don't overcommit. The total address space commit for the system is not permitted to exceed swap + a configurable amount (default is 50%) of physical RAM.
> Depending on the amount you use, in most situations this means a process will not be killed while accessing pages but will receive errors on memory allocation as appropriate.
> Useful for applications that want to guarantee their memory allocations will be available in the future without having to initialize every page.
You're right, "unquantifiable" was the wrong word here. I meant, a program has no real way of predicting/reacting to OOM. I didn't realize mode 2 with overcommit_ratio = 100 behaved that way, thanks for sharing.
Yeah I think in a practical sense you're right, since AFAIK using mode 2 is fairly rare because most software assumes overcommit, and even if a program is written with an understanding that malloc can return NULL, its in the sense of
POSIX doesn't require that fork() be implemented using copy-on-write techniques. An implementation is free to copy all of the parent's writable address space.
If the parent is a JVM, for sure. But a copy-on-write fork() still doesn't perform well. The point isn't to just copy the whole parent. The point is to stop copying at all.
Copy-on-write is supposed to be cheap, but in fact it's not. MMU/TLB manipulations are very slow. Page faults are slow. So the common thing now is to just copy the entire resident set size (well, the writable pages in it), and if that is large, that too is slow.
> clone() is stupid ... the clone(2) design, or its maintainers, encourages a proliferation of flags, which means one must constantly pay attention to the possible need to add new flags at existing call sites.
IMHO a bigger problem [2] in practice with clone is that (according to glibc maintainers) once your program calls it, you can't call any glibc function anymore. [1] Essentially the raw syscall is a tool for the libc implementation to use. The libc implementation hasn't provided a wrapper for programs to use which maintains the libc's internal invariants about things like (IIUC) thread-local storage for errno.
The author's aforkx implementation is something that glibc maintainers could (and maybe should) provide, but my understanding is that you can get in trouble by implementing it yourself.
[2] editing to add: or at least a more concrete expression of the problem. Wouldn't surprise me if they haven't provided this wrapper in part because the proliferation the author mentioned makes it difficult for them to do so.
It's really unfortunate that the sanctioned way to call Linux syscalls directly is via the syscall() function (previously the _syscallN macros), and both of those methods set errno on error, which fails in a clone() thread.
If only Glibc provided a syscall_r() or something that returns the raw return value whether it's an error or not.
It is possible to make syscall() (and regular libc syscalls like read()) work in a clone() thread. I use this in performance-optimised I/O code in a database engine, so I know it works, but it requires some ugly Glibc-and-architecture-specific things. Doing it portably doesn't seem to be an option.
The problem with this argument is that the set of programs that just fork() and then exec() is fairly small. Sure, shells are small and do this, but then the article argues that shells are a good use of fork().
In larger programs, you're forking because you need to diverge the work that's going to be done and probably where it's going to be done (maybe you want to create a new pid ns, you need a separate mm because you're going to allocate a bunch, whatever). Maybe the argument is that programs should never do this? I don't buy that. Then there's a lot of string-slinging through exec().
That's backwards from my experience, which is that most users of fork() only do "fork; child does small amount of setup, eg closing file descriptors; exec". Shells are one of the few programs that do serious work in the child, because the POSIX shell semantics surface "create a subshell and do..." to the shell user, and then the natural way to implement that when you're evaluating an expression tree is "fork, and let the child process continue evaluating as a long-lived process continuing to execute as the same shell binary". (Depending on what's in that sub-tree of the expression, it might eventually exec, but it equally might not.)
Many years back I worked on an rtos that had no fork(), only a 'spawn new process' primitive (it didn't use an MMU and all processes shared an address space, so fork would have been hard). Most unixy programs were easy to port, because you could just replace the fork-tweak-exec sequence with an appropriate spawn call. The shells (bash, ash I think were the two I looked at) were practically impossible to port -- at any rate, we never found it worth the effort, though I think with a lot of effort and willingness to carry invasive local patches it could have been done.
The vast majority of programs that fork are doing fork() followed almost immediately by exec(), to the extent that on macOS for example a process is only really considered safe for exec() after fork() happens. Pretty much nothing else is considered safe.
Yeah; that would be my assumption too. I worked one time on a significant project that benefit from fork() without exec() and it was a monstrous pain - only if you own every single line of code in your project, have centralized resource management, and have no significant library dependencies should you ever consider doing this.
Yeah, you can't depend on pthreads or pthread mutexes (they're not defined as being fork safe).
The entirety of Foundation (so presumably anything in Swift) is not fork safe either.
To be clear: "not fork safe" in this case means "severely constrained environment": e.g. you can do things liker limits, set up pipes, etc but good luck with much more. I guess morally similar to the restrictions you have in a signal handler, albeit with different restrictions.
Oh no, there's tons of ProcessBuilder type APIs in Java, Python, and... every major language you can think of.
The problems with fork() become very apparent in any Java apps that try to run external programs, especially in apps that have many threads and massive heaps and are very busy.
> In larger programs, you're forking because you need to diverge the work that's going to be done and probably where it's going to be done
That's usually going to be done with clone() instead, no? You'll likely want to fiddle with the various flags for those usages and are unlikely to be happy with what fork() otherwise does.
That paper smacks of a Chesterton Fence. They haven't come up with a tested replacement for many of the use cases, i.e.:
These designs are not yet general enough to cover all the use-cases outlined above, but perhaps can serve as a starting point...
yet bullet #1 in the next paragraph is
Deprecate Fork
I think this is a case of security guys being upset about fork gumming-up their experiments. I don't really care about their experiments. The security regime for the past 20 years may have bought us a little more security against eastern bloc hackers, but it hasn't done squat to protect us from Apple, Google, & Microsoft! I have never had a virus de-rail my computing life as much as the automatic Windows 10 upgrade. Robert Morris got 400 hours community service for a relatively benign worm. If that's the penalty scale, Redmond should get actual time in the slammer for Cortana, forced Windows Update, and adding telemetry to Calculator.
You fail to address any of the substance of their paper, or of my gist (TFA), then go on a rant about unrelated things. The authors of that paper deserve better treatment even if you hate Microsoft.
I did. Chesterton Fence. fork() has been in Unix from the beginning. Taking it out at this point will cause more problems than it solves. Until you have a working Unix distro (kernel AND common userland services) that elegantly covers all of the forkless cases, your paper and their paper are just opinions. Theirs is a formally written one. Yours is a clickbaity one. And casting vfork() as any kind of improvement here is just bonkers.
And the rant is totally related: i.e. devs breaking things that worked just fine to begin with for the sake of doctrinal purity. It is usually a false doctrine.
I'm not proposing that fork() be removed. Microsoft is much more interested in not ever implementing fork() than I am in removing it. So your dilapidated fence can stay up where it's up.
I have to disagree that fork is evil. fork is great because of copy-on-write. I guess my particular use case is not very typical/common though.
I'm running powerflow simulations on a power grid model (several GB of memory to store the model). Copy-on-write means I can make small modifications to this model and run simulations in parallel. Thanks to fork/copy-on-write, I can run 32 simulations in parallel, each will small modifications without requiring 32 times as much memory.
I saw a bug once where an application would get way slower on MacOS after calling fork(). Not just temporarily either; many syscalls would continue to run slowly from the call to fork() until the process exited.
Looking on Stack Overflow, I see a few reports of this behavior[0][1].
I don't think containers should be like jails. Containers should be more like chroots than they are now.
Have you ever tried to run a modern X/whatever app with 3D graphics and audio and DBUS and God knows what else in a container and get it to show up on your desktop? It's a fucking nightmare. I spent over a week trying to get 1Password to run in a container. Somebody decided containers had to be "secure", even though they don't actually exist as a single concept and security was never their primary purpose. If instead containers were used only to isolate filesystem dependencies, we could actually pretend containers were like normal applications and treat them with the same lack of security concern that all the rest of our non-containerized programs are.
Firecracker is the correct abstraction for isolation: a micro-VM. That is the model you want if you want to run an app securely (not to mention reliably, as it can come with its own kernel, rather than needing you to run a compatible host kernel).
I... didn't mean that containers have to have a copy of the operating system inside them, systemd and many other things included. I meant only that they should be created in ways like how the BSDs and Illumos do it.
Is it a fair point to implement first with fork() because of memory protection, then optimize by using benchmarks and potentially vfork() for speed? Benchmark areas can look at synchronous locks, copy-on-write memory, stack sharing, etc.
What are the good practices of security tradeoffs of fork() vs. vfork() especially in terms of ease of writing correct code? I'd thought that fork() + exec() tends to favor thinking about clearer separation/isolation. For example I've written small daemons using fork() + exec() because it seems safe and easy to do at the start.
In short, fork() mixes poorly with multi-threaded code (and has some security footguns like needing to explicitly unshare elements of environment which may be sensitive, such as file descriptors (suddenly you need to know all the file descriptors used in the whole program from a single place in code)). Here is a well-written comment about fork() from David Chisnall: <https://lobste.rs/s/cowy6y/fork_road_2019#c_zec42d>
Additionally, the fork()+exec() idiom practically forces OS designers into a corner where they simply have to implement Copy-on-Write for virtual memory pages, or otherwise the whole userspace using this idiom is going to be terribly slow. Without the fork()+exec() idiom you don't need CoW to be efficient.
Fork mixes so poorly with multithreaded code that a lot of modern languages that are built from the beginning with threads of one sort or another in mind, like Go, simply won't let you do it. There is no binding to fork in the standard library.
I think you could bash it together yourself with raw syscalls, because that can't really be stopped once you have a syscall interface, but basically the Go runtime is built around assuming it won't be forked. I have no idea what would happen to even a "single threaded" Go program if you forked it, and I have no intention of finding out. The lowest level option given in the syscall package is ForkExec: https://pkg.go.dev/syscall#ForkExec And this is a package that will, if you want, create new event loops outside of the Go runtime's control, set up network connections outside of the runtime's control, and go behind the runtime's back in a variety of other ways... but not this one. If you want this, you'll be looking up numbers yourself and using the raw Syscall or RawSyscall functions.
> I have no idea what would happen to even a "single threaded" Go program if you forked it, and I have no intention of finding out.
I'm not an expert on Go internals, but the GC in Go is multithreaded, so I would assume forking will kill the GC. Better hope it's not holding any mutexes.
TL;DR if another thread is holding a lock when you fork that lock will be stuck locked in the child, but that thread that was using that lock no longer exists.
So if your multi-threaded program uses malloc you may fork while a global allocation lock is being held and you won't be able to use malloc or free in the child (thread-local caches aside).
There are other problems but this is the basic idea. To be fork-safe you need to allow any thread to just disappear (or halt forever) at any point in your program.
malloc has to guard its locks against fork, probably using pthread_atfork, or some lower level internal API related to that.
The problem with pthread_atfork is third party libs.
YOU will use it in YOUR code. The C library will correctly use it in its code. But you have no assurance that any other libraries are doing the right things with their locks.
Your "third party libs" includes system libraries like libdl.
We had a Python process using both threads (for stuff like background downloads, where the GIL doesn't hurt) and multiprocessing (for CPU-intensive work), and found that on Linux, the child process sometimes deadlocks in libdl (which Python uses to import extension modules).
The fix was to use `multiprocessing.set_start_method('spawn')` so that Python doesn't use fork().
Also if, for any reason, you end up doing a `fork()` syscall directly rather than via libc you'll still have a problem as appropriate cleanup won't happen.
Of course, the best answer to that is usually going to be "don't do that"!
Right, and so POSIX "fixed" that by standardizing posix_spawn. Thus fork is now mainly for those scenarios in which exec is not called, plus traditional coding that is portable to old systems.
Apologies if this is a silly question, but it seems like there's a false dichotomy here:
(1) You have separate fork() (etc.) and exec(), so that in the brief window in between you can set all the properties of the new process using APIs that exist anyway for controlling your own process.
(2) You have a single call to spawn a new process, but you have a million different options to control every aspect of the new process.
Why not do it this other way instead? Perhaps a bit late now but seems like in retrospect it would give the API simplicity of fork+exec without any of the complications.
(3) There are two steps to run a new process. The first fully sets up its memory and returns a PID, but doesn't start running it. The second call, unfreeze(), allows it to begin executing code. All the usual APIs that exist anyway for controlling your own process take an extra parameter specifying the PID of a frozen child (or -1 for the current process).
There is something about fork which I have never understood. Maybe someone here can explain it to me.
Why would anyone ever want fork as a primitive? It seems to me that what you really want is a combination of fork and exec because 99% of the time you immediately call exec after fork (at least that's what I do 99% of the time when I use fork). If you know that you're going to call exec immediately after fork, then all the issues of dealing with the (potentially large) address space of the parent just evaporate because the child process is just going to immediately discard it all.
So why is there not a fork-exec combo? And why has it not replaced fork for 99% of use cases?
And as long as I'm asking stupid questions, why would anyone ever use vfork? If the child shares the parent's address space and uses the same stack as the parent, and the parent has to block, how is that different from a function call (other than being more expensive)?
Because there are many, many use cases where you don't want to call exec() immediately after fork().
Want to constrain memory usage or CPU time of an arbitrary child process? You have to call setrlimit() before exec(). Privilege separation? Call setuid() before exec(). Sandbox an untrusted child process in some way? Call seccomp() (or your OS equivalent) before exec(). And so on and so forth. Any time you want to change what OS resources the child process will have access to, you'll need to do some set-up work before invoking exec().
Windows solves this by adding a bunch of optional parameters to CreateProcess, as well as having two more variants (CreateProcessAsUser and CreateProcessWithLogon). Some of the arguments are complicated enough that they have helper functions to construct them.
I like the more composable fork()->modify->exec() approach of unix, but I wouldn't call either of them really elegant.
The one I've favored while reading these arguments has been the "suspended process" model. The primitives are CREATE(), which takes an executable as a parameter and returns the PID of a paused process, and START(), which allows the process to actually run.
Unix already has the concept of a paused executable, after all.
This model also requires all the process-mutation syscalls, like setrlimit(), to accept a PID as a parameter, but prlimit() wound up being created anyway, because the ability to mutate an already-running process is useful.
A third way is to grant the parent process access to the child such that they can use the child process handle to "remotely" set restrictions, write memory, start a thread, etc.
Practically, syscall overhead has gotten in the way of that being the ubiquitous in the past. Here's to hoping that newer models of syscalls that reduce kernel/user overhead make such a thing possible.
To me this feels like a call for more powerful language primitives. i.e. a way to specify some action to take to "set up" the child process that's more explicit and readable than one special behaving in a particularly odd way. I'm imagining closures with some kind of Rust-like move semantics, but not entirely sure.
(if we're speaking in terms of greenfield implementation of OS features)
"Process control in its modern form was designed and implemented within a couple of days. It is astonishing how easily it fitted into the existing system; at the same time it is easy to see how some of the slightly unusual features of the design are present precisely because they represented small, easily-coded changes to what existed. A good example is the separation of the fork and exec functions. The most common model for the creation of new processes involves specifying a program for the process to execute; in Unix, a forked process continues to run the same program as its parent until it performs an explicit exec. The separation of the functions is certainly not unique to Unix, and in fact it was present in the Berkeley time-sharing system [2], which was well-known to Thompson. Still, it seems reasonable to suppose that it exists in Unix mainly because of the ease with which fork could be implemented without changing much else."
OK, but why has it not be replaced with something better in the intervening 50 years? There have been a lot of improvements to unix since 1970. Why not this?
I think the reason for fork() and exec() as primitives goes back to the early days Unix design philosophy. Unix tends to favour "easy and simple for the OS to implement" rather than "convenient for user processes to use". (For another example of that, see the mess around EINTR.) fork() in early unix was not a lot of code, and splitting into fork/exec means two simple syscalls rather than needing a lot of extra fiddly parameters to set up things like file descriptors for the child.
There's a bit on this in "The Evolution of the UNIX Time-Sharing System" at https://www.bell-labs.com/usr/dmr/www/hist.html -- "The separation of the functions is certainly not unique to Unix, and in fact it was present in the Berkeley time-sharing system [2], which was well-known to Thompson. Still, it seems reasonable to suppose that it exists in Unix mainly because of the ease with which fork could be implemented without changing much else." It says the initial fork syscall only needed 27 lines of assembly code...
(Edit: I see while I was typing that other commenters also noted both the existence of posix_spawn and that quote...)
> Unix tends to favour "easy and simple for the OS to implement"
Well, yeah, but the whole problem here, it seems to me, is that fork is not simple to implement precisely because it combines the creation of the kernel data structures required for a process with the actual initiation of the process. Why not mkprocess, which creates a suspended process that has to be started with a separate call to exec? That way you never have to worry about all the hairy issues that arise from having to copy the parent's process memory state.
It was simple specifically for the people writing it at the time. We know this, because they've helpfully told us so :-) It might or might not have been harder than a different approach for some other programmers writing some other OS running on different hardware, but the accidents of history mean we got the APIs designed by Thompson, Ritchie, et al, and so we get what they personally found easy for their PDP7/PDP11 OS...
Long ago in the far away land of UNIX, fork was a primitive because the primary use of fork was to do more work on the system. You likely were one of thee or four other people, at any given moment vying for CPU time, and it wasn't uncommon to see loads of 11 on a typical university UNIX system.
> so why is there not a fork-exec combo
you're looking for system(3). Turns out, most people waitpid(fork()). Windows explicitly handles this situation with CreateProcess[0] which does a way better job of it than POSIX does (which, IMO, is the standard for most of the win32 API, but that's a whole can of worms I won't get into).
> why would anyone ever use vfork?
Small shells, tools that need the scheduling weight of "another process" but not for long, etc. See also, waitpid(fork()).
When you have something with MASSIVE page tables, you don't want to spend the time copying the whole thing over. There's a huge overhead to that.
system(3) is not a good alternative because it indirects through the shell, which adds the overhead of launching the shell as well as the danger of misinterpreting shell metacharacters in the command if you aren’t meticulous about escaping them correctly.
`fork` is a classic example, as others have mentioned, as something that was implemented because it was [at the time] easy rather than because it was a good design. In the decades since, we've found there are issues that are caused by the semantics of fork, especially if the most common subsequent system call is `exec`.
If you're designing an OS from scratch, support for `fork` and `exec` as separate system calls is not what you want. Instead, you'd be likely to describe something in terms of a process creation system call, which will have eleventy billion parameters governing all of the attributes of the spawned process.
POSIX specifies a fork+exec combo called posix_spawn. This is actually used a fair amount, but the reason it isn't used more is because it doesn't support all of the eleventy-billion parameters governing all of the attributes of the spawned process. Instead, these parameters are usually set by calling system calls that change these parameters between fork and exec. These system calls might, for example, change the root directory of a process or attach a debugger. Neither of these are supported by posix_spawn, which only allows the common operations of changing the file descriptors or resetting the signal mask in the list of actions to do.
And this suggests why you might want vfork: vfork allows you write something that looks like posix_spawn: you get to fork, do your new-process-attribute-setting-flags, and then exec to the new process image, all while being able to report errors in the same memory space.
> If you're designing an OS from scratch, support for `fork` and `exec` as separate system calls is not what you want. Instead, you'd be likely to describe something in terms of a process creation system call, which will have eleventy billion parameters governing all of the attributes of the spawned process.
Or if you happen to be sane you'll have a single, simple system call to create a blank, suspended child process, and all the regular system calls which operate on process state will take a handle or process "file descriptor" to indicate which process to modify rather than assuming the current process as the target.
This was the ultimate flaw of posix_spawn(). As you point out it doesn't support all the things you might want to tweak in the child process—a consequence of trying to capture every aspect of the initial process state in a single process-creation API rather than distributing the work through the normal system calls so that each new interface or state can be adjusted for child processes in the same way that it's adjusted for the current process.
Whatever you do, though, make sure it's possible to emulate fork() reliably with your "better" replacement. Consider the case of Cygwin where emulated fork() calls can (and frequently do) fail in bizarre ways because the "blank" child process was pre-loaded with some unexpected virtual memory mapping by AV software or other system tasks, with the result that a required DLL or private memory space can't be set up at same address used in the parent.
Most APIs can be extended. The problem is that when someone adds a new tunable parameter or resource that one might want to modify for a child process it doesn't automatically get added to posix_spawn()—that takes extra effort. Which is why I emphasized using the same APIs for the current process and child processes, rather than duplicating the work in two places.
fork() without exec() can make sense in the context of a process-per-connection application server (like SSH). I've also used it quite effectively as a threading alternative in some scripting languages.
> So why is there not a fork-exec combo?
There is; it's called posix_spawn(). Like a lot of POSIX APIs, it's kind of overcomplicated, but it does solve a lot of the problems with fork/exec.
> And as long as I'm asking stupid questions, why would anyone ever use vfork?
For processes with a very large address space, fork() can be an expensive operation. vfork() avoids that, so long as you can guarantee that it'll immediately be followed by an exec().
fork with copy-on-write semantics avoids copying the whole address space. It does have to copy some data structures that manage virtual memory and maybe the first level of the paging structure(page directory or whatever).
Can you elaborate on this? I understand why copying a large address space might be slow but how or why does the number of threads in a process affects this? Is it scheduling?
Copy-on-write means twiddling with the MMU, and TLB updates across cores ("TLB shootdowns") can be very expensive. If the process is not threaded, then the OS could make sure to schedule the child and parent on the same CPU to avoid needing TLB shootdowns, but if it's threaded, forget about it.
From "Operating Systems: Three Easy Pieces" chapter on "Process API" (section 5.4 "Why? Motivating The API") [1]:
... the separation of fork() and exec() is essential in building a UNIX shell,
because it lets the shell run code after the call to fork() but before the call
to exec(); this code can alter the environment of the about-to-be-run program,
and thus enables a variety of interesting features to be readily built.
...
The separation of fork() and exec() allows the shell to do a whole bunch of
useful things rather easily. For example:
prompt> wc p3.c > newfile.txt
In the example above, the output of the program wc is redirected into the output
file newfile.txt (the greater-than sign is how said redirection is indicated).
The way the shell accomplishes this task is quite simple: when the child is
created, before calling exec(), the shell closes standard output and opens the
file newfile.txt. By doing so, any output from the soon-to-be-running program wc
are sent to the file instead of the screen.
As an explanation it doesn't make much sense, because there are other ways to alter the environment of the about-to-be-run program (see any non-Unix OS for examples).
Because "fork" was easy to implement in UNIX on the PDP-11.
The original implementation was for a machine with very limited memory. So fork worked by swapping out the process. But then, instead of releasing the in-memory copy, the kernel duplicated the process table entry. So there were now two copies of the process, one in memory and one swapped out. Both were runnable, even if there wasn't enough memory for both to fit at once. Both executed onward from there.
And that's why "fork" exists. It was a cram job to fit in a machine with a small address space.
# function1 and funtion2 are shell functions
$ function1 | grep foo | function2
here, the shell must fork a process (without exec) to run one of these functions.
For instance function1 might run in a fork, the grep is a fork and exec of course, and function2 could be in the shell's primary process.
In the POSIX shell language, fork is so tightly integrated that you can access it just by parenthesizing commands:
$ (cd /path/to/whatever; command) && other command
Everything in the parentheses is a sub-process; the effect of the cd, and any variable assignments, are lost (whether exported to the environment or not).
In Lisp terms, fork makes everything dynamically scoped, and rebinds it in the child's context: except for inherited resources like signal handlers and file descriptors.
Imagine every memory location having *earmuffs* like a defvar, and being bound to its current value by a giant let, and imagine that being blindingly efficient to do thanks to VM hardware.
I use fork a lot in my Python science programs. It's really great - you can stick it in a loop and get immediate parallelism. It's much better than multiprocessing, etc, as you keep the state from just before the fork happened, so you can share huge data structures between the processes, without having to process the same data again or duplicate them. I've even written a module for processing things in forked processes: https://pypi.org/project/forkqueue/
Splitting fork and exec allows you to do stuff before calling exec, for example redirecting file descriptors (like stdin/out/err), creating a pipe, modifying the child's environment, and so on.
That would be the fugliest, most unwieldy API in history. In addition to the two most basic things I mentioned, there are namespaces, control groups, setuid/setgid, and probably a billion other things I can't think of.
There are so many variations to what you can do with fork+exec that designing a suitable "fork-exec combo" API is really difficult, so any attempts tend to yield a fairly limited API or a very difficult-to-use API, and that ends up being very limiting to its consumers.
On the flip side, fork()+exec() made early Unix development very easy by... avoiding the need to design and implement a complex spawn API in kernel-land.
Nowadays there are spawn APIs. On Unix that would be posix_spawn().
> And as long as I'm asking stupid questions, why would anyone ever use vfork? If the child shares the parent's address space and uses the same stack as the parent, and the parent has to block, how is that different from a function call (other than being more expensive)?
(Not a stupid question.)
You'd use vfork() only to finish setting up the child side before it execs, and the reason you'd use vfork() instead of fork() is that vfork()'s semantics permit a very high performance implementation while fork()'s semantics necessarily preclude a high performance implementation altogether.
I think it's actually a pretty useful primitive for doing multiprocessing. Unlike threading, you have a completely separate memory space both for avoiding data races and performance (memory allocators still aren't perfect and weird stuff can happen with cache lines). Unlike exec after fork or anything equivalent, you still get to share things like file descriptors and read only memory for convenience.
> Why would anyone ever want fork as a primitive? It seems to me that what you really want is a combination of fork and exec because 99% of the time you immediately call exec after fork (at least that's what I do 99% of the time when I use fork).
If you eliminate fork, then what do you do for those 1% of cases where you actually do need it? I agree that it's uncommon, but I have written code before that calls fork() but then does not exec().
> So why is there not a fork-exec combo?
There is; it's called posix_spawn(3).
> And why has it not replaced fork for 99% of use cases?
Even though it's been around for about 20 years, it's still newer than fork+exec, so I assume a) many people just don't know about it, or b) people still want to go for maximum compatibility with old systems that may not have it, even if that's a little silly.
Lacking fork(), if you want to multi-process a service, you have to spawn (vfork()+exec() or posix_spawn(), or whatever) the processes and arrange for them to get whatever state and resources they need to start up. It's a pain, but I've done it.
You might want to move around some file descriptors if you don't want the child process to inherit your stdin/stdout/stderr (e.g. if you want to read the stdout of the process you launched, or give it some stdin).
And there does exist such a fork-exec combo - posix_spawn. It allows adding some "commands" of what file descriptor operations to do between the fork & exec before they're ever done, among some other things. But, as the article mentions, using it is annoying - you have to invoke various posix_spawn_file_actions_* functions, instead of the regular C functions you'd use.
The whole idea of fork is strange - the design pattern of "child process is executing exactly where the parent process is executing" is foreign to me. Don't we want to direct where the child process is executing? Like, when creating a thread? Why is fork() so conceptually orthogonal to that? Is there a good reason? A historical reason?
I don't find fork() to be obvious or useful or natural. I work hard to never do it.
Oh I understand how it works. I implemented it, in the first POSIX implementation. I just don't get how anybody wants to do that.
Yes, there's the example right there. But it shows the awkwardness immediately - decoding what the f happened by checking a side effect (is pid == 0? wtf?)
How about spoon(handle_connection, ...) or something like that? See how much better?
It makes more difficult to pass context. You have to resort in the classical void * context, that is not handy to use. Or you have to use globals. The fork idea is more elegant to me, it duplicates the program flow execution in place.
If you want the child to start executing some other code but you have fork(), it's easy to do it yourself by calling that function.
But on the other hand, if you do want the child to execute code at the same place as the parent, but a hypothetical fork() asks you to provide a function pointer, it would be a bit more complicated.
It's a leaky abstraction and everything it does can be done manually, and possibly better. It exists purely because, at some point in the past, threads didn't exist.
If you design your program without fork, you'll probably end up with a cleaner and faster solution. Some things are best forgotten or never learned in the first place.
The beauty of (v)fork(+exec) is that it doesn't need a new interface for configuring the environment in whichever way you want before the other process starts. Instead you get to use the exact same means of modifying the environment to your needs, and once it's done, you can call exec and the new process inherits those things.
I mean, just look at the interface of posix_spawn.
I grant though that this isn't without its problems (including performance) and IMO e.g. FD_CLOEXEC is one example of how those problems can be patched up. It's like the reverse problem: you have too wide implicit interface in it, and then you need to come up with all these ways to be explicit about some things.
Add to that, fork is (was) very inefficient. You had to duplicate the entire process state (page tables etc). Then the damn program would exec(), and you would tear it all down again. Took 100ms on older computers. Complete waste.
We would resort to making a weak copy, with page tables faulting in only if you used them. A lot of drama, so the user could make a goofy call that they didn't really want most of the time.
Another option is to allow the parent to create an empty child process, and then make arbitrary system calls and execute code in the child, like a debugger does. In most cases the last "remote system call" would be exec.
One use case for fork()--which is used extensively on Android--is to build an expensive template process that can then be replicated for later work, which is exactly what people often want for the behavior with virtual machines. I wrote an article on the history of linking and loading optimizations leading up to how Android handles their "zygote" which touches on this behavior.
We had the case that some library we were using (OpenBLAS) used pthread_atfork. Unfortunately, the atfork handler behaved buggy in certain situations involving multiple threads and caused a crash. This was annoying because we basically did not need fork at all but just fork+exec (for various other libraries spawning sub processes), where those atfork handlers would not be relevant.
Our solution was to override pthread_atfork to ignore any functions, and in case this is not enough, also fork itself to just directly do the syscall without calling the atfork handlers.
posix_spawn() shouldn't call atfork handlers. It's allowed to call them or not call them because implementors can use fork(), which must call them, or they can use vfork(), which must not call them -- or they can make posix_spawn() a proper system call, too, or they can use clone(), or my putative avfork(), or whatever.
If you used vfork(), you wouldn't have had this problem.
Fork-safety issues arise mainly because of the sharing of resources between the parent and child. pthread_atfork() exists mainly to allow libraries to add a measure of fork-safety by letting them disable things on the child-side of fork() or re-set-up things on the child-side of fork(). For example, a PKCS#11 provider might need to create a new connection to the tokens and re-C_Login() to them (except, since it really can't quite do that, most likely it must render every session inoperable on the child-side). (Indeed, PKCS#11 specifically mandates that on the child-side of fork all sessions must be dead and must not be used.)
The good/evil/etc. here seem to be defined exclusively around "performance above all else", and - more specifically - performant primitives over performant application architecture.
It strikes me that performance gains associated with sharing address space & stack are similar to many performance gains: trade-offs. So calling them "good" and "evil" when performance is seemingly your sole goal and interest seems a bit forward.
In my world we often say things like "X is the moral equivalent of Y" where X and Y are just technologies and, clearly, are morally-neutral things.
Why do we do this? Well, because it adds emphasis, and a dash of humor.
Clearly fork() is neither Good nor Evil. It's morally neutral. It has no moral value whatsoever. But to say "fork() is evil" is to cause the audience to raise their eyebrows -"what, why would you say fork() is evil?!"- and maybe pay attention.
Yes, there is the risk that the audience might react dismissively because fork() obviously is morally-neutral, so any claim that it is "evil" must be vacuous or hyperbolic. It's a risk I chose to take.
Really, it's a rhetorical device. I think it's pretty standard. I didn't create that device myself -- I've seen it used before and I liked it.
Morally-neutral does not equate to neutral insofar as I think most technologists consider some tech to be "good" and some to be "bad" in a practical sense.
"Good -vs- evil" is obviously hyperbolic - particularly the latter - but outside of morals they still imply a tendency to be technically/practically good or bad in an objective sense. So discounting it as a mere rhetorical device seems overly dismissive.
Fork() is the second worst idea in programming, behind null pointers. Fork() is the reason overcommit exists, which is the reason my web browser crashes if I open too many tabs, and the reason the "safe" Rust programming language leaves software vulnerable to DOS attacks if it uses the standard library. It's a clear example of "worse is worse", and we should have switched to the Microsoft Windows model decades ago.
Here's a paper from Microsoft Research supporting this point of view:
> the reason the "safe" Rust programming language leaves software vulnerable to DOS attacks if it uses the standard library
Linux overcommitment is often cited as an argument for the "panic on OOM" design of the allocating parts of the Rust standard library, and it's an important part of the story. But I think even if the Linux defaults were different, Rust would still have gone with the same design. For example, here's Herb Sutter (who works for Microsoft) arguing that C++ would benefit from aborting on allocation failure: https://youtu.be/ARYP83yNAWk?t=3510. The argument is that the vast majority of allocations in the vast majority of programs don't have any reasonable options for handling an alloc failure besides aborting. For languages like C++ and Rust, which want to support large, high-level applications in addition to low-level stuff, making programmers litter their code with explicit aborts next to every allocation would be really painful.
I think it's very interesting that Zig has gone the opposite direction. It could be that writing big applications with lots of allocs ends up feelign cumbersome in Zig, or it could be that they bend the curve. Fingers crossed.
Why overcommit is a problem? A program is unlikely to use all the memory that it allocates, or use it only at a later time. It would be a waste to not have it, it would mean having a ton of RAM that never gets used because a lot of programs allocates more ram that they will probably ever need. And it would be inefficient, costly and error prone to use dynamic memory allocation for everything.
The cause of your browser crash is not the overcommit, is simply the fact that you have not enough memory. If you disable overcommit (something you can do on Linux) you would the same crash earlier, before you allocated (not necessary used) 100% of your RAM (because really no software handles the dynamic memory fail condition, i.e. malloc returning null, that you can't handle reasonably).
Null pointers are not a mistake, how do you signal the absence of a value otherwise? How do you signal the failure of a function that returns a pointer without having to return a struct with a pointer and an error code (which is inefficient since the return value doesn't fit a single register)? null makes a perfect sense to be used as a value to signal "this pointer doesn't point to something valid".
Microsoft saying that fork() was a mistake... well, of course, because Windows doesn't have it. fork was a good idea and that is the reason why it's still used these days. Of course nowadays there are evolution, in Linux there is the clone system call (fork is deprecated and still there for compatibility reasons, the glibc fork is implemented with the clone system call). But the concept of creating a process by cloning the resources of the parent is something that to me always seamed very elegant to me.
In reality fork is something that (if I remember correctly, I don't have that much experience in programming in Windows) doesn't exist on Windows, and the only way to create a new process of the same program is to launch the executable, and pass the parameters from the command line, that is not that great for efficiency at all, and also can have its problems (for example the executable was deleted, renamed, etc while the program was running). Also in Windows there is neither the concept of exec, tough I think it can be emulated in software (while fork can't).
To me it makes perfect sense to separate the concept of creating a new process (fork/clone) and loading an executable from disk (exec). It gives a lot of flexibility, at a cost that is not that high (and there are alternatives to avoid it, such as vfork or variations of the clone system call, or directly higher level API such as posix_spawn).
I think much of the confusion around nulls stems from the fact that in mainstream languages pointers are overloaded for two purposes: for passing values by reference, and for optionality.
Nearly every pointer bug is caused by the programmer wanting one of these two properties, and not considering the consequences of the other.
Non-nullable references and pass-by-value optionals can replace many usages of pointers.
Yes, and they are just two usages of pointers. The fact is that, whatever you call it, null pointer, nullable reference, optional, you have to put in a language a concept of "reference to an object that can reference a non valid object".
>How do you signal the failure of a function that returns a pointer without having to return a struct with a pointer and an error code (which is inefficient since the return value doesn't fit a single register)?
Rust does this with the Result and Option "enums", which are internally implemented as tagged unions. From my understanding the only overhead with this implementation is the size taken by the tag and then any padding required for alignment.
It also helps that references in Rust are not nullable and working with pointers is fairly rare, so the type system can do a lot of heavy lifting for you rather than putting null checks all over the place. When you have &T you never have to worry about handling null in the first place!
The inventor, Tony Hoare, famously called them his "billion-dollar mistake". The better way to do it is with nullable types (which could internally represent null as 0 as a performance optimization). This is something Rust gets right.
Nullable types... they have the same problems as null pointers: if you don't care about handling the case they are null the program will crash, if you handle it, you can handle it also for null pointers. Well, they have a nicer syntax, and that's it. How much Rust code is full of `.unwrap()` because programmers are lazy and don't want to check each optional to see if it's valid? Or simply don't care about it, since having the program crash on an unexpected condition is not the end of the world.
The Rust code using `.unwrap()` is explicitly testing for a missing value and signaling a well-defined error when the prerequisites are not met. Contrast this with dereferencing a null pointer in C, where doing so results in undefined behavior.
More importantly, in Rust you don't have to allow the value to be missing. What Rust has but C does not is not nullable pointer types, but rather non-nullable ones—in C all pointers are potentially null, or dangling, or referencing incorrectly aliased shared memory, etc. Barring a programming error in marked `unsafe` code, or a compiler bug, if you have a plain reference in Rust not wrapped in Option<T> then it can't possibly be null (or invalid or mutable through other references) so you don't need to check for that and your program is still guaranteed not to crash when you use it.
Nullable/option types are explicit. Every time you ignore null, you have to make a conscious choice to do so, and it's prominent in the source code forever after.
The problem with null pointers is that you have to remember to check for null. For OO languages specifically, the other problem is that null pointers violate the Liskov substitution principle.
More importantly, all syscalls also take a target process as an argument, making the Windows version both simpler and more powerful than can be done with fork. Spawn is also a lot slower on Windows, but that is an implementation issue.
> Spawn is also a lot slower on Windows, but that is an implementation issue.
afaik most of that slowdown is because malware scanners (including Windows Defender) hook spawn to do blocking verification of what to launch. Which is an issue also present on eg. MacOS, and why it's also kinda slow to launch new processes (and can be subject to extreme latencies): https://www.engadget.com/macos-slow-apps-launching-221445977...
Which is yes an implementation problem, but also a problem that potentially changes/impacts the design. Like maybe it'd make sense to get a handle to a pre-verified process so that repeated spawns of it don't need to hit that path (for eg. something like Make or Ninja that just spam the same executable over and over and over again). Or the kernel/trusted module needs to in some way be involved & can recognize that an executable was already scanned & doesn't need to be re-scanned.
Very true (hence "most" not "all" in my statement :) ), but with AV disabled it's more or less on par with MacOS: https://www.bitsnbites.eu/benchmarking-os-primitives/ (not the best comparison given the wide variety of hardware in play, but for orders of magnitude it's probably good enough)
File creation on Windows is similarly massively impacted by search & AV.
I don't think there's anything inherent to the semantics of Win32 CreateProcess that makes it slow. But there's clearly something inherent to NT architecture that does, because it was just as true 25 years ago as it is today.
Windows doesn't have fork as you know it. It has a POSIX-ish fork-alike for compliance, but under the hood it's CreateThread[0] with some Magic.
in Windows, you create the thread with CreateThread, then are passed back a handle to that thread. You then can query the state of the thread using GetExitCodeThread[1] or if you need to wait for the thread to finish, you call WaitForSingleObject [2] with an Infinite timeout
Aside: WaitForSingleObject is how you track a bunch of stuff: semaphores, mutexes, processes, events, timers, etc.
The flipside of this is that Windows processes are buckets of handles: a Process object maintains a series of handles to (threads, files, sockets, WMI meters, etc), one of which happens to be the main thread. Once the main thread exits, the system goes back and cleans up (as it can) the rest of the threads. This is why sometimes you can get zombie'd processes holding onto a stuck thread.
This is also how it's a very cheap operation to interrogate what's going on in a process ala Process Explorer.
If I had to describe the difference between Windows and Linux at a process model level, I have to back up to the fundamental difference between the Linux and Windows programming models: Linux is is a kernel that has to hide its inner workings for its safety and security, passing wrapped versions of structures back and forth through the kernel-userspace boundary; Windows is a kernel that considers each portion of its core separated, isolated through ACLs, and where a handle to something can be passed around without worry. The windows ABI has been so fundamentally stable over 30 years now because so much of it is built around controlling object handles (which are allowed to change under the hood) rather than manipulation of of kernel primitives through syscalls.
Early WinNT was very restrictive and eased up a bit as development continued so that win9x software would run on it under the VDM. Since then, most windows software insecurities are the result of people making assumptions about what will or won't happen with a particular object's ACL.
There's a great overview of windows programming over at [3]. It covers primarily Win32, but gets into the NT kernel primitives and how it works.
A lot of work has gone into making Windows an object-oriented kernel; where Linux has been looking at C11 as a "next step" and considering if Rust makes sense as a kernel component, Windows likely has leftovers of Midori and Singularity [4] lingering in it that have gone onto be used for core functionality where it makes sense.
Overcommits exist any time you can have a debugger anyways.
fork() was a brilliant way to make Unix development easy in the 70s: it made it trivial move a lot of development activity out of the kernel and into user-land.
But with it came problems that only became apparent much later.
unpopular opinion: null pointers (in at least java and c) are the single greatest metaphor in software development, and are the CS analog to the invention of zero
There was an article about exceptions the other day that lamented that exceptions are high latency because the exceptional path will be paged out. I would assume overcommit is to blame for that too.
That's probably a caching issue, and caching issues are a fact of life for the foreseeable future. (Could also be a disk swap issue, but probably not.)
Well it's Linux's whole memory philosophy really. That you ask for data storage that may or may not be memory. This ties in with overcommit, because if you promise more memory than you have then you need a contingency plan. And that means flushing caches, it means swapping data to disk and it means erasing executable code (it is file backed, so it can just be read back in).
This fuzziness of what is and isn't in memory, is why stuff that is rarely needed needs to hit disk meaning a latency spike.
"I won't bother explaining what fork(2) is -- if you're reading this, I assume you know.", If that applied to everything I looked at from HN I'd read precious little.
I didn't write it for HN. It wasn't a paper to publish in some Computer Science journal. It was just a github gist. If you don't get the subject, it's not for you. I might well write a paper now based on it, and then it might be a good read for you, but I still won't be writing it for you, but for people who are interested in the topic. The intended audience is small, expert on the matter, and probably even more opinionated than I am.
I found the article well written and informative even though it's not my area of expertise, I intended my comment as a light hearted reflection of the fact that a lot of articles on HN go over my head but are still worth a read to me, just like your article.
For those saying to use posix_spawn: What am I supposed to make of the writeup in the posix_spawn manpage though?
"...specified by POSIX to provide a standardized method of creating new processes on machines that lack the capability to support the fork(2) system call. These machines are generally small, embedded systems lacking MMU support"
Is this why no one uses it? It has this gratuitous opinion piece at the beginning that makes people think it's just for embedded systems and my dad's Amiga?
That's just some injected opinion, I assume from someone contributing to glibc who doesn't like posix_spawn I guess? In any case it is wrong.
Don't assume what is written in man pages is the truth. Some of them have a lot of opinion added. It can be useful to cross-check man pages between systems - they don't always call out non-portable options or behavior.
On some kernels posix_spawn is a syscall or specifies flags that make it more efficient than fork+exec. Darwin is one such system, though you can use POSIX_SPAWN_SETEXEC if you still want to replace the current process with a new executable rather than creating a child.
Hah, that's pretty funny. Regardless of the motivation as written, the motivation I surmise is:
- some systems (e.g., Windows) lack fork() for various reasons
- vfork() is baaaad
- I know, let's do something like WIN32's spawn() or CreateProcess(), but, like, better
The middle item I have good reason to think is very likely. vfork() still has a bad rap from that old "vfork() Considered Dangerous" paper. That paper circulated a lot way back when, and was the reason vfork() was removed from some Unixes for a while (well, it was left as an alias of fork()) before it was eventually re-added. The Open Group participants would been very aware of that paper, and that is almost certainly the reason that POSIX says about vfork():
Conforming applications are recommended not
to depend on vfork(), but to use fork() instead.
The vfork() function may be withdrawn in a
future version.
So if fork() can't perform well, and the committee won't recommend the use of vfork(), what shall the committee do? Answer: design and specify posix_spawn(). It's not an unreasonable answer. Though, IMO of course, they should have un-obsoleted vfork().
Meta comment: Github Gist seems to be great for blogging. Yeah, the UI is not very blog-specific, but it has all the useful features, and then some: markdown, comments, hosting, an index of all posts, some measure of popularity (stars), a very detailed edit history, etc.
All without having to pay or setup anything yourself.
Unfortunately, there's no way to turn off comments on a Gist, which makes it not a viable replacement for anyone who doesn't want to spend a lot of time processing and moderating comments.
Good point. However, you need a GitHub account to post comments so everyone knows who you are. Your reputation might suffer if you constantly post comments that require moderation.
This does not, in practice, stop people. Both because it's possible to make throw-away accounts, and because some people don't have a reputation to care about to begin with.
This avfork implementation is poor. You don't want to make your single threaded programs multi-threaded. I don't really get the big benefit of afork over other existing mechanisms other than handwaving about things being evil.
Also,
> Linux should have had a thread creation system call -- it would have then saved itself the pain of the first pthread implementation for Linux. Linux should have learned from Solaris/SVR4, where emulation of BSD sockets via libsocket on top of STREAMS proved to be a very long and costly mistake. Emulating one API from another API with impedance mismatches is difficult at best.
Linux does have a thread creation system call. It's clone(2). It literally creates new threads of execution with various properties. It does not "emulate" threads, it is threads.
You do, but it's not a good implementation for a general API is all I was trying to say.
Do you really need an "asynchronous process creation" call? The rationale is that "blocking is bad", but a thread creation system call blocks the caller too until the thread is created. So it's not just "blocking", it's the amount of blocking if anything. Is posix_spawn or vfork+exec really too slow for your case?
Then multi-process and multi-threading seems like a reasonable solution. Asynchronous system calls are the exception not the rule in unix. So it wouldn't make sense as a traditional afork(2) system call. You could probably do a posix_spawn for io_uring, but do you really need to?
- @famzah'z blog about fork vs vfork vs clone performance:
https://blog.famzah.net/tag/fork-vfork-popen-clone-performance/
- A very similar idea to my afork() idea, from 2 years earlier:
https://developers.redhat.com/blog/2015/08/19/launching-helper-process-under-memory-and-latency-constraints-pthread_create-and-vfork
- misc
https://inbox.vuxu.org/tuhs/CAEoi9W6HFL3UcnWkKoqka8Dt16MWskKd6yEJr3HYCcCT9pMTig@mail.gmail.com/T/
https://bugzilla.redhat.com/show_bug.cgi?id=682922 (see attachments)
The intent of fork() is to start a new process in its own address space. That *fork() variations that run in the SAME address space are confusing. A use case today for fork() might also be sandboxing apps. Certainly I expect browsers use this approach to spawn unique pages. But generally fork() is very specific from my recollection.
> The intent of fork() is to start a new process in its own address space.
True!
> That *fork() variations that run in the SAME address space are confusing.
Why is it confusing? They are distinct and different system calls, with different semantics. They are also sufficiently similar that they are also similarly named. But there's nothing confusing about their semantics. vfork() is not harder to use than fork() -- it's just subtly different.
> A use case today for fork() might also be sandboxing apps. Certainly I expect browsers use this approach to spawn unique pages.
I wouldn't expect that. Sandboxing is a large and complex topic.*
Amusingly vfork semantics differ across OSes. This program prints 42 in Linux but 1 on Mac: https://godbolt.org/z/jn7Gaf5Me because on Linux they share address space.
Unfortunately there was this paper from the 80s titled "vfork() Considered Dangerous", which led to BSDs removing vfork(), and then later it was re-added because that paper was clearly quite wrong. But the news hasn't quite filtered through to Apple, I guess.
I am pretty sure Mac OS doesn't COW fork(), and that the address space is copied. At least it was the last time I looked. FreeBSD and Linux both seem to COW.
My (very possibly wrong) understanding is that xnu does CoW fork but doesn't overcommit, meaning that memory must be reserved (perhaps in swap) in case the pages need to be duplicated.
There's other complications relating to inheriting Mach ports and the mach_task <-> BSD process "duality" in xnu, which Linux doesn't have. I'd love for someone to chime in who knows more about how this stuff works.
I started with DOS, where spawn() is the norm, so I've always considered the fork()-like behaviour to be unusual yet handy for certain use-cases. Perhaps a system call that offers a combination of the two behaviours should be named spork().
- vfork() is O(1)
- copying fork() is O(N) where N is the
amount of writable memory in the parent's
address space
- copy-on-write fork() is O(N) where N is
the resident set size (RSS) of the parent
O(1) beats O(N).
And O(N) is just the complexity of fork() for a single-threaded parent process. Now imagine a very busy, threaded, large-RSS process that forks a lot. You get threads and child processes stepping all over each other's CoW mappings, causing lots of page faults and copies. Ok, that is still O(N), but users will feel the added pain of all those page faults and TLB shootdowns.
Ok but you're just repeating "It's inefficient" and not saying in any way for what use is its inefficiency even noticeable. I want to reason about when I would care. You see?
The first link didn't even have units on its numbers(!) I assume they're milliseconds. When does that scale become something one would care about at all? Not launching a gui process. Not a shell pipeline. So when is this issue arising at all? What is being done that makes fork inefficiency anything other than academic interest. Must be something, right? Forking webserver?
> When does that scale become something one would care about at all? Not launching a gui process. Not a shell pipeline.
Indeed, in those cases one just does not care about performance.
Yet there are cases where one does. Imagine an orchestration system written in Java -- with lots of threads (perhaps because it might be a thread-per-client affair, or maybe just NCPU threads), with a large heap (because Java), and launching lots of small tasks as external programs. Maybe those tasks are ssh commands (ok, sure, today you could use an SSH library in Java) or build jobs (maybe your app is a CI/CD orchestrator). Now launching external jobs is the core of what this does, and now the cost of fork() bites.
So for software archtiectures that separate concerns by spawning many short-lived processes and using message passing, (which seems like a great idea, just can't think of anything that does that, would love examples if they exist) it /could/ be a factor but we have no numbers. Do you see it?
Let's just say I want to design a solution involving spawning a buttload of procesesses and pass messages back and forward. Roughly when does fork efficiency become something other than of academic concern? 10 processes per second, 1000, 100000? What does the inefficiency look like? Nothing? A stutter you might not notice? Through to everything grinds to a halt and you can't login to the box and neither will the oom killer help you.
That's a fair question. Basically, don't call fork() in Java (JNI or alike), or Java classes that do, and you might be fine, and if ever you're not, you'll know where to start looking.
Don't ever call fork from java? Not even once? And what are the consequences of calling fork? A minor stutter? Halt and catch fire? I don't java but it's hardly new tech. Surely someone has done some numbers on competing operating systems in the past couple of decades?
Until you quantify on some level, even very roughly, what the observed issue is, when you see it and how it degrades, that you're trying to optimize it's just urinating into the breeze. We might get lucky is the best outcome. The chances of it being a really good outcome are pretty limited. decrying something as "inefficient" based on big O or whatever is just meaningless until we actually do it. [1]
[1] selection sort is O(n^2) and can totally dominate O(n log n) algorithms in actual time and cycles spent depending on circumstance. We have to specify, it's not something that can be shortcut because it will likely get a terrible result.
I have had to debug slow forking cases with Java. No I can't point you at data from those. I can point you to the Microsoft paper and @famzah's posts if you want data. For Microsoft this is an important topic: they don't want to have to implement a real fork(), and I fully understand why they don't want to. My guess is they will eventually buckle and do it. fork() is not easy to implement.
It's inherently inefficient because while the child process does its initialization (pre-exec) stuff, the parent gets page faults for every thread writing into the memory due to COW. This will basically stall the parent and can cause funny issues.
In another comment, I observe how Go doesn't even have a binding to fork.
Erlang is another example of that. There is no standard library binding to the fork function. If someone were to bash one into a NIF, I have no idea what would happen to the resulting processes, but there's no good that can come of it. (To use Star Trek, think less good and evil Kirk and more "What we got back, didn't live long... fortunately.") Despite the terminology, all Erlang processes are green threads in a single OS process.
> Despite the terminology, all Erlang processes are green threads in a single OS process.
The main Erlang runtime uses an M:N Erlang:native process model, not an N:1. So Erlang processes are like green threads (they are called processes instead of threads because they are shared-nothing), but not in a single process.
I mentioned this somewhere else but I thought Erlang does NOT share memory.
Doesn’t that make Erlang a bit unique. It was the ability to spawn a new process extremely fast AND also have memory isolation. This combination is what the OP was wanting to achieve.
Erlang mostly doesn't doesn't share memory between its Erlang processes, but it does this by making it so there's simply no way, at the Erlang level, of even writing code that refers to the memory in another Erlang process. It's an Erlang-level thing, not an OS-level thing.
If you write a NIF in C, it can do whatever it wants within that process.
The BEAM VM itself will share references to large binaries. Erlang, at the language level, declares those to be immutable so "sharing" doesn't matter. As an optimization, the VM could choose to convert some of your immutable operations into mutation-based ones, but if it does that, it's responsible for making the correct copies so you can't witness this at the Erlang level.
The Erlang spawn function spawns a new Erlang process. It does not spawn a new OS process. While BEAM may run in multiple OS processes per dragonwriter, the spawn function certainly isn't what starts them. The VM would.
So, you can not spawn a new Erlang process, then set its UID, priority, current directory, and all that other state that OS processes have, because an Erlang process is not an OS process. If the user wants to fork for some reason beyond simply running a program simply, because they want to change the OS process attributes for some reason, Erlang is not a viable choice.
Erlang is not unique in that sense. It runs as a normal OS process. What abilities it has are implemented within that sandbox, no different than the JVM or a browser hosting a Javascript VM.
My reference to “fast” was in the context of creating a new process due to the OP post talking about how long fork/etc can take. Not in reference to executing code itself.
In that sense it’s fast in the same way e.g. coroutines(/goroutines) are fast: it’s just the erlang scheduler performing some allocation (possibly from a freelist) and initialisation. Avoiding the kernel having to set things up and the related context switches makes for much better performances.
> I thought green threads share memory but Erlang processes do NOT share memory, which is what makes Erlang so unique.
Erlang processes don’t share memory because the language and vm don’t give primitives which let you do it. They all exist within the same address space (e.g. large binaries are reference-counted and stored on a shared heap, excluding clustering obviously).
> Did Erlang create a so called “green process”?
Yes.
> If so, why can’t this model be implemented in the kernel?
Because erlang processes are not an antagonistic model, and the language restricts the ability to attack the VM (kinda, I’m sure you could write NIFs to fuck up everything, you just don’t have any reason to as an application developer).
The problem is clone is more of a start phase after vfork but before fork regardless for github. So it's kind of a bit strange that we call vfork first but that is about templates too.
As for templates they need to be in different languages and in different formats for video games consoles, and so many other formats they port systems and games that sort of work digitally to certain things but not playable to certain things too.
The other problem is that clone is part of syscall interfaces and part of apis and part of a lot of other things too.
It's a rhetorical device. I didn't expect this to -years later- become a front-page item on HN. I wrote that to share with certain people.
And yes, clone() has some real problems, and if calling it "stupid" pisses off some people, but maybe also leads others to want to improve clone() or create a better alternative, then that's fine. If I'd wanted to write an alternative to Linux I'd probably have had to deal with the very, very fine language that Linus and others use on the Linux kernel mailing lists -- if you don't like my using the word "stupid", then you really shouldn't look there because you're likely to be very disappointed. Indeed, not only would I have to accept colorful language from reviewers there, I'd probably have to employ some such language myself.
TL;DR: clone() came from Linux, where "stupid" is the least colorful language you'll find, and me calling it "stupid" is just a rhetorical device.