Hacker Newsnew | past | comments | ask | show | jobs | submit | afr0ck's commentslogin

I worked at Linaro, who was contracting for Qualcomm. Qualcomm were pushing for some protected hypervisor called Gunyah (which had its own Linux interface and needed a new qemu port) that apparently no one liked. I tried to port it to KVM [1], but upstream folks (mostly Google) outright rejected the port. Otherwise KVM would have been available on QCOM boards. You can still try it. I have a Linux kernel and a Qemu port on my github [2,3]

[1] https://lore.kernel.org/kvm/20250424141341.841734-1-karim.ma...

[2] https://github.com/karim-manaouil/linux-next/tree/gunyah-kvm

[3] https://github.com/karim-manaouil/qemu-for-gunyah


Upstream would accept a patchset that exposed an independent Gunyah-specific UAPI (why not the same one as downstream — crosvm already supports that) instead of pretending to be KVM (it's not a "port", you can't port a hypervisor to a hypervisor).

KVM is available on current compute platforms (laptops) if you escape to EL2 via slbounce; and on Glymur (X2E) it will be available by default (yay!).


That's not how operating systems work. KVM is both an interface and a hypervisor. Just as we have different hypervisor implementations for amd, intel, arm and others all abstracted behind the same KVM interface, there is no reason the same can't be done for Gunyah. Userspace does not have to know anything about that. KVM already supports svm and vmx for amd and intel on x86. Why is something similar can't be done for Arm? Plus now there is pKVM.

I just don't understand this argument of a separate interface. The only reason you want to do that is to decouple from the KVM community, but that introduces a shit tone of duplicated effort and needless fragmentation to the virtualisation software ecosystem hindering your users from enjoying the existing upstream tools they already know about. In other terms, vendor locking and shitty downstream experience.


Linux kernel + bootloaders + firmware

The Linux kernel side is mostly device trees, device drivers and the like.

u-boot is very famous as a bootloader in the embedded space

Firmware for board bring up and devices


There are Qualcomm laptops now I believe (at least that's what I heard when I was last working for them). NXP also made some boxes (I own a bunch of them). The server market is also growing with Ampere and Cavium (now Novell) which I have both.


Also AWS Graviton and Google Axion servers & VMs on those clouds


NUMA is only useful if you have multiple sockets, because then you have several I/O dies and you want your workload 1) to be closer to the I/O device and 2) avoid crossing the socket interconnect. Within the same socket, all CPUs shared the same I/O die, thus uniform latency.


I think Meta has already rolled out some CXL hardware for memory tiering. Marvell, Samsung, Xconn and many others have built various memory chips and switching hardware up to CXL 3.0. All recent Intel and AMD CPUs support CXL.


CXL uses the PCIe physical layer, so you just need to buy hardware that understands the protocol, namely the CPU and the expansion boards. AMD Genoa (e.g. EPYC 9004) supports CXL 1.1 as well as Intel Saphire Rapids and all subsequent models do. For CXL memory expansion boards, you can get from Samsung or Marvell. I got a 128 GB model from Samsung with 25 GB/s read throughput.


It's not that deep. The futex was developed just to save you from issuing a special system call to ask the OS to put you on a wait queue.

The whole point is that implementing a mutex requires doing things that only the privileged OS kernel can do (e.g. efficiently blocking/unblocking processes). Therefore, for systems like Linux, it made sense to combine the features for a fast implementation.


Also, I should say, in user-land you can efficiently enough save thread state, go off and do something else with that thread, then come back to it, never hitting the kernel while something blocks. That's pretty much async in a nutshell (or green threads).

The point of the article anyway is that it's inexcusable to have a modern concurrency textbook and not cover the futex, since it's at the core of any efficient primitive on modern hardware.


The problem with green threads, historically, was that there was no way to do arbitrary syscalls async; if your syscall blocks it blocks all your other green threads. Doh.

io_uring is supposed to be about solving this, but it's quite the kitchen sink so I have no idea how complete it is on the "arbitrary syscall async" front.


Yes, it's gotten quite large, but I think with far fewer wrong turns in the API compared to the futex. Enough was available async via `epoll()` + having fd interfaces to things that I never was as worried about the arbitrary latency of syscalls, but it's still incredibly cool, especially in the number of calls it avoids outright.


`epoll` doesn’t actually do any IO though, so it doesn’t help with syscall latency. It just avoids the overhead of doing IO via a large number of threads (memory overhead, hard context switches, etc.).


No it doesn't, which is one key reason why I am a fan of `io_uring`. I brought `epoll` up because it does help with the blocking though, for most of the things that matter when it comes to async (at a cost to latency, of course).


You actually issue the `futex` system call to get yourself on the wait queue tied to the memory address. It separates out the waiting from the locking.

And that can absolutely save a bunch of system calls, especially vs. polling mixed with `sleep()` or similar.


> It separates out the waiting from the locking.

It does not, in fact the two are fundamentally inseparable and the state of the memory address must be treated atomically with the waiting state. The magic of futex is that you can use a hardware atomic operation (c.f. lock cmpxchg on x86) to get the lock in the common/uncontended case, but if you have to wait you need to tell the kernel both that you need to wait and the address on which you're waiting, so it can use the same hardware interlocks along with its own state locking to put you to sleep race-free.


It quite does; the kernel is not the keeper of the lock, it only needs to detect the race condition that would result in a spurious sleep. It cares not one bit about the rest of your semantics.

It's true you could use it that way, but it's not the way it's meant to be used, defeating the purpose by requiring a system call even for uncontended locks.


I think you're misunderstanding how futexes work, or else making an essentially irrelevant semantic argument around a definition for "keeper". The kernel is, 100%, absolutely, the "keeper" of that lock data for the duration of the system call. It knows that (hardware) address and matches it to any other such syscall from any other process on the system. And that requires tracking and intelligence and interaction with the VM system and arch layer up and down the stack.

It just doesn't "allocate" it on its own and lets the process use its own mapped memory. But to pretend that it doesn't have to do any work or that the memory is "separated" betrays some misunderstandings about what is actually happening here.


The kernel is responsible for maintaining the wait queues, and making sure that there is no race condition on the state that should preclude queueing.

It does not care how you use the queue, at all. It doesn't have to be done with a locking primitive, whatsoever. You absolutely can use the exact same mechanism to implement a thread pool with a set of dormant threads, for instance.

The state check in the basic futex is only done to avoid a race condition. None of the logic of preventing threads from entering critical sections is in the purview of the kernel, either. That's all application-level.

And most importantly, no real lock uses a futex for the locking parts. As mentioned in the article, typically a mutex will directly try to acquire the lock with an atomic operation, like an atomic fetch-and-or, fetch-and-add, or even compare-and-swap.

A single atomic op, even if you go for full sequential consistency (which comes w/ full pipeline stalls), is still a lot better than a trip into the kernel when you can avoid it.

Once again, I'm not saying you couldn't use the futex state check to decide what's locked and what's not. I'm saying nobody should, and it was never the intent.

The intent from the beginning was to separate out the locking from the waiting, and I think that's pretty clear in the original futex paper (linked to in my article).


I like to think of a futex as the simplest possible condition variable, where the predicate is just the state of the memory word (note that a mutex guarding the predicate is unnecessary since the word can be read and written atomically). It turns out that this is simple enough to implement efficiently in the kernel, yet expressive enough to implement pretty much any userspace synchronization primitive over it.


You are of course completely right. In fact sometimes I wish that the kernel would do slightly more with the memory location, like optionally reserving a bit to show the empty/non empty state of the queue: the kernel should be able to keep it up to date cheaply as part of the wait/wake operations while is more complicated for userspace.


This sort of thing can be implemented in the kernel without special hardware operations by adding the thread to the wake list, then suspending the thread, then checking the value. If the value has changed, undo that and return, else just return leaving the thread suspended. Special hardware multi-atomic instructions are not required, but the details are very dependent on how kernel's thread switching is designed.


Why is this gray!? This is absolutely correct. Futex was added as an ad hoc solution to the obvious needs of SMP processes communicating via atomic memory operations who still wanted blocking IPC. And it had to be updated and reworked repeatedly as it moved out of the original application (locks and semaphores) into stuff like condition variables and priority inheritance where it didn't work nearly as well.

In point of fact futex is really not a particularly simple syscall and has a lot of traps, see the man page. But the core idea is indeed "not that deep".


As the article says, the futex system call is overly complicated. But one shouldn't downplay its importance. Every major OS has had a slimmed down equivalent for about a decade, and the futex is at the core of any good modern lock.

Many things are obvious after, but there was plenty of time before for other people to do the same thing, it's not like we didn't know sysv semaphores didn't scale well.

"ad hoc" feels like an opinion here. My opinion is that when separation of concerns leads to big gains like the futex did, that's elegant, and an achievement. No need to diminish the good work done!


If this is ad hoc solution, what's the "right" approach?


Futex is a fine solution for locks and semaphores (FUTEX_WAIT/WAKE operations). It's been extended repeatedly to handle the needs of condition variables, priority inheritance, timeouts, interop with file descriptors and async/io_uring, etc... with the result that a lot of the API exists to support newer operations with oddball semantics and not a few genuine mistakes and traps (often undocumented). See the glibc condition variable code for how complicated this can get.

Also, while googling for some examples for you I was reminded of this LWN article from a few years back that details some of the issues: https://lwn.net/Articles/823513/


Just because the Linux futex call is currently a Swiss Army knife with some parts that add no value (which I do say in the article) doesn't mean that it's not valuable, or important.

The fact that Linux has extended it in so many ways is, in fact, a testament to it to how impactful the futex concept has been to the world of concurrency.

The fact that it's also at the core of other OS primitives does as well. At least on the MacOS side, those primitives do have much simpler APIs, as well. For instance, here's the main wait function:

`extern int os_sync_wait_on_address(void * addr, uint64_t value, size_t size, os_sync_wait_on_address_flags_t flags);`

There's also one with a timeout.

The wake side is equally simple, with two calls, one to wake one thread, one to wake all threads. No other number matters, so it's a great simplification in my view.

Your fundamental point is that the futex is actually a pretty unimportant construct. Clearly I don't agree, and it's okay not to agree, but I really am struggling to see your counter-argument.

If futexes aren't important to good locks, then, if modern OSes all felt compelled to jettison the futex for some reason, you'd have pthread implementations do ... what exactly??


They are so good that they keep being reinvented even in userspace, for example WTF::ParkingLot.

Or C++ also adding std::atomic::wait which is basically a thin [1] wrapper over futex.

[1] implementations still manage to mess it up.


The WTF::ParkingLot example is interesting because it shows that you don't actually need futexes in the kernel to implement them efficiently; you just need something like a spinlock (with sane backoff!) to guard the userspace wait queue and a per-thread eventfd to wake up waiters.


Yes, you can do a good futex impression in userspace and add any missing functionality you need. Most importantly for webkit, I think, you get portability.

The advantage of futex provided by the kernel is ABI stability and cross process support.


It gives you an implementation. The implementation does not have to be ideal, though.


This is mostly the fault of pthreads, where baroque POSIX semantics had to be shoehorned into the kernel for efficiency.


Wait/Wake are enough for the vast majority of the use cases. The rest is a mix of niche cases (PI, robust mutexes) and failed experiments.


>right-wing religious revival rooted in Christianity, combined with technological acceleration and a reimagined political order that prioritizes heroic individuals and hierarchical impulses

Sounds like a fast path to totalitarianism a la 1930.


Inference runs like a stateless web server. If you have 50K or 100K machines, each with a tons of GPUs (usually 8 GPUs per node), then you end up with a massive GPU infrastructure that can run hundreds of thousands, if not millions, of inference instances. They use something like Kubernetes on top for scheduling, scaling and spinning up instances as needed.

For storage, they also have massive amount of hard disks and SSD behind planet scale object file systems (like AWS's S3 or Tectonic at Meta or MinIO in prem) all connected by massive amount of switches and routers of varying capacity.

So in the end, it's just the good old Cloud, but also with GPUs.

Btw, OpenAI's infrastructure is provided and managed by Microsoft Azure.

And, yes, all of this requires billions of dollars to build and operate.


France has its own independent military production including jet fighters (Rafale), tanks, ballistic missile, nuclear submarines and nuclear heads.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: