I looked into something similar for implementing a concurrent GC. I ended up just using mmap() and ptrace() since I did have to manipulate the process for certain barrier operations; I probably could have done it with non-ptrace system calls; there are tradeoffs to be made (either way you need to interrupt any pending systemcalls, but there are multiple ways of doing that).