Hacker Newsnew | past | comments | ask | show | jobs | submit | touisteur's favoriteslogin

It seems like this is indeed possible using video codecs: https://arxiv.org/abs/2407.00467v1

The potato trick.

https://www.farmersalmanac.com/parmentier-made-potatoes-popu...

> Still, even after all of Parmentier’s work, the French feared and hated potatoes. But Parmentier was undeterred. Determined to prove to his people that potatoes were, in fact, good, he started holding publicity stunts that included potatoes. He hosted stylish dinners featuring the maligned tuber, inviting such celebrities as Benjamin Franklin and Antoine Lavoisier. Once, Parmentier made a bouquet of potato flowers to give to the King and Queen of France.

> With the publicity stunts failing to popularize potatoes, Parmentier tried a new tactic. King Louis XVI granted him a large plot of land at Sablons in 1781. Parmentier turned this land into a potato patch, then hired heavily armed guards to make a great show of guarding the potatoes. His thinking was that people would notice the guards and assume that potatoes must be valuable. Anything so fiercely guarded had to be worth stealing, right? To that end, Parmentier’s guards were given orders to allow thieves to get away with potatoes. If any enterprising potato bandits offered a bribe in exchange for potatoes, the guards were instructed to take the bribe, no matter how large or small.

> Sure enough, before too long, people began stealing Parmentier’s potatoes.

I've never known if the myth was true, but I've always wanted to try the trick with something to see if it works.


Which is why the unobserved portions of the universe do not exist.

How can you travel faster than light, if you haven't observed where you'll be? And if there are no particles before you arrive, what are you going to even observe?

No wonder the universe's expansion is a accelerating, we keep looking at it!


A famous great anecdote about Dirac (and Bohr and Rutherford) was in Absolute Zero Gravity:

--

Young Dirac arrived at Niels Bohr’s institute with a glowing recommendation from the great experimentalist Ernest Rutherford. A few months later, Bohr remarked to Rutherford that this marvelous Dirac hardly seemed so special: he said nothing and he did nothing. Legend has it that Rutherford replied with the following story:

A man went to a pet shop to buy a parrot. There was a gray parrot that knew a few words selling for one hundred dollars. There was a blue parrot that could sing and tell stories for two hundred dollars. There was a beautiful green and purple bird that spoke several ancient languages for five hundred dollars. And there was a nondescript brown bird priced at a thousand dollars.

“A thousand dollars!” exclaimed the would-be buyer. “That must be some bird - how many languages does he speak?” “Just English,” admitted the shopkeeper.

“His vocabulary is extraordinary, perhaps?” The shopkeeper shrugged. “Not really”.

“Does he sing, then?” “No,” said the shopkeeper. “Most days this parrot doesn’t even talk”.

“Well, does he do acrobatic tricks or something? What on earth is so valuable about that parrot?”

“Sir, this parrot thinks”.

Rutherford concluded, “Dirac thinks”.


Some AMD 80386DX-40 drama:

> While the AM386 CPU was essentially ready to be released prior to 1991, Intel kept it tied up in court.[2] Intel learned of the Am386 when both companies hired employees with the same name who coincidentally stayed at the same hotel, which accidentally forwarded a package for AMD to Intel's employee.[3]


Unrelated, but on the topic of reducing power consumption, I want to once again note that both AMD and NVidia max out a CPU core per blocking API call, preventing your CPU from entering low power states even when doing nothing but waiting on the GPU, for no reason other than to minimally rice benchmarks.

Basically, these APIs are set up to busyspin while waiting for a bus write from the GPU by default (!), rather than use interrupts like every other hardware device on your system.

You turn it off with

NVidia: `cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync)`

AMD: `hipSetDeviceFlags(hipDeviceScheduleBlockingSync)`

On Pytorch

NVidia: `import ctypes \ ctypes.CDLL('libcudart.so').cudaSetDeviceFlags(4)`

AMD: `import ctypes \ ctypes.CDLL('libamdhip64.so').hipSetDeviceFlags(4)`

This saves me 20W whenever my GPU is busy in ComfyUI.

Every single device using the default settings for CUDA/ROCM burns a CPU core per worker thread for no reason.


With context switches becoming more and more expensive relative to faster and faster I/O devices, almost the same order of magnitude, I believe that thread-per-core is where things are heading, because the alternative of not doing thread-per-core might literally be halving your throughput.

That's also the most exciting thing about io_uring for me: how it enables a simple, single-threaded and yet highly performant thread-per-core control plane, outsourcing to the kernel thread pool for the async I/O data plane, instead of outsourcing to a user space thread pool as in the past. It's much more efficient and at the same time, much easier to reason about. There's no longer the need for multithreading to leak into the control plane.

My experience with io_uring has been mostly working on TigerBeetleDB [1], a new distributed database that can process a million financial transactions a second, and I find it's a whole new way of thinking... that you can now just submit I/O directly from the control plane without blocking and without the cost of a context switch. It really changes the kinds of designs you can achieve, especially in the storage space (e.g. things like LSM-tree compactions can become much more parallel and incremental, while also becoming much simpler, i.e. no longer any need to think of memory barriers). Fantastic also to now have a unified API for networking/storage.

So much good stuff in io_uring. Exciting times.

[1] https://www.tigerbeetle.com


I think the dependency situation is pretty rough, and very few folks want to admit it. An example I recently stumbled upon: the cargo-watch[0] crate.

At its core its a pretty simple app. I watches for file changes, and re-runs the compiler. The implementation is less than 1000 lines of code. But what happens if I vendor the dependencies? It turns out, the deps add up to almost 4 million lines of Rust code, spread across 8000+ files. For a simple file-watcher.

[0] https://crates.io/crates/cargo-watch


The single inference as the simplest way to switch from Type 1 to Type 2...

(if we think of catechisms as being the canonical type 1 instruction, and textbooks the canonical type 2)

Consider the "Tigerfibel" (1943), many pages of which feature items for every sort of thinker, eg https://web.archive.org/web/20230617172820im_/http://www.ala... :

Type 2 thinkers have a thesis and antithesis: "Water is necessary to cool the engine, when it runs" and "Water is sufficient to destroy the engine, if it freezes"

Type 1 thinkers have a motto: (upper left)

    Water puts your tank at ease, don't forget the anti-freeze
Type 0 thinkers have a place to rest their eyes (upper right)

Power-off storage temperature is a problem for MLC NAND, but when powered on the NAND cells need to reach a high temperature to operate properly. If you are cooling the flash chips with a heatsink (rather than cooling the controller) you will be forcing the device to dump power into the cells to heat them to a temperature where they work properly.

3D NAND hits its best program time and raw bit error rate at about 70C.

Edit: See data table on page 27. Retention is directly proportional to device active temperature, i.e. higher cell temperature during programming leads to higher retention. https://www.jedec.org/sites/default/files/Alvin_Cox%20%5bCom...


If you want devtmpfs mounted automatically by the kernel within an initramfs, then you may need to patch your kernel slightly, like: https://lore.kernel.org/lkml/25e7e777-19f9-6280-b456-6c9c782...

By default the kernel will not mount devtmpfs automatically within an initramfs, even if you've configured your kernel to automatically mount devtmpfs, because it waits until the real rootfs has been mounted to do it. This is fine unless you never plan to actually mount a real rootfs as you've made your initramfs have everything a normal rootfs would have and you want devtmpfs to Just Work.


I designed a syntax for this: that everything is a state machine progression, a bit like sequence types in the article.

   state1a state1b state1c | state2a state2b state2c | state3a state3b state3c
This means wait for state1a, state1b state1c in any order, then move to the next sequence of things to wait for.

In a multithreaded server or multimachine distributed system, there are global states you want to wait for and then trigger behaviour. The communication can be inferred and optimised and scheduled.

It's BNF syntax - inspired by parsing technology for parsing sequences of tokens but tokens represent events.

If you use printf debugging a lot, you know the progression of what you see is what happened and that helps you understand what went wrong. So why not write or generate the log of sequence of actions you want directly not worry about details?

But wait! There's more. You can define movements between things.

So take an async/await thread pool, this syntax defines an async/await thread pool:

  next_free_thread(thread:2);
  task(task:A) thread(thread:1) assignment(task:A, thread:1) = running_on(task:A, thread:1) | paused(task:A, thread:1);

  running_on(task:A, thread:1)
  thread(thread:1)
  assignment(task:A, thread:1)
  thread_free(thread:next_free_thread) = fork(task:A, task:B)
                                | send_task_to_thread(task:B, thread:next_free_thread)
                                |   running_on(task:B, thread:2)
                                    paused(task:A, thread:1)
                                    running_on(task:A, thread:1)
                                    assignment(task:B, thread:2)
                               | { yield(task:B, returnvalue) | paused(task:B, thread:2) }
                                 { await(task:A, task:B, returnvalue) | paused(task:A, thread:1) }
                               | send_returnvalue(task:B, task:A, returnvalue); 
  
Why not just write what you want to happen and then the computer works out how to schedule it and parallelize it?

I think iteration/looping and state persistence and closures are all related.

I have a parser for this syntax and a multithreaded barrier runtime which I'm working on, I use liburing. I want to get to 500 million requests per second of the and ~50ish nanosecond latency of LMAX Disruptor.

The notation could be used for business programming and low level server programming I think.


Apparently ECC does not prevent this

https://www.vusec.net/projects/eccploit/


The decoupling narrative is oversold for queues.

There's essential decoupling and accidental decoupling; decoupling you want, and decoupling which mostly just obscures your business logic.

Resilience in the face of failure, where multiple systems are communicating, or there's a lot of long-running work which you want to continue as seamlessly as possible, is the biggest essential decoupling. You externalize transitions in the state machine (edges in the state graph) of the logic as serialized messages, so you can blow away services and bring them back and the global state machine can continue.

Scaling from a single consumer to multiple consumers, from multiple CPUs to multiple machines, is mostly essential decoupling. Making decisions about how to parallelize subgraphs of your state machine, removing scaling bottlenecks, is an orthogonal problem to the correctness of the state machine on its own, and representing it as a state machine with messages in queues for edges helps with that orthogonality. You can scale things up without external persistent queues but you'll end up with queues somewhere, even if it's just worker queues for threads.

Accidental decoupling is where you have a complex state machine encapsulating a business procedure with multiple steps, and it's coordinated as messages between and actions in multiple services. The business logic might say something like: take order from user; send email notification; complete billing steps; remove stock from inventory system; schedule delivery; dispatch stock; etc.

All this logic needs to complete, in sequence, but without higher order workflow systems which encode the state machine, a series of messages and producers and consumers is like so much assembly code hiding the logic. It's easy to end up with the equivalent of COMEFROM code in a message system.

https://en.wikipedia.org/wiki/COMEFROM


Another heuristic is to ask yourself:

- does this have to be said

- does this have to be said now

- does this have to be said by me


What's really unfortunate is that compilers already perform this "control flow -> state machine" transformation, but almost none of them expose it to the user for direct manipulation - instead, they tightly couple it to a bunch of unrelated abstractions like event loop runtimes, async executors, managed stacks, etc. I'd kill for a compiler that can properly:

* Parse a coroutine that produces and consumes values

* Perform all the normal function-level optimizations to minimize frame space (stack slot reuse via liveness analysis of local variables, re-computing temporaries across yield points, etc.)

* Expose that coroutine frame as a fixed-size struct that I can explicitly resume and query

Zig is almost there, but suspend/resume cannot return values/take arguments, which requires some unergonomic workarounds. Rust is making some promising progress on unified coroutines (https://github.com/rust-lang/rfcs/pull/2781), but the generator types are opaque so you can't encapsulate them in a Sized struct or allocate an array of them. Not to mention that it's still extra-unstable, and last I checked, there were issues with generator size optimizations (https://github.com/rust-lang/rust/issues/59087). C++20 coroutines are similarly opaque and cannot be allocated as a contiguous array.


(This is more of a link-dump than a paper discussion --)

For the line of inquiry w.r.t tensor compilers and MLIR/LLVM (linalg, polyhedral, [sparse_]tensor, etc), I personally found the following really helpful: https://news.ycombinator.com/item?id=25545373 (links to a survey), https://github.com/merrymercy/awesome-tensor-compilers

I also have an interest in the community more widely associated with pandas/dataframes-like languages (e.g. modin/dask/ray/polars/ibis) with substrait/calcite/arrow their choice of IR. Some links: https://github.com/modin-project/modin, https://github.com/dask/dask/issues/8980, https://news.ycombinator.com/item?id=16510610, https://news.ycombinator.com/item?id=35521785

I broadly classify them as such since the former has a stronger disposition towards linear/tensor-algebra, while the latter towards relational algebra, and it isn't yet clear (to me) how well innovations in one carry over to the other (if they do), and hence I'm also curious to hear more about proposals for a unified language across linalg and relational alg (e.g. https://news.ycombinator.com/item?id=36349015).

I'm particularly interested in pandas precisely because it seems to be right at the intersection of both forms of algebra (and draws a strong reaction from people who are familiar/comfortable with one community and not the other). See e.g. https://datapythonista.me/blog/pandas-20-and-the-arrow-revol... and https://wesmckinney.com/blog/apache-arrow-pandas-internals/


One of the biggest interests & excitements I feel over QUIC & HTTP3 is the potential for something really different & drastically better in this realm. Right out of the box, QUIC is "connectionless", using cryptography to establish session. I feel like there's so much more possibility for a data-center to move around who is serving a QUIC connection. I have a lot of research to do, but ideally that connection can get routed stream by stream, & individual servers can do some kind of Direct Server Return (DSR), to individual streams. But I'm probably pie in the sky with these over-flowing hopes.

Edit: oh here's a Meta talk on their QUIC CDN doing DSR[1].

The original "live migration of virtual machines"[2] paper blew me away & reset my expectations for computing & the connectivity, way back in 2005. They live migrated a Quake 3 server. :)

[1] https://engineering.fb.com/2022/07/06/networking-traffic/wat...

[2] https://lass.cs.umass.edu/~shenoy/courses/spring15/readings/...


I'd love to have an open source CNC machine to design joinery with http://ma-la.com/tsugite.html Ideally a whole house and most of the furniture...

If anyone has any ideas on how to accelerate build times of open hardware, that's something I'm trying to solve. Creating high quality instructionals is a huge amount of work and I think instructionals should be automatically generated by computer vision and have interactable elements, ideally AR, but even just highlighting wiring diagrams on hovering would be hugely helpful. Even if things are well documented, replication is still insanely pyrrhic without economy of scale or universal fabrication. It's time consuming because it's hard to replicate knowledge/tool environments quickly.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: