As someone who hot on early on the Ryzen AI 395+, are there any added value for ...

storus · 2025-10-15T11:47:23 1760528843

M3 Ultra has slow GPU and no HW FP4 support so its initial token decoding is going to be slow, practically unusable for 100k+ context sizes. For token generation that is memory bound M3 Ultra would be much faster, but who wants to wait 15 minutes to read the context? Spark will be much faster for initial token processing, giving you a much better time to first token, but then 3x slower (273 vs 800GB/s) in token generation throughput. You need to decide what is more important for you. Strix Halo is IMO the worst of both worlds at the moment due to having the worst specs in both dimensions and the least mature software stack.

EnPissant · 2025-10-16T22:45:49 1760654749

This is 100% the truth, and I am really puzzled to see people push Strix Halo so much for local inference. For about $1200 more you can just build a DDR5 + 5090 machine that will crush a Strix Halo with just about every MoE model (equal decode and 10-20x faster prefill for large, and huge gaps for any MoE that fits in 32GB VRAM). I'd have a lot more confidence in reselling a 5090 in the future than a Strix Halo machine, too.