Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think that's the craziest one I've heard so far. I can't imagine trying to reproduce a hardware bug in an FPGA with a logic analyzer.

How much time did you spend on the bug? Was it something that you ignored for a long time, and then you decided to dive in and spend a couple of weeks on it?



It was about 8 years ago, so my memory is a bit foggy. At one point we had everyone on our entire project working on it, which was about 20 people. We had daily calls with the customer since it was causing frequent outages on their network. I was full time on the bug for a while, and most of that time was spent making meticulous documentation for what happens in every clock cycle in the fpga.

I visited the customer's site about 3 times during that period with a very expensive logic analyzer. By that time I'd added a ton of debug code in the fpga, and the logic analyzer was set to trigger if any of those counters hit. A big breakthrough was we were counting packets at each point in the pipeline, and that debug was showing that at a certain point, the counters were off. This was huge since it was the first clue into what was actually happening. Prior to that we just knew that there was an exponential increase in retransmitted packets until the whole thing came crashing down.

Looking back on it, I'm really happy I ended up writing a large post-mortem with screen shots of the logic analyzer and everything. I can go back and see exactly what caused it, which is really interesting even today.


Wow, that blows my mind. I don’t even know how to ask the right question, but is there some kind of “formal verification” or “type-checking” that could have caught this issue earlier? Or is that just too difficult, and reserved for medical equipment, nuclear reactors, etc.?


It was a problem with a clock domain crossing with IP from another company into ours. There were testbenches/simulations for verification, but they didn't test every packet size. In hindsight, everything can be caught, but it's a matter of how extensive the testing is. For example, passing every packet size through would not have showed the symptom. Everything looked normal, and all packets succeed. The key was that the meta data saying whether a buffer was free after transmitting was incorrect and never showing that it was free. So to really see the issue you had to send enough packets of that size to totally deplete the buffers. With a standard packet distribution, such as imix, you never hit that size.


Thanks a lot for sharing this story! I started reading about clock domain crossing and metastability issues, and for the past few days I've been thinking a lot about circuit designs and FPGAs.

I posted on /r/FPGA on Reddit [1], and I asked about the best development board to start with. I was wondering if you had any thoughts on that? I think I'm going to order a snickerdoodle black [2] and a breakout board, and start learning VHDL or Verilog. I think it would be a lot of fun to eventually design my own soft-core processor and compile a Rust program that runs on it. And after reading about clock domain crossing issues, I think it would also be really interesting to experiment with asynchronous circuits and clock-less CPUs. Thanks for introducing me to this!

[1] https://www.reddit.com/r/FPGA/comments/9yutk8/best_100300_fp...

[2] https://www.crowdsupply.com/krtkl/snickerdoodle


Sorry, I can't help much with the dev board side of things. I mostly work on sw these days, and the fpgas we use at work are some of the largest you can buy (stratix 10). I can ask around at work if that would help.


Also I just found this hardware description language called Clash [1]. It's based on Haskell and compiles to VHDL.

That sounds really awesome to me! I'll learn VHDL as well, but this seems like a nicer high-level language with a really good type system.

[1] https://clash-lang.org


That looks really interesting. I've never used any high-level languages, though, so I can't comment. Most of the fpga developers I know somewhat despise those higher-level languages since you lose a lot of the control you have writing the HDL. They also tend to take up a lot more resources in terms of gates and memory, but things may have gotten better.


Wow, such an arcane bug causing such dramatic problems. My god man.

There should be a blog for this kind of stuff.


I mentioned in a previous comment that I actually have an entire detailed write-up of the bug. If my work lets me release it, I wouldn't mind transferring it to a blog.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: