Oohh, are we doing time bugs? Sun T4 servers around early 2010s, and I forget wh...

Oohh, are we doing time bugs? Sun T4 servers around early 2010s, and I forget which Solaris release this was, had a beaut of a random clock jump.

Every now and then, certain apps on the a system would crash around the same time. We'd scrape through the logs and usually see our app cratered, some databases, ntpd, sshd naturally. Logging of timestamps was iffy, obviously.

The ntpd was the obvious suspect because, despite keeping a nice low offset to its peers like 1ms or two, for months at a time, out of the blue it would confess something like "time offset is too large, I can't fix that so I'll exit!". After chasing Sun ntpd bug reports[1] for a while, we ruled it out when we saw a pattern in the undamaged logs that looked like

    09:59:58.000 ...
    09:59:59.123 ...
    09:09:01.345 ...
    09:09:02.123 ...

Yep the system clock really had jumped back almost an hour. That explained everything about the userspace going nuts including ntpd exiting as a symptom and not a culprit.

After some Sun support and some sunbugs searching [1 again] we found the T4 in that Solaris rev had a hardware RTC with separate registers for H, M, S etc and a write mutex protecting them, but no read mutex. It was possible to read the RTC while it was being updated, which happened when it was syncing the OS clock to the RTC, or something like that. Fixed in a later release.

1: RIP sunbugs database. It was such a mature relationship where Sun would let everyone see what they were working on and customers could participate or at least know about known issues. I would love to find an archive. Of course Oracle shut that off immediately so you had to open a ticket and ask.