I had a bug on a small system (basically an embedded computer with some custom I/O hardware), where the report was a unit "wouldn't reboot if it was turned off for too long(!?)".
Focused testing easily reproduced the problem. If you power cycled the machine and the off time was less than 30 seconds, it always rebooted correctly. If the off time was greater than 2 minutes, it ALSO always rebooted correctly, but if the off time was around 1 minute, it would reliably hang during the boot process. This behavior was rapidly confirmed on a number of other units taken off the production line.
Needless to say this was very confusing.
The eventual root cause was that during boot, a file-system was created in DRAM, similar to what a modern system would use initramfs for. The file system creation routine had originally been written for non-volatile memory, so it would by default check for a superblock to see if there was already a file system in place at the creation location and if there was, just use that (after unlinking all pre-existing files). If no pre-existing FS was found it would create a new one.
But this was in DRAM, not non-volatile memory so:
- if the off time was less than 30 seconds, the filesystem was STILL IN THE DRAM and the ramfs_create() routine would happily re-use it.
- if the off time was over two minutes, there were enough bit errors in the DRAM that ramfs_create() would fail to recognize the superblock, overwrite it with a new one, and everything STILL worked fine.
- in the critical timing zone, the number of bit errors in the superblock in DRAM would be small enough that the FS creation would recognize that there was a pre-existing filesystem, but large enough to cause the ramfs_create() function to error out. The boot would then hang when the DRAM filesystem was accessed.
Of course there was no error checking on the ramfs_create() return.
The solution was to change the flags to ramfs_create() to overwrite unconditionally. After that, booting was reliable with any amount of off time.
Lesson is - cold boot attacks on DRAM contents are real, and we managed to do it to ourselves by accident.
Focused testing easily reproduced the problem. If you power cycled the machine and the off time was less than 30 seconds, it always rebooted correctly. If the off time was greater than 2 minutes, it ALSO always rebooted correctly, but if the off time was around 1 minute, it would reliably hang during the boot process. This behavior was rapidly confirmed on a number of other units taken off the production line.
Needless to say this was very confusing.
The eventual root cause was that during boot, a file-system was created in DRAM, similar to what a modern system would use initramfs for. The file system creation routine had originally been written for non-volatile memory, so it would by default check for a superblock to see if there was already a file system in place at the creation location and if there was, just use that (after unlinking all pre-existing files). If no pre-existing FS was found it would create a new one.
But this was in DRAM, not non-volatile memory so:
- if the off time was less than 30 seconds, the filesystem was STILL IN THE DRAM and the ramfs_create() routine would happily re-use it.
- if the off time was over two minutes, there were enough bit errors in the DRAM that ramfs_create() would fail to recognize the superblock, overwrite it with a new one, and everything STILL worked fine.
- in the critical timing zone, the number of bit errors in the superblock in DRAM would be small enough that the FS creation would recognize that there was a pre-existing filesystem, but large enough to cause the ramfs_create() function to error out. The boot would then hang when the DRAM filesystem was accessed.
Of course there was no error checking on the ramfs_create() return.
The solution was to change the flags to ramfs_create() to overwrite unconditionally. After that, booting was reliable with any amount of off time.
Lesson is - cold boot attacks on DRAM contents are real, and we managed to do it to ourselves by accident.