(10-22-2020, 01:47 AM)weblacky Wrote: You knew the "expected" 10 year retention period of modern NAND and you knew it needs to be rewritten (NOT READ) for flash to be refreshed.
I feel like the point I was making has been missed: As Flash wears from being written to, the amount of charge that can be held in a cell reduces. The cell has worn out when it can't hold enough charge that it can reliably signal a 0 or 1.
Therefore, a cell that's perhaps been rewritten a handful of times can be expected to have a much longer retention period than the quoted period.
It's likely that the flash used in SGI machines is NOR flash. The datasheets I've found didn't say, but it's more in line with the use case, pinout (for the ones we know) and so on. Happily data retention period is typically higher in NOR flash. For their recent NOR flash products, Atmel (now part of Microchip) cites 'greater than 100 years'.
(10-22-2020, 01:47 AM)weblacky Wrote: For older machines...it's in known IC PROMS (Indigo, Indigo2, O2, Indy)...maybe Octane...anyone know? Didn't someone have a PROM dump collection?
There's more programmable devices in SGI systems than that; I know of three in the O2 (not including the RTC), and I've only checked the motherboard module. As well as the flash chip for the PROM, there's also an XC1718 device on the processor module (which I assume is used in configuring (V)ICE) and a DS2502 on the PCI riser.
You can't reprogram the latter two of these: the XC1718 is OTP, and the DS2502 is an 'add only' EPROM.
Which leads to another concern about trying to refresh the PROM - it may not be possible at all. Certainly, it's not possible for any of the machines with a UV EPROM (so, Indigo, Indigo 2 and Indy are out). With the O2, referring to the
AT29C040 datasheet, page 3 mentions a 'Boot Block Programming Lockout'. This allows regions of the flash chip to be permanently locked from reprogramming.
The expected use of this is that the machine starts by running code in this locked section. This code checks the integrity of the remaining flash (probably checksum), and if it's OK, jumps to it. If it's not OK, then it takes some corrective action - perhaps blinking an LED to signal an error. This is particularly attractive on systems which have dual flash images; if the newer image is damaged in some way, then the boot process can automatically try the older image.
(10-22-2020, 01:47 AM)weblacky Wrote: For newer...I thought it was buried inside a co-processor on like the L1 processor itself? We've never heard from anyone that proved where the firmware lives.
All of the SGIs I've looked at have mostly used off the shelf components, with the custom silicon reserved for where it really matters eg., the processors, graphics and memory interconnect. Outside of those things rolling their own is just going to increase cost. Even in things produced in large volumes with security requirements the flash chips still aren't usually integrated into the processors.
On the Tezro, for instance, looking at photos of the (I think) nodeboard
(in this thread) there's a what looks to be a
Coldfire MCF5206 (m68k-ish) microprocessor (next to two chips that look like memory. I would posit that is this is the L1, and that one of those chips is RAM and the other flash ROM. They look like standard chips (one looks like it's made by Cypress and the other maybe TI) but unfortunately I cannot read the part numbers.
(10-22-2020, 01:47 AM)weblacky Wrote: Also we have no proof these chips are smart enough to manage or refresh their own data (move it around or refresh) while powered. So let's assume worst case scenario...they data rot from the inside out (behind a locked door, inside a Atmel chip or something).
The
worst case scenario is surely failure of the entire chip? Even for the partial fails, conventional wisdom is that the chip has degraded and so should be replaced. (Although, as hobbyists we might try rewriting and hope it keeps working)
For that replacement, of course, you need a dump, a compatible replacement IC and a device programmer. Whilst we're considering worst cases, it's not even known if reflashing will refresh all of the critical areas of the flash chips; and it's very common to intentionally design things so it won't. In addition, getting dumps of all the programmable devices is actually required, because there's known to be devices in SGI machines which cannot be reprogrammed from the command line.
Whilst the number of write cycles aren't a concern, losing power whilst flashing the PROM is a concern. If there's a recovery mode, that recovery mode is going to have to be stored in an area of flash that doesn't get rewritten, and it will be the first bit of code that the processor runs on startup. If there's not a recovery mode, losing power during a PROM update means a dead computer.
In summary, if there's a protected boot block (pretty much a given for machines that can have more than one image, and not unlikely on single image machines) it's a waste of time. If there's not, there's an elevated risk of bricking the machine every time you reflash - which is a pretty major issue if you don't even know where the flash IC is.