Hi, I talked about this a bit on the Discord but wanted to go ahead and make a thread to get more insight.
I have an Octane2 system (R12K/400, V8) which I have had for about two years. It has been working fine, but I recently moved it to a new location and it no longer boots properly. Of course this is usually just the RAM becoming unseated, so I removed the system board and reseated everything, but I usually just get a blinking red lightbar. Using a serial console, I get messages like the following:
Code:
Bad or missing DIMM in bank 0, DIMM S2
Bad or missing DIMM in bank 3, DIMM S8
Bank 0: Address: 0x90000000a0000008, Expected: 0xaaaaaaaaaaaaaaaa, Actual: 0xaaaaaaaa00005450
Bad or missing DIMM in bank 0, DIMM S2
Bank 1: Address: 0x90000000a0000008, Expected: 0xaaaaaaaaaaaaaaaa, Actual: 0xaaaaaaaa00005450
Bad or missing DIMM in bank 1, DIMM S4
Bank 2: Address: 0x90000000a0000008, Expected: 0xaaaaaaaaaaaaaaaa, Actual: 0xaaaaaaaa00000000
Bad or missing DIMM in bank 2, DIMM S6
Bank 3: Address: 0x90000000a0000008, Expected: 0xaaaaaaaaaaaaaaaa, Actual: 0xaaaaaaaa00000000
Bad or missing DIMM in bank 3, DIMM S8
No usable memory found. Make sure you have at least one full bank (2 DIMMs).
I can reboot it multiple times and get slightly different messages and "Expected" values. It is not always the same DIMMs, and sometimes every single DIMM is marked bad:
Code:
Bank 0: Address: 0x90000000a0000000, Expected: 0x5555555555555555, Actual: 0x40016fc86fac82ea
Bank 0: Address: 0x90000000a0000008, Expected: 0xaaaaaaaaaaaaaaaa, Actual: 0x2000aec86facf7fe
Bank 1: Address: 0x90000000a0000000, Expected: 0x5555555555555555, Actual: 0x40017ff8efbc0be0
Bank 1: Address: 0x90000000a0000008, Expected: 0xaaaaaaaaaaaaaaaa, Actual: 0x00003ff8ffffffff
Bank 2: Address: 0x90000000a0000000, Expected: 0x5555555555555555, Actual: 0x4a016ff8ffbfffff
Bank 2: Address: 0x90000000a0000008, Expected: 0xaaaaaaaaaaaaaaaa, Actual: 0x2000aff8ffbffffb
Bank 3: Address: 0x90000000a0000000, Expected: 0x5555555555555555, Actual: 0x5e017ff8ffffffff
Bank 3: Address: 0x90000000a0000008, Expected: 0xaaaaaaaaaaaaaaaa, Actual: 0xb400bef8ffffffff
No usable memory found. Make sure you have at least one full bank (2 DIMMs).
Following the advice on the Discord I did a deep clean of the board and DIMMs to ensure that there was not a contact issue, but it did not change anything significantly.
Strangely enough, every once in a while I will get the system to boot normally after reseating everything, but it will fail to boot into IRIX, either freezing or doing a core dump with an ECC error. I tried to run the diagnostics to see if it would tell me which DIMM it was, but it freezes immediately when diagnostics are started.
I have a total of 12 128MB DIMMs available to me, which I have tried in various configurations (going down to just 2, swapping them around, putting only my 4 spare DIMMs in, etc). No configuration reliably works any more. I have also tried a spare R10K/250 system board I have but it has its own problems (it always detects a DIMM mismatch even if the installed memory is identical).
Does anyone know how to more deeply troubleshoot this issue? It seems rather strange (and probably doubtful) that moving the system killed all my RAM, but I'm not really sure what else to try.