I think mine may be on the way out much like @jan-jaap mentioned. What's odd is that the environmental monitoring was working. Although again the chip on my PIMM seems like its bad or going bad. It's been recently complaining about cache errors.
Monitoring chip:
Code:
ALERT: Error initializing the PIMM DS1780, no acknowledge
ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge
ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge
ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge
ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge
ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge
ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge
ALERT: Error configuring PIMM temperature monitoring: no acknowledge
Crash log:
Code:
serenity 12# tail fru.5
Dumpheader version 7, processor type IP35, running in M-mode
<6>++FRU ANALYSIS BEGIN
<6>Guess 0: 001c01/IP35/PIMM0 Likelyhood: 90 PIMM/CPU dcache error
<6>
Timeout Histogram is empty.
<6>++FRU ANALYSIS END
Which is also confirmed in the POD:
Code:
A 000 001c01: POD SysCt Dex> log
A 000 001c01: A Info : *** Local network link down
A 000 001c01: A Fatal: Serial #: 08006910413E
A 000 001c01: A Fatal: HARDWARE ERROR STATE:
A 000 001c01: A Fatal: + Errors on node Nasid 0x0 (0)
A 000 001c01: A Fatal: + IP35 in /hw/module/001c01/node [serial number MNW845]
A 000 001c01: A Fatal: + BEDROCK signalled following errors.
A 000 001c01: A Fatal: + BEDROCK PI 0 Error spool A:
A 000 001c01: A Fatal: + *** 12 Access Errors to unpopulated memory skipped
A 000 001c01: A Fatal: + CPU cache on cpu 0
A 000 001c01: A Fatal: + Cache error register: 0x48000018
A 000 001c01: A Fatal: + Primary dcache error: pidx 0x18 Way 1 addr 0x4f5cc018
A 000 001c01: A Fatal: + 27: D, Uncorrectable data array error, way 1
A 000 001c01: A Fatal: End Hardware Error State
A 000 001c01: A Fatal: ++FRU ANALYSIS BEGIN
A 000 001c01: A Fatal: Guess 0: 001c01/IP35/PIMM0 Likelyhood: 90 PIMM/CPU dcache error
A 000 001c01: A Fatal:
A 000 001c01: Timeout Histogram is empty.
A 000 001c01:
A 000 001c01: A Fatal: ++FRU ANALYSIS END
A 000 001c01: A Fatal: PANIC: Cache Error (unrecoverable - dcache data) Eframe = 0x838
A 000 001c01: A Fatal:
A 000 001c01: Dumping to /hw/module/001c01/Ibrick/xtalk/15/pci/1/scsi_ctlr/0/target/1/lun/0/disk/partition/1/block at block 0, space: 0x2000 pages
A 000 001c01: A Info : System dump startedA Fatal: Dumping low memory...A Fatal:
A 000 001c01: A Fatal: umping static kernel pages...--More--
A 000 001c01: A Fatal: umping pfdat pages...--More--
A 000 001c01: A Fatal: umping backtrace pages...--More--
A 000 001c01: A Fatal: umping dynamic kernel pages...--More--
A 000 001c01: A Fatal: Dumping buffer pages...A Fatal:
A 000 001c01: A Fatal: Dumping remaining in-use pages...A Fatal:
A 000 001c01: A Fatal: Dumping free pages...A Fatal:
A 000 001c01: Available dump space depleted.
A 000 001c01: A Fatal:
A 000 001c01: A Fatal:
A 000 001c01: A Fatal: Restarting the machine...
A 000 001c01: A Info : Kernel log synchronized.
A 000 001c01: A Info : *** Local network link down
A 000 001c01: A Fatal: Serial #: 08006910413E
A 000 001c01: A Fatal: HARDWARE ERROR STATE:
A 000 001c01: A Fatal: + Errors on node Nasid 0x0 (0)
A 000 001c01: A Fatal: + IP35 in /hw/module/001c01/node [serial number MNW845]
A 000 001c01: A Fatal: + BEDROCK signalled following errors.
A 000 001c01: A Fatal: + BEDROCK PI 0 Error spool A:
A 000 001c01: A Fatal: + *** 14 Access Errors to unpopulated memory skipped
A 000 001c01: A Fatal: End Hardware Error State
A 000 001c01: A Fatal: ++FRU ANALYSIS BEGIN
A 000 001c01: A Fatal: No rules triggered: Insufficient data
A 000 001c01: A Fatal:
A 000 001c01: Timeout Histogram is empty.
A 000 001c01:
A 000 001c01: A Fatal: ++FRU ANALYSIS END
A 000 001c01: A Fatal: PANIC: Cache Error (CER=0x64000600 Multiple cache errors in same unit detected) Eframe = 0x838
A 000 001c01: A Info : *** Local network link down
A 000 001c01: A Fatal: Restarting the machine...
Doing the following gets rid of the error and allows me to boot into the system:
Code:
001c01-L1>debug 0x10d (<-Setting DEBUG mode)
debug switches set to 0x010d
001c01-L1>pwr u
001c01-L1> (<-Ctrl + D)
entering console mode 001c01 CPU0, <CTRL_T> to escape to L1
A 000 001c01: POD SysCt Cac> go cac
A 000 001c01: POD SysCt Cac> clearalllogs
A 000 001c01: POD SysCt Cac> initalllogs
A 000 001c01: POD SysCt Cac> flush
A 000 001c01: POD SysCt Cac>
escaping to L1 system controller
001c01-L1>debug 0 (<- Clear DEBUG flag)
debug switches set to 0x0000
returning to console mode 001c01 CPU0, <CTRL_T> to escape to L1
A 000 001c01: POD SysCt Cac> reset
But I can see that monitoring is enabled but not running now.
Code:
001a01-L1>env
Environmental monitoring is enabled, but not running (configuration error).
Description State Warning Limits Fault Limits Current
-------------- ---------- ----------------- ----------------- -------
12V Enabled 10% 10.80/ 13.20 20% 9.60/ 14.40 11.94
12V IO Enabled 10% 10.80/ 13.20 20% 9.60/ 14.40 12.00
5V Enabled 10% 4.50/ 5.50 20% 4.00/ 6.00 5.07
3.3V Enabled 10% 2.97/ 3.63 20% 2.64/ 3.96 3.34
2.5V Enabled 10% 2.25/ 2.75 20% 2.00/ 3.00 2.47
1.5V Enabled 10% 1.35/ 1.65 20% 1.20/ 1.80 1.47
5V aux Enabled 10% 4.50/ 5.50 20% 4.00/ 6.00 4.99
3.3V aux Enabled 10% 2.97/ 3.63 20% 2.64/ 3.96 3.30
PIMM0 12V bias Enabled 10% 10.80/ 13.20 20% 9.60/ 14.40 3.30
Asterix SRAM Enabled 10% 2.25/ 2.75 20% 2.00/ 3.00 3.30
Asterix CPU Enabled 10% 1.44/ 1.76 20% 1.28/ 1.92 3.30
PIMM0 1.5V Enabled 10% 1.35/ 1.65 20% 1.20/ 1.80 3.30
PIMM0 3.3V aux Enabled 10% 2.97/ 3.63 20% 2.64/ 3.96 3.30
PIMM0 5V aux Enabled 10% 4.50/ 5.50 20% 4.00/ 6.00 3.30
XIO 12V bias Enabled 10% 10.80/ 13.20 20% 9.60/ 14.40 11.81
XIO 5V Enabled 10% 4.50/ 5.50 20% 4.00/ 6.00 5.04
XIO 2.5V Enabled 10% 2.25/ 2.75 20% 2.00/ 3.00 2.47
XIO 3.3V aux Enabled 10% 2.97/ 3.63 20% 2.64/ 3.96 3.30
Description State Warning RPM Current RPM
-------------- ---------- ----------- -----------
FAN 0 EXHAUST Enabled 920 1219
FAN 1 HD Enabled 1560 2244
FAN 2 PCI Enabled 1120 1573
FAN 3 XIO 1 Enabled 1600 2200
FAN 4 XIO 2 Enabled 1600 4611
FAN 5 PS Enabled 1600 2259
Advisory Critical Fault Current
Description State Temp Temp Temp Temp
-------------- ---------- --------- --------- --------- ---------
NODE 0 Enabled 60C/140F 65C/149F 70C/158F 40C/104F
NODE 1 Enabled 60C/140F 65C/149F 70C/158F 37C/ 98F
NODE 2 Enabled 60C/140F 65C/149F 70C/158F 31C/ 87F
PIMM Enabled 60C/140F 65C/149F 70C/158F 0C/ 32F
ODYSSEY Enabled 60C/140F 65C/149F 70C/158F 40C/104F
BEDROCK Enabled 60C/140F 65C/149F 70C/158F 43C/109F
However I get no more crashes. So perhaps the monitoring chip on the PIMM is probably dead but the module itself is still fine. I'll keep it running for awhile to see. But with the cache errors present I feel like the PIMM will eventually die, fingers crossed it's fine. Thoughts? The temperatures otherwise are fine. It's just the PIMM itself not reporting likely because the DS1780 is dead.
I did reach out to someone who still sells certain parts and am ready to replace it if need be. But I wanted to make sure 100% before I spend $ on something that I generally don't use everyday.