IRIXNetwork

s0ke · 09-05-2024, 09:28 PM

I think mine may be on the way out much like @jan-jaap mentioned. What's odd is that the environmental monitoring was working. Although again the chip on my PIMM seems like its bad or going bad. It's been recently complaining about cache errors.

Monitoring chip:

Code:
ALERT: Error initializing the PIMM DS1780, no acknowledge

ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge

ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge

ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge

ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge

ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge

ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge

ALERT: Error configuring PIMM temperature monitoring: no acknowledge

Crash log:

Code:
serenity 12# tail fru.5

Dumpheader version 7, processor type IP35, running in M-mode

        <6>++FRU ANALYSIS BEGIN

        <6>Guess 0:    001c01/IP35/PIMM0      Likelyhood: 90  PIMM/CPU dcache error

        <6>

        Timeout Histogram is empty.

        <6>++FRU ANALYSIS END

Which is also confirmed in the POD:

Code:
A 000 001c01: POD SysCt Dex> log

A 000 001c01: A Info : *** Local network link down

A 000 001c01: A Fatal: Serial #: 08006910413E

A 000 001c01: A Fatal: HARDWARE ERROR STATE:

A 000 001c01: A Fatal: +  Errors on node Nasid 0x0 (0)

A 000 001c01: A Fatal: +    IP35 in /hw/module/001c01/node [serial number MNW845]

A 000 001c01: A Fatal: +      BEDROCK signalled following errors.

A 000 001c01: A Fatal: +        BEDROCK PI 0 Error spool A:

A 000 001c01: A Fatal: +            *** 12 Access Errors to unpopulated memory skipped

A 000 001c01: A Fatal: +      CPU cache on cpu 0

A 000 001c01: A Fatal: +        Cache error register: 0x48000018

A 000 001c01: A Fatal: +          Primary dcache error: pidx 0x18  Way 1 addr 0x4f5cc018

A 000 001c01: A Fatal: +          27:  D, Uncorrectable data array error, way 1

A 000 001c01: A Fatal: End Hardware Error State

A 000 001c01: A Fatal: ++FRU ANALYSIS BEGIN

A 000 001c01: A Fatal: Guess 0: 001c01/IP35/PIMM0      Likelyhood: 90  PIMM/CPU dcache error

A 000 001c01: A Fatal:

A 000 001c01: Timeout Histogram is empty.

A 000 001c01:

A 000 001c01: A Fatal: ++FRU ANALYSIS END

A 000 001c01: A Fatal: PANIC: Cache Error (unrecoverable - dcache data) Eframe = 0x838

A 000 001c01: A Fatal:

A 000 001c01: Dumping to /hw/module/001c01/Ibrick/xtalk/15/pci/1/scsi_ctlr/0/target/1/lun/0/disk/partition/1/block at block 0, space: 0x2000 pages

A 000 001c01: A Info : System dump startedA Fatal: Dumping low memory...A Fatal:

A 000 001c01: A Fatal:  umping static kernel pages...--More--

A 000 001c01: A Fatal:  umping pfdat pages...--More--

A 000 001c01: A Fatal:  umping backtrace pages...--More--

A 000 001c01: A Fatal:  umping dynamic kernel pages...--More--

A 000 001c01: A Fatal: Dumping buffer pages...A Fatal:

A 000 001c01: A Fatal: Dumping remaining in-use pages...A Fatal:

A 000 001c01: A Fatal: Dumping free pages...A Fatal:

A 000 001c01: Available dump space depleted.

A 000 001c01: A Fatal:

A 000 001c01: A Fatal:

A 000 001c01: A Fatal: Restarting the machine...

A 000 001c01: A Info : Kernel log synchronized.

A 000 001c01: A Info : *** Local network link down

A 000 001c01: A Fatal: Serial #: 08006910413E

A 000 001c01: A Fatal: HARDWARE ERROR STATE:

A 000 001c01: A Fatal: +  Errors on node Nasid 0x0 (0)

A 000 001c01: A Fatal: +    IP35 in /hw/module/001c01/node [serial number MNW845]

A 000 001c01: A Fatal: +      BEDROCK signalled following errors.

A 000 001c01: A Fatal: +        BEDROCK PI 0 Error spool A:

A 000 001c01: A Fatal: +            *** 14 Access Errors to unpopulated memory skipped

A 000 001c01: A Fatal: End Hardware Error State

A 000 001c01: A Fatal: ++FRU ANALYSIS BEGIN

A 000 001c01: A Fatal: No rules triggered:  Insufficient data

A 000 001c01: A Fatal:

A 000 001c01: Timeout Histogram is empty.

A 000 001c01:

A 000 001c01: A Fatal: ++FRU ANALYSIS END

A 000 001c01: A Fatal: PANIC: Cache Error (CER=0x64000600 Multiple cache errors in same unit detected) Eframe = 0x838

A 000 001c01: A Info : *** Local network link down

A 000 001c01: A Fatal: Restarting the machine...

Doing the following gets rid of the error and allows me to boot into the system:

Code:
001c01-L1>debug 0x10d (<-Setting DEBUG mode)

debug switches set to 0x010d

001c01-L1>pwr u

001c01-L1> (<-Ctrl + D)

entering console mode 001c01 CPU0, <CTRL_T> to escape to L1

A 000 001c01: POD SysCt Cac> go cac

A 000 001c01: POD SysCt Cac> clearalllogs

A 000 001c01: POD SysCt Cac> initalllogs

A 000 001c01: POD SysCt Cac> flush

A 000 001c01: POD SysCt Cac>

escaping to L1 system controller

001c01-L1>debug 0 (<- Clear DEBUG flag)

debug switches set to 0x0000

returning to console mode 001c01 CPU0, <CTRL_T> to escape to L1

A 000 001c01: POD SysCt Cac> reset

But I can see that monitoring is enabled but not running now.

Code:
001a01-L1>env

Environmental monitoring is enabled, but not running (configuration error).

Description    State      Warning Limits    Fault Limits      Current

-------------- ----------  -----------------  -----------------  -------

          12V    Enabled  10%  10.80/ 13.20  20%  9.60/ 14.40  11.94

        12V IO    Enabled  10%  10.80/ 13.20  20%  9.60/ 14.40  12.00

            5V    Enabled  10%  4.50/  5.50  20%  4.00/  6.00    5.07

          3.3V    Enabled  10%  2.97/  3.63  20%  2.64/  3.96    3.34

          2.5V    Enabled  10%  2.25/  2.75  20%  2.00/  3.00    2.47

          1.5V    Enabled  10%  1.35/  1.65  20%  1.20/  1.80    1.47

        5V aux    Enabled  10%  4.50/  5.50  20%  4.00/  6.00    4.99

      3.3V aux    Enabled  10%  2.97/  3.63  20%  2.64/  3.96    3.30

PIMM0 12V bias    Enabled  10%  10.80/ 13.20  20%  9.60/ 14.40    3.30

  Asterix SRAM    Enabled  10%  2.25/  2.75  20%  2.00/  3.00    3.30

  Asterix CPU    Enabled  10%  1.44/  1.76  20%  1.28/  1.92    3.30

    PIMM0 1.5V    Enabled  10%  1.35/  1.65  20%  1.20/  1.80    3.30

PIMM0 3.3V aux    Enabled  10%  2.97/  3.63  20%  2.64/  3.96    3.30

  PIMM0 5V aux    Enabled  10%  4.50/  5.50  20%  4.00/  6.00    3.30

  XIO 12V bias    Enabled  10%  10.80/ 13.20  20%  9.60/ 14.40  11.81

        XIO 5V    Enabled  10%  4.50/  5.50  20%  4.00/  6.00    5.04

      XIO 2.5V    Enabled  10%  2.25/  2.75  20%  2.00/  3.00    2.47

  XIO 3.3V aux    Enabled  10%  2.97/  3.63  20%  2.64/  3.96    3.30

Description    State      Warning RPM  Current RPM

-------------- ----------  -----------  -----------

FAN 0  EXHAUST    Enabled          920        1219

FAN 1      HD    Enabled        1560        2244

FAN 2      PCI    Enabled        1120        1573

FAN 3    XIO 1    Enabled        1600        2200

FAN 4    XIO 2    Enabled        1600        4611

FAN 5      PS    Enabled        1600        2259

                          Advisory  Critical  Fault      Current

Description    State      Temp      Temp      Temp      Temp

-------------- ----------  ---------  ---------  ---------  ---------

NODE 0            Enabled  60C/140F  65C/149F  70C/158F  40C/104F

NODE 1            Enabled  60C/140F  65C/149F  70C/158F  37C/ 98F

NODE 2            Enabled  60C/140F  65C/149F  70C/158F  31C/ 87F

PIMM              Enabled  60C/140F  65C/149F  70C/158F    0C/ 32F

ODYSSEY          Enabled  60C/140F  65C/149F  70C/158F  40C/104F

BEDROCK          Enabled  60C/140F  65C/149F  70C/158F  43C/109F

However I get no more crashes. So perhaps the monitoring chip on the PIMM is probably dead but the module itself is still fine. I'll keep it running for awhile to see. But with the cache errors present I feel like the PIMM will eventually die, fingers crossed it's fine. Thoughts? The temperatures otherwise are fine. It's just the PIMM itself not reporting likely because the DS1780 is dead.

I did reach out to someone who still sells certain parts and am ready to replace it if need be. But I wanted to make sure 100% before I spend $ on something that I generally don't use everyday.

[Image: dolph.jpg]

Login
Username:
Password:	Lost Password?
	Remember me