IRIXNetwork

chulofiasco · 02-05-2024, 01:25 PM

(02-03-2024, 04:03 PM)Was weblacky Wrote: For posterity the issue ended up being…

What about commands used?

**weblacky** · 02-05-2024, 02:15 PM

All commands used were already in this thread at post #5: http://forums.irixnet.org/thread-4195-po...l#pid31127

Since Debug was never issued, it's not posted.

My advice was mostly physical, coupled with whatever terminal software was being used on the client side to see the serial output. Otherwise the commands in the link above were the only other ones utilized up to that point.

**jwhat** · 02-08-2024, 12:07 AM

Hi Weblacky & Others,

I presume the machine was stopping on POD, as someone had set the debug flag to non zero.

I see you clear the logs, but didn't you also reset the L1 debug flag to zero to stop machine stopping at POD ?

Cheers from Oz,

jwhat/John.

**weblacky** · 02-08-2024, 12:20 AM

(02-08-2024, 12:07 AM)jwhat Wrote: Hi Weblacky & Others,

I presume the machine was stopping on POD, as someone had set the debug flag to non zero.

I see you clear the logs, but didn't you also reset the L1 debug flag to zero to stop machine stopping at POD ?

Cheers from Oz,

jwhat/John.

Yeah, I had him do that (clearing and resetting debug on to 0 on L1). We assumed that the previous owner may have commanded it to go into L1 Debug during troubleshooting and then put it in storage. We saw no other error at the serial PROM to answer why it had ended up in debug. However it could be due to a dead snaphat. But I'm unaware if a dead shaphat causes you to go into POD at first initialization. Maybe it does and I have forgotten?

Either way, after clearing debug things seemed normal for now. We haven't heard back about the full experience because the time the user didn't have a DVI monitor and keyboard/mouse hooked up to the system. They were installing Irix solely via serial Terminal. Not my cup of tea but they're at the full PROM so should be OK.

s0ke · 02-08-2024, 03:53 PM

The system kept stopping at:

Code:
A 000 001c01: POD SysCt Dex> why

A 000 001c01:  EPC    : 0xc00000001fc05c38 (0xc00000001fc05c38)

A 000 001c01:  ERREPC : 0x0100020000302000 (0x100020000302000)

A 000 001c01:  CACERR : 0x000000004a98110a

A 000 001c01:  Status : 0x0000000024407c80

A 000 001c01:  BadVA  : 0xc00000001fc05c38 (0xc00000001fc05c38)

A 000 001c01:  RA     : 0xffffffffbfc38a4c (0xc00000001fc38a4c)

A 000 001c01:  SP     : 0xa800000000103650

A 000 001c01:  A0     : 0x0000000000000000

A 000 001c01:  Cause  : 0x0000000000008008 (INT:8------- <Load TLB Miss>)

A 000 001c01:  Reason : 245 (Unexpected XTLB Refill Exception.)

A 000 001c01:  POD mode was called from: 0xc00000001fc01fe4 (0xc00000001fc01fe4)

Turns out because the snaphat was dead and nvram was resetting back to

Code:
console g

instead of

Code:
console d

which is what I needed as the machine has no peripherals attached. Doh! Snaphat replaced and POD cleared and debug set back to 0. No more issues. Irix installed via bootp/tftp.

s0ke · 08-21-2024, 01:52 AM

Code:
Advisory  Critical  Fault      Current

Description    State      Temp      Temp      Temp      Temp

-------------- ----------  ---------  ---------  ---------  ---------

NODE 0            Enabled  60C/140F  65C/149F  70C/158F  37C/ 98F

NODE 1            Enabled  60C/140F  65C/149F  70C/158F  32C/ 89F

NODE 2            Enabled  60C/140F  65C/149F  70C/158F  26C/ 78F

PIMM              Enabled  60C/140F  65C/149F  70C/158F    0C/ 32F

ODYSSEY          Enabled  60C/140F  65C/149F  70C/158F  28C/ 82F

BEDROCK          Enabled  60C/140F  65C/149F  70C/158F  38C/100F

Ugh env monitoring on PIMM is now dead. Will likely just replace it instead of trying to get a new chip. Just wish my o2 was this snappy, otherwise, I wouldn't have a need for these temperamental things.

weblacky · 08-21-2024, 02:10 AM

(08-21-2024, 01:52 AM)s0ke Wrote:
Code:
Advisory Critical Fault Current Description State Temp Temp Temp Temp -------------- ---------- --------- --------- --------- --------- NODE 0 Enabled 60C/140F 65C/149F 70C/158F 37C/ 98F NODE 1 Enabled 60C/140F 65C/149F 70C/158F 32C/ 89F NODE 2 Enabled 60C/140F 65C/149F 70C/158F 26C/ 78F PIMM Enabled 60C/140F 65C/149F 70C/158F 0C/ 32F ODYSSEY Enabled 60C/140F 65C/149F 70C/158F 28C/ 82F BEDROCK Enabled 60C/140F 65C/149F 70C/158F 38C/100F

Ugh env monitoring on PIMM is now dead. Will likely just replace it instead of trying to get a new chip. Just wish my o2 was this snappy, otherwise, I wouldn't have a need for these temperamental things.

I can fix that! Replacing the ENV monitoring chip on the Fuel PIMM is very easy. If your processor is working otherwise I’d be happy to replace it for a nominal fee. PM me if you’re interested. Otherwise you can change the chip out yourself if you’re handy with SMD soldering. The bigger issue is removing it from the board without using hot air. I don’t use hot air in any of the work I do. I use an extractor tip that matches that SMD size.

Does ENV still show good voltages for your PIMM values? Or are all ENV values erroneous or zero for all the PIMM readings in ENV?

PIMM VRM parts can go, which will provide the wrong voltage to parts of the CPU and can kill the environment monitoring chip. However, if your CPU is working normally and you’re not experiencing any cache errors during PROM testing or operating system freezes, then environmental monitoring chip simply went bad and a replacement may get you back going. I can quickly check parts of the VRM for you, but it won’t be an exhaustive test unless I start desoldering things. Which I don’t normally do.

jan-jaap · 08-21-2024, 10:05 AM

(08-21-2024, 02:10 AM)weblacky Wrote: I can fix that! Replacing the ENV monitoring chip on the Fuel PIMM is very easy.

Isn't env monitoring and i2c done by the same chip? My Fuel has a faulty i2c chip on the PIMM. At first this was an intermittent problem and the CPU itself would work fine, but now the L1 always tells me there's no CPU.

I have a dozen or so of these chips stashed somewhere

**weblacky** · 08-21-2024, 11:51 AM

Well, that’s why I asked about the CPUs’ operation otherwise. If you remember my PIMM repair thread from oh so many years ago, with a 600Mhz Fuel PIMM from user “noguri” I found that both the VRM was genuinely damaged and it had taken the PIMM’s ENV chip with it.

https://forums.irixnet.org/thread-3238.html

I ended up repairing both, and as you yourself suggested on that thread because the VRM had some out of small damage, causing it to operate improperly, and it was experiencing higher voltage on the 5V line, which is what actually damaged the sensor to begin with, in my case the reading was actually higher than the actual voltage and the ENV chip was no longer able to actually report the correct voltage but was still alive. It was both a matter of VRM damage and incorrect reporting but present CPU. So that’s why I asked if his other values are still there. It’s possible he could have some voltage values appear OK or at least within known range but his temperature has gone bad.

I’ve not heard of the CPU not being detected at all. However, I guess it depends if the i2c is used for that and as you yourself know if the ENV chip collapses in a certain way, it can definitely hold down that bus so that no one else can communicate, but normally you get a bunch of “node not acknowledged” errors in your L1 as an obvious sign of this collapse.

I’ve only done this once so I have no other situations to draw from. But I know for this user the system was booting normally and he already had a third-party ATX power supply using Kuba’s power adapter so I really doubt his power system went weird and took out some thing. Although given the age certainly anything is possible.

It’s the same exact chip that’s on the fuel main board for sensing, DS1780. However, I am not 100% if the i2c bus is the sole domain of the environmental monitoring system or if it is also used for other purposes.

In the case of you saying your CPU isn’t even detected the only thing that I understand that could come to mean is either the identification information is not available, and this could be due to either the eprom chip being damaged or much more likely having some sort of voltage failure on the board itself that might cause very wrong PIMM voltages or something of that nature. Either too high or too low is going to affect the CPU or components. So I would assume a problem with that voltage would yield us the mysteriously empty information. Either permanent damage or a protection circuit or not enough voltage to run the components. Remember that voltage right outside the highest value on the DS1780 legs does damage to the chip. But to do damage to the entire PIMM chip is a slightly different matter that I would need more samples to research further.

I guess to put it another way does it say you have no CPU because there’s not the correct voltage, or there’s not the correct identification, or is it because the CPU hardware is damaged and basically not there anymore.

Thanks.

s0ke · 09-05-2024, 09:28 PM

I think mine may be on the way out much like @jan-jaap mentioned. What's odd is that the environmental monitoring was working. Although again the chip on my PIMM seems like its bad or going bad. It's been recently complaining about cache errors.

Monitoring chip:

Code:
ALERT: Error initializing the PIMM DS1780, no acknowledge

ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge

ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge

ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge

ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge

ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge

ALERT: Error configuring PIMM DS1780 power monitoring: no acknowledge

ALERT: Error configuring PIMM temperature monitoring: no acknowledge

Crash log:

Code:
serenity 12# tail fru.5

Dumpheader version 7, processor type IP35, running in M-mode

        <6>++FRU ANALYSIS BEGIN

        <6>Guess 0:    001c01/IP35/PIMM0      Likelyhood: 90  PIMM/CPU dcache error

        <6>

        Timeout Histogram is empty.

        <6>++FRU ANALYSIS END

Which is also confirmed in the POD:

Code:
A 000 001c01: POD SysCt Dex> log

A 000 001c01: A Info : *** Local network link down

A 000 001c01: A Fatal: Serial #: 08006910413E

A 000 001c01: A Fatal: HARDWARE ERROR STATE:

A 000 001c01: A Fatal: +  Errors on node Nasid 0x0 (0)

A 000 001c01: A Fatal: +    IP35 in /hw/module/001c01/node [serial number MNW845]

A 000 001c01: A Fatal: +      BEDROCK signalled following errors.

A 000 001c01: A Fatal: +        BEDROCK PI 0 Error spool A:

A 000 001c01: A Fatal: +            *** 12 Access Errors to unpopulated memory skipped

A 000 001c01: A Fatal: +      CPU cache on cpu 0

A 000 001c01: A Fatal: +        Cache error register: 0x48000018

A 000 001c01: A Fatal: +          Primary dcache error: pidx 0x18  Way 1 addr 0x4f5cc018

A 000 001c01: A Fatal: +          27:  D, Uncorrectable data array error, way 1

A 000 001c01: A Fatal: End Hardware Error State

A 000 001c01: A Fatal: ++FRU ANALYSIS BEGIN

A 000 001c01: A Fatal: Guess 0: 001c01/IP35/PIMM0      Likelyhood: 90  PIMM/CPU dcache error

A 000 001c01: A Fatal:

A 000 001c01: Timeout Histogram is empty.

A 000 001c01:

A 000 001c01: A Fatal: ++FRU ANALYSIS END

A 000 001c01: A Fatal: PANIC: Cache Error (unrecoverable - dcache data) Eframe = 0x838

A 000 001c01: A Fatal:

A 000 001c01: Dumping to /hw/module/001c01/Ibrick/xtalk/15/pci/1/scsi_ctlr/0/target/1/lun/0/disk/partition/1/block at block 0, space: 0x2000 pages

A 000 001c01: A Info : System dump startedA Fatal: Dumping low memory...A Fatal:

A 000 001c01: A Fatal:  umping static kernel pages...--More--

A 000 001c01: A Fatal:  umping pfdat pages...--More--

A 000 001c01: A Fatal:  umping backtrace pages...--More--

A 000 001c01: A Fatal:  umping dynamic kernel pages...--More--

A 000 001c01: A Fatal: Dumping buffer pages...A Fatal:

A 000 001c01: A Fatal: Dumping remaining in-use pages...A Fatal:

A 000 001c01: A Fatal: Dumping free pages...A Fatal:

A 000 001c01: Available dump space depleted.

A 000 001c01: A Fatal:

A 000 001c01: A Fatal:

A 000 001c01: A Fatal: Restarting the machine...

A 000 001c01: A Info : Kernel log synchronized.

A 000 001c01: A Info : *** Local network link down

A 000 001c01: A Fatal: Serial #: 08006910413E

A 000 001c01: A Fatal: HARDWARE ERROR STATE:

A 000 001c01: A Fatal: +  Errors on node Nasid 0x0 (0)

A 000 001c01: A Fatal: +    IP35 in /hw/module/001c01/node [serial number MNW845]

A 000 001c01: A Fatal: +      BEDROCK signalled following errors.

A 000 001c01: A Fatal: +        BEDROCK PI 0 Error spool A:

A 000 001c01: A Fatal: +            *** 14 Access Errors to unpopulated memory skipped

A 000 001c01: A Fatal: End Hardware Error State

A 000 001c01: A Fatal: ++FRU ANALYSIS BEGIN

A 000 001c01: A Fatal: No rules triggered:  Insufficient data

A 000 001c01: A Fatal:

A 000 001c01: Timeout Histogram is empty.

A 000 001c01:

A 000 001c01: A Fatal: ++FRU ANALYSIS END

A 000 001c01: A Fatal: PANIC: Cache Error (CER=0x64000600 Multiple cache errors in same unit detected) Eframe = 0x838

A 000 001c01: A Info : *** Local network link down

A 000 001c01: A Fatal: Restarting the machine...

Doing the following gets rid of the error and allows me to boot into the system:

Code:
001c01-L1>debug 0x10d (<-Setting DEBUG mode)

debug switches set to 0x010d

001c01-L1>pwr u

001c01-L1> (<-Ctrl + D)

entering console mode 001c01 CPU0, <CTRL_T> to escape to L1

A 000 001c01: POD SysCt Cac> go cac

A 000 001c01: POD SysCt Cac> clearalllogs

A 000 001c01: POD SysCt Cac> initalllogs

A 000 001c01: POD SysCt Cac> flush

A 000 001c01: POD SysCt Cac>

escaping to L1 system controller

001c01-L1>debug 0 (<- Clear DEBUG flag)

debug switches set to 0x0000

returning to console mode 001c01 CPU0, <CTRL_T> to escape to L1

A 000 001c01: POD SysCt Cac> reset

But I can see that monitoring is enabled but not running now.

Code:
001a01-L1>env

Environmental monitoring is enabled, but not running (configuration error).

Description    State      Warning Limits    Fault Limits      Current

-------------- ----------  -----------------  -----------------  -------

          12V    Enabled  10%  10.80/ 13.20  20%  9.60/ 14.40  11.94

        12V IO    Enabled  10%  10.80/ 13.20  20%  9.60/ 14.40  12.00

            5V    Enabled  10%  4.50/  5.50  20%  4.00/  6.00    5.07

          3.3V    Enabled  10%  2.97/  3.63  20%  2.64/  3.96    3.34

          2.5V    Enabled  10%  2.25/  2.75  20%  2.00/  3.00    2.47

          1.5V    Enabled  10%  1.35/  1.65  20%  1.20/  1.80    1.47

        5V aux    Enabled  10%  4.50/  5.50  20%  4.00/  6.00    4.99

      3.3V aux    Enabled  10%  2.97/  3.63  20%  2.64/  3.96    3.30

PIMM0 12V bias    Enabled  10%  10.80/ 13.20  20%  9.60/ 14.40    3.30

  Asterix SRAM    Enabled  10%  2.25/  2.75  20%  2.00/  3.00    3.30

  Asterix CPU    Enabled  10%  1.44/  1.76  20%  1.28/  1.92    3.30

    PIMM0 1.5V    Enabled  10%  1.35/  1.65  20%  1.20/  1.80    3.30

PIMM0 3.3V aux    Enabled  10%  2.97/  3.63  20%  2.64/  3.96    3.30

  PIMM0 5V aux    Enabled  10%  4.50/  5.50  20%  4.00/  6.00    3.30

  XIO 12V bias    Enabled  10%  10.80/ 13.20  20%  9.60/ 14.40  11.81

        XIO 5V    Enabled  10%  4.50/  5.50  20%  4.00/  6.00    5.04

      XIO 2.5V    Enabled  10%  2.25/  2.75  20%  2.00/  3.00    2.47

  XIO 3.3V aux    Enabled  10%  2.97/  3.63  20%  2.64/  3.96    3.30

Description    State      Warning RPM  Current RPM

-------------- ----------  -----------  -----------

FAN 0  EXHAUST    Enabled          920        1219

FAN 1      HD    Enabled        1560        2244

FAN 2      PCI    Enabled        1120        1573

FAN 3    XIO 1    Enabled        1600        2200

FAN 4    XIO 2    Enabled        1600        4611

FAN 5      PS    Enabled        1600        2259

                          Advisory  Critical  Fault      Current

Description    State      Temp      Temp      Temp      Temp

-------------- ----------  ---------  ---------  ---------  ---------

NODE 0            Enabled  60C/140F  65C/149F  70C/158F  40C/104F

NODE 1            Enabled  60C/140F  65C/149F  70C/158F  37C/ 98F

NODE 2            Enabled  60C/140F  65C/149F  70C/158F  31C/ 87F

PIMM              Enabled  60C/140F  65C/149F  70C/158F    0C/ 32F

ODYSSEY          Enabled  60C/140F  65C/149F  70C/158F  40C/104F

BEDROCK          Enabled  60C/140F  65C/149F  70C/158F  43C/109F

However I get no more crashes. So perhaps the monitoring chip on the PIMM is probably dead but the module itself is still fine. I'll keep it running for awhile to see. But with the cache errors present I feel like the PIMM will eventually die, fingers crossed it's fine. Thoughts? The temperatures otherwise are fine. It's just the PIMM itself not reporting likely because the DS1780 is dead.

I did reach out to someone who still sells certain parts and am ready to replace it if need be. But I wanted to make sure 100% before I spend $ on something that I generally don't use everyday.

[Image: dolph.jpg]

Login
Username:
Password:	Lost Password?
	Remember me