The start of a LONG Fuel repair thread...
#41
RE: The start of a LONG Fuel repair thread...
(11-07-2021, 08:35 PM)vvuk Wrote:  With my Fuel _with_ a V10 installed, when I look at env before booting, I see:
Code:
env
Environmental monitoring is enabled and running.

... (omitted power) ...

Description    State       Warning RPM  Current RPM
-------------- ----------  -----------  -----------
FAN 0  EXHAUST   Wait Pwr          920            0
FAN 1       HD   Wait Pwr         1560            0
FAN 2      PCI   Wait Pwr         1120            0
FAN 3    XIO 1   Wait Pwr         1600            0
FAN 4    XIO 2   Wait Pwr         1600            0
FAN 5       PS   Wait Pwr         1600            0

                           Advisory   Critical   Fault      Current
Description    State       Temp       Temp       Temp       Temp
-------------- ----------  ---------  ---------  ---------  ---------
NODE 0           Wait Pwr   60C/140F   65C/149F   70C/158F   21C/ 69F
NODE 1           Wait Pwr   60C/140F   65C/149F   70C/158F   20C/ 68F
NODE 2           Wait Pwr   60C/140F   65C/149F   70C/158F   21C/ 69F
PIMM             Wait Pwr   60C/140F   65C/149F   70C/158F   21C/ 69F
ODYSSEY          Wait Pwr   60C/140F   65C/149F   70C/158F   21C/ 69F
BEDROCK          Wait Pwr  Not currently available

Note the "Not currently available" on BEDROCK.  So that's normal.  My guess is the monitoring chip for BEDROCK just doesn't have power applied until the system is fully running.

I can pull my V10 temporarily if you'd like, but I'm not sure what it would tell you; the system would boot fine, like it did before.  I guarantee this system ran fine without a graphics board, and I can't image why SGI would want to limit the possibility to do that -- it would be a valuable option in case of graphics card damage when data needs to be recovered or similar.  There's no reason to explicitly forbid it.  I'm hesitant to upgrade my firmware because, well, things are working fine.

However I don't see what in your logs actually prevents things from booting; I see "ERROR: 001a01 auto power up error." but nothing beyond that.  I see from an earlier log something like "system failed pre-power check"; is that the current issue?  Are you seeing anything from the PROM after you start booting?  Issue a "pwr up" and then when the L1 prompt comes back, hit Control-D.

If that wasn't stated in my posting, let me see if I can describe it better.  To be 100% clear, since I cannot power up...I can't get verbose debug info...you have to "power up" for that :-)  So we cannot see anything because there's nothing to see (we're at too early a stage of "power on "to have any real info as a consumer).

When my card is inserted (after replacing the DS1780 and shorting out the 12V XIO short on the V10), I get dual white LED flashing on the front.  The entire ENV system claims, running, and ready - the bedrock temps DO APPEAR (they don't say unavailable anymore!)...the auto power-on proceeds to the next stage where my fans are running and RPM is correct (my fans don't run now with card removed...as I cannot go onto next power stage).

At this point I have an L1 that loves me, ENV filled, temps filled, fans RPMs filled & running, Ready to for pwr up....so I type pwr up.


Nothing happens...a few seconds later I get that "pre-power check failed" and when I say "pwr down", the L1 responds with "aren't I already powered down?" (or something like that, it's in the logs).  I cannot get to the stage where I can attempt power on without the V10 installed (as I showed), as then will NOT go to the next stage (fans and ready to power-on) and will complain of I2C missing.  So I HAVE to have a working graphics card to get to the stage I get to now...just ready for power up to PROM.

Now the original errors on my logs (posted earlier) where "Error reading monitor ODYSSEY interrupt status 1: no acknowledge" and sensor alerts.  Upon replacement of the V10 DS1780 I only got ALERT XIO 12V Bias low ~2.68v...which is how I found out I had a short on the card.  I accidently shorted the card out with a 1V DC 1.5A power injection on the short to try to find the short with my thermal imager on the V10 outside the system on my bench, so something went from dragging down the 12V XIO line and originally taking the DS1780 with it to now...but being open circuit...so it's disconnected inside the V10.


However I no longer get the Acknowledge error either...but I shorted something open on the 12V XIO rail on the V10...so that rail isn't fully connected (yet to figure it out).

My working theory at this point is the original owner had the system die by the V10 shorting out, also killing the DS1780 (the thing that monitors voltages for the XIO).  With the DS1780 now working the env is happy but alerting to the original short (now that the DS1780 is there to MONITOR the XIO 12V line :-)).

So that got shorted open by me while trying to find it (my bad)...so my assumption (that's the biggie) at this time is...the mechanism that shorted likely powers the 12V line of the card on/off or is otherwise included in signaling...that's disconnected inside the V10.

So the mainboard is ready for power-on...sends the power-on signal to the card...nothing happens...I believe (for lack of other evidence and damage outside the V10) this is why I get this pre-power on error. Because the V10 cannot be started because part of it's 12V internally have now become disconnected...here I sit.

Since I'm 100% certain the original failure was the V10 in this case (as I found real damage) I believe it's isolated to the V10.

You're suggesting that there is in fact more damage and it's the mainboard as well as the V10 (because I already have proof of V10 Damage)...well we both have zero way of knowing that until I try another graphics card because from what I can tell right now, when I complete the env i2C bus the L1 is happy and ready to go...the dead card is holding me back as the originally shorted 12V could have been a simple cap (not likely, I didn't get that lucky) more than likely it was a semiconductor and/or something important to the startup/operation signalling of the V10.

I do not see another way to break the stalemate other than by trying another card.
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
11-07-2021, 10:43 PM
#42
RE: The start of a LONG Fuel repair thread...
(11-07-2021, 10:43 PM)weblacky Wrote:  Now the original errors on my logs (posted earlier) where "Error reading monitor ODYSSEY interrupt status 1: no acknowledge" and sensor alerts.  Upon replacement of the V10 DS1780 I only got ALERT XIO 12V Bias low ~2.68v...which is how I found out I had a short on the card.  I accidently shorted the card out with a 1V DC 1.5A power injection on the short to try to find the short with my thermal imager on the V10 outside the system on my bench, so something went from dragging down the 12V XIO line and originally taking the DS1780 with it to now...but being open circuit...so it's disconnected inside the V10.

Ah! This is the piece that I was missing. And makes sense.

Now that I understand the L1 flashing process a bit more -- and crucially, that I should be able to "temporarily" boot a new version -- I'm happy to upgrade to the same L1 version you have and let you know what happens if I try to boot without the V10 installed. Let me know if that would be useful and I can try to find time.
vvuk
O2

Trade Count: (0)
Posts: 43
Threads: 4
Joined: Aug 2021
Location: California
Find Reply
11-08-2021, 08:51 PM
#43
RE: The start of a LONG Fuel repair thread...
I honestly won’t ask you to do anything on your system. My system is currently an unknown (I won’t know if it’s acting normal until I put the replacement V10 in and things appear normal) and there is no way to flash backwards (I thought) due to possible mainboard revisions (to prevent flashing a later mainboard with earlier firmware). So even if I wanted to be sure you’d be stopped from running headless on new firmware, it wouldn’t do anyone much good unless they discovered a first-gen fuel that seemly never got an L1 update due to the lack of proper Irix upgrades after purchasing.

So it’d be interesting knowledge but not useable by most systems (if true).

The tracking number I got from Mashek claims the V10 shipment was handed over to USPS and is expected to arrive on Friday.

We’ll see over the weekend if all this basic system stuff gets resolved or not around that time, I’d imagine.

If it’s isolated to the V10 the fuel should roar to life.
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
11-09-2021, 12:34 AM
#44
RE: The start of a LONG Fuel repair thread...
OH ...boy...this is bad.  I got the new card...SAME THING..SAME!!!!!  URGGGHHHHH

Okay so I cobled together a multi-headed hydra of serial adapters into an old laptop to because I couldn't make console mode work on the fuel before (switching serial ports did NOTHING).

I got two terminal windows running, one on the internal header, one on serial #1 at 9600 instead of the higher speed and I was able to enter CAC and see stuff...I did a error_dump...OH MAN...I hope someone knows what the heck this is because it LOOKS LIKE A SHOW STOPPER.  This would mean the board had BOTH real V10 damage and real Mainboard/PIMM damage!?  Wow, my luck is NOT going well on this fuel.  

Okay L1 log (changed to flash storage B just to see if that made a diff):
Code:
SGI SN1 L1 Controller

Firmware Image B: Rev. 1.28.3, Built 03/20/2004 00:01:57

001a01-L1>env

Environmental monitoring is enabled and running.



Description    State       Warning Limits     Fault Limits       Current

-------------- ----------  -----------------  -----------------  -------

           12V   Wait Pwr  10%  10.80/ 13.20  20%   9.60/ 14.40    0.25

        12V IO   Wait Pwr  10%  10.80/ 13.20  20%   9.60/ 14.40    0.25

            5V   Wait Pwr  10%   4.50/  5.50  20%   4.00/  6.00    0.08

          3.3V   Wait Pwr  10%   2.97/  3.63  20%   2.64/  3.96    0.62

          2.5V   Wait Pwr  10%   2.25/  2.75  20%   2.00/  3.00    0.00

          1.5V   Wait Pwr  10%   1.35/  1.65  20%   1.20/  1.80    0.00

        5V aux   Wait Pwr  10%   4.50/  5.50  20%   4.00/  6.00    5.04

      3.3V aux   Wait Pwr  10%   2.97/  3.63  20%   2.64/  3.96    3.29

PIMM0 12V bias   Wait Pwr  10%  10.80/ 13.20  20%   9.60/ 14.40    0.25

     Fuel SRAM   Wait Pwr  10%   2.25/  2.75  20%   2.00/  3.00    0.05

      Fuel CPU   Wait Pwr  10%   1.13/  1.38  20%   1.00/  1.50    0.01

    PIMM0 1.5V   Wait Pwr  10%   1.35/  1.65  20%   1.20/  1.80    0.04

PIMM0 3.3V aux   Wait Pwr  10%   2.97/  3.63  20%   2.64/  3.96    3.27

  PIMM0 5V aux   Wait Pwr  10%   4.50/  5.50  20%   4.00/  6.00    5.02

  XIO 12V bias   Wait Pwr  10%  10.80/ 13.20  20%   9.60/ 14.40    0.25

        XIO 5V   Wait Pwr  10%   4.50/  5.50  20%   4.00/  6.00    0.08

      XIO 2.5V   Wait Pwr  10%   2.25/  2.75  20%   2.00/  3.00    0.00

  XIO 3.3V aux   Wait Pwr  10%   2.97/  3.63  20%   2.64/  3.96    3.30



Description    State       Warning RPM  Current RPM

-------------- ----------  -----------  -----------

FAN 0  EXHAUST   Wait Pwr          920            0

FAN 1       HD   Wait Pwr         1560            0

FAN 2      PCI   Wait Pwr         1120            0

FAN 3    XIO 1   Wait Pwr         1600            0

FAN 4    XIO 2   Wait Pwr         1600            0

FAN 5       PS   Wait Pwr         1349            0



                              Advisory   Critical   Fault      Current

Description       State       Temp       Temp       Temp       Temp       

----------------- ----------  ---------  ---------  ---------  --------- 

0 NODE 0           Wait Pwr    [Autofan Control]    75C/167F   26C/ 78F

1 NODE 1           Wait Pwr    [Autofan Control]    75C/167F   25C/ 77F

2 NODE 2           Wait Pwr    [Autofan Control]    75C/167F   20C/ 68F

3 PIMM             Wait Pwr    [Autofan Control]    75C/167F   29C/ 84F

4 ODYSSEY          Wait Pwr    [Autofan Control]    75C/167F   23C/ 73F

5 BEDROCK          Wait Pwr  Not currently available



001a01-L1>INFO: 001a01 will power up system in  5 seconds...

INFO: 001a01 powering up the system.



001a01-L1>env

Environmental monitoring is enabled and running.



Description    State       Warning Limits     Fault Limits       Current

-------------- ----------  -----------------  -----------------  -------

           12V    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.00

        12V IO    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.06

            5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.07

          3.3V    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.35

          2.5V    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.47

          1.5V    Enabled  10%   1.35/  1.65  20%   1.20/  1.80    1.47

        5V aux    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.02

      3.3V aux    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.29

PIMM0 12V bias    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.00

     Fuel SRAM    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.51

      Fuel CPU    Enabled  10%   1.13/  1.38  20%   1.00/  1.50    1.25

    PIMM0 1.5V    Enabled  10%   1.35/  1.65  20%   1.20/  1.80    1.49

PIMM0 3.3V aux    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.29

  PIMM0 5V aux    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.02

  XIO 12V bias    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.00

        XIO 5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.07

      XIO 2.5V    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.48

  XIO 3.3V aux    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.30



Description    State       Warning RPM  Current RPM

-------------- ----------  -----------  -----------

FAN 0  EXHAUST    Enabled          920         1196

FAN 1       HD    Enabled         1560         2205

FAN 2      PCI    Enabled         1120         1534

FAN 3    XIO 1    Enabled         1600         2191

FAN 4    XIO 2    Enabled         1600         2057

FAN 5       PS    Enabled         1349         2122



                              Advisory   Critical   Fault      Current

Description       State       Temp       Temp       Temp       Temp       

----------------- ----------  ---------  ---------  ---------  --------- 

0 NODE 0            Enabled    [Autofan Control]    75C/167F   25C/ 77F

1 NODE 1            Enabled    [Autofan Control]    75C/167F   25C/ 77F

2 NODE 2            Enabled    [Autofan Control]    75C/167F   20C/ 68F

3 PIMM              Enabled    [Autofan Control]    75C/167F   30C/ 86F

4 ODYSSEY           Enabled    [Autofan Control]    75C/167F   23C/ 73F

5 BEDROCK           Enabled    [Autofan Control]    85C/185F   27C/ 80F


001a01-L1>log

11/12/21 15:15:14 power up (COMMAND)

11/12/21 15:15:21 reset again MIPS

11/12/21 15:17:40 SMP unregistering events

11/12/21 15:17:40 UNREG: 300064c4 0 4

11/12/21 15:17:41 SMP-R: UART:UART_NO_CONNECTION

11/12/21 15:18:30 3.3V low fault limit reached  1.926V.

11/12/21 15:19:02 L1 booting 1.28.3

11/12/21 15:19:02 PSC: 0x09

11/12/21 15:19:02 USB0: waiting on open

11/12/21 15:19:02 auto power up countdown initiated

11/12/21 15:19:08 auto power up initiated

11/12/21 15:19:08 power up (COMMAND)

11/12/21 15:19:14 reset again MIPS

11/12/21 15:19:20 power down (COMMAND)

11/12/21 15:19:20 power down (COMMAND)

11/12/21 15:20:36 SMP unregistering events

11/12/21 15:20:36 UNREG: 300064c4 0 4

11/12/21 15:20:37 SMP-R: UART:UART_NO_CONNECTION

11/12/21 15:21:11 L1 booting 1.28.3

11/12/21 15:21:11 PSC: 0x09

11/12/21 15:21:11 USB0: waiting on open

11/12/21 15:21:11 auto power up countdown initiated

11/12/21 15:21:16 auto power up initiated

11/12/21 15:21:16 power up (COMMAND)

11/12/21 15:21:23 reset again MIPS

11/12/21 15:22:35 PIMM0 1.5V low fault limit reached  1.142V.

11/12/21 15:23:06 L1 booting 1.28.3

11/12/21 15:23:07 PSC: 0x09

11/12/21 15:23:07 USB0: waiting on open

11/12/21 15:23:07 auto power up countdown initiated

11/12/21 15:23:12 auto power up initiated

11/12/21 15:23:12 power up (COMMAND)

11/12/21 15:23:18 reset again MIPS

11/12/21 15:23:29 12V IO low fault limit reached  9.500V.

11/12/21 15:23:35 L1 booting 1.28.3

11/12/21 15:23:35 PSC: 0x09

11/12/21 15:23:35 USB0: waiting on open

11/12/21 15:23:35 auto power up countdown initiated

11/12/21 15:23:40 auto power up initiated

11/12/21 15:23:40 power up (COMMAND)

11/12/21 15:23:47 reset again MIPS

11/12/21 15:23:58 1.5V low fault limit reached  1.199V.

11/12/21 15:24:34 L1 booting 1.28.3

11/12/21 15:24:35 PSC: 0x09

11/12/21 15:24:35 USB0: waiting on open

11/12/21 15:24:35 auto power up countdown initiated

11/12/21 15:24:40 auto power up initiated

11/12/21 15:24:40 power up (COMMAND)

11/12/21 15:24:47 reset again MIPS

001a01-L1>env

Environmental monitoring is enabled and running.



Description    State       Warning Limits     Fault Limits       Current

-------------- ----------  -----------------  -----------------  -------

           12V    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   11.94

        12V IO    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.00

            5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.07

          3.3V    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.35

          2.5V    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.47

          1.5V    Enabled  10%   1.35/  1.65  20%   1.20/  1.80    1.47

        5V aux    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.02

      3.3V aux    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.29

PIMM0 12V bias    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   11.94

     Fuel SRAM    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.51

      Fuel CPU    Enabled  10%   1.13/  1.38  20%   1.00/  1.50    1.25

    PIMM0 1.5V    Enabled  10%   1.35/  1.65  20%   1.20/  1.80    1.49

PIMM0 3.3V aux    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.29

  PIMM0 5V aux    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.02

  XIO 12V bias    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   11.88

        XIO 5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.07

      XIO 2.5V    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.47

  XIO 3.3V aux    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.30



Description    State       Warning RPM  Current RPM

-------------- ----------  -----------  -----------

FAN 0  EXHAUST    Enabled          920         1196

FAN 1       HD    Enabled         1560         2205

FAN 2      PCI    Enabled         1120         1534

FAN 3    XIO 1    Enabled         1600         2191

FAN 4    XIO 2    Enabled         1600         2057

FAN 5       PS    Enabled         1349         2122



                              Advisory   Critical   Fault      Current

Description       State       Temp       Temp       Temp       Temp       

----------------- ----------  ---------  ---------  ---------  --------- 

0 NODE 0            Enabled    [Autofan Control]    75C/167F   27C/ 80F

1 NODE 1            Enabled    [Autofan Control]    75C/167F   25C/ 77F

2 NODE 2            Enabled    [Autofan Control]    75C/167F   20C/ 68F

3 PIMM              Enabled    [Autofan Control]    75C/167F   31C/ 87F

4 ODYSSEY           Enabled    [Autofan Control]    75C/167F   22C/ 71F

5 BEDROCK           Enabled    [Autofan Control]    85C/185F   29C/ 84F



001a01-L1>debug 0x010d

debug switches set to 0x010d

001a01-L1>cac

ERROR: command not found.

001a01-L1> 
entering console mode  001a01 CPU0, <CTRL_T> to escape to L1

The low voltage errors in the log are odd, they don't show in the env...I think thats becsue they are momentary based on start-up load.  But odd anyway.

Okay so the BIG guy..CAC...I did all the reset logs, reset all stuff...error_dump remained the same and it's pretty core..I really don't know how I'd look into this (FYI, I see the memory layout error and I have corrected that, I don't want to upload a new log because it didn't change anything, also I F*D with the debug mode and restarted a bunch of times, so the output will be odd.):
Code:
IP35 PROM SGI Version 6.210  built 02:33:51 PM Aug 26, 2004
  built for bedrock rev. 1.1 or greater
running in IP34 mode
Running in DDR mode
Local master CPU A revision: f41
PROM length: 0x1686a8, BSS length: 0xa7a0, flash count: 9
Configured bedrock clock: 200.0 MHz
Status of local IO: 0x1 0x3fc0fff6403
Bedrock Rev: 2, Module: 1 (001c01) from Sys Ctlr
On PROM entry: ERR_EPC=0xc00000001fc02ad4 (0xc00000001fc02ad4)
Configuring memory
*** Memory sizing failure:
*** Bank 0 (256 MB) and bank 1 (512 MB) sizes differ,
*** treating them both as 256 MB
Local memory configured: 1024 MB (standard)
*** Warning: System controller debug switches are non-zero (0x10d)
*** Diag level set to None (2)
*** Info level set to verbose
*** Boot stop requested at Global (2)
before reading NICHub NIC: 0x5455827f
SR1 set to 0x0000080690349000
SR0 set to 0x000000005455827f
Testing/Initializing memory ...............        DONE
Copying PROM code to memory ...............        Copy PROM (0x9000000018000000) to RAM (0x9600000001a00000), len 0x1686a8
Done
DONE
Skipping secondary cache diags
CPU A switching stack into UALIAS and invalidating D-cache
CPU A switching into node 0 cached RAM
CPU A running cached
Initializing kldir.
Done initializing kldir.
Initializing klconfig.
init_klcfg: nasid 0 start 9600000000030000 size 10000
Done initializing klconfig.
Discovering local IO ......................        Check_master: link 10 is master
Check_master: link 10 is master
DONE
CPU A initialized subnode
Discovering NUMAlink connectivity .........        
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 30357 usec
Waiting for peers to complete discovery....        Discovery results:
ENTRY 0: HUB(5455827f)
    NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
DONE
No other nodes present; becoming global master
Global master is entry 0, NIC 0x5455827f, /hw/rack/001/bay/01
Global master is /hw/rack/001/bay/01
Global barrier (line 4315) \Global barrier passed.
Global barrier (line 4348) \Global barrier passed.
Master System Topology Graph (pre-nasid_assign):
ENTRY 0: HUB(5455827f)
    NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Calculating NASIDs
num_routers is 0
Master System Topology Graph:
ENTRY 0: HUB(5455827f)
    NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Distributing routing tables
Distributing NASIDs
*** NASID assigned to 0
CPU A switching to UALIAS
CPU A running in UALIAS
Changing node ID to 0
Global barrier (line 4823) \Global barrier passed.
CPU A Flushing and invalidating caches
Global barrier (line 4928) \Global barrier passed.
CPU A switching to node 0 cached RAM
CPU A running cached
Nasids in partition:  +0
Regions in partition:  +0
Intializing any CPUless nodes..............        Global barrier (line 7714) \Global barrier passed.
Global barrier (line 7715) \Global barrier passed.
DONE
Global barrier (line 5089) \Global barrier passed.
hubii_link_good: A-brick attached to module 001c01.
Checking partitioning information .........        DONE
No other nodes present; becoming partition master
*** After partitioning ***
ENTRY 0: HUB(5455827f)
    NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: FE
Erecting partition fences ................                        DONE
Update config for routers connected to hubs
Update config for hubs and hubless routers
CPU A flushing cache
check_router_cfg: nasid 0 is_voyager 0 check_cfg = 0
Global barrier (line 5300) \Global barrier passed.
Nasids in partition:  +0
Regions in partition:  +0

A 000: *** Entering POD mode on node 0
A 000: POD IOC3 Cac> go cac
Must be in Dex mode before switching to Cac or Unc.
A 000: POD IOC3 Cac> clearalllogs
*** This must be run only after NUMAlink discovery is complete.
*** This will clear all previous log variables such as:
*** moduleids, nodeids, etc. for all nodes.
Clear all logs? [n] n
Aborted
A 000: POD IOC3 Cac> y clearalllogs
*** This must be run only after NUMAlink discovery is complete.
*** This will clear all previous log variables such as:
*** moduleids, nodeids, etc. for all nodes.
Clear all logs? [n] y
Checking 1 entries for promlogs
.DONE
All PROM logs cleared!
A 000: POD IOC3 Cac> reset
Resetting the system...


IP35 PROM SGI Version 6.210  built 02:33:51 PM Aug 26, 2004
  built for bedrock rev. 1.1 or greater
running in IP34 mode
Running in DDR mode
Local master CPU A revision: f41
PROM length: 0x1686a8, BSS length: 0xa7a0, flash count: 9
Configured bedrock clock: 200.0 MHz
Status of local IO: 0x1 0x3fc0fff6403
Bedrock Rev: 2, Module: 1 (001c01) from Sys Ctlr
On PROM entry: ERR_EPC=0xc00000001fc6b5e8 (0xc00000001fc6b5e8)
Configuring memory
*** Memory sizing failure:
*** Bank 0 (256 MB) and bank 1 (512 MB) sizes differ,
*** treating them both as 256 MB
Local memory configured: 1024 MB (standard)
*** Warning: System controller debug switches are non-zero (0x10d)
*** Diag level set to None (2)
*** Info level set to verbose
*** Boot stop requested at Global (2)
before reading NICHub NIC: 0x5455827f
SR1 set to 0x0000080690349000
SR0 set to 0x000000005455827f
Testing/Initializing memory ...............        DONE
Copying PROM code to memory ...............        Copy PROM (0x9000000018000000) to RAM (0x9600000001a00000), len 0x1686a8
Done
DONE
Skipping secondary cache diags
CPU A switching stack into UALIAS and invalidating D-cache
CPU A switching into node 0 cached RAM
CPU A running cached
Initializing kldir.
Done initializing kldir.
Initializing klconfig.
init_klcfg: nasid 0 start 9600000000030000 size 10000
Done initializing klconfig.
Discovering local IO ......................        Check_master: link 10 is master
Check_master: link 10 is master
DONE
CPU A initialized subnode
Discovering NUMAlink connectivity .........        
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 30358 usec
Waiting for peers to complete discovery....        Discovery results:
ENTRY 0: HUB(5455827f)
    NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
DONE
No other nodes present; becoming global master
Global master is entry 0, NIC 0x5455827f, /hw/rack/001/bay/01
Global master is /hw/rack/001/bay/01
Global barrier (line 4315) \Global barrier passed.
Global barrier (line 4348) \Global barrier passed.
Master System Topology Graph (pre-nasid_assign):
ENTRY 0: HUB(5455827f)
    NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Calculating NASIDs
num_routers is 0
Master System Topology Graph:
ENTRY 0: HUB(5455827f)
    NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Distributing routing tables
Distributing NASIDs
*** NASID assigned to 0
CPU A switching to UALIAS
CPU A running in UALIAS
Changing node ID to 0
Global barrier (line 4823) \Global barrier passed.
CPU A Flushing and invalidating caches
Global barrier (line 4928) \Global barrier passed.
CPU A switching to node 0 cached RAM
CPU A running cached
Nasids in partition:  +0
Regions in partition:  +0
Intializing any CPUless nodes..............        Global barrier (line 7714) \Global barrier passed.
Global barrier (line 7715) \Global barrier passed.
DONE
Global barrier (line 5089) \Global barrier passed.
hubii_link_good: A-brick attached to module 001c01.
Checking partitioning information .........        DONE
No other nodes present; becoming partition master
*** After partitioning ***
ENTRY 0: HUB(5455827f)
    NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: FE
Erecting partition fences ................                        DONE
Update config for routers connected to hubs
Update config for hubs and hubless routers
CPU A flushing cache
check_router_cfg: nasid 0 is_voyager 0 check_cfg = 0
Global barrier (line 5300) \Global barrier passed.
Nasids in partition:  +0
Regions in partition:  +0

A 000: *** Entering POD mode on node 0       
A 000: POD IOC3 Cac> error_dump
Hardware Error State: (Forced error dump)
+  Errors on node Nasid 0x0 (0)
+    XBow in /hw/module/174562
+      BEDROCK signalled following errors.
+        XBow Link a status register: 0xffffffff80020000
+          17: Illegal destination
+        XBow error command word register: 0xffffffffaa018000
+        XBow error upper address register: 0x0
+        XBow error lower address register: 0x0
END Hardware Error State (Forced error dump)
A 000: POD IOC3 Cac> reset all
This will reset ALL partitions, OK? [n] y

Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 30360 usec
Resetting the system...


IP35 PROM SGI Version 6.210  built 02:33:51 PM Aug 26, 2004
  built for bedrock rev. 1.1 or greater
running in IP34 mode
Running in DDR mode
Local master CPU A revision: f41
PROM length: 0x1686a8, BSS length: 0xa7a0, flash count: 9
Configured bedrock clock: 200.0 MHz
Status of local IO: 0x1 0x3fc0fff6403
Bedrock Rev: 2, Module: 1 (001c01) from Sys Ctlr
On PROM entry: ERR_EPC=0xc00000001fc6b5e8 (0xc00000001fc6b5e8)
Configuring memory
*** Memory sizing failure:
*** Bank 0 (256 MB) and bank 1 (512 MB) sizes differ,
*** treating them both as 256 MB
Local memory configured: 1024 MB (standard)
*** Warning: System controller debug switches are non-zero (0x4d)
*** Diag level set to None (2)
*** Info level set to verbose
*** Boot stop requested at Global (2)
*** Bypassing first IO7
before reading NICHub NIC: 0x5455827f
SR1 set to 0x0000080690349000
SR0 set to 0x000000005455827f
Testing/Initializing memory ...............        DONE
Copying PROM code to memory ...............        Copy PROM (0x9000000018000000) to RAM (0x9600000001a00000), len 0x1686a8
Done
DONE
Skipping secondary cache diags
CPU A switching stack into UALIAS and invalidating D-cache
CPU A switching into node 0 cached RAM
CPU A running cached
Initializing kldir.
Done initializing kldir.
Initializing klconfig.
init_klcfg: nasid 0 start 9600000000030000 size 10000
Done initializing klconfig.
Discovering local IO ......................        Check_master: link 10 is master
Check_master: link 10 is master
DONE
CPU A initialized subnode
Discovering NUMAlink connectivity .........        
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 30359 usec
Waiting for peers to complete discovery....        Discovery results:
ENTRY 0: HUB(5455827f)
    NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
DONE
No other nodes present; becoming global master
Global master is entry 0, NIC 0x5455827f, /hw/rack/001/bay/01
Global master is /hw/rack/001/bay/01
Global barrier (line 4315) \Global barrier passed.
Global barrier (line 4348) \Global barrier passed.
Master System Topology Graph (pre-nasid_assign):
ENTRY 0: HUB(5455827f)
    NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Calculating NASIDs
num_routers is 0
Master System Topology Graph:
ENTRY 0: HUB(5455827f)
    NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Distributing routing tables
Distributing NASIDs
*** NASID assigned to 0
CPU A switching to UALIAS
CPU A running in UALIAS
Changing node ID to 0
Global barrier (line 4823) \Global barrier passed.
CPU A Flushing and invalidating caches
Global barrier (line 4928) \Global barrier passed.
CPU A switching to node 0 cached RAM
CPU A running cached
Nasids in partition:  +0
Regions in partition:  +0
Intializing any CPUless nodes..............        Global barrier (line 7714) \Global barrier passed.
Global barrier (line 7715) \Global barrier passed.
DONE
Global barrier (line 5089) \Global barrier passed.
hubii_link_good: A-brick attached to module 001c01.
*** Nasid 0: Memory bank 2 was previously Present & Enabled but is now Present & Disabled
*** Nasid 0: Memory bank 2 previously had 256 MB but now has 512 MB
Checking partitioning information .........        DONE
No other nodes present; becoming partition master
*** After partitioning ***
ENTRY 0: HUB(5455827f)
    NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: FE
Erecting partition fences ................                        DONE
Update config for routers connected to hubs
Update config for hubs and hubless routers
CPU A flushing cache
check_router_cfg: nasid 0 is_voyager 0 check_cfg = 0
Global barrier (line 5300) \Global barrier passed.
Nasids in partition:  +0
Regions in partition:  +0

A 000: *** Entering POD mode on node 0
A 000: POD IOC3 Cac> error_dump
Hardware Error State: (Forced error dump)
+  Errors on node Nasid 0x0 (0)
+    XBow in /hw/module/174562
+      BEDROCK signalled following errors.
+        XBow Link a status register: 0xffffffff80020000
+          17: Illegal destination
+        XBow error command word register: 0xffffffffaa008000
+        XBow error upper address register: 0x0
+        XBow error lower address register: 0x0
END Hardware Error State (Forced error dump)
A 000: POD IOC3 Cac> error
A 000: POD IOC3 Cac>



Soooooo...is this my error?
Code:
  Errors on node Nasid 0x0 (0)
    XBow in /hw/module/174562
  BEDROCK signalled following errors.
  XBow Link a status register: 0xffffffff80020000
  17: Illegal destination
  XBow error command word register: 0xffffffffaa008000
  XBow error upper address register: 0x0
  XBow error lower address register: 0x0



Also while using "leds" on the L1 during all this there was one instance where is said something like "invalid icache" but I didn't get it in a capture.

So, worst fears realized...damage to V10, damage to mainboard?  Possibly the PIMM...I could still stick in the low voltage PIMM I got from noguri as a test PIMM?

Can anyone shed light on the this error?  I mean from a "modular" perspective...it's NOT the V10 holding me back (right now). So either mainboard or PIMM.  I have a sort of working PIMM I can test with...do you think that would yield any positive results?  I'm asking only due to the power log errors...as one was 12v IO but the other WAS PIMM 1.5v voltage.  Maybe PIMM is shorted and during startup it's connected, then it get's disconnected due to not being needed yet and 1.5V returns to normal before any real OS boot?

Grasping at straws...but there's nothing else to grasp right now...
(This post was last modified: 11-12-2021, 10:52 PM by weblacky.)
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
11-12-2021, 10:50 PM
#45
RE: The start of a LONG Fuel repair thread...
Hi Weblacky,

really long post....

I have to take kids to the museum as looking at my old SGI boxes is not a exciting as DINOSAURS!! ;-)

I will review on return.

Cheers from Oz,

jwhat/John.
(This post was last modified: 11-12-2021, 11:05 PM by jwhat.)
jwhat
Octane/O350/Fuel User

Trade Count: (0)
Posts: 513
Threads: 29
Joined: Jul 2018
Location: Australia
Find Reply
11-12-2021, 11:04 PM
#46
RE: The start of a LONG Fuel repair thread...
(11-12-2021, 11:04 PM)jwhat Wrote:  Hi Weblacky,

really long post....

I have to take kids to the museum as looking at my old SGI boxes is not a exciting as DINOSAURS!! ;-)

I will review on return.

Cheers from Oz,

jwhat/John.

Thanks, it's as long as it is demoralizing :-)  Also, you could compromise and take the kids to see SGIs at a computer museum?

Also, assuming this HW graph node numbering ISN'T random.  Can some use something like this find to find out WHAT HW this number is: https://nixdoc.net/man-pages/IRIX/man1/xbstat.1.html

Okay...sh*t storm 2: the legend continues....

Just fir yucks I tried the PIMM no guri gave me that's suppose to have a voltage error...well it does still have the voltage error, it causes/reports the 5v PIMM AUX voltage as 6.22v and enters emergency shutdown.  The DS1780 on it isn't shorted so I'm assuming it's true.  It also then rained caches errors!

Errors on second PIMM that has a 5V voltage issue (test if diff):
Code:
IP35 PROM SGI Version 6.210  built 02:33:51 PM Aug 26, 2004
Running in DDR mode
Testing/Initializing memory ...............        DONE
Copying PROM code to memory ...............        DONE



SECONDARY CACHE DATA FAILURE: Module 001c01 CPU A
   Subtest      : Full March: DATA
   Failure      : ECC Miscompare
   Address      : 0xa8000000000037f0 (Way 0)

                  Off ---------     Data     ----------  ECC
   Expected     : 70  5555555555555555 0000000000000000  155
   Received     : 70  5555557555555555 0000000000000000  155
   Syndrome     : 70  0000002000000000 0000000000000000  000

   Failing Bits
     SCData<101>
       Asterix R14K   CPU  C3D1 [Pin AG33]   SRAM C8E6  [Pin H8]





SECONDARY CACHE DATA FAILURE: Module 001c01 CPU A
   Subtest      : Full March: DATA
   Failure      : ECC Miscompare
   Address      : 0xa800000000003ab0 (Way 0)

                  Off ---------     Data     ----------  ECC
   Expected     : 30  cccccccccccccccc 0000000000000000  0cc
   Received     : 30  ccccdceccccccccc 0000000000000000  0cc
   Syndrome     : 30  0000102000000000 0000000000000000  000

   Failing Bits
     SCData<101>
       Asterix R14K   CPU  C3D1 [Pin AG33]   SRAM C8E6  [Pin H8]
     SCData<108>
       Asterix R14K   CPU  C3D1 [Pin AC27]   SRAM C8E6  [Pin K2]





SECONDARY CACHE DATA FAILURE: Module 001c01 CPU A
   Subtest      : Full March: DATA
   Failure      : ECC Miscompare
   Address      : 0xa800000000003790 (Way 0)

                  Off ---------     Data     ----------  ECC
   Expected     : 10  0f0f0f0f0f0f0f0f 0000000000000000  30f
   Received     : 10  0f0f0f0f4f0f0f0f 0000000000000000  30f
   Syndrome     : 10  0000000040000000 0000000000000000  000

   Failing Bits
     SCData<94>
       Asterix R14K   CPU  C3D1 [Pin  AK1]   SRAM C8B7  [Pin D1]





SECONDARY CACHE DATA FAILURE: Module 001c01 CPU A
   Subtest      : Base Address: DATA
   Failure      : Brother Double Word Not Zero
   Address      : 0x0000000000000008 (Way 0)

                  Off ---------     Data     ----------  ECC
   Expected     : 00  0000000000000000 5555555555555555  155
   Received     : 00  0000000040000000 5555555555555555  155
   Syndrome     : 00  0000000040000000 0000000000000000  000

   Failing Bits
     SCData<94>
       Asterix R14K   CPU  C3D1 [Pin  AK1]   SRAM C8B7  [Pin D1]





SECONDARY CACHE DATA FAILURE: Module 001c01 CPU A
   Subtest      : Base Address: DATA
   Failure      : Data Miscompare
   Address      : 0x0000000000000000 (Way 0)

                  Off ---------     Data     ----------  ECC
   Expected     : 00  aaaaaaaaaaaaaaaa 0000000000000000  2aa
   Received     : 00  eaaaaaaaeaaaaaaa 0000000000000000  2aa
   Syndrome     : 00  4000000040000000 0000000000000000  000

   Failing Bits
     SCData<94>
       Asterix R14K   CPU  C3D1 [Pin  AK1]   SRAM C8B7  [Pin D1]
     SCData<126>
       Asterix R14K   CPU  C3D1 [Pin AK30]   SRAM C8E6  [Pin D1]





SECONDARY CACHE DATA FAILURE: Module 001c01 CPU A
   Subtest      : Walking Address: DATA
   Failure      : Data Miscompare
   Address      : 0x0000000000002000 (Way 0)

                  Off ---------     Data     ----------  ECC
   Expected     : 00  aaaaaaaaaaaaaaaa 0000000000000000  2aa
   Received     : 00  aaaaaaaaeaaaaaaa 0000000000000000  2aa
   Syndrome     : 00  0000000040000000 0000000000000000  000

   Failing Bits
     SCData<94>
       Asterix R14K   CPU  C3D1 [Pin  AK1]   SRAM C8B7  [Pin D1]





SECONDARY CACHE DATA FAILURE: Module 001c01 CPU A
   Subtest      : Walking Address: DATA
   Failure      : Data Miscompare
   Address      : 0x0000000000001000 (Way 0)

                  Off ---------     Data     ----------  ECC
   Expected     : 00  aaaaaaaaaaaaaaaa 0000000000000000  2aa
   Received     : 00  eaaaaaaaeaaaaaaa 0000000000000000  2aa
   Syndrome     : 00  4000000040000000 0000000000000000  000

   Failing Bits
     SCData<94>
       Asterix R14K   CPU  C3D1 [Pin  AK1]   SRAM C8B7  [Pin D1]
     SCData<126>
       Asterix R14K   CPU  C3D1 [Pin AK30]   SRAM C8E6  [Pin D1]





SECONDARY CACHE DATA FAILURE: Module 001c01 CPU A
   Subtest      : Base + Walk/Inv: DATA
   Failure      : Brother Double Word Not Zero
   Address      : 0x0000000000000009 (Way 1)

                  Off ---------     Data     ----------  ECC
   Expected     : 00  0000000000000000 0000000000000000  000
   Received     : 00  0000000040000000 0000000000000000  000
   Syndrome     : 00  0000000040000000 0000000000000000  000

   Failing Bits
     SCData<94>
       Asterix R14K   CPU  C3D1 [Pin  AK1]   SRAM C8B7  [Pin D1]





SECONDARY CACHE DATA FAILURE: Module 001c01 CPU A
   Subtest      : Short March: DATA
   Failure      : Data Miscompare
   Address      : 0x0000000000000100 (Way 0)

                  Off ---------     Data     ----------  ECC
   Expected     : 00  5555555555555555 0000000000000000  155
   Received     : 00  555555d555555555 0000000000000000  155
   Syndrome     : 00  0000008000000000 0000000000000000  000

   Failing Bits
     SCData<103>
       Asterix R14K   CPU  C3D1 [Pin AD30]   SRAM C8E6  [Pin H3]





SECONDARY CACHE DATA FAILURE: Module 001c01 CPU A
   Subtest      : Full March: DATA
   Failure      : ECC Miscompare
   Address      : 0xa800000000003810 (Way 0)

                  Off ---------     Data     ----------  ECC
   Expected     : 10  5555555555555555 0000000000000000  155
   Received     : 10  55555d7555555555 0000000000000000  155
   Syndrome     : 10  0000082000000000 0000000000000000  000

   Failing Bits
     SCData<101>
       Asterix R14K   CPU  C3D1 [Pin AG33]   SRAM C8E6  [Pin H8]
     SCData<107>
       Asterix R14K   CPU  C3D1 [Pin AA27]   SRAM C8E6  [Pin H1]





SECONDARY CACHE DATA FAILURE: Module 001c01 CPU A
   Subtest      : Walking Address: DATA
   Failure      : Data Miscompare
   Address      : 0x0000000000002000 (Way 0)

                  Off ---------     Data     ----------  ECC
   Expected     : 00  5555555555555555 0000000000000000  155
   Received     : 00  5555557555555555 0000000000000000  155
   Syndrome     : 00  0000002000000000 0000000000000000  000

   Failing Bits
     SCData<101>
       Asterix R14K   CPU  C3D1 [Pin AG33]   SRAM C8E6  [Pin H8]





SECONDARY CACHE DATA FAILURE: Module 001c01 CPU A
   Subtest      : Walking Address: DATA
   Failure      : Data Miscompare
   Address      : 0x0000000000001000 (Way 0)

                  Off ---------     Data     ----------  ECC
   Expected     : 00  5555555555555555 0000000000000000  155
   Received     : 00  55555d7555555555 0000000000000000  155
   Syndrome     : 00  0000082000000000 0000000000000000  000

   Failing Bits
     SCData<101>
       Asterix R14K   CPU  C3D1 [Pin AG33]   SRAM C8E6  [Pin H8]
     SCData<107>
       Asterix R14K   CPU  C3D1 [Pin AA27]   SRAM C8E6  [Pin H1]





SECONDARY CACHE DATA FAILURE: Module 001c01 CPU A
   Subtest      : Walking Address: DATA
   Failure      : Data Miscompare
   Address      : 0x0000000000000800 (Way 0)

                  Off ---------     Data     ----------  ECC
   Expected     : 00  5555555555555555 0000000000000000  155
   Received     : 00  55555df555555555 0000000000000000  155
   Syndrome     : 00  000008a000000000 0000000000000000  000

   Failing Bits
     SCData<101>
       Asterix R14K   CPU  C3D1 [Pin AG33]   SRAM C8E6  [Pin H8]
     SCData<103>
       Asterix R14K   CPU  C3D1 [Pin AD30]   SRAM C8E6  [Pin H3]
     SCData<107>
       Asterix R14K   CPU  C3D1 [Pin AA27]   SRAM C8E6  [Pin H1]





SECONDARY CACHE DATA FAILURE: Module 001c01 CPU A
   Subtest      : Walking Address: DATA
   Failure      : Data Miscompare
   Address      : 0x0000000000000400 (Way 0)

                  Off ---------     Data     ----------  ECC
   Expected     : 00  5555555555555555 0000000000000000  155
   Received     : 00  55555d7555555555 0000000000000000  155
   Syndrome     : 00  0000082000000000 0000000000000000  000

   Failing Bits
     SCData<101>
       Asterix R14K   CPU  C3D1 [Pin AG33]   SRAM C8E6  [Pin H8]
     SCData<107>
       Asterix R14K   CPU  C3D1 [Pin AA27]   SRAM C8E6  [Pin H1]





SECONDARY CACHE DATA FAILURE: Module 001c01 CPU A
   Subtest      : Base + Walk/Inv: DATA
   Failure      : Data Miscompare
   Address      : 0x0000000000000001 (Way 1)

                  Off ---------     Data     ----------  ECC
   Expected     : 00  5555555555555555 0000000000000000  155
   Received     : 00  5555557555555555 0000000000000000  155
   Syndrome     : 00  0000002000000000 0000000000000000  000

   Failing Bits
     SCData<101>
       Asterix R14K   CPU  C3D1 [Pin AG33]   SRAM C8E6  [Pin H8]





SECONDARY CACHE DATA FAILURE: Module 001c01 CPU A
   Subtest      : Short March: DATA
   Failure      : Data Miscompare
   Address      : 0x0000000000000001 (Way 1)

                  Off ---------     Data     ----------  ECC
   Expected     : 00  5555555555555555 0000000000000000  155
   Received     : 00  55555d7555555555 0000000000000000  155
   Syndrome     : 00  0000082000000000 0000000000000000  000

   Failing Bits
     SCData<101>
       Asterix R14K   CPU  C3D1 [Pin AG33]   SRAM C8E6  [Pin H8]
     SCDat
 
L1 log with diff PIMM tha tha voltage issue:
Code:
SGI SN1 L1 Controller

Firmware Image B: Rev. 1.28.3, Built 03/20/2004 00:01:57





001a01-L1>

001a01-L1>

001a01 ATTN: brick auto power down in 25 seconds



001a01 ATTN: brick auto power down in 20 seconds



001a01-L1>env

Environmental monitoring is enabled and running.



Description    State       Warning Limits     Fault Limits       Current

-------------- ----------  -----------------  -----------------  -------

           12V    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   11.94

        12V IO    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.00

            5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.07

          3.3V    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.35

          2.5V    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.47

          1.5V    Enabled  10%   1.35/  1.65  20%   1.20/  1.80    1.47

        5V aux    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.02

      3.3V aux    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.30

PIMM0 12V bias      Fault  10%  10.80/ 13.20  20%   9.60/ 14.40    9.31

     Fuel SRAM    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.52

      Fuel CPU    Enabled  10%   1.44/  1.76  20%   1.28/  1.92    1.61

    PIMM0 1.5V    Enabled  10%   1.35/  1.65  20%   1.20/  1.80    1.49

PIMM0 3.3V aux    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.29

  PIMM0 5V aux    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    6.63

  XIO 12V bias    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   11.88

        XIO 5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.07

      XIO 2.5V    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.47

  XIO 3.3V aux    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.30



Description    State       Warning RPM  Current RPM

-------------- ----------  -----------  -----------

FAN 0  EXHAUST    Enabled          920         1188

FAN 1       HD    Enabled         1560         2191

FAN 2      PCI    Enabled         1120         1548

FAN 3    XIO 1    Enabled         1600         2177

FAN 4    XIO 2    Enabled         1600         2045

FAN 5       PS    Enabled         1349         2109



                              Advisory   Critical   Fault      Current

Description       State       Temp       Temp       Temp       Temp      

----------------- ----------  ---------  ---------  ---------  --------- 

0 NODE 0            Enabled    [Autofan Control]    75C/167F   17C/ 62F

1 NODE 1            Enabled    [Autofan Control]    75C/167F   17C/ 62F

2 NODE 2            Enabled    [Autofan Control]    75C/167F   16C/ 60F

3 PIMM              Enabled    [Autofan Control]    75C/167F   18C/ 64F

4 ODYSSEY           Enabled    [Autofan Control]    75C/167F   18C/ 64F

5 BEDROCK           Enabled    [Autofan Control]    85C/185F   16C/ 60F



001a01-L1>

001a01 ATTN: brick auto power down in 15 seconds



001a01 ATTN: brick auto power down in 10 seconds



001a01-L1>env

Environmental monitoring is enabled and running.



Description    State       Warning Limits     Fault Limits       Current

-------------- ----------  -----------------  -----------------  -------

           12V    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   11.94

        12V IO    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.00

            5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.07

          3.3V    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.35

          2.5V    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.47

          1.5V    Enabled  10%   1.35/  1.65  20%   1.20/  1.80    1.47

        5V aux    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.02

      3.3V aux    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.29

PIMM0 12V bias      Fault  10%  10.80/ 13.20  20%   9.60/ 14.40    9.38

     Fuel SRAM    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.52

      Fuel CPU    Enabled  10%   1.44/  1.76  20%   1.28/  1.92    1.61

    PIMM0 1.5V    Enabled  10%   1.35/  1.65  20%   1.20/  1.80    1.49

PIMM0 3.3V aux    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.27

  PIMM0 5V aux    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    6.63

  XIO 12V bias    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   11.88

        XIO 5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.07

      XIO 2.5V    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.47

  XIO 3.3V aux    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.30



Description    State       Warning RPM  Current RPM

-------------- ----------  -----------  -----------

FAN 0  EXHAUST    Enabled          920         1188

FAN 1       HD    Enabled         1560         2191

FAN 2      PCI    Enabled         1120         1534

FAN 3    XIO 1    Enabled         1600         2177

FAN 4    XIO 2    Enabled         1600         2045

FAN 5       PS    Enabled         1349         2109



                              Advisory   Critical   Fault      Current

Description       State       Temp       Temp       Temp       Temp      

----------------- ----------  ---------  ---------  ---------  --------- 

0 NODE 0            Enabled    [Autofan Control]    75C/167F   18C/ 64F

1 NODE 1            Enabled    [Autofan Control]    75C/167F   17C/ 62F

2 NODE 2            Enabled    [Autofan Control]    75C/167F   16C/ 60F

3 PIMM              Enabled    [Autofan Control]    75C/167F   19C/ 66F

4 ODYSSEY           Enabled    [Autofan Control]    75C/167F   18C/ 64F

5 BEDROCK           Enabled    [Autofan Control]    85C/185F   17C/ 62F



001a01-L1>

001a01 ATTN: brick auto power down in 5 seconds



001a01 ATTN: brick is powering down now!

So because of secondary cache failure I didn't get a chance to issue a error_dumpt to see if the old error went away or not.  So still unknown...but I kind of doubt it. Also, I'm sometimes seeing shorts on this "broken" PIMMs caps so there is something wrong..it's possible the cache error is due to voltage and not really bad cache...but who knows and it likely is a dead end to prove anything right now.

Also I sometimes get the SAME TLB error as this other post (happened only once, likely not related): https://forums.irixnet.org/thread-3194-p...l#pid22723

Console from when I PUT BACK the PIMM (700Mhz) I got with it:

Code:
IP35 PROM SGI Version 6.210  built 02:33:51 PM Aug 26, 2004
Running in DDR mode
Testing/Initializing memory ...............        DONE
Copying PROM code to memory ...............        DONE
Discovering local IO ......................        DONE
Discovering NUMAlink connectivity .........        
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 30360 usec
Waiting for peers to complete discovery....        DONE
No other nodes present; becoming global master
Global master is /hw/rack/001/bay/01
\\\\Intializing any CPUless nodes..............        \\DONE
\Checking partitioning information .........        DONE
No other nodes present; becoming partition master
\Loading BASEIO prom .......................        DONE

BASEIO PROM Monitor SGI Version 6.210  built 02:30:38 PM Aug 26, 2004 (BE64)
1 CPUs on 1 nodes found.

NVRAM checksum is incorrect: reinitializing.
Automatic update of PROM environment disabled

PS/2 Keyboard & Mouse diagnostics
    No keyboard found, skipping keyboard test
    No mouse found, skipping mouse test

    Missing PS/2 device(s) AND console set to "g"
PS/2 Keyboard & Mouse diagnostics passed with a possible problem

Graphics diagnostics

Odyssey board #0 found on nasid 0
Running Odyssey xtalk sanity diag...
        Board version 1 - Buzz revision 2B
        On board sdram size: 32 Mb
        Cas latency: CAS 3
        2 banks by sdram module
Running Odyssey Buzz registers diag...
Device passed diagnostics

Installing PROM Device drivers ............            
Base I/O Ethernet set to /dev/ethernet/ef0
Installing Graphics Console...
graphics install: searching for pipe 0

Walking SCSI Adapter 0, (pci id 1)
1timeout on adapter 0 target 1
   tm0=0xffffc4d34a373a5d, tm1=0xfffed7de39b49c50, timeout=0xb
- 2+ Device Vendor Product:
3+ Device Vendor Product:
4+ Device Vendor Product:
5+ Device Vendor Product:
6+ Device Vendor Product:
7+ Device Vendor Product:
8+ Device Vendor Product:
9+ Device Vendor Product:
10+ Device Vendor Product:
11timeout on adapter 0 target b
   tm0=0xfffed7de39b49c5b, tm1=0xfffed7de39b49c34, timeout=0x3
+ Device Vendor Product:
12+ Device Vendor Product:
13
A 000: *** TLB Refill Exception on node 0
A 000: *** EPC: 0xc00000001fc47e58 (0xc00000001fc47e58)
A 000: *** Press ENTER to continue.
A 000: POD IOC3 Dex> log
A Info : SYS-DEGRADED: 512 MB Memory disabled
A Info : Memory bank 2 was previously Present & Enabled but is now Present & Disabled
A Info : Memory bank 2 previously had 256 MB but now has 512 MB
A Info : SYS-DEGRADED: 512 MB Memory disabled
A Info : Memory bank 3 was previously Present & Enabled but is now Absent
A Info : Memory bank 3 previously had 256 MB but now has 0 MB
A Info : SYS-DEGRADED: 512 MB Memory disabled
A Info : SYS-DEGRADED: 512 MB Memory disabled
A Info : SYS-DEGRADED: 512 MB Memory disabled
A Info : SYS-DEGRADED: 512 MB Memory disabled
A 000: POD IOC3 Dex> error_dump
Hardware Error State: (Forced error dump)
+  Errors on node Nasid 0x0 (0)
+    XBow in /hw/module/001c01
+      BEDROCK signalled following errors.
+        XBow Link a status register: 0xffffffff80020000
+          17: Illegal destination
+        XBow error command word register: 0xffffffffaa028000
+        XBow error upper address register: 0x0
+        XBow error lower address register: 0x0
END Hardware Error State (Forced error dump)
A 000: POD IOC3 Dex>

IP35 PROM SGI Version 6.210  built 02:33:51 PM Aug 26, 2004
  built for bedrock rev. 1.1 or greater
running in IP34 mode
Running in DDR mode
Local master CPU A revision: f41
PROM length: 0x1686a8, BSS length: 0xa7a0, flash count: 9
Configured bedrock clock: 200.0 MHz
Status of local IO: 0x1 0x3fc0fff6403
Bedrock Rev: 2, Module: 1 (001c01) from Sys Ctlr
On PROM entry: ERR_EPC=0xc00000001fc02ad0 (0xc00000001fc02ad0)
Configuring memory
Local memory configured: 512 MB (standard)
*** Warning: System controller debug switches are non-zero (0x2d)
*** Diag level set to None (2)
*** Info level set to verbose
*** Boot stop requested at Global (2)
*** Ignoring env. vars/using defaults
before reading NICHub NIC: 0x5455827f
SR1 set to 0x0000080690349000
SR0 set to 0x000000005455827f
Testing/Initializing memory ...............        DONE
Copying PROM code to memory ...............        Copy PROM (0x9000000018000000) to RAM (0x9600000001a00000), len 0x1686a8
Done
DONE
Skipping secondary cache diags
CPU A switching stack into UALIAS and invalidating D-cache
CPU A switching into node 0 cached RAM
CPU A running cached
Initializing kldir.
Done initializing kldir.
Initializing klconfig.
init_klcfg: nasid 0 start 9600000000030000 size 10000
Done initializing klconfig.
Discovering local IO ......................        Check_master: link 10 is master
Check_master: link 10 is master
DONE
CPU A initialized subnode
Discovering NUMAlink connectivity .........        
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 30360 usec
Waiting for peers to complete discovery....        Discovery results:
ENTRY 0: HUB(5455827f)
    NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
DONE
No other nodes present; becoming global master
Global master is entry 0, NIC 0x5455827f, /hw/rack/001/bay/01
Global master is /hw/rack/001/bay/01
Global barrier (line 4315) \Global barrier passed.
Global barrier (line 4348) \Global barrier passed.
Master System Topology Graph (pre-nasid_assign):
ENTRY 0: HUB(5455827f)
    NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Calculating NASIDs
num_routers is 0
Master System Topology Graph:
ENTRY 0: HUB(5455827f)
    NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Distributing routing tables
Distributing NASIDs
*** NASID assigned to 0
CPU A switching to UALIAS
CPU A running in UALIAS
Changing node ID to 0
Global barrier (line 4823) \Global barrier passed.
CPU A Flushing and invalidating caches
Global barrier (line 4928) \Global barrier passed.
CPU A switching to node 0 cached RAM
CPU A running cached
Nasids in partition:  +0
Regions in partition:  +0
Intializing any CPUless nodes..............        Global barrier (line 7714) \Global barrier passed.
Global barrier (line 7715) \Global barrier passed.
DONE
Global barrier (line 5089) \Global barrier passed.
hubii_link_good: A-brick attached to module 001c01.
Checking partitioning information .........        DONE
No other nodes present; becoming partition master
*** After partitioning ***
ENTRY 0: HUB(5455827f)
    NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: FE
Erecting partition fences ................                        DONE
Update config for routers connected to hubs
Update config for hubs and hubless routers
CPU A flushing cache
check_router_cfg: nasid 0 is_voyager 0 check_cfg = 0
Global barrier (line 5300) \Global barrier passed.
Nasids in partition:  +0
Regions in partition:  +0

A 000: *** Entering POD mode on node 0
A 000: POD IOC3 Cac> error_dump
Hardware Error State: (Forced error dump)
+  Errors on node Nasid 0x0 (0)
+    XBow in /hw/module/174562
+      BEDROCK signalled following errors.
+        XBow Link a status register: 0xffffffff80020000
+          17: Illegal destination
+        XBow error command word register: 0xffffffffaa000000
+        XBow error upper address register: 0x0
+        XBow error lower address register: 0x0
END Hardware Error State (Forced error dump)
A 000: POD IOC3 Cac> tst
                     ^ Syntax error
A 000: POD IOC3 Cac>

IP35 PROM SGI Version 6.210  built 02:33:51 PM Aug 26, 2004
  built for bedrock rev. 1.1 or greater
running in IP34 mode
Running in DDR mode
Local master CPU A revision: f41
PROM length: 0x1686a8, BSS length: 0xa7a0, flash count: 9
Configured bedrock clock: 200.0 MHz
Status of local IO: 0x1 0x3fc0fff6403
Bedrock Rev: 2, Module: 1 (001c01) from Sys Ctlr
On PROM entry: ERR_EPC=0xc00000001fc02ac0 (0xc00000001fc02ac0)
Configuring memory
Local memory configured: 512 MB (standard)
*** Warning: System controller debug switches are non-zero (0x2d)
*** Diag level set to None (2)
*** Info level set to verbose
*** Boot stop requested at Global (2)
*** Ignoring env. vars/using defaults
before reading NICHub NIC: 0x5455827f
SR1 set to 0x0000080690349000
SR0 set to 0x000000005455827f
Testing/Initializing memory ...............        DONE
Copying PROM code to memory ...............        Copy PROM (0x9000000018000000) to RAM (0x9600000001a00000), len 0x1686a8
Done
DONE
Skipping secondary cache diags
CPU A switching stack into UALIAS and invalidating D-cache
CPU A switching into node 0 cached RAM
CPU A running cached
Initializing kldir.
Done initializing kldir.
Initializing klconfig.
init_klcfg: nasid 0 start 9600000000030000 size 10000
Done initializing klconfig.
Discovering local IO ......................        Check_master: link 10 is master
Check_master: link 10 is master
DONE
CPU A initialized subnode
Discovering NUMAlink connectivity .........        
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 30359 usec
Waiting for peers to complete discovery....        Discovery results:
ENTRY 0: HUB(5455827f)
    NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
DONE
No other nodes present; becoming global master
Global master is entry 0, NIC 0x5455827f, /hw/rack/001/bay/01
Global master is /hw/rack/001/bay/01
Global barrier (line 4315) \Global barrier passed.
Global barrier (line 4348) \Global barrier passed.
Master System Topology Graph (pre-nasid_assign):
ENTRY 0: HUB(5455827f)
    NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Calculating NASIDs
num_routers is 0
Master System Topology Graph:
ENTRY 0: HUB(5455827f)
    NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Distributing routing tables
Distributing NASIDs
*** NASID assigned to 0
CPU A switching to UALIAS
CPU A running in UALIAS
Changing node ID to 0
Global barrier (line 4823) \Global barrier passed.
CPU A Flushing and invalidating caches
Global barrier (line 4928) \Global barrier passed.
CPU A switching to node 0 cached RAM
CPU A running cached
Nasids in partition:  +0
Regions in partition:  +0
Intializing any CPUless nodes..............        Global barrier (line 7714) \Global barrier passed.
Global barrier (line 7715) \Global barrier passed.
DONE
Global barrier (line 5089) \Global barrier passed.
hubii_link_good: A-brick attached to module 001c01.
Checking partitioning information .........        DONE
No other nodes present; becoming partition master
*** After partitioning ***
ENTRY 0: HUB(5455827f)
    NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: FE
Erecting partition fences ................                        DONE
Update config for routers connected to hubs
Update config for hubs and hubless routers
CPU A flushing cache
check_router_cfg: nasid 0 is_voyager 0 check_cfg = 0
Global barrier (line 5300) \Global barrier passed.
Nasids in partition:  +0
Regions in partition:  +0

A 000: *** Entering POD mode on node 0
A 000: POD IOC3 Cac> log
A Info : SYS-DEGRADED: 512 MB Memory disabled
A Info : Memory bank 2 was previously Present & Enabled but is now Present & Disabled
A Info : Memory bank 2 previously had 256 MB but now has 512 MB
A Info : SYS-DEGRADED: 512 MB Memory disabled
A Info : Memory bank 3 was previously Present & Enabled but is now Absent
A Info : Memory bank 3 previously had 256 MB but now has 0 MB
A Info : SYS-DEGRADED: 512 MB Memory disabled
A Info : SYS-DEGRADED: 512 MB Memory disabled
A Info : SYS-DEGRADED: 512 MB Memory disabled
A Info : SYS-DEGRADED: 512 MB Memory disabled
A Info : SYS-DEGRADED: 512 MB Memory disabled
A Info : SYS-DEGRADED: 512 MB Memory disabled
A Info : SYS-DEGRADED: 512 MB Memory disabled
A 000: POD IOC3 Cac> error_dump
Hardware Error State: (Forced error dump)
+  Errors on node Nasid 0x0 (0)
+    XBow in /hw/module/174562
+      BEDROCK signalled following errors.
+        XBow Link a status register: 0xffffffff80020000
+          17: Illegal destination
+        XBow error command word register: 0xffffffffaa020000
+        XBow error upper address register: 0x0
+        XBow error lower address register: 0x0
END Hardware Error State (Forced error dump)
A 000: POD IOC3 Cac>

So I guess without another mainboard or PIMM I don't know.  This is also the LAST mainboard revision...hence the most expensive...geez.


UPDATE:  Okay still looking and grasping but I did find this: http://www.mit.edu/afs.new/sipb/project/...s/Makefile

It has an SGI-IP27 (Origin200/2000) section and it specifically calls out the address in my error, it says it belongs to a CONFIG_TOSHIBA_RBTX4927 section.  The files includes go to something like this:https://android.googlesource.com/platform/hardware/bsp/kernel/rockchip/rk-v4.4/+/refs/heads/master/arch/mips/Makefile.  Whick I assume is still a MIPS section of the linux kernel and gives me the impression that the BEDROCK (rockchip) may be this Toshiba MIPS processor?

UPDATE 2:  Register for SDRAM CHANNEL CONTROL?  http://www.elektronikjk.pl/elementy_czyn...R4927A.pdf

SDRAM chips can be resoldered (hand reflowed) on this board...could this be a SDRAM issue?  I'm having trouble reading the address, cold also be an ECC Status Register (ECCSR)?
(This post was last modified: 11-13-2021, 01:13 AM by weblacky.)
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
11-12-2021, 11:08 PM
#47
RE: The start of a LONG Fuel repair thread...
######UPDATE 3:  THERE IS A PERSISTENT MEMORY SIZING FAILURE, not matter WHAT I insert, one of slots is failing to recognize the size and defaulting to 256MB.  Also as I change RAM sticks and densities XBOW Status register failure error address MOVES slightly! Even with totally different sticks in ALL slots (no original RAM), error and sizing issues.

Now the real question, is this THE ISSUE or AN ISSUE.  Would a memory sizing failure resulting in an XBOW error cause the system to stop allowing booting?  Anyone know that?  I mean, I'd love to solve this but I'm going to throw a tantrum if I solve the memory sizing issue and it's NOT what's preventing boot!

Okay here is my evidence.  Notice the error status register address MOVES a tiny bit with each memory change! 

Okay here is the POD boot info when I had JUST one interleaved bank of 512MB + 512MB = 1GB

Code:
IP35 PROM SGI Version 6.210  built 02:33:51 PM Aug 26, 2004
  built for bedrock rev. 1.1 or greater
running in IP34 mode
Running in DDR mode
Local master CPU A revision: f41
PROM length: 0x1686a8, BSS length: 0xa7a0, flash count: 9
Configured bedrock clock: 200.0 MHz
Status of local IO: 0x1 0x3fc0fff6403
Bedrock Rev: 2, Module: 1 (001c01) from Sys Ctlr
On PROM entry: ERR_EPC=0xc00000001fc02ac0 (0xc00000001fc02ac0)
Configuring memory
Local memory configured: 1024 MB (standard)
*** Warning: System controller debug switches are non-zero (0x2d)
*** Diag level set to None (2)
*** Info level set to verbose
*** Boot stop requested at Global (2)
*** Ignoring env. vars/using defaults
before reading NICHub NIC: 0x5455827f
SR1 set to 0x0000080690349000
SR0 set to 0x000000005455827f
Testing/Initializing memory ...............        DONE
Copying PROM code to memory ...............        Copy PROM (0x9000000018000000) to RAM (0x9600000001a00000), len 0x1686a8
Done
DONE
Skipping secondary cache diags
CPU A switching stack into UALIAS and invalidating D-cache
CPU A switching into node 0 cached RAM
CPU A running cached
Initializing kldir.
Done initializing kldir.
Initializing klconfig.
init_klcfg: nasid 0 start 9600000000030000 size 10000
Done initializing klconfig.
Discovering local IO ......................        Check_master: link 10 is master
Check_master: link 10 is master
DONE
CPU A initialized subnode
Discovering NUMAlink connectivity .........        
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 30360 usec
Waiting for peers to complete discovery....        Discovery results:
ENTRY 0: HUB(5455827f)
    NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
DONE
No other nodes present; becoming global master
Global master is entry 0, NIC 0x5455827f, /hw/rack/001/bay/01
Global master is /hw/rack/001/bay/01
Global barrier (line 4315) \Global barrier passed.
Global barrier (line 4348) \Global barrier passed.
Master System Topology Graph (pre-nasid_assign):
ENTRY 0: HUB(5455827f)
    NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Calculating NASIDs
num_routers is 0
Master System Topology Graph:
ENTRY 0: HUB(5455827f)
    NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Distributing routing tables
Distributing NASIDs
*** NASID assigned to 0
CPU A switching to UALIAS
CPU A running in UALIAS
Changing node ID to 0
Global barrier (line 4823) \Global barrier passed.
CPU A Flushing and invalidating caches
Global barrier (line 4928) \Global barrier passed.
CPU A switching to node 0 cached RAM
CPU A running cached
Nasids in partition:  +0
Regions in partition:  +0
Intializing any CPUless nodes..............        Global barrier (line 7714) \Global barrier passed.
Global barrier (line 7715) \Global barrier passed.
DONE
Global barrier (line 5089) \Global barrier passed.
hubii_link_good: A-brick attached to module 001c01.
*** Nasid 0: Memory bank 0 previously had 256 MB but now has 512 MB
*** Nasid 0: Memory bank 1 previously had 256 MB but now has 512 MB
Checking partitioning information .........        DONE
No other nodes present; becoming partition master
*** After partitioning ***
ENTRY 0: HUB(5455827f)
    NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: FE
Erecting partition fences ................                        DONE
Update config for routers connected to hubs
Update config for hubs and hubless routers
CPU A flushing cache
check_router_cfg: nasid 0 is_voyager 0 check_cfg = 0
Global barrier (line 5300) \Global barrier passed.
Nasids in partition:  +0
Regions in partition:  +0

A 000: *** Entering POD mode on node 0
A 000: POD IOC3 Cac> eror  ror_dump
Hardware Error State: (Forced error dump)
+  Errors on node Nasid 0x0 (0)
+    XBow in /hw/module/174562
+      BEDROCK signalled following errors.
+        XBow Link a status register: 0xffffffff80020000
+          17: Illegal destination
+        XBow error command word register: 0xffffffffaa018000
+        XBow error upper address register: 0x0
+        XBow error lower address register: 0x0
END Hardware Error State (Forced error dump)
A 000: POD IOC3 Cac>


Then I decided to add more of MY RAM (totally removing the old RAM) (same part number, same densities! ALL 512MB modules from a previous tezro upgrade):
Code:
IP35 PROM SGI Version 6.210  built 02:33:51 PM Aug 26, 2004
  built for bedrock rev. 1.1 or greater
running in IP34 mode
Running in DDR mode
Local master CPU A revision: f41
PROM length: 0x1686a8, BSS length: 0xa7a0, flash count: 9
Configured bedrock clock: 200.0 MHz
Status of local IO: 0x1 0x3fc0fff6403
Bedrock Rev: 2, Module: 1 (001c01) from Sys Ctlr
On PROM entry: ERR_EPC=0xc00000001fc02abc (0xc00000001fc02abc)
Configuring memory
*** Memory sizing failure:
*** Bank 2 (512 MB) and bank 3 (256 MB) sizes differ,
*** treating them both as 256 MB
Local memory configured: 1792 MB (standard)
*** Warning: System controller debug switches are non-zero (0x2d)
*** Diag level set to None (2)
*** Info level set to verbose
*** Boot stop requested at Global (2)
*** Ignoring env. vars/using defaults
before reading NICHub NIC: 0x5455827f
SR1 set to 0x0000080690349000
SR0 set to 0x000000005455827f
Testing/Initializing memory ...............        DONE
Copying PROM code to memory ...............        Copy PROM (0x9000000018000000) to RAM (0x9600000001a00000), len 0x1686a8
Done
DONE
Skipping secondary cache diags
CPU A switching stack into UALIAS and invalidating D-cache
CPU A switching into node 0 cached RAM
CPU A running cached
Initializing kldir.
Done initializing kldir.
Initializing klconfig.
init_klcfg: nasid 0 start 9600000000030000 size 10000
Done initializing klconfig.
Discovering local IO ......................        Check_master: link 10 is master
Check_master: link 10 is master
DONE
CPU A initialized subnode
Discovering NUMAlink connectivity .........        
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 30358 usec
Waiting for peers to complete discovery....        Discovery results:
ENTRY 0: HUB(5455827f)
    NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
DONE
No other nodes present; becoming global master
Global master is entry 0, NIC 0x5455827f, /hw/rack/001/bay/01
Global master is /hw/rack/001/bay/01
Global barrier (line 4315) \Global barrier passed.
Global barrier (line 4348) \Global barrier passed.
Master System Topology Graph (pre-nasid_assign):
ENTRY 0: HUB(5455827f)
    NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Calculating NASIDs
num_routers is 0
Master System Topology Graph:
ENTRY 0: HUB(5455827f)
    NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: NF
Distributing routing tables
Distributing NASIDs
*** NASID assigned to 0
CPU A switching to UALIAS
CPU A running in UALIAS
Changing node ID to 0
Global barrier (line 4823) \Global barrier passed.
CPU A Flushing and invalidating caches
Global barrier (line 4928) \Global barrier passed.
CPU A switching to node 0 cached RAM
CPU A running cached
Nasids in partition:  +0
Regions in partition:  +0
Intializing any CPUless nodes..............        Global barrier (line 7714) \Global barrier passed.
Global barrier (line 7715) \Global barrier passed.
DONE
Global barrier (line 5089) \Global barrier passed.
hubii_link_good: A-brick attached to module 001c01.
Checking partitioning information .........        DONE
No other nodes present; becoming partition master
*** After partitioning ***
ENTRY 0: HUB(5455827f)
    NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
    MODULE=001c01 PARTITION=0 SPACE=RESET
    Port 1 connection: Not connected
    Port status: FE
Erecting partition fences ................                        DONE
Update config for routers connected to hubs
Update config for hubs and hubless routers
CPU A flushing cache
check_router_cfg: nasid 0 is_voyager 0 check_cfg = 0
Global barrier (line 5300) \Global barrier passed.
Nasids in partition:  +0
Regions in partition:  +0

A 000: *** Entering POD mode on node 0
A 000: POD IOC3 Cac> error_dump
Hardware Error State: (Forced error dump)
+  Errors on node Nasid 0x0 (0)
+    XBow in /hw/module/174562
+      BEDROCK signalled following errors.
+        XBow Link a status register: 0xffffffff80020000
+          17: Illegal destination
+        XBow error command word register: 0xffffffffaa020000
+        XBow error upper address register: 0x0
+        XBow error lower address register: 0x0
END Hardware Error State (Forced error dump)
A 000: POD IOC3 Cac>


Notice once I've placed ALL 512MB sticks in the DIMM slots I get (Believe me I popped them out and label checked, I know all my sticks were Identical from my Tezro's previous memory layout...no way this is coincidence):


** Memory sizing failure:
*** Bank 2 (512 MB) and bank 3 (256 MB) sizes differ,
*** treating them both as 256 MB
Local memory configured: 1792 MB (standard)


No matter what I insert, Bank 3 cannot determine size!  If I only place in a single bank of modules (leaving bank 3 empty), I don't get a sizing error but the XBOW error remains.  And since the address of the error MOVES with memory layout...it's got to be ECC status registers or something regarding memory DIMMs? Right?

Now RAM slots are normally through-hole technology, so a highly heated desoldering gun SHOULD have enough sink to reflow the pins from the underside of the memory slot.  Before I do that I could look for a break using the bottom test pads (I assume they are there) and the memory slot "fingers".  It will take a while.

The previous owner had an invalid memory config that the system dumbed down (I noticed this but couldn't boot so I had other issues).  He had 3 x 256MB and 1 x 512MB, oddly enough my pics show he had the 512 module IN bank 3.  So he use the memory failure to get all 256MB DIMMs?

Given he had this layout though...I assumed he booted at one point...so I'm worried that even though I think I'm right on this sizing failure being a cracked solder joint...he may have been booted without it this whole time!  Which would mean this pre-power check error is unrelated!

Well, I have NO OTHER ERRORs to go on...I can only work with what's right in front of me. So since I know the bank with the issue, I guess I'll look into taking the mainboard out again and doing a probe/reflow on the 3rd DIMM bank.  

I'm still worried that I cannot find any reference this this pre-power check error...that simple error is causing me no end of sleeplessness....
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
11-13-2021, 02:18 AM
#48
RE: The start of a LONG Fuel repair thread...
Clean the slots? That fixed it in my old Fuel once.

I'm the system admin of this site. Private security technician, licensed locksmith, hack of a c developer and vintage computer enthusiast. 

https://contrib.irixnet.org/raion/ -- contributions and pieces that I'm working on currently. 

https://codeberg.org/SolusRaion -- Code repos I control

Technical problems should be sent my way.
Raion
Chief IRIX Officer

Trade Count: (9)
Posts: 4,240
Threads: 533
Joined: Nov 2017
Location: Eastern Virginia
Website Find Reply
11-13-2021, 02:34 AM
#49
RE: The start of a LONG Fuel repair thread...
I guess stranger things have happened. Given I've inserted modules in this slot 4 times and the old owner likely had this issue, I'd discounted that..but you're right..try the easy stuff first.
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
11-13-2021, 02:52 AM
#50
RE: The start of a LONG Fuel repair thread...
Hi Weblacky,

I have not read all the log info...

But I did notice that you did a POD/CAC clearalllogs, but you did not seem to have followed with an "initalllogs" and a "flush".

I suggest you try the following (in POD/CAC):
Clear the logs: "clearalllogs"
Reinitalise logs: "initalllogs"
flush the buffers - "flush"
Then reset "debug" back to 0 (via L1) and
finally via POD/CAC do a: "reset".

See if machine now boots up ok.

In reference to "megaimg" "Fuel TBL Refill Exception", I was able to reproduce this, on my working Fuel, by trying to do an install via Console port while the PROM environment variable is set to "console=g".

As per that thread i have verified that you can boot the Fuel without using graphics console (even if you have the V10/V12 plugged in).

If you want to boot via console, then keep the mouse and keyboard unplugged, as PROM then treats machine as headless machine.

I will also review your logs vs my known good ones to see if I can see anything.

Also FYI, I found that you can use the USB port to monitor L1 (via L2) while using the system board serial port to do console stuff.

Cheers from Oz.

jwhat/John
(This post was last modified: 01-02-2022, 10:59 PM by jwhat.)
jwhat
Octane/O350/Fuel User

Trade Count: (0)
Posts: 513
Threads: 29
Joined: Jul 2018
Location: Australia
Find Reply
11-13-2021, 03:49 AM


Forum Jump:


Users browsing this thread: 1 Guest(s)