The start of a LONG Fuel repair thread...
#51
RE: The start of a LONG Fuel repair thread...
HOLY H*LL...I just got to graphical PROM (using replacement V10..haven't tried "repaired" V10 yet).  Also, no I didn't touch RAM slot...It's been broken all along (will show power-on log) so still needs fixing.

Okay, NVRAM is corrupted on every power-on (not powered reboot). DigiKey just sold out of snaphats too...but looks like mouser has them...so off to reload on snaphats and DIP sockets!

So EVERY fresh boot resulted in a halt.  Every test was later done on POD mode stopped.  Now that I have two terminals running for L1 and console I thought...why not just LET IT TRY TO GO without ANY POD mode debug diags.

Fresh boot COMPLAINED of LOTs of memory issues on bank #3, then invalid NVRAM, then forced into POD mode...did a reset switch.  Now valid data for NVRAM...discovery normal..BOOTED RIGHT TO GRAPHICS!!!

Power button caused NMI errors and freezing (I'm assuming that's normal for ATX adapter?) at the end of the log.  So memory error and resulting XBOW error are REAL but NOT considered unbootable...still need to look into this.

Bad NVRAM caused every fresh boot not to work anyway (hidden in console mode only).  I still don't know about the "pre-power check" but that was INFO and not an error..so maybe that's the ATX adapter again?  I'll ask the creator.  DS1780 replacement was still needed on V10 to get ENV working.

So much to do but without further ado the VERY LONG boot log:

Code:
IP35 PROM SGI Version 6.210  built 02:33:51 PM Aug 26, 2004
Running in DDR mode
*** Memory sizing failure:
*** Bank 2 (512 MB) and bank 3 (256 MB) sizes differ,
*** treating them both as 256 MB
Testing/Initializing memory ...........DIRECTORY MEMORY ERROR: Location DIMMs ADDRESS (backdoor)  subtest
(dir. type is Standard)  001c01  2 & 3 0x90000001d8000000  Base Address
    expected: 0x00000000aaaaaaaa
      actual: 0x00000000aaa8aaa8
        diff: ^x...........2...2

suspect devices = [Dimm3:G1A7B] [Dimm3:G1A7B]

DIRECTORY MEMORY ERROR: Location DIMMs ADDRESS (backdoor)  subtest
(dir. type is Standard)  001c01  2 & 3 0x90000001d8000000  Base Address
    expected: 0x00000000ffffffff
      actual: 0x00000000fffdfffd
        diff: ^x...........2...2

suspect devices = [Dimm3:G1A7B] [Dimm3:G1A7B]

DIRECTORY MEMORY ERROR: Location DIMMs ADDRESS (backdoor)  subtest
(dir. type is Standard)  001c01  2 & 3 0x90000001d8000000  Walking Address
    expected: 0x00000000aaaaaaaa
      actual: 0x00000000aaa8aaa8
        diff: ^x...........2...2

suspect devices = [Dimm3:G1A7B] [Dimm3:G1A7B]

DIRECTORY MEMORY ERROR: Location DIMMs ADDRESS (backdoor)  subtest
(dir. type is Standard)  001c01  2 & 3 0x90000001d8000000  Walking Address
    expected: 0x00000000ffffffff
      actual: 0x00000000fffdfffd
        diff: ^x...........2...2

suspect devices = [Dimm3:G1A7B] [Dimm3:G1A7B]


MBIST ERROR 0
************************************************


            (DIRECTORY): Location:  001c01
                         DimmPair:  1
                         Bank:      1
                         RAS:       0x0
                         CAS:       0x1

actual data:    0x00000000aaa80000
expected data:  0x000000000000aaaa

suspect devices = [Dimm3:G1B6B] [Dimm3:G1A7B]
                  [Dimm3:G1B6B] [Dimm3:G1A7B]



            (MEMORY): Location:  001c01
                      DimmPair:  1
                      Bank:      1
                      RAS:       0x0
                      CAS:       0x0

actual data:    0xaaaaaaaaa 0xaaaaaaaaa 0xaaaaaaaaa 0xaaaaaaaaa
expected data:  0x555555555 0x555555555 0x555555555 0x555555555

suspect devices = [Dimm 2 : F0A7B] [Dimm 2 : F0B6B]
                  [Dimm 2 : F5A7B] [Dimm 2 : F5B6B]
                  [Dimm 2 : B5A7B] [Dimm 2 : B5B6B]
                  [Dimm 3 : F0A7B] [Dimm 3 : F0B6B]
                  [Dimm 3 : E4A7B] [Dimm 3 : E4B6B]
                  [Dimm 3 : D9A7B] [Dimm 3 : D9B6B]
                  [Dimm 3 : C6A7B] [Dimm 3 : C6B6B]
                  [Dimm 3 : F5A7B] [Dimm 3 : F5B6B]
                  [Dimm 2 : C6A7B] [Dimm 2 : C6B6B]
                  [Dimm 2 : C1A7B] [Dimm 2 : C1B6B]
                  [Dimm 2 : B0A7B] [Dimm 2 : B0B6B]
                  [Dimm 2 : A4A7B] [Dimm 2 : A4B6B]
                  [Dimm 3 : A4A7B] [Dimm 3 : A4B6B]
                  [Dimm 3 : B0A7B] [Dimm 3 : B0B6B]
                  [Dimm 3 : B5A7B] [Dimm 3 : B5B6B]
                  [Dimm 3 : C1A7B] [Dimm 3 : C1B6B]
                  [Dimm 2 : E4A7B] [Dimm 2 : E4B6B]
                  [Dimm 2 : D9A7B] [Dimm 2 : D9B6B]


MBIST ERROR 1
************************************************


            (DIRECTORY): Location:  001c01
                         DimmPair:  1
                         Bank:      1
                         RAS:       0x0
                         CAS:       0x1

actual data:    0x00000000aaa80000
expected data:  0x000000000000aaaa

suspect devices = [Dimm3:G1B6B] [Dimm3:G1A7B]
                  [Dimm3:G1B6B] [Dimm3:G1A7B]



            (MEMORY): Location:  001c01
                      DimmPair:  1
                      Bank:      1
                      RAS:       0x0
                      CAS:       0x0

actual data:    0xaaaaaaaaa 0xaaaaaaaaa 0xaaaaaaaaa 0xaaaaaaaaa
expected data:  0x555555555 0x555555555 0x555555555 0x555555555

suspect devices = [Dimm 2 : F0A7B] [Dimm 2 : F0B6B]
                  [Dimm 2 : F5A7B] [Dimm 2 : F5B6B]
                  [Dimm 2 : B5A7B] [Dimm 2 : B5B6B]
                  [Dimm 3 : F0A7B] [Dimm 3 : F0B6B]
                  [Dimm 3 : E4A7B] [Dimm 3 : E4B6B]
                  [Dimm 3 : D9A7B] [Dimm 3 : D9B6B]
                  [Dimm 3 : C6A7B] [Dimm 3 : C6B6B]
                  [Dimm 3 : F5A7B] [Dimm 3 : F5B6B]
                  [Dimm 2 : C6A7B] [Dimm 2 : C6B6B]
                  [Dimm 2 : C1A7B] [Dimm 2 : C1B6B]
                  [Dimm 2 : B0A7B] [Dimm 2 : B0B6B]
                  [Dimm 2 : A4A7B] [Dimm 2 : A4B6B]
                  [Dimm 3 : A4A7B] [Dimm 3 : A4B6B]
                  [Dimm 3 : B0A7B] [Dimm 3 : B0B6B]
                  [Dimm 3 : B5A7B] [Dimm 3 : B5B6B]
                  [Dimm 3 : C1A7B] [Dimm 3 : C1B6B]
                  [Dimm 2 : E4A7B] [Dimm 2 : E4B6B]
                  [Dimm 2 : D9A7B] [Dimm 2 : D9B6B]


MBIST ERROR 2
************************************************


            (DIRECTORY): Location:  001c01
                         DimmPair:  1
                         Bank:      1
                         RAS:       0x0
                         CAS:       0x1

actual data:    0x00000000aaa80000
expected data:  0x000000000000aaaa

suspect devices = [Dimm3:G1B6B] [Dimm3:G1A7B]
                  [Dimm3:G1B6B] [Dimm3:G1A7B]



            (MEMORY): Location:  001c01
                      DimmPair:  1
                      Bank:      1
                      RAS:       0x0
                      CAS:       0x0

actual data:    0xaaaaaaaaa 0xaaaaaaaaa 0xaaaaaaaaa 0xaaaaaaaaa
expected data:  0x555555555 0x555555555 0x555555555 0x555555555

suspect devices = [Dimm 2 : F0A7B] [Dimm 2 : F0B6B]
                  [Dimm 2 : F5A7B] [Dimm 2 : F5B6B]
                  [Dimm 2 : B5A7B] [Dimm 2 : B5B6B]
                  [Dimm 3 : F0A7B] [Dimm 3 : F0B6B]
                  [Dimm 3 : E4A7B] [Dimm 3 : E4B6B]
                  [Dimm 3 : D9A7B] [Dimm 3 : D9B6B]
                  [Dimm 3 : C6A7B] [Dimm 3 : C6B6B]
                  [Dimm 3 : F5A7B] [Dimm 3 : F5B6B]
                  [Dimm 2 : C6A7B] [Dimm 2 : C6B6B]
                  [Dimm 2 : C1A7B] [Dimm 2 : C1B6B]
                  [Dimm 2 : B0A7B] [Dimm 2 : B0B6B]
                  [Dimm 2 : A4A7B] [Dimm 2 : A4B6B]
                  [Dimm 3 : A4A7B] [Dimm 3 : A4B6B]
                  [Dimm 3 : B0A7B] [Dimm 3 : B0B6B]
                  [Dimm 3 : B5A7B] [Dimm 3 : B5B6B]
                  [Dimm 3 : C1A7B] [Dimm 3 : C1B6B]
                  [Dimm 2 : E4A7B] [Dimm 2 : E4B6B]
                  [Dimm 2 : D9A7B] [Dimm 2 : D9B6B]


MBIST ERROR 3
************************************************


            (DIRECTORY): Location:  001c01
                         DimmPair:  1
                         Bank:      1
                         RAS:       0x0
                         CAS:       0x1

actual data:    0x00000000aaa80000
expected data:  0x000000000000aaaa

suspect devices = [Dimm3:G1B6B] [Dimm3:G1A7B]
                  [Dimm3:G1B6B] [Dimm3:G1A7B]



            (MEMORY): Location:  001c01
                      DimmPair:  1
                      Bank:      1
                      RAS:       0x0
                      CAS:       0x0

actual data:    0xaaaaaaaaa 0xaaaaaaaaa 0xaaaaaaaaa 0xaaaaaaaaa
expected data:  0x555555555 0x555555555 0x555555555 0x555555555

suspect devices = [Dimm 2 : F0A7B] [Dimm 2 : F0B6B]
                  [Dimm 2 : F5A7B] [Dimm 2 : F5B6B]
                  [Dimm 2 : B5A7B] [Dimm 2 : B5B6B]
                  [Dimm 3 : F0A7B] [Dimm 3 : F0B6B]
                  [Dimm 3 : E4A7B] [Dimm 3 : E4B6B]
                  [Dimm 3 : D9A7B] [Dimm 3 : D9B6B]
                  [Dimm 3 : C6A7B] [Dimm 3 : C6B6B]
                  [Dimm 3 : F5A7B] [Dimm 3 : F5B6B]
                  [Dimm 2 : C6A7B] [Dimm 2 : C6B6B]
                  [Dimm 2 : C1A7B] [Dimm 2 : C1B6B]
                  [Dimm 2 : B0A7B] [Dimm 2 : B0B6B]
                  [Dimm 2 : A4A7B] [Dimm 2 : A4B6B]
                  [Dimm 3 : A4A7B] [Dimm 3 : A4B6B]
                  [Dimm 3 : B0A7B] [Dimm 3 : B0B6B]
                  [Dimm 3 : B5A7B] [Dimm 3 : B5B6B]
                  [Dimm 3 : C1A7B] [Dimm 3 : C1B6B]
                  [Dimm 2 : E4A7B] [Dimm 2 : E4B6B]
                  [Dimm 2 : D9A7B] [Dimm 2 : D9B6B]

************************************************

MBIST ERROR 0
************************************************


            (DIRECTORY): Location:  001c01
                         DimmPair:  1
                         Bank:      1
                         RAS:       0x0
                         CAS:       0x0

actual data:    0x00000000aaa80000
expected data:  0x000000000000aaaa

suspect devices = [Dimm3:G1B6B] [Dimm3:G1A7B]
                  [Dimm3:G1B6B] [Dimm3:G1A7B]



            (MEMORY): Location:  001c01
                      DimmPair:  1
                      Bank:      1
                      RAS:       0x0
                      CAS:       0x0

actual data:    0x555555555 0x555555555 0x555555555 0x555555555
expected data:  0xaaaaaaaaa 0xaaaaaaaaa 0xaaaaaaaaa 0xaaaaaaaaa

suspect devices = [Dimm 2 : F0A7B] [Dimm 2 : F0B6B]
                  [Dimm 2 : F5A7B] [Dimm 2 : F5B6B]
                  [Dimm 2 : B5A7B] [Dimm 2 : B5B6B]
                  [Dimm 3 : F0A7B] [Dimm 3 : F0B6B]
                  [Dimm 3 : E4A7B] [Dimm 3 : E4B6B]
                  [Dimm 3 : D9A7B] [Dimm 3 : D9B6B]
                  [Dimm 3 : C6A7B] [Dimm 3 : C6B6B]
                  [Dimm 3 : F5A7B] [Dimm 3 : F5B6B]
                  [Dimm 2 : C6A7B] [Dimm 2 : C6B6B]
                  [Dimm 2 : C1A7B] [Dimm 2 : C1B6B]
                  [Dimm 2 : B0A7B] [Dimm 2 : B0B6B]
                  [Dimm 2 : A4A7B] [Dimm 2 : A4B6B]
                  [Dimm 3 : A4A7B] [Dimm 3 : A4B6B]
                  [Dimm 3 : B0A7B] [Dimm 3 : B0B6B]
                  [Dimm 3 : B5A7B] [Dimm 3 : B5B6B]
                  [Dimm 3 : C1A7B] [Dimm 3 : C1B6B]
                  [Dimm 2 : E4A7B] [Dimm 2 : E4B6B]
                  [Dimm 2 : D9A7B] [Dimm 2 : D9B6B]


MBIST ERROR 1
************************************************


            (DIRECTORY): Location:  001c01
                         DimmPair:  1
                         Bank:      1
                         RAS:       0x0
                         CAS:       0x0

actual data:    0x00000000aaa80000
expected data:  0x000000000000aaaa

suspect devices = [Dimm3:G1B6B] [Dimm3:G1A7B]
                  [Dimm3:G1B6B] [Dimm3:G1A7B]



            (MEMORY): Location:  001c01
                      DimmPair:  1
                      Bank:      1
                      RAS:       0x0
                      CAS:       0x0

actual data:    0x555555555 0x555555555 0x555555555 0x555555555
expected data:  0xaaaaaaaaa 0xaaaaaaaaa 0xaaaaaaaaa 0xaaaaaaaaa

suspect devices = [Dimm 2 : F0A7B] [Dimm 2 : F0B6B]
                  [Dimm 2 : F5A7B] [Dimm 2 : F5B6B]
                  [Dimm 2 : B5A7B] [Dimm 2 : B5B6B]
                  [Dimm 3 : F0A7B] [Dimm 3 : F0B6B]
                  [Dimm 3 : E4A7B] [Dimm 3 : E4B6B]
                  [Dimm 3 : D9A7B] [Dimm 3 : D9B6B]
                  [Dimm 3 : C6A7B] [Dimm 3 : C6B6B]
                  [Dimm 3 : F5A7B] [Dimm 3 : F5B6B]
                  [Dimm 2 : C6A7B] [Dimm 2 : C6B6B]
                  [Dimm 2 : C1A7B] [Dimm 2 : C1B6B]
                  [Dimm 2 : B0A7B] [Dimm 2 : B0B6B]
                  [Dimm 2 : A4A7B] [Dimm 2 : A4B6B]
                  [Dimm 3 : A4A7B] [Dimm 3 : A4B6B]
                  [Dimm 3 : B0A7B] [Dimm 3 : B0B6B]
                  [Dimm 3 : B5A7B] [Dimm 3 : B5B6B]
                  [Dimm 3 : C1A7B] [Dimm 3 : C1B6B]
                  [Dimm 2 : E4A7B] [Dimm 2 : E4B6B]
                  [Dimm 2 : D9A7B] [Dimm 2 : D9B6B]


MBIST ERROR 2
************************************************


            (DIRECTORY): Location:  001c01
                         DimmPair:  1
                         Bank:      1
                         RAS:       0x0
                         CAS:       0x0

actual data:    0x00000000aaa80000
expected data:  0x000000000000aaaa

suspect devices = [Dimm3:G1B6B] [Dimm3:G1A7B]
                  [Dimm3:G1B6B] [Dimm3:G1A7B]



            (MEMORY): Location:  001c01
                      DimmPair:  1
                      Bank:      1
                      RAS:       0x0
                      CAS:       0x0

actual data:    0x555555555 0x555555555 0x555555555 0x555555555
expected data:  0xaaaaaaaaa 0xaaaaaaaaa 0xaaaaaaaaa 0xaaaaaaaaa

suspect devices = [Dimm 2 : F0A7B] [Dimm 2 : F0B6B]
                  [Dimm 2 : F5A7B] [Dimm 2 : F5B6B]
                  [Dimm 2 : B5A7B] [Dimm 2 : B5B6B]
                  [Dimm 3 : F0A7B] [Dimm 3 : F0B6B]
                  [Dimm 3 : E4A7B] [Dimm 3 : E4B6B]
                  [Dimm 3 : D9A7B] [Dimm 3 : D9B6B]
                  [Dimm 3 : C6A7B] [Dimm 3 : C6B6B]
                  [Dimm 3 : F5A7B] [Dimm 3 : F5B6B]
                  [Dimm 2 : C6A7B] [Dimm 2 : C6B6B]
                  [Dimm 2 : C1A7B] [Dimm 2 : C1B6B]
                  [Dimm 2 : B0A7B] [Dimm 2 : B0B6B]
                  [Dimm 2 : A4A7B] [Dimm 2 : A4B6B]
                  [Dimm 3 : A4A7B] [Dimm 3 : A4B6B]
                  [Dimm 3 : B0A7B] [Dimm 3 : B0B6B]
                  [Dimm 3 : B5A7B] [Dimm 3 : B5B6B]
                  [Dimm 3 : C1A7B] [Dimm 3 : C1B6B]
                  [Dimm 2 : E4A7B] [Dimm 2 : E4B6B]
                  [Dimm 2 : D9A7B] [Dimm 2 : D9B6B]


MBIST ERROR 3
************************************************


            (DIRECTORY): Location:  001c01
                         DimmPair:  1
                         Bank:      1
                         RAS:       0x0
                         CAS:       0x0

actual data:    0x00000000aaa80000
expected data:  0x000000000000aaaa

suspect devices = [Dimm3:G1B6B] [Dimm3:G1A7B]
                  [Dimm3:G1B6B] [Dimm3:G1A7B]



            (MEMORY): Location:  001c01
                      DimmPair:  1
                      Bank:      1
                      RAS:       0x0
                      CAS:       0x0

actual data:    0x555555555 0x555555555 0x555555555 0x555555555
expected data:  0xaaaaaaaaa 0xaaaaaaaaa 0xaaaaaaaaa 0xaaaaaaaaa

suspect devices = [Dimm 2 : F0A7B] [Dimm 2 : F0B6B]
                  [Dimm 2 : F5A7B] [Dimm 2 : F5B6B]
                  [Dimm 2 : B5A7B] [Dimm 2 : B5B6B]
                  [Dimm 3 : F0A7B] [Dimm 3 : F0B6B]
                  [Dimm 3 : E4A7B] [Dimm 3 : E4B6B]
                  [Dimm 3 : D9A7B] [Dimm 3 : D9B6B]
                  [Dimm 3 : C6A7B] [Dimm 3 : C6B6B]
                  [Dimm 3 : F5A7B] [Dimm 3 : F5B6B]
                  [Dimm 2 : C6A7B] [Dimm 2 : C6B6B]
                  [Dimm 2 : C1A7B] [Dimm 2 : C1B6B]
                  [Dimm 2 : B0A7B] [Dimm 2 : B0B6B]
                  [Dimm 2 : A4A7B] [Dimm 2 : A4B6B]
                  [Dimm 3 : A4A7B] [Dimm 3 : A4B6B]
                  [Dimm 3 : B0A7B] [Dimm 3 : B0B6B]
                  [Dimm 3 : B5A7B] [Dimm 3 : B5B6B]
                  [Dimm 3 : C1A7B] [Dimm 3 : C1B6B]
                  [Dimm 2 : E4A7B] [Dimm 2 : E4B6B]
                  [Dimm 2 : D9A7B] [Dimm 2 : D9B6B]

************************************************

    TOO MANY ERRORS IN DIRECTORY MEMORY TEST: DimmPair:  1
                                              DIMMs:     2 & 3
    There were 4 errors found.

mtest_bank_dir failed: Walking Address
RSLT mtest_bank_dir FAIL                diag_rc = 44  Walking Address          

    TOO MANY ERRORS DURING MEMORY BIST: DimmPair:  1
                                        DIMMs:     2 & 3
                                        Bank:      1
    There were 8 errors found.

mtest_bank_bist failed:
RSLT mtest_bank_bis FAIL                diag_rc = 48                           
    Disabling the failed bank
....        DONE
Copying PROM code to memory ...............        DONE
Discovering local IO ......................        DONE
Discovering NUMAlink connectivity .........        
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 30359 usec
Waiting for peers to complete discovery....        DONE
No other nodes present; becoming global master
Global master is /hw/rack/001/bay/01
\\\\Intializing any CPUless nodes..............        \\DONE
\*** Nasid 0: Memory bank 3 was previously Present & Enabled but is now Present & Disabled
Checking partitioning information .........        DONE
No other nodes present; becoming partition master
\Loading BASEIO prom .......................        DONE

BASEIO PROM Monitor SGI Version 6.210  built 02:30:38 PM Aug 26, 2004 (BE64)
1 CPUs on 1 nodes found.

NVRAM checksum is incorrect: reinitializing.
Automatic update of PROM environment disabled

PS/2 Keyboard & Mouse diagnostics
    Found mouse on port 0
    Found keyboard on port 1
PS/2 Keyboard & Mouse diagnostics passed

Graphics diagnostics

Odyssey board #0 found on nasid 0
Running Odyssey xtalk sanity diag...
        Board version 1 - Buzz revision 2B
        On board sdram size: 32 Mb
        Cas latency: CAS 3
        2 banks by sdram module
Running Odyssey Buzz registers diag...
Device passed diagnostics

Installing PROM Device drivers ............            
Base I/O Ethernet set to /dev/ethernet/ef0
Installing Graphics Console...
graphics install: searching for pipe 0

Walking SCSI Adapter 0, (pci id 1)
1timeout on adapter 0 target 1
   tm0=0xfffed7df04994b53, tm1=0xfffed7de39b49c50, timeout=0xb
- 2+ Device Vendor Product:
3+ Device Vendor Product:
4+ Device Vendor Product:
5+ Device Vendor Product:
6+ Device Vendor Product:
7+ Device Vendor Product:
8+ Device Vendor Product:
9+ Device Vendor Product:
10+ Device Vendor Product:
11timeout on adapter 0 target b
   tm0=0xfffed7de39b49c5b, tm1=0xfffed7de39b49c34, timeout=0x3
+ Device Vendor Product:
12+ Device Vendor Product:

A 000: *** TLB Refill Exception on node 0
A 000: *** EPC: 0xc00000001fc47e58 (0xc00000001fc47e58)
A 000: *** Press ENTER to continue.
A 000: POD IOC3 Dex>

IP35 PROM SGI Version 6.210  built 02:33:51 PM Aug 26, 2004
Running in DDR mode
*** Memory sizing failure:
*** Bank 2 (512 MB) and bank 3 (256 MB) sizes differ,
*** treating them both as 256 MB
Testing/Initializing memory ...............        DONE
Copying PROM code to memory ...............        DONE
Discovering local IO ......................        DONE
Discovering NUMAlink connectivity .........        
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 30359 usec
Waiting for peers to complete discovery....        DONE
No other nodes present; becoming global master
Global master is /hw/rack/001/bay/01
\\\\Intializing any CPUless nodes..............        \\DONE
\Checking partitioning information .........        DONE
No other nodes present; becoming partition master
\Loading BASEIO prom .......................        DONE

BASEIO PROM Monitor SGI Version 6.210  built 02:30:38 PM Aug 26, 2004 (BE64)
1 CPUs on 1 nodes found.
Automatic update of PROM environment disabled

PS/2 Keyboard & Mouse diagnostics
    Found mouse on port 0
    Found keyboard on port 1
PS/2 Keyboard & Mouse diagnostics passed

Graphics diagnostics

Odyssey board #0 found on nasid 0
Running Odyssey xtalk sanity diag...
        Board version 1 - Buzz revision 2B
        On board sdram size: 32 Mb
        Cas latency: CAS 3
        2 banks by sdram module
Running Odyssey Buzz registers diag...
Device passed diagnostics

Installing PROM Device drivers ............            
Base I/O Ethernet set to /dev/ethernet/ef0
Installing Graphics Console...
graphics install: searching for pipe 0

Walking SCSI Adapter 0, (pci id 1)
1- 2- 3- 4- 5- 6- 7- 8- 9- 10- 11- 12- 13- 14- 15- = 0 device(s)


Walking SCSI Adapter 1, (pci id 1)
1- 2- 3- 4- 5- 6+ Device Vendor Product: TOSHIBA DVD-ROM SD-M1711
7- 8- 9- 10- 11- 12- 13- 14- 15- = 1 device(s)

Initializing PROM Device drivers ..........             DONE

A 000: *** NMI while in PROM on node 0
A 000: *** Error EPC: 0xc00000001306c038 (0xc00000001306c038)
A 000: *** Press ENTER to continue.

A 000: *** NMI while in PROM on node 0
A 000: *** Error EPC: 0xc00000001fc3ce08 (0xc00000001fc3ce08)
A 000: *** Press ENTER to continue.


Here's the HINV from PROM:
   

So I need to look through these errors but it seems to fail Bank 2 and Bank 3 segments on power-on and trim down working memory.  I'll try cleaning first, obviously.  But I'll need a new battery and possible an RTC and I'll need to look into this memory thing..but so far things are better.  No confidence tests yet though so there could be more problems.

Upon further thought too, the "pre-power" error COULD be related to the fact that the power system cannot communicate with the PSU (can no longer auto-turn off) either.
(This post was last modified: 11-13-2021, 04:05 AM by weblacky.)
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
11-13-2021, 04:02 AM
#52
RE: The start of a LONG Fuel repair thread...
Hi Weblacky,

good news.

I was going to suggest you just start with single DIMM...

EDIT #1: but I read the manual as per Raion's comment they need to be in pairs (Socket 0 & Socket 2 for the first pair)

What you need to do now is clear all the errors...

If you can get to PROM Command Prompt then do an:
- "enableall" - this will clear errors and re-enable disabled components (as a result of errors)
- "update" - refresh inventory stored in PROM

Great that you have got a booting machine.

And I captured a log (right towards the end "Sample POD/DEX/CAC Startup on Fuel") of working Fuel debug boot and the only significant difference I could see btw yours and mine was memory errors.

Cheers from Oz,

jwhat/John.
(This post was last modified: 11-13-2021, 07:06 AM by jwhat.)
jwhat
Octane/O350/Fuel User

Trade Count: (0)
Posts: 513
Threads: 29
Joined: Jul 2018
Location: Australia
Find Reply
11-13-2021, 05:31 AM
#53
RE: The start of a LONG Fuel repair thread...
You have to install RAM in pairs fyi.

I'm the system admin of this site. Private security technician, licensed locksmith, hack of a c developer and vintage computer enthusiast. 

https://contrib.irixnet.org/raion/ -- contributions and pieces that I'm working on currently. 

https://codeberg.org/SolusRaion -- Code repos I control

Technical problems should be sent my way.
Raion
Chief IRIX Officer

Trade Count: (9)
Posts: 4,244
Threads: 534
Joined: Nov 2017
Location: Eastern Virginia
Website Find Reply
11-13-2021, 05:34 AM
#54
RE: The start of a LONG Fuel repair thread...
Yeah, I knew about the interleaved memory because I re-read the manual so I wouldn't be wrong...though I hate how SGI documents refer to BANK as a pair of DIMMs while the Firmware refers to bank as a single DIMM slot!!!

But it's obvious that there is a shorted or perhaps open bus line on two line between the two slots (the patterns show a..pattern of disruption) so it's definitely not random and my gut says THAT open or short of the bus lines IS the reason you get an XBOW error! If I can repair that, everything should clear!

I'm not going to do the CAC erase stuff now because I already did that and I KNOW the error will come right back until the condition is actually fixed. But I think it's either a cracked solder joint, shavings/blockage DIMM slot or some track issues...I'd have a hard time seeing a BGA issue with the Bedrock...so I assume the pads for bedrock ASIC are soldered well and connect...just the tracks are open or otherwise bridged.

I've just ordered the pricey (large) snaphats and I KNOW the L1 time is accurate enough (within minutes correct when I got it...so I KNOW the L1 RTC is still good. Must be the snaphat!

Then I'll remove the mainboard again inspect for this Memory BS, then if that works loading and stress-test, if that works...I'll dare to try my "fixed" V10 and see if it's actually fixed! The voltage drop load of both the old and new V10 cards appears the same in ENV...so it's still possible the old card's short was just a decoupling cap and might fire right up after the console issues are cleared.


Wish me luck and fingers crossed!
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
11-13-2021, 06:27 AM
#55
RE: The start of a LONG Fuel repair thread...
Okay all, well GOOD News and some...we'll see news.

Okay, I removed the mainboard again, flipped it over and looked at the back of the DIMM slots. I found a SMALL, but present, amount of materials that MAY have been conductive (unknown) and used a USB microscope to go to EACH IC and if I saw any hair, gunk, bridges, I used a pin and cleared the legs. I inspected all ICs along the slots. I generally checked for a free legs (didn't find one) and removed any debris I found...but didn't see ANY scratches, or cracked joints/movable legs.

However two chips did have some kind of hair or the the like and one was IN the solder! I have no way of knowing if what I cleared was the difference or not...but I did it.

I examined the slots themselves...boy they look clean to me, no dust, no lint, no hairs. I didn't want to place a cotton swab in there as it would LEAVE hairs...so I didn't do much.

So, I put everything back, turned it on with serial terminals hooked up, it turned on and an instantly complained on the terminal console that all the DIMM1 addressing didn't work and since that didn't work it could see ANY memory!!! It claimed DIMM1 had to work regardless of anything and it had too many issues.

WOW, okay...I wouldn't have thought things would get worse...but I then took all DIMMs OUT and reinstalled just two in DIMM1 & DIMM3 slots. Started again...said it saw the RAM and in disabled slots so re-enable them and said 1024MB...Okay better...added back the other modules...Same original issue...DIMM3 is 256MB instead of 512MB, cutting memory down.

Hmmmm...full circle hu?

Okay, reinserted ALL DIMMs and played...wiggle wiggle wiggle...nope...wiggle wiggle wiggle yeah!

I then went to debug 0x010d, cac, clearalllogs, initilogs, resest...After that...I've never been able to see the console again from serial 1...gone. L1 works fine...console, no output...okay...moving on.

Still have a bad snaphat so double-boot boot/reset/boot and I'm in graphical PROM = 2048MB....NO ISSUES! Sees ALL the RAM now. Memory inits and done....no more memory sizing issue (I still think it's a bad joint)...but I won't be removing these DIMMs as I don't care about a 4GB Fuel, 2GB is great so I'm happy. But if happens again I'll likely manually reflow the ICs under the DIMM3 slot and PERHAPS the slot itself since it's ALSO a SMD leg-stye component.

The error_dump error register didn't go away during this, same error.

Can someone with a working Fuel please do a debug 0x010d and a error_dump and see what comes up. What's it supposed to say?

Okay so, outside of snaphat issue...I'm mostly good...let's install Irix 6.5.30!


So first I hook up the ORIGINAL SGI drive...yeah ~18GB Quantum Atlas II that looks a heck of a lot like an IBM...got to PROM...all IDs filled with drives...great original owner never set the jumpers. Sigh....remove drive, lookup pins, find ID 1, micro-jumper ID 1 (great thing I BOUGHT micro-jumpers last year). Reinstall...now I have a DVD and an HDD...great.

DVD doesn't read well while system is on it's side. Remove all serial cables, close up panel, run system, insert CD into DVD drive.

Runs fine now. HDD was partially hosed SGI label... I don't care about what the previous owner had...given the poor condition of the system.

Okay, FXed without wiping since label was hosed anyway. Back to PROM, "install Software", from CDROM, GO!!!!!

progress...progress...done..Inst (would you like to format, yes!), XFS 4096, select disks...here you go...running.................::me feeding discs and standing:::: ......Done...ready to reboot, yes, reboots, I catch it on PROM before first boot. Turn it off...I'm done for tonight.

So it RAN and installed Irix without crashing?! That's progress! It passes startup memory tests, progress! So all I need to do is go through first boot and run diags...and we'll see where I am.

The PROM install claimed it was already the same version 6.5.30 wanted to install...so wasn't installed. L1 is unknown (if it did anything), likely not. But I want to ensure a stable OS before I tried the flashing dance.

I don't know if I got my serial console back..I'm hoping it magically come back after updates and fixes...:-(

So, good news overall! Once things are 100% reliable, I'll try the original V10 card to see if it really does work.
(This post was last modified: 11-17-2021, 06:10 AM by weblacky.)
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
11-17-2021, 06:06 AM
#56
RE: The start of a LONG Fuel repair thread...
I don't recommend using cotton swabs to clean anything on the system boards. In the past I used old toothbrushes that I cleaned really well of human residue.

Nowadays I actually have nylon cleaning brushes that I can literally dip in rubbing alcohol and then run through almost any type of nook and cranny. I often will follow that up with a rinse if a board is dirty, followed by a drying session up in my water heater closet, then a hair dryer over everything to force all the remaining water out and then an alcohol bath from a parts washer to displace any water and then I put it back in the cupboard until it dries for about 24 hours.

Overkill perhaps but considering I don't have ultrasonic cleaners or anything this is the best way I've actually been able to clean off crap.

I'm the system admin of this site. Private security technician, licensed locksmith, hack of a c developer and vintage computer enthusiast. 

https://contrib.irixnet.org/raion/ -- contributions and pieces that I'm working on currently. 

https://codeberg.org/SolusRaion -- Code repos I control

Technical problems should be sent my way.
Raion
Chief IRIX Officer

Trade Count: (9)
Posts: 4,244
Threads: 534
Joined: Nov 2017
Location: Eastern Virginia
Website Find Reply
11-17-2021, 07:28 AM
#57
RE: The start of a LONG Fuel repair thread...
I've not found a good electronics swab yet, if you have a link or brand recommendation then please let me know. For surface stuff I use a lens wipe and soak it. But for getting INTO stuff, yeah toothbrush is my goto...but I don't know about a memory slot. This DIMM contacts in the Fuel look FLIMSY and even catching those on a toothbrush makes me nervous. Why did the need to make them so fragile, when a normal PC RAM slot has a much beefier contact!

I have access to a small and huge ultrasonic tank (size of a large bathtub) at work, but in reality it's only good for removing corrosion with mild organic acids, great for carburetors...not so much for electronics cleaning in general. The big issue is if you want to do electronics you NEED to have a sweep function to sweep frequencies for interaction. Most of the stuff on the market is single hertz and that actually doesn't clean as well as something that shifts up and down. Some corrosion responds better during frequency shift. You do see the function advertised sometimes.

But yes, that's why I don't tend to use cotton swabs on electronics...catches and tears more than it works.
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
11-17-2021, 07:51 AM
#58
RE: The start of a LONG Fuel repair thread...
Hi Weblacky,

to:
"Can someone with a working Fuel please do a debug 0x010d and a error_dump and see what comes up. What's it supposed to say?:

Its not very exciting....
>> A 000 001c01:
>> A 000 001c01: *** Entering POD mode on node 0
>> A 000 001c01: POD SysCt Cac> error_dump
>> A 000 001c01: Hardware Error State: (Forced error dump)
>> A 000 001c01: END Hardware Error State (Forced error dump)
>> A 000 001c01: POD SysCt Cac>

Also I also only get diags via serial port on the system board (not via the external com1 port).

I have not done full test, but believe that this is likely right behaviour, as with keyboard/mouse plugged it it expects to send boot console output to graphics "textport".

EDIT #1: If you do boot without keyboard/mouse and then go into command prompt and do printenv, then you will see different values for ConsoleIn/Console/Out than if you boot into graphics mode. On my Fuel it has /dev/tty/hubtty0 or something like that (verified), which I presume is the system board serial port as distinct from /dev/ttyd1 for instance, which is the external com1 port.

EDIT #2: If you do graphics boot then ConsoleIn=/dev/input/ioc3pckm0 (keyboard I expect) and ConsoleOut=/dev/graphics/textport , you can also login via L2 and going into Console mode (Ctl-D)

Cheers from Oz,


jwhat/John
(This post was last modified: 11-17-2021, 11:20 PM by jwhat.)
jwhat
Octane/O350/Fuel User

Trade Count: (0)
Posts: 513
Threads: 29
Joined: Jul 2018
Location: Australia
Find Reply
11-17-2021, 12:31 PM
#59
RE: The start of a LONG Fuel repair thread...
(11-17-2021, 12:31 PM)jwhat Wrote:  Hi Weblacky,

to:
"Can someone with a working Fuel please do a debug 0x010d and a error_dump and see what comes up. What's it supposed to say?:

Its not very exciting....
>> A 000 001c01:
>> A 000 001c01: *** Entering POD mode on node 0
>> A 000 001c01: POD SysCt Cac> error_dump
>> A 000 001c01: Hardware Error State: (Forced error dump)
>> A 000 001c01: END Hardware Error State (Forced error dump)
>> A 000 001c01: POD SysCt Cac>

Also I also only get diags via serial port on the system board (not via the external com1 port).

I have not done full test, but believe that this is likely right behaviour, as with keyboard/mouse plugged it it expects to send boot console output to graphics "textport".

Cheers from Oz,


jwhat/John

Hmmm…well I hope that’s right.  I’ll do more testing to see if I now magically have both L1 & console on the same serial!  I sure didn’t before!  Super irritating.  

darn, so that error isn’t like that TLB and vestigial.

Well, if tests prove sound I’ll start firmware march, using your download, perhaps that will cure what ails mean!  

I’d sure like to stay clean of errors and not have any loose ends at the end of this. 

Thanks for confirming that output!
(This post was last modified: 11-17-2021, 12:44 PM by weblacky.)
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
11-17-2021, 12:43 PM
#60
RE: The start of a LONG Fuel repair thread...
Okay!  Things are really looking up!

I got my "extra beefy" snaphats yesterday and snapped one into the Fuel today (not as easy as I thought).

I went ahead and took jwhat's advice and assumed the console moved back to the L1 port that you toggle with Crtl+d & Ctrl+t (that turned out to be true, now). 

All errors are now RESOLVED (this log shows booting with NEW a snaphat for the first time, then rebooting after snaphat RTC initialization is defaulted using L1 pwr commands):

Code:
SGI SN1 L1 Controller

Firmware Image B: Rev. 1.28.3, Built 03/20/2004 00:01:57

001a01-L1> INFO: 001a01 will power up system in  5 seconds...

INFO: 001a01 powering up the system.

WARNING: power appears off, console unavailable

entering console mode  001a01 console, <CTRL_T> to escape to L1


escaping to L1 system controller

001a01-L1> 

returning to console mode  001a01 CPU0, <CTRL_T> to escape to L1
.......        DONE
Copying PROM code to memory ...............        DONE
Discovering local IO ......................        DONE
Discovering NUMAlink connectivity .........        
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 5889 usec
Waiting for peers to complete discovery....        DONE
No other nodes present; becoming global master
Global master is /hw/rack/001/bay/01
\\\\Intializing any CPUless nodes..............        \\DONE
\Checking partitioning information .........        DONE
No other nodes present; becoming partition master
\Loading BASEIO prom .......................        DONE

BASEIO PROM Monitor SGI Version 6.210  built 02:30:38 PM Aug 26, 2004 (BE64)
1 CPUs on 1 nodes found.

NVRAM checksum is incorrect: reinitializing.
Automatic update of PROM environment disabled

PS/2 Keyboard & Mouse diagnostics
    Found mouse on port 0
    Found keyboard on port 1
PS/2 Keyboard & Mouse diagnostics passed

Graphics diagnostics

Odyssey board #0 found on nasid 0
Running Odyssey xtalk sanity diag...
        Board version 1 - Buzz revision 2B
        On board sdram size: 32 Mb
        Cas latency: CAS 3
        2 banks by sdram module
Running Odyssey Buzz registers diag...
Device passed diagnostics

Installing PROM Device drivers ............             
Base I/O Ethernet set to /dev/ethernet/ef0
Installing Graphics Console...
graphics install: searching for pipe 0

Walking SCSI Adapter 0, (pci id 1)
1timeout on adapter 0 target 1
   tm0=0xfffed7de8c4f0e53, tm1=0xfffed7de35f21936, timeout=0xb
- 2+ Device Vendor Product:
3+ Device Vendor Product:
4+ Device Vendor Product:
5+ Device Vendor Product:
timeout on adapter 0 target 5
   tm0=0xfffed7de35f2193b, tm1=0xfffed7de35f21914, timeout=0xb
6Set Device queue Parameters failed
HCCR_HOST_INT never cleared
MBOX_CMD_SET_TARGET_PARAMETERS command failed
+ Device Vendor Product:
A 000 001c01:
A 000 001c01: *** TLB Refill Exception on node 0
A 000 001c01: *** EPC: 0xc0000000130dcf04 (0xc0000000130dcf04)
A 000 001c01: *** Press ENTER to continue.
A 000 001c01: POD SysCt Dex> error_dump
A 000 001c01: Hardware Error State: (Forced error dump)
A 000 001c01: END Hardware Error State (Forced error dump)
A 000 001c01: POD SysCt Dex>

escaping to L1 system controller

001a01-L1>debug 0

debug switches set to 0x0000

returning to console mode  001a01 CPU0, <CTRL_T> to escape to L1
             
A 000 001c01: POD SysCt Dex>
A 000 001c01: POD SysCt Dex> log
A 000 001c01: POD SysCt Dex>

escaping to L1 system controller

001a01-L1>pwr down

WARNING: power appears off, console unavailable

returning to console mode  001a01 CPU0, <CTRL_T> to escape to L1
WARNING: power on 001a01 appears off!
WARNING: power on 001a01 appears off!
WARNING: power on 001a01 appears off!


escaping to L1 system controller

001a01-L1>pwr up


returning to console mode  001a01 CPU0, <CTRL_T> to escape to L1
Starting PROM Boot process 


IP35 PROM SGI Version 6.210  built 02:33:51 PM Aug 26, 2004
Running in DDR mode
Testing/Initializing memory ...............        DONE
Copying PROM code to memory ...............        DONE
Discovering local IO ......................        DONE
Discovering NUMAlink connectivity .........        
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 5890 usec
Waiting for peers to complete discovery....        DONE
No other nodes present; becoming global master
Global master is /hw/rack/001/bay/01
\\\\Intializing any CPUless nodes..............        \\DONE
\Checking partitioning information .........        DONE
No other nodes present; becoming partition master
\Loading BASEIO prom .......................        DONE

BASEIO PROM Monitor SGI Version 6.210  built 02:30:38 PM Aug 26, 2004 (BE64)
1 CPUs on 1 nodes found.
Automatic update of PROM environment disabled

PS/2 Keyboard & Mouse diagnostics
    Found mouse on port 0
    Found keyboard on port 1
PS/2 Keyboard & Mouse diagnostics passed

Graphics diagnostics

Odyssey board #0 found on nasid 0
Running Odyssey xtalk sanity diag...
        Board version 1 - Buzz revision 2B
        On board sdram size: 32 Mb
        Cas latency: CAS 3
        2 banks by sdram module
Running Odyssey Buzz registers diag...
Device passed diagnostics

Installing PROM Device drivers ............             
Base I/O Ethernet set to /dev/ethernet/ef0
Installing Graphics Console...
graphics install: searching for pipe 0

Walking SCSI Adapter 0, (pci id 1)
1+ Device Vendor Product: QUANTUM ATLAS10K2-TY184L
2- 3- 4- 5- 6- 7- 8- 9- 10- 11- 12- 13- 14- 15- = 1 device(s)


Walking SCSI Adapter 1, (pci id 1)
1- 2- 3- 4- 5- 6+ Device Vendor Product: TOSHIBA DVD-ROM SD-M1711
7- 8- 9- 10- 11- 12- 13- 14- 15- = 1 device(s)

Initializing PROM Device drivers ..........             DONE



IRIS console login:   
001a01 ATTN: Power Down: issue 'pwr d' again to immediately power down.


So error_dump is empty now, Memory sizing consistently detects 2GB of Memory, and boots to graphics now.  The only unknown thing is

A: Doing online diags (extensive mode)
B: Testing the original V10 graphics card.
C: Upgrading the L1

I'm trying to put together a CDROM with all this on it so I can just pop it in.

I'll keep you guys updated.
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
11-20-2021, 01:37 AM


Forum Jump:


Users browsing this thread: 1 Guest(s)