Follow me down the rabbit hole of Fuel PIMM repair.
#31
RE: Follow me down the rabbit hole of Fuel PIMM repair.
Hi Weblacky,

sorry I think I edited while you were responding....

If you turn on DEBUG you should see, memory location, where it is loading PROM to.

I am still a bit puzzled by the fact that it manages to load PROM...

But you are deep into the rabbit hole with this one ;-)

Cheers from Oz,


jwhat/John.
(This post was last modified: 08-18-2022, 01:04 PM by jwhat.)
jwhat
Octane/O350/Fuel User

Trade Count: (0)
Posts: 513
Threads: 29
Joined: Jul 2018
Location: Australia
Find Reply
08-18-2022, 01:04 PM
#32
RE: Follow me down the rabbit hole of Fuel PIMM repair.
(08-18-2022, 01:04 PM)jwhat Wrote:  Hi Weblacky,

sorry I think I edited while you were responding....

If you turn on DEBUG you should see, memory location, where it is loading PROM to.

I am still a bit puzzled by the fact that it manages to load PROM...

But you are deep into the rabbit hole with this one ;-)

Cheers from Oz,


jwhat/John.

Don’t confuse memory and cache.  Memory is fine, CPU cache (SRAM) is the problem.  PROM images loads to main RAM, CPU uses cache.  You’d think they’d just cut it down and go without most of the cache to allow it work, apparently they demand it all as installed (unlike main memory).
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
08-18-2022, 03:20 PM
#33
RE: Follow me down the rabbit hole of Fuel PIMM repair.
Hi Weblacky,

looking at my Fuel boot log:

>> IP35 PROM SGI Version 6.211 built 04:16:18 PM Jan 25, 2008
>> built for bedrock rev. 1.1 or greater
>> running in IP34 mode
>> Running in DDR mode
>> Local master CPU A revision: f42
>> PROM length: 0x168648, BSS length: 0xa7a0, flash count: 16
>> Configured bedrock clock: 200.0 MHz
>> Status of local IO: 0x1 0x3fc0fff6403
>> Bedrock Rev: 2, Module: 1 (001c01) from Sys Ctlr
>> On PROM entry: ERR_EPC=0xc00000001fc02ac0 (0xc00000001fc02ac0)
>> Configuring memory
>> Local memory configured: 4096 MB (premium)
>> *** Warning: System controller debug switches are non-zero (0x10d)
>> *** Diag level set to None (2)
>> *** Info level set to verbose
>> *** Boot stop requested at Global (2)
>> before reading NICHub NIC: 0x52275dad
>> SR1 set to 0x0000081698349000
>> SR0 set to 0x0000000052275dad
>> Testing/Initializing memory ............... DONE
>> Copying PROM code to memory ............... Copy PROM (0x9000000018000000) to RAM (0x9600000001a00000), len 0x168648
>> Done
>> DONE
>> Skipping secondary cache diags
>> CPU A switching stack into UALIAS and invalidating D-cache
>> CPU A switching into node 0 cached RAM
>> CPU A running cached
>> Initializing kldir.
>> Done initializing kldir.
>> Initializing klconfig.
>> init_klcfg: nasid 0 start 9600000000030000 size 10000
>> Done initializing klconfig.
>> Discovering local IO ...................... Check_master: link 10 is master
>> Check_master: link 10 is master
>> DONE
>> CPU A initialized subnode
>> Discovering NUMAlink connectivity .........
>> Local hub NUMAlink is down.
>> *** Local network link down
>> DONE
>> Found 1 objects (1 hubs, 0 routers) in 5894 usec
>> Waiting for peers to complete discovery.... Discovery results:
>> ENTRY 0: HUB(52275dad)
>> NASID=-1 Mod=1 Flg=0x9500000 PROM=6.211 Route=N/A
>> MODULE=001c01 PARTITION=0 SPACE=RESET
>> Port 1 connection: Not connected
>> Port status: NF
>> DONE

I am still curious at exactly where/when it crashes.

Even to load the PROM into Memory, the processor would surely have to be running at a lower speed, as it needs a CPU to load PROM into RAM.

I say this as if the machine started at 700 MHZ when it has a 600 MHZ CPU then surely it would get a failure much much sooner in the boot process.

I am wondering if there is a "DEBUG" switch setting that disable use of Cache... off to read the manual again.

EDIT: I did some further testing and with debug flag "0x7890" you should be able to boot straight into POD/DEX and avoid access to cache ... this flags includes setting "Boot Stop Point" to "Memoryless". This avoids access to memory and POD/DEX mode stops using of cache'd access to RAM (unlike POD/CAC), I believe. I tested on my Fuel and will post logs showing behaviour change after dinner.

Cheers from Oz,

jwhat/John.
(This post was last modified: 08-19-2022, 01:00 PM by jwhat.)
jwhat
Octane/O350/Fuel User

Trade Count: (0)
Posts: 513
Threads: 29
Joined: Jul 2018
Location: Australia
Find Reply
08-19-2022, 06:51 AM
#34
RE: Follow me down the rabbit hole of Fuel PIMM repair.
I tried to dabble with the debug switch combos and got really irritated that nothing seemed to work (nothing changed in the PROM output that I could tell) when I ANDed or ORed the debug position values so I stuck with what worked. I would be interested to know if "custom" debug switch combos actually work..I wasted like 30 mins and just called it... :-(

As I mentioned there is currently a low chance I might be getting some Fuel hardware to repair on in the future, should I get such projects I'll be in a better position to try crazy stuff like this on a "beater" system. Still too scared to ruin my only Fuel on actual fiddling, but I think the debug mode checks and the memory struct dumps (and refinement of the direct struct access commands) is a safe and possible helpful way of 'getting closer" to getting there...

But I agree with you that there must be a place in the process where these can be changed during bootstrap as you can obviously fiddle with the memory, while I don't know about the "flash" part of access versus it's a shadow copy of the contents, it seems like SGI MUST have had a way to solve this problem easily without handing out 800/900 Mhz processor to techs :-). So the alteration of the core values during init seems doable and a temporary change to get into Irix to then make the permanent flash change.
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
08-19-2022, 09:49 AM
#35
RE: Follow me down the rabbit hole of Fuel PIMM repair.
Hi Weblacky,

here is my log and you can see the clear difference in behavior.

First with Debug: 0x7890:

>> 001a01:
>> debug switches set to 0x7890
>> ?-XXX.XXX.XXX.143-L2>l1 power up
>> ?-XXX.XXX.XXX.143-L2>
>> entering system console mode (001a01 CPU0), <CTRL_T> to escape to L2
>> *** DIP switch 15 set. Will skip IO and NUMAlink discovery
>> 
>> 
>> IP35 PROM SGI Version 6.211  built 04:16:18 PM Jan 25, 2008
>> Running in DDR mode
>> *** Warning: System controller debug switches are non-zero (0x7890)
>> *** Boot stop requested at Local (1)
>> *** Giving up global master status
>> Testing/Initializing memory ...............            DONE
>> Copying PROM code to memory ...............            DONE
>> Discovering NUMAlink connectivity .........
>> Local hub NUMAlink is down.
>> *** Local network link down
>> DONE
>> Found 1 objects (1 hubs, 0 routers) in 5886 usec
>> Waiting for peers to complete discovery....            DONE
>> No other nodes present; becoming global master
>> Global master is /hw/rack/001/bay/01
>> Intializing any CPUless nodes..............            DONE
>> Checking partitioning information .........            DONE
>> No other nodes present; becoming partition master
>> Suppressing error state display (system just powered on).
>> A 000 001c01:
>> A 000 001c01: *** Entering POD mode on node 0
>> A 000 001c01: POD SysCt Cac>

And now with what we have pretty much always used, debug 0x10d:

>> escaping to L2 system controller
>> ?-XXX.XXX.XXX.143-L2>debug 0x10d
>> 001a01:
>> debug switches set to 0x010d
>> 
>> re-entering system console mode (001a01 CPU0), <CTRL_T> to escape to L2
>> 
>> A 000 001c01: POD SysCt Cac> reset
>> A 000 001c01: Resetting the system...
>> Starting PROM Boot process
>> hubii_link_good: A-brick attached to module 001c01.
>> HUB at 0x0 attached as widget 0xa
>> 001c01/0xa/xbow_arb: nasid= 0x0 xbow_base= 0x9200000000000000
>> 001c01/0xa/xbow_arb: 622 master is 0xa
>> Check_master: link 10 is master
>> hubii_link_good: A-brick attached to module 001c01.
>> Check_master: link 10 is master
>> 
>> 
>> IP35 PROM SGI Version 6.211  built 04:16:18 PM Jan 25, 2008
>>  built for bedrock rev. 1.1 or greater
>> running in IP34 mode
>> Running in DDR mode
>> Local master CPU A revision: f42
>> PROM length: 0x168648, BSS length: 0xa7a0, flash count: 16
>> Configured bedrock clock: 200.0 MHz
>> Status of local IO: 0x1 0x3fc0fff6403
>> Bedrock Rev: 2, Module: 1 (001c01) from Sys Ctlr
>> On PROM entry: ERR_EPC=0xffffffffbfc00300 (0xc00000001fc00300)
>> Configuring memory
>> Local memory configured: 4096 MB (premium)
>> *** Warning: System controller debug switches are non-zero (0x10d)
>> *** Diag level set to None (2)
>> *** Info level set to verbose
>> *** Boot stop requested at Global (2)
>> before reading NICHub NIC: 0x52275dad
>> SR1 set to 0x0000081698349000
>> SR0 set to 0x0000000052275dad
>> Testing/Initializing memory ...............            DONE
>> Copying PROM code to memory ...............            Copy PROM (0x9000000018000000) to RAM (0x9600000001a00000), len 0x168648
>> Done
>> DONE
>> Skipping secondary cache diags
>> CPU A switching stack into UALIAS and invalidating D-cache
>> CPU A switching into node 0 cached RAM
>> CPU A running cached
>> Initializing kldir.
>> Done initializing kldir.
>> Initializing klconfig.
>> init_klcfg: nasid 0 start 9600000000030000 size 10000
>> Done initializing klconfig.
>> Discovering local IO ......................            Check_master: link 10 is master
>> Check_master: link 10 is master
>> DONE
>> CPU A initialized subnode
>> Discovering NUMAlink connectivity .........
>> Local hub NUMAlink is down.
>> *** Local network link down
>> DONE
>> Found 1 objects (1 hubs, 0 routers) in 5889 usec
>> Waiting for peers to complete discovery....            Discovery results:
>> ENTRY 0: HUB(52275dad)
>>    NASID=-1 Mod=1 Flg=0x9500000 PROM=6.211 Route=N/A
>>    MODULE=001c01 PARTITION=0 SPACE=RESET
>>    Port 1 connection: Not connected
>>    Port status: NF
>> DONE
>> No other nodes present; becoming global master
>> Global master is entry 0, NIC 0x52275dad, /hw/rack/001/bay/01
>> Global master is /hw/rack/001/bay/01
>> Global barrier (line 4315)Global barrier passed.
>> Global barrier (line 4348)Global barrier passed.
>> Master System Topology Graph (pre-nasid_assign):
>> ENTRY 0: HUB(52275dad)
>>    NASID=-1 Mod=1 Flg=0x9500000 PROM=6.211 Route=N/A
>>    MODULE=001c01 PARTITION=0 SPACE=RESET
>>    Port 1 connection: Not connected
>>    Port status: NF
>> Calculating NASIDs
>> num_routers is 0
>> Master System Topology Graph:
>> ENTRY 0: HUB(52275dad)
>>    NASID=0 Mod=1 Flg=0x9500000 PROM=6.211 Route=N/A
>>    MODULE=001c01 PARTITION=0 SPACE=RESET
>>    Port 1 connection: Not connected
>>    Port status: NF
>> Distributing routing tables
>> Distributing NASIDs
>> *** NASID assigned to 0
>> CPU A switching to UALIAS
>> CPU A running in UALIAS
>> Changing node ID to 0
>> Global barrier (line 4823)Global barrier passed.
>> CPU A Flushing and invalidating caches
>> Global barrier (line 4928)Global barrier passed.
>> CPU A switching to node 0 cached RAM
>> CPU A running cached
>> Nasids in partition:  +0
>> Regions in partition:  +0
>> Intializing any CPUless nodes..............            Global barrier (line 7714)Global barrier passed.
>> Global barrier (line 7715)Global barrier passed.
>> DONE
>> Global barrier (line 5089)Global barrier passed.
>> hubii_link_good: A-brick attached to module 001c01.
>> Checking partitioning information .........            DONE
>> No other nodes present; becoming partition master
>> *** After partitioning ***
>> ENTRY 0: HUB(52275dad)
>>    NASID=0 Mod=1 Flg=0x9500000 PROM=6.211 Route=N/A
>>    MODULE=001c01 PARTITION=0 SPACE=RESET
>>    Port 1 connection: Not connected
>>    Port status: FE
>> Erecting partition fences ................                        DONE
>> Update config for routers connected to hubs
>> Update config for hubs and hubless routers
>> CPU A flushing cache
>> check_router_cfg: nasid 0 is_voyager 0 check_cfg = 0
>> Global barrier (line 5300)Global barrier passed.
>> Nasids in partition:  +0
>> Regions in partition:  +0
>> A 000 001c01:
>> A 000 001c01: *** Entering POD mode on node 0
>> A 000 001c01: POD SysCt Cac>

You can clearly see all the extra stuff that is occuring with the more complete boot sequence.

EDIT #2: Further checking of the set values is that change is due to flags suppressing log output..., I fixed the debug flag calculator to generate the right flags.

I have created a "debug flag" cheat sheet. I will post once I can render it as easily readable graphic:

[Image: chimera-debug-flags-08.png]

And here is my "Dip Switch Calculator" (excel spreadsheet).

EDIT #1: looking at the more complete boot log, see "Configured bedrock clock: 200.0 MHz" this would appear to be point where it gets data from PROM speed configuration settings, given default values provide doing "flash" PROM update, if they machine does change clock as part of boot process, then it is likely to start at 400 MHZ (which is the default and lower speed than any sold configuration of Fuel).

EDIT #3: I found error in the dip switch calculations, due to not ensuring the Least Significant Bit (LSB) order was correct (dip switch ids go LSB from left -> right but binary convention has LSB on right, so needs to read right to left..). So fixed that with v0.2 of spreadsheet.

EDIT #4: When I set debug flag to boot "memoryless" (debug 0x011f) the boot hangs..., I tried with keyboard/mouse plugged in and pulled out and with console via L1 USB and first serial port ... but could not get Fuel to boot directly into POD/DEX mode and so avoid Cache. It might be you actually have to have a machine with NO RAM for this to work.

Cheers from Oz,

jwhat/John.
(This post was last modified: 08-22-2022, 11:28 AM by jwhat.)
jwhat
Octane/O350/Fuel User

Trade Count: (0)
Posts: 513
Threads: 29
Joined: Jul 2018
Location: Australia
Find Reply
08-19-2022, 10:13 AM
#36
RE: Follow me down the rabbit hole of Fuel PIMM repair.
(08-19-2022, 10:13 AM)jwhat Wrote:  And here is my "Dip Switch Calculator" (excel spreadsheet).


my man.. this is f*kn awesome...

riddle me this... any good docs on your side for POD exploring?  (any docs at all really, if they haven't seen the light of day..)

let me know!

Indigo2 IMPACT  : R10K-195MHz, 1GB RAM, 146GB 15K, CD-ROM, AudioDAT, MaxImpact w/ TRAM.  IRIX 6.5.22

O2 : R12K-400MHz, 1GB RAM, 300GB 15K, DVD-ROM, CRM Graphics, AV1/2 Media Boards & O2 Cam, DV-Link, FPA & SW1600.  IRIX 6.5.30

 : 2 x R14K-600MHz, 6GB RAM, V12 Graphics, PCI Shoebox.  IRIX 6.5.30

IBM  : 7012-39H, 7043-140

chulofiasco
Hardware Junkie

Trade Count: (0)
Posts: 328
Threads: 51
Joined: May 2019
Location: New York, NY
Website Find Reply
08-24-2022, 09:37 PM
#37
RE: Follow me down the rabbit hole of Fuel PIMM repair.
(08-24-2022, 09:37 PM)chulofiasco Wrote:  
(08-19-2022, 10:13 AM)jwhat Wrote:  And here is my "Dip Switch Calculator" (excel spreadsheet).


my man.. this is f*kn awesome...

riddle me this... any good docs on your side for POD exploring?  (any docs at all really, if they haven't seen the light of day..)

let me know!


Only one I've seen in ONYX and Origin Handbooks go DEEPLY in POD mode for those machines...nothing else.
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
08-25-2022, 03:25 AM
#38
RE: Follow me down the rabbit hole of Fuel PIMM repair.
Hi Chulofiasco,

the main reference is on web archive: https://archive.org/details/Origin2k_Har...e/mode/2up

Then see “man prom”.

The only additional thing I found was use of Dip Switch 15 (bit 14) (disable NUMAlink Discovery).

Once you know the uses, it gives you a bit more confidence to test the values, and decipher the behavior.

Cheers from Oz,

jwhat/John
jwhat
Octane/O350/Fuel User

Trade Count: (0)
Posts: 513
Threads: 29
Joined: Jul 2018
Location: Australia
Find Reply
08-26-2022, 09:55 AM


Forum Jump:


Users browsing this thread: 1 Guest(s)