######UPDATE 3: THERE IS A PERSISTENT MEMORY SIZING FAILURE, not matter WHAT I insert, one of slots is failing to recognize the size and defaulting to 256MB. Also as I change RAM sticks and densities XBOW Status register failure error address MOVES slightly! Even with totally different sticks in ALL slots (no original RAM), error and sizing issues.
Now the real question, is this THE ISSUE or AN ISSUE. Would a memory sizing failure resulting in an XBOW error cause the system to stop allowing booting? Anyone know that? I mean, I'd love to solve this but I'm going to throw a tantrum if I solve the memory sizing issue and it's NOT what's preventing boot!
Okay here is my evidence. Notice the error status register address MOVES a tiny bit with each memory change!
Okay here is the POD boot info when I had JUST one interleaved bank of 512MB + 512MB = 1GB
Code:
IP35 PROM SGI Version 6.210 built 02:33:51 PM Aug 26, 2004
built for bedrock rev. 1.1 or greater
running in IP34 mode
Running in DDR mode
Local master CPU A revision: f41
PROM length: 0x1686a8, BSS length: 0xa7a0, flash count: 9
Configured bedrock clock: 200.0 MHz
Status of local IO: 0x1 0x3fc0fff6403
Bedrock Rev: 2, Module: 1 (001c01) from Sys Ctlr
On PROM entry: ERR_EPC=0xc00000001fc02ac0 (0xc00000001fc02ac0)
Configuring memory
Local memory configured: 1024 MB (standard)
*** Warning: System controller debug switches are non-zero (0x2d)
*** Diag level set to None (2)
*** Info level set to verbose
*** Boot stop requested at Global (2)
*** Ignoring env. vars/using defaults
before reading NICHub NIC: 0x5455827f
SR1 set to 0x0000080690349000
SR0 set to 0x000000005455827f
Testing/Initializing memory ............... DONE
Copying PROM code to memory ............... Copy PROM (0x9000000018000000) to RAM (0x9600000001a00000), len 0x1686a8
Done
DONE
Skipping secondary cache diags
CPU A switching stack into UALIAS and invalidating D-cache
CPU A switching into node 0 cached RAM
CPU A running cached
Initializing kldir.
Done initializing kldir.
Initializing klconfig.
init_klcfg: nasid 0 start 9600000000030000 size 10000
Done initializing klconfig.
Discovering local IO ...................... Check_master: link 10 is master
Check_master: link 10 is master
DONE
CPU A initialized subnode
Discovering NUMAlink connectivity .........
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 30360 usec
Waiting for peers to complete discovery.... Discovery results:
ENTRY 0: HUB(5455827f)
NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
MODULE=001c01 PARTITION=0 SPACE=RESET
Port 1 connection: Not connected
Port status: NF
DONE
No other nodes present; becoming global master
Global master is entry 0, NIC 0x5455827f, /hw/rack/001/bay/01
Global master is /hw/rack/001/bay/01
Global barrier (line 4315) \Global barrier passed.
Global barrier (line 4348) \Global barrier passed.
Master System Topology Graph (pre-nasid_assign):
ENTRY 0: HUB(5455827f)
NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
MODULE=001c01 PARTITION=0 SPACE=RESET
Port 1 connection: Not connected
Port status: NF
Calculating NASIDs
num_routers is 0
Master System Topology Graph:
ENTRY 0: HUB(5455827f)
NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
MODULE=001c01 PARTITION=0 SPACE=RESET
Port 1 connection: Not connected
Port status: NF
Distributing routing tables
Distributing NASIDs
*** NASID assigned to 0
CPU A switching to UALIAS
CPU A running in UALIAS
Changing node ID to 0
Global barrier (line 4823) \Global barrier passed.
CPU A Flushing and invalidating caches
Global barrier (line 4928) \Global barrier passed.
CPU A switching to node 0 cached RAM
CPU A running cached
Nasids in partition: +0
Regions in partition: +0
Intializing any CPUless nodes.............. Global barrier (line 7714) \Global barrier passed.
Global barrier (line 7715) \Global barrier passed.
DONE
Global barrier (line 5089) \Global barrier passed.
hubii_link_good: A-brick attached to module 001c01.
*** Nasid 0: Memory bank 0 previously had 256 MB but now has 512 MB
*** Nasid 0: Memory bank 1 previously had 256 MB but now has 512 MB
Checking partitioning information ......... DONE
No other nodes present; becoming partition master
*** After partitioning ***
ENTRY 0: HUB(5455827f)
NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
MODULE=001c01 PARTITION=0 SPACE=RESET
Port 1 connection: Not connected
Port status: FE
Erecting partition fences ................ DONE
Update config for routers connected to hubs
Update config for hubs and hubless routers
CPU A flushing cache
check_router_cfg: nasid 0 is_voyager 0 check_cfg = 0
Global barrier (line 5300) \Global barrier passed.
Nasids in partition: +0
Regions in partition: +0
A 000: *** Entering POD mode on node 0
A 000: POD IOC3 Cac> eror ror_dump
Hardware Error State: (Forced error dump)
+ Errors on node Nasid 0x0 (0)
+ XBow in /hw/module/174562
+ BEDROCK signalled following errors.
+ XBow Link a status register: 0xffffffff80020000
+ 17: Illegal destination
+ XBow error command word register: 0xffffffffaa018000
+ XBow error upper address register: 0x0
+ XBow error lower address register: 0x0
END Hardware Error State (Forced error dump)
A 000: POD IOC3 Cac>
Then I decided to add more of MY RAM (totally removing the old RAM) (same part number, same densities! ALL 512MB modules from a previous tezro upgrade):
Code:
IP35 PROM SGI Version 6.210 built 02:33:51 PM Aug 26, 2004
built for bedrock rev. 1.1 or greater
running in IP34 mode
Running in DDR mode
Local master CPU A revision: f41
PROM length: 0x1686a8, BSS length: 0xa7a0, flash count: 9
Configured bedrock clock: 200.0 MHz
Status of local IO: 0x1 0x3fc0fff6403
Bedrock Rev: 2, Module: 1 (001c01) from Sys Ctlr
On PROM entry: ERR_EPC=0xc00000001fc02abc (0xc00000001fc02abc)
Configuring memory
*** Memory sizing failure:
*** Bank 2 (512 MB) and bank 3 (256 MB) sizes differ,
*** treating them both as 256 MB
Local memory configured: 1792 MB (standard)
*** Warning: System controller debug switches are non-zero (0x2d)
*** Diag level set to None (2)
*** Info level set to verbose
*** Boot stop requested at Global (2)
*** Ignoring env. vars/using defaults
before reading NICHub NIC: 0x5455827f
SR1 set to 0x0000080690349000
SR0 set to 0x000000005455827f
Testing/Initializing memory ............... DONE
Copying PROM code to memory ............... Copy PROM (0x9000000018000000) to RAM (0x9600000001a00000), len 0x1686a8
Done
DONE
Skipping secondary cache diags
CPU A switching stack into UALIAS and invalidating D-cache
CPU A switching into node 0 cached RAM
CPU A running cached
Initializing kldir.
Done initializing kldir.
Initializing klconfig.
init_klcfg: nasid 0 start 9600000000030000 size 10000
Done initializing klconfig.
Discovering local IO ...................... Check_master: link 10 is master
Check_master: link 10 is master
DONE
CPU A initialized subnode
Discovering NUMAlink connectivity .........
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 30358 usec
Waiting for peers to complete discovery.... Discovery results:
ENTRY 0: HUB(5455827f)
NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
MODULE=001c01 PARTITION=0 SPACE=RESET
Port 1 connection: Not connected
Port status: NF
DONE
No other nodes present; becoming global master
Global master is entry 0, NIC 0x5455827f, /hw/rack/001/bay/01
Global master is /hw/rack/001/bay/01
Global barrier (line 4315) \Global barrier passed.
Global barrier (line 4348) \Global barrier passed.
Master System Topology Graph (pre-nasid_assign):
ENTRY 0: HUB(5455827f)
NASID=-1 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
MODULE=001c01 PARTITION=0 SPACE=RESET
Port 1 connection: Not connected
Port status: NF
Calculating NASIDs
num_routers is 0
Master System Topology Graph:
ENTRY 0: HUB(5455827f)
NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
MODULE=001c01 PARTITION=0 SPACE=RESET
Port 1 connection: Not connected
Port status: NF
Distributing routing tables
Distributing NASIDs
*** NASID assigned to 0
CPU A switching to UALIAS
CPU A running in UALIAS
Changing node ID to 0
Global barrier (line 4823) \Global barrier passed.
CPU A Flushing and invalidating caches
Global barrier (line 4928) \Global barrier passed.
CPU A switching to node 0 cached RAM
CPU A running cached
Nasids in partition: +0
Regions in partition: +0
Intializing any CPUless nodes.............. Global barrier (line 7714) \Global barrier passed.
Global barrier (line 7715) \Global barrier passed.
DONE
Global barrier (line 5089) \Global barrier passed.
hubii_link_good: A-brick attached to module 001c01.
Checking partitioning information ......... DONE
No other nodes present; becoming partition master
*** After partitioning ***
ENTRY 0: HUB(5455827f)
NASID=0 Mod=1 Flg=0x1500000 PROM=6.210 Route=N/A
MODULE=001c01 PARTITION=0 SPACE=RESET
Port 1 connection: Not connected
Port status: FE
Erecting partition fences ................ DONE
Update config for routers connected to hubs
Update config for hubs and hubless routers
CPU A flushing cache
check_router_cfg: nasid 0 is_voyager 0 check_cfg = 0
Global barrier (line 5300) \Global barrier passed.
Nasids in partition: +0
Regions in partition: +0
A 000: *** Entering POD mode on node 0
A 000: POD IOC3 Cac> error_dump
Hardware Error State: (Forced error dump)
+ Errors on node Nasid 0x0 (0)
+ XBow in /hw/module/174562
+ BEDROCK signalled following errors.
+ XBow Link a status register: 0xffffffff80020000
+ 17: Illegal destination
+ XBow error command word register: 0xffffffffaa020000
+ XBow error upper address register: 0x0
+ XBow error lower address register: 0x0
END Hardware Error State (Forced error dump)
A 000: POD IOC3 Cac>
Notice once I've placed ALL 512MB sticks in the DIMM slots I get (Believe me I popped them out and label checked, I know all my sticks were Identical from my Tezro's previous memory layout...no way this is coincidence):
** Memory sizing failure:
*** Bank 2 (512 MB) and bank 3 (256 MB) sizes differ,
*** treating them both as 256 MB
Local memory configured: 1792 MB (standard)
No matter what I insert, Bank 3 cannot determine size! If I only place in a single bank of modules (leaving bank 3 empty), I don't get a sizing error but the XBOW error remains. And since the address of the error MOVES with memory layout...it's got to be ECC status registers or something regarding memory DIMMs? Right?
Now RAM slots are normally through-hole technology, so a highly heated desoldering gun SHOULD have enough sink to reflow the pins from the underside of the memory slot. Before I do that I could look for a break using the bottom test pads (I assume they are there) and the memory slot "fingers". It will take a while.
The previous owner had an invalid memory config that the system dumbed down (I noticed this but couldn't boot so I had other issues). He had 3 x 256MB and 1 x 512MB, oddly enough my pics show he had the 512 module IN bank 3. So he use the memory failure to get all 256MB DIMMs?
Given he had this layout though...I assumed he booted at one point...so I'm worried that even though I think I'm right on this sizing failure being a cracked solder joint...he may have been booted without it this whole time! Which would mean this pre-power check error is unrelated!
Well, I have NO OTHER ERRORs to go on...I can only work with what's right in front of me. So since I know the bank with the issue, I guess I'll look into taking the mainboard out again and doing a probe/reflow on the 3rd DIMM bank.
I'm still worried that I cannot find any reference this this pre-power check error...that simple error is causing me no end of sleeplessness....