Need help troubleshooting broken SGI Tezro -
Dzeimis - 10-24-2018
Hi everyone
I posted about my problem on nekochan a year ago (which I was directed to by reddit where I asked about my issue first), but haven't found any solution since then and just put my problem away to deal with later...
Long story short, my Tezro doesn't boot and I have no idea why. T
he front panel had only solid red LED lighted which, according to manual, means "System node board failure (failed to read PROM at power on)". There were some bent pins between motherboars and node board (pics in reddit post) which I think I fixed, but not 100% sure about it. Still doesn't boot.
I had another computer connected to serial console and tried several things which I don't remember exactly, but I found some saved snippets on my laptop which are attached to this post. I don't remember what I did to cause general exception or controller firmware panic...
Let me know if there's anything I can do to provide more information. I will try booting this system when I get back home this weekend.
Can anybody with a working Tezro tell me if they also get messages "Cannot enable VRM: x" in console when the system is just started (or plugged in to wall socket... Don't remember anymore)?
Edit: forgot to actually attach Tezro output that I had saved.
RE: Need help troubleshooting broken SGI Tezro -
mamorim01 - 10-24-2018
I think I saw your response on a reddit thread a few days ago, including some pictures from a much earlier thread you had posted. I reckon you will find this to be a much better place to seek help, but you should find a way to post those pictures so people can look into them in more detail. I do not believe those connectors you showed were compression connectors but, still, the pins looked somewhat deformed and I can see how a bad connection there could be causing havoc.
Best of luck bringing it back to life, hope it is something you can manage to fix!
Edit: Found the reddit thread, these are the pins I was referring to. If you check closely some of them seem to be misaligned. It *may* be nothing, though.
By the way, you may wish to paste the contents of the file you attached as it is not downloading properly.
RE: Need help troubleshooting broken SGI Tezro -
Dzeimis - 10-24-2018
Yep, it was my post. I tried straightening the pins as much as I could without risking of breaking them. I think I remember the pins that were bent were still noticeable a bit, but not to be failing... I guess I'll also see if I can fix them further when I get to it.
Strange that the attachment doesn't work. It downloads fine for me. Anyway, just in case, copying snippets that are there:
Code:
INFO: Cannot enable VRM: 9
INFO: Cannot enable VRM: 10
INFO: Cannot enable VRM: 11
SGI SN1 L1 Controller
Firmware Image B: Rev. 1.26.5, Built 12/15/2003 12:58:02
Code:
CPU A: 0x00: PLED_RESET: Slave loop (0x00/0x45=okay, solid 0x00=possibly hung)
CPU B: < CPU not present >
CPU C: 0x00: PLED_RESET: Slave loop (0x00/0x45=okay, solid 0x00=possibly hung)
CPU D: < CPU not present >
Code:
A 000 001c01:
A 000 001c01: *** General Exception on node 0
A 000 001c01: *** EPC: 0xc00000001fc41180 (0xc00000001fc41180)
A 000 001c01: *** Press ENTER to continue.
A 000 001c01:
A 000 001c01: *** General Exception on node 0
A 000 001c01: *** EPC: 0xc00000001fc41148 (0xc00000001fc41148)
A 000 001c01: *** Press ENTER to continue.
Code:
INFO: Cannot enable VRM: 9
INFO: Cannot enable VRM: 10
INFO: Cannot enable VRM: 11
****************************************
controller firmware panic! resetting...
****************************************
IMAGE B: Rev. 1.26.5
[thread ID 30004b44 stack]
TR: fff6f158 fff6ea10 fff845d4 fff848ba fff84930 fff596f6 fff4f1a0
TR: fff6510a fff10e88 fff7fb7e fff131b0 fff13b20 fff3be50 00000000
(if you see this, please email ssh@sgi.com and include
the output from the 'log' command and a description of
what caused the problem)
SGI SN1 L1 Controller
Firmware Image B: Rev. 1.26.5, Built 12/15/2003 12:58:02
07/12/17 03:38:10 power up (COMMAND)
07/12/17 03:38:15 Node 0 IP53 XTalk clock 88
07/12/17 03:38:18 reset again MIPS
07/12/17 03:38:18 Cooling system stabilized
07/12/17 03:38:22 Node 0 IP53 XTalk clock 88
07/12/17 03:40:03 reset (COMMAND)
07/12/17 03:40:04 Node 0 IP53 XTalk clock 88
07/12/17 03:40:33 soft reset (COMMAND)
07/12/17 03:40:33 PANIC: ioExp.c line 571 ; Illegal I/O expander index: 35
07/12/17 03:40:34 L1 booting 1.26.5
07/12/17 03:40:35 CONTROLLER FIRMWARE PANIC!
07/12/17 03:40:35 IMAGE B: Rev. 1.26.5
07/12/17 03:40:35 [thread ID 30004b44 stack]
07/12/17 03:40:35 TR: fff6f158 fff6ea10 fff845d4 fff848ba fff84930 fff596f6 f
ff4f1a0
07/12/17 03:40:35 TR: fff6510a fff10e88 fff7fb7e fff131b0 fff13b20 fff3be50 0
0000000
07/12/17 03:40:36 Cooling system stabilized
07/12/17 03:40:36 USB0: waiting on open
Code:
001c01-L1>leds
CPU A: 0x2a: PLED_MAKESTACK
CPU B: < CPU not present >
CPU C: 0x2a: PLED_MAKESTACK
CPU D: < CPU not present >
8
RE: Need help troubleshooting broken SGI Tezro -
mamorim01 - 10-24-2018
(10-24-2018, 09:18 PM)Dzeimis Wrote: Yep, it was my post. I tried straightening the pins as much as I could without risking of breaking them. I think I remember the pins that were bent were still noticeable a bit, but not to be failing... I guess I'll also see if I can fix them further when I get to it.
Strange that the attachment doesn't work. It downloads fine for me. Anyway, just in case, copying snippets that are there:
Code:
INFO: Cannot enable VRM: 9
INFO: Cannot enable VRM: 10
INFO: Cannot enable VRM: 11
SGI SN1 L1 Controller
Firmware Image B: Rev. 1.26.5, Built 12/15/2003 12:58:02
Code:
CPU A: 0x00: PLED_RESET: Slave loop (0x00/0x45=okay, solid 0x00=possibly hung)
CPU B: < CPU not present >
CPU C: 0x00: PLED_RESET: Slave loop (0x00/0x45=okay, solid 0x00=possibly hung)
CPU D: < CPU not present >
Code:
A 000 001c01:
A 000 001c01: *** General Exception on node 0
A 000 001c01: *** EPC: 0xc00000001fc41180 (0xc00000001fc41180)
A 000 001c01: *** Press ENTER to continue.
A 000 001c01:
A 000 001c01: *** General Exception on node 0
A 000 001c01: *** EPC: 0xc00000001fc41148 (0xc00000001fc41148)
A 000 001c01: *** Press ENTER to continue.
Code:
INFO: Cannot enable VRM: 9
INFO: Cannot enable VRM: 10
INFO: Cannot enable VRM: 11
****************************************
controller firmware panic! resetting...
****************************************
IMAGE B: Rev. 1.26.5
[thread ID 30004b44 stack]
TR: fff6f158 fff6ea10 fff845d4 fff848ba fff84930 fff596f6 fff4f1a0
TR: fff6510a fff10e88 fff7fb7e fff131b0 fff13b20 fff3be50 00000000
(if you see this, please email ssh@sgi.com and include
the output from the 'log' command and a description of
what caused the problem)
SGI SN1 L1 Controller
Firmware Image B: Rev. 1.26.5, Built 12/15/2003 12:58:02
07/12/17 03:38:10 power up (COMMAND)
07/12/17 03:38:15 Node 0 IP53 XTalk clock 88
07/12/17 03:38:18 reset again MIPS
07/12/17 03:38:18 Cooling system stabilized
07/12/17 03:38:22 Node 0 IP53 XTalk clock 88
07/12/17 03:40:03 reset (COMMAND)
07/12/17 03:40:04 Node 0 IP53 XTalk clock 88
07/12/17 03:40:33 soft reset (COMMAND)
07/12/17 03:40:33 PANIC: ioExp.c line 571 ; Illegal I/O expander index: 35
07/12/17 03:40:34 L1 booting 1.26.5
07/12/17 03:40:35 CONTROLLER FIRMWARE PANIC!
07/12/17 03:40:35 IMAGE B: Rev. 1.26.5
07/12/17 03:40:35 [thread ID 30004b44 stack]
07/12/17 03:40:35 TR: fff6f158 fff6ea10 fff845d4 fff848ba fff84930 fff596f6 f
ff4f1a0
07/12/17 03:40:35 TR: fff6510a fff10e88 fff7fb7e fff131b0 fff13b20 fff3be50 0
0000000
07/12/17 03:40:36 Cooling system stabilized
07/12/17 03:40:36 USB0: waiting on open
Code:
001c01-L1>leds
CPU A: 0x2a: PLED_MAKESTACK
CPU B: < CPU not present >
CPU C: 0x2a: PLED_MAKESTACK
CPU D: < CPU not present >
8
I would wait until someone could look into your logs in case they could make sense of them before trying to fix the alignment of those pins any further. There could be further troubleshooting procedures that elude me, this is unfortunately above my pay grade but wait until some other Tezro owners notice your post and chime in.
Good luck!
RE: Need help troubleshooting broken SGI Tezro -
gijoe77 - 10-25-2018
I don't really know what the problem is, but I do wonder why it's complaining about the 3 VRM's...
Was this Tezro working one day then it stopped?
RE: Need help troubleshooting broken SGI Tezro -
mamorim01 - 10-25-2018
(10-25-2018, 05:02 PM)gijoe77 Wrote: I don't really know what the problem is, but I do wonder why it's complaining about the 3 VRM's...
Was this Tezro working one day then it stopped?
I am in way over my head but it could be a communication issue, this would be consistent with damage observed in the connector per the image above. Judging from documentation online the VRMs seem to be placed on the same board as the DIMM modules and CPUs (system node board). If there is no proper connection to this board then any attempts to reach them and/or probe them for voltage readings would result in some kind of error? This would also be consistent with the errors referring to the absence of CPUs which are also mounted on the main node board.
To quote the OP:
Quote:There were some bent pins between motherboars and node board (pics in reddit post) which I think I fixed, but not 100% sure about it.
RE: Need help troubleshooting broken SGI Tezro -
adlihajarat - 09-29-2019
Hello there,
The problem is not obvious with the Tezro as many might think.
This has to do with the system serial number.
You need to get hold of an L2 controller and feed the Tezro a valid serial number.
When the system sees a valid system number BSN and SSN, then the system will boot.
Make sure that you have a valid date on the system before you boot the system by issuing the date command on the L2 controller.
When the system boots, stop for maintenance and do the same of the Tezro.
Hope this helps
(10-25-2018, 05:09 PM)mamorim01 Wrote: (10-25-2018, 05:02 PM)gijoe77 Wrote: I don't really know what the problem is, but I do wonder why it's complaining about the 3 VRM's...
Was this Tezro working one day then it stopped?
I am in way over my head but it could be a communication issue, this would be consistent with damage observed in the connector per the image above. Judging from documentation online the VRMs seem to be placed on the same board as the DIMM modules and CPUs (system node board). If there is no proper connection to this board then any attempts to reach them and/or probe them for voltage readings would result in some kind of error? This would also be consistent with the errors referring to the absence of CPUs which are also mounted on the main node board.
To quote the OP:
Quote:There were some bent pins between motherboars and node board (pics in reddit post) which I think I fixed, but not 100% sure about it.
![[Image: lWpPaxm.png]](https://i.imgur.com/lWpPaxm.png)
![[Image: gb19D1J.png]](https://i.imgur.com/gb19D1J.png)
RE: Need help troubleshooting broken SGI Tezro -
jan-jaap - 09-29-2019
(10-25-2018, 05:02 PM)gijoe77 Wrote: I don't really know what the problem is, but I do wonder why it's complaining about the 3 VRM's...
Was this Tezro working one day then it stopped?
I'm with you on this one. You can sometimes see messages like that when optional hardware isn't present, but I don't think it applies to the Tezro.
Since you've got a working L1 connection, why not check voltages when the system is 'on'? It should be similar to this (if you don't have an IP59 at least):
Code:
001c01-L1> env
Environmental monitoring is enabled and running.
Description State Warning Limits Fault Limits Current
-------------- ---------- ----------------- ----------------- -------
1.8V Enabled 10% 1.62/ 1.98 20% 1.44/ 2.16 1.791
12V Enabled 10% 10.80/ 13.20 20% 9.60/ 14.40 12.063
12V #2 Enabled 10% 10.80/ 13.20 20% 9.60/ 14.40 12.188
3.3V Enabled 10% 2.97/ 3.63 20% 2.64/ 3.96 3.337
2.5V Enabled 10% 2.25/ 2.75 20% 2.00/ 3.00 2.509
12V IO Enabled 10% 10.80/ 13.20 20% 9.60/ 14.40 12.063
5V AUX Enabled 10% 4.50/ 5.50 20% 4.00/ 6.00 5.044
3.3V AUX Enabled 10% 2.97/ 3.63 20% 2.64/ 3.96 3.268
5V Enabled 10% 4.50/ 5.50 20% 4.00/ 6.00 5.070
XIO 12V BIAS Enabled 10% 10.80/ 13.20 20% 9.60/ 14.40 12.000
XIO 5V Enabled 10% 4.50/ 5.50 20% 4.00/ 6.00 5.044
XIO 2.5V Enabled 10% 2.25/ 2.75 20% 2.00/ 3.00 2.457
XIO 3.3V AUX Enabled 10% 2.97/ 3.63 20% 2.64/ 3.96 3.285
IP53 3.3V AUX Enabled 10% 2.97/ 3.63 20% 2.64/ 3.96 3.302
IP53 5V AUX Enabled 10% 4.50/ 5.50 20% 4.00/ 6.00 5.044
IP53 12V Enabled 10% 10.80/ 13.20 20% 9.60/ 14.40 12.063
IP53 SRAM Enabled 10% 2.25/ 2.75 20% 2.00/ 3.00 2.483
IP53 1.5V Enabled 10% 1.35/ 1.65 20% 1.20/ 1.80 1.480
IP53 VCPU Enabled 10% 1.13/ 1.38 20% 1.00/ 1.50 1.241
Description State Warning RPM Current RPM
--------------- ---------- ----------- -----------
FAN 0 NODE 1 Enabled 1800 2096
FAN 1 NODE 2 Enabled 1800 2096
FAN 2 NODE 3 Enabled 1800 2163
FAN 3 PCI 1 Enabled 1350 1486
FAN 4 PCI 2 Enabled 1350 1493
FAN 5 HD Enabled 1620 3308
FAN 6 ODY 1 Enabled 1350 1704
FAN 7 ODY 2 Enabled 1350 1607
Advisory Critical Fault Current
Description State Temp Temp Temp Temp
----------------- ---------- --------- --------- --------- ---------
0 INTERFACE 0 Enabled [Autofan Control] 76C/168F 35C/ 95F
1 INTERFACE 1 Enabled [Autofan Control] 76C/168F 32C/ 88F
2 INTERFACE 2 Enabled [Autofan Control] 76C/168F 30C/ 86F
3 INTERFACE 3 Enabled [Autofan Control] 76C/168F 42C/107F
4 ODYSSEY Enabled [Autofan Control] 76C/168F 50C/122F
5 NODE Enabled [Autofan Control] 76C/168F 46C/114F
6 BEDROCK Enabled [Autofan Control] 85C/185F 47C/116F
Zone Temp Target Current Zone Fan Curr/Min
Zone Name State Sensors Average Average Index Fan %
--------- -------- ------------ -------- -------- --------- ---------
Node Enabled 5,6 62C/143F 46C/114F 0 46%/ 46%
PCI Enabled 0,1,2,3 45C/113F 36C/ 96F 3,4 57%/ 57%
ODY Enabled 4 50C/122F 50C/122F 6 65%/ 61%
HD Enabled 5 40C/104F 44C/112F 5 52%/ 38%
The SSN (system serial number) can be checked from the L1 as well:
Code:
001c01-L1> serial all
Data Location Value
------------------------------ ------------ --------
Local System Serial Number NVRAM P10003xx
Reference System Serial Number NVRAM P10003xx
Local Brick Serial Number EEPROM NBY982
Reference Brick Serial Number NVRAM NBY982
EEPROM Product Name Serial Part Number Rev T/W
---------- -------------- ------------- -------------------- --- ------
INTERFACE WS_INT_53 NBY982 030_1881_007 A 00
IO9 IO9 NFW612 030_1771_005 A 00
ODYSSEY ODY128B1_2 NNB535 030_1884_005 B 00
SNOWBALL no hardware detected
NODE IP53_4CPU NER934 030_1956_002 C 00
IO DGHTR CHWS_IO_DAUG NEB185 030_1875_003 A 00
EEPROM JEDEC-SPD Info Part Number Rev Speed SGI
---------- ------------------------ ------------------ ---- ------ --------
DIMM 0 CE000000000000000CA3E100 M3 46L2820ET3-CA0 3E 10.0 N/A
DIMM 2 CE000000000000000C9BE100 M3 46L2820ET3-CA0 3E 10.0 N/A
DIMM 4 CE000000000000000CDE4800 M3 46L2820BT2-CA0 2B 10.0 N/A
DIMM 6 CE000000000000000CE64200 M3 46L2820BT2-CA0 2B 10.0 N/A
DIMM 1 CE000000000000000C97E100 M3 46L2820ET3-CA0 3E 10.0 N/A
DIMM 3 CE000000000000000C8BE100 M3 46L2820ET3-CA0 3E 10.0 N/A
DIMM 5 CE000000000000000CF04200 M3 46L2820BT2-CA0 2B 10.0 N/A
DIMM 7 CE000000000000000CE64800 M3 46L2820BT2-CA0 2B 10.0 N/A
I wouldn't separate the nodeboard from the backplane unless there is no other option. There's no logical explanation for these connectors to get damaged unless you take the system apart.
RE: Need help troubleshooting broken SGI Tezro -
weblacky - 10-01-2019
Hi Dzeimis,
It seems to make any progress here, you need to remove some variables from this equation.
You're getting these error messages, you know you had/have this damaged connector. You think you've got it connecting, but you're not sure. Why don't we find a way for you to check that connection?
I can't find a good picture of the back of both these mating connectors (best I found is this:
http://www.jarredcapellman.com/wp-content/uploads/2013/02/origin350_inside.jpg and
https://gainos.org/~elf/sgi/nekonomicon/forum/14/16721099/1.html) as the node board is the same, but it looks like it has pads on the back of the connector? I assume these may be for testing/QA purposes?
Why not try to probe the back of the damaged node board connector (while connected/seated to the mainboard) with a multimeter set on diode mode while the system is off and unplugged? It looks like the pads (if they are pads) correspond to each pin. Why not try using a multimeter to test pin to ground, or even pin to neighboring pin. If you have an oscilloscope, you can try probing pin to ground with the system on.
Any non-zero reading really means they are connected! 0 means there's no connection there...now that's not a guarantee it's not connected, could just be the two points you've selected have no common connection. But a positive reading DOES mean it's connected. So you may get lucky and prove to yourself that the bend pins are connected (or not) by testing.
Keep one probe on the position of the bent pin (pad) and try a few of the other pins around it and see what you hit...if you hit anything, that's the positive reading. You may find that some (but not all) the pins you've straightened are connected. Which would show you where you may need to focus on the connector.
I'd think this is low risk with the system unplugged and a good multimeter set to diode mode. And at least any positive readings will tell you what's connected, and you can continue to fiddle or perhaps just solder a small wire to the back of each connector to replace the bent/damaged pin connection? Assuming you probe the node-board connector at both sides and prove a 1 to 1 correlation to the position of a pad and the pin underneath it.
Once you've cleared your doubts about the connection, you can move on. I personally think you'll find a couple of the connections still aren't connecting, but some of the ones you've corrected are. So you can further focus on trying to correct those few remaining pins.
Best of luck and keep us in the loop!
RE: Need help troubleshooting broken SGI Tezro -
weblacky - 07-15-2020
FYI, I just looked at old L1 logs from a desk side Tezro I have and I get the same 3 VRM warnings! I also have a issue but it’s not a kernel panic, I have a 1.8v line drop to 1.22v and initiates shutdown. But I can get to the PROM and everything before that happens.
I notice two things, 1. Your running firmware from image B and it’s really old. So What’s in image storage A? Try doing a swap from L1 to A and boot to prom. It might be a bad flash the previous owner did? 2. That connector with the bent pins is known, it can be desoldered and replaced by a professional, but I’m not totally convinced the connection isn’t working. Claims firmware is crashing. So maybe try the other firmware bank and see?