Need help troubleshooting broken SGI Tezro
#1
Need help troubleshooting broken SGI Tezro
Hi everyone Smile

I posted about my problem on nekochan a year ago (which I was directed to by reddit where I asked about my issue first), but haven't found any solution since then and just put my problem away to deal with later... 

Long story short, my Tezro doesn't boot and I have no idea why. The front panel had only solid red LED lighted which, according to manual, means "System node board failure (failed to read PROM at power on)". There were some bent pins between motherboars and node board (pics in reddit post) which I think I fixed, but not 100% sure about it. Still doesn't boot.

I had another computer connected to serial console and tried several things which I don't remember exactly, but I found some saved snippets on my laptop which are attached to this post. I don't remember what I did to cause general exception or controller firmware panic...
Let me know if there's anything I can do to provide more information. I will try booting this system when I get back home this weekend.

Can anybody with a working Tezro tell me if they also get messages "Cannot enable VRM: x" in console when the system is just started (or plugged in to wall socket... Don't remember anymore)?

Edit: forgot to actually attach Tezro output that I had saved.


Attached Files
.txt TEZRO OUTPUT.TXT Size: 2.42 KB  Downloads: 327
(This post was last modified: 10-24-2018, 07:09 PM by Dzeimis.)
Dzeimis
O2

Trade Count: (0)
Posts: 2
Threads: 1
Joined: Oct 2018
Find Reply
10-24-2018, 07:06 PM
#2
RE: Need help troubleshooting broken SGI Tezro
I think I saw your response on a reddit thread a few days ago, including some pictures from a much earlier thread you had posted. I reckon you will find this to be a much better place to seek help, but you should find a way to post those pictures so people can look into them in more detail. I do not believe those connectors you showed were compression connectors but, still, the pins looked somewhat deformed and I can see how a bad connection there could be causing havoc.

Best of luck bringing it back to life, hope it is something you can manage to fix!

Edit: Found the reddit thread, these are the pins I was referring to. If you check closely some of them seem to be misaligned. It *may* be nothing, though.

[Image: CskrtXA.jpg]

By the way, you may wish to paste the contents of the file you attached as it is not downloading properly.

:Indy: R4000 @100MHz, 96MB, 3GB, XL Newport | :O2: R5000 @180MHz, 256MB, 36GB, CRM | :Indigo2: R4400 @250MHz, 192MB, 36GB, Elan + XZ | :Octane: R10000 @250MHz, 1024MB, 8GB+2GB, ESSI.



(This post was last modified: 10-24-2018, 07:43 PM by mamorim01.)
mamorim01
Octane

Trade Count: (0)
Posts: 99
Threads: 4
Joined: Oct 2018
Location: Spain
Website Find Reply
10-24-2018, 07:34 PM
#3
RE: Need help troubleshooting broken SGI Tezro
Yep, it was my post. I tried straightening the pins as much as I could without risking of breaking them. I think I remember the pins that were bent were still noticeable a bit, but not to be failing... I guess I'll also see if I can fix them further when I get to it.

Strange that the attachment doesn't work. It downloads fine for me. Anyway, just in case, copying snippets that are there:

Code:
INFO: Cannot enable VRM: 9
INFO: Cannot enable VRM: 10
INFO: Cannot enable VRM: 11


SGI SN1 L1 Controller
Firmware Image B: Rev. 1.26.5, Built 12/15/2003 12:58:02

Code:
CPU  A: 0x00: PLED_RESET:  Slave loop (0x00/0x45=okay, solid 0x00=possibly hung)
CPU  B: < CPU not present >
CPU  C: 0x00: PLED_RESET:  Slave loop (0x00/0x45=okay, solid 0x00=possibly hung)
CPU  D: < CPU not present >

Code:
A 000 001c01:
A 000 001c01: *** General Exception on node 0
A 000 001c01: *** EPC: 0xc00000001fc41180 (0xc00000001fc41180)
A 000 001c01: *** Press ENTER to continue.
A 000 001c01:
A 000 001c01: *** General Exception on node 0
A 000 001c01: *** EPC: 0xc00000001fc41148 (0xc00000001fc41148)
A 000 001c01: *** Press ENTER to continue.

Code:
INFO: Cannot enable VRM: 9
INFO: Cannot enable VRM: 10
INFO: Cannot enable VRM: 11


****************************************
controller firmware panic!   resetting...
****************************************

IMAGE B: Rev. 1.26.5
[thread ID 30004b44 stack]
  TR: fff6f158 fff6ea10 fff845d4 fff848ba fff84930 fff596f6 fff4f1a0
  TR: fff6510a fff10e88 fff7fb7e fff131b0 fff13b20 fff3be50 00000000

(if you see this, please email ssh@sgi.com and include
the output from the 'log' command and a description of
what caused the problem)


SGI SN1 L1 Controller
Firmware Image B: Rev. 1.26.5, Built 12/15/2003 12:58:02
07/12/17 03:38:10 power up (COMMAND)
07/12/17 03:38:15 Node 0 IP53 XTalk clock 88
07/12/17 03:38:18 reset again MIPS
07/12/17 03:38:18 Cooling system stabilized
07/12/17 03:38:22 Node 0 IP53 XTalk clock 88
07/12/17 03:40:03 reset (COMMAND)
07/12/17 03:40:04 Node 0 IP53 XTalk clock 88
07/12/17 03:40:33 soft reset (COMMAND)
07/12/17 03:40:33 PANIC: ioExp.c line 571 ; Illegal I/O expander index: 35
07/12/17 03:40:34 L1 booting 1.26.5
07/12/17 03:40:35 CONTROLLER FIRMWARE PANIC!
07/12/17 03:40:35 IMAGE B: Rev. 1.26.5
07/12/17 03:40:35 [thread ID 30004b44 stack]
07/12/17 03:40:35    TR: fff6f158 fff6ea10 fff845d4 fff848ba fff84930 fff596f6 f
ff4f1a0
07/12/17 03:40:35    TR: fff6510a fff10e88 fff7fb7e fff131b0 fff13b20 fff3be50 0
0000000
07/12/17 03:40:36 Cooling system stabilized
07/12/17 03:40:36 USB0: waiting on open

Code:
001c01-L1>leds
CPU  A: 0x2a: PLED_MAKESTACK
CPU  B: < CPU not present >
CPU  C: 0x2a: PLED_MAKESTACK
CPU  D: < CPU not present >
8
(This post was last modified: 10-24-2018, 09:19 PM by Dzeimis.)
Dzeimis
O2

Trade Count: (0)
Posts: 2
Threads: 1
Joined: Oct 2018
Find Reply
10-24-2018, 09:18 PM
#4
RE: Need help troubleshooting broken SGI Tezro
(10-24-2018, 09:18 PM)Dzeimis Wrote:  Yep, it was my post. I tried straightening the pins as much as I could without risking of breaking them. I think I remember the pins that were bent were still noticeable a bit, but not to be failing... I guess I'll also see if I can fix them further when I get to it.

Strange that the attachment doesn't work. It downloads fine for me. Anyway, just in case, copying snippets that are there:

Code:
INFO: Cannot enable VRM: 9
INFO: Cannot enable VRM: 10
INFO: Cannot enable VRM: 11


SGI SN1 L1 Controller
Firmware Image B: Rev. 1.26.5, Built 12/15/2003 12:58:02

Code:
CPU  A: 0x00: PLED_RESET:  Slave loop (0x00/0x45=okay, solid 0x00=possibly hung)
CPU  B: < CPU not present >
CPU  C: 0x00: PLED_RESET:  Slave loop (0x00/0x45=okay, solid 0x00=possibly hung)
CPU  D: < CPU not present >

Code:
A 000 001c01:
A 000 001c01: *** General Exception on node 0
A 000 001c01: *** EPC: 0xc00000001fc41180 (0xc00000001fc41180)
A 000 001c01: *** Press ENTER to continue.
A 000 001c01:
A 000 001c01: *** General Exception on node 0
A 000 001c01: *** EPC: 0xc00000001fc41148 (0xc00000001fc41148)
A 000 001c01: *** Press ENTER to continue.

Code:
INFO: Cannot enable VRM: 9
INFO: Cannot enable VRM: 10
INFO: Cannot enable VRM: 11


****************************************
controller firmware panic!   resetting...
****************************************

IMAGE B: Rev. 1.26.5
[thread ID 30004b44 stack]
  TR: fff6f158 fff6ea10 fff845d4 fff848ba fff84930 fff596f6 fff4f1a0
  TR: fff6510a fff10e88 fff7fb7e fff131b0 fff13b20 fff3be50 00000000

(if you see this, please email ssh@sgi.com and include
the output from the 'log' command and a description of
what caused the problem)


SGI SN1 L1 Controller
Firmware Image B: Rev. 1.26.5, Built 12/15/2003 12:58:02
07/12/17 03:38:10 power up (COMMAND)
07/12/17 03:38:15 Node 0 IP53 XTalk clock 88
07/12/17 03:38:18 reset again MIPS
07/12/17 03:38:18 Cooling system stabilized
07/12/17 03:38:22 Node 0 IP53 XTalk clock 88
07/12/17 03:40:03 reset (COMMAND)
07/12/17 03:40:04 Node 0 IP53 XTalk clock 88
07/12/17 03:40:33 soft reset (COMMAND)
07/12/17 03:40:33 PANIC: ioExp.c line 571 ; Illegal I/O expander index: 35
07/12/17 03:40:34 L1 booting 1.26.5
07/12/17 03:40:35 CONTROLLER FIRMWARE PANIC!
07/12/17 03:40:35 IMAGE B: Rev. 1.26.5
07/12/17 03:40:35 [thread ID 30004b44 stack]
07/12/17 03:40:35    TR: fff6f158 fff6ea10 fff845d4 fff848ba fff84930 fff596f6 f
ff4f1a0
07/12/17 03:40:35    TR: fff6510a fff10e88 fff7fb7e fff131b0 fff13b20 fff3be50 0
0000000
07/12/17 03:40:36 Cooling system stabilized
07/12/17 03:40:36 USB0: waiting on open

Code:
001c01-L1>leds
CPU  A: 0x2a: PLED_MAKESTACK
CPU  B: < CPU not present >
CPU  C: 0x2a: PLED_MAKESTACK
CPU  D: < CPU not present >
8

I would wait until someone could look into your logs in case they could make sense of them before trying to fix the alignment of those pins any further. There could be further troubleshooting procedures that elude me, this is unfortunately above my pay grade but wait until some other Tezro owners notice your post and chime in.

Good luck!

:Indy: R4000 @100MHz, 96MB, 3GB, XL Newport | :O2: R5000 @180MHz, 256MB, 36GB, CRM | :Indigo2: R4400 @250MHz, 192MB, 36GB, Elan + XZ | :Octane: R10000 @250MHz, 1024MB, 8GB+2GB, ESSI.



mamorim01
Octane

Trade Count: (0)
Posts: 99
Threads: 4
Joined: Oct 2018
Location: Spain
Website Find Reply
10-24-2018, 09:43 PM
#5
RE: Need help troubleshooting broken SGI Tezro
I don't really know what the problem is, but I do wonder why it's complaining about the 3 VRM's...

Was this Tezro working one day then it stopped?
gijoe77
Tezro

Trade Count: (1)
Posts: 644
Threads: 34
Joined: Jun 2018
Find Reply
10-25-2018, 05:02 PM
#6
RE: Need help troubleshooting broken SGI Tezro
(10-25-2018, 05:02 PM)gijoe77 Wrote:  I don't really know what the problem is, but I do wonder why it's complaining about the 3 VRM's...

Was this Tezro working one day then it stopped?

I am in way over my head but it could be a communication issue, this would be consistent with damage observed in the connector per the image above. Judging from documentation online the VRMs seem to be placed on the same board as the DIMM modules and CPUs (system node board). If there is no proper connection to this board then any attempts to reach them and/or probe them for voltage readings would result in some kind of error? This would also be consistent with the errors referring to the absence of CPUs which are also mounted on the main node board.

To quote the OP:


Quote:There were some bent pins between motherboars and node board (pics in reddit post) which I think I fixed, but not 100% sure about it.


[Image: lWpPaxm.png]

[Image: gb19D1J.png]

:Indy: R4000 @100MHz, 96MB, 3GB, XL Newport | :O2: R5000 @180MHz, 256MB, 36GB, CRM | :Indigo2: R4400 @250MHz, 192MB, 36GB, Elan + XZ | :Octane: R10000 @250MHz, 1024MB, 8GB+2GB, ESSI.



(This post was last modified: 10-25-2018, 05:16 PM by mamorim01.)
mamorim01
Octane

Trade Count: (0)
Posts: 99
Threads: 4
Joined: Oct 2018
Location: Spain
Website Find Reply
10-25-2018, 05:09 PM
#7
RE: Need help troubleshooting broken SGI Tezro
Hello there,

The problem is not obvious with the Tezro as many might think.
This has to do with the system serial number.
You need to get hold of an L2 controller and feed the Tezro a valid serial number.
When the system sees a valid system number BSN and SSN, then the system will boot.
Make sure that you have a valid date on the system before you boot the system by issuing the date command on the L2 controller.
When the system boots, stop for maintenance and do the same of the Tezro.

Hope this helps


(10-25-2018, 05:09 PM)mamorim01 Wrote:  
(10-25-2018, 05:02 PM)gijoe77 Wrote:  I don't really know what the problem is, but I do wonder why it's complaining about the 3 VRM's...

Was this Tezro working one day then it stopped?

I am in way over my head but it could be a communication issue, this would be consistent with damage observed in the connector per the image above. Judging from documentation online the VRMs seem to be placed on the same board as the DIMM modules and CPUs (system node board). If there is no proper connection to this board then any attempts to reach them and/or probe them for voltage readings would result in some kind of error? This would also be consistent with the errors referring to the absence of CPUs which are also mounted on the main node board.

To quote the OP:


Quote:There were some bent pins between motherboars and node board (pics in reddit post) which I think I fixed, but not 100% sure about it.


[Image: lWpPaxm.png]

[Image: gb19D1J.png]
adlihajarat
Tezro, IR4, O2, Fuel, Indigo I and II

Trade Count: (0)
Posts: 2
Threads: 0
Joined: Jan 2019
Location: Amman, Jordan
Find Reply
09-29-2019, 03:38 AM
#8
RE: Need help troubleshooting broken SGI Tezro
(10-25-2018, 05:02 PM)gijoe77 Wrote:  I don't really know what the problem is, but I do wonder why it's complaining about the 3 VRM's...

Was this Tezro working one day then it stopped?

I'm with you on this one. You can sometimes see messages like that when optional hardware isn't present, but I don't think it applies to the Tezro.

Since you've got a working L1 connection, why not check voltages when the system is 'on'? It should be similar to this (if you don't have an IP59 at least):

Code:
001c01-L1> env
Environmental monitoring is enabled and running.

Description    State       Warning Limits     Fault Limits       Current
-------------- ----------  -----------------  -----------------  -------
1.8V    Enabled  10%   1.62/  1.98  20%   1.44/  2.16    1.791
12V    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.063
12V #2    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.188
3.3V    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.337
2.5V    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.509
12V IO    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.063
5V AUX    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.044
3.3V AUX    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.268
5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.070
XIO 12V BIAS    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.000
XIO 5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.044
XIO 2.5V    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.457
XIO 3.3V AUX    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.285
IP53 3.3V AUX    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.302
IP53 5V AUX    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.044
IP53 12V    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.063
IP53 SRAM    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.483
IP53 1.5V    Enabled  10%   1.35/  1.65  20%   1.20/  1.80    1.480
IP53 VCPU    Enabled  10%   1.13/  1.38  20%   1.00/  1.50    1.241

Description     State       Warning RPM  Current RPM
--------------- ----------  -----------  -----------
FAN  0   NODE 1    Enabled         1800         2096
FAN  1   NODE 2    Enabled         1800         2096
FAN  2   NODE 3    Enabled         1800         2163
FAN  3    PCI 1    Enabled         1350         1486
FAN  4    PCI 2    Enabled         1350         1493
FAN  5       HD    Enabled         1620         3308
FAN  6    ODY 1    Enabled         1350         1704
FAN  7    ODY 2    Enabled         1350         1607

Advisory   Critical   Fault      Current
Description       State       Temp       Temp       Temp       Temp
----------------- ----------  ---------  ---------  ---------  ---------
0 INTERFACE 0       Enabled    [Autofan Control]    76C/168F   35C/ 95F
1 INTERFACE 1       Enabled    [Autofan Control]    76C/168F   32C/ 88F
2 INTERFACE 2       Enabled    [Autofan Control]    76C/168F   30C/ 86F
3 INTERFACE 3       Enabled    [Autofan Control]    76C/168F   42C/107F
4 ODYSSEY           Enabled    [Autofan Control]    76C/168F   50C/122F
5 NODE              Enabled    [Autofan Control]    76C/168F   46C/114F
6 BEDROCK           Enabled    [Autofan Control]    85C/185F   47C/116F

Zone Temp     Target    Current   Zone Fan   Curr/Min
Zone Name  State     Sensors       Average   Average   Index      Fan %
---------  --------  ------------  --------  --------  ---------  ---------
Node        Enabled           5,6  62C/143F  46C/114F          0   46%/ 46%
PCI         Enabled       0,1,2,3  45C/113F  36C/ 96F        3,4   57%/ 57%
ODY         Enabled             4  50C/122F  50C/122F          6   65%/ 61%
HD          Enabled             5  40C/104F  44C/112F          5   52%/ 38%

The SSN (system serial number) can be checked from the L1 as well:

Code:
001c01-L1> serial all

Data                            Location      Value
------------------------------  ------------  --------
Local System Serial Number      NVRAM         P10003xx
Reference System Serial Number  NVRAM         P10003xx
Local Brick Serial Number       EEPROM        NBY982
Reference Brick Serial Number   NVRAM         NBY982


EEPROM      Product Name    Serial         Part Number           Rev  T/W
----------  --------------  -------------  --------------------  ---  ------
INTERFACE   WS_INT_53       NBY982         030_1881_007          A    00
IO9         IO9             NFW612         030_1771_005          A    00
ODYSSEY     ODY128B1_2      NNB535         030_1884_005          B    00
SNOWBALL    no hardware detected
NODE        IP53_4CPU       NER934         030_1956_002          C    00
IO DGHTR    CHWS_IO_DAUG    NEB185         030_1875_003          A    00

EEPROM     JEDEC-SPD Info           Part Number        Rev  Speed  SGI
---------- ------------------------ ------------------ ---- ------ --------
DIMM 0     CE000000000000000CA3E100 M3 46L2820ET3-CA0   3E   10.0  N/A
DIMM 2     CE000000000000000C9BE100 M3 46L2820ET3-CA0   3E   10.0  N/A
DIMM 4     CE000000000000000CDE4800 M3 46L2820BT2-CA0   2B   10.0  N/A
DIMM 6     CE000000000000000CE64200 M3 46L2820BT2-CA0   2B   10.0  N/A
DIMM 1     CE000000000000000C97E100 M3 46L2820ET3-CA0   3E   10.0  N/A
DIMM 3     CE000000000000000C8BE100 M3 46L2820ET3-CA0   3E   10.0  N/A
DIMM 5     CE000000000000000CF04200 M3 46L2820BT2-CA0   2B   10.0  N/A
DIMM 7     CE000000000000000CE64800 M3 46L2820BT2-CA0   2B   10.0  N/A

I wouldn't separate the nodeboard from the backplane unless there is no other option. There's no logical explanation for these connectors to get damaged unless you take the system apart.
jan-jaap
SGI Collector

Trade Count: (0)
Posts: 1,048
Threads: 37
Joined: Jun 2018
Location: Netherlands
Website Find Reply
09-29-2019, 01:30 PM
#9
RE: Need help troubleshooting broken SGI Tezro
Hi Dzeimis,
It seems to make any progress here, you need to remove some variables from this equation.

You're getting these error messages, you know you had/have this damaged connector. You think you've got it connecting, but you're not sure. Why don't we find a way for you to check that connection?

I can't find a good picture of the back of both these mating connectors (best I found is this: http://www.jarredcapellman.com/wp-conten...inside.jpg and https://gainos.org/~elf/sgi/nekonomicon/...099/1.html) as the node board is the same, but it looks like it has pads on the back of the connector? I assume these may be for testing/QA purposes?

Why not try to probe the back of the damaged node board connector (while connected/seated to the mainboard) with a multimeter set on diode mode while the system is off and unplugged? It looks like the pads (if they are pads) correspond to each pin. Why not try using a multimeter to test pin to ground, or even pin to neighboring pin. If you have an oscilloscope, you can try probing pin to ground with the system on.

Any non-zero reading really means they are connected! 0 means there's no connection there...now that's not a guarantee it's not connected, could just be the two points you've selected have no common connection. But a positive reading DOES mean it's connected. So you may get lucky and prove to yourself that the bend pins are connected (or not) by testing.

Keep one probe on the position of the bent pin (pad) and try a few of the other pins around it and see what you hit...if you hit anything, that's the positive reading. You may find that some (but not all) the pins you've straightened are connected. Which would show you where you may need to focus on the connector.

I'd think this is low risk with the system unplugged and a good multimeter set to diode mode. And at least any positive readings will tell you what's connected, and you can continue to fiddle or perhaps just solder a small wire to the back of each connector to replace the bent/damaged pin connection? Assuming you probe the node-board connector at both sides and prove a 1 to 1 correlation to the position of a pad and the pin underneath it.

Once you've cleared your doubts about the connection, you can move on. I personally think you'll find a couple of the connections still aren't connecting, but some of the ones you've corrected are. So you can further focus on trying to correct those few remaining pins.

Best of luck and keep us in the loop!
(This post was last modified: 10-03-2019, 01:01 AM by weblacky.)
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
10-01-2019, 07:14 AM
#10
RE: Need help troubleshooting broken SGI Tezro
FYI, I just looked at old L1 logs from a desk side Tezro I have and I get the same 3 VRM warnings! I also have a issue but it’s not a kernel panic, I have a 1.8v line drop to 1.22v and initiates shutdown. But I can get to the PROM and everything before that happens.

I notice two things, 1. Your running firmware from image B and it’s really old. So What’s in image storage A? Try doing a swap from L1 to A and boot to prom. It might be a bad flash the previous owner did? 2. That connector with the bent pins is known, it can be desoldered and replaced by a professional, but I’m not totally convinced the connection isn’t working. Claims firmware is crashing. So maybe try the other firmware bank and see?
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
07-15-2020, 12:34 AM


Forum Jump:


Users browsing this thread: 1 Guest(s)