sgi tezro L1 General Exception on node 0
#1
sgi tezro L1 General Exception on node 0
Hello everyone,

unfortunately, my tezro has recently issued an L1 error message and refuses to boot.


Code:
returning to console mode 001c01 CPUO,
<CTRL T› to escape to L1
A 000 001c01:
A 000 001c01: *** General Exception on node 0
A 000 001c01:
*** EPC: Oxffffffffbfc01a94 (Oxc00000001fc01a94)
A 000 001c01: +** Press ENTER to continue



Does anyone have an idea what that could be? Thank you, any help is welcome!


   
HarryT
tezro

Trade Count: (0)
Posts: 70
Threads: 18
Joined: Oct 2018
Find Reply
05-21-2022, 12:48 PM
#2
RE: sgi tezro L1 General Exception on node 0
Can you please post an “env” output from L1? I’m more concerned about the multiple errors before the exception.
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
05-21-2022, 01:11 PM
#3
RE: sgi tezro L1 General Exception on node 0
(05-21-2022, 01:11 PM)weblacky Wrote:  Can you please post an “env” output from L1?  I’m more concerned about the multiple errors before the exception.
This ist the output of the "env"

Code:
001c01-L1>env
Environmental monitoring is enabled and running.

Description    State       Warning Limits     Fault Limits       Current
-------------- ----------  -----------------  -----------------  -------
          1.8V   Wait Pwr  10%   1.62/  1.98  20%   1.44/  2.16    0.08
           12V   Wait Pwr  10%  10.80/ 13.20  20%   9.60/ 14.40    0.06
        12V #2   Wait Pwr  10%  10.80/ 13.20  20%   9.60/ 14.40    0.06
          3.3V   Wait Pwr  10%   2.97/  3.63  20%   2.64/  3.96    0.00
          2.5V   Wait Pwr  10%   2.25/  2.75  20%   2.00/  3.00    0.00
        12V IO   Wait Pwr  10%  10.80/ 13.20  20%   9.60/ 14.40    0.00
        5V AUX   Wait Pwr  10%   4.50/  5.50  20%   4.00/  6.00    5.04
      3.3V AUX   Wait Pwr  10%   2.97/  3.63  20%   2.64/  3.96    3.29
            5V   Wait Pwr  10%   4.50/  5.50  20%   4.00/  6.00    0.00
  XIO 12V BIAS   Wait Pwr  10%  10.80/ 13.20  20%   9.60/ 14.40    0.06
        XIO 5V   Wait Pwr  10%   4.50/  5.50  20%   4.00/  6.00    0.00
      XIO 2.5V   Wait Pwr  10%   2.25/  2.75  20%   2.00/  3.00    0.00
  XIO 3.3V AUX   Wait Pwr  10%   2.97/  3.63  20%   2.64/  3.96    3.32
NODE 3.3V AUX   Wait Pwr  10%   2.97/  3.63  20%   2.64/  3.96    3.32
   NODE 5V AUX   Wait Pwr  10%   4.50/  5.50  20%   4.00/  6.00    5.02
      NODE 12V   Wait Pwr  10%  10.80/ 13.20  20%   9.60/ 14.40    0.06
     NODE SRAM   Wait Pwr  10%   2.25/  2.75  20%   2.00/  3.00    0.00
     NODE 1.5V   Wait Pwr  10%   1.35/  1.65  20%   1.20/  1.80    0.00
     NODE VCPU   Wait Pwr  10%   1.13/  1.38  20%   1.00/  1.50    0.00

001c01-L1>env
Environmental monitoring is enabled and running.

Description    State       Warning Limits     Fault Limits       Current
-------------- ----------  -----------------  -----------------  -------
          1.8V    Enabled  10%   1.62/  1.98  20%   1.44/  2.16    1.88
           12V    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.06
        12V #2    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.12
          3.3V    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.47
          2.5V    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.61
        12V IO    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.06
        5V AUX    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.04
      3.3V AUX    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.29
            5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.07
  XIO 12V BIAS    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.06
        XIO 5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.07
      XIO 2.5V    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.57
  XIO 3.3V AUX    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.32
NODE 3.3V AUX    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.32
   NODE 5V AUX    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.02
      NODE 12V    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.00
     NODE SRAM    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.57
     NODE 1.5V    Enabled  10%   1.35/  1.65  20%   1.20/  1.80    1.57
     NODE VCPU    Enabled  10%   1.13/  1.38  20%   1.00/  1.50    1.30

Description     State       Warning RPM  Current RPM
--------------- ----------  -----------  -----------
FAN  0   NODE 1    Enabled         1800         2070
FAN  1   NODE 2    EnaODE 3    Enabled         1800         2163
FAN  3    PCI 1    Enabled         1350         1527
FAN  4    PCI 2    Enabled         1350         1493
FAN  5       HD    Enabled         1620         2812
FAN  6    ODY 1    Enabled         1350         1555
FAN  7    ODY 2    Enabled         1350         1513

                              Advisory   Critical   Fault      Current
Description       State       Temp       Temp       Temp       Temp
----------------- ----------  ---------  ---------  ---------  ---------
0 INTERFACE 0       Enabled    [Autofan Control]    76C/168F   38C/100F
1 INTERFACE 1       Enabled    [Autofan Control]    76C/168F   34C/ 93F
2 INTERFACE 2       Enabled    [Autofan Control]    76C/168F   33C/ 91F
3 INTERFACE 3       Enabled    [Autofan Control]    76C/168F   41C/105F
4 ODYSSEY           Enabled    [Autofan Control]    76C/168F   34C/ 93F
5 NODE              Enabled    [Autofan Control]    76C/168F   45C/113F
6 BEDROCK           Enabled    [Autofan Control]    85C/185F   45C/113F

                     Zone Temp     Target    Current   Zone Fan   Curr/Min
Zone Name  State     Sensors       Average   Average   Index      Fan %
---------  --------  ------------  --------  --------  ---------  ---------
Node        Enabled           5,6  62C/143F  45C/113F          0   46%/ 46%
PCI         Enabled       0,1,2,3  45C/113F  36C/ 96F        3,4   57%/ 57%
ODY         Enabled             4  50C/122F  34C/ 93F          6   61%/ 61%
HD          Enabled             5  40C/104F  45C/113F          5   51%/ 38%


returning to console mode  001c01 CPU0, <CTRL_T> to escape to L1
A 000 001c01:
A 000 001c01: *** General Exception on node 0
A 000 001c01: *** EPC: 0xc00000001fc44018 (0xc00000001fc44018)
A 000 001c01: *** Press ENTER to continue.

and the "log"

Code:
001c01-L1>log
05/21/22 06:07:41 SMP unregistering events
05/21/22 06:07:41 UNREG: 30006838 0 4
05/21/22 06:07:42 SMP-R: UART:UART_NO_CONNECTION
05/21/22 06:18:58 power up (COMMAND)
05/21/22 06:19:03 Node 0 XTalk clock 88
05/21/22 06:19:05 reset again MIPS
05/21/22 06:19:09 Node 0 XTalk clock 88
05/21/22 06:19:10 Cooling system stabilized
05/21/22 06:22:22 SCC WR 6 (len=6) - UART:UART_TIMEOUT
05/21/22 06:22:24 SCC WR 6 (len=6) - UART:UART_TIMEOUT
05/21/22 06:25:30 SMP unregistering events
05/21/22 06:25:30 UNREG: 30006838 0 4
05/21/22 06:25:30 SMP-R: UART:UART_BREAK_RECEIVED
05/21/22 06:26:33 SMP unregistering events
05/21/22 06:26:33 UNREG: 30006838 0 4
05/21/22 06:26:34 SMP-R: UART:UART_BREAK_RECEIVED
05/21/22 06:29:03 power down (PANEL)
05/21/22 06:29:33 SMP unregistering events
05/21/22 06:29:33 UNREG: 30006838 0 4
05/21/22 06:29:34 SMP-R: UART:UART_NO_CONNECTION
05/21/22 07:12:56 SMP unregistering events
05/21/22 07:12:56 UNREG: 30006838 0 4
05/21/22 07:12:57 SMP-R: UART:UART_BREAK_RECEIVED
05/21/22 07:13:02 SMP unregistering events
05/21/22 07:13:02 UNREG: 30006838 0 4
05/21/22 07:13:03 SMP-R: UART:UART_NO_CONNECTION
05/21/22 07:18:52 SMP unregistering events
05/21/22 07:18:52 UNREG: 30006838 0 4
05/21/22 07:18:53 SMP-R: UART:UART_OVERRUN_ERROR
05/21/22 07:18:54 SMP unregistering events
05/21/22 07:18:54 UNREG: 30006838 0 4
05/21/22 07:18:55 SMP-R: UART:UART_OVERRUN_ERROR
05/21/22 07:18:56 SMP unregistering events
05/21/22 07:18:56 UNREG: 30006838 0 4
05/21/22 07:18:57 SMP-R: UART:UART_OVERRUN_ERROR
05/21/22 07:18:57 SMP unregistering events
05/21/22 07:18:57 UNREG: 30006838 0 4
05/21/22 07:18:58 SMP-R: UART:UART_OVERRUN_ERROR
05/21/22 07:18:59 SMP unregistering events
05/21/22 07:18:59 UNREG: 30006838 0 4
05/21/22 07:19:00 SMP-R: UART:UART_OVERRUN_ERROR
05/21/22 07:21:22 power up (COMMAND)
05/21/22 07:21:26 Node 0 XTalk clock 88
05/21/22 07:21:29 reset again MIPS
05/21/22 07:21:33 Node 0 XTalk clock 88

returning to console mode  001c01 CPU0, <CTRL_T> to escape to L1
A 000 001c01:
A 000 001c01: *** General Exception on node 0
A 000 001c01: *** EPC: 0xffffffffbfc01a94 (0xc00000001fc01a94)
A 000 001c01: *** Press ENTER to continue.
(This post was last modified: 05-21-2022, 02:18 PM by HarryT.)
HarryT
tezro

Trade Count: (0)
Posts: 70
Threads: 18
Joined: Oct 2018
Find Reply
05-21-2022, 01:24 PM
#4
RE: sgi tezro L1 General Exception on node 0
Thank you for posting that.

Okay while I see no smoking guns here I’ll ask the usual questions. 

Do you have any saved log and env output from before this happened?  Like a year or two ago for comparison as I only see one small thing that *maybe* something. 

Unless I’m misreading the table (others please correct me if I am) your NODE VCPU voltage is only 0.08v away from a 10% out of tolerance warning. Am I right?  If so could this be a power issue (clean power tolerance)?  However I’ll assuming that’s not it though.

Did you upgrade Irix which would in turn upgrade your L1 firmware? 

Where you experiencing any oddities on Irix or the system before this happened (loss of datetime in Irix or booing issues)?

Did you alter anything (even reseating hardware or memory DIMMs) recently?

Did move the Tezro recently, like move residences by car, or something similar?
(This post was last modified: 05-21-2022, 02:53 PM by weblacky.)
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
05-21-2022, 02:52 PM
#5
RE: sgi tezro L1 General Exception on node 0
Thank you for your help!

In fact I haven't changed anything in the system or moved it in any way. Nothing was changed either on the hardware or on the software. The system was last in operation in December 2021. Unfortunately I don't have a log file from back then. Right now I'm guessing the Dallas chip's battery is dead.
HarryT
tezro

Trade Count: (0)
Posts: 70
Threads: 18
Joined: Oct 2018
Find Reply
05-21-2022, 03:09 PM
#6
RE: sgi tezro L1 General Exception on node 0
No, don’t assume that. If anything you can test the yellow Snaphat. The L1 RTC almost never goes dead and if it was you’d have other symptoms during boot.

The yellow Snaphat is in charge of boot variables and time and date under Irix. It’s on your IO9 card.

You can just pull it off and measure the end with a volt meter. One pin pair is a crystal oscillator the other pair is a lithium battery which should measure around 3v+.

Also you can pretty easily test the L1 RTC by using the date command. All you have to do is either set the date or look at the date under the L1 terminal then unplug the system for a little while then plug it back in if the L1 date is still intact and making sense then the L1 RTC battery is working just fine.
(This post was last modified: 05-21-2022, 04:22 PM by weblacky.)
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
05-21-2022, 04:18 PM
#7
RE: sgi tezro L1 General Exception on node 0
(05-21-2022, 04:18 PM)weblacky Wrote:  
Also you can pretty easily test the L1 RTC by using the date command. All you have to do is either set the date or look at the date under the L1 terminal then unplug the system for a little while then plug it back in if the L1 date is still intact and making sense then the L1 RTC battery is working just fine.

The date seems correct. What now?
HarryT
tezro

Trade Count: (0)
Posts: 70
Threads: 18
Joined: Oct 2018
Find Reply
05-21-2022, 05:06 PM
#8
RE: sgi tezro L1 General Exception on node 0
Maybe try reseating your node board (Only attempt this if replacing the Snaphat doesn't work). It's unlikely, but there could be a bad connection there! (Two MEG-array connectors)

Just be careful not to force it back into position. (I've swapped out nodes in my Tezro a few times and it's quite easy to do!)

There appears to be a signalling fault. (UART - Universal asynchronous receiver-transmitter)
(This post was last modified: 05-21-2022, 05:36 PM by Irinikus.)
Irinikus
Hardware Connoisseur

Trade Count: (0)
Posts: 3,475
Threads: 319
Joined: Dec 2017
Location: South Africa
Website Find Reply
05-21-2022, 05:23 PM
#9
RE: sgi tezro L1 General Exception on node 0
We could try to use your backup L1 firmware bank just in case as well to see if anything changes.

But a general exception is really not leading me much if anywhere online so we’ll have to make our own luck!

Also see if there are any console messages on Ctrl+t via L1 while trying to boot to get us more output to inspect.

Otherwise I’d go to that Yellow Snaphat, remove it and measure voltage on it.

M4T28-BR12SH
M4T32-BR12SH

That’s your other (more important to system boot) RTC. If low, order a new (larger one, M4T32-BR12SH) and install. But both part numbers work on SGI Systems.

It’s on your large IO9 card that’s the top-most mounted card with SCSI, Ethernet, etc on it. To lifts straight out but takes a little force. Take your time.

Removal WILL RESET your Irix date and time and your VRAM boot parameters! So they will be put to factory default values when you install or reinstall the Snaphat battery. If you’ve not fiddled with special boot arguments this should have no affect on booting (no warranty in that statement).

Obviously full power disconnect, AC cord pulled out of socket before trying to remove that yellow battery/crystal assembly.

If it measure above 3v then it’s good, if it’s right on the edge us could still be under 3v when under load so that’s a touchy call.

If you think it’s good the reinstall and we’ll need to try more drastic measures like removing Memory modules and resetting the boards logs and configuration via L1.

If anyone else has other suggestions I’d be interested to hear them. Normally there’s only a few things and if the system really hasn’t been touched I’m sort of scratching my head on where the next reasonable step should come from.
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
05-21-2022, 05:24 PM
#10
RE: sgi tezro L1 General Exception on node 0
(05-21-2022, 05:24 PM)weblacky Wrote:  We could try to use your backup L1 firmware bank just in case as well to see if anything changes. 

But a general exception is really not leading me much if anywhere online so we’ll have to make our own luck!

Also see if there are any console messages on Ctrl+t via L1 while trying to boot to get us more output to inspect.

Otherwise I’d go to that Yellow Snaphat, remove it and measure voltage on it.

M4T28-BR12SH
M4T32-BR12SH

That’s your other (more important to system boot) RTC.  If low, order a new (larger one, M4T32-BR12SH) and install. But both part numbers work on SGI Systems.

   

Where can you measure? On one side I have 0,17 Volts.

On the web, there are M4T32-BR12SH1 and M4T32-BR12SH6 modules. Do they also fit?
(This post was last modified: 05-21-2022, 06:50 PM by HarryT.)
HarryT
tezro

Trade Count: (0)
Posts: 70
Threads: 18
Joined: Oct 2018
Find Reply
05-21-2022, 06:36 PM


Forum Jump:


Users browsing this thread: 1 Guest(s)