sgi tezro L1 General Exception on node 0 -
HarryT - 05-21-2022
Hello everyone,
unfortunately, my tezro has recently issued an L1 error message and refuses to boot.
Code:
returning to console mode 001c01 CPUO,
<CTRL T› to escape to L1
A 000 001c01:
A 000 001c01: *** General Exception on node 0
A 000 001c01:
*** EPC: Oxffffffffbfc01a94 (Oxc00000001fc01a94)
A 000 001c01: +** Press ENTER to continue
Does anyone have an idea what that could be? Thank you, any help is welcome!
RE: sgi tezro L1 General Exception on node 0 -
weblacky - 05-21-2022
Can you please post an “env” output from L1? I’m more concerned about the multiple errors before the exception.
RE: sgi tezro L1 General Exception on node 0 -
HarryT - 05-21-2022
(05-21-2022, 01:11 PM)weblacky Wrote: Can you please post an “env” output from L1? I’m more concerned about the multiple errors before the exception.
This ist the output of the "env"
Code:
001c01-L1>env
Environmental monitoring is enabled and running.
Description State Warning Limits Fault Limits Current
-------------- ---------- ----------------- ----------------- -------
1.8V Wait Pwr 10% 1.62/ 1.98 20% 1.44/ 2.16 0.08
12V Wait Pwr 10% 10.80/ 13.20 20% 9.60/ 14.40 0.06
12V #2 Wait Pwr 10% 10.80/ 13.20 20% 9.60/ 14.40 0.06
3.3V Wait Pwr 10% 2.97/ 3.63 20% 2.64/ 3.96 0.00
2.5V Wait Pwr 10% 2.25/ 2.75 20% 2.00/ 3.00 0.00
12V IO Wait Pwr 10% 10.80/ 13.20 20% 9.60/ 14.40 0.00
5V AUX Wait Pwr 10% 4.50/ 5.50 20% 4.00/ 6.00 5.04
3.3V AUX Wait Pwr 10% 2.97/ 3.63 20% 2.64/ 3.96 3.29
5V Wait Pwr 10% 4.50/ 5.50 20% 4.00/ 6.00 0.00
XIO 12V BIAS Wait Pwr 10% 10.80/ 13.20 20% 9.60/ 14.40 0.06
XIO 5V Wait Pwr 10% 4.50/ 5.50 20% 4.00/ 6.00 0.00
XIO 2.5V Wait Pwr 10% 2.25/ 2.75 20% 2.00/ 3.00 0.00
XIO 3.3V AUX Wait Pwr 10% 2.97/ 3.63 20% 2.64/ 3.96 3.32
NODE 3.3V AUX Wait Pwr 10% 2.97/ 3.63 20% 2.64/ 3.96 3.32
NODE 5V AUX Wait Pwr 10% 4.50/ 5.50 20% 4.00/ 6.00 5.02
NODE 12V Wait Pwr 10% 10.80/ 13.20 20% 9.60/ 14.40 0.06
NODE SRAM Wait Pwr 10% 2.25/ 2.75 20% 2.00/ 3.00 0.00
NODE 1.5V Wait Pwr 10% 1.35/ 1.65 20% 1.20/ 1.80 0.00
NODE VCPU Wait Pwr 10% 1.13/ 1.38 20% 1.00/ 1.50 0.00
001c01-L1>env
Environmental monitoring is enabled and running.
Description State Warning Limits Fault Limits Current
-------------- ---------- ----------------- ----------------- -------
1.8V Enabled 10% 1.62/ 1.98 20% 1.44/ 2.16 1.88
12V Enabled 10% 10.80/ 13.20 20% 9.60/ 14.40 12.06
12V #2 Enabled 10% 10.80/ 13.20 20% 9.60/ 14.40 12.12
3.3V Enabled 10% 2.97/ 3.63 20% 2.64/ 3.96 3.47
2.5V Enabled 10% 2.25/ 2.75 20% 2.00/ 3.00 2.61
12V IO Enabled 10% 10.80/ 13.20 20% 9.60/ 14.40 12.06
5V AUX Enabled 10% 4.50/ 5.50 20% 4.00/ 6.00 5.04
3.3V AUX Enabled 10% 2.97/ 3.63 20% 2.64/ 3.96 3.29
5V Enabled 10% 4.50/ 5.50 20% 4.00/ 6.00 5.07
XIO 12V BIAS Enabled 10% 10.80/ 13.20 20% 9.60/ 14.40 12.06
XIO 5V Enabled 10% 4.50/ 5.50 20% 4.00/ 6.00 5.07
XIO 2.5V Enabled 10% 2.25/ 2.75 20% 2.00/ 3.00 2.57
XIO 3.3V AUX Enabled 10% 2.97/ 3.63 20% 2.64/ 3.96 3.32
NODE 3.3V AUX Enabled 10% 2.97/ 3.63 20% 2.64/ 3.96 3.32
NODE 5V AUX Enabled 10% 4.50/ 5.50 20% 4.00/ 6.00 5.02
NODE 12V Enabled 10% 10.80/ 13.20 20% 9.60/ 14.40 12.00
NODE SRAM Enabled 10% 2.25/ 2.75 20% 2.00/ 3.00 2.57
NODE 1.5V Enabled 10% 1.35/ 1.65 20% 1.20/ 1.80 1.57
NODE VCPU Enabled 10% 1.13/ 1.38 20% 1.00/ 1.50 1.30
Description State Warning RPM Current RPM
--------------- ---------- ----------- -----------
FAN 0 NODE 1 Enabled 1800 2070
FAN 1 NODE 2 EnaODE 3 Enabled 1800 2163
FAN 3 PCI 1 Enabled 1350 1527
FAN 4 PCI 2 Enabled 1350 1493
FAN 5 HD Enabled 1620 2812
FAN 6 ODY 1 Enabled 1350 1555
FAN 7 ODY 2 Enabled 1350 1513
Advisory Critical Fault Current
Description State Temp Temp Temp Temp
----------------- ---------- --------- --------- --------- ---------
0 INTERFACE 0 Enabled [Autofan Control] 76C/168F 38C/100F
1 INTERFACE 1 Enabled [Autofan Control] 76C/168F 34C/ 93F
2 INTERFACE 2 Enabled [Autofan Control] 76C/168F 33C/ 91F
3 INTERFACE 3 Enabled [Autofan Control] 76C/168F 41C/105F
4 ODYSSEY Enabled [Autofan Control] 76C/168F 34C/ 93F
5 NODE Enabled [Autofan Control] 76C/168F 45C/113F
6 BEDROCK Enabled [Autofan Control] 85C/185F 45C/113F
Zone Temp Target Current Zone Fan Curr/Min
Zone Name State Sensors Average Average Index Fan %
--------- -------- ------------ -------- -------- --------- ---------
Node Enabled 5,6 62C/143F 45C/113F 0 46%/ 46%
PCI Enabled 0,1,2,3 45C/113F 36C/ 96F 3,4 57%/ 57%
ODY Enabled 4 50C/122F 34C/ 93F 6 61%/ 61%
HD Enabled 5 40C/104F 45C/113F 5 51%/ 38%
returning to console mode 001c01 CPU0, <CTRL_T> to escape to L1
A 000 001c01:
A 000 001c01: *** General Exception on node 0
A 000 001c01: *** EPC: 0xc00000001fc44018 (0xc00000001fc44018)
A 000 001c01: *** Press ENTER to continue.
and the "log"
Code:
001c01-L1>log
05/21/22 06:07:41 SMP unregistering events
05/21/22 06:07:41 UNREG: 30006838 0 4
05/21/22 06:07:42 SMP-R: UART:UART_NO_CONNECTION
05/21/22 06:18:58 power up (COMMAND)
05/21/22 06:19:03 Node 0 XTalk clock 88
05/21/22 06:19:05 reset again MIPS
05/21/22 06:19:09 Node 0 XTalk clock 88
05/21/22 06:19:10 Cooling system stabilized
05/21/22 06:22:22 SCC WR 6 (len=6) - UART:UART_TIMEOUT
05/21/22 06:22:24 SCC WR 6 (len=6) - UART:UART_TIMEOUT
05/21/22 06:25:30 SMP unregistering events
05/21/22 06:25:30 UNREG: 30006838 0 4
05/21/22 06:25:30 SMP-R: UART:UART_BREAK_RECEIVED
05/21/22 06:26:33 SMP unregistering events
05/21/22 06:26:33 UNREG: 30006838 0 4
05/21/22 06:26:34 SMP-R: UART:UART_BREAK_RECEIVED
05/21/22 06:29:03 power down (PANEL)
05/21/22 06:29:33 SMP unregistering events
05/21/22 06:29:33 UNREG: 30006838 0 4
05/21/22 06:29:34 SMP-R: UART:UART_NO_CONNECTION
05/21/22 07:12:56 SMP unregistering events
05/21/22 07:12:56 UNREG: 30006838 0 4
05/21/22 07:12:57 SMP-R: UART:UART_BREAK_RECEIVED
05/21/22 07:13:02 SMP unregistering events
05/21/22 07:13:02 UNREG: 30006838 0 4
05/21/22 07:13:03 SMP-R: UART:UART_NO_CONNECTION
05/21/22 07:18:52 SMP unregistering events
05/21/22 07:18:52 UNREG: 30006838 0 4
05/21/22 07:18:53 SMP-R: UART:UART_OVERRUN_ERROR
05/21/22 07:18:54 SMP unregistering events
05/21/22 07:18:54 UNREG: 30006838 0 4
05/21/22 07:18:55 SMP-R: UART:UART_OVERRUN_ERROR
05/21/22 07:18:56 SMP unregistering events
05/21/22 07:18:56 UNREG: 30006838 0 4
05/21/22 07:18:57 SMP-R: UART:UART_OVERRUN_ERROR
05/21/22 07:18:57 SMP unregistering events
05/21/22 07:18:57 UNREG: 30006838 0 4
05/21/22 07:18:58 SMP-R: UART:UART_OVERRUN_ERROR
05/21/22 07:18:59 SMP unregistering events
05/21/22 07:18:59 UNREG: 30006838 0 4
05/21/22 07:19:00 SMP-R: UART:UART_OVERRUN_ERROR
05/21/22 07:21:22 power up (COMMAND)
05/21/22 07:21:26 Node 0 XTalk clock 88
05/21/22 07:21:29 reset again MIPS
05/21/22 07:21:33 Node 0 XTalk clock 88
returning to console mode 001c01 CPU0, <CTRL_T> to escape to L1
A 000 001c01:
A 000 001c01: *** General Exception on node 0
A 000 001c01: *** EPC: 0xffffffffbfc01a94 (0xc00000001fc01a94)
A 000 001c01: *** Press ENTER to continue.
RE: sgi tezro L1 General Exception on node 0 -
weblacky - 05-21-2022
Thank you for posting that.
Okay while I see no smoking guns here I’ll ask the usual questions.
Do you have any saved log and env output from before this happened? Like a year or two ago for comparison as I only see one small thing that *maybe* something.
Unless I’m misreading the table (others please correct me if I am) your NODE VCPU voltage is only 0.08v away from a 10% out of tolerance warning. Am I right? If so could this be a power issue (clean power tolerance)? However I’ll assuming that’s not it though.
Did you upgrade Irix which would in turn upgrade your L1 firmware?
Where you experiencing any oddities on Irix or the system before this happened (loss of datetime in Irix or booing issues)?
Did you alter anything (even reseating hardware or memory DIMMs) recently?
Did move the Tezro recently, like move residences by car, or something similar?
RE: sgi tezro L1 General Exception on node 0 -
HarryT - 05-21-2022
Thank you for your help!
In fact I haven't changed anything in the system or moved it in any way. Nothing was changed either on the hardware or on the software. The system was last in operation in December 2021. Unfortunately I don't have a log file from back then. Right now I'm guessing the Dallas chip's battery is dead.
RE: sgi tezro L1 General Exception on node 0 -
weblacky - 05-21-2022
No, don’t assume that. If anything you can test the yellow Snaphat. The L1 RTC almost never goes dead and if it was you’d have other symptoms during boot.
The yellow Snaphat is in charge of boot variables and time and date under Irix. It’s on your IO9 card.
You can just pull it off and measure the end with a volt meter. One pin pair is a crystal oscillator the other pair is a lithium battery which should measure around 3v+.
Also you can pretty easily test the L1 RTC by using the date command. All you have to do is either set the date or look at the date under the L1 terminal then unplug the system for a little while then plug it back in if the L1 date is still intact and making sense then the L1 RTC battery is working just fine.
RE: sgi tezro L1 General Exception on node 0 -
HarryT - 05-21-2022
(05-21-2022, 04:18 PM)weblacky Wrote:
Also you can pretty easily test the L1 RTC by using the date command. All you have to do is either set the date or look at the date under the L1 terminal then unplug the system for a little while then plug it back in if the L1 date is still intact and making sense then the L1 RTC battery is working just fine.
The date seems correct. What now?
RE: sgi tezro L1 General Exception on node 0 -
Irinikus - 05-21-2022
Maybe try reseating your node board (Only attempt this if replacing the Snaphat doesn't work). It's unlikely, but there could be a bad connection there! (Two MEG-array connectors)
Just be careful not to force it back into position. (I've swapped out nodes in my Tezro a few times and it's quite easy to do!)
There appears to be a signalling fault. (UART - Universal asynchronous receiver-transmitter)
RE: sgi tezro L1 General Exception on node 0 -
weblacky - 05-21-2022
We could try to use your backup L1 firmware bank just in case as well to see if anything changes.
But a general exception is really not leading me much if anywhere online so we’ll have to make our own luck!
Also see if there are any console messages on Ctrl+t via L1 while trying to boot to get us more output to inspect.
Otherwise I’d go to that Yellow Snaphat, remove it and measure voltage on it.
M4T28-BR12SH
M4T32-BR12SH
That’s your other (more important to system boot) RTC. If low, order a new (larger one, M4T32-BR12SH) and install. But both part numbers work on SGI Systems.
It’s on your large IO9 card that’s the top-most mounted card with SCSI, Ethernet, etc on it. To lifts straight out but takes a little force. Take your time.
Removal WILL RESET your Irix date and time and your VRAM boot parameters! So they will be put to factory default values when you install or reinstall the Snaphat battery. If you’ve not fiddled with special boot arguments this should have no affect on booting (no warranty in that statement).
Obviously full power disconnect, AC cord pulled out of socket before trying to remove that yellow battery/crystal assembly.
If it measure above 3v then it’s good, if it’s right on the edge us could still be under 3v when under load so that’s a touchy call.
If you think it’s good the reinstall and we’ll need to try more drastic measures like removing Memory modules and resetting the boards logs and configuration via L1.
If anyone else has other suggestions I’d be interested to hear them. Normally there’s only a few things and if the system really hasn’t been touched I’m sort of scratching my head on where the next reasonable step should come from.
RE: sgi tezro L1 General Exception on node 0 -
HarryT - 05-21-2022
(05-21-2022, 05:24 PM)weblacky Wrote: We could try to use your backup L1 firmware bank just in case as well to see if anything changes.
But a general exception is really not leading me much if anywhere online so we’ll have to make our own luck!
Also see if there are any console messages on Ctrl+t via L1 while trying to boot to get us more output to inspect.
Otherwise I’d go to that Yellow Snaphat, remove it and measure voltage on it.
M4T28-BR12SH
M4T32-BR12SH
That’s your other (more important to system boot) RTC. If low, order a new (larger one, M4T32-BR12SH) and install. But both part numbers work on SGI Systems.
Where can you measure? On one side I have 0,17 Volts.
On the web, there are M4T32-BR12SH1 and M4T32-BR12SH6 modules. Do they also fit?