Onyx: boot is incomplete, fault is no master
#11
RE: Onyx: boot is incomplete, fault is no master
You can re-enable CPUs from the system controller. The system controller will still talk on the SSE diagnostics port even if there's no CPU board in the system at all.

Regarding the power system: it's vulnerable due to the many DC-DC converters in the system. But it voltages go out of spec it will show the dreaded POK-A or POK-B errors and not even begin booting. I'm sure that in theory a DC-DC converter on the CPU board can put out enough noise to make the L2 malfunction while the CPU still works and the voltages are within spec, but I doubt it's the most likely scenario here.

It's best to hook up a serial cable. We can speculate a lot here, but the system is probably telling what's wrong on the serial line.
jan-jaap
SGI Collector

Trade Count: (0)
Posts: 1,048
Threads: 37
Joined: Jun 2018
Location: Netherlands
Website Find Reply
11-12-2021, 10:36 AM
#12
RE: Onyx: boot is incomplete, fault is no master
By now, this feels a lot like an adventure game where I have to look at a walkthrough for every second step. :-(

According to the diagnostic road map, re-enabling the CPUs is done by disabling the power-on diagnostics in the debug menu. So I loaded a save game (haha), did that and rebooted. Good news: CPU arbitration is no longer the cause for the machine not booting, now it is PROM error 251: The PROM code took an unexpected exception.

To quote weblacky: "I guess console boot output is a must then..."

And I like the idea of blaming power. Stepping back a few steps and trying to apply some logic, what makes me wonder is the fact that the machine passed the power-on tests and booted into the management GUI just fine when I got it. Only since the first reset, both CPU caches have failed their self-tests and prevented the machine from booting. So both were fine. Now both are not.

In layman's terms, that sounds like a blown fuse to me. I did not see or smell any magic smoke escaping the machine at any time, and none of the parts on the CPU board looked defective, but I am not a power or electronics guy, I am a software dude. So if you could point me to spots where should attach my multimeter to, please do. And you might also want to tell me what to set it to and what the expected values are, because, you know, I am a software dude. ;-)

Thanks again everyone, you're doing a great job on not letting me give up on the machine.
(And it is way too heavy to throw it out of the window, too.)

[Image: onyx.png] [Image: indigo.png] [Image: o2.png] [Image: indy.png]
capmilk
O2

Trade Count: (0)
Posts: 10
Threads: 1
Joined: Nov 2021
Location: Germany
Find Reply
11-13-2021, 01:22 PM
#13
Sad  RE: Onyx: boot is incomplete, fault is no master
I failed to get any outpot of the console. Probably because I failed to build a working cable. Or because I failed to properly set up Kermit on my NeXT. Or even both.

The user manual does not mention what kind of protocol the console port speaks, the description of the 4D serial pinout on http://www.sgistuff.net/mirrors/4dfaq/#serialports does not match the one in the owners manual (007-1733-070), so in the end I tried all of these and others I had found on the web before. Never got anything to appear in that Kermit window.

So I am done for day. Very close to settling for "Yes, I own an Onyx. No, it does not run."
I just might not have the patience for this kind of tinkering without the slightest clue of what I am doing any more.

[Image: onyx.png] [Image: indigo.png] [Image: o2.png] [Image: indy.png]
capmilk
O2

Trade Count: (0)
Posts: 10
Threads: 1
Joined: Nov 2021
Location: Germany
Find Reply
11-13-2021, 05:42 PM
#14
RE: Onyx: boot is incomplete, fault is no master
From experience, many of the serial terminal requirements are basic 3-wire systems. Yes, more is better but many will work and show console output with just the minimum.

When I do serial cables, I use pin numbers molded into the connector to keep my head straight (because it’s easy to confuse symmetry and mirror image going by eye).

Many of the manuals have a sample cable pin out with port numbers on each end. That’s how I make it through those.

So go back to the manual you quoted, page 106 describes what you need , a 3-wire system. Convert the db25 pin positions (terminal side) to db9 here: https://allpinouts.org/pinouts/cables/se...in-serial/

Basically 2 & 3 are reversed on a DB9 and ground is pin 5, not 7 like a DB25.

There is no “protocol” like a modem transfer (zmodem, etc). It’s often just like a vt100 kind of signaling assumption.

I always end up using an old Windows laptop so I just fire up hyper terminal and don’t worry about it.
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
11-13-2021, 06:41 PM
#15
RE: Onyx: boot is incomplete, fault is no master
Fired up the Onyx and you're right, it's completely silent on the SSE port. You have to unlock the special hidden debug menu first  Cool 

The Diagnostics Roadmap page 4-16 has the details.

With debug flag 7 set , I get this:

Code:
g System..

Welcome to Everest manufacturing mode.
Testing master IA chip registers (slot 03)...
Testing map RAM in master IO4's IA chip (slot 03)...
Testing master EPC (slot 03, adap 01)...
Initializing EPC UART...


IP25 SCC(E) SGI Version 6  built 10:11:49 AM May  8, 1996
R10000 3.1 194MHz BE (4-2-2/8) 2MB

-- USING SYS. CTLR. UART --
Initializing hardware inventory...              ...done.
    CPU 02/00 is bootmaster
Testing Secondary Cache... ...passed.
Testing and clearing bus tags... ...passed.
Configuring memory...
    Using standard interleave algorithm.
Running built-in memory test... 01
*** Self-test FAILED on slot 01, leaf 0, bank 0 (A)

*** Self-test FAILED on slot 01, leaf 1, bank 0 (B)

                                                ...passed.
Writing cfginfo to memory
Initializing MPCONF blocks
Checking slave processor diag results................
    Enabled 1536 Megabytes of main memory
    (Disabled 512 Megabytes of main memory)
    Enabled 4 processors
Downloading PROM header information...
Downloading PROM code...
Jumping into IO4 PROM.
IO4 PROM Monitor SGI Version 4.21 Rev A IP25,  Sep  3, 1996 (BE64)
Sizing caches...
Initializing exception vectors.
Initializing IO4 subsystems.
Fixing vpids...
Initializing environment
*** S/N      SYSCTLR(S47448) NVRAM(S47448)      ***
Piggyback reads enabled.
Initializing software and devices.
Cannot connect to keyboard -- check the cable.
Cannot open keyboard() for input
Cannot connect to keyboard -- check the cable.
Cannot open keyboard() for input
All initialization and diagnostics completed.
Bootmaster processor already started.
Starting processor #1
Starting processor #2
Starting processor #3
Checking hardware inventory...
***      Bank A on the MC3 in slot 1 failed diagnostics.
***        Reason: Memory built-in self-test failed.
***      Bank A on the MC3 in slot 1 is DISABLED.
***      Bank B on the MC3 in slot 1 failed diagnostics.
***        Reason: Memory built-in self-test failed.
***      Bank B on the MC3 in slot 1 is DISABLED.

Press <Enter> to continue%

A few things:

1. Seems my onyx is sick too and I need to pull the MC3 and clean the SIMMs.
2. The initial output is missing. Reminds me a lot about the PowerSeries, where the initial console output goes missing if you don't have a console cable with hardware handshaking. I used a 3-wire cable here.
3. It asks for ENTER, but when I press it, nothing happens. Again, may be the cable, but maybe this just a diagnostics write-only output and not a full console.

When attached to the console port, I can get into the PROM menu and proceed:

Code:
System Maintenance Menu

1) Start System
2) Install System Software
3) Run Diagnostics
4) Recover System
5) Enter Command Monitor

Option? 5
Command Monitor.  Type "exit" to return to the menu.
>> hinv
                   System: IP25
                Processor: 194 Mhz R10000, 2M secondary cache
                Processor: 194 Mhz R10000, 2M secondary cache, (cpu 1)
                Processor: 194 Mhz R10000, 2M secondary cache, (cpu 2)
                Processor: 194 Mhz R10000, 2M secondary cache, (cpu 3)
              Memory size: 1536 Mbytes
               SCSI CDROM: scsi(0)cdrom(3)
                SCSI Tape: scsi(0)tape(6)
                SCSI Disk: scsi(1)disk(1)
                SCSI Disk: scsi(1)disk(2)
                 Graphics: InfiniteReality Graphics
>> update
>>

NB: you don't need to power cycle the Onyx to get into the debug menu. Just key to the ON position, up+down, key to DIAG, up+down and you're in. It goes away again if you leave the menu, you have to do the secret dance again to re-enable it
(This post was last modified: 11-14-2021, 01:17 PM by jan-jaap.)
jan-jaap
SGI Collector

Trade Count: (0)
Posts: 1,048
Threads: 37
Joined: Jun 2018
Location: Netherlands
Website Find Reply
11-14-2021, 01:15 PM
#16
RE: Onyx: boot is incomplete, fault is no master
I saved this conversation for later - without these special procedures I will be lost on the older machines

SGI - the legend will never die!!

Indy Indigo Crimson Indigo2 R10000/IMPACT Indigo2 R10000/IMPACT O2 O2 Octane Octane2 Octane2 Tezro
Geoman
Crimson to Tezro

Trade Count: (0)
Posts: 162
Threads: 13
Joined: May 2018
Location: Germany
Find Reply
11-14-2021, 06:29 PM
#17
RE: Onyx: boot is incomplete, fault is no master
You know the drill by now: I write a post, crying how nothing is working, one of you replies with some wisdom, I can make another step, say thank you and repeat from the start.

So here it is:
Thank you for the tip with "manu mode" in the debug menu, that did get console output going. Maybe I might even have gotten that NeXT and its cable to work, but I had already set up an XP machine with HyperTerminal and a different cable. With numbered pins, God bless. (The only DB9-DB9 serial cable I have is an original NeXT one, so this solution still has a little style, but not very much compared to before.)

Here's the output of the SSE port:

Code:
Welcome to Everest manufacturing mode.
Initializing master IO4...
Testing master IA chip registers (slot 03)...
Testing map RAM in master IO4's IA chip (slot 03)...
Testing master EPC (slot 03, adap 01)...
Initializing EPC UART...


IP19 PROM (BE) SGI Version 13  built 02:03:53 AM Aug 29, 1993
-- USING SYS. CTLR. UART --
Checking system endianess...                    Big endian
Initializing hardware inventory...              ...done.
    CPU 02/00 is bootmaster
Testing and clearing bus tags...                ...passed.
Configuring memory...
    Using standard interleave algorithm.
Running built-in memory test... 01
*** Self-test FAILED on slot 01, leaf 1, bank 3 (H)

                                                ...passed.
Writing cfginfo to memory
Initializing MPCONF blocks
Checking EAROM...                               ...passed.
Testing secondary cache...                      ...FAILED!
*** Current bootmaster (02/00) failed diagnostics.  Rearbitrating...

Welcome to Everest manufacturing mode.
Welcome to Everest manufacturing mode.
Testing master IA chip registers (slot 03)...
Testing map RAM in master IO4's IA chip (slot 03)...
Testing master EPC (slot 03, adap 01)...
Initializing EPC UART...


IP19 PROM (BE) SGI Version 13  built 02:03:53 AM Aug 29, 1993
-- USING SYS. CTLR. UART --
Checking system endianess...                    Big endian
Initializing hardware inventory...              ...done.
    CPU 02/01 is bootmaster
Testing and clearing bus tags...                ...passed.
Configuring memory...
    Using standard interleave algorithm.
Running built-in memory test... 01
*** Self-test FAILED on slot 01, leaf 1, bank 3 (H)

                                                ...passed.
Writing cfginfo to memory
Initializing MPCONF blocks
Checking EAROM...                               ...passed.
Testing secondary cache...                      ...FAILED!
*** Current bootmaster (02/01) failed diagnostics.  Rearbitrating...
 

This Self-test FAILED on slot 01, leaf 1, bank 3 (H) means 2nd level cache, right? Can I remove individual cache SIMMs, or do they need to go in pairs? Or none at all?
What confuses me is the fact that both CPUs fail arbitration because of the same SIMM, I would have expected each CPU to have their own cache.

Anyway, thanks again, I'll try to find a map of the cache SIMM positions in the manual(s). Otherwise, I'll start a 50/50 test.

[Image: onyx.png] [Image: indigo.png] [Image: o2.png] [Image: indy.png]
capmilk
O2

Trade Count: (0)
Posts: 10
Threads: 1
Joined: Nov 2021
Location: Germany
Find Reply
11-14-2021, 06:32 PM
#18
RE: Onyx: boot is incomplete, fault is no master
I dared to try something without awaiting your instructions for once. Namely checking out the MC3 memory board. The SIMM I think might match the error message would be the front bottom one (H3 in Leaf 1 of Board 1). Pulled it out (this time breaking the plastic lever), cleaned it, put it back in again. Did not change a thing, though. Done for the day.

[Image: onyx.png] [Image: indigo.png] [Image: o2.png] [Image: indy.png]
capmilk
O2

Trade Count: (0)
Posts: 10
Threads: 1
Joined: Nov 2021
Location: Germany
Find Reply
11-14-2021, 08:35 PM
#19
RE: Onyx: boot is incomplete, fault is no master
Decided to fix my own Onyx's memory problems.

When I pulled the MC3 I noticed that the offending memory sticks were all 'double decker' high density SIMMs.
Pulled, cleaned and re-seated everything.

This didn't change a thing.

Since I apparently had two faulty banks I decided to move sticks around in order to find the faulty one in each bank of four.

Oh, did I mention that in order to remove the MC3 in an Onyx IR you have to remove the protective metal bar across the board handles, and in order to remove that you need to remove the big 'jumper' across the IR board? And that last one is a major pain in the $$$ to remove. All for a dirty/faulty SIMM. At least I now know the system will POST without that jumper installed...

Anyway, my memory SIMM swapping didn't result in anything so I decided to take an empty MC3 and use that for my experiments. That's when I noticed little SMD capacitors on my desk Rolleyes  You see, there's a protective sheet of plastic on the backside of the IP25 CPU board in the next slot. The MC3 SIMMs were 'eating' the edge of it every time I reinstalled the MC3 so I probably pushed the MC3 a little too far left, and the backside of the MC3 rubbed the chassis gasket and lost some parts.

After that, the MC3 never worked again. It disables all memory banks at POST and the system panics.

Fortunately, I have spare MC3s. Unfortunately, they're all filled with low density memory (512MB) and I want my 2GB. Then again, one of them had the same protective plastic on the back that my IP25 has. No more rubbing against the chassis. Made sure the system booted with it (the skin of your fingers hates the procedure of removing 32 SIMMs and cleaning everything with pure alcohol so you don't want to do this and then find out the MC3 suffers from a bloody POK-A error).

Glued the protective plastic to the back of the IP25 with some superglue so the MC3 doesn't get stuck on it. Found 4 working sims from the 2 disabled banks of high density memory using a spare MC3. Found one complete bank of high density SIMMs in my infinite collection of spares. Removed and cleaned all low density memory from the MC3-with-proctective-backside-cover. Transplanted all working high density memory to it. Have system that POSTs once again. Smile

So there you have it. This is life with an Onyx1: a simple memory fault escalates into 2 or 3 evenings of disassembly, cleaning and more damage, and only a good stock of spares saved the day.

Next up: my Challenge L. It throws a -12VDC over-voltage and refuses to power on. The only place where -12VDC exists in these is the IO4 VCAM. I replaced the IO4 but the fault remained. Meh. I think I'll replace the system controller next (this means rear side disassembly...), maybe the monitoring circuit is bad. Otherwise it would mean I have not one but two broken VCAMs in my hands, and I'd have to attach some test wires to measure the actual voltages that the system complains about.

Finally; the Onyx isn't even fully fixed yet. I do not have a single RM6 board that will pass irsaudit, and the damage is beyond simple pixel faults on the screen, it crashes the X server in no time. If anyone reads this and has a good RM6 board (or an entire Onyx IR ...) that they want to get rid of, let me know.
(This post was last modified: 11-18-2021, 08:15 AM by jan-jaap.)
jan-jaap
SGI Collector

Trade Count: (0)
Posts: 1,048
Threads: 37
Joined: Jun 2018
Location: Netherlands
Website Find Reply
11-18-2021, 08:14 AM
#20
RE: Onyx: boot is incomplete, fault is no master
I have two RM8 boards, don't know if that would help you out... LMK.
indigofan
Tezro

Trade Count: (4)
Posts: 294
Threads: 43
Joined: Jun 2020
Location: Catskill Mountains, NY, USA
Find Reply
11-18-2021, 07:53 PM


Forum Jump:


Users browsing this thread: 1 Guest(s)