Troubleshooting my Indigo2 -
Trippynet - 04-14-2020
Given that I'm on holiday at the moment and am restricted to house jobs, I've been trying to troubleshoot my Indigo2.
In short, it runs for a while, then powers off. I've had issues like this on and off for ages with it, bought another PSU from Ian a few years back and initially thought the problem was gone, yet with prolonged running, it kicks in again. Also had one of my PSUs sent off to a repair firm who completely re-capped it and replaced 4 resistors as well, but I still have issues. So I've tried a number of combinations over the past few days:
PSU No. 1 (original internal components):
All main components installed (Max IMPACT, hard drives, etc): 28 minutes runtime before powering down.
Graphics card swapped with SolidIMPACT: 1hr 54 mins runtime before powering down.
No graphics card or hard drives (system chimes, then sits there): 6hr 30 mins and still going.
PSU No. 2 (re-capped PSU):
All main components installed (Max IMPACT, hard drives, etc): 57 minutes runtime before powering down.
Graphics card swapped with SolidIMPACT: 4hr 6 mins runtime before powering down.
No graphics card or hard drives (system chimes, then sits there): 6hr 30 mins+ and still going.
Should note I also swapped the fans back to the original Panaflow ones just to ensure poor cooling wasn't an issue. This had no major effect on timings, in fact it actually ran slightly longer with the un-modded PSU and quiet fans than it did when I swapped the Panaflow one into the card bay. All tests were run with the case fitted to ensure cooling would be effective.
So! Power draw definitely looks to affect things, and with lower power draw it does run longer before crapping out. Plus the re-capped supply does seem to survive longer before the system dies. However I'm still not fully sure what is causing the issues. Do I have two duff PSUs (possible due to the age of them)? Or is it possible that there's something else at fault which is possibly putting too much strain on the PSU until (presumably) the PSU's protection circuits kill the thing. Any ideas on how I can try to suss this finicky bugger out would be welcome at this stage!
RE: Troubleshooting my Indigo2 -
weblacky - 04-15-2020
Hi Trippynet,
I've not gotten time to work with my SGIs for several months (real world BS) and I feel bad that we've really not be able to help you further on this. So I'm willing to propose some additional tests (see if something doesn't shake loose).
1. Thermal issues check before anything: Did you do all these test one after the other, or did you do like one each day? Was the system cold (like an hour sitting off) before each was done?
2. Could you please perform the same graphics card tests again (MAX then Solid Impact and then no graphics) but this time leave out ALL optical and external/internal drives (remove sleds entirely). Just sit at PROM with the graphics. Please let us know what times you get. Are they the same? I've seen at least one shorted cap on a drive before (not SGI) and the like. I'd like to see if your draw theory pans out but in a certain section of the machine. So let's try no drives and see if nothing changes. Drives normally shouldn't draw more than 1 AMP per drive at start with 500 mA or less once spinning. So they aren't a huge source of draw, but if you observe a change then I'd go into rebuilding the caps and voltage regulators on the backs of the drive tray regions (where termination lives on each bay) and try a different drive and/or sled. Also confirm findings with an external enclosure drive, if you can source one.
3. If after all permutations, you only get a continually operating machine with no graphics set (no change from previously posted results) then:
Last time I was inside a Indigo2 IMPACT machine and was looking at points of failure and rebuild areas, what drew me was the "way too many" Capacitors on the GIO-64 backplane and the back of the CDROM tray (terminator area) as well. So when you lean towards the lack of a problem by having no graphics cards at all...yes I agree on power draw being something...but perhaps another idea is literally the card interface?
I had always assumed the Caps on the GIO-64 backplane were for communication, not so much for power draw smoothing/filtering. But regardless, what if that backplane needs new caps or something (I stared at mine pretty hard back when I was doing this) and though to myself...how would I know if these were the issue, what do we need all these caps to do and in the end do they just improve something or are they vital?
Off the wall idea:
I'm assuming you don't have advanced electronic troubleshooting tools like a curve tracer to test caps in-circuit. If you want to try something odd, you can try a cheat similar (in a comparative sense only) to a curve tracer deflection check on the capacitor dielectric with a multimeter that has capacitance measurement on it. If you have a multimeter that can do capacitive farads measurement (and read small enough values) then you can try this:
Power everything off, remove the your graphics set and the graphics backplane board. Check that the "tons of caps" on the backplane board have metal tops on them exposed and not fully sealed bodies. With the backplane board outside the machine and all slots on it empty put your meter in capacitance measurement mode, touch the red lead to the top metal can of a cap on the backplane board and touch the black lead to the negative leg of the same cap on the other side of the board. You might also find that all negative leads are connected together. You can check that by leaving the black probe on a cap’s negative leg and try the neighboring cap’s top and see if you get a reading. That might make the work much easier (if they are connected then leave black probe at a negative leg location and just change location of red probe, can’t promise that they are connected like that though).
You should register a very small capacitance reading, much smaller than what the cap is rated for on label on the side)...that's expected (fractional). Now slowly do it for every similar cap on that backplane board. Unless your meter registers no reading for ALL caps while doing this. They should all be very similar. If you get one cap with no reading (while the others do read) or that reading is MUCH smaller or MUCH bigger than the others, you've likely discovered a bad/failing cap. Assume all caps are aging and all caps have a lower capacitance then when new. But they should all be similar (that’s the trick). A cap that looks like the others, labelled the same, should read the same as the others. If not, it’s actually working against them.
This tests half the cap, as the metal lid isn't attached to either leg but is attached to the electrolytic material. So you are actually measuring from one "small plate - the cap’s top lid" to the negative large plate inside (which forms it’s own “disconnected from circuit” capacitor). This is a crude technique to allow you see if the dielectric is problematic without removing the cap from the board. It's a comparative thing.
If you get no readings at all then I’d say the meter you’re using isn’t sensitive enough to pickup the very small values. I’ve found most high-end handheld meters do sense a value from caps at this size and I’d expect like 2-4pf but the value you get doesn’t matter so much as what values the group gets.
Perhaps you’ll find something about your backplane that leads you believe the “no graphics set” solution really points to the interface slot? It would be much easier to fix if that’s the real issue.
But it’s another data point. Let us know what you find and if anything changes or pops out at you.
RE: Troubleshooting my Indigo2 -
nintendoeats - 04-15-2020
If this were a late-90s x86 machine, I would propose thermal shutoff. Does anybody know if the IMPACT boards have a thermal sensor?
I note that you list new/quiet fans. What are they? The stock fans are high static pressure, like 4 mmH2O. Even really good quiet fans tend to have half that rating (at best). Since the MaxIMPACT creates more of a restriction and generates more heat, it would make sense that it would shut down first if the fan produces insufficient airflow.
RE: Troubleshooting my Indigo2 -
jan-jaap - 04-15-2020
(04-15-2020, 02:30 AM)nintendoeats Wrote: If this were a late-90s x86 machine, I would propose thermal shutoff. Does anybody know if the IMPACT boards have a thermal sensor?
I note that you list new/quiet fans. What are they? The stock fans are high static pressure, like 4 mmH2O. Even really good quiet fans tend to have half that rating (at best). Since the MaxIMPACT creates more of a restriction and generates more heat, it would make sense that it would shut down first if the fan produces insufficient airflow.
To the best of my knowledge there is no thermal sensor on the IMPACT boards (or the CPU side for that matter). There might be one inside the PSU.
I have a MaxIMPACT Indigo2 myself and the graphics side air exhaust is like a hair dryer. I wouldn't dare replacing the fans with silent ones. I may have mentioned this before: I replaced the fans in my Cisco WS-C1400 FDDI concentrator with near equivalent (on paper) Noctuas and after a while some of the optical transceivers started to fail. Let it cool down and they worked again.
No way I'm going to risk my SGIs like that. Mine are usually 'loaded' anyway, the margins for error are thin.
RE: Troubleshooting my Indigo2 -
Trippynet - 04-15-2020
Thanks for the responses guys!
The quieter fans I installed some time back were still ones that shifted a significant volume of air, albeit not as much as the original Panaflow fans. They were also replaced particularly as I only had SolidIMPACT graphics originally and hence felt the stock fan was probably OTT for that. I did swap the original Panaflow fans back in for the testing (both in the card bay, and the PSU) to ensure excessive heat wasn't an issue.
Weblacky:
1) The system was left to cool after each test (usually at least an hour or more). I did try it once without cooling and it ran for barely 10 minutes before powering down again.
2) I did do one test with no drives at all (sleds removed for HDDs and optical drive), but with the MaxIMPACT card still fitted. The system sat at the PROM screen and then died after 36 minutes (with the original parts PSU). Not yet tried it with the SolidIMPACT card and the other PSU, I may try to do this over the coming days.
3) I did a visual inspection of the caps on the GIO riser board the other day, no sign of any leaking or damage - although that doesn't guarantee they're fine of course! Although I do have a couple of multimeters, annoyingly neither can measure capacitance. I might look into this to see if I could get such a multimeter as I'm sure it'd be handy in future!
RE: Troubleshooting my Indigo2 -
jan-jaap - 04-15-2020
Are you sure you have the correct PSU installed? There are a few different options. Most people can recognize the IMPACT capable PSU vs. the older model, but there were different versions for R10K vs. R4400, and there were dual-head R4400 and a dual-head R10K model as well.
RE: Troubleshooting my Indigo2 -
weblacky - 04-15-2020
Here is that info. Be aware that I asked Ian about this exact PSU topic and he claimed he‘s seen
060-0021-xxx PSU running R10K machines without issue so the two highest models might just be revisions?
There are five power supplies available for an Indigo2:
Part number: 9430814
Suitable for Extreme, XZ and XL graphics on R4400 and R8000 machines.
Part number: 060-8001-xxx
Suitable for single-head (Solid, High or Max) IMPACT graphics and
dual-head Solid/Solid IMPACT graphics on R4400 machines. Can also
be used in place of 9430814.
Part number: 060-0021-xxx
Suitable for dual-head (Max/Solid, High/High or High/Solid) IMPACT
graphics on R4400 machines. Can also be used in place of 9430814
or 060-8001-xxx.
Part number: 060-8002-xxx
Suitable for single-head IMPACT graphics on R10000 machines.
Part number: 060-0027-xxx
Suitable for dual-head IMPACT graphics on R10000 machines. Can
also be used in place of 060-8002-xxx.
But that brings a good question, is the PSU correctly sized for a MAX impact and more?
RE: Troubleshooting my Indigo2 -
Trippynet - 04-15-2020
The one I have with original components is part number 060-0021-002, which I understand to be a 48A IMPACT PSU. The second (re-capped) one is part number 060-0027-002, a newer 48A IMPACT PSU. Unless I'm wrong with anything here?
RE: Troubleshooting my Indigo2 -
jan-jaap - 04-15-2020
This post in comp.sys.sgi.hardware refers to SGI document 108-0149-001 (unfortunately 108-*-* numbers are internal & confidential) and quotes:
Quote:Subject: Re: Indigo2 power supply differences
Date: 05/21/1999
Author: Martin Leese - OMG <mleese@hudson.CS.unb.ca>
On 19 May 1999 14:39:05 GMT I wrote:
>> I have become confused over the differences in power supplies in an
>> Indigo2 to support the different graphics options. Could somebody
>> please point me at a reference that explains which power supply
>> supports what graphics. We run Maximum Impact graphics.
Many thanks for Reinhard Wolf for pointing me at SGI Document Number 108-0149-001.
I will give a brief summary so that it will be in Deja News should anubody else need it.
There are five power supplies available for an Indigo2:
Part number: 9430814
Suitable for Extreme, XZ and XL graphics on R4400 and R8000 machines.
Part number: 060-8001-xxx
Suitable for single-head (Solid, High or Max) IMPACT graphics and
dual-head Solid/Solid IMPACT graphics on R4400 machines. Can also
be used in place of 9430814.
Part number: 060-0021-xxx
Suitable for dual-head (Max/Solid, High/High or High/Solid) IMPACT
graphics on R4400 machines. Can also be used in place of 9430814
or 060-8001-xxx.
Part number: 060-8002-xxx
Suitable for single-head IMPACT graphics on R10000 machines.
Part number: 060-0027-xxx
Suitable for dual-head IMPACT graphics on R10000 machines. Can
also be used in place of 060-8002-xxx.
No wonder I was confused.
Regards,
Martin
E-mail: mleese@omg.unb.ca
Web: <http://www.omg.unb.ca/~mleese/>
Seems you need either 060-8002-xxx or 060-0027-xxx for an R10K Indigo2. If you have the latter, you should be fine.
RE: Troubleshooting my Indigo2 -
nintendoeats - 04-15-2020
(04-15-2020, 08:18 AM)Trippynet Wrote: 1) The system was left to cool after each test (usually at least an hour or more). I did try it once without cooling and it ran for barely 10 minutes before powering down again.
Does that not suggest that this is a cooling issue? I realize that "suddenly shuts down" doesn't seem like a very specific symptom, but so far there is a direct correlation between heat generated/removed and how long the system lasts. Everything about it sounds to me like the system is overheating.
Perhaps the original fans are worn out and not generating enough pressure anymore? It's also possible that something has degraded and is leaking more power, narrowing the margin on the existing cooling.
Note that there is a difference between static pressure (mmH2O or Pascals) and air flow volume (CFM); most PC fans are designed for PC cases with good airflow characteristics, so they have high max volume but low max static pressure. This means that they can move a lot of air
so long as there is no restriction. Fans designed for radiators generally have higher static pressure, but nothing like the Panaflo fans SGI typically used. The design of the I2 case fundamentally depends on high static pressure fans that can pull air from the tiny vents on the other side of the case.