VPro / Odyssey Diagnostic Disk (broken V12?)
#11
RE: VPro / Odyssey Diagnostic Disk (broken V12?)
A simple IR Thermometer can "read" between the grills - the chips have 35 - 43 ° Celsius with an ambient temperature of 23 °C. That is okay I guess. On a hot summer day without AC however this can chance for the worse. So better run these machines on cool days or invest in an AC-Unit (mandatory for any server room nonethess)

Update: Having the machine running for some time I get 51°C on the chips on a rather cool day for this season - this is not so good


Attached Files Image(s)
   

SGI - the legend will never die!!

Indy Indigo Crimson Indigo2 R10000/IMPACT Indigo2 R10000/IMPACT O2 O2 Octane Octane2 Octane2 Tezro
(This post was last modified: 08-11-2021, 09:06 AM by Geoman.)
Geoman
Crimson to Tezro

Trade Count: (0)
Posts: 162
Threads: 13
Joined: May 2018
Location: Germany
Find Reply
08-11-2021, 09:01 AM
#12
RE: VPro / Odyssey Diagnostic Disk (broken V12?)
(08-11-2021, 09:01 AM)Geoman Wrote:  A simple IR Thermometer can "read" between the grills - the chips have 35 - 43 ° Celsius with an ambient temperature of 23 °C. That is okay I guess. On a hot summer day without AC however this can chance for the worse. So better run these machines on cool days or invest in an AC-Unit (mandatory for any server room nonethess)

Update: Having the machine running for some time I get 51°C on the chips on a rather cool day for this season - this is not so good

hey bud, how did you get your readings, (l1cmd) env or with your heat reader?

why do you say that 51C is not good?

thanks !

Indigo2 IMPACT  : R10K-195MHz, 1GB RAM, 146GB 15K, CD-ROM, AudioDAT, MaxImpact w/ TRAM.  IRIX 6.5.22

O2 : R12K-400MHz, 1GB RAM, 300GB 15K, DVD-ROM, CRM Graphics, AV1/2 Media Boards & O2 Cam, DV-Link, FPA & SW1600.  IRIX 6.5.30

 : 2 x R14K-600MHz, 6GB RAM, V12 Graphics, PCI Shoebox.  IRIX 6.5.30

IBM  : 7012-39H, 7043-140

chulofiasco
Hardware Junkie

Trade Count: (0)
Posts: 328
Threads: 51
Joined: May 2019
Location: New York, NY
Website Find Reply
08-12-2021, 12:21 AM
#13
RE: VPro / Odyssey Diagnostic Disk (broken V12?)
Only using the heat-reader. For now I'd say it's best to operate these machine on cold weather with open windows or while using an AC-System.

SGI - the legend will never die!!

Indy Indigo Crimson Indigo2 R10000/IMPACT Indigo2 R10000/IMPACT O2 O2 Octane Octane2 Octane2 Tezro
Geoman
Crimson to Tezro

Trade Count: (0)
Posts: 162
Threads: 13
Joined: May 2018
Location: Germany
Find Reply
08-19-2021, 09:53 AM
#14
RE: VPro / Odyssey Diagnostic Disk (broken V12?)
Hi V12'ers,

after waiting for over 2 months for post, I finally got a replacement Octane V12.

So swapped out problem board and put in the new one and ran the diags from dedicated 6.5.22 Boot disk.

This now goes past the prior error, but I am now getting another error from the "odydiags" tool:
>> INFO tile= 0x19800, base = 0x332000, pat=0, dx=0x300
>> INFO tile= 0x19980, base = 0x335000, pat=0, dx=0x300
>> INFO Printing first 20 errors
>>
>> INFO Expected vs Received
>> **** ERROR 000052
>> Exp=0xabcd vs Rcv=0xa3cd @ (735,123)
>> INFO
>> Pixel location information:
>> X Y Origin Stride fbdp lin_adr addr sdrm_idx chip data
>>
>> 000002df 0000007b 00335000 00000030 2 00019bfd 00000001 00000002 15 0000
>> INFO SDRAM chip # 15 is SDRAM Bank 0_A
>>
>> **** ERROR 000008 2bpp dma comparison failure
>>
>> RSLT patterns FAIL Board#0: errcode==SDRAMPT
>> INFO Maximum error count (1) reached
>> Sun Nov 7 21:50:49 2021
>> Loop 1: 9 tests PASSED, 1 tests FAILED, Sun Nov 7 21:57:38 2021
>> ####### # ### #
>> # # # # #
>> # # # # #
>> ##### # # # #
>> # ####### # #
>> # # # # #
>> # # # ### #######
>>
>> Resetting graphics for diagnostics...
>>
>>
>> DIAGNOSTIC TEST RESULTS:
>>
>>
>> Summary is in summary.log, detailed info is in details.log

So I swapped out the diagnostic boot disk for regular 6.5 boot and even though I get the diagnostic error my machine boots up ok and once again has working graphics.

So I now have spare "dead" V12 which can be used for repair tesing, though there is nothing apparent that could be cause of failure.
On this Octane V12, I cannot see any equivalent of the "SOT23" chip that Weblacky was doing testing on with Fuel V10.


Cheers from Oz,

jwhat/John
jwhat
Octane/O350/Fuel User

Trade Count: (0)
Posts: 513
Threads: 29
Joined: Jul 2018
Location: Australia
Find Reply
11-08-2021, 12:43 PM
#15
RE: VPro / Odyssey Diagnostic Disk (broken V12?)
If you weren't located halfway across the world I would suggest you send it to weblacky or mopar5150

I'm the system admin of this site. Private security technician, licensed locksmith, hack of a c developer and vintage computer enthusiast. 

https://contrib.irixnet.org/raion/ -- contributions and pieces that I'm working on currently. 

https://codeberg.org/SolusRaion -- Code repos I control

Technical problems should be sent my way.
Raion
Chief IRIX Officer

Trade Count: (9)
Posts: 4,240
Threads: 533
Joined: Nov 2017
Location: Eastern Virginia
Website Find Reply
11-08-2021, 03:55 PM
#16
RE: VPro / Odyssey Diagnostic Disk (broken V12?)
I doubt at this point I could do anything about the dead one (yet, maybe I’ll gain enough knowledge to be useful on graphics someday). I’m a little more curious about the replacement graphics error.

How would you all that have actually run these tests interpret this error? Is this an SDRAM Chip that needs swapping? Or perhaps solder joint on the SDRAM IC a simple “going back over” wouldn’t solve?

While it sounds really cool that they give you the location of the bad SDRAM IC does anyone actually know the layout to know the location of it in real life?

If that SDRAM chip is a tssop style, yeah I’d give it a try, I’d just need a different extraction tool/method if it needs removal vs reflow. But it sure would be interesting to see if it’s a chip issue or a communication issue.

This is all with that cool IST test for Octane graphics disk test suite?
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
11-08-2021, 04:24 PM
#17
RE: VPro / Odyssey Diagnostic Disk (broken V12?)
Hi Weblacky,

yep you hit the nail on the head: "While it sounds really cool that they give you the location of the bad SDRAM IC does anyone actually know the layout to know the location of it in real life?" ...

I think this is why the diags were not generally made available, as they would only be useful to someone with all the details of the graphics boards, without that knowledge there is not much you can do...

One of the things your work is doing is to uncover some of the internal details on these machines, so having this information could be helpful once more is known on them.

BTW for people who are interested the swmgr package for the IST 2.7 Octane Diags is: "IRIX based Online Odyssey Diagnostic Environment for IRIX 6.5.14+ (Version 2.3)"

Back to the original fault report, which was:

>> INFO Saving current register values
>> INFO Walking 1's
>> **** ERROR 003006 PBJ_I2C_opt_control exp 0x1 recv 0xff
>>
>> INFO Restoring original register values
>> RSLT i2c FAIL Board#0: errcode==I2CBUS

I found this web site that provides info on I2C: https://www.i2c-bus.org

As I2C this is a general purpose technology, could this help in trying to pin point the issue with my original V12 ?

Prior to this I always thought that the I2C was only relevant for later model SGIs and not applicable to machines like Octane.

Cheers from Oz,


jwhat/John.
(This post was last modified: 11-09-2021, 08:01 AM by jwhat.)
jwhat
Octane/O350/Fuel User

Trade Count: (0)
Posts: 513
Threads: 29
Joined: Jul 2018
Location: Australia
Find Reply
11-09-2021, 07:59 AM
#18
RE: VPro / Odyssey Diagnostic Disk (broken V12?)
I did a little work with I2C back in the day, What’s hard to know about this error is, is it complaining (grammatically) about the entire bus or simply not being able to communicate or set one specific IC on the bus?

Assuming you knew where the bus was, and that’s not exactly an easy feat unless you find some IC with Clk and SDA marked lines, if you think it’s complaining about the entire bus than likely a single chip may have actually shorted in a way that is pulling down the entire bus line. This would be something you could see with a multimeter and moving forward if you had a multimeter that was very sensitive to the point of milliohms, you might be able to track down the chip.

It’s not unusual to have multiple busses of related chips. After all you certainly wouldn’t want them all on one big bus if you have more than a few, as that will cause contention or saturation.

Not hearing back from a chip on functioning bus (which is what I think this is) is a totally separate issue. And unfortunately the bus addresses are set by each chip in different ways there’s not a way to tell which chip belongs to which address.

The one interesting thing is that you can actually interface with these buses using a normal parallel port. I’ve done this before it’s a standard Linux hack. So you can join a bus simply by adding yourself as a member and since it’s a shared line you just listen to all the traffic.

In your case you’d be listening for something making a request and nothing ever answering it. That’s how at least you know you’re on the right bus. I don’t know much about the line length issue but if you found the two tracks you needed you could just solder thin wires to those points and transport them outside the case hook them to a parallel port and perform the investigation that way.

On the SDRAM error (replacement card) I think you might actually be able to use thermals on this. I mean if the chip isn’t setting stuff wouldn’t it be taking less power and thereby have less heat? I have no idea if we’re talking about an entire chip or a section of a chip being bad. Because are we talking about a memory address or an entire memory segment represented by a single chip?

But there may be a temperature difference. If you had a thermal imager And could somehow get it out quick enough you might be able to spot a temperature difference that’s either too high or too low when compared to all the other neighboring SDRAM chips?

It’s a little bit of a longshot but it might be a viable idea.

Also without getting ahead of myself, I quickly reread your original posting, what you’re describing is textbook solder joint cracking from thermal expansion. Most integrated circuits don’t just work fine one moment then don’t work find the next then work fine again. Passive components don’t tend to do this either. One of the great things about electronics is outside of logic issues it’s pretty binary, joke not withstanding.

Now your first comment was about your DCD, then we started on this registry error, correct?

I think that your DCD may work just fine, unless I misread something, you’d need to try it on another graphics card to know. I think you have a communication problem on your main card that is hindering early stages of using the DCD or is directly related to enabling or otherwise utilizing the DCD and related functions. This registry setup is obviously relevant/related.

Unfortunately octane is one of the few worst systems of troubleshooting because of course all the cards slide in and are in accessible during operation. Unlike other SGI models. Honestly, I think it depends on how much you value the card and/or want to take a risk.

If you consider this card nonworking and really it’s not gonna get any worse than I’d consider looking at ways to perform reflows. I don’t know the temperature of the solder used during production on these boards. Most small component boards use some form of paste or lower temperature wave soldering, right?

It doesn’t sound like some sort of big chip problem like something that would be BGA or huge. I’d start looking at the little guys and maybe think about trying to do some controlled hot air reflows like the kind of stuff you see on YouTube with people trying to reflow graphics chips and things like that. Not enough to actually melt most solder and have things drop off your board, but low enough to liquefy small amounts of solder that have nothing with a huge thermal mass like a chip connected to it.

Obviously the huge question would be where do you start. It be nice to have some indication of what board zero is in the error.

Any of the heavy hitters on the forum that actually have used reflow ovens to this effect should probably chime in on what’s really possible. If you were trying to re-flow any of the boards in these graphic stacks you’re really hoping not to disturb the large ships but to start getting the smaller stuff. I could see may be preheating the board with doing a little bit of hot air work in regional areas?

I know that most reflowing is a shotgun approach. Also I have seen digital heat guns start to appear on the market. If you’re doing massive areas it would be better to get a digitally controlled paint stripping heat gun for cheap and set it to something like 400°F and wave it over a section of the board for a little for good even heating and see if you can make a difference with just that.
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
11-09-2021, 09:28 AM
#19
RE: VPro / Odyssey Diagnostic Disk (broken V12?)
Hi Weblacky,

many thanks for going into this.

It is really helpful to get your thoughts on this stuff.

FYI: Yes you are right, the Octane mostly was sitting unused (but now unloved ;-) ) in my study.

First Failure Case:

The machine was just sitting immobile for many months, then when the problem first showed up I simply got no UI on boot.

So I did the usual thing which was to pull out the board give the compression connectors a blow out and put it back in.

Next I connected to "analog" RGB monitor port (rather than DCD DVI one) and found that that worked.

So I then took off the DCD and cleaned it up and then put it back on again and it worked again after reseating.

So I let sleeping dogs lie and just left it undiagnosed (as the odydiags crashed on my IRIX 6.5.30 setup)


Second Failure Case:

Next time I had the problem (again after the machine was just sitting unused for a few months), it had the same symptoms and so I just did the same thing, but this time, but I could not even get the console boot screen to come up with the DCD attached, so I removed it.

I also tested it with an alternate spare DCD I had, with no luck.

In receiving my replacement V12, I put the original DCD back on this and put the new V12 board back into the Octane.

While I do get the diagnostic error, I can use the computer perfectly and without any visible visual issues.

I was rather surprised that I got the diagnostics error as my motivation of running the diags, was just so I could capture a "good run" just in case someone else was wanting to know what a good run looked like.

End of Failure Cases.


I will follow up on the "reflow" stuff, as John mopar5150 has reported having had good results putting items through reflow ovens.

I believe that to do this with V12 you would have to first remove all the heat sink stuff, which occupies a signifant portion of the board space.

Cheers from Oz,


jwhat/John.
(This post was last modified: 11-09-2021, 12:34 PM by jwhat.)
jwhat
Octane/O350/Fuel User

Trade Count: (0)
Posts: 513
Threads: 29
Joined: Jul 2018
Location: Australia
Find Reply
11-09-2021, 12:07 PM
#20
RE: VPro / Odyssey Diagnostic Disk (broken V12?)
Yo,
Okay so you know your old DCD works! I’ll assume your “spare” works too, it’s a small, rigid, board. I assume it’s pretty darn tough.

I see what you mean on the surprise issue, either there are corrective actions taken during SDRAM addressing failure or the visual anomalies it may produce go unnoticed?

I wonder if it’s transitive? I assume you ran the test repeatedly (in a row) and got the same results?

I wonder with the additional passage of time if you still get the same error at the same location?

If you do end up doing a reflow and have consistent results to show it worked, please publish the reflow profile used, temps over time, graph.

I certainly do not have the experience to know what temp I should be using other than watching closely for solder surface sheen.

Hot air and reflow genuinely scares me due to my lack of experience to control the process.
weblacky
I play an SGI Doctor, on daytime TV.

Trade Count: (10)
Posts: 1,716
Threads: 88
Joined: Jan 2019
Location: Seattle, WA
Find Reply
11-09-2021, 02:14 PM


Forum Jump:


Users browsing this thread: 1 Guest(s)