RE: OpenBSD/sgi -
johnnym - 02-01-2023
Ok, I'm through, though I don't believe
4fc402b61836e8a270d48f5516496cfbfe57776b (
0541ad29664165bece2a7957d02a188f9bafb73b in real) is the real culprit. I mean, yeah it is in the network code so could have a play in the problem EDIT: and it intruced a bug for big endian machines according to
the next change to the according file, but why should it specifically affect IP28 and no other machine, not even the Indy which uses the same NIC?
I rather believe this is the reason of the problem I mentioned earlier (hangs after NFS mounts) and
that I also saw on the Octane but which was gone later on EDIT: my memory played my a trick here: the Octane didn't hang but took a long time according to my logs at around the NFS mounts (the solution - according to the commit dates - was to replay
85caa4b also for sgi with
1583f1e). Also newer commits identified as
bad did not hang on the Indigo². I'll look into that on the weekend probably.
Code:
$ git bisect log
git bisect start
# bad: [0d58b8b8b5e0621f84efa993ee9ef47605603beb] drop the -beta
git bisect bad 0d58b8b8b5e0621f84efa993ee9ef47605603beb
# good: [5b7ece61fa1aa6c1348e9b8f2e7b0863e6ea20e7] close enough to release, we drop -beta
git bisect good 5b7ece61fa1aa6c1348e9b8f2e7b0863e6ea20e7
# good: [9f850877c8e5a89e6bfb255f1f7026c00bb7875e] vmm(4): reference count vm's and vcpu's
git bisect good 9f850877c8e5a89e6bfb255f1f7026c00bb7875e
# bad: [970cf9c09324fa9781be02dc122de675098fbf1d] Don't yet configure smmu(4) on Qualcomm SoCs as used on the Lenovo x13s as it is still not ready for runtime use and probably needs further quirks.
git bisect bad 970cf9c09324fa9781be02dc122de675098fbf1d
# good: [a7cdf5850edf952aab05a1d12a910add326a1f7c] Make test table based, extend it a little
git bisect good a7cdf5850edf952aab05a1d12a910add326a1f7c
# bad: [a39c18f28d16b1a61658f6ce07a74bc58176db30] strlen was in v6 libc (s5/perror.c) but not documented till v7 ok schwarze@
git bisect bad a39c18f28d16b1a61658f6ce07a74bc58176db30
# good: [17fc9e5b1d3178a8d65cacff7114b83163f14a02] The IPv4 reassembly code is MP safe, so we can run it in parallel. Note that ip_ours() runs with shared netlock, while ip_local() has exclusive netlock after queuing. Move existing the code into function ip_fragcheck() and call it from ip_ours(). OK mvs@
git bisect good 17fc9e5b1d3178a8d65cacff7114b83163f14a02
# bad: [4fc402b61836e8a270d48f5516496cfbfe57776b] Checking the fragment flags of an incoming IP packet does not need the mutex for the fragment list. Move this code before the critical section. Use ISSET() to make clear which flags are checked. OK mvs@
git bisect bad 4fc402b61836e8a270d48f5516496cfbfe57776b
# good: [cd0dd8f18578e5b883f43aa8ea64d31710ca1159] Force disabling the use of delay slots. This is ugly but gets the compiler to produce 99+% correct code at all optimization levels, and can help people who would like to tinker a bit with the backend.
git bisect good cd0dd8f18578e5b883f43aa8ea64d31710ca1159
# good: [0e641b41fe54fcc8de2a3351ff974e0677060113] Remove bogus mtw_read_cfg.
git bisect good 0e641b41fe54fcc8de2a3351ff974e0677060113
# good: [5e24e96cb0c2a092c174a5e9f83d4cbadf271e3f] Zap prototypes for nonexistent nd6_setmtu() and in6_ifdel()
git bisect good 5e24e96cb0c2a092c174a5e9f83d4cbadf271e3f
# good: [e62afb52dea0a7b7d0c6c099652a54e60340a22d] Fix RFC number in comment
git bisect good e62afb52dea0a7b7d0c6c099652a54e60340a22d
# good: [61f35befa9a0619b3becea84efb445917c00389e] Add a second test to validate the tables in the library.
git bisect good 61f35befa9a0619b3becea84efb445917c00389e
# first bad commit: [4fc402b61836e8a270d48f5516496cfbfe57776b] Checking the fragment flags of an incoming IP packet does not need the mutex for the fragment list. Move this code before the critical section. Use ISSET() to make clear which flags are checked. OK mvs@
Each bisect step required the following steps for verification:
- tar repo state - seconds
- copy it over to the NFS server - about 1m
- untar it on the Octane to a 15K SCSI disk from NFS - about 20m
- compile it - about 28m
- test it on Indigo² - under 5m
...so roughly 50 minutes per step.
RE: OpenBSD/sgi -
johnnym - 02-02-2023
Couldn't let it rest today:
So I followed my suspicion from yesterday and indeed, with the changes from commit
1109691f1d2 applied on top of
4fc402b6183 the hangs after the NFS mounts are gone and the Indigo² happily boots to the login prompt and works correctly, so this is not the real problem. Checking out the whole repo at 1109691f1d2 and compiling it also gives the same result.
So we got a new
good commit and
the last bad commit that didn't hang after the NFS mounts as new
bad commit for another round of bisecting. This time only 104 commits to search through, "
roughly 7 steps" according to
git bisect.
RE: OpenBSD/sgi -
johnnym - 02-04-2023
Feeling lucky today, so I'll first try to manually search and find 1 or 2 (if need be) commit(s) - one
good or one
bad - close to the respecitve other end of
the search room (
same in OpenBSD's original source), thus to cut the number of needed bisect steps short. Fingers crossed.
RE: OpenBSD/sgi -
johnnym - 02-04-2023
Unfortunately the two commits selected and tried (
3b3dd72256d and
805206b941e) didn't limit the search room to a big degree but could rather be used as new
good and
bad commits.
But when quickly scanning through the remaining commits I recognized a specific one (
ae6cd46) that
"benefits most mips64 platforms":
Code:
commit ae6cd4623ffb7b807b08788d8e53f7a9259c0c82
Author: miod <miod@openbsd.org>
Date: Sun Aug 7 19:40:48 2022 +0000
Use PMAP_PREFER_ALIGN() == 0 rather than !defined(PMAP_PREFER) to enable the
fast path in the pager code; this benefits most mips64 platforms.
ok kettenis@ mpi@
(cherry picked from commit d600f90f1a804e442018f93ce8ec61f99cd5fb69)
I checked its parent which was still
good and then checked it itself which resulted in a
bad kernel producing the errors I described here earlier and also on GitHub for IP28.
The corresponding issue on GitHub was updated accordingly and has the details.
On the way I also could cut down the time needed for compiling considerably by using
git diff to create a patch and apply that one to move between revisions and let
make find out what to recompile. Not sure if this will also be faster for commits that are further away than what I was operating on.
RE: OpenBSD/sgi -
johnnym - 03-01-2023
Looks like I never posted that I found a fix/workaround for the problem with the IP28 kernel in OpenBSD/sgi 7.2:
Well, my "solution" was to extend an existing - i.e still existing in OpenBSD/sgi - clause meant for R5000 and R7000 processors in
sys/arch/mips64/include/pmap.h to also trigger for IP28 (i.e. CPU_R10000
and TGT_INDIGO2):
Code:
diff --git a/sys/arch/mips64/include/pmap.h b/sys/arch/mips64/include/pmap.h
index 7cbac309a96..391e542797c 100644
--- a/sys/arch/mips64/include/pmap.h
+++ b/sys/arch/mips64/include/pmap.h
@@ -177,8 +177,11 @@ void pmap_page_cache(vm_page_t, u_int);
* and many structures containing fields which will be used with
* <machine/atomic.h> routines are allocated from pools, __HAVE_PMAP_DIRECT can
* not be defined on systems which may use flawed processors.
+ *
+ * There could be a similar problem for the IP28 aka POWER Indigo2 R10000, so
+ * we exclude the definition of __HAVE_PMAP_DIRECT for these systems, too.
*/
-#if !defined(CPU_R5000) && !defined(CPU_RM7000)
+#if !( defined(CPU_R5000) || defined(CPU_RM7000) || ( defined(TGT_INDIGO2) && defined(CPU_R10000) ) )
#define __HAVE_PMAP_DIRECT
vaddr_t pmap_map_direct(vm_page_t);
vm_page_t pmap_unmap_direct(vaddr_t);
...and which prevents the definition of __HAVE_PMAP_DIRECT which leads to the described problems on IP28 with
commit ae6cd46 in sgi-never-retired branch (
d600f90 in the official OpenBSD source code).
The
corresponding issue on GitHub was updated accordingly, as was
the IP28 kernel of OpenBSD/sgi 7.2 on GitHub.
RE: OpenBSD/sgi -
johnnym - 03-19-2023
Finally, but still earlier than expected, OpenBSD switched to 7.3:
Code:
commit 9a3badca5016bb6b6ce5e35f28496815da15afb9 (HEAD -> sgi-is-alive-at-7.3)
Author: deraadt <deraadt@openbsd.org>
Date: Fri Mar 17 22:52:22 2023 +0000
remove -beta tag
See
https://github.com/openbsd/src/blob/1750b2485245729867353d98b376ca12415da42b/sys/conf/newvers.sh
Got some work to do...
RE: OpenBSD/sgi -
johnnym - 03-19-2023
Now look at that, my Octane already got a new MP kernel running:
Code:
>> boot
Setting $netaddr to 172.16.2.51 (from server )
Obtaining /sash from server
7278928+720752 entry: 0xa800000020020000
ARCS64 Firmware
Found SGI-IP30, setting up.
Initial setup done, switching console.
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California. All rights reserved.
Copyright (c) 1995-2023 OpenBSD. All rights reserved. https://www.OpenBSD.org
OpenBSD 7.3 (GENERIC-IP30.MP) #0: Sun Mar 19 21:29:24 CET 2023
root@octane.machine-hall.org:/usr/src/sys/arch/sgi/compile/GENERIC-IP30.MP
real mem = 2147483648 (2048MB)
rsvd mem = 1064960 (2MB)
avail mem = 2119352320 (2021MB)
warning: no entropy supplied by boot loader
random: boothowto does not indicate good seed
mainbus0 at root: Octane
cpu0 at mainbus0: MIPS R12000 CPU rev 2.3 300 MHz, R10000 FPU rev 0.0
cpu0: cache L1-I 32KB D 32KB 2 way, L2 2048KB 2 way
cpu1 at mainbus0: MIPS R12000 CPU rev 2.3 300 MHz, R10000 FPU rev 0.0
cpu1: cache L1-I 32KB D 32KB 2 way, L2 2048KB 2 way
clock0 at mainbus0: int 5
xbow0 at mainbus0: XBow revision 4
xheart0 at xbow0 widget 8: Heart revision 4
onewire0 at xheart0
owserial0 at onewire0 "16kb EPROM" sn xxxxxxxxxxxx
owserial0: "PM20300MHZ" p/n 030-1356-001, serial xxxxxx
xbridge0 at xbow0 widget 15: Bridge revision 3
xbpci0 at xbridge0 bus 0: 33 MHz PCI bus
pci0 at xbpci0 bus 0
qlw0 at pci0 dev 0 function 0 "QLogic ISP1020" rev 0x05: irq 0, xbow irq 14
qlw0: nvram corrupt
qlw0: firmware rev 4.66.0, attrs 0x0
scsibus0 at qlw0: 16 targets, initiator 0
sd0 at scsibus0 targ 1 lun 0: <HP 73.4G, ST373455LC, HPC8> xxxxxxxxxxxxxxxxxxxx
sd0: 70007MB, 512 bytes/sector, 143374738 sectors
sd1 at scsibus0 targ 2 lun 0: <COMPAQ, BD07288277, HPB0> xxxxxxxxxxxxxxxxxxxx
sd1: 69464MB, 512 bytes/sector, 142264000 sectors
qlw1 at pci0 dev 1 function 0 "QLogic ISP1020" rev 0x05: irq 1, xbow irq 13
qlw1: nvram corrupt
qlw1: firmware rev 4.66.0, attrs 0x0
scsibus1 at qlw1: 16 targets, initiator 0
ioc0 at pci0 dev 2 function 0 "SGI IOC3" rev 0x01
onewire1 at ioc0
owmac0 at onewire1 "1kb EPROM" sn xxxxxxxxxxxx
owmac0: Ethernet Address xx:xx:xx:xx:xx:xx
owserial1 at onewire1 "16kb EPROM" sn xxxxxxxxxxxx
owserial1: "FP1" p/n 030-0891-003, serial xxxxxx
owserial2 at onewire1 "16kb EPROM" sn xxxxxxxxxxxx
owserial2: "PWR.SPPLY.ER" p/n 060-0035-002, serial xxxxxxxxxx
ioc0: ethernet irq 2, xbow irq 12
ioc0: superio irq 4, xbow irq 11
com0 at ioc0 base 0x20178: ns16550a, 16 byte fifo
com0: console
com1 at ioc0 base 0x20170: ns16550a, 16 byte fifo
iockbc0 at ioc0
iec0 at ioc0: 128KB SSRAM, address xx:xx:xx:xx:xx:xx
icsphy0 at iec0 phy 1: ICS1890 10/100 PHY, rev. 3
lpt at ioc0 not configured
dsrtc0 at ioc0: DS1687
"SGI Rad1" rev 0xc0 at pci0 dev 3 function 0 not configured
power0 at mainbus0
/dev/ksyms: Symbol table not valid.
vscsi0 at root
scsibus2 at vscsi0: 256 targets
softraid0 at root
scsibus3 at softraid0: 256 targets
boot device: iec0
nfs_boot: using interface iec0, with revarp & bootparams
nfs_boot: client_addr=172.16.2.51
nfs_boot: server_addr=172.16.0.1 hostname=octane
root on 172.16.0.2:/srv/nfs/octane/root
WARNING: clock gained 99 days
WARNING: CHECK AND RESET THE DATE!
swap on 172.16.0.2:/srv/nfs/octane/swap
Automatic boot in progress: starting file system checks.
pfctl: DIOCADDRULE: Operation not supported by device
pf enabled
starting network
pfctl: DIOCADDRULE: Operation not supported by device
starting early daemons: syslogd pflogd ntpd.
starting RPC daemons:.
swapctl: adding 172.16.0.2:/srv/nfs/openbsd/7.2/octeon/hosts/octane2/swap as swap device at priority 0
kvm_mkdb: can't open /dev/ksyms
savecore: /bsd: kvm_read: version misread
checking quotas: done.
clearing /tmp
kern.securelevel: 0 -> 1
creating runtime link editor directory cache.
preserving editor files.
starting network daemons: sshd smtpd sndiod.
starting local daemons: cron.
Sun Mar 19 21:44:02 CET 2023
OpenBSD/sgi (octane.machine-hall.org) (console)
login: root
Last login: Mon Feb 6 13:08:06 on console
OpenBSD 7.3 (GENERIC-IP30.MP) #0: Sun Mar 19 21:29:24 CET 2023
Welcome to OpenBSD: The proactively secure Unix-like operating system.
Please use the sendbug(1) utility to report bugs in the system.
Before reporting a bug, please try to reproduce it with the latest
version of the code. With bug reports, please try to ensure that
enough information to reproduce the problem is enclosed, and if a
known fix for it exists, include that as well.
You have mail.
octane# machine
octeon
octane# sysctl hw
hw.machine=sgi
hw.model=IP30
hw.ncpu=2
hw.byteorder=4321
hw.pagesize=16384
hw.disknames=sd0:cff4231e147e67d8,sd1:87a703b75b1e1601
hw.diskcount=2
hw.cpuspeed=299
hw.vendor=SGI
hw.product=Octane
hw.physmem=2147483648
hw.usermem=2147450880
hw.ncpufound=2
hw.allowpowerdown=1
hw.ncpuonline=2
hw.power=1
octane#
Unfortunately it doesn't like to work with my OpenBSD/sgi 7.0 FS, so I booted an octeon 7.2 FS instead. But this also means, I can't try out this kernel when compiling the 7.3 kernels for the other machines, something I usuall do for testing the new kernel(s).
Well, we can't have everything at once.
Find
the new OpenBSD/sgi 7.3 branch on GitHub.
RE: OpenBSD/sgi -
johnnym - 03-31-2023
OpenBSD/sgi 7.3
I have by now created all kernels for OpenBSD/sgi 7.3 (incl. for R8000 Indigo² (IP26)) and tested all kernels I have machines for (please see
the corresponding release page on GitHub for details).
Already available since a while is also
an intro branch that gives an overview.
Every machine was tested by successfully booting with a OpenBSD/octeon 7.3 FS snapshot - OpenBSD 7.3 hasn't released yet! - and running a few benchmarks (
7za,
openssl) using MP operation where possible. The boot logs are linked from the above mentioned release page, as are the new kernels.
Unfortunately also this release does come with issues, this time for two machines:
- Indy (IP22)
- R10000 Indigo² (IP28)
I created
an issue over at GitHub to follow the process of finding the reason for and hopefully solving this issue.
So it looks like it's time to bring
the sgi-never-retired branch forward and do some bisecting starting with the Indy (IP22) kernel.
RE: OpenBSD/sgi -
johnnym - 04-06-2023
There might be some confusion about what systems OpenBSD/sgi runs on. So let me clarify that by citing "official" information from
the intro(4/sgi) manpage of OpenBSD/sgi 6.9:
Quote:[...]
HARDWARE
The following systems are supported:
Hardware Family Kernel Model
IP20 IP20 IP22 Indigo (R4k)
IP22 IP22 IP22 Indigo2, Challenge M (R4k)
IP24 IP22 IP22 Indy*, Challenge S
IP26 IP22 IP26 POWER Indigo2 (R8000)
IP27 IP27 IP27 Origin 2x00, Onyx 2
IP28 IP22 IP28 POWER Indigo2 (R10000)*
IP29 IP27 IP27 Origin 200
IP30 IP30 IP30 Octane*, Octane 2* (Speedracer)
IP31 IP27 IP27 Origin 200*/2x00, Onyx 2 (250+ MHz)
IP32 IP32 IP32 O2*, O2+ (Moosehead)
IP34 IP35 IP27 Fuel (Asterix)
IP35 IP35 IP27 Origin 3x00, Onyx 3x000, Onyx 3
IP39 IP35 IP27 Onyx 4
IP45 IP35 IP27 Origin 300, Onyx 300
IP53 IP35 IP27 Origin 350, Onyx 350, Tezro
IP59 IP35 IP27 Origin 350, Onyx 350, Tezro (1GHz)
[...]
I can confirm the principle working of the systems marked with a * - these are the systems I have at hand and tested so far - for OpenBSD/sgi up to 7.3 with the exception of Indy and R10000 Indigo² for which I try to track down the problem cause. For the Indy I could already clarify that the issue detected for OpenBSD/sgi 7.3 is not new but present in all versions since 6.9 (and maybe even earlier), see
https://github.com/the-machine-hall/openbsd-src/issues/2#issuecomment-1494425202 for details.
I provide kernels for all of the above listed systems on
GitHub and those can be easily tested by netbooting them from the PROM on the respective system.
RE: OpenBSD/sgi -
Raion - 04-06-2023
Thanks for the clarification. The last time I looked into openBSD for SGI was on a whim in 2015 and back then the support for anything that was origin 300 or Chimera based (the fuel and origin 300 are more closely based on a different architecture than the later Chimera systems ) was not particularly good.