Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")

Topic: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)") (Read 1778 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Re: Help building an old kernel (5.19.0)

Reply #15 – 08 January 2023, 15:49:50

@jspaces OK you were right actually!

I was searching in the wrong place.

Ubuntu thinks it is a good idea to put udev rules in /lib/udev/rules.d ...

So, here is what I just quickly find:

Code: [Select]

$ ack thunderbolt
90-bolt.rules
13:# start bolt service if we have a thunderbolt device connected
14:SUBSYSTEM=="thunderbolt", TAG+="systemd", ENV{SYSTEMD_WANTS}+="bolt.service"

90-nm-thunderbolt.rules
4:ACTION!="add|change|move", GOTO="nm_thunderbolt_end"
6:# Load he thunderbolt-net driver if we a device of type thunderbolt_xdomain
8:SUBSYSTEM=="thunderbolt", ENV{DEVTYPE}=="thunderbolt_xdomain", RUN{builtin}+="kmod load thunderbolt-net"
10:# For all thunderbolt network devices, we want to enable link-local configuration
11:SUBSYSTEM=="net", ENV{ID_NET_DRIVER}=="thunderbolt-net", ENV{NM_AUTO_DEFAULT_LINK_LOCAL_ONLY}="1"
13:LABEL="nm_thunderbolt_end"

I'm not really sure that I can reuse the rules, as there seems to be some systemd woodoo mingled inside.

OK so the meaningful line is

Code: [Select]

SUBSYSTEM=="thunderbolt", TAG+="systemd", ENV{SYSTEMD_WANTS}+="bolt.service"

but this only starts the service "bolt" through which the user can allow or deny the connected device.
The only problem is that this service "bolt" depends heavily on systemd, so I won't try to use that.

Re: Help building an old kernel (5.19.0)

Reply #16 – 08 January 2023, 15:56:31

Quote from: lq – on 08 January 2023, 14:03:07

I'm pretty sure it's not really a problem for you.

LOL believe it or not, it does bother me, because it was also an interesting problem to know what was preventing the kernel from booting...

Re: Help building an old kernel (5.19.0)

Reply #17 – 08 January 2023, 19:34:52

Try taking a look inside the systemd "bolt.service" script as bases for creating a thunderbolt service for the init that you use. It should be possible to start a thunderbolt service for your init with the udev rule.
Maybe this is required.

Code: [Select]

galaxy/bolt 0.9.4-1 (160.7 KiB 453.6 KiB) ->     Thunderbolt 3 device manager

Re: Help building an old kernel (5.19.0)

Reply #18 – 09 January 2023, 00:26:04

@jspaces Sooo, I discovered that I had already installed a package named bolt (that provides boltctl).

I ran asp checkout bolt to look inside, and to my surprise (and horror) its PKGBUILD states

Code: [Select]

depends=('polkit' 'systemd')

How is that possible given that I'm pretty sure that I haven't installed systemd?
Does the PKGBUILD from asp not reflect what is actually built by Artix?

Also I discovered the existence of a bolt-openrc package, but asp checkout bolt-openrc returns me an error

Code: [Select]

$ asp checkout bolt-openrc
error: unknown package: bolt-openrc

(and YEAH I know that I could just pacman it right away, but I really just want to see what it does beforehand)

Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")

Reply #19 – 09 January 2023, 00:47:55

Quote from: Ale – on 09 January 2023, 00:26:04

How is that possible given that I'm pretty sure that I haven't installed systemd?
Does the PKGBUILD from asp not reflect what is actually built by Artix?

Code: [Select]

pacman -Si asp
Repository      : extra
Name            : asp
Version         : 8-1
Description     : Arch Linux build source file management tool
Architecture    : any
URL             : https://github.com/falconindy/asp
Licenses        : MIT
Groups          : None
Provides        : None
Depends On      : awk  bash  jq  git  libarchive
Optional Deps   : None
Conflicts With  : None
Replaces        : None
Download Size   : 10.56 KiB
Installed Size  : 20.13 KiB
Packager        : Jelle van der Waa <[email protected]>
Build Date      : Wed 03 Nov 2021 09:23:18 GMT
Validated By    : MD5 Sum  SHA-256 Sum  Signature

asp is an arch package. You have the arch bolt PKGBUILD (and package)

artix-archlinux-support provides a dummy systemd. The (arch) bolt package believes the systemd dependency is met. But in reality it isn't.

I'd love to see an artix version of asp but it's not currently a thing.

Code: [Select]

git clone https://gitea.artixlinux.org/packagesB/bolt.git

For the artix version

Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")

Reply #20 – 09 January 2023, 01:20:52

@gripped Thanks a lot! This is spot on.

I did not know that asp did not give Artix packages, I'll check if I have not installed inadvertently unintended dependencies because of that in the past...

Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")

Reply #21 – 09 January 2023, 05:26:35

The bolt packages are in Artix's repositories.

Code: [Select]

$ pacman -Ss bolt
galaxy/bolt 0.9.5-1
    Thunderbolt 3 device manager
galaxy/bolt-dinit 20211102-2 (dinit-galaxy)
    dinit service scripts for bolt
galaxy/bolt-openrc 20210506-1 (openrc-galaxy)
    OpenRC script for bolt
galaxy/bolt-runit 20210426-1
    runit service scripts for bolt
galaxy/bolt-s6 20210919-1 (s6-galaxy)
    s6-rc service scripts for bolt
community/bolt 0.9.5-1
    Thunderbolt 3 device manager

Try installing the bolt package from galaxy as my prior post showed if you read the line carefully and the bolt-{openrc,runit,dinit,s6} for your init and start it. Test and see if the thunderbolt eGPU hot plug connection is picked up and go from there.

Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")

Reply #22 – 09 January 2023, 08:36:51

So, I have tried the bolt package, and removed all the udev stuff I had added manually.

It ends up giving me the same result: yes the device appears in lspci.

However, now I noticed a problem that was also present with the manual udev editing method: OK it passes PCI, but then the hotplugging fails somewhere with NVidia drivers.

It does not fail, in both cases (manual or bolt), when the eGPU is plugged-in at boot time.

So, more details about how the hotplugging fails (both with the udev tricks and with bolt):

Code: [Select]

# At this point I plug in the eGPU which is powered on.
$ nvidia-smi
Mon Jan  9 16:33:55 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
| 42%   19C    P0    32W / 200W |      0MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ python -c "import torch; print(torch.cuda.is_available())"
False
$ nvidia-smi
No devices were found

Here is the dmesg:

Code: [Select]

[ 4193.491500] nvidia-nvlink: Nvlink Core is being initialized, major device number 509

[ 4193.491947] nvidia 0000:04:00.0: enabling device (0000 -> 0003)
[ 4193.492125] nvidia 0000:04:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 4193.541270] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  525.60.11  Release Build  (archlinux-builder@mymachine)  
[ 4193.576480] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  525.60.11  Release Build  (archlinux-builder@mymachine)  
[ 4193.578450] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[ 4193.578452] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 1
[ 4195.048738] NVRM objClInitPcieChipset: *** Chipset Setup Function Error!
[ 4197.079688] NVRM gpuInitOptimusSettings_IMPL: SBIOS did not acknowledge cfg space owner change
[ 4197.473074] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4197.473076] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4197.473080] NVRM nvAssertFailedNoLog: Assertion failed: rmStatus == NV_OK @ osinit.c:1926
[ 4210.197920] nvidia-uvm: Loaded the UVM driver, major device number 507.
[ 4210.632379] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4210.632382] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4210.643384] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe
[ 4210.643387] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164
[ 4210.643392] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235
[ 4210.643396] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff
[ 4210.643400] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 4210.644922] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x62:0xffff:1622)
[ 4210.645481] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 4211.060534] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4211.060538] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4211.071629] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe
[ 4211.071632] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164
[ 4211.071637] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235
[ 4211.071640] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff
[ 4211.071645] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 4211.073314] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x62:0xffff:1622)
[ 4211.073892] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 4213.101002] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4213.101005] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4213.112318] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe
[ 4213.112321] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164
[ 4213.112326] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235
[ 4213.112330] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff
[ 4213.112334] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 4213.113864] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x62:0xffff:1622)
[ 4213.114525] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 4213.528551] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4213.528553] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4213.539882] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe
[ 4213.539884] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164
[ 4213.539889] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235
[ 4213.539893] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff
[ 4213.539897] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 4213.541376] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x62:0xffff:1622)
[ 4213.541898] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0

Meanwhile, in XUbuntu 22.10, the hotplugging works like a charm even when testing pytorch with the same command...

So, there is probably some difference in the udev handling and the scripts behind...

Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")

Reply #23 – 09 January 2023, 19:00:19

Code: [Select]

[ 4195.048738] NVRM objClInitPcieChipset: *** Chipset Setup Function Error!

Yep this error is doom and gloom pictures at eleven situation.
Since Nvidia is not picking it up, the only thing that comes to mind is if you can maybe reload the nvidia modules to see if they can pick the new information from system after lspci shows it connected.

One would have to think a little outside the box now to get it go. Nvidia driver internals is a blob and not something us mere morals know much about just look how much work noveau has to do to provide an open source video solution with all the reverse engineering that must be done to see what is going on.

Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")

Reply #24 – 10 January 2023, 08:10:14

Actually now it works!! LOL

In this order:

* plug it in (hot)
* wait a handful of seconds (check using lsmod that the modules nvidia, nvidia_modeset and nvidia_drm have been loaded automatically as a result of the plugging-in)
* load manually the missing nvidia_uvm through

Code: [Select]

sudo modprobe nvidia_uvm

* check with

Code: [Select]

python -c "import torch; print(torch.cuda.is_available())"

* enjoy!!

Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")

Reply #25 – 10 January 2023, 08:13:09

And if I take care to rmmod all 4 modules beforehand, I can even hot-unplug!!

Of course XOrg is completely out in this story.

Next step: it looks like XUbuntu (and probably Ubuntu in general) can attach the hotplugged GPU to XOrg!!
(it appears in

Code: [Select]

xrandr --lisproviders

)

So I bet there would be a way to do it in Arch/Artix too, but the how is another level.

Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")

Reply #26 – 10 January 2023, 08:38:41

Awesome, I am glad that your eGPU is now recognized by the nvidia driver after the hot plug.
It is always a pleasure to succeed when something one wants to function and the path to get there is not straight forward.

Quote

Next step: it looks like XUbuntu (and probably Ubuntu in general) can attach the hotplugged GPU to XOrg!!
So I bet there would be a way to do it in Arch/Artix too, but the how is another level.

What about checking the Xorg logs on Xubuntu maybe some hints exist?

Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")

Reply #27 – 10 January 2023, 08:43:02

Quote from: Ale – on 10 January 2023, 08:10:14

Actually now it works!! LOL

In this order:

* plug it in (hot)
* wait a handful of seconds (check using lsmod that the modules nvidia, nvidia_modeset and nvidia_drm have been loaded automatically as a result of the plugging-in)
* load manually the missing nvidia_uvm through
Code: [Select]
sudo modprobe nvidia_uvm
* check with
Code: [Select]
python -c "import torch; print(torch.cuda.is_available())"
* enjoy!!

So to avoid the manual modprobe step, I just added this rule in /lib/udev/rules.d/60-nvidia.rules:

Code: [Select]

KERNEL=="nvidia_modeset", RUN+="/usr/bin/bash -c '/usr/bin/modprobe nvidia_uvm'"

I made it depend on nvidia_modeset and not nvidia because I wanted to be sure that all the actions triggered by nvidia were already done. Not sure if this is required though, if not, then make it depend on nvidia instead.

So, now this works perfectly!

Not sure where I should send a pull request? Here? https://github.com/archlinux/svntogit-packages/blob/packages/nvidia-utils/trunk/nvidia.rules

Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")

Reply #28 – 10 January 2023, 08:52:11

Quote

Not sure where I should send a pull request? Here? https://github.com/archlinux/svntogit-packages/blob/packages/nvidia-utils/trunk/nvidia.rules

Well that link is for Arch and not Artix.
For Artix, I think I would open another thread to request that the rule be added to the nvidia-utils package in "Package management" and see what the devs have to say.

Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")

Reply #29 – 10 January 2023, 08:56:08

Quote from: jspaces – on 10 January 2023, 08:52:11

Well that link is for Arch and not Artix.
For Artix, I think I would open another thread to request that the rule be added to the nvidia-utils package in "Package management" and see what the devs have to say.

I mean, I'm pretty sure Arch has exactly the same problem...