Artix Linux Forum

Artix Linux => System => Topic started by: Ale on 08 January 2023, 10:06:38

Title: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: Ale on 08 January 2023, 10:06:38
Hi,

in order to investigate some issue myself (https://forum.artixlinux.org/index.php/topic,4926.msg31528.html),
and knowing that a XUbuntu 22.10 with a kernel 5.19.0-28.generic was giving satisfying results (i.e. the eGPU being listed in lspci after a hotplug event), I decided to try a kernel 5.19.0 on Artix.

My aim is to see if the eGPU hotplug works on Artix too, with the 5.19.0, and from there try to narrow down which kind of commit has led to this regression.

So I built it and installed it (after a download from the kernel.org archive), using the "traditional method" (with mkinitcpio). I had to replace the pahole-flags.sh file by a newer one.
But in the end I had a kernel and an initram image.

But when I try to boot this new kernel, I'm stuck very early in the boot sequence, at "Starting udevd" or something like that.

I'm aware that this could have something to do with

Code: [Select]
HOOKS=(base udev autodetect modconf kms keyboard keymap consolefont block filesystems fsck)

that is found in /etc/mkinitcpio.conf.

However I have no idea how to go from there.

Does that mean that autodetect has a problem?

By the way, the "fallback" initramfs 5.19.0 works (the boot sequence completes and I can use my system).
(and by the way, when using this fallback-5.19.0, the eGPU hotplugging does NOT work, which is a bit surprising)

Any hint welcome!
Title: Re: Help building an old kernel (5.19.0)
Post by: gavincc on 08 January 2023, 12:13:08
as an idea, have you considered trying the lts kernel - that's on 5.15 and in the system repo.?
Title: Re: Help building an old kernel (5.19.0)
Post by: jspaces on 08 January 2023, 12:45:01
How about using the old PKGBUILD for the last 5.19 kernel?
Linux 5.19.12 PKGBUILD (https://gitea.artixlinux.org/packagesL/linux/src/commit/247bb0896aa3fdcf73dc7ab09ec6afc3f30b4c5c/x86_64/core/PKGBUILD)
linux-5.19.12 config file (https://gitea.artixlinux.org/packagesL/linux/src/commit/247bb0896aa3fdcf73dc7ab09ec6afc3f30b4c5c/x86_64/core/config)

Though the issue with eGPU detection may be more like a udev rule to trigger the process or something  similar.
Maybe check out Xubuntu that you observed the hot plug function normally and see if there is an udev rule. If there is migrate it over to /etc/udev.d/rules/ with path and name modifications if required.
Title: Re: Help building an old kernel (5.19.0)
Post by: Ale on 08 January 2023, 12:57:43
How about using the old PKGBUILD for the last 5.19 kernel?
Linux 5.19.12 PKGBUILD (https://gitea.artixlinux.org/packagesL/linux/src/commit/247bb0896aa3fdcf73dc7ab09ec6afc3f30b4c5c/x86_64/core/PKGBUILD)
linux-5.19.12 config file (https://gitea.artixlinux.org/packagesL/linux/src/commit/247bb0896aa3fdcf73dc7ab09ec6afc3f30b4c5c/x86_64/core/config)

Though the issue with eGPU detection may be more like a udev rule to trigger the process or something  similar.
Maybe check out Xubuntu that you observed the hot plug function normally and see if there is an udev rule. If there is migrate it over to /etc/udev.d/rules/ with path and name modifications if required.

Nice idea, although, just by checking, there are only two files, /etc/udev/rules.d/70-snap.firefox.rules and /etc/udev/rules.d/70-snap.snapd.rules, the first one only deals with U2F, and in the second one there is no mention of NVidia GPUs or of Razer enclosures.

I'll try, though, but without great hopes!
Title: Re: Help building an old kernel (5.19.0)
Post by: Ale on 08 January 2023, 13:00:38
as an idea, have you considered trying the lts kernel - that's on 5.15 and in the system repo.?

Thank you for the idea, but I have already tried that, and it still does not work with the 5.15 LTS of Artix, unfortunately...

Although what is funny is that it DOES work for Ubuntu's 5.15.0-43-generic... (but IN UBUNTU, as I have not tried this particular one in Artix)

I have already tried to recompile a 5.15 on Artix, but it was the 5.15.85, and the result is that hotplug still does not work.

So there are two possibilities: either Ubuntu introduces fixes in their -43-generic patch set, or the mainline kernel itself breaks the feature between the 5.15.0 and 5.15.85!
Title: Re: Help building an old kernel (5.19.0)
Post by: lq on 08 January 2023, 13:09:56
However I have no idea how to go from there.

How about reading Wiki?

https://wiki.archlinux.org/title/External_GPU
Title: Re: Help building an old kernel (5.19.0)
Post by: Ale on 08 January 2023, 13:13:13
How about reading Wiki?

https://wiki.archlinux.org/title/External_GPU

Yeh, thanks, well I had already read it many times, but I'd point out that from the first paragraphes of the wiki I feel already hopeless, as my problem is precisely that my eGPU does not appear at all in lspci when hotplugged.

(although all works well when the eGPU is plugged-before-boot)

EDIT: I am now seeing that there is something about Thunderbolt authorization. You may be onto something there!! If so, really really thank you...
Title: Re: Help building an old kernel (5.19.0)
Post by: lq on 08 January 2023, 13:19:25
Yeh, thanks, well I had already read it many times,

Have you read this too?

https://wiki.archlinux.org/title/Thunderbolt#User_device_authorization
Title: Re: Help building an old kernel (5.19.0)
Post by: Ale on 08 January 2023, 13:20:58
Have you read this too?

https://wiki.archlinux.org/title/Thunderbolt#User_device_authorization

Nope, I hadn't read it!!

You beat me of a few seconds but I had added an edit to my previous message just to say that exactly!!

Thank you, I think this is the solution!

By the way I'm pretty sure this is a recent addition to the wiki. I swear I've read this wiki hundred of times!

Aaaand no, it is not a recent addition. I must be stupid!!
Title: Re: Help building an old kernel (5.19.0)
Post by: Ale on 08 January 2023, 13:25:51
@lq And THERE IT IS!!! I just added the rule

Code: [Select]
ACTION=="add", SUBSYSTEM=="thunderbolt", ATTR{authorized}=="0", ATTR{authorized}="1"

to my udev files, loaded

Code: [Select]
sudo udevadm control --reload

plugged the eGPU in, and it appears in lspci!!!

Thank you so much!!! And thinking I was getting lost in recompiling kernels dozens of times!!
Title: Re: Help building an old kernel (5.19.0)
Post by: Ale on 08 January 2023, 13:27:04
Now the really weird situation, forum-wise, is that the solution of the other topic is here, although the title of this one is unrelated...
Title: Re: Help building an old kernel (5.19.0)
Post by: lq on 08 January 2023, 14:03:07
Now the really weird situation, forum-wise, is that the solution of the other topic is here, although the title of this one is unrelated...

I'm pretty sure it's not really a problem for you.  ;D
Title: Re: Help building an old kernel (5.19.0)
Post by: jspaces on 08 January 2023, 14:10:59
No udev rule exists, hmmm well it was worth a shot.
Title: Re: Help building an old kernel (5.19.0)
Post by: jspaces on 08 January 2023, 14:16:18
Quote
Code: [Select]
ACTION=="add", SUBSYSTEM=="thunderbolt", ATTR{authorized}=="0", ATTR{authorized}="1"
I guess I was right about it being an udev rule.
Title: Re: Help building an old kernel (5.19.0)
Post by: Ale on 08 January 2023, 15:37:02
@jspaces Maybe! However there is no trace of this kind of rule in the two files I found in Ubuntu...

Moreover this rule is too wide, I want to narrow it down, but I failed.

For example I tried:

Code: [Select]
ACTION=="add", SUBSYSTEM=="thunderbolt", ATTR{vendor}=="0x10de", ATTR{authorized}=="0", ATTR{authorized}="1"
ACTION=="add", SUBSYSTEM=="thunderbolt", ATTR{vendor}=="0x8086", ATTR{authorized}=="0", ATTR{authorized}="1"

with 8086 being Intel and 10de being NVidia, but when I do this it does not work...

I think it is a bad idea, security-wise, to keep a rule this lenient...
Title: Re: Help building an old kernel (5.19.0)
Post by: Ale on 08 January 2023, 15:49:50
@jspaces OK you were right actually!

I was searching in the wrong place.

Ubuntu thinks it is a good idea to put udev rules in /lib/udev/rules.d ...

So, here is what I just quickly find:

Code: [Select]
$ ack thunderbolt
90-bolt.rules
13:# start bolt service if we have a thunderbolt device connected
14:SUBSYSTEM=="thunderbolt", TAG+="systemd", ENV{SYSTEMD_WANTS}+="bolt.service"

90-nm-thunderbolt.rules
4:ACTION!="add|change|move", GOTO="nm_thunderbolt_end"
6:# Load he thunderbolt-net driver if we a device of type thunderbolt_xdomain
8:SUBSYSTEM=="thunderbolt", ENV{DEVTYPE}=="thunderbolt_xdomain", RUN{builtin}+="kmod load thunderbolt-net"
10:# For all thunderbolt network devices, we want to enable link-local configuration
11:SUBSYSTEM=="net", ENV{ID_NET_DRIVER}=="thunderbolt-net", ENV{NM_AUTO_DEFAULT_LINK_LOCAL_ONLY}="1"
13:LABEL="nm_thunderbolt_end"

I'm not really sure that I can reuse the rules, as there seems to be some systemd woodoo mingled inside.

OK so the meaningful line is

Code: [Select]
SUBSYSTEM=="thunderbolt", TAG+="systemd", ENV{SYSTEMD_WANTS}+="bolt.service"

but this only starts the service "bolt" through which the user can allow or deny the connected device.
The only problem is that this service "bolt" depends heavily on systemd, so I won't try to use that.
Title: Re: Help building an old kernel (5.19.0)
Post by: Ale on 08 January 2023, 15:56:31
I'm pretty sure it's not really a problem for you.  ;D

LOL believe it or not, it does bother me, because it was also an interesting problem to know what was preventing the kernel from booting...
Title: Re: Help building an old kernel (5.19.0)
Post by: jspaces on 08 January 2023, 19:34:52
Try taking a look inside the systemd "bolt.service" script as bases for creating a thunderbolt service for the init that you use. It should be possible to start a thunderbolt service for your init with the udev rule.
Maybe this is required.
Code: [Select]
galaxy/bolt 0.9.4-1 (160.7 KiB 453.6 KiB) ->     Thunderbolt 3 device manager
Title: Re: Help building an old kernel (5.19.0)
Post by: Ale on 09 January 2023, 00:26:04
@jspaces Sooo, I discovered that I had already installed a package named bolt (that provides boltctl).

I ran asp checkout bolt to look inside, and to my surprise (and horror) its PKGBUILD states

Code: [Select]
depends=('polkit' 'systemd')

How is that possible given that I'm pretty sure that I haven't installed systemd?
Does the PKGBUILD from asp not reflect what is actually built by Artix?

Also I discovered the existence of a bolt-openrc package, but asp checkout bolt-openrc returns me an error

Code: [Select]
$ asp checkout bolt-openrc
error: unknown package: bolt-openrc

(and YEAH I know that I could just pacman it right away, but I really just want to see what it does beforehand)
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: gripped on 09 January 2023, 00:47:55
How is that possible given that I'm pretty sure that I haven't installed systemd?
Does the PKGBUILD from asp not reflect what is actually built by Artix?
Code: [Select]
pacman -Si asp
Repository      : extra
Name            : asp
Version         : 8-1
Description     : Arch Linux build source file management tool
Architecture    : any
URL             : https://github.com/falconindy/asp
Licenses        : MIT
Groups          : None
Provides        : None
Depends On      : awk  bash  jq  git  libarchive
Optional Deps   : None
Conflicts With  : None
Replaces        : None
Download Size   : 10.56 KiB
Installed Size  : 20.13 KiB
Packager        : Jelle van der Waa <[email protected]>
Build Date      : Wed 03 Nov 2021 09:23:18 GMT
Validated By    : MD5 Sum  SHA-256 Sum  Signature
asp is an arch package. You have the arch bolt PKGBUILD (and package)

artix-archlinux-support provides a dummy systemd. The (arch) bolt package believes the systemd dependency is met. But in reality it isn't.

I'd love to see an artix version of asp but it's not currently a thing.  :'(
Code: [Select]
git clone https://gitea.artixlinux.org/packagesB/bolt.git
For the artix version
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: Ale on 09 January 2023, 01:20:52
@gripped Thanks a lot! This is spot on.

I did not know that asp did not give Artix packages, I'll check if I have not installed inadvertently unintended dependencies because of that in the past...
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: jspaces on 09 January 2023, 05:26:35
The bolt packages are in Artix's repositories.
Code: [Select]
$ pacman -Ss bolt
galaxy/bolt 0.9.5-1
    Thunderbolt 3 device manager
galaxy/bolt-dinit 20211102-2 (dinit-galaxy)
    dinit service scripts for bolt
galaxy/bolt-openrc 20210506-1 (openrc-galaxy)
    OpenRC script for bolt
galaxy/bolt-runit 20210426-1
    runit service scripts for bolt
galaxy/bolt-s6 20210919-1 (s6-galaxy)
    s6-rc service scripts for bolt
community/bolt 0.9.5-1
    Thunderbolt 3 device manager
Try installing the bolt package from galaxy as my prior post showed if you read the line carefully and the bolt-{openrc,runit,dinit,s6} for your init and start it. Test and see if the thunderbolt eGPU hot plug connection is picked up and go from there.
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: Ale on 09 January 2023, 08:36:51
So, I have tried the bolt package, and removed all the udev stuff I had added manually.

It ends up giving me the same result: yes the device appears in lspci.

However, now I noticed a problem that was also present with the manual udev editing method: OK it passes PCI, but then the hotplugging fails somewhere with NVidia drivers.

It does not fail, in both cases (manual or bolt), when the eGPU is plugged-in at boot time.

So, more details about how the hotplugging fails (both with the udev tricks and with bolt):

Code: [Select]
# At this point I plug in the eGPU which is powered on.
$ nvidia-smi
Mon Jan  9 16:33:55 2023      
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
| 42%   19C    P0    32W / 200W |      0MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ python -c "import torch; print(torch.cuda.is_available())"
False
$ nvidia-smi
No devices were found

Here is the dmesg:

Code: [Select]
[ 4193.491500] nvidia-nvlink: Nvlink Core is being initialized, major device number 509

[ 4193.491947] nvidia 0000:04:00.0: enabling device (0000 -> 0003)
[ 4193.492125] nvidia 0000:04:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 4193.541270] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  525.60.11  Release Build  (archlinux-builder@mymachine) 
[ 4193.576480] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  525.60.11  Release Build  (archlinux-builder@mymachine) 
[ 4193.578450] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[ 4193.578452] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 1
[ 4195.048738] NVRM objClInitPcieChipset: *** Chipset Setup Function Error!
[ 4197.079688] NVRM gpuInitOptimusSettings_IMPL: SBIOS did not acknowledge cfg space owner change
[ 4197.473074] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4197.473076] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4197.473080] NVRM nvAssertFailedNoLog: Assertion failed: rmStatus == NV_OK @ osinit.c:1926
[ 4210.197920] nvidia-uvm: Loaded the UVM driver, major device number 507.
[ 4210.632379] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4210.632382] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4210.643384] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe
[ 4210.643387] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164
[ 4210.643392] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235
[ 4210.643396] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff
[ 4210.643400] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 4210.644922] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x62:0xffff:1622)
[ 4210.645481] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 4211.060534] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4211.060538] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4211.071629] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe
[ 4211.071632] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164
[ 4211.071637] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235
[ 4211.071640] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff
[ 4211.071645] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 4211.073314] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x62:0xffff:1622)
[ 4211.073892] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 4213.101002] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4213.101005] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4213.112318] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe
[ 4213.112321] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164
[ 4213.112326] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235
[ 4213.112330] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff
[ 4213.112334] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 4213.113864] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x62:0xffff:1622)
[ 4213.114525] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 4213.528551] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4213.528553] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4213.539882] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe
[ 4213.539884] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164
[ 4213.539889] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235
[ 4213.539893] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff
[ 4213.539897] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 4213.541376] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x62:0xffff:1622)
[ 4213.541898] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0

Meanwhile, in XUbuntu 22.10, the hotplugging works like a charm even when testing pytorch with the same command...

So, there is probably some difference in the udev handling and the scripts behind...
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: jspaces on 09 January 2023, 19:00:19
Code: [Select]
[ 4195.048738] NVRM objClInitPcieChipset: *** Chipset Setup Function Error!
Yep this error is doom and gloom pictures at eleven situation.
Since Nvidia is not picking it up, the only thing that comes to mind is if you can maybe reload the nvidia modules to see if they can pick the new information from system after lspci shows it connected.

One would have to think a little outside the box now to get it go. Nvidia driver internals is a blob and not something us mere morals know much about just look how much work noveau has to do to provide an open source video solution with all the reverse engineering that must be done to see what is going on.
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: Ale on 10 January 2023, 08:10:14
Actually now it works!! LOL

In this order:

* plug it in (hot)
* wait a handful of seconds (check using lsmod that the modules nvidia, nvidia_modeset and nvidia_drm have been loaded automatically as a result of the plugging-in)
* load manually the missing nvidia_uvm through
Code: [Select]
sudo modprobe nvidia_uvm
* check with
Code: [Select]
python -c "import torch; print(torch.cuda.is_available())"
* enjoy!!
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: Ale on 10 January 2023, 08:13:09
And if I take care to rmmod all 4 modules beforehand, I can even hot-unplug!!

Of course XOrg is completely out in this story.

Next step: it looks like XUbuntu (and probably Ubuntu in general) can attach the hotplugged GPU to XOrg!!
(it appears in
Code: [Select]
xrandr --lisproviders
)

So I bet there would be a way to do it in Arch/Artix too, but the how is another level.
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: jspaces on 10 January 2023, 08:38:41
Awesome, I am glad that your eGPU is now recognized by the nvidia driver after the hot plug.
It is always a pleasure to succeed when something one wants to function and the path to get there is not straight forward.

Quote
Next step: it looks like XUbuntu (and probably Ubuntu in general) can attach the hotplugged GPU to XOrg!!
So I bet there would be a way to do it in Arch/Artix too, but the how is another level.
What about checking the Xorg logs on Xubuntu maybe some hints exist?
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: Ale on 10 January 2023, 08:43:02
Actually now it works!! LOL

In this order:

* plug it in (hot)
* wait a handful of seconds (check using lsmod that the modules nvidia, nvidia_modeset and nvidia_drm have been loaded automatically as a result of the plugging-in)
* load manually the missing nvidia_uvm through
Code: [Select]
sudo modprobe nvidia_uvm
* check with
Code: [Select]
python -c "import torch; print(torch.cuda.is_available())"
* enjoy!!

So to avoid the manual modprobe step, I just added this rule in /lib/udev/rules.d/60-nvidia.rules:

Code: [Select]
KERNEL=="nvidia_modeset", RUN+="/usr/bin/bash -c '/usr/bin/modprobe nvidia_uvm'"

I made it depend on nvidia_modeset and not nvidia because I wanted to be sure that all the actions triggered by nvidia were already done. Not sure if this is required though, if not, then make it depend on nvidia instead.

So, now this works perfectly!

Not sure where I should send a pull request? Here? https://github.com/archlinux/svntogit-packages/blob/packages/nvidia-utils/trunk/nvidia.rules
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: jspaces on 10 January 2023, 08:52:11
Quote
Not sure where I should send a pull request? Here? https://github.com/archlinux/svntogit-packages/blob/packages/nvidia-utils/trunk/nvidia.rules
Well that link is for Arch and not Artix.
For Artix, I think I would open another thread to request that the rule be added to the nvidia-utils package in "Package management" and see what the devs have to say.
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: Ale on 10 January 2023, 08:56:08
Well that link is for Arch and not Artix.
For Artix, I think I would open another thread to request that the rule be added to the nvidia-utils package in "Package management" and see what the devs have to say.

I mean, I'm pretty sure Arch has exactly the same problem...
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: jspaces on 10 January 2023, 09:06:08
Sure share it and maybe some other distros as well.
The spirit of sharing is why we love open source community so much.
Peace. :)
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: Ale on 10 January 2023, 10:16:09
Sure share it and maybe some other distros as well.
The spirit of sharing is why we love open source community so much.
Peace. :)

 Yep I will share it on the Arch forum, and other Arch-derivated distros including Artix will probably inherit the fix.

In Debian-based distros it is completely different and the fix does not apply, and is maybe not even required.
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: Ale on 10 January 2023, 10:51:23
Wait... I claimed victory too early...

As soon as I use the card, e.g. with nvidia-smi, or with python and torch.cuda.is_available(), the next time I use it the GPU is unavailable...

And immediately after the first use, dmesg adds these lines:

Code: [Select]
[  592.012595] NVRM objClInitPcieChipset: *** Chipset Setup Function Error!
[  594.048546] NVRM gpuInitOptimusSettings_IMPL: SBIOS did not acknowledge cfg space owner change
[  594.417099] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[  594.417102] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[  594.417107] NVRM nvAssertFailedNoLog: Assertion failed: rmStatus == NV_OK @ osinit.c:1926
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: Ale on 10 January 2023, 11:03:31
And actually the situation is the same if I remove my "fix" in 60-nvidia.rules....

I just was too happy that pytorch was returning True and I was not at all testing thoroughly enough...
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: Ale on 10 January 2023, 11:07:05
OK so the secret is to launch:

Code: [Select]
sudo rc-service nvidia-persistenced start

And it does not seem really necessary to launch the "missing" module nvidia_uvm (at least since I'm not using XOrg with the GPU).

NOTE: the exact order below is important:

* plug in the eGPU
* wait for the modules to load (no need for nvidia_uvm)
* sudo rc-service nvidia-persistenced start

For hot-unplugging (yes it is possible as long as XOrg is not involved, as it is our case), use the reverse order:

* sudo rc-service nvidia-persistenced stop
* sudo rmmod all of the nvidia* modules
* unplug
Title: Re: Thunderbolt on Artix, (renamed, was: "Help building an old kernel (5.19.0)")
Post by: Ale on 14 January 2023, 10:48:26
With the recent Artix upgrade (kernel 6.1.4-artix1-1, nvidia-open-dkms 525.78.01-8, etc), I don't need anymore to do any trick other than just installing and enabling boltd.

* I don't need anymore to start or stop the service nvidia-persistenced, which stays stopped at all times,
* I don't need to modify the udev rules
* the sequence for hotplug (for CUDA compute only) becomes: 1/ plug 2/ enjoy
* the sequence for hot-unplug (CUDA) becomes: 1/ rmmod all nvidia* drivers 2/ unplug