Hi,
in order to investigate some issue myself (https://forum.artixlinux.org/index.php/topic,4926.msg31528.html),
and knowing that a XUbuntu 22.10 with a kernel 5.19.0-28.generic was giving satisfying results (i.e. the eGPU being listed in lspci after a hotplug event), I decided to try a kernel 5.19.0 on Artix.
My aim is to see if the eGPU hotplug works on Artix too, with the 5.19.0, and from there try to narrow down which kind of commit has led to this regression.
So I built it and installed it (after a download from the kernel.org archive), using the "traditional method" (with mkinitcpio). I had to replace the pahole-flags.sh file by a newer one.
But in the end I had a kernel and an initram image.
But when I try to boot this new kernel, I'm stuck very early in the boot sequence, at "Starting udevd" or something like that.
I'm aware that this could have something to do with
HOOKS=(base udev autodetect modconf kms keyboard keymap consolefont block filesystems fsck)
that is found in /etc/mkinitcpio.conf.
However I have no idea how to go from there.
Does that mean that autodetect has a problem?
By the way, the "fallback" initramfs 5.19.0 works (the boot sequence completes and I can use my system).
(and by the way, when using this fallback-5.19.0, the eGPU hotplugging does NOT work, which is a bit surprising)
Any hint welcome!
as an idea, have you considered trying the lts kernel - that's on 5.15 and in the system repo.?
How about using the old PKGBUILD for the last 5.19 kernel?
Linux 5.19.12 PKGBUILD (https://gitea.artixlinux.org/packagesL/linux/src/commit/247bb0896aa3fdcf73dc7ab09ec6afc3f30b4c5c/x86_64/core/PKGBUILD)
linux-5.19.12 config file (https://gitea.artixlinux.org/packagesL/linux/src/commit/247bb0896aa3fdcf73dc7ab09ec6afc3f30b4c5c/x86_64/core/config)
Though the issue with eGPU detection may be more like a udev rule to trigger the process or something similar.
Maybe check out Xubuntu that you observed the hot plug function normally and see if there is an udev rule. If there is migrate it over to /etc/udev.d/rules/ with path and name modifications if required.
Nice idea, although, just by checking, there are only two files, /etc/udev/rules.d/70-snap.firefox.rules and /etc/udev/rules.d/70-snap.snapd.rules, the first one only deals with U2F, and in the second one there is no mention of NVidia GPUs or of Razer enclosures.
I'll try, though, but without great hopes!
Thank you for the idea, but I have already tried that, and it still does not work with the 5.15 LTS of Artix, unfortunately...
Although what is funny is that it DOES work for Ubuntu's 5.15.0-43-generic... (but IN UBUNTU, as I have not tried this particular one in Artix)
I have already tried to recompile a 5.15 on Artix, but it was the 5.15.85, and the result is that hotplug still does not work.
So there are two possibilities: either Ubuntu introduces fixes in their -43-generic patch set, or the mainline kernel itself breaks the feature between the 5.15.0 and 5.15.85!
How about reading Wiki?
https://wiki.archlinux.org/title/External_GPU
Yeh, thanks, well I had already read it many times, but I'd point out that from the first paragraphes of the wiki I feel already hopeless, as my problem is precisely that
my eGPU does not appear at all in lspci when hotplugged.(although all works well when the eGPU is plugged-before-boot)
EDIT: I am now seeing that there is something about Thunderbolt authorization. You may be onto something there!! If so, really really thank you...
Have you read this too?
https://wiki.archlinux.org/title/Thunderbolt#User_device_authorization
Nope, I hadn't read it!!
You beat me of a few seconds but I had added an edit to my previous message just to say that exactly!!
Thank you, I think this is the solution!
By the way I'm pretty sure this is a recent addition to the wiki. I swear I've read this wiki hundred of times!Aaaand no, it is not a recent addition. I must be stupid!!
@lq And THERE IT IS!!! I just added the rule
ACTION=="add", SUBSYSTEM=="thunderbolt", ATTR{authorized}=="0", ATTR{authorized}="1"
to my udev files, loaded
sudo udevadm control --reload
plugged the eGPU in,
and it appears in lspci!!!Thank you so much!!! And thinking I was getting lost in recompiling kernels dozens of times!!
Now the really weird situation, forum-wise, is that the solution of the other topic is here, although the title of this one is unrelated...
I'm pretty sure it's not really a problem for you. ;D
No udev rule exists, hmmm well it was worth a shot.
I guess I was right about it being an udev rule.
@jspaces Maybe! However there is no trace of this kind of rule in the two files I found in Ubuntu...
Moreover this rule is too wide, I want to narrow it down, but I failed.
For example I tried:
ACTION=="add", SUBSYSTEM=="thunderbolt", ATTR{vendor}=="0x10de", ATTR{authorized}=="0", ATTR{authorized}="1"
ACTION=="add", SUBSYSTEM=="thunderbolt", ATTR{vendor}=="0x8086", ATTR{authorized}=="0", ATTR{authorized}="1"
with 8086 being Intel and 10de being NVidia, but when I do this it does not work...
I think it is a bad idea, security-wise, to keep a rule this lenient...
@jspaces OK you were right actually!
I was searching in the wrong place.
Ubuntu thinks it is a good idea to put udev rules in /lib/udev/rules.d ...
So, here is what I just quickly find:
$ ack thunderbolt
90-bolt.rules
13:# start bolt service if we have a thunderbolt device connected
14:SUBSYSTEM=="thunderbolt", TAG+="systemd", ENV{SYSTEMD_WANTS}+="bolt.service"
90-nm-thunderbolt.rules
4:ACTION!="add|change|move", GOTO="nm_thunderbolt_end"
6:# Load he thunderbolt-net driver if we a device of type thunderbolt_xdomain
8:SUBSYSTEM=="thunderbolt", ENV{DEVTYPE}=="thunderbolt_xdomain", RUN{builtin}+="kmod load thunderbolt-net"
10:# For all thunderbolt network devices, we want to enable link-local configuration
11:SUBSYSTEM=="net", ENV{ID_NET_DRIVER}=="thunderbolt-net", ENV{NM_AUTO_DEFAULT_LINK_LOCAL_ONLY}="1"
13:LABEL="nm_thunderbolt_end"
I'm not really sure that I can reuse the rules, as there seems to be some systemd woodoo mingled inside.
OK so the meaningful line is
SUBSYSTEM=="thunderbolt", TAG+="systemd", ENV{SYSTEMD_WANTS}+="bolt.service"
but this only starts the service "bolt" through which the user can allow or deny the connected device.
The only problem is that this service "bolt" depends heavily on systemd, so I won't try to use that.
LOL believe it or not, it
does bother me, because it was also an interesting problem to know what was preventing the kernel from booting...
Try taking a look inside the systemd "bolt.service" script as bases for creating a thunderbolt service for the init that you use. It should be possible to start a thunderbolt service for your init with the udev rule.
Maybe this is required.
galaxy/bolt 0.9.4-1 (160.7 KiB 453.6 KiB) -> Thunderbolt 3 device manager
@jspaces Sooo, I discovered that I had already installed a package named
bolt (that provides
boltctl).
I ran
asp checkout bolt to look inside, and to my surprise (and horror) its PKGBUILD states
depends=('polkit' 'systemd')
How is that possible given that I'm pretty sure that I haven't installed systemd?
Does the PKGBUILD from
asp not reflect what is actually built by Artix?
Also I discovered the existence of a
bolt-openrc package, but
asp checkout bolt-openrc returns me an error
$ asp checkout bolt-openrc
error: unknown package: bolt-openrc
(and YEAH I know that I could just pacman it right away, but I really just want to see what it does beforehand)
pacman -Si asp
Repository : extra
Name : asp
Version : 8-1
Description : Arch Linux build source file management tool
Architecture : any
URL : https://github.com/falconindy/asp
Licenses : MIT
Groups : None
Provides : None
Depends On : awk bash jq git libarchive
Optional Deps : None
Conflicts With : None
Replaces : None
Download Size : 10.56 KiB
Installed Size : 20.13 KiB
Packager : Jelle van der Waa <[email protected]>
Build Date : Wed 03 Nov 2021 09:23:18 GMT
Validated By : MD5 Sum SHA-256 Sum Signature
asp is an arch package. You have the arch bolt PKGBUILD (and package)
artix-archlinux-support provides a dummy systemd. The (arch) bolt package believes the systemd dependency is met. But in reality it isn't.
I'd love to see an artix version of asp but it's not currently a thing. :'(
git clone https://gitea.artixlinux.org/packagesB/bolt.git
For the artix version
@gripped Thanks a lot! This is spot on.
I did not know that asp did not give Artix packages, I'll check if I have not installed inadvertently unintended dependencies because of that in the past...
The bolt packages are in Artix's repositories.
$ pacman -Ss bolt
galaxy/bolt 0.9.5-1
Thunderbolt 3 device manager
galaxy/bolt-dinit 20211102-2 (dinit-galaxy)
dinit service scripts for bolt
galaxy/bolt-openrc 20210506-1 (openrc-galaxy)
OpenRC script for bolt
galaxy/bolt-runit 20210426-1
runit service scripts for bolt
galaxy/bolt-s6 20210919-1 (s6-galaxy)
s6-rc service scripts for bolt
community/bolt 0.9.5-1
Thunderbolt 3 device manager
Try installing the bolt package from galaxy as my prior post showed if you read the line carefully and the bolt-{openrc,runit,dinit,s6} for your init and start it. Test and see if the thunderbolt eGPU hot plug connection is picked up and go from there.
So, I have tried the bolt package, and removed all the udev stuff I had added manually.
It ends up giving me the same result: yes the device appears in lspci.
However, now I noticed a problem that was also present with the manual udev editing method: OK it passes PCI, but then the hotplugging fails somewhere with NVidia drivers.
It does not fail, in both cases (manual or bolt), when the eGPU is plugged-in at boot time.So, more details about how the hotplugging fails (both with the udev tricks and with bolt):
# At this point I plug in the eGPU which is powered on.
$ nvidia-smi
Mon Jan 9 16:33:55 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:04:00.0 Off | N/A |
| 42% 19C P0 32W / 200W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ python -c "import torch; print(torch.cuda.is_available())"
False
$ nvidia-smi
No devices were found
Here is the dmesg:
[ 4193.491500] nvidia-nvlink: Nvlink Core is being initialized, major device number 509
[ 4193.491947] nvidia 0000:04:00.0: enabling device (0000 -> 0003)
[ 4193.492125] nvidia 0000:04:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 4193.541270] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 525.60.11 Release Build (archlinux-builder@mymachine)
[ 4193.576480] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64 525.60.11 Release Build (archlinux-builder@mymachine)
[ 4193.578450] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[ 4193.578452] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 1
[ 4195.048738] NVRM objClInitPcieChipset: *** Chipset Setup Function Error!
[ 4197.079688] NVRM gpuInitOptimusSettings_IMPL: SBIOS did not acknowledge cfg space owner change
[ 4197.473074] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4197.473076] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4197.473080] NVRM nvAssertFailedNoLog: Assertion failed: rmStatus == NV_OK @ osinit.c:1926
[ 4210.197920] nvidia-uvm: Loaded the UVM driver, major device number 507.
[ 4210.632379] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4210.632382] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4210.643384] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe
[ 4210.643387] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164
[ 4210.643392] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235
[ 4210.643396] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff
[ 4210.643400] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 4210.644922] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x62:0xffff:1622)
[ 4210.645481] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 4211.060534] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4211.060538] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4211.071629] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe
[ 4211.071632] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164
[ 4211.071637] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235
[ 4211.071640] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff
[ 4211.071645] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 4211.073314] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x62:0xffff:1622)
[ 4211.073892] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 4213.101002] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4213.101005] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4213.112318] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe
[ 4213.112321] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164
[ 4213.112326] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235
[ 4213.112330] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff
[ 4213.112334] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 4213.113864] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x62:0xffff:1622)
[ 4213.114525] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 4213.528551] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4213.528553] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4213.539882] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe
[ 4213.539884] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164
[ 4213.539889] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235
[ 4213.539893] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff
[ 4213.539897] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 4213.541376] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x62:0xffff:1622)
[ 4213.541898] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
Meanwhile, in XUbuntu 22.10, the hotplugging works like a charm even when testing pytorch with the same command...
So, there is probably some difference in the udev handling and the scripts behind...
[ 4195.048738] NVRM objClInitPcieChipset: *** Chipset Setup Function Error!
Yep this error is doom and gloom pictures at eleven situation.
Since Nvidia is not picking it up, the only thing that comes to mind is if you can maybe reload the nvidia modules to see if they can pick the new information from system after lspci shows it connected.
One would have to think a little outside the box now to get it go. Nvidia driver internals is a blob and not something us mere morals know much about just look how much work noveau has to do to provide an open source video solution with all the reverse engineering that must be done to see what is going on.
Actually now it works!! LOL
In this order:
* plug it in (hot)
* wait a handful of seconds (check using lsmod that the modules nvidia, nvidia_modeset and nvidia_drm have been loaded automatically as a result of the plugging-in)
* load manually the missing nvidia_uvm through
sudo modprobe nvidia_uvm
* check with
python -c "import torch; print(torch.cuda.is_available())"
* enjoy!!
And if I take care to rmmod all 4 modules beforehand, I can even hot-unplug!!
Of course XOrg is completely out in this story.
Next step: it looks like XUbuntu (and probably Ubuntu in general) can attach the hotplugged GPU to XOrg!!(it appears in
xrandr --lisproviders
)
So I bet there would be a way to do it in Arch/Artix too, but the how is another level.
Awesome, I am glad that your eGPU is now recognized by the nvidia driver after the hot plug.
It is always a pleasure to succeed when something one wants to function and the path to get there is not straight forward.
What about checking the Xorg logs on Xubuntu maybe some hints exist?
So to avoid the manual modprobe step, I just added this rule in
/lib/udev/rules.d/60-nvidia.rules:
KERNEL=="nvidia_modeset", RUN+="/usr/bin/bash -c '/usr/bin/modprobe nvidia_uvm'"
I made it depend on
nvidia_modeset and not
nvidia because I wanted to be sure that all the actions triggered by
nvidia were already done. Not sure if this is required though, if not, then make it depend on
nvidia instead.
So, now this works perfectly!
Not sure where I should send a pull request? Here? https://github.com/archlinux/svntogit-packages/blob/packages/nvidia-utils/trunk/nvidia.rules
Well that link is for Arch and not Artix.
For Artix, I think I would open another thread to request that the rule be added to the nvidia-utils package in "Package management" and see what the devs have to say.
I mean, I'm pretty sure Arch has exactly the same problem...
Sure share it and maybe some other distros as well.
The spirit of sharing is why we love open source community so much.
Peace. :)
Yep I will share it on the Arch forum, and other Arch-derivated distros including Artix will probably inherit the fix.
In Debian-based distros it is completely different and the fix does not apply, and is maybe not even required.
Wait... I claimed victory too early...
As soon as I use the card, e.g. with nvidia-smi, or with python and torch.cuda.is_available(), the next time I use it the GPU is unavailable...
And immediately after the first use, dmesg adds these lines:
[ 592.012595] NVRM objClInitPcieChipset: *** Chipset Setup Function Error!
[ 594.048546] NVRM gpuInitOptimusSettings_IMPL: SBIOS did not acknowledge cfg space owner change
[ 594.417099] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 594.417102] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 594.417107] NVRM nvAssertFailedNoLog: Assertion failed: rmStatus == NV_OK @ osinit.c:1926
And actually the situation is the same if I remove my "fix" in 60-nvidia.rules....
I just was too happy that pytorch was returning True and I was not at all testing thoroughly enough...
OK so the secret is to launch:
sudo rc-service nvidia-persistenced start
And it does not seem really necessary to launch the "missing" module nvidia_uvm (at least since I'm not using XOrg with the GPU).
NOTE: the exact order below is important:
* plug in the eGPU
* wait for the modules to load (no need for nvidia_uvm)
* sudo rc-service nvidia-persistenced start
For hot-unplugging (yes it is possible as long as XOrg is not involved, as it is our case), use the reverse order:
* sudo rc-service nvidia-persistenced stop
* sudo rmmod all of the nvidia* modules
* unplug
With the recent Artix upgrade (kernel 6.1.4-artix1-1, nvidia-open-dkms 525.78.01-8, etc), I don't need anymore to do any trick other than just installing and enabling boltd.
* I don't need anymore to start or stop the service nvidia-persistenced, which stays stopped at all times,
* I don't need to modify the udev rules
* the sequence for hotplug (for CUDA compute only) becomes: 1/ plug 2/ enjoy
* the sequence for hot-unplug (CUDA) becomes: 1/ rmmod all nvidia* drivers 2/ unplug