So, I have tried the bolt package, and removed all the udev stuff I had added manually.
It ends up giving me the same result: yes the device appears in lspci.
However, now I noticed a problem that was also present with the manual udev editing method: OK it passes PCI, but then the hotplugging fails somewhere with NVidia drivers.
It does not fail, in both cases (manual or bolt), when the eGPU is plugged-in at boot time.
So, more details about how the hotplugging fails (both with the udev tricks and with bolt):
# At this point I plug in the eGPU which is powered on.
$ nvidia-smi
Mon Jan 9 16:33:55 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:04:00.0 Off | N/A |
| 42% 19C P0 32W / 200W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ python -c "import torch; print(torch.cuda.is_available())"
False
$ nvidia-smi
No devices were found
Here is the dmesg:
[ 4193.491500] nvidia-nvlink: Nvlink Core is being initialized, major device number 509
[ 4193.491947] nvidia 0000:04:00.0: enabling device (0000 -> 0003)
[ 4193.492125] nvidia 0000:04:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 4193.541270] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 525.60.11 Release Build (archlinux-builder@mymachine)
[ 4193.576480] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64 525.60.11 Release Build (archlinux-builder@mymachine)
[ 4193.578450] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver
[ 4193.578452] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:04:00.0 on minor 1
[ 4195.048738] NVRM objClInitPcieChipset: *** Chipset Setup Function Error!
[ 4197.079688] NVRM gpuInitOptimusSettings_IMPL: SBIOS did not acknowledge cfg space owner change
[ 4197.473074] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4197.473076] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4197.473080] NVRM nvAssertFailedNoLog: Assertion failed: rmStatus == NV_OK @ osinit.c:1926
[ 4210.197920] nvidia-uvm: Loaded the UVM driver, major device number 507.
[ 4210.632379] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4210.632382] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4210.643384] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe
[ 4210.643387] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164
[ 4210.643392] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235
[ 4210.643396] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff
[ 4210.643400] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 4210.644922] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x62:0xffff:1622)
[ 4210.645481] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 4211.060534] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4211.060538] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4211.071629] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe
[ 4211.071632] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164
[ 4211.071637] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235
[ 4211.071640] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff
[ 4211.071645] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 4211.073314] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x62:0xffff:1622)
[ 4211.073892] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 4213.101002] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4213.101005] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4213.112318] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe
[ 4213.112321] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164
[ 4213.112326] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235
[ 4213.112330] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff
[ 4213.112334] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 4213.113864] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x62:0xffff:1622)
[ 4213.114525] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 4213.528551] NVRM s_executeBooterUcode_TU102: Booter failed with non-zero error code: 0xa
[ 4213.528553] NVRM kgspExecuteBooterUnloadIfNeeded_TU102: failed to execute Booter Unload: 0xffff
[ 4213.539882] NVRM s_executeFwsec_TU102: failed to execute FWSEC for FRTS: FRTS error code 0xbe
[ 4213.539884] NVRM nvAssertOkFailedNoLog: Assertion failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from kgspExecuteFwsecFrts_HAL(pGpu, pKernelGsp, pKernelGsp->pFwsecUcode, pKernelGsp->pWprMeta->frtsOffset) @ kernel_gsp_ga102.c:164
[ 4213.539889] NVRM nvAssertFailedNoLog: Assertion failed: status == NV_OK @ kernel_gsp_ga102.c:235
[ 4213.539893] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0xffff
[ 4213.539897] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 4213.541376] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x62:0xffff:1622)
[ 4213.541898] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
Meanwhile, in XUbuntu 22.10, the hotplugging works like a charm even when testing pytorch with the same command...
So, there is probably some difference in the udev handling and the scripts behind...