Skip to main content
Topic: ran out of space on my 50gb ssd root partition and then i did a baaaaaaad thing (Read 1064 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Re: ran out of space on my 50gb ssd root partition and then i did a baaaaaaad thing

Reply #15
ok so in nano it looks like this repeated ad infinitum

Code: [Select]
Aug 30 11:00:11 mate-elitedesk syslog-ng[216]: syslog-ng starting up; version='4.2.0'
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0: AER: Corrected error received: 0000:00:1d.0
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0:  device [8086:a298] error status/mask=00000001/00000000
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0:    [ 0] RxErr                  (First)
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0: AER: Corrected error received: 0000:00:1d.0
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0:  device [8086:a298] error status/mask=00000001/00000000
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0:    [ 0] RxErr                  (First)
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0: AER: Corrected error received: 0000:00:1d.0
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0:  device [8086:a298] error status/mask=00000001/00000000
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0:    [ 0] RxErr                  (First)
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0: AER: Corrected error received: 0000:00:1d.0
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0:  device [8086:a298] error status/mask=00000001/00000000
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0:    [ 0] RxErr                  (First)
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0: AER: Corrected error received: 0000:00:1d.0
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0:  device [8086:a298] error status/mask=00000001/00000000
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0:    [ 0] RxErr                  (First)
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0: AER: Corrected error received: 0000:00:1d.0
Aug 30 11:00:11 mate-elitedesk kernel: pcieport 0000:00:1d.0: AER: can't find device of ID00e8
Cat Herders of Linux

Re: ran out of space on my 50gb ssd root partition and then i did a baaaaaaad thing

Reply #16
artix is on a 500gb wd black pcie4 nvme drive connected to a nvme slot on the mb which is pcie3.

windows exists on a faxang 500 gb pcie 3 nvme drive connected to an add in card on a x4 lane.

in case any of that matters.


i also dont have any drives in fstab that aren't mounted.  fstab looks just as it should.

so i need to know which item is ID00e8
Cat Herders of Linux

Re: ran out of space on my 50gb ssd root partition and then i did a baaaaaaad thing

Reply #17
Check your bios and cpu microcode is up to date.

Looks like this is the device throwing errors
https://devicehunt.com/view/type/pci/vendor/8086/device/A298

Do some searching for the same device and errors. I still think if logrotate was working you wouldn't end up with 35GB of logs.


Re: ran out of space on my 50gb ssd root partition and then i did a baaaaaaad thing

Reply #19
this is what each of those logs looks like now

https://pastebin.com/ecatBEEN

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173

Code: [Select]
WORKAROUND: add pci=noaer to your kernel command line:

1) edit /etc/default/grub and and add pci=noaer to the line starting with GRUB_CMDLINE_LINUX_DEFAULT. It will look like this:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=noaer"
2) run "sudo update-grub"
3) reboot


i'll see in a few days if that fixes it or not i guess
Cat Herders of Linux

 

Re: ran out of space on my 50gb ssd root partition and then i did a baaaaaaad thing

Reply #20
Check your bios and cpu microcode is up to date.

Looks like this is the device throwing errors
https://devicehunt.com/view/type/pci/vendor/8086/device/A298

Do some searching for the same device and errors. I still think if logrotate was working you wouldn't end up with 35GB of logs.
never heard of logrotate b4 this mess.  i just used the mate iso from the d/l section.  guess i'll be looking into that next?  maybe i should let it be because if logrotate had been working i'd have never known about this bug or its workaround?
root is now 12.8 gb  lets see if it can stay there for a while?
Cat Herders of Linux

Re: ran out of space on my 50gb ssd root partition and then i did a baaaaaaad thing

Reply #21
and just some fun reading on aer as it relates to pci and such.



https://www.kernel.org/doc/html/latest/PCI/pcieaer-howto.html


8.3.2. Frequent Asked Questions
Q:
What happens if a PCIe device driver does not provide an error recovery handler (pci_driver->err_handler is equal to NULL)?

A:
The devices attached with the driver won't be recovered. If the error is fatal, kernel will print out warning messages. Please refer to section 3 for more information.
Cat Herders of Linux

Re: ran out of space on my 50gb ssd root partition and then i did a baaaaaaad thing

Reply #22
never heard of logrotate b4 this mess.  i just used the mate iso from the d/l section.  guess i'll be looking into that next?  maybe i should let it be because if logrotate had been working i'd have never known about this bug or its workaround?
root is now 12.8 gb  lets see if it can stay there for a while?
I'm fairly sure that logrotate should be working as standard on any artix iso but as I haven't installed my system from an iso I don't know for sure.
The artix logrotate installs a cron job into /etc/cron.daily. This runs logrotate based off the configuration in /etc/logrotate.conf which also loads further configurations in /etc/logrotate.d (where other packages put their logrotate configuration rules).

In a nutshell what tends to happen is each most recent log is plain text. Weekly that log gets compressed and named, for example, auth.log.1.gz and a new empty auth.log is created. A week later the process happens again and auth.log.1.gz becomes auth.log.2.gz. Once it gets to auth.log.4.gz it is simply deleted at the next rotation. Hence why you should not be able to gather 35GB of logs if logrotate is installed and working. If you don't have a working cron that would also prevent logrotate from working.

You are right that you yourself would not have known about the device issue without your bloated /var/log directory (It does not hurt to check dmesg and logs once in a while looking for errors). But seriously in future if you have huge log files the first thought should be "What is creating all these messages?". Not just deleting them. You live and learn. Your root drive filling up could have ended up worse. Glad you got away with it.

Re: ran out of space on my 50gb ssd root partition and then i did a baaaaaaad thing

Reply #23
I have logrotate running as a daily cron job, plus rsyslog has rate limiting enabled by default to prevent log flooding, which as well as being annoying is a security hole as it can be used in malware attacks. syslog-ng apparently has a throttle option but I've no idea if it's enabled by default, not in this case by the sound of it.

Incidentally, I get the same sort of errors on my laptop which has an nvme drive, but looking at the time stamps you can see rsyslog stops the log growing too much:
Code: [Select]
Sep  1 20:27:02 xyz kernel: [ 2427.941856] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:02:00.0
Sep  1 20:27:02 xyz kernel: [ 2427.941874] nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep  1 20:27:02 xyz kernel: [ 2427.941881] nvme 0000:02:00.0:   device [15b7:5002] error status/mask=00000001/0000e000
Sep  1 20:27:02 xyz kernel: [ 2427.941889] nvme 0000:02:00.0:    [ 0] RxErr                  (First)
Sep  1 20:27:11 xyz kernel: [ 2437.156292] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:02:00.0
Sep  1 20:27:11 xyz kernel: [ 2437.156303] nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep  1 20:27:11 xyz kernel: [ 2437.156307] nvme 0000:02:00.0:   device [15b7:5002] error status/mask=00000001/0000e000
Sep  1 20:27:11 xyz kernel: [ 2437.156311] nvme 0000:02:00.0:    [ 0] RxErr                  (First)
Sep  1 20:27:52 xyz kernel: [ 2478.116243] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:02:00.0
Sep  1 20:27:52 xyz kernel: [ 2478.116264] nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep  1 20:27:52 xyz kernel: [ 2478.116271] nvme 0000:02:00.0:   device [15b7:5002] error status/mask=00000001/0000e000
Sep  1 20:27:52 xyz kernel: [ 2478.116280] nvme 0000:02:00.0:    [ 0] RxErr                  (First)
Sep  1 20:28:19 xyz kernel: [ 2504.745594] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:02:00.0
Sep  1 20:28:19 xyz kernel: [ 2504.745615] nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep  1 20:28:19 xyz kernel: [ 2504.745623] nvme 0000:02:00.0:   device [15b7:5002] error status/mask=00000001/0000e000
Sep  1 20:28:19 xyz kernel: [ 2504.745631] nvme 0000:02:00.0:    [ 0] RxErr                  (First)
Sep  1 20:28:59 xyz kernel: [ 2545.191077] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:02:00.0
Sep  1 20:28:59 xyz kernel: [ 2545.191100] nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep  1 20:28:59 xyz kernel: [ 2545.191108] nvme 0000:02:00.0:   device [15b7:5002] error status/mask=00000001/0000e000
Sep  1 20:28:59 xyz kernel: [ 2545.191116] nvme 0000:02:00.0:    [ 0] RxErr                  (First)
Sep  1 20:29:29 xyz kernel: [ 2575.276836] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:02:00.0
Sep  1 20:29:29 xyz kernel: [ 2575.276864] nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep  1 20:29:29 xyz kernel: [ 2575.276875] nvme 0000:02:00.0:   device [15b7:5002] error status/mask=00000001/0000e000
Sep  1 20:29:29 xyz kernel: [ 2575.276889] nvme 0000:02:00.0:    [ 0] RxErr                  (First)
Sep  1 20:29:40 xyz kernel: [ 2585.635256] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:02:00.0
Sep  1 20:29:40 xyz kernel: [ 2585.635283] nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep  1 20:29:40 xyz kernel: [ 2585.635298] nvme 0000:02:00.0:   device [15b7:5002] error status/mask=00000001/0000e000
Sep  1 20:29:40 xyz kernel: [ 2585.635310] nvme 0000:02:00.0:    [ 0] RxErr                  (First)

I found some advice online to ignore them when I searched a while back, so I did.  :D

Re: ran out of space on my 50gb ssd root partition and then i did a baaaaaaad thing

Reply #24
so just to be clear, i got null errors on my nvme drive because it's pcie4 in a pcie3 slot?  i created this problem by using the wrong hardware and the null errors are the result of that hardware mismatch.  and while i was led to a workaround in grub, that isn't the ideal option here.  the ideal option would be to use a pcie3 nvme drive in a pcie3 slot.  is this the correct understanding?
Cat Herders of Linux

Re: ran out of space on my 50gb ssd root partition and then i did a baaaaaaad thing

Reply #25
That helps a lot, knowing it is due to mixing PCIE3 and PCIE4. According to this, you should be able to use PCIE3 and PCIE4 together, so this is probably more due to a "Linux" issue, I am sure in the past there have been discussions about NVME suggesting support in Linux was sometimes imperfect, although I guess it keeps improving:
https://www.quora.com/What-happens-if-you-use-a-PCIe-4-0-NVMe-SSD-in-a-PCIe-3-0-M-2-motherboard-slot
noaer would disable my internal wifi card, so I couldn't use that. Another suggestion:
"But the above solution of adding "pci=noaer" to boot I do not think really "solves" anything other than hiding the error.  The error is still happening, just not reporting it."
"... try pcie_aspm=off  .  This seems to disable power management mode which is throwing the error."
https://forums.unraid.net/topic/118286-nvme-drives-throwing-errors-filling-logs-instantly-how-to-resolve/
This option may affect sleep, but possibly only on the PCI bus. From a quick test it is working, there are no errors, and the wifi works too, no idea about the long term or sleep which I don't normally use anyway, I just shutdown fully.

Re: ran out of space on my 50gb ssd root partition and then i did a baaaaaaad thing

Reply #26
TEAMGROUP MP33 2TB SLC Cache 3D NAND TLC NVMe 1.3 PCIe Gen3x4 M.2 2280 Internal Solid State Drive SSD (Read/Write Speed up to 1,800/1,500 MB/s) Compatible with Laptop & PC Desktop TM8FP6002T0C101 https://a.co/d/2HL3cnO


Coming in a few hours
Cat Herders of Linux

Re: ran out of space on my 50gb ssd root partition and then i did a baaaaaaad thing

Reply #27
My drive which gives the same errors is a Gen 3.0 x4 like that Teamgroup one too, so I wonder if that will help?
https://documents.westerndigital.com/content/dam/doc-library/en_us/assets/public/western-digital/product/internal-drives/pc-sn720-ssd/data-sheet-pc-sn720-compute.pdf

I put:
Code: [Select]
GRUB_CMDLINE_LINUX="pcie_aspm=off"
in /etc/default/grub, then ran update-grub. Barring any future issues that occur to persuade me otherwise, that will probably do until either the kernel fixes the issue or I upgrade to a newer laptop sometime in the future. :D
(What these options do is turn on or off kernel driver features, so they are not necessarily bad to use, it is not really any different than choosing config options when building a kernel.)

Re: ran out of space on my 50gb ssd root partition and then i did a baaaaaaad thing

Reply #28
i had added that to the other gen4 wd black.  i havent added it yet to this one.  i thought i had d/l and installed mate dinit iso onto the 2tb nvme but it turns out i installed openrc so i guess that's what i'm using this go around?  anyway all seems fine so far. and since i know about this error i set root to only 30 gb.  if it fills up it will do so much more quickly!
Cat Herders of Linux