Ryzen 1700 Idle Crash
When setting up my homelab server with an old desktop, I ran into some mysterious crashes that turned out to be a "known" issue with first-gen Ryzen processors on Linux.
It started right before Christmas. I have an old desktop that I had previously set up as a proxmox server to host some VMs, including a NAS. The first time around I mostly set it up and then didn’t touch it again, and when we moved, it ended up in the garage, unplugged, disconnected, and gathering dust. On the first day of the holiday break, I decided to reconnect it.
The First Problem
The first thing I ran into was that virtualization was disabled. The server booted fine, but then the VMs refused to launch. I was confused, because they’d definitely worked before, so I disconnected everything, brought the server into the house, plugged in a spare GPU (since the Ryzen 1700 doesn’t have onboard graphics), connected it into my TV as a monitor, and booted into BIOS. Yep, there it was: the setting was disabled.
Changed the setting, disconnected everything, removed the GPU, put it back in the garage, and started it back up. Then the VMs still failed to start.
So again: I disconnected everything, brought it into the house, put the GPU back in, etc. (this will become a theme), and found that the setting was still disabled. This is when I also noticed that the BIOS date showed 2017. It was a dead CMOS battery.
Thankfully, I had a spare CR2032 battery (our car key recently needed a replacement and it’s the same type), so I replaced the battery, reconfigured the settings, and confirmed they persisted after a power loss. With the settings correctly persisting and the VMs able to boot, I disconnected the server, removed the GPU, and put it back in the garage. Surely now everything would be fine.
The Second Problem
The next day we flew off to go see family for Christmas. During some idle time, I decided to test out Unifi’s Teleport VPN feature to see if I could remotely connect to the server.
The VPN seemed to work fine, I could ssh into the gateway and talk to other hosts on the network, but the VM server wasn’t responding. Taking a look at the Unifi app, it showed the server’s port as “disconnected”. Weird. Maybe the server was asleep for some reason? I tried ssh-ing into the gateway and sending some wake-on-lan packets, but that didn’t seem to help either.
I wasn’t able to really diagnose the issue until we came home later that week.
Lights On, Nobody’s Home
I found that the server was powered, fans spinning, lights on, but no life: still couldn’t connect, wouldn’t respond to pings, nothing. After a hard reboot, the system came up clean, and everything was working fine.
I searched the logs for any clues about what might have happened. Between when the logs stopped and when my router last saw the device, I knew it had crashed about two days after I left, but nothing else stood out. I figured it was a one-off crash and moved on.
The next morning, the server was dead again. Same thing: no response, but fans
spinning and lights on. To confirm whether or not the system was actually dead,
I plugged in a keyboard and blind-typed to login as root (no GPU, remember?) and
touch /jakemco-was-here in case it was just that the network was down but the
machine was fine.
After a reboot and quick ls / via ssh, it was clear that the machine had not
been alive, and some quick googling suggesting this was likely a kernel panic.
Debugging the Kernel
Scouring the logs, still nothing stood out (which is consistent with a kernel panic, as when the kernel is in a bad state, it’s not safe to interact with the filesystem to write logs to disk).
I needed a way to see what was happening and debug the kernel panic to figure out what was going on. So first things first: I disconnected, brought it inside, plugged in the GPU, etc. so that I could at least see what was going on if it crashed again.
Then I did some research: how do you debug a kernel panic? The main two anwers I found:
- Use a Serial Port (which I don’t have handy)
- Set up netconsole
I found a great guide from apalrd on how to set up netconsole, and got it connected to my Windows desktop.
Ironically, since setting that up, it hasn’t crashed again. But that’s because of what I learned next…
A Revelation!
While waiting several hours for a memtest to complete (spoiler: the RAM is fine), I did a bit more googling, and finally came across this reddit post. While the post is deleted, the comments had this useful information:
There are two problems with Ryzen under Linux: … 2. C6 State crashes under very light loads or sometimes coming out of sleep. See this link from the kernel bugzilla.
This was consistent with my crashes! Both of them happened when the machine was idle for a few days or even a few hours, and manifested as a kernel panic. I was fairly certain this was the issue.
The kernel thread was very very long, but led to a few possible fixes:
- Adding
rcu_nocbs=0-15to the kernel command line - Disabling C6 power management in the BIOS
- Adding
processor.max_cstate=1orprocessor.max_cstate=5to the kernel command line
(1) turned out to be a red herring. It masks the issue in some cases (something about the CPU being busier so it’s less likely to hit the idle states), but doesn’t fix it more generally.
(2) was more promising, but some folks reported that this setting only tells the BIOS to suggest this behavior to the OS, and the kernel doesn’t always respect it.
(3) similarly seemed to be a more promising fix, but I saw conflicting details around whether to set it to 1 or to 5. Given the problem was with C6, setting it to 5 made sense to me, but I found some sources that claimed the problem also occurred in 2-5, but also that setting it to 1 would lead to more power consumption, and other claims that this wouldn’t fix it regardless.
There was also mention of a BIOS “power supply option”, but I couldn’t find it in my BIOS (even after updating to the latest version), so I ignored it.
All of this led me back to the original reddit post, where I found…
The Fix
It wasn’t until I started using the zenstates-linux script both on boot and coming out of sleep that my system became rock-solid
So as it turns out, this zenstates.py script is exactly what I
needed to stablize my system. It has the ability to report whether C6 is enabled
or disabled, and then to toggle it on or off.
Since using the following fix to consistently disable C6 states, my system hasn’t crashed again:
Side Note: I also found this Gentoo wiki with references to this
and another bug. Given Ryzen errata 1109, I added idle=nomwait to my
kernel parameters as well.
So there you have it. Three days and countless hours of disconnecting and reconnecting later, I’ve got a stable system (so far).
Right now it’s still sitting inside the house, with GPU connected, and
netconsole enabled to see if it it crashes again and if I can get some logs.
Assuming it doesn’t, I’ll probably move it back to the garage in the next few
days. Though I might leave the GPU installed this time…