Recently, I investigated an incident related to a specific Linux Kernel behavior. I thought that would be a great topic to explore. So, today, let’s dive into soft and hard lockups. Also, this post marks the introduction of a new section, Systems, where we will cover more low-level concepts.
According to kernel.org:
A soft lockup is defined as a bug that causes the kernel to loop in kernel mode for more than 20 seconds without giving other tasks a chance to run.
A hard lockup is defined as a bug that causes the CPU to loop in kernel mode for more than 10 seconds without letting other interrupts have a chance to run.
To make sure these definitions are clear, let’s break down a few related concepts.
Tasks: Modern operating systems (OS) are multitasking, meaning they are designed to run multiple tasks (e.g., processes) seemingly simultaneously. The OS achieves this by giving each task a small time slice to execute and switching between tasks using context switching.
Kernel mode: Mainly to provide memory/hardware protection, most OSs divide memory into user space and kernel space:
User space: Where applications run, isolated from critical system resources.
Kernel space: Where the OS and device drivers reside, with full access to hardware and memory.
When the CPU is in user mode, code can only access its own user space memory. Yet, when the CPU is in kernel mode, code can access both kernel and user space.
Interrupts: An interrupt is a signal emitted either by hardware or software to the CPU to indicate an event needs attention:
Hardware interrupts: Triggered by physical devices such as moving the mouse or a network interrupt triggered by a network interface card when a packet is received.
Software interrupts: Triggered by systems calls, such as creating a new process, or by exceptions like a segmentation fault caused by division by zero.
Interrupts can either be:
Maskable interrupts (MI): A type of interrupt that can be enabled or disabled (masked) by the CPU. A CPU can temporarily ignore maskable interrupts to ensure a critical task is not interrupted.
Non-maskable interrupts (NMI): A type of interrupt that cannot be disabled or ignored by the CPU. NMIs are used for events that require immediate attention.
Back to the concepts of lockups:
A soft lockup means that the kernel loops in kernel mode without letting any other tasks run (e.g., user processes), yet the system can still process interrupts. Systems experiencing soft lockups result in slow performance (e.g., the mouse moves slowly).
A hard lockup, on the other hand, is more severe. The kernel is stuck to the point that it doesn’t even respond to interrupts, leaving the system completely unresponsive (e.g., the mouse doesn’t move at all).
What are some causes of lockups?
Kernel bugs: Defects in kernel code or device drivers.
Hardware failures: Issues like a malfunctioning CPU or faulty RAM, for instance.
Virtualized environments: In virtual machines, CPU starvation due to hypervisor scheduling can lead to lockups. For example, if the hypervisor fails to schedule the guest VM’s virtual CPU, the guest OS may experience a soft lockup or even a hard lockup. Tracking these lockups is crucial to monitor the health of VMs.
The Linux kernel includes a watchdog1 to detect soft and hard lockups. When one is detected, the watchdog sends an NMI2 to all CPUs. We can configure the following parameters to tell the kernel to panic if lockups are detected:
$ cat /proc/sys/kernel/softlockup_panic 1 # Soft lockup panics are enabled $ cat /proc/sys/kernel/hardlockup_panic 1 # Hard lockup panics are enabled
If these parameters are disabled, the kernel watchdog will just log a message to /var/log/messages
, for example:
BUG: soft lockup - CPU#13 stuck for 21s!
We can also adjust the threshold:
$ cat /proc/sys/kernel/watchdog_thresh 10 # 10 seconds
A soft lockup is detected if a CPU has been running on the same task for at least
2 * watchdog_thresh
(default: 20 seconds)A hard lockup is detected if a CPU hasn’t responded to any interrupt for at least
1 * watchdog_thresh
(default: 10 seconds).
Soft and hard lockups can signal critical issues in systems, from kernel bugs to hardware problems. Tracking and addressing these lockups is essential to ensure the reliability of Linux-based systems.
Explore Further
Have you ever encountered a soft or hard lockup in your system? Share your experiences in the comments.
A system mechanism that monitors something.
As we said, a non-maskable interrupt.