Skip to content
Go back

A tale about fixing eBPF spinlock issues in the Linux kernel

We’ve been working on the Linux version of Superluminal (a CPU profiler) for a while now, and we’ve been in a private alpha with a small group of testers. That was going great, until one of our testers, Aras, ran into periodic full system freezes while capturing with Superluminal.

We always pride ourselves on Superluminal “Just Working”, and this was decidedly not that, so we of course went hunting for what turned out to be one of the toughest bugs we’ve faced in our careers.

The hunt led us deep into the internals of the Linux kernel (again), where we learned more about spinlocks in the kernel than we ever wanted expected to know, and we ended up helping to find & fix a number of issues along the way.

Initial analysis

The problem he was running into was that on his Fedora 42 machine (kernel 6.17.4-200), the system would periodically freeze for short periods while a Superluminal capture was running:

It’s really difficult to remotely debug issues like this, so we first attempted to reproduce the issue in a VM. However, we were unable to after several attempts with various Fedora versions/kernels. We finally tried installing Fedora on a physical machine, and we were able to reproduce it there.

Now that we have a repro, we can start looking into the issue in earnest.

Since the machine is periodically freezing while capturing with Superluminal, we can start by looking at what the capture looks like after opening it. Opening such a capture, we get this:

This is showing the timeline for each thread in the process. A green color means the CPU is actively executing work (i.e. the thread is scheduled in), any other color means the thread is scheduled out and waiting for something. We can immediately spot some suspicious looking areas in the capture (marked in blue) where it appears as if each thread in the process is busy for the ~same amount of time, across all threads, which doesn’t match the workload being profiled.

Each of these areas is 250+ milliseconds where the CPU appears to be fully busy. Zooming in on one of these sections and expanding a thread, we see this:

This indicates that despite this thread being reported as ‘busy’, there are no samples being collected during this period at all. This pattern holds across all threads.

Looking at dmesg output while reproducing the issue, we also get messages like the following, which closely matches the 250+ ms we’re seeing in Superluminal itself:

[  +0.014286] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.723 msecs
[  +0.232451] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 250.424 msecs
[  +0.000001] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 250.424 msecs
[  +0.250938] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 250.936 msecs

So all of this doesn’t tell us why the CPU is busy (we need callstacks/samples for that), but it does indicate that there is something happening in the kernel that takes 250+ milliseconds.

We “just” need to figure out what.

Debugging the kernel

Normally, when you get freezes like this in an application, it’s relatively simple to figure out what’s happening: simply attach a debugger, break when the freeze happens, and inspect the threads/stacks. In this case, we’re dealing with a kernel freeze so it’s not as simple as that.

But, for a previous change, we already learned how to compile the kernel. With a custom kernel, it’s pretty simple to enable kernel debugging. You need a separate machine to debug the kernel from, which is not too big of a problem.

The main problem was that our machines didn’t have one of these:

If you’re of a certain vintage, you’ll immediately recognize this. You may even have fond memories of soldering TX and RX lines together to be able to play Doom in multiplayer…

If you’re not that old (ahem): this is a COM or serial port.

Machines used to have one of these on the motherboard by default, but modern machines typically don’t have one anymore. The Linux kernel debugger uses it for communication between the debugger & debuggee, which means both machines need to have one of these; we had to get serial port PCIe cards as pictured above.

With this in place, we’re able to attach to the problematic machine using gdb and perform basic debugger commands such as breaking into the kernel and inspecting the kernel state. From here, finding the bug is simple: reproduce the freeze, then break into the kernel from the remote machine and see what’s going on.

… or it should have been as simple as that. Unfortunately, it seems that when the kernel is in this freezing state, even the debugger doesn’t respond anymore. Our attempts to break while freezing all resulted in gdb crashing and/or timing out and we were not able to get this to work.

So debugging is out of the window. Great. Now what?

Finding a minimal repro

When you have an issue like this, where it is completely unknown what code is causing the issue, it often helps to try to narrow it down to as minimal a reproduction case as you possibly can. That allows you to reason about a (hopefully) relatively small amount of code, instead of “the entire system” as is the case right now.

Our capturing backend on Linux is a significant chunk of complex code: ~2000 lines of eBPF code, and ~6000 lines of userspace controlling code. That is quite a big haystack to look in for this particular needle.

Given the nature of the issue (a kernel freeze), and the fact that the eBPF code is the only code running in the kernel, we’ll make the assumption that the problem is somewhere in our eBPF code. This reduces the code we need to investigate to ~2000 lines, which is better, but still a lot.

As described in previous articles, the eBPF code attaches to various tracepoints in the kernel to collect performance data. Aside from a few low-frequency bookkeeping events, there are three main types of performance events we capture in eBPF:

To narrow the issue down further, we created some debug options that allowed us to disable these event types individually. We then attempted to repro the issue using various combinations of enabled/disabled events. This resulted in the following observations:

This strongly points towards the issue being some kind of interaction between the eBPF code that runs when a sample event happens, and the eBPF code that runs when a context switch happens. In addition, reducing the sampling frequency from its default of 8 kHz to 1 Hz makes the freezes less frequent, but they still occur.

With this information, we disabled everything except sampling & context switch events in code, and stripped the eBPF code for both further and further until the freezes stopped occurring. This finally led us to the following, very minimal, eBPF repro:

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 512 * 1024 * 1024);
} ringBuffer SEC(".maps");

SEC("tp_btf/sched_switch")
int cswitch(struct bpf_raw_tracepoint_args* inContext)
{
    struct CSwitchEvent* event = bpf_ringbuf_reserve(&ringBuffer, sizeof(struct CSwitchEvent), 0);
    if (event == NULL)
        return 1;

    bpf_ringbuf_discard(event, 0);
    return 0;
}

SEC("perf_event")
int sample(struct bpf_perf_event_data* inContext)
{
    struct SampleEvent* event = bpf_ringbuf_reserve(&ringBuffer, sizeof(struct SampleEvent), 0);
    if (event == NULL)
        return 1;

    bpf_ringbuf_discard(event, 0);
    return 0;
}

This describes two eBPF programs: one that runs when a context switch happens, and one that runs on the sampling interrupt. The programs do nothing but reserve some space in a ring buffer (bpf_ringbuf_reserve) and then immediately discard the reserved space (bpf_ringbuf_discard).

Attaching these eBPF programs to the kernel independently of Superluminal (i.e. without any of our other code running) with a sufficiently high sampling frequency results in the periodic freezes.

Digging deeper

Since these programs are barely doing anything after all this, the function that’s most likely to be causing the issue is bpf_ringbuf_reserve, since that’s the bulk of the remaining work. So, let’s see if we can figure out what’s going on by using the ancient art of Just Reading The Code.

Here it is in full for the kernel we were running on (original source):

static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
{
	unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, flags;
	struct bpf_ringbuf_hdr *hdr;
	u32 len, pg_off, tmp_size, hdr_len;

	if (unlikely(size > RINGBUF_MAX_RECORD_SZ))
		return NULL;

	len = round_up(size + BPF_RINGBUF_HDR_SZ, 8);
	if (len > ringbuf_total_data_sz(rb))
		return NULL;

	cons_pos = smp_load_acquire(&rb->consumer_pos);

	if (raw_res_spin_lock_irqsave(&rb->spinlock, flags))
		return NULL;

	pend_pos = rb->pending_pos;
	prod_pos = rb->producer_pos;
	new_prod_pos = prod_pos + len;

	while (pend_pos < prod_pos) {
		hdr = (void *)rb->data + (pend_pos & rb->mask);
		hdr_len = READ_ONCE(hdr->len);
		if (hdr_len & BPF_RINGBUF_BUSY_BIT)
			break;
		tmp_size = hdr_len & ~BPF_RINGBUF_DISCARD_BIT;
		tmp_size = round_up(tmp_size + BPF_RINGBUF_HDR_SZ, 8);
		pend_pos += tmp_size;
	}
	rb->pending_pos = pend_pos;

	/* check for out of ringbuf space:
	 * - by ensuring producer position doesn't advance more than
	 *   (ringbuf_size - 1) ahead
	 * - by ensuring oldest not yet committed record until newest
	 *   record does not span more than (ringbuf_size - 1)
	 */
	if (new_prod_pos - cons_pos > rb->mask ||
	    new_prod_pos - pend_pos > rb->mask) {
		raw_res_spin_unlock_irqrestore(&rb->spinlock, flags);
		return NULL;
	}

	hdr = (void *)rb->data + (prod_pos & rb->mask);
	pg_off = bpf_ringbuf_rec_pg_off(rb, hdr);
	hdr->len = size BPF_RINGBUF_BUSY_BIT;
	hdr->pg_off = pg_off;

	/* pairs with consumer's smp_load_acquire() */
	smp_store_release(&rb->producer_pos, new_prod_pos);

	raw_res_spin_unlock_irqrestore(&rb->spinlock, flags);

	return (void *)hdr + BPF_RINGBUF_HDR_SZ;
} 

Most of this code is fairly straightforward. It implements the producing side of a relatively standard multiple-producer single-consumer (MPSC) ring buffer.

There is one noteworthy thing about this code: it is guarded by a spinlock (raw_res_spin_lock_irqsave on line 16, and raw_res_spin_unlock_irqrestore on line 54). In addition, the code has been written such that it can deal with raw_res_spin_lock_irqsave failing. When that happens, bpf_ringbuf_reserve fails and returns NULL.

Given that we’re tracking down a freeze in this specific function, and that it’s taking locks, this is highly suspicious. Let’s look at the spinlock implementation next. Both the lock & unlock functions go through a few macros; folding them all together, it essentially boils down to the following:

static int raw_res_spin_lock_irqsave(lock, flags)
{
    local_irq_save(flags);
    preempt_disable();

    return res_spin_lock(lock);
}

static void raw_res_spin_unlock_irqrestore(lock, flags)
{
    res_spin_unlock(lock);

    preempt_enable();
    local_irq_restore(flags);
}

The lock function disables interrupts on the same CPU (local_irq_save), disables preemption (preempt_disable), and attempts to acquire the lock. The unlock function does the inverse. The goal here is to ensure that the code that’s running while the spinlock is held can’t be interrupted by other code.

However, from our previous tests, we saw that we can only reproduce the freezes if both sampling and context switches are enabled. This is interesting, because sampling interrupts are a special kind of interrupt called a “non-maskable interrupt”, or NMI. Maskable in this context simply means whether a hardware interrupt can be disabled by software (e.g. via local_irq_save) or not. A NMI can’t be disabled at all and can thus occur at any point.

So now a hypothesis begins to form: what happens when…

  1. A context switch happens and attempts to acquire the ring buffer spinlock via raw_res_spin_lock_irqsave
  2. At the same time, a non-maskable sampling interrupt happens on the same CPU
  3. The sampling interrupt then also tries to acquire the same ring buffer spinlock

The code for the actual spinlock implementation is too large to share here. But, just skimming through it, there are various code paths in it where something equivalent to the following happens:

if (val & _Q_LOCKED_MASK) {
    RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT);
    res_smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK));
}

The function res_smp_cond_load_acquire being called here implements a spin-wait for the condition (the second argument) to become true. The above snippet essentially boils down to:

if (val & _Q_LOCKED_MASK) {
    RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT);
    while (true)
    {
        if (!lock->locked || RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK))
            break;
    }
}

So, it’s waiting for two conditions:

  1. For the locked flag of the lock to become false, indicating that the lock is no longer held
  2. Or for RES_CHECK_TIMEOUT to return true, indicating that the timeout for this wait has expired

In other words, this is simply a wait for the locked flag to become false, but with a timeout to prevent spinning forever. Looking at the definition for RES_DEF_TIMEOUT, which is the timeout this code is waiting for on various paths, we get:

/*
 * Default timeout for waiting loops is 0.25 seconds
 */
#define RES_DEF_TIMEOUT (NSEC_PER_SEC / 4)

Remember how at the very beginning we saw in the capture that the freezes were 250+ milliseconds, also known as 0.25 seconds? It’s probably not a coincidence that the default timeout for these spin-waits in the spinlock code is also 250 milliseconds. I believe this is what the professionals call ‘a smoking gun’.

At this point, we reported the issue to the eBPF kernel mailing list. This led to a productive back-and-forth with kernel maintainers Kumar Kartikeya Dwivedi and Alexei Starovoitov that exposed several issues.

You could read the mailing list discussion now, but that would spoil the fun. So instead, join me on a deep dive into the Linux kernel & spinlocks.

A relatively short long primer on spinlocks

(If you’re already familiar with spinlocks and common problems with them, you can skip to the next section)

You’ve probably come across spinlocks as an alternative to full mutexes/locks before. The most basic variant of it looks something like this:

typedef struct {
    volatile int locked;
} spinlock_t;

static inline void spin_lock(spinlock_t *lock)
{
    while (__sync_val_compare_and_swap(&lock->locked, 0, 1))
    {
        // Spin until we managed to take the lock
    }
}

static inline void spin_unlock(spinlock_t *lock)
{
    __sync_lock_release(&lock->locked);
}

And you’d use it like this:

spinlock_t lock;

spin_lock(&lock);

// Code guarded by the lock goes here; only one CPU can be executing this at a time

spin_unlock(&lock);

The spin_lock function performs an atomic compare-and-swap (CAS) via __sync_val_compare_and_swap to acquire the lock: it atomically changes the value of locked to 1. If that succeeds, the lock is now held by this thread.

It fails when the value of locked isn’t currently 0: in that case, another thread already had the lock, or has taken the lock while this thread was trying to acquire it. When that happens, it will keep retrying in a loop until it succeeds.

There are various problems with this basic variant of a spinlock.

One of them is that the spinning wastes CPU cycles doing nothing useful: while the lock is held by another thread, the CPU this thread is running on will sit there in a loop burning CPU cycles until the lock is released.

Another, perhaps more serious, issue is that even disregarding the spinloop, a CAS by itself can be very expensive. The reason for that is an issue called cache line bouncing or ping-ponging.

CPUs use the MESI protocol to maintain memory cache coherence across cores. When a CPU wants to write to a value in memory, in this case the locked flag, it must first acquire the cache line containing that value in Modified state. This acquisition requires the CPU to first broadcast an invalidation to all other cores also holding that cache line, then waiting for each CPU to acknowledge that invalidation. This inter-core communication over the memory bus is expensive.

This issue is not unique to spinlocks, but spinlocks are one of the places greatly affected by it because of the shared locked flag. For example, let’s say we have 4 threads on 4 CPUs all trying to acquire the basic spinlock above. The locked flag is shared and sits in a single cacheline.

  1. Let’s say that CPU 0 holds the lock. CPUs 1, 2 and 3 are all spinning, waiting for the lock
  2. Each spin loop on each CPU repeatedly uses a CAS (via __sync_val_compare_and_swap) to attempt to acquire the lock. The CAS counts as a write operation, and so each CPU must acquire the cache line in Modified state, which means sending an invalidation to and waiting for acknowledgement from all other cores
  3. Each waiting CPU is therefore constantly fighting with the other waiting CPUs to invalidate the cacheline and grab the cache line in Modified state, even though the CAS will always fail since CPU 0 is holding the lock. This causes a continuous storm of cacheline invalidations
  4. When CPU 0 finally releases the lock, it writes to the same cacheline to release the lock, so it also joins the queue of CPUs fighting over the cacheline
  5. One of the other CPUs finally wins the CAS, and the process starts all over again

So in this scenario, the cacheline is effectively bouncing between cores at a high frequency. And, since each CPU must invalidate/acknowledge each other CPU, this gets quadratically worse with the number of contenders for the lock. On high CPU count machines this can be very problematic.

Finally, a basic spinlock like this is “unfair”: when the lock is released, all waiters race to acquire it and the winner is essentially arbitrary. This means that it is possible for a thread to spin forever if other CPUs continuously win the race: there is no guarantee of forward progress. For example, in combination with the cache line bouncing problem above, a CPU that happens to be physically closer to the CPU currently holding the lock could repeatedly win the CAS race, simply because the latency over the memory bus between those CPUs is lower than CPUs further away.

In contrast, a fair spinlock ensures that threads always acquire the lock in the order that they arrived: if CPU 0 holds the lock, and then CPU 3 and finally CPU 2 try to take the lock, it is guaranteed that CPU 3 will get the lock before CPU 2.

In userspace, there are a host of other problems with spinlocks that don’t apply in the context of the kernel, because the kernel has more control over the system (such as being able to disable preemption). See this article for an excellent overview of potential issues with spinlocks in userspace.

The queued spinlock

Given the problems described in the previous section, it should be no surprise that more sophisticated spinlocks exist. The Linux kernel uses spinlocks extensively, and there are a number of spinlock variants in use in the kernel. The main variant in use is the ‘queued spinlock’, or qspinlock (code).

The qspinlock is a version of the MCS spinlock. The implementation in the Linux kernel has a number of specific optimizations applied to it compared to the regular MCS lock, but the basic operation of the MCS lock is fairly straightforward.

The MCS lock is defined by two structures:

// Each CPU has its own node
struct mcs_node {
    struct mcs_node* next;  // Points to the next waiter in the lock queue
    int locked;             // Indicates whether this CPU is waiting for the lock (set to 1 while waiting; 0 when not waiting)
};

// The lock itself; essentially just a tail pointer
struct mcs_lock {
    struct mcs_node* tail;  // Points to the last waiter in the lock queue; null if currently unlocked
};

The mcs_lock represents the actual lock. The mcs_node represents a lock holder or waiter.

The idea behind the MCS spinlock is that each CPU has its own mcs_node, typically via per-CPU storage. The mcs_nodes form a linked list of waiters: the next pointer of each node points to the next waiter that should be unlocked after this node (if any). In addition, there is no shared locked flag that all CPUs have to contend over. The locked flag is instead located in the per-CPU mcs_node, which means that each waiter spins on its own local flag.

The mcs_lock itself is just a tail pointer. It always points to the mcs_node of the last CPU that has acquired, or is in the process of acquiring the lock. It is null if nobody is currently holding the lock.

By representing a lock like this, it solves the two major issues of the previously described basic spinlock:

  1. No cache line bouncing: there is no shared locked flag that all waiters are contending over. Instead, each waiter (CPU) has its own locked flag and it spins on that local flag. The only shared state is the tail pointer on the lock, but nobody spins on it.
  2. Lock fairness: waiters are guaranteed to be released in the order that they arrived at the lock through the use of the queue formed by the next pointers, making the lock completely fair.

That’s all a bit abstract, so let’s work through an example to make this more concrete. In the following example, we start with an unlocked lock, and CPUs 0, 3 and 2 will attempt to acquire the lock in that order.

The lock initially starts in an empty state, meaning it’s unlocked. The tail pointer is null:

CPU 0 attempts to acquire the lock and atomically swaps the tail pointer of the lock with its own node. Because the tail pointer was null, we know that there was no previous lock holder, so we’re now the holder of the lock:

Now CPU 3 attempts to acquire the lock as well. It again atomically swaps the tail pointer of the lock with its own node. The previous value of the tail pointer was not null, so it knows that there was already somebody else holding the lock. It adds itself to the queue (by setting the next pointer of the previous value of tail) and starts spin-waiting on the locked flag to become 0:

CPU 2 attempts to acquire the lock as well, and follows the same process as CPU 3 before it. It also adds itself to the queue and spins on locked:

Finally, CPU 0 releases the lock. It sees that its node has a next pointer set, so it knows that there is another waiter behind it in the queue, in this case CPU 3. It sets the locked flag of the next waiter to 0, thus making the next waiter the new lock holder and breaking out of its spinloop:

This process continues, with each lock releasing & transferring ownership in turn, until the final waiter (CPU 2) releases and the lock is unlocked again, returning the lock to the empty state (tail == NULL):

Putting the ‘resilient’ in ‘resilient queued spinlocks’

The qspinlock described above is used extensively throughout the Linux kernel. It is written so that it assumes all usage of it to be correct and makes no attempt to handle potentially incorrect usage: if a qspinlock is incorrectly used, it will simply deadlock and hang the system, similar to how it would work with any other regular (user space) lock.

This is fine for the kernel, because as a kernel developer you can track such deadlocks down relatively easily, similarly to how you would with any other locking issues in regular user space code. The problem appears when a system is introduced that allows for random (within reason) C code to be executed at arbitrary points during the kernel’s regular execution.

This system is called ‘eBPF’, and that random code takes the form of eBPF programs.

Writers of eBPF programs are not typically kernel developers themselves, and so cannot be expected to understand the intricacies of the locking strategy of the kernel system they’re hooking into. For a long time, the way eBPF dealt with this (via the verifier) was simple:

  1. It only allowed a single spinlock to be taken at a time in an eBPF program
  2. It only allowed spinlocks to be taken in select program types with well-known attach points in the kernel (i.e. where it was known up front that taking a spinlock would not pose a problem)
  3. It disallowed any functions to be called while the spinlock was held, thus preventing recursive locking situations

While this worked for simple cases, as the scope/usage of eBPF grew these restrictions became too limiting. It would be ideal if the verifier could statically prove that an eBPF program is deadlock-free, but that’s infeasible due to the combinatorial explosion: eBPF programs can attach many different kernel hooks, which may (or may not) already be holding other locks.

To statically prove deadlock-freedom, the verifier would essentially need to answer: across all possible interleavings of all loaded BPF programs, on all CPUs, in all interrupt contexts, combined with whatever kernel locks are held at each attachment point, is there any locking violation?

To still be able to relax the restrictions, a new resilient queued spinlock, or rqspinlock was introduced for use in the eBPF system. This spinlock is, as the name implies, based on the existing qspinlock. But, where qspinlock expects usage to be correct, the rqspinlock takes a more defensive approach: it tolerates potential deadlocks and recovers from them dynamically.

The goal is to ensure the system will (eventually) make forward progress, regardless of locking errors. It achieves this by detecting & recovering from deadlocks in a few different ways:

  1. Active deadlock detection: Each CPU maintains a (fixed) table of currently held locks. On lock acquisition, when the lock is found to be already held (i.e. the fast path fails and the CPU must spin), and periodically during the spin-wait, rqspinlock performs the following checks. If the checks fail, lock acquisition fails with -EDEADLK:
    • AA check (recursive lock): the held lock table of the current CPU is checked for the current lock. If the lock is already held on the same CPU, this is a recursive lock, called ‘AA deadlock’ in the rqspinlock code
    • ABBA check (lock order inversion): the held lock table of other CPUs is checked for lock order inversion. In this case, CPU A holds lock X and wants lock Y; CPU B holds lock Y and wants lock X
  2. Timeout: The deadlock detection is not exhaustive; it only covers common cases that are cheap to detect. As a final defense, to deal with deadlocks that are not detected, rqspinlock’s spinloops have a maximum duration. If the maximum duration is exceeded, lock acquisition fails with -ETIMEDOUT
  3. Queue timeout: When the spinloop times out, the timeout cascades through the waiters behind the timed out node on the MCS queue, causing all waiters for the lock to exit in FIFO order with -ETIMEDOUT

All this together makes rqspinlock resilient against various common lock issues. While this is conceptually ‘simple’, as you might imagine, it requires quite a bit of complicated (and race-prone) machinery in rqspinlock to achieve all this.

And it’s in that machinery that the cause of our issue lies.

Race conditions on a single CPU

Now that we know much more about spinlocks in the Linux kernel than we ever expected, we can start to reason about the freeze we’re trying to track down, so let’s recap.

We know from our minimal repro that the freeze is caused by some interaction between two eBPF programs: one that runs on a context switch, and one that runs on the sampling interrupt. Both programs use bpf_ringbuf_reserve to reserve space in a ring buffer, and that function uses the rqspinlock (which we now understand very well) to protect the ring buffer state.

As described earlier in the article, our hypothesis based on our minimal repro is that the freeze is caused by the spinlock in bpf_ringbuf_reserve being recursively acquired: we suspect that the sampling program attempts to acquire the ring buffer spinlock while the context switch program is also acquiring it on the same CPU.

But, as described in the previous section, rqspinlock should theoretically be able to deal with this situation via its recursive (AA) lock detection, and yet we still seem to be hitting the full 250ms rqspinlock timeout. So there are two options: either our hypothesis is incorrect and the problem is somewhere else, or there is a bug in rqspinlock.

Spoiler: it’s a bug. Let’s dig into it.

We now know that the deadlock detection runs as part of the spinloop when the lock is contended, and we saw earlier that the spinwait in rqspinlock effectively looks like this:

while (true)
{
    if (!lock->locked || RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK))
        break;
}

This is a loop that spins until either the locked flag is no longer set (meaning that the previous lock holder released, and this CPU is now the owner), or until the timeout is reached (RES_CHECK_TIMEOUT).

There is no deadlock detection happening in this loop, so we’ll make the educated guess that the deadlock detection happens in RES_CHECK_TIMEOUT. If you follow the code for RES_CHECK_TIMEOUT, you indeed eventually end up at the function check_deadlock_AA being called every 1ms as part of the timeout checking.

It looks like this:

static noinline int check_deadlock_AA(rqspinlock_t *lock, u32 mask,
				      struct rqspinlock_timeout *ts)
{
	struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
	int cnt = min(RES_NR_HELD, rqh->cnt);

	/*
	 * Return an error if we hold the lock we are attempting to acquire.
	 * We'll iterate over max 32 locks; no need to do is_lock_released.
	 */
	for (int i = 0; i < cnt - 1; i++) {
		if (rqh->locks[i] == lock)
			return -EDEADLK;
	}
	return 0;
}

So, it retrieves the table of held locks for the current cpu (this_cpu_ptr(&rqspinlock_held_locks) on line 4). It then goes through the locks in the table, looking for the lock we’re currently trying to acquire. If the lock we’re trying to acquire is in the table, this is a recursive lock, and -EDEADLK is returned.

Looking at this code, there is not a lot that can go wrong, at least on this side where we only check the table. So let’s check the code that actually updates the table.

The function that adds locks to the table is called grab_held_lock_entry. I won’t share it here (code), but it just adds the lock to a static per-cpu array using a simple increment of a per-cpu counter. Not a lot that can go wrong there either.

The next step is to look at the usages of this function. There are two places where this function is called. The first is in the fast path of lock acquisition, which is when the lock is currently not held at all (i.e. the tail == null case of the MCS lock).

The second is in the slow path, when the lock is already held, and we thus need to spinwait for the lock to become available. The spinloop code we showed above is from the slow path; we haven’t looked at the fast path yet. It looks like this:

static __always_inline int res_spin_lock(rqspinlock_t *lock)
{
	int val = 0;

	if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL))) {
		grab_held_lock_entry(lock);
		return 0;
	}
	return resilient_queued_spin_lock_slowpath(lock, val);
}

As mentioned before, the version of the MCS lock in the Linux kernel is more optimized than the basic MCS lock, but you can see the atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL) on line 5 as equivalent to the exchange of the tail pointer in the basic MCS lock: if it succeeds it means that the lock wasn’t locked yet, and that we now hold the lock. It then adds this lock to the held lock table for this CPU via grab_held_lock_entry.

It is in this code that the issue lies, as spotted by Alexei.

The problem is that this code assumes that there can’t be anybody acquiring the lock at the same time, on the same CPU, which is a fair assumption: as we saw before, the lock function for rqspinlock disables both interrupts and preemption on this CPU before attempting to acquire the lock to guarantee it can’t be interrupted.

However, we now also know the sampling interrupt is a NMI (non-maskable interrupt), which means it cannot be disabled: if an NMI fires, it will interrupt whatever code is currently running on the CPU and take precedence.

And that’s the problem here: if the NMI interrupt happens after the compare-exchange to acquire the lock has happened (line 5), but before grab_held_lock_entry has been called (line 6), the lock is (temporarily) in an inconsistent state. It is locked (i.e. acquired), but it’s not in the table of held locks (yet).

This means that any deadlock checks that happen while in that state will miss any deadlocks involving this lock. Take the following sequence of events, on a single CPU:

  1. The context switch eBPF program runs and attempts to reserve space in the ring buffer. It tries to acquire the lock via res_spin_lock and does so via atomic_try_cmpxchg_acquire; it now holds the lock
  2. Before it updates the held lock array via grab_held_lock_entry, a sampling NMI fires on the same CPU, interrupting the context switch eBPF program
  3. The sampling eBPF program also attempts to reserve space in the ring buffer. It tries to acquire the lock via res_spin_lock again, but sees the lock is currently held (i.e. the atomic_try_cmpxchg_acquire fails), in this case by the context switch program
  4. It enters the slow path (resilient_queued_spin_lock_slowpath), where it enters the spinloop to wait for the lock to be released.
  5. Because the context switch hasn’t updated the held lock array yet, the AA deadlock check doesn’t trigger, causing this sampling NMI to spin for the full 250ms timeout, preventing the CPU from making progress at all for that duration, thus freezing the system
  6. The sampling NMI lock acquisition fails after 250ms, which fails the ring buffer reservation, and eventually causes the program to stop, after which the context switch eBPF program continues to run. The context switch program now updates the held locks table via grab_held_lock_entry and runs as normal, thus unfreezing the system

The fix for this was made by Kumar, which came down to swapping the order: instead of only updating the held locks table when the acquire succeeds, the held locks table is now always updated before actual lock acquisition (via the compare-exchange) is attempted:

static __always_inline int res_spin_lock(rqspinlock_t *lock)
{
	int val = 0;

	/*
	 * Grab the deadlock detection entry before doing the cmpxchg, so that
	 * reentrancy due to NMIs between the succeeding cmpxchg and creation of
	 * held lock entry can correctly detect an acquisition attempt in the
	 * interrupted context.
	 *
	 * cmpxchg lock A
	 * <NMI>
	 * res_spin_lock(A) --> missed AA, leads to timeout
	 * </NMI>
	 * grab_held_lock_entry(A)
	 */
	grab_held_lock_entry(lock);

	if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
		return 0;
	return resilient_queued_spin_lock_slowpath(lock, val);
}

We grabbed the patch from Kumar and compiled our kernel with it to confirm it fixes the issue. We tried to repro again and the issue was gone: the system is no longer freezing. Yay! Job’s done.

Or is it?

While the system is no longer visibly freezing with the patch, we’re still seeing messages like this in dmesg:

[  +0.000002] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.584 msecs
[  +0.014555] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.701 msecs
[  +0.004845] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 26.229 msecs
[  -0.000001] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 15.120 msecs
[  +0.000000] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 24.805 msecs
[  +0.000000] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 6.855 msecs
[  +0.246851] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.974 msecs
[  +0.163526] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.981 msecs
[  +0.235492] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.996 msecs

The good news is that the 250ms warning messages we saw initially are indeed gone (as expected), but there are still cases where the NMI takes too long to run.

Given that we’re doing nothing but reserving space in a ring buffer in the NMI handler, you would never expect this message to be printed for our test case. The stalls are much shorter now, but they’re still stalls. Our suspicion is that these are still system freezes, but that they’re now too short to notice by eye.

Something is clearly still going on, so, let’s dig into that next.

Checking for deadlocks…too late

The first thing we can see from these warning messages is that the timeouts appear to fall into two bands: 1-2ms, and much larger ones around 6-26ms.

The first band is quite close to (a multiple of) the frequency of 1ms that deadlock checks are performed at during the spinwait. This made us suspicious that there was some issue with how the deadlock check is triggered. So let’s look at the function that performs the timeout checking & deadlock detection:

static noinline int check_timeout(rqspinlock_t *lock, u32 mask,
				  struct rqspinlock_timeout *ts)
{
	u64 time = ktime_get_mono_fast_ns();
	u64 prev = ts->cur;

	if (!ts->timeout_end) {
		ts->cur = time;
		ts->timeout_end = time + ts->duration;
		return 0;
	}

	if (time > ts->timeout_end)
		return -ETIMEDOUT;

	/*
	 * A millisecond interval passed from last time? Trigger deadlock
	 * checks.
	 */
	if (prev + NSEC_PER_MSEC < time) {
		ts->cur = time;
		return check_deadlock(lock, mask, ts);
	}

	return 0;
}

This function is called during the spinwait; it’s what the RES_CHECK_TIMEOUT macro we saw earlier boils down to. There is a rqspinlock_timeout object passed to it that represents the state of the timeout.

When this function is first called for a particular spinwait, it initializes the timeout and immediately returns; this is the if on line 7.

For subsequent calls (i.e. when the timeout has already been initialized), two checks are performed:

  1. Whether the current time (from ktime_get_mono_fast_ns) is past the end of the timeout (if (time > ts->timeout_end) on line 13). In this case, the wait should time out and -ETIMEDOUT is returned
  2. If the wait hasn’t timed out yet, and at least 1ms has passed since the last time we checked (if (prev + NSEC_PER_MSEC < time) on line 20), the deadlock checks are performed

While this functionally works fine, the fact that the deadlock checks are only performed when at least 1ms has passed is what’s causing the 1-2ms NMI timeouts we’re seeing in dmesg.

It’s still essentially the same situation as the previous freezing issue that’s now fixed, but it expresses itself slightly differently:

  1. The context switch program runs and acquires the ring buffer spinlock
  2. While the context switch program holds the ring buffer spinlock, a NMI occurs on the same CPU that also tries to acquire the ring buffer spinlock
  3. The NMI enters the spinwait, but doesn’t immediately break out, even though this is guaranteed to be an AA deadlock situation, because the deadlock check is not triggered immediately (it takes at least 1ms)
  4. After at least 1ms has passed (but may be more depending on timing), the deadlock check in check_timeout finally triggers and detects the deadlock we already knew was there; it breaks out of the spinwait with -EDEADLK

To verify this, we made a quick local patch that triggered check_deadlock on initialization of the timeout (i.e. in the if (!ts->timeout_end) on line 7) and attempted to repro again. With this fix in place, the timeouts in the 1-2ms band no longer occur; we’re only seeing the larger timeouts in dmesg now.

We reported this issue as well, and a proper fix was made.

Death by a thousand NMIs

We’re left with a single issue: the 6-26ms timeouts still being reported through dmesg after our previous fix. The cause of this issue was spotted by Kumar.

The issue here is that, even in the absence of recursive locks, NMIs can prevent the holder of a spinlock from making progress, starving the lock holder of CPU time.

To understand why, take the following situation:

  1. The context switch program runs on CPU 0 and acquires the ring buffer spinlock
  2. While CPU 0 has the ring buffer spinlock, a NMI happens on another CPU, let’s say CPU 3. That NMI tries to acquire the ring buffer spinlock as well, and enters the spinwait because CPU 0 has the lock. Note that because this is on another CPU, this is not a recursive lock situation, so the deadlock checks don’t trigger

So we’re now in a state where CPU 0 is holding the lock, and CPU 3 is waiting for it. In this situation, an NMI happens on CPU 0 and also tries to acquire the ring buffer spinlock. This is a recursive lock situation, and with the fixes previously made, it’s now handled correctly and the NMI fails to acquire the ring buffer spinlock. However, this NMI, even if it couldn’t get the lock, did prevent CPU 0 from making progress in the meantime.

If NMIs keep getting spammed on CPU 0 while it holds the ring buffer spinlock, it prevents the context switch program from making progress and eventually releasing the ring buffer spinlock, which in turn prevents other CPUs that want the lock (like CPU 3) from progressing as well. At a high enough sampling frequency, this can lead to the longer stalls, especially in combination with the previous issue of a minimum 1ms time before deadlock checks are triggered.

The fix for this was made by Kumar and involved a number of quite subtle changes to the core of the rqspinlock slow path (i.e. when a lock is already locked when attempting to acquire it). While the changes themselves were relatively small, explaining them is out of the scope of this post, given the subtlety of the race conditions involved. If you’re interested in reading about it, see patch #3, #4 and #5.

Either way, we tested this fix together with all the other fixes and we’re finally in a state where we’re not seeing any freezes on the system when running our repro, and dmesg is free from timeout warnings. Yay!

Wrapping up

Phew! That was quite a trip, but we did learn a lot about the internals of locking in the Linux kernel.

The fixes for all issues have been rolled into a patch series by Kumar, which made it into the 6.19 kernel. The issues were deemed important enough that the fixes have also been backported to the 6.17 and 6.18 kernels. Thanks again to Kumar and Alexei for their help in tracking the various issues down!

And, now that we thoroughly understand the issue(s), we’ve been able to create a workaround on our end for other older kernels that don’t have the fixes yet; we simply discard recursive NMIs that happen during the context switch eBPF program on our end.

It was a fun ride, but here’s to hoping we don’t see any more kernel issues in the future! Ah, who are we kidding…


Share this article on:

Next Article
From profiling to kernel patch: the journey to an eBPF performance fix