CPU bound - are you sure?

Published on 2024-02-06

Content

TLDR

You have a bottleneck in your system, you run top and see your system’s process taking 100% CPU. You then run top -H and see there’s this one thread keeping 1 CPU core totally busy. You run perf and get a flamegraph, you see some hot functions, you optimize them and you get slightly better performance - enough that your system can survive for a few more months and your customers are happy (for now). But you wonder: modern CPUs should be so fast (billions of instructions per second per core), why is my system’s performance not close to ideal numbers?

This is a tale of how CPU bound workloads could be memory bound in disguise. Why? Because top is lying. Or, maybe we don’t realize what it’s precisely measuring. This is a backstory of how I found out (the hard way) that the CPU utilization that top reports includes stalled CPU cycles e.g. waiting on memory.

If you don’t have time to read this post, just remember that next time you’re investigating a CPU bottleneck in your system, think whether stalled CPU cycles (e.g. waiting for memory) is the actual root cause, and not you doing a lot of CPU intensive computation.

Fun fact: after I learnt about stalled cycles and CPU utilization, I searched around and sure enough, there’s a Brendan Gregg talk on it. I probably would have saved a few days of head scratching had I seen this video earlier, but it wouldn’t have been as fun!

Workload

As much as I like measuring performance against a real production workload, sometimes it’s fun to just think of a made up workload and speculate performance numbers. You get to test how close your mental model of how a computer works is relative to the real performance - either you get validation that your mental model is solid or you learn a thing or two about how computres actually work.

The workload I was thinking of: if I have a 64 bit counter, how fast can I increment it? The metric I was looking was counts/sec.

My machine is running Linux kernel 5.15.0-88-generic. It has an AMD Ryzen 7 3700X CPU, 8 cores, 16 with hyperthreading. It’s clocked at 3.6GHz but from observing historical timeseries stats of my system, I observe 2GHz (when system is mostly idle) to 4GHz (very busy CPU, high temperatures). This is typical in modern CPUs which adjust clock rate to optimize between power consumption and performance.

4GHz is 4B (billion) cycles per second. Assuming that counter increment takes 1 instruction (wrong as we’ll see later), ideally we should be able to get a throughput of 4B counts/sec. Modern CPUs actually optimize a lot (out of order execution, branch prediction, caching, superscalar execution etc.), so in practice you’d see an IPC (Instructions per cycle) of even more than 1 unless you have severe stalls because of external resources like main memory, disk, etc.

I wrote a simple Rust program for incrementing a counter as fast as possible:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


use std::time::{Duration, Instant};

fn main() {
    let report_interval_secs = 3;
    let mut counter: u64 = 0;
    let mut last_display_time = Instant::now();
    let mut last_counter_value = 0;
    loop {
        counter = counter + 1;
        if last_display_time.elapsed() >= Duration::from_secs(report_interval_secs) {
            println!(
                "counter throughput is {} counts/sec",
                (counter - last_counter_value) / report_interval_secs
            );
            last_display_time = Instant::now();
            last_counter_value = counter;
        }
    }
}

If I run this, I get 20M counts/sec on my machine. I ran top -H and saw that the main thread is consuming 100% CPU. I then ran pidstat -p pid -I 1 to confirm that almost all the time is coming from user-space as opposed to somewhere in the kernel.

Now you might say that the critical code path here (the loop) is doing more stuff (reporting) than just incrementing the counter, but in terms of CPU cycles, not much is spent there since we only display the counter throughput every 3 seconds. I verified using strace that my binary, in steady state, doesn’t use the write syscall (used to print to stdout) frequently. Initially when the binary starts running there are some sys calls e.g. rt_sigaction. And that’s coming from the Rust runtime. If you don’t know already: yes, Rust does have a runtime albeit a light one which handles things like panics. I then disassembled the binary and looked at the relevant assembly code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86


; Symbol: perf_test::main, mangled name=_ZN9perf_test4main17h0a2ede65c59bc4dcE
; Source: /home/code/play/perf-test/src/main.rs:3
55555555C000: 48 81 EC C8 00 00 00       subq   $0xc8, %rsp
55555555C007: 48 C7 84 24 C0 00 00 00 >  movq   $0x3, 0xc0(%rsp)
55555555C013: 48 C7 44 24 38 00 00 00 >  movq   $0x0, 0x38(%rsp)
55555555C01C: 48 8D 05 FD 5E 01 00       leaq   0x15efd(%rip), %rax  ; std::time::Instant::now at time.rs:275
55555555C023: FF D0                      callq  *%rax
55555555C025: 48 89 54 24 48             movq   %rdx, 0x48(%rsp)
55555555C02A: 48 89 44 24 40             movq   %rax, 0x40(%rsp)
55555555C02F: 48 C7 44 24 50 00 00 00 >  movq   $0x0, 0x50(%rsp)
55555555C038: 48 8B 44 24 38             movq   0x38(%rsp), %rax
55555555C03D: 48 83 C0 01                addq   $0x1, %rax
55555555C041: 48 89 44 24 30             movq   %rax, 0x30(%rsp)
55555555C046: 0F 92 C0                   setb   %al
55555555C049: A8 01                      testb  $0x1, %al
55555555C04B: 75 23                      jne    0x55555555c070  ; <+112> at main.rs:9:19
55555555C04D: 48 8B 44 24 30             movq   0x30(%rsp), %rax
55555555C052: 48 89 44 24 38             movq   %rax, 0x38(%rsp)
55555555C057: 48 8D 05 52 5F 01 00       leaq   0x15f52(%rip), %rax  ; std::time::Instant::elapsed at time.rs:378
55555555C05E: 48 8D 7C 24 40             leaq   0x40(%rsp), %rdi
55555555C063: FF D0                      callq  *%rax
55555555C065: 89 54 24 60                movl   %edx, 0x60(%rsp)
55555555C069: 48 89 44 24 58             movq   %rax, 0x58(%rsp)
55555555C06E: EB 1E                      jmp    0x55555555c08e  ; <+142> at main.rs
55555555C070: 48 8D 3D F9 4F 03 00       leaq   0x34ff9(%rip), %rdi  ; str.0
55555555C077: 48 8D 15 2A 41 04 00       leaq   0x4412a(%rip), %rdx
55555555C07E: 48 8D 05 7B F5 FF FF       leaq   -0xa85(%rip), %rax  ; core::panicking::panic at panicking.rs:41
55555555C085: BE 1C 00 00 00             movl   $0x1c, %esi
55555555C08A: FF D0                      callq  *%rax
55555555C08C: 0F 0B                      ud2    
55555555C08E: BF 03 00 00 00             movl   $0x3, %edi
55555555C093: E8 18 FE FF FF             callq  0x55555555beb0  ; core::time::Duration::from_secs at time.rs:210
55555555C098: 89 54 24 70                movl   %edx, 0x70(%rsp)
55555555C09C: 48 89 44 24 68             movq   %rax, 0x68(%rsp)
55555555C0A1: 48 8D 7C 24 58             leaq   0x58(%rsp), %rdi
55555555C0A6: 48 8D 74 24 68             leaq   0x68(%rsp), %rsi
55555555C0AB: E8 00 FD FF FF             callq  0x55555555bdb0  ; core::cmp::PartialOrd::ge at cmp.rs:1163
55555555C0B0: 88 44 24 2F                movb   %al, 0x2f(%rsp)
55555555C0B4: 8A 44 24 2F                movb   0x2f(%rsp), %al
55555555C0B8: A8 01                      testb  $0x1, %al
55555555C0BA: 75 05                      jne    0x55555555c0c1  ; <+193> at main.rs:13:18
55555555C0BC: E9 77 FF FF FF             jmp    0x55555555c038  ; <+56> at main.rs:9:19
55555555C0C1: 48 8B 44 24 38             movq   0x38(%rsp), %rax
55555555C0C6: 48 2B 44 24 50             subq   0x50(%rsp), %rax
55555555C0CB: 48 89 44 24 20             movq   %rax, 0x20(%rsp)
55555555C0D0: 0F 92 C0                   setb   %al
55555555C0D3: A8 01                      testb  $0x1, %al
55555555C0D5: 75 02                      jne    0x55555555c0d9  ; <+217> at main.rs:13:17
55555555C0D7: EB 1E                      jmp    0x55555555c0f7  ; <+247> at main.rs
55555555C0D9: 48 8D 3D D0 4F 03 00       leaq   0x34fd0(%rip), %rdi  ; str.1
55555555C0E0: 48 8D 15 F9 40 04 00       leaq   0x440f9(%rip), %rdx
55555555C0E7: 48 8D 05 12 F5 FF FF       leaq   -0xaee(%rip), %rax  ; core::panicking::panic at panicking.rs:41
55555555C0EE: BE 21 00 00 00             movl   $0x21, %esi
55555555C0F3: FF D0                      callq  *%rax
55555555C0F5: 0F 0B                      ud2    
55555555C0F7: 48 8B 44 24 20             movq   0x20(%rsp), %rax
55555555C0FC: B9 03 00 00 00             movl   $0x3, %ecx
55555555C101: 31 D2                      xorl   %edx, %edx
55555555C103: 48 F7 F1                   divq   %rcx
55555555C106: 48 89 84 24 B8 00 00 00    movq   %rax, 0xb8(%rsp)
55555555C10E: 48 8D BC 24 B8 00 00 00    leaq   0xb8(%rsp), %rdi
55555555C116: E8 C5 F9 FF FF             callq  0x55555555bae0  ; core::fmt::ArgumentV1::new_display at mod.rs:317
55555555C11B: 48 89 44 24 10             movq   %rax, 0x10(%rsp)
55555555C120: 48 89 54 24 18             movq   %rdx, 0x18(%rsp)
55555555C125: 48 8B 44 24 18             movq   0x18(%rsp), %rax
55555555C12A: 48 8B 4C 24 10             movq   0x10(%rsp), %rcx
55555555C12F: 48 89 8C 24 A8 00 00 00    movq   %rcx, 0xa8(%rsp)
55555555C137: 48 89 84 24 B0 00 00 00    movq   %rax, 0xb0(%rsp)
55555555C13F: 48 8D 8C 24 A8 00 00 00    leaq   0xa8(%rsp), %rcx
55555555C147: 48 8D 7C 24 78             leaq   0x78(%rsp), %rdi
55555555C14C: 48 8D 35 6D 40 04 00       leaq   0x4406d(%rip), %rsi
55555555C153: BA 02 00 00 00             movl   $0x2, %edx
55555555C158: 41 B8 01 00 00 00          movl   $0x1, %r8d
55555555C15E: E8 0D FA FF FF             callq  0x55555555bb70  ; core::fmt::Arguments::new_v1 at mod.rs:388
55555555C163: 48 8D 7C 24 78             leaq   0x78(%rsp), %rdi
55555555C168: FF 15 F2 6A 04 00          callq  *0x46af2(%rip)
55555555C16E: FF 15 C4 6B 04 00          callq  *0x46bc4(%rip)
55555555C174: 48 89 04 24                movq   %rax, (%rsp)
55555555C178: 48 89 54 24 08             movq   %rdx, 0x8(%rsp)
55555555C17D: 48 8B 44 24 08             movq   0x8(%rsp), %rax
55555555C182: 48 8B 0C 24                movq   (%rsp), %rcx
55555555C186: 48 89 4C 24 40             movq   %rcx, 0x40(%rsp)
55555555C18B: 48 89 44 24 48             movq   %rax, 0x48(%rsp)
55555555C190: 48 8B 44 24 38             movq   0x38(%rsp), %rax
55555555C195: 48 89 44 24 50             movq   %rax, 0x50(%rsp)
55555555C19A: E9 99 FE FF FF             jmp    0x55555555c038  ; <+56> at main.rs:9:19

This is not that much assembly. In fact, a lot of it is coming from the Rust runtime e.g. leaq -0xaee(%rip), %rax ; core::panicking::panic at panicking.rs:41.

To be completely sure, I also trimmed down the code to not have any reporting in the critical path and only report the average throughput at the end, and I still saw 20M counts/sec.

200x bad

20M counts/sec is really low, it’s 200x worse performance than expected, so pretty bad. Now in reality, one of our assumptions was that 1 cycle is spent on incrementing. In reality, more cycles are spent in this critical path (as can be seen in the assembly), but still with CPU optimizations, 20M counts/sec is really low - it’s two orders of magnitude worse performance.

I was also just doing cargo run which doesn’t produce the most optimized binary for execution. If I do cargo run --release, I get around 30M counts/sec. That makes sense, since in the critical path there are less instructions for the CPU core to process. It’s still bad, almost 70x worse than ideal result.

So where is the 70x coming from? We need to go deeper into the low level CPU metrics like stalled cycles, branch misses, page faults, etc. For that, I first ran ps -T -p pid on my binary’s pid to get the corresponding thread id, in this case they are the same since my application is single threaded and only the main thread is running. Then I ran perf stat --tid <tid> -I 999 and I saw:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


     7.994403883          877944718      branches                  #  878.600 M/sec                    (50.07%)
     7.994403883               9302      branch-misses             #    0.00% of all branches          (50.44%)
     8.993656455             999.24 msec task-clock                #    1.000 CPUs utilized
     8.993656455                  3      context-switches          #    0.003 K/sec
     8.993656455                  0      cpu-migrations            #    0.000 K/sec
     8.993656455                  0      page-faults               #    0.000 K/sec
     8.993656455         4297277580      cycles                    #    4.301 GHz                      (49.56%)
     8.993656455             167988      stalled-cycles-frontend   #    0.00% frontend cycles idle     (49.75%)
     8.993656455         3597991325      stalled-cycles-backend    #   83.73% backend cycles idle      (50.15%)
     8.993656455         4863283375      instructions              #    1.13  insn per cycle
     8.993656455                                                   #    0.74  stalled cycles per insn  (50.44%)

We see that that the CPU core was indeed clocked at around 4GHz, and that our process’ thread is consuming the whole core. IPC is greater than 1, which is not as bad, but also not great because 83% of the backend cycles are stalled. That’s the red flag. It’s indiciating that the CPU can’t expire/complete the instructions as fast as it could had the stall on an external resource not be there. What could that be external resource? It has to be main memory (RAM) because I’m not doing any I/O here (disk or network).

You might ask: well, don’t CPUs have L1/L2/L3 caches which avoid main memory hits? For loads, yes that is possible and I can potentially rewrite the code for better cache locality - although that’s generally an anti-pattern if you’re thinking too much about it everytime you’re writing code (more on that in another post some day maybe). Back to CPU caches, the problem is that in our workload, we must increment the counter and write it back, so store is required and that causes the hit to main memory. It’s unavoidable (unless I write custom assembly to just use registers to store and increment the count). This would’ve explained the stalls in the workload of Brendan Gregg talk because there the root cause was KPTI patches (backported to older kernel versions) causing increased TLB flushes which meant more round trips to main memory and therefore more stalled CPU cycles. In Brendan’s example, we could’ve used PMCArch to see the LLC hit ratio and point out the issue (he had a “good” and “bad” example so could compare). In our workload however, there’s no alternative to writing to main memory very frequently.

Memory bound

All of this exploration brings up an interesting point: what is it about main memory that causes the CPU stalled cycles? As is common in systems, it boils down to understanding the difference between bandwidth and latency:

Data is transferred from CPU to Main memory in cache line sized chunks (typically 64 bytes). Your memory module can process a limit of these chunks in a second, this is known as the capacity or bandwidth of your memory and the units are bytes/sec. The range is typically 25-50 GB/s, and my memory module should be on the higher end of almost 50GB/s. We’re talking billions of bytes in a second, that’s not in line with the counts/sec in millions I am seeing.
For the data to travel from CPU to your memory module, it takes time. You’re literally bounded by the speed of light. Typically this number is in the range of hundreds of nano seconds. Let’s say it’s 100ns for cache line sized chunk to travel to main memory. That means for my CPU, around 400 cycles are spent waiting on main memory (1/4B divided by 100ns). That means even with the idealistic (and wrong) assumption that count increment takes 1 cpu cycle, we get to 4B/(400 + 1) ~ 10M counts/sec.

This makes more sense, there’s still a gap (I’m getting 30M counts/second while my updated mental model and empirical data suggest 10M counts/sec). There are probably optimizations going on (at the CPU/Memory layer), or perhaps latency is less than 100ns.

What can we do better?

Clearly, relying on main memory stores at a very high frequency (basically as fast as possible) is not ideal since you’ll become bottlenecked on CPU to main memory latency which is bounded by speed of light. We have to change our workload in some way and optimize the writes to main memory. There are two options, both of which I haven’t had time to explore yet:

Don’t write to main memory at all. Store counter data in register and increment and store there. If I get time, I’ll probably write assembly by hand to see performance numbers here, I expect counts/sec to be in billions, not millions.
Coalesce the data such that you write bigger chunks (multiples of cache line size) to main memory as opposed to writing 64 bits as fast as you can. Because in the current approach, we’re not utilizing the bandwidth of our memory module. If we batch data (with the tradeoff that stores will take some more time), perhaps we can make our workload to be memory bandwidth bound, not memory latency bound. In fact, since CPU clock frequency is 4Ghz (much lower than 50GB/s), perhaps we can make the workload CPU bound after all (in the truest sense, not in stalled cycles sense).

Conclusion

Next time you see 100% CPU util, try to understand what it really means. You’d be surprised that a lot of time, actual CPU time is spent on just waiting for an external resource (like main memory) thus causing high stalled cycles. I would imagine (given empirical evidence from experience), most CPU bound workloads are actually memory bound workloads in disguise and specifically memory latency bound.