Content

Modern SSDs - how fast are they?

I was recently reading the paper: What Modern NVME Storage Can Do, And How to Exploit it. It provides a good overview of the state of current storage engines in different databases, and talks about linux I/O API options, trends in hardware, and how to leverage the full power of SSDs. It’s a familiar story: just like in the previous post we couldn’t saturate main memory bandwidth, here we aren’t able to saturate disk bandwidth. You should read it if you design storage engines in production.

But this paper got me thinking: how fast are modern NVME SSDs really? Vendors can claim numbers but they’ll always be based on their special setup, with nuances around the tools used along with software configurations. I was interested that on my machine, with no special settings, how far can I take my NVME SSD? I talk about that in this post, and then give some of my thoughts around tradeoffs and factors affecting performance.

Setup

Relevant machine specs:

Based on the spec sheet, Samsung claims:

  • Sequential read bandwidth: 3500 MB/s
  • Sequential write bandwidth: 3300 MB/s
  • QD 1, Thread 1, Random Read: 19K IOPS
  • QD 1, Thread 1, Random Write: 60K IOPS
  • QD 32, Thread 4, Random Read: 600K IOPS
  • QD 32, Thread 4, Random Write: 550K IOPS

QD here is queue depth. The storage device processes requests that the kernel gives it through a submission queue, QD measures the depth of that queue. QD32 means there can be up to 32 IO requests in the queue, and the storage device is free to process the queue in whatever way it wants. This depends on the device driver implementation.

Thread 1 means the workload was single threaded read or write. Thread 4 means the workload had 4 concurrent readers or writers. Thread 4 with QD32 means that those 4 threads could fill up the submission queue with 32 IO requests. Async IO makes this possible.

Another interesting thing to note is that for sequential workloads, MB/s are the units. This is because the driving use-case here is streaming high volume of data (e.g. streaming a big file from disk and on to the network, creating a consistent snapshot of a database, range scan queries, etc.). For random workloads however, IOPS are the units. To give some context, IOPS is the measure of how many IO requests the storage device can complete in a second. The problem is that IOPS makes more sense in the context of block/page size since all IO requests are not created equal. You can have 1 IO where you write a memory buffer of 1GiB, or you can have 250K IOs each of which writes a memory buffer of 4KiB - end result in both cases is a GiB file. Typically vendors use 4KiB as the block size so I’ll assume that here.

Also, I can’t stress this enough: take these numbers with a grain of salt because they are a function of:

  • the exact workload
  • software (kernel code version, configurations, I/O api used, threading model)
  • age or endurance of the drive, or even sound! Although to be fair, sound probably affects HDDs much more (more moving parts), but I can imagine physical factors like temperature affecting flash drives as well.

It’s impossible to know what performance you will get with your setup unless you test and measure it yourself. Use these numbers as a starting point but end with concrete numbers from load tests or benchmarks, ideally against production workloads.

Results

I used fio to measure my NVME SSD’s peformance. There are different ways to run fio but since I had a bunch of knobs to turn, I used job files - here’s an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
[global]
size=1g
direct=1
filename=test.data
per_job_logs=1
ioengine=io_uring
iodepth=16
numjobs=4

[job_async_read_seq_small_thread=1]
stonewall
bs=512
rw=read

[job_async_read_seq_medium_thread=1]
stonewall
bs=1m
rw=read

[job_async_read_seq_large_thread=1]
stonewall
bs=4m
rw=read

[job_async_read_random_small_thread=1]
stonewall
bs=512
rw=randread

[job_async_read_random_medium_thread=1]
stonewall
bs=1m
rw=randread

[job_async_read_random_large_thread=1]
stonewall
bs=4m
rw=randread

I had multiple such jobs. The async interface I used was libaio, although io_uring gave similar performance (more on the benefits of io_uring in a later post one day, but this provides a great summary). I also used direct i/o to avoid hitting the kernel cache. For both sync and async, I did write and read tests (seq and random each with different block sizes). Total file size to write/read to/from was 4GiB. And finally, for async interface, I used a queue depth of 16 with 4 concurrent threads. As you can see, we are in a N-dimensional configuration space, so it’s important to realize that depending on which point you pick in this Nd space, the performance results you get can be significantly different.

Here are the results I got with sync IO:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1. writes
	1. seq
		1. small block size (512 bytes)
			1. IOPS = 83K, write throughput = 40MiB/s, write latency p50 = 10us, p99 = 20us, p100=2ms
		2. medium block size (1 MiB)
			1. IOPS = 2900, write throughput = 2.9GiB/s, write latency p50 = 326us, p99 = 420us, p100 = 4ms
		3. large block size (4 MiB)
				1. IOPS = 780, write throughput = 3.1GiB/s, write latency p50 = 1.2ms, p99 = 1.7ms, p100 = 2ms
	2. random
		1. small block size (512 bytes)
			1. IOPS = 80K, write throughput = 38MiB/s, write latency p50 = 10us, p99 = 23us, p100 = 5ms
		2. medium block size (1 MiB)
			1. IOPS = 3000, write throughput = 3Gib/s, write latency p50 = 326us, p99 = 420us, p100 = 1ms
		3. large block size (4 MiB)
			1. IOPS = 775, write throughput = 3.1GiB/s, write latency p50 = 1.2ms, p99 = 1.7ms, p100 = 4ms
2. reads
	1. seq
		1. small block size (512 bytes)
			1. IOPS = 100K, read throughput = 49MiB/s, read latency p50 = 9us, p99 = 12us, p100 = 1.1ms
		2. medium block size (1 MiB)
			1. IOPS = 2850, read throughput = 2.8GiB/s, read latency p50 = 338us, p99 = 611us, p100 = 758us
		3. large block size (4 MiB)
			1. IOPS = 750, read throughput = 3GiB/s, read latency p50 = 1.3ms, p99 = 1.8ms, p100 = 1.9ms
	2. random
		1. small block size (512 bytes)
			1. IOPS = 20K, read throughput = 10MiB/s, read latency p50 = 50us, p99 = 58us, p100 = 10ms
		2. medium block size (1 MiB)
			1. IOPS = 2.6K, read throughput = 2.6GiB/s, read latency p50 = 375us, p99 = 420us, p100 = 700us
		3. large block size (4 MiB)
			1. IOPS = 775, read throughput = 3.1GiB/s, read latency p50 = 1.2ms, p99 = 1.5ms, p100=1.8ms

And with async IO (libaio interface) with queue depth of 16 and 4 concurrent threads:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1. writes
	1. seq
		1. small block size (512 bytes)
			1. IOPS = 20K, write throughput = 9.7MiB/s, write latency p50 = 545us, p99 = 1.4ms, p100 = 128ms
		2. medium block size (1 MiB)
			1. IOPS = 800, write throughput = 800MiB/s, write latency p50 = 18ms, p99 = 20ms, p100 = 22ms
		3. large block size (4 MiB)
			1. IOPS = 200, write throughput = 840MiB/s, write latency p50 = 71ms, p99 = 75ms, p100 = 80ms
	2. random
		1. small block size (512 bytes)
			1. IOPS = 19K, write throughput = 9.7MiB/s, write latency p50 = 758us, p99 = 1.5ms, p100 = 13ms
		2. medium block size (1 MiB)
			1. IOPS = 800, write throughput = 800MiB/s, write latency p50 = 19ms, p99 = 33ms, p100 = 38ms
		3. large block size (4 MiB)
			1. IOPS = 200, write throughput = 800MiB/s, write latency p50 = 82ms, p99 = 148ms, p100 = 156ms
2. reads
	2. seq
		1. small block size (512 bytes)
			1. IOPS = 82K, read throughput = 40MiB/s, read latency p50 = 120us, p99 = 701us, p100 = 3.4ms	
		2. medium block size (1 MiB)
			1. IOPS = 823, read  throughput = 823MiB/s, read latency p50 = 19ms, p99 = 31ms, p100 = 31ms
		3. large block size (4 MiB)
			1. IOPS = 253, read throughput = 1GiB/s, read latency p50 = 59ms, p99 = 100ms, p100 = 105ms
	3. random
		1. small block size (512 bytes)
			1. IOPS = 140K, read throughput = 68MiB/s, read latency p50 = 93us, p99 = 322us, p100 = 6ms
		2. medium block size (1 MiB)
			1. IOPS = 843, read throughput = 843MiB/s, read latency p50 = 17ms, p99 = 35ms, p100 = 36ms
		3. large block size (4 MiB)
			1. IOPS = 240, read throughput = 966MiB/s, read latency p50 = 64ms, p99 = 124ms, p100 = 125ms

For any IO interface throughput numbers (either bytes/sec or IOPS unit), multiply the value by 4 assuming all 4 threads are giving uniforming performance each.

Analysis

Here were some of my thoughts when I saw these numbers:

  • for sync + sequential writes, write throughput can do up to 3GiB/s, and IOPS depend on the block size. For smaller block size (<1KiB), the latency is 10s of us, and it goes to low ms as block size increased (1-4 MiB range). As always, once you saturate the bandwidth of the drive, you’ll notice higher latencies but in “normal conditions”, expect us to ms, where the latency expectedly goes up as the block size increases i.e. you’re doing larger IO.
  • for sync + sequential reads, read throughput goes to 3GiB/s and latencies from us to ms, very similar to sync + sequential write above.
  • for sync + random writes and sync + random reads, it’s the same story: both throughput and latencies are in the same range and a function of block size
  • now for async, it’s more complex because async interface (libaio vs io_uring), queue depth, and number of threads come into the picture. The cardinality of the result space is much higher. With libaio, qd=16, threads=4, the throughput I see is around 3.5 to 4GiB/s for medium/large block size (with both read vs write and seq vs random) - this is slighlty higher than sync. Latencies, however, are much higher than sync e.g. for small block size, p99 seq write is 20us in sync and 1.4ms in async.
  • all these numbers match the vendor numbers which is great to confirm

Thoughts on sync vs async: I would say with async, we benefit because while the IO is served, the CPU cycles can be used by our application for other stuff since we’re non-blocking. And naturally, concurrency is higher although we can always have concurrency with sync (just have 4 sync writers or readers). But the main downside of async is that its implementation, depending on the language and library you use, can use locks, queues, and then it also has kernel scheduler context switching the worker threads. All of this increases the number of stalled cpu cycles thereby increasing IO latency. It’s a tradeoff: giving more CPU time to non-disk-IO parts of your code vs taking latency hits for disk-IO calls because of factors mentioned above. Another factor is complexity that async comes with e.g. code readability, dealing with exceptions, etc. Having said that, a lot of coroutine implementations try to bridge this gap (in my experience, they’re much more ergonomic than other forms of doing async).

Thoughts on where to start when designing performant storage systems: start from the workload (e.g. point reads vs large scans, etc.). Realize that if you want to maximize the utilization of a SSD, start with saturating (close to) its bandwidth and for that you can use higher concurrency and/or bigger block size (if your block size is less than disk page size, then you’re wasting IO). In these normal conditions, latency will scale up somewhat linearly as the block size increases. If you’ve very small reads, then it’ll be hard to get max throughput because of wasted IO, so consider maintaining an in-memory cache (complexity increases though) - this cache can serve reads and on writes, you can update the cache but batch enough writes in the same offset range into less IO calls to maximize bandwidth - the “same offset range” becomes very easy for sequential writes e.g. immutable append-only log that LSM uses at the disk layer. This batching of writes is also sometimes referred to as coalescing and even though you maximize IO this way, you pay another cost: recoverability. In the case of failures (machines decide to malfunction often at scale), your batched writes could not yet written to disk, so could be lost in the event of machine failure. This is typically solved in databases by maintaining a write-ahead log (WAL) and it comes with it’s own tradeoffs.

Can we get more performance?

The paper I mentioned earlier claims that with modern hardware, storage/disk performance can reach DRAM performance - they’re talking about bandwidth/throughput here, latency I doubt we can ever beat (main memory fetches are 10-100s of ns, while with NVME SSDs, we are still talking about us-ms). But just in terms of throughput, they do seem correct and I wish I could test this one day. The key insight here is that the previous SATA interface for SSDs is being replaced with PCIe/NVME, and with PCIe 4.0, you can have 4 lanes against 1 SSD. You can have 128 PCIe lanes per CPU socket, which means you can have multiple SSDs upto the point where this setup can saturate all SSDs. They claim that 8 NVMEs SSDs can be saturated with a single CPU socket, and so if 1 SSD gives you 7GiB/s bandwidth, then with this setup of 8 SSDs, we’ll get ~50GiB/s which is what the RAM bandwidth on my machine (based on its spec). Note that the 7 GiB/s is higher than the 3-4 GiB/s my SSD gives, that’s just because I didn’t buy the fastest SSD I could find at the time.

All of this is promising, and with PCIe 5.0, the trend of getting extremely high disk throughput performance will probably increase further. But to conclude, the question I had was: what about durability guarantees? If you have 8 SSDs, is it even possible to have atomic writes across those SSDS? Probably not. And if so, how do we change the storage engine design to accommodate for multiple SSDs and the high performance they give. Anyways, I’ll end it here, if you have questions or comments, feel free to contact me.