I’ve been learning Rust in my free time for the last 6 months. I read The Book and I’m in the middle of reading Rust for Rustaceans (my notes so far ), but after doing a couple of very small exercises and projects I wanted to do something more rusty.

So on Sundays, in between espressos and with very good company, I started writing an external sort.

This is the 1st post in a series, where I cover the code that generates the data to be sorted. Stay tuned for the next one!

All the code & commands I’ll post here are running on my Macbook Pro M1 (32GiB, 1TB SSD). This is just for fun. The code is hosted in GitHub .

The problem

External sorting algorithms are used when the data that must be sorted can’t fit in memory. For example, my Mac has 32GiB of RAM. How can I sort a larger file?

So the basic idea is:

you get a big file (bigger than memory),
you split it into smaller chunks that fit in memory,
you sort those chunks and write them to disk,
you merge the sorted chunks and write the output, sorted.

I will go deeper into details as we go into my implementation.

The Solution

Goals

Make this thing fast af.
Drink good espresso.
Go into rabbit holes, learn about hardware, the OS, and performance.

Non-goals

To write a production-ready Rust crate. This is just a fun exercise.

Sorting lots of random data

Hold on, to sort lots of random data we need to start by -

Generating lots of random data

Soundtrack: Turnstile - Generator

Ok, so the first thing we need is a way to generate random data. For this project, we’ll consider a 4KiB (AKA a “page”) ASCII string a unit or block of data. This will be the resolution we’ll be working at. The source file will be composed of many 4KiB strings with no gaps or delimiters.

Writing* lots of random data fast*

One of the Goals is to make this thing fast. So let’s think about that for a second - what does fast mean in this context? What determines the higher bound for time in this algorithm? It requires 3 things:

Memory
CPU
IO (SSD)

If the input grows, we could add more CPU and memory, but IO speeds would remain constant - NVMe SSDs have a physical write speed limit, so we’ll probably be limited by it. Let’s see what that looks like.

How “fast” is “slow”?

There are some unofficial sources which have published benchmarks for Apple MacBook Pro SSDs, but I wanna check by myself.

To do so, I’ll use fio (the Flexible IO CLI, but nobody I know calls it that), a disk benchmark tool. It has a lot of knobs (it’s flexible!), and I encourage you to run man fio to check them out. However, we’re only interested in knowing the write rate (mebibytes per second), so this is how we’ll run it:

NOTE: The following command will create a 100GiB file in the directory where its run. Think before you run it!

fio --name=write        \
    --ioengine=posixaio \
    --rw=write          \
    --bs=1m             \
    --numjobs=1         \
    --size=100g         \
    --iodepth=8         \
    --end_fsync=1       \
    --direct=1

Flags explanation

--name=write: Just a name for the fio “job”. It will use it for the output and the file it creates.
--ioengine=posixaio: The IO engine for MacOS, POSIX Asynchronous IO.
--rw=write: We’re only interested in measuring performance with sequential writes.
--bs=1m: A block size of 1MiB. If it was smaller, fio would need to make lots of repeated write syscalls, adding a lot of noise to the test (since they’re expensive - more on this later).
--numjobs=1: The number of parallel jobs performing IO (writing, in this case). Since we’re doing sequential IO, adding more jobs won’t do any good - they’ll contend for file access.
--size=100g: A big file!
--iodepth=16: The queue of in-flight IO operations against the file. This is determined by a combination of the OS, the fio engine (posixaio in this case), and other factors. fio’s output shows a distribution of the queue size, which helps determine the maximum value here. If a smaller-than-possible value is used, the benchmark might spend lots of time waiting for IO operations to be acknowledged. A value of 1 would make it behave synchronously.
--end_fsync=1: Only consider the test done when the file is flushed to disk.
--direct=1: Use direct IO (as opposed to buffered IO). In MacOS, this is controlled by fcntl’s F_NOCACHE flag (see man fcntl). In Linux, this is equivalent to setting the O_DIRECT flag when opening a file.

I will omit most of the output since right now we’re only looking for the average write rate, expressed by the bw (bandwidth) output field, which in this case is ~5736MiB/s (I averaged it over multiple runs). If you’re curious about what the rest of the output means, the official docs have a great explanation.

So now that we know the practical upper limit that we can hit with the generator, we can move on to it.

Considerations

File size

Simple one. We’ll allow the file size to be configurable. However, since we’re treating a 4KiB block as our unit, the file size will be 4KiB-aligned (that is, we must check that its size is a multiple of 4KiB).

Memory limit

The reason for using an external sorting algorithm is not having enough memory to hold the data being sorted. Therefore, we need to take care not to go over the available memory. We’ll allow it to be configurable via a flag.

Parallelism

Generating random ASCII strings will be slow, so we’ll benefit from using multiple threads. We’ll use Tokio for that.

Saturation

The SSD must be saturated at all times, so that we don’t add artificial bottlenecks.

The code

The strategy I followed is as follows:

Create the file. In MacOS, the file is opened with Direct IO by default. In Linux, you’ll need to be careful to open it with the O_DIRECT flag set.
Spawn a “writer” thread. It will hold a handle to the open file and receive strings over a channel, which it will write to the file. Since the strings are random, it won’t do any ordering, writing them in a FIFO fashion.
Start threads to generate the random strings, sending them over the channel. This is where memory is allocated, so we must have a way to enforce the configured limit.

Sidequest: limiting memory

To make it easier to enforce the memory limit, I wrote a simple rate-limiting token bucket .

It’s initialized with a capacity, and threads can take and put tokens back. If a thread tries to take a token when the bucket doesn’t have any, it will sleep until a token is put back by another thread.

The implementation isn’t perfect (e.g. it doesn’t prevent putting tokens over the initial capacity), but it’s enough.

Back to the main thing

Ok, back to random string generation. Here’s the code in full. I’ll go over each part.

First, the signature:

pub async fn generate_data(filepath: &str, size_bytes: usize, max_mem: usize) -> io::Result<()>

generate_data is an async function. It takes a filepath where the file will be created, the file’s size, size_bytes, and the program’s memory limit: max_mem. It returns an io::Result to communicate that it can fail.

if (max_mem as usize) < crate::BLOCK_SIZE {
    return io::Result::Err(Error::new(
        io::ErrorKind::Other,
        format!(
            "Max allowed memory must be larger than {}B",
            crate::BLOCK_SIZE
        ),
    ));
}

The max memory must be at least 1 block (4096B) - otherwise we won’t be able to do anything! Here’s a good place to make it clear that max_mem doesn’t account for the memory things on the stack - pointers, numbers, etc.

let file = File::create(filepath)?;
let num_cores = std::thread::available_parallelism()?.get();
let mem_per_core = max_mem / num_cores;
let b = Arc::new(bucket::Bucket::new(num_cores as i32));
let mut set = JoinSet::new();
let (tx, rx) = mpsc::channel();

Next up, the values used throughout the function at are initialized:

file: The file into which we’ll write the random strings.
num_cores: The number of cores available in the CPU.
mem_per_core: The maximum amount of memory (in bytes) that each thread will be able to use.
b: The Bucket that will serve as the memory-limiting mechanism. It’s wrapped in an Arc so that it can be “shared” across threads.
set: A Tokio JoinSet, which we’ll use to keep track of the threads, and to join them when work is finished.
tx, rx: The send and receive ends of a channel, which we’ll use to transmit strings from the “generator” threads to the “writer” thread.

About that…

let writer_handle = tokio::spawn(writer(file, b.clone(), rx));
// ...
async fn writer(
    mut file: File,
    b: Arc<Bucket>,
    rx: Receiver<String>,
) -> io::Result<()> {
    loop {
        let s = rx.recv().map_err(|e| {
            Error::new(
                ErrorKind::Other,
                format!("Could not receive from channel: {e}"),
            )
        })?;
        if s == POISON_PILL {
            return Ok(());
        }
        b.put();
        file.write_all(s.as_bytes())?;
    }
}

The writer thread writes the strings it receives over rx, and puts back a token into the bucket, so any waiting generator threads are unblocked. When generation is finished, a POISON_PILL value is sent, signaling it to stop.

Now, onto the main loop in generate_data:

let mut remaining = size_bytes;
while remaining > 0 {
    let len = min(remaining, mem_per_core);
    remaining -= len;
    let tx = tx.clone();
    b.take();
    set.spawn_blocking(move || generate(tx, len));
}

Here, a new generator task is spawned for every chunk of string that must be generated, until there are no more remaining bytes. The string will be at most mem_per_core long. A token is taken from the bucket b, to prevent the program from going over the max_mem. generate is called, which generates a random string and sends it over tx so that the writer thread writes it to the file.

while let Some(res) = set.join_next().await {
    let _ = res??;
}
tx.send(String::from(POISON_PILL))
    .map_err(|e| io::Error::new(ErrorKind::Other, e))?;

writer_handle.await?

Finally, we join all the generator tasks, send the POISON_PILL message to the writer, and join it.

The moment of truth

According to the write rate output by fio (5736MiB/s), just writing 100GiB file should take around ~17.9 seconds. We should expect some overhead from the random string generation, though.

I created a small CLI with the clap crate to make it easier to generate the file. Wrapping the call with time gives us:

time ./target/release/ext-sort gen --file data.txt --size 107374182400
./target/release/ext-sort gen --file data.txt --size 107374182400  176.07s user 21.10s system 924% cpu 21.323 total

That’s just over 3 seconds more than fio, which is a little bit of overhead. If you see any chances for reducing it, please let me know over at Twitter or Bluesky !

On the memory side, I set the default value for max_mem to be 2GiB. Let’s see if this holds.

I used htop to get a snapshot of the CPU and memory before and during (I should just do this all on Linux already, with free and heaptrack).

Resource usage, before

5.67GiB - only Terminal (and 1000 other MacOS processes) running.

Resource usage, during

7.40GiB! A little bit under 2GiB over the baseline on average. And look at all those busy CPU cores. Looks like it’s working as intended - although the write speed looks like it could be better. I’ll try to figure it out later.

The real friends were the walls we crashed into along the way

Soundtrack: Turnstile - Endless
When I hit a wall I gotta BREAK. IN.

Before we say goodbye, here are some things I ran into, which could be useful for you too.

Always use `--profile release`

… before you test performance. cargo build on its own doesn’t apply all possible optimizations, to make compilation faster. The result is abysmal:

cargo build
time ./target/debug/ext-sort gen --file data.txt --size 107374182400
./target/debug/ext-sort gen --file data.txt --size 107374182400  1825.21s user 28.37s system 963% cpu 3:12.48 total

3:12.48! That’s over 9x more than with the release build.

JoinSet::spawn_blocking vs JoinSet::spawn

Using spawn_blocking yields slightly faster (~600ms for 100GiB files) run times than spawn in this case. This caught me off-guard at first, but upon inspection, this probably happens because in generate, SmallRng::from_entropy() issues a system call to getentropy (getrandom in Linux).

File::write vs File::write_all

The write docstring says it clearly:

This function will attempt to write the entire contents of buf, but the entire write might not succeed […]

So if you wanna make sure the whole buffer is written to the file, call write_all (or do what write_all does, which is to call write as many times as needed).

Goodbye for now

This was the 1st in a series of posts. Next up, I’ll cover the actual sorting. The rest of the algorithm is already “finished” in the repo, but there’s things I still wanna experiment with before I write a post about it.

I’m also thinking about doing some variations on all this, such as a Linux version using Glommio . We’ll see!

Please let me know if I missed anything that could make the code better in any way, and I’ll buy you some coffee if we ever meet.

The problem#

The Solution#

Sorting lots of random data#

Generating lots of random data#

Writing* lots of random data fast*#

How “fast” is “slow”?#

Considerations#

File size#

Memory limit#

Parallelism#

Saturation#

The code#

Sidequest: limiting memory#

Back to the main thing#

The moment of truth#

The real friends were the walls we crashed into along the way#

Always use --profile release#

JoinSet::spawn_blocking vs JoinSet::spawn#

File::write vs File::write_all#

Goodbye for now#