Safety and Performance - Threadsafe Datastructures in C++

, in Computing

A spool of yarnMultithreading is something that is often misunderstood by even senior programmers. I do a lot of interviewing and I have noticed that a large proportion of candidates have only vague notions of how to make a completely thread-safe datastructure.

Of course, the safest threaded code is code that does not share any data between threads at all. Sadly, it is impossible for most problems to avoid completely non-shared data so we have to deal with concurrent access somehow.

This post demonstrates three simple techniques and is not the last word on the subject. Rather it is a general overview of the hows and whys of basic thread safety. The code examples are in C++ but the general techniques apply to any language that allows threading.

If you want to follow along at home see the corresponding GitHub Repo.

The Database

Let us start with a very simple class:

// A simple "database" with no locking that is dangerous to use across multiple threads
// The sleep_for statements simulate a more complex operation
class Database
{
public:
    Database()
    {
        std::fill(data_.begin(), data_.end(), 0);
    }

    // Obtain a value
    int readValue(size_t index) const
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(1));
        return data_.at(index);
    }

    // update a value
    void updateValue(size_t index, int64_t value)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(5));
        data_.at(index) += value;
    }

    // used for testing
    bool isAllZero()
    {
        return std::all_of(data_.begin(), data_.end(), [](const int64_t &x) -> bool
                           { return x == 0; });
    }

    static constexpr size_t DATABASE_SIZE = 10;
    static constexpr auto DATABASE_TYPE = "non-threadsafe database";

    friend std::ostream &operator<<(std::ostream &os, const Database &db);

protected:
    std::array<int64_t, DATABASE_SIZE> data_;
};

I hope this is somewhat self-explanatory. Database is a simple class (it hardly warrants the name) that wraps a std::array of 10 values. Users of this class can read and update values randomly without locking. The opperations are much simpler than a real database so I have added some sleep_for() statements to simulate a more complex datastructure (updates take an arbitrary 5 times longer than reads.)

For the purposes of this blog post I am going to load up the database with the following "clients" each running on a different thread:

The idea here is that I am simulating a framework where reads are frequent and fairly quick. Updates a much rarer but take much longer. This is a fairly common situation with client-server applications.

template <typename DBTYPE>
void testDatabase()
{
    DBTYPE db;

    std::vector<std::thread> threads;

    auto startTime = std::chrono::high_resolution_clock::now();

    for (size_t i = 0; i < 100; ++i)
    {
        threads.emplace_back([&]()
                             {
            for (int repeat = 0; repeat < 1; ++repeat)
            {
                updateAllInRandomOrder(db, 25, threads.size());
                updateAllInRandomOrder(db, -40, threads.size());
                updateAllInRandomOrder(db, 15, threads.size());
            } });
    }

    for (size_t i = 0; i < 1000; ++i)
    {
        threads.emplace_back([&]()
                             {
            for (int repeat = 0; repeat < 1; ++repeat)
            {
                readAllInRandomOrder(db, threads.size());
            } });
    }

    std::for_each(threads.begin(), threads.end(), [&](std::thread &t)
                  {
        if (t.joinable())
        {
            t.join();
        } });

    auto endTime = std::chrono::high_resolution_clock::now();

    std::cout << "Results for " << DBTYPE::DATABASE_TYPE << ":\n"
              << "  Elapsed Time:      " << std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count() << "ms\n"
              << "  Database Contents: " << db << "\n"
              << "  All Zero:          " << (db.isAllZero() ? "Pass" : "FAILED!!!") << std::endl;
}

Lets see how it behaves on my M1 Macbook Pro:

Results for non-threadsafe database:
  Elapsed Time:      558ms
  Database Contents: -25 40 0 0 0 0 0 0 0 0 
  All Zero:          FAILED!!!

Here you can see that the non-threadsafe database was very quick for executing 1100 threads but actually failed the test due to unsafe updates (you will get different results if you try this locally but it will still probably fail). Because all the updates eventually sum to zero, any non-zero elements in the array mean a failed test (and data corruption in the real world!)

Why Is This Not Threadsafe?

There are two issue, one obvious and one less so.

Either way, data corruption is always possible.

Making the Database Threadsafe

The easiest way to make a datastructure threadsafe to to introduce a std::mutex.

// A database with simple locking using a mutex
// Threadsafe but slower
class SingleMutexDatabase : public Database
{
public:
    SingleMutexDatabase() : Database(){};
    int readValue(size_t index) const
    {
        std::lock_guard<std::mutex> lock{mutex_};
        return Database::readValue(index);
    }

    void updateValue(size_t index, int64_t value)
    {
        std::lock_guard<std::mutex> lock{mutex_};
        Database::updateValue(index, value);
    }

    static constexpr auto DATABASE_TYPE = "single mutex database";

private:
    mutable std::mutex mutex_;
};

Nice and simple. Note that both the readValue and the writeValue methods need to be protected due to the issue discussed in the previous section.

Lets see how it works out:

Results for single mutex database:
  Elapsed Time:      31301ms
  Database Contents: 0 0 0 0 0 0 0 0 0 0 
  All Zero:          Pass

Nice. No data corruption. On the minus side, the operations took over 50 times longer!

Maybe that is acceptable - after all it is better to be slowly correct than speedily wrong. 95% of threadsafe datastructures I see in the wild rely on a simple mutex just like this and perform just fine.

Remember, this example is contrived to expose the performance problems. If performance meets requirements in the real world then we can stop here. But the mutex clearly puts some limit on speed at which we can handle clients accessing our database. We might be able to do better.

Shared Mutex

Recall that we have a small number of clients writing to the database with a larger number of clients just reading. C++17 provides a specialized mutex for just this situation - std::shared_mutex.

// A database using a shared_mutex allowing multiple reads at the same time
class SharedMutexDatabase : public Database
{
public:
    SharedMutexDatabase() : Database(){};
    int readValue(size_t index) const
    {
        // reads can oocurs simultaneously with a shared_lock
        std::shared_lock lock{shared_mutex_};
        return Database::readValue(index);
    }

    void updateValue(size_t index, int64_t value)
    {
        // writes require a unique_lock
        std::unique_lock lock{shared_mutex_};
        Database::updateValue(index, value);
    }

    static constexpr auto DATABASE_TYPE = "shared mutex database";

private:
    mutable std::shared_mutex shared_mutex_;
};

shared_mutex implements what is sometimes called a reader/writer lock. Multiple readers can lock the mutex with shared_lock and operate in the knowledge that no writer thread can change the datastructure. When a writer obtains the lock with std::unique_lock, the reader threads and all other writers are blocked.

This is exactly what we want - the readers can all read in parallel, only blocked during the relatively rare updates.

How does it perform:

Results for shared mutex database:
  Elapsed Time:      24058ms
  Database Contents: 0 0 0 0 0 0 0 0 0 0 
  All Zero:          Pass

A significant improvement, only 43 times slower than not locking.

And here we come to the major problem with shared_mutex. Although in theory it should lead to better performance, much of the gain is erased by the very significant overhead that shared_mutex introduces. I had to play around with the operations in my test program to tilt the numbers in shared_mutexes favor - for many workloads it was significantly slower than even a normal mutex.

That said, shared_mutex has its uses and will shine when writes are very infrequent and very expensive.

Breaking Up The Mutex

A more generally applicable way of getting better performance is to break up the mutex. This works well in this case because the entries in our database are completely independant so having multiple mutexes each protecting part of the datastructure make sense.

// A database where access is controlled by multiple mutexes allowing
// some degree of shared access
class SplitMutexDatabase : public Database
{
public:
    SplitMutexDatabase() : Database() {}
    int readValue(size_t index) const
    {
        std::lock_guard<std::mutex> lock{
            mutexes_[index % mutexes_.size()]};
        return Database::readValue(index);
    };

    void updateValue(size_t index, int64_t value)
    {
        std::lock_guard<std::mutex> lock{
            mutexes_[index % mutexes_.size()]};
        Database::updateValue(index, value);
    }

    static constexpr auto DATABASE_TYPE = "split mutex database";

private:
    mutable std::array<std::mutex, DATABASE_SIZE / 2> mutexes_;
};

Instead of a single mutex, SplitMutexDatabase creates 5 mutexes each protecting two entries. In this case, the first mutex is locked for entries 0 and 5, the second for entries 1 and 6, etc.

There is a tradeoff between the performance gains for not having to lock the whole datastructure for every operation and the overhead of storing multiple mutexes. The optimum ratio of mutexes to entries is very dependent on your datastructure and access pattern.

Results for split mutex database:
  Elapsed Time:      6822ms
  Database Contents: 0 0 0 0 0 0 0 0 0 0 
  All Zero:          Pass

Still much slower than no locking but we have regained much of the performance lost by the single mutex and are now just 12 times slower.

Other Approaches to Thread Safety

There are other (more invasive) approaches to this problem. This is a non-exhaustive survey.

Atomic operations

Rarely you can make all the operations on your datastructure atomic but if you can then you can get away with no explicit locking at all. Only the simplest data can be completely atomic though.

Copy-on-write

If you can stand reads not getting the absolutely latest information and only have a single thread writing then you can implement a copy-on-write datastructure. In this approach, your internal datastructure is indirectly accessed via a shared_ptr and is always const. Multiple readers can access this without locking. When the writer thread needs to do a write, it makes a copy of the entire internal datastructure, modifies the copy, and updates the shared_ptr to point to the new version.

This approach still requires a mutex, but it only needs to be held when a reader thread grabs the pointer to the internal datastructure and when the writer thread updates it. Note that shared_ptr is not threadsafe when assigning a new object.

Copy-on-write works well when the datastructure can be copied easily, writes are infrequent but slow, and only a single thread can ever perform a write. None of the readers need to wait for the slow updates to finish.

If this applies then copy-on-write can provide a huge performance gain at the expense of more complex code.

Actors

Sometimes it is convenient not to use locking at all. In this approach, all operations are redirected to a queue that is process by a single thread managed by the datastructure itself. This has significant overhead and defeats some of the purpose of multithreading since only one thread ever accesses the datastructure.

Actors are useful if you have complex access patterns as implementing an actor can greatly simplify your code and ensure correctness.

Combinations

Some of these techniques can be used in combination. For instance, a copy-on-write database could allow reads from multiple threads but queue up writes on a single thread using a task queue like an Actor.