Deeper Look at Concurrency With Immutable Data

You can leave Feedback on Hackernews and follow me on Twitter

Immutable data is often cited as a way to solve problems when several threads access the same data in concurrent programming. In a recent article I've included immutable data as a technique to solve a whole class of bugs. Most articles on immutable data and concurrency are thin on facts though, so I will explore the nitty gritty details of the topic.

Concurrency is easy if threads do not share data. As Brian Goetz says

There are three ways to safely handle state ...

  • Don't share
  • Don't mutate
  • Coordinate access

The first two are far easier to get right than the third ...

Not sharing is the easiest way to safely handle state, but we go with 'don't mutate' here and this is where immutable data comes in.

We start by looking into common pitfalls of concurrency, then take a look into immutable data and data structures, explain how those solve the common concurrency pitfalls and finally use immutable data structures for concurrency.

Concurrency pitfalls

Let's start with common pitfalls when doing concurrency. Suppose we have an Employee class with

Employee {
   fistName: string
   lastName: string
   salary: int
   group: string
}

and a List of all employees. Now we want to change data from within two or more concurrent threads.

Dirty reads

When one thread (T1) reads the data, and one thread (T2) writes to the data, then the reading thread (T1) can get dirty reads. T2 changes the salary to one value and then to another value because of some later decision.

For example T2 changes salaries in two steps.

Employee e = ....
// read and write salary
e.salary = e.salary * 1.10
...
// read and write salary
if (e.salary < 10000) {
   e.salary = e.salary + 1000
}

In the first step it updates an employees salary giving each one a 10% raise. Then in a second step if the salary is less than $10000 it adds another $1000. When T1 reads the salary in between these two changes it doesn't read the final value but a dirty (incorrect) value.

Lost update

If two threads (T1,T2) write to the data, then one of the changes might be lost. T1 one copies an employee from the list and changes the firstName and lastName, then replaces the employee in the list. During this T2 also copies that employee and changes the lastName, then replaces the employee in the list. Now the firstName change and the lastName change from T1 are no longer there. This is no problem with lastName, as T2 would have changed this either way. But the update to firstName is lost.

This happens when a thread creates a copy, e.g. reading firstName into a string and then writing it back without changes. As immutable data structures use copying this is also relevant to them (see below). Lost updates are also the classic concurrency example where two threads update a shared counter and some of the updates to the counters are lost, with counter in the end not reflecting all increases.

Unrepeatable reads

If one thread (T1) reads the data and one thread (T2) writes to the data, then the reading thread can have unrepeatable reads. If T1 reads the salary and reads the salary again after the write by T2, then T1 has unrepeatable reads. Suppose T1 makes some decisions on the first read and then some decisions on the second read, those decisions will be inconsistent.

// T1
Employee e = ...

if (e.salary > 100000) {
   doSomething(e)
                                <-- T2 writes
} else if (e.salary > 120000) {
   doSomethingElse(e)
}

Here if T2 changes the value in between the two reads of e.salary the decision is wrong because e.salary does not provide repeatable reads.

Read inconsistent states

If two threads (T1,T2) write to the data, this can lead to inconsistent states. If T1 first writes salary and then based on the salary if the salary is in the top 10% T1 writes 'top-earner' to group and T2 writes to salary reducing the salary, then this leads to inconsistent states. Either it should be high and the group be 'top-earner' or it should be low and the group not changed. With the mixed changes the state is inconsistent.

T1                         T2
Gets employee
                            Gets employee
Writes top-earner
                            Reduces salary

Coordinate data access

Before we look at how we prevent these problems with immutable data we look at the classic way. The classical way with mutable data to prevent dirty reads, inconsistent data and lost updates is to coordinate access. Often this is done with locks.

To change employee salary safely with two threads (T1,T2) we would get a lock to protect our operations, execute our operations and then freeing the lock afterwards.

Employee e = ...
Lock lock = lock(e)

e.salary = e.salary * 1.10
...
// later
if (e.salary < 10000) {
   e.salary = e.salary + 500
}
lock.free()

During the time we have the lock, no other thread can get a lock on employee e. If T1 executes the code and gets the lock, T2 will wait in the Lock lock = lock(e) line until T1 frees its lock. Then T2 can proceed.

This way our operations are coordinated and the concurrency problems from above can't occur. Locks are often a source of bugs though. If one thread locks different resources this can easily lead to a situation where two threads block each other because they wait for each other lock. This is called deadlock. It's also easy to forget one code path that accesses shared data and protect it properly with locks. All in all coordinating access with locks is error prone and needs a lot of attention.

What is immutable data or persistent data structures?

What is immutable data and how does it help? Immutable data means the data cannot be changed. As data that can't be changed is not very suitable outside of analytics, there needs to be a way to change data. To 'change' it data is copied and changed during copying. So anyone having a reference to the original data can be sure no one would change it.

With mutable data we would change an employee with

// get employee reference e
// from somewhere then change
// the data
Employe e = ...
e.fistName = 'First'
e.lastName = 'Last'
// Everyone sees this changes!

With immutable data we would get a reference and then create a copy because we cannot change the employee directly as in the former example.

// get employee reference e
//
Employe e = ...
Employee newE = e.copy({
   firstName: 'First',
   lastName: 'Last'
});
// No one sees changes

where all the fields we do not change (salary, group) are copied and lastName and firstName are changed.

You might wonder how one would change nested objects? When we make Salary a class with

Salary {
   value: int
   currency: string
}

Employee {
   ...
   salary: Salary
}

we can no longer easily change the salary value. We would need to copy Employee and then copy Salary.

Employee e = ...
e.copy({
   salary: e.salary.copy(
       { value: 10000 }
   )}
)

With deeper nested objects this creates a lot of unnecessary code. Some people found a way around it by inventing Lenses. A lens is essentially an object with a reference to the attribute you are changing. You can think of a lens as a setter and getter that focuses on more detail inside some nested data structure.

Employee e = ...
// get reference to attribute
Lens l = Lens<Employee>(_.salary)
// change value of salary to 10000
Employee newE = l.set(e, 10000)

This will do all the stuff from above with copying the immutable data and change the salary. Lenses work for deeply nested objects without the boilerplate.

Immutable data structures

Now we can work with objects and nested objects of immutable data, but it gets more complex with data structures. We stick to our copy strategy to change immutable data. If we work with a list of employees we can create a copy of the list and hand out that copy to threads.

Threads can modify the list and the items without getting any of the concurrency problems if each one has a copy. As copying lists or maps is memory intensive and also bad for performance, developers are using something else called persistent data structures. If you modify a persistent data structure the part you modify is changed only for you, all others that have a reference to the data do not see the changes. They are called persistent because the parts that do not change are shared between the old and the new data structure and kept around.

We can also call these data structures copy-on-write (COW) data structures and sometimes they are called purely functional data structures. I prefer copy-on-write because any developer will immediately understand what they are about, but will use some of the other names interchangeably,

The simplest persistent data structure is a list where data can be added and removed from the head (or the tail). Once can represent this list with a recursive data structure where a List is constructed from an element and another List. That rest is again constructed from an element and another list. This goes on until all elements are represented in the data structure.

Take a list with three items (1,2,3) represented as List(1, List(2, List(3, NIL))) (this is very similar to a single linked list). Simplified the list looks like

1
|
2
|
3

Now two threads (T1, T2) have a reference to that list.

1
|
2
|
3 <-- T1, T2

When T2 adds 4 to the list then it gets a reference to a new List List(4, originalList) while T1 still holds a reference to the original list.

1
|
2
|
3 <-- T1
|
4 <-- T2

When T1 adds 5 to its list this will result in List(5, originalList). Now we have two tails, one for T1 and T2 while both share the unchanged original list.

1
|
2
|
3-----------
|          |
4 <- T2    5 <-- T1

When T1 adds 6 it will look like this

1
|
2
|
3-----------
|          |
4 <- T2    5
           |
           6 <-- T1

So each thread doesn't see the changes the other one made.

Removing an element means dropping the element and return the inner list. Removing 4 by T2 (T1 doesn't even know about 4 and remember we can only remove at the tail) from List(4, rest) will return rest, which is List(3, List(2, List(1, NIL)))

1
|
2
|
3 <-- T2
|
5
|
6 <-- T1

T2 then can remove 3 from it's reference to List(3, rest) and gets

1
|
2 <-- T2
|
3
|
5
|
6 <-- T1

So both threads add and remove elements from the list without the other noticing.

If both T1 and T2 get a reference to 1-2-3 they concurrently add to the list and remove from the list without most of the problems mentioned above. And all of this without the memory overhead or performance overhead of copying the complete list all the time when handing it out to T1 and T2. The downside of our list is we can only add or remove at one end, not the other end or in the middle.

Because this is the way Lisp (PDF) implements lists, persistent data structures were invented by accident and only later people recognized their benefits for concurrent programming. The original idea of Lisp lists was driven by recursiveness. The paper by John McCarthy was called "Recursive Functions of Symbolic Expressions".

Immutable trees and maps

While the list is the simplest immutable model, this idea of copy-on-write data structures can be extended to many more. For example there exist copy-on-write data structures for maps. How is it done for maps (dictionaries)?

It's easy to see how to do it for a tree so lets start with a tree. We have a simple tree where we store data A, Ba, Be, Ca. We have nodes root,(B) and leaves with the data A, Ba, Be, Ca.

root
|
|--> A
|
|--> (B)
|    |-> Ba
|    |-> Be
|
|--> Ca

Now we want to insert Cb. With a mutable version of our tree this results in a modified tree.

root
|
|--> A
|
|--> (B)
|    |-> Ba
|    |-> Be
|
|--> (C)
    |-> Ca
    |-> Cb

How would we make this persistent and immutable? The same way we did with list. We share as much of the structure with the old tree as possible and only create the nodes that need to be changed for the new tree.

From the original tree

root
|
|--> A
|
|--> (B)
|    |-> Ba
|    |-> Be
|
|--> Ca

we create new nodes and reuse old nodes that haven't changed. New and changed nodes are only those on the path to the new lead Cb. In this case newRoot and (C).

T1 -> root        newRoot <- T2
      |                 |
      |---> A <---------|
      |                 |
      |-> (B) <---------|
      |   |-> Ba        |
      |   |-> Be        |
      |           (C) <-|
      |--> Ca <---|
           Cb <---|

Now our new tree has the Cb value stored without copying the whole tree and without changing the tree. If we have two threads (T1,T2) holding a reference to the tree and T2 inserts Cb then T1 doesn't see the change by T2.

Immutable maps (dictionaries) work the same way. They use a map API but store the data internally in an immutable tree. For example Hash Array Mapped Tries (HAMT) are trees used for maps that cleverly store the data in a way that lead to constant lookup times. This way they can be used as map implementations.

Downsides of persistent data structures

Persistent data structures seem to have only benefits, so why is not every programming language using them by default? Because they have downsides that some consider important enough to not use them. The main down sides are memory usage, performance and specialization. Persistent (copy-on-write) data structures usually have a higher memory usage than their mutable (change-on-write) cousins. Beside that their performance is lower as anyone using Scala has found out. And in the world of mutable data structures there are many specialized ones as writing one is easy, while the amount of COW data structures is much more limited.

That said we've build a large marketplace with hundreds of thousands of users with Scala. We never had problems with memory or performance, customers where happy and we ranked very well with Google in a highly competitive space. In the context of web applications where lists and maps are most often small e.g. smaller than 100 elements, the performance and memory usage of COW structures doesn't matter. When developers get into larger data (hundreds of thousands of list items), this might become an issue for them.

How does immutable data help with concurrency?

How is 'Don't mutate' better than 'Coordinate access'? How does immutable data help here? After we dived deep into immutable data and persistent data structures. In the case of one writer and many readers, immutable data makes locks unnecessary. Readers (T1,T2) get a reference to the data structure and then work on it. The one writer (T3) gets a reference to the data structure and changes it. Neither T1 or T2 will see the changes. They still work on their copy of the data. There will be no inconsistencies or dirty reads. As the data doesn't change for them there will also be no unrepeatable reads.

When the writer T3 is finished, it updates the shared reference to the data structure. If another thread now gets a reference, it will get the new data. The change happens atomic and later threads do not get inconsistent data or dirty reads this way or see partial changes.

It's a little bit more complicated with many writers. The first steps are the same, but if each writer changes the shared reference to the data structure to share their changes, later changes overwrite earlier ones. With writers T1 and T2 suppose T1 removes an employee from the persistent copy-on-write list. At the same time T2 adds an employee.

T1                  T2
Gets ref to list
                   Gets ref to list
Removes employee
                   Adds employee
Writes back ref
                   Writes back ref

Because both change operations create a new list, and changes by one thread can't be seen by the other thread, the changes are lost. In this case the removal of the employee is lost and we get a lost update anomaly.

The same happens with changing firstName and lastName

T1                    T2
Gets employee
                     Gets same employee
Sets firstName
                     Sets lastName
Writes back ref
                     Writes back ref

As sets lastName involves copying the data - including firstName - the sharing of the reference by T2 creates a lost update anomaly when the firstName change from T1 is lost.

There are several ways to prevent lost updates with persistent data structures.

  • One thread to aggregate data

The lost update happens due to multiple writers. We can replace these multiple writers with one writer. When T1 and T2 change the first and last name of an employee and write the shared reference, we get a lost update as described above. If T1 and T2 only describe the changes e.g. as

T1: { firstName: "New first name" }
T2: { lastName: "New last name" }

and then pass those descriptions on to a third thread T3 which consolidates the changes then into the employee

// T3
Employee e = ...
for (c in changes) {
   e = e.copy(c)
}

(which rolled out results in)

// T3
Employe e = ...
Employee t1 = e.copy({
   firstName: 'New first name'
});
Employee t2 = t1.copy({
   lastName: 'New last name'
});

and stores the shared references no update is lost. As we only have one writer in this architecture we no longer have lost updates. This model is often used with actor based concurrency. This one thread is no longer multi threaded through and can become a bottleneck to a multi threading application (this thread in essence holds an exclusive write lock or acts as a barrier). To move around this problem one needs to partition the data and have an aggregating thread for each partition (e.g. partition by class employees/customer/... or partition by employee id). Also with map-and-reduce algorithms several aggregating threads work on different stages.

The model with one thread is implemented in languages with Actor based and message passing concurrency like Scala Akka or Erlang. There one actor has the exclusive write access to the data. Doesn't this exclusive write make immutable data structures mood? You still have the ability to create the changes in multiple threads and have the ability to safely read a shared data structure - while with mutable data structures you would need a read lock.

  • Use atomic references

The problem with two writers is that they both update the same shared reference and overwrite each other result. One way to fix this is to check if the reference has been changed before we write it.

Employee initial = ref.get()
Employee newE = initial.copy({
   firstName: 'First',
   lastName: 'Last'
});
// check if the reference has been changed
// if not, we set our new reference
if (ref.get() == initial) {
   ref.set(newE)
}

Fine, so we do not overwrite changes. But what now if someone changed the data and we want to change it too? We try until we succeed.

bool changed = false
while (! changed) {
 Employee initial = ref.get()
 Employee newE = initial.copy({
   firstName: 'New First',
   lastName: 'New Last'
 });
 if (ref.get() == initial) {
   ref.set(newE)
   changed = true
}
}

If we could not change the reference, we get the data again and make the changes again and then try to change it again. We repeat this until we succeed. This looks inefficient compared to surrounding the employee changed with a lock, but is often more performant than protecting data access with locks.

Sadly our if (ref.get() == initial) {...} block is not thread safe. Different threads might try to change the employee reference at the same time. We could use a lock to secure the if block but there is a faster and easier way which is called atomic reference. With an atomic reference the test if the reference has changed and setting the new reference otherwise is atomic and thread safe. It's also supported on most CPU architectures and therefore very fast.

shared AtomicReference<Employee> ref =
   new AtomicReference<Employee>()

Employee initial = ref.get()
Employee newE = initial.copy({
   firstName: 'First',
   lastName: 'Last'
});
bool success = ref.compareAndSet(
   initial,
   newE
);

The compareAndSet method will check the reference and see if it was changed by someone else from the initial value. If it has changed it will return false. If the reference hasn't changed it will change it to newE and return true for success.

Conclusion

There are three ways to safely handle state: Do not share, Do not mutate and Coordinate access. As the third option is an error prone way and not sharing is often not an option, we've looked in detail on how immutable data solves the problems raising from concurrency. Using immutable data with one writer and many readers is trivial. For the remaining lost updates problems with many writers we can easily use ways to mitigate the problem and share results from many writers. Immutable data is an easy and safe way to share data in concurrent applications.

Thanks to Luke Seelenbinder and Thomas Stratmann for reading the draft and giving valuable feedback.

Preventing Classes of Bugs

Reducing bugs is a good thing. Bugs demotivate developers, alienate developers from their product and give a bad brand perception. So preventing bugs from being made has interested many people in the past. There are technologies and techniques that can prevent whole bug classes.

  • Typed References Prevents bugs where some code is passed data that doesnt conform to its expectations about attributes or methods. There is this famous slide that says "38% Airbnb bugs preventable with TypeScript according to postmortem analysis". If a method expects a Dog and you call it with a Cat then the method will crash. With typed references every reference of local variables or method parameters is typed and the compiler prevents calls with the wrong variable type. Languages: e.g. Scala, Haskell, Java, Typescript

  • Automatic Testing Prevents unfullfilled requirements. Automatic Tests match the code against specifications and detect code that doesn\'t conform to specifications. Automatic testing also helps with regression bugs, where a bug is fixed but is introduced later again or where something worked and is broken with subsequent code changes. Tools: e.g. Django Testing, Rails Testing, Mocha, Jest, JUnit

  • Exploratory Testing Finds missing or incomplete requirements. When testers with the goal of improving customer perceived quality explorative test an application they find missing requirements and quality improving features. Lately I had a form where I had to enter a tax ID which was 13 digits long. The UI told me they expect a 11 digit tax ID. A good UI would tell the customer "You\'ve entered a 13 digit tax ID. Often this is the case when the first two digits represent the area. These are not needed, do you want to to use the last 11 digits which are a valid tax ID?"

  • Garbage Collection Prevents memory leaks from unfreed memory and prevents bugs from double freeing memory. Manual memory management with malloc and free has the problem of forgetting to free some used memory. This will lead up to memory leaks which will crash the application. When two or more code paths free the same memory this leads to an undefined state and beside security implications this can crash the application. Luckily nearly every programming language today has automatic memory management in one way or the other. Languages: e.g. JavaScript, Ruby, Python, Kotlin, Java

  • Generics Prevents ClassCastExceptions when developers put one type into a container while another developers expects other types in the container. One developer puts Dog objects into a bag, and takes them out again. Another developer puts Cat objects into the bag which do not have a bark method. If you pull out an object from the bag and call bark on it, it will fail. Generics type the Dog bag in a way that you can\'t put inCat objects, so whatever you pull out of the bag, it will be a dog. Languages: e.g. Java, C++, Haskell, Scala

  • Option Monads Prevents uninitialized or non existing data from being accessed. Often with different code paths data can be uninitialized and being Null. Calling a method on a variable with Null turns into a NullPointerException. Another way is to call a method with some data where the method expects the data to exist, e.g. lastname of customer. Often if a very small percentage of customers don\'t have lastname the data might be null. Working with lastname then leads to an exception. Option prevents this because you need to first check if the data is there before you can use it. Declaring the lastname optional prevents calling methods on Null. Languages: e.g. Haskell, Scala

  • Tag Types Prevent a method from getting data that it doesn\'t expect and prevents IllegalArgumentException. For example an integer could be tagged positive with - in Typescript - int & Positive or in Scala Int @@ Positive. Then the method can\'t be called with negative numbers if it expects only positive integers. Tag types can also help with optional data, for example when a method expects a Customer with a lastname: Customer & HasLastName and Customer { lastnamename: string | undefined} prevents giving the method a customer that has an undefined lastname. Tools: e.g. Scala refined, Typescript taghiro

  • Immutable Data Prevents data corruption and inconsistent data with persistent data structures. With persistent data structures if the data in it is changed, a copy of the changed data is created and only the code that changed the data sees the changes. If your concurrent code gets some data the code can be sure that no other code manipulates it. This prevents data corruption where some code changes one part of the data and other code changes other parts of the shared data. Immutable data also prevents ConcurrentModificationException. And in the case of wrong synchronisation of concurrent code you only get lost updates where the changes of some code to the data are overwritten by other changes, instead of much worse inconsistent data. Languages: e.g. Haskell, Scala, Elm

  • Code Review Prevents missunderstood requirements and API usage. When more people look over written code, requirements are interpreted by more people and missunderstood requirements are found. If the developer is unfamiliar with some (internal) API he might use the wrong way. Code reviews by a more experienced developers will detect wrong API usage.

Using technologies and techniques to prevent whole classes of bugs has the highest impact on the quality of products. It\'s good to choose processes, organizations and programming languages to implement them.

If you have more techniques and technologies that prevent classes of bugs, I\'d wish to here them on Twitter @Codemonkeyism.