Essential storage tradeoff: Simple Reads vs. Simple Writes
Photo by Robert Scoble
When I studied computer science with a focus on databases, the holy scripture said only to use normalized data. For my exam I had to study 2NF and 3NF. And now it seems this is all wrong for large scale operations. From my education, I still have a strong feeling towards the pureness of normalized databases – and detect this feeling in many developers. In reality denormalized schemas are the way to go.
Why the former rise of normalized schemas for relational databases? Normalized data is pure. There are no repetitions, no duplication and data can easily be added, changed and removed. The database enforces a consistent state. But to read data from several tables you need to join them, which means complex and – often – slow reads. The tradeoff.
The rise of NoSQL storages
Joining doesn’t scale to millions of concurrent users, and it’s rumored that some companies ban joins completely. NoSQL databases, which include key/value stores and document databases, drop the notion of normalized data. Those with the biggest momentum are Cassandra, Project Voldemort, Tokyo Tyrant, Redis and CouchDB. They store data or documents in denormalized chunks which leads to easy reads. But if the data needs to be present in several documents, this means complex writes.
Essentially you exchange simple-writes/complex-reads with complex-writes/simple-reads. In most modern web architectures with many more reads than writes – depending on your business model – it’s no wonder people prefer the second tradeoff.
Let’s take a look at an – contrived – example. Suppose we want to store a shopping list. Each shopping list can be tagged and also have friends who can see the shopping list. With a normalized database one would perhaps model this problem like this:
Compare the normalized form with a document schema:
We added the tags and friends (in human readable form) to the shopping list, and the friends to the user document.
Read and write use cases
Let’s examine two use cases:
- Adding, removing and changing the name of a friend
- Reading the shopping list
With the normalized schema, adding and removing a friend to a shopping list is easy: Just insert a row or remove a row in the Friends table. Changing the name of a friend is also easy, just change it in the User table. Reading the shopping list at least involves a join with Tags and so isn’t a one-chunck read access. If we want to display the friends for a shopping list too, there even will be a second join.
With the document schema, adding and removing a friend is more complex. We need to read and write the shopping list document. We also need to read and write the user document. Reading is simple though. Just read the shopping list document, there is all the information to display a shopping list with the tags and the friends. Changing names is most difficult, we would need to change all shopping lists. This is the primary argument for normalized databases. But in reality, how often do users change their name compared to viewing a shopping list? And most of those slow operations can be done in a second, asynchronous step (see picture above).
Be aware of this tradeoff and plan or change your architecture accordingly. But what do you think?