Why NoSQL Will Not Die

Reading the recent flamory piece "I Can't Wait for NoSQL to Die" from Ted Dziuba, I thought the author is wrong on so many levels. Or as Jeremy writes:

Well done, Ted. I laughed to myself a few times reading your post.

Not that I'm a NoSQL zealot, see my The Dark Side of NoSQL, but Ted is hilarous. On to our first laugh:

Never mind of course that MySQL was the perfect solution to everything a few years ago when Ruby on Rails was flashing in the pan.

No it wasn't, without heavy memcaching MySQL never worked for websites. Or:

Well, no. Did you know that Cassandra requires a restart when you change the column family definition? Yeah, the MySQL developers actually had to think out how ALTER TABLE works, but according to Cassandra, that's a hard problem that has very little business value. Right.

It seems to me, Ted has never worked with real live MySQL applications. ALTER TABLE is a pain, for tables with several millions of rows it will take hours to alter a table, mostly because MySQL creates temp tables. Which is no problem if your domain and market is static - as I assume Teds is - or if your MySQL schema is meta. But for others this is hell. All the while it does locks and your website is heavily impaired during the change. Even dumping the table, recreating it and importing all data (which is faster than ALTER TABLE) takes usually hours. You can work around this with hardware, SSDs, a clever slave setup, but you need a MySQL wizard to get this working.

The real solution to schema changes with high volumes of data is not to have a schema at all in your store - something most NoSQL databases support. This is mostly done by storing XML or (B)JSON into the store, and the store does not care about your schema. Your app then needs to deal with different versions of a schema (with at least two) and migrate data from the old schema to the new one between reads and writes (NoSQL to the rescue: Store JSON data with a version string, read old version, change, write new version). Or deal with optional values from the beginning, something a lot of code already does with sparse filled social media data. A background job can also migrate data piece by piece to a new version. With this setup, schema changes are easy, without a complicated slave setup or downtimes.

The problem with RDBMS doesn't end there. In a post to High Scalability Joe Stump writes:

Precompute on writes, make reads fast. This is an oldie as a scaling strategy, but it's valuable to see how SimpleGeo is applying it to their problem of finding entities within a certain geographical region.

I wrote in more detail about this in "Essential storage tradeoff: Simple Reads vs. Simple Writes" and how RDBMS wrongly optimize wirtes (I know about materialized views).

m3mnoch speculates about the reasons for Teds laughable post:

it doesn’t look like he’s ever done anything for a large, mainstream audience. i bet he still thinks getting slashdotted or techcrunched is the definition of “a lot of users.” [...] my point is, this isn’t 1998 anymore.

Agreed, even my humble WordPress blog on one server survives this. The reason MySQL works for Ted is:

because i totally believe that google adwords runs on mysql. IT’S READ-ONLY! that’s what mysql is good for — lots of read-heavy, cacheable data you can map against other read-heavy cacheable data.

Back to Ted:

You Are Not Google. The sooner your company admits this, the sooner you can get down to some real work.

He's right. You are not Google, so you will not have those MySQL wizards around who write patches for InnoDB. And he's wrong. You will get into scaling troubles with MySQL far before you're as big as Google.

DBAs are a reason for NoSQL

DBAs should not be worried, because any company that has the resources to hire a DBA likely has decision makers who understand business reality.

Another real gem. One of the reasons people go for Cassandra is that they do not need as many DBAs as with MySQL. Clustering and Scaling works out of the box for a wide range of scenarios - cases which you would have needed a MySQL wizard to reach. In Joe Stump words, CTO and co-founder of SimpleGeo:

How much time are your DBAs spending administering your RDBMSs? How much time are they in the data centers? How much do those data centers cost? How much do DBAs cost a year? Let’s say you have 10 monster DB servers and 1 DBA; you’re looking at about $500,000 in database costs.

The cost of RDBMS operations

And more about the cost of operating RDBMS with large websites:

I’m running a 50 node cluster, which spans three data centers, on Amazon’s EC2 service for about $10,000 a month. Furthermore, this is an operational expense as opposed to a capital expense, which is a bit nicer on the books. In order to scale a RDBMS to 6,000 reads/second I’d need to spend on the order of five months of operation of my 50 node cluster. [...] I’m happy to put my $/write, $/read, and $/GB numbers for my NoSQL setup against anyone’s RDBMS numbers.

SQL databases will survive, but for a much smaller niche (transcational data) than today. For sure NoSQL will not die in the near future: They support schema changes better, they scale better for write heavy applications and they are cheaper to scale all in all.

Other NoSQL posts on CodeMonkeyism:

The dark side of NoSQL

There is a dark side to most of the current NoSQL databases. People rarely talk about it. They talk about performance, about how easy schemaless databases are to use. About nice APIs. They are mostly developers and not operation and system administrators. No-one asks those. But it's there where rubber hits the road.

The three problems no-one talks about - almost noone, I had a good talk with the Infinispan lead [1] - are:

  • ad hoc data fixing - either no query language available or no skills
  • ad hoc reporting - either no query language available or no in-house skills
  • data export - sometimes no API way to access all data

In an insightful comment to my blog post "Essential storage tradeoff: Simple Reads vs. Simple Writes", Eric Z. Beard, VP Engineering at Loop, wrote:

My application relies on hundreds of queries that need to run in real-time against all of that transactional data – no offline cubes or Hadoop clusters. I’m considering a jump to NoSql, but the lack of ad-hoc queries against live data is just a killer. I write probably a dozen ad-hoc queries a week to resolve support issues, and they normally need to run “right now!” I might be analyzing tens of millions of records in several different tables or fixing some field that got corrupted by a bug in the software. How do you do that with a NoSql system?

  1. Data export: NoSQL data bases are differently affected by those problems. Each of them is unique. With some it's easy to export all our data, mostly the non distributed ones (CouchDB, MongoDB, Tokyo Tyrant) compared to the more difficult ones (Voldemort, Cassandra). Voldemort looks especially weak here.
  2. Ad hoc data fixing: With the non-distributed NoSQL stores, which do posess a query and manipulation language, ad hoc fixing is easier, while it is harder with distributed ones (Voldemort, Cassandra).
  3. Ad hoc reporting: The same with ad hoc reporting. The better the query capabilities (CouchDB, MongoDB) the easier ad hoc reporting becomes. For some of those reporting woes Hadoop is a solution. But as the Scala Swarm author Ian Clarke notes, not every problem is applicable to map/reduce. Either way you need to train customers and their expectations as they have become addicted to ad hoc reporting. This is not only a technical question, but a cultural one.

One solution is to split data that needs to be queried or reported (User, Login, Order, Money) and data which needs best performance (app data, social network data). Use a tradition SQL database for the first kind of data, and a fast, distributed NoSQL store for the second kind of data. Joining will be difficult, you need to support more different systems and skills are an issue. But the three problems can be solved this way.

What is your NoSQL strategy? Please leave a comment, I would like to know.

[1] they plan a distributed query language for ad hoc reporting in distributed environments