The dark side of NoSQL

There is a dark side to most of the current NoSQL databases. People rarely talk about it. They talk about performance, about how easy schemaless databases are to use. About nice APIs. They are mostly developers and not operation and system administrators. No-one asks those. But it's there where rubber hits the road.

The three problems no-one talks about - almost noone, I had a good talk with the Infinispan lead [1] - are:

  • ad hoc data fixing - either no query language available or no skills
  • ad hoc reporting - either no query language available or no in-house skills
  • data export - sometimes no API way to access all data

In an insightful comment to my blog post "Essential storage tradeoff: Simple Reads vs. Simple Writes", Eric Z. Beard, VP Engineering at Loop, wrote:

My application relies on hundreds of queries that need to run in real-time against all of that transactional data – no offline cubes or Hadoop clusters. I’m considering a jump to NoSql, but the lack of ad-hoc queries against live data is just a killer. I write probably a dozen ad-hoc queries a week to resolve support issues, and they normally need to run “right now!” I might be analyzing tens of millions of records in several different tables or fixing some field that got corrupted by a bug in the software. How do you do that with a NoSql system?

  1. Data export: NoSQL data bases are differently affected by those problems. Each of them is unique. With some it's easy to export all our data, mostly the non distributed ones (CouchDB, MongoDB, Tokyo Tyrant) compared to the more difficult ones (Voldemort, Cassandra). Voldemort looks especially weak here.
  2. Ad hoc data fixing: With the non-distributed NoSQL stores, which do posess a query and manipulation language, ad hoc fixing is easier, while it is harder with distributed ones (Voldemort, Cassandra).
  3. Ad hoc reporting: The same with ad hoc reporting. The better the query capabilities (CouchDB, MongoDB) the easier ad hoc reporting becomes. For some of those reporting woes Hadoop is a solution. But as the Scala Swarm author Ian Clarke notes, not every problem is applicable to map/reduce. Either way you need to train customers and their expectations as they have become addicted to ad hoc reporting. This is not only a technical question, but a cultural one.

One solution is to split data that needs to be queried or reported (User, Login, Order, Money) and data which needs best performance (app data, social network data). Use a tradition SQL database for the first kind of data, and a fast, distributed NoSQL store for the second kind of data. Joining will be difficult, you need to support more different systems and skills are an issue. But the three problems can be solved this way.

What is your NoSQL strategy? Please leave a comment, I would like to know.

[1] they plan a distributed query language for ad hoc reporting in distributed environments