by Stephan Schmidt

The dark side of NoSQL

There is a dark side to most of the current NoSQL databases. People rarely talk about it. They talk about performance, about how easy schemaless databases are to use. About nice APIs. They are mostly developers and not operation and system administrators. No-one asks those. But it’s there where rubber hits the road.

The three problems no-one talks about – almost noone, I had a good talk with the Infinispan lead [1] – are:

  • ad hoc data fixing – either no query language available or no skills
  • ad hoc reporting – either no query language available or no in-house skills
  • data export – sometimes no API way to access all data

In an insightful comment to my blog post “Essential storage tradeoff: Simple Reads vs. Simple Writes”, Eric Z. Beard, VP Engineering at Loop, wrote:

My application relies on hundreds of queries that need to run in real-time against all of that transactional data – no offline cubes or Hadoop clusters. I’m considering a jump to NoSql, but the lack of ad-hoc queries against live data is just a killer. I write probably a dozen ad-hoc queries a week to resolve support issues, and they normally need to run “right now!” I might be analyzing tens of millions of records in several different tables or fixing some field that got corrupted by a bug in the software. How do you do that with a NoSql system?

  1. Data export: NoSQL data bases are differently affected by those problems. Each of them is unique. With some it’s easy to export all our data, mostly the non distributed ones (CouchDB, MongoDB, Tokyo Tyrant) compared to the more difficult ones (Voldemort, Cassandra). Voldemort looks especially weak here.
  2. Ad hoc data fixing: With the non-distributed NoSQL stores, which do posess a query and manipulation language, ad hoc fixing is easier, while it is harder with distributed ones (Voldemort, Cassandra).
  3. Ad hoc reporting: The same with ad hoc reporting. The better the query capabilities (CouchDB, MongoDB) the easier ad hoc reporting becomes. For some of those reporting woes Hadoop is a solution. But as the Scala Swarm author Ian Clarke notes, not every problem is applicable to map/reduce. Either way you need to train customers and their expectations as they have become addicted to ad hoc reporting. This is not only a technical question, but a cultural one.

One solution is to split data that needs to be queried or reported (User, Login, Order, Money) and data which needs best performance (app data, social network data). Use a tradition SQL database for the first kind of data, and a fast, distributed NoSQL store for the second kind of data. Joining will be difficult, you need to support more different systems and skills are an issue. But the three problems can be solved this way.

What is your NoSQL strategy? Please leave a comment, I would like to know.

[1] they plan a distributed query language for ad hoc reporting in distributed environments

You can leave a Reply here. Of course, you should follow me on twitter here.

You can share this post!
Do you want to tell others about this article? Use the social bookmark icons to submit this artice to the service of your choice. Thanks.

About the author: Stephan Schmidt is head of development at brands4friends. He has more than 15 years of internet technology experience and 10 years experience in agile. He was head of development, consultant and CTO and is a speaker, author and blog writer. He specializes in organizing and optimizing software development helping companies by increasing productivity with lean software development and agile methodologies. Want to know more? All views are only his own.

40 Comments 58 Tweets 41 Comments 1 Other Comment

Leave a reply.

Comments

Ben Smith

“Adhoc” not “Add hoc”.
“problem is” not “problems”.

Aside from that, your third point is a bad one.

Whilst it may be true that your customers expectations are that ad hoc reporting is simple for you (when using traditional SQL, for instance) because each time they ask for something they get it within five minutes, and that they can be trained to expect it to take longer, it’s no good when your boss needs to make a business decision and it will take several days to get him the numbers he needs.

Certainly in the businesses I’ve been involved in, ad hoc reporting has been phenomenally important and to lose that capability would be a deal breaker when considering moving to a different platform.

@Ben: Thanks, fixed those, visual edit in wordpress removed all style attributes so I copied the mistakes back from my editor to wordpress :-) Fixed again.

“Certainly in the businesses I’ve been involved in, ad hoc reporting has been phenomenally important and to lose that capability would be a deal breaker when considering moving to a different platform.”

Yes, experienced that too. But sometimes adhoc reporting is important (realtime), other times it’s just curiosity, like looking into your email inbox, twitter, …. and a 2 hour gap due to hadoop e.g. is tolerable (for the company, perhaps not for the curious employee)

I think if you ask most nosql advocates, you’ll find that they wouldn’t recommend mongo, couch, hadoop, etc for something like $-valued transactions or user signups. These key-value stores are designed with performance and scalability in mind, not 100% transactional reliability. In most real world applications you’ll find a combination of traditional relational databases and these schemaless DBs. You mention this at the end of your post as a possible solution, but I think it is the probable solution.

You gave the solution in your last paragraph: when you have strong querying/reporting needs, RDBMSs are still the way to go, even if you’re using a NoSQL store with query capabilities such as MongoDB.
RDBMSs are simply far better.

So, you may choose to split your data among (pseudo-)functional/semantic paths as you suggested, or just put everything in your NoSQL store and write behind to your RDBMS … in both cases, RDBMSs will be part of your architecture for long time.

That’s one reason why NoSQL is indeed a bad name: a NoSQL store doesn’t imply no RDBMS at all.

Just my two euro cents,
Cheers,

Sergio B.

Elliot

Err, “ad hoc”, not “adhoc”, right? Ad hoc is “To this” in latin; “adhoc” is nonsense.

“Ad Hoc”, not “Adhoc”

@Latinist and Elliot: Thanks, see Bens comment, changed again now to “ad hoc”. Especially annoying because I’ve learned – or so I thought – 5 years Latin ;-)

Steve

If you have scaled up to the point where you need NoSQL, Ad Hoc queries are already out of the question.

@Steve: I beg to differ. Also see the quote in the post.

Tom

A couple concerns I have that you touched upon:

1. Agility: Are they agile? If a user at a later date wants a new query that spans across 3 or more entities can I as easily do that with noSql as I can with Sql?

2. Conventions over adhoc design: If a developer can put anything into a database, they will. This means that as years go buy the datastore becomes murky. I’ve worked with object-oriented database, XML database and old-school IMS hierarchy database and I can tell you that as time goes by these can start to look very ugly.

Tom

Steve said, “If you have scaled up to the point where you need NoSQL, Ad Hoc queries are already out of the question.”

I agree. I see noSql as something to start using when the existing relational datastore can’t handle the load or maybe for quick solutions.

Someone care to explain how NoSQL relates to RDF (and its triple stores)? *edit:* Not implying a connection here, just wondering how they compare to each other. *edit2:* Obviously, Im mostly interested in query performance etc, not in inference or ontology support – NoSQL probably cant do that :)

Steve

@Stephan 10 million rows? That’s still in the scope of any decent RDBMS. Why does he need to use NoSQL? It just sounds like someone that wants to make the jump to the next “latest” technology. When you have a billion rows, that’s a different problem. A normal RDBMS can’t handle that volume of data in a reasonable amount of time. Ad hoc queries aren’t possible. At all. The only way to get answers from a billion+ rows of data is to use a NoSQL system (Hadoop, Cassandra, whatever) or invest in an extremely expensive and extremely proprietary cluster RDBMS system (NetEzza, Vertica, AsterData, etc).

10 million rows of normal data is not NoSQL territory.

Ad hoc reporting is very difficult in CouchDB, especially if you are dealing with a large amount of data.

CouchDB does support “temporary views”, which are ad hoc views that you can build and execute on the fly. However, temporary views are not recommended for production use, because they need to be built before they can get you the data you need. This could take hours, or days depending on the size of you database and the processing power of your database server. Temporary views are meant as a way to test new views in development which will eventually be saved into a design document, and not for running ad hoc queries.

Rick

I would like to point out that MongoDB, with it’s powerful javascript query system handles all of these issues swimmingly. IMO, most complex operations on datasets are easier when using an imperative/iterative system instead of SQL.

@Rick: I mentioned MongoDB as the easier ones, the skill issue persists though. I know many excellent SQL admins who whip up complex data reports in minutes, they might not in MongoDB. Same goes for expressive power of MongoDB JS (and I like MongoDB!) and SQL.

@John: Good points, thanks

@Steve: NoSQL is better suited to some applications than SQL is, no matter how many rows are involved.

The most common forms of web applications – blogs, forums, CMS, wikis – are all examples of structured document storage, and have a natural fit with NoSQL and an impedance mismatch with SQL.

In such cases key-value stores offer higher productivity in development, and higher runtime efficiency.

Steve

@Twylite I guess I was looking at it more from the OLAP side of things.

euromix

I guess it means NoSql is a god fit for hierachical data that does not need export or queries, like a blog.
Export and reporting was never a strong point in web world anyway.

Our application developed in Rails was designed to use sqlserver because no other product do what we need like reporting service.

I would have loved to use an open source database and make the ecosystem more consistent, but really, i dont know any system that can compete from far with reporting services.

Colin

Column-oriented databases are another alternative to noSQL. These nontraditional database systems do support SQL, provide low-latency, and easily manipulate trillions of rows of data.

Some examples are OneTick, MonetDB, Vhayu, KDB.

@Colin: Some years back I’ve played with KDB (+K) and was amazed. Thanks for pointing them out.

I think ad hoc reporting is a problem in general, since you shouldn’t be doing that against your OLTP either. Really reporting data should be captured either in the data layer and queued for inserting into a reporting SQL DB, or exported via an ETL layer, of course that requries data export.

But in general, I agree that most developers jump into NoSql without operations consideration. Hopefullly that’s just a symptom of the novelty of these data stores and will work itself out as they become more established.

As mentioned above already, MongoDB, while requiring a new set of skills, does provide all three of the capabilities you mention.

@Twylite – There are certain types of data that have an impedance mismatch to SQL (eg. graph data), but not the ones you mentioned. The relational model fits the things you mentioned perfectly well.

The impedance mismatch that exists is between sets and objects. There is nothing about a blog post that makes it more natural to be thought of as an object. Sets of blog posts and all the regular relational juice is still applicable here.

When you go NoSQL you are deciding the forego flexibility with your data in favor of raw speed and scalability. Don’t fool yourself into thinking that NoSQL is more “natural” just because you don’t like the leaky ORM abstractions.

MemDude

Maybe I’m missing something in this argument, but it sounds like someone saying: I use “Tool XZY” and have a lot of experience with it, so we should always use “Tool XYZ” since my team is so good using it. If that was a good rule of thumb then we’d have whole lot more people programming in COBOL using IMS databases.

@MemDude: I think you’re missing something. I came to this post because I have to deal with adhoc reporting and – in the past -data fixing and because I tried to get data out of Voldermort which I was using.

@Arne: Yes, a good reporting solution and DWH for reporting is a good thing. Many companies do not have those though. And it doesn’t address the problem with adhoc data fixing which is sometimes needed due to software bugs.

Good discussion of running both types of DBs in one system:
http://johnpwood.net/2009/09/29/using-multiple-database-models-in-a-single-application/

“@Steve: I beg to differ.”

beg away, but you need to refute the argument and not just handwave. Keeping olap/ad-hoc queries off the active database is a best practice for relatively small systems, never mind the outliers than non-relational systems target.

ad-hoc repair is another matter – does it not indicate you haven’t thought through the operational usecases on the data?

[...] Code Monkeyism: The dark side of NoSQL (tags: nosql software) [...]

Arun Srini

NoSQL databases vs OORDBMS are like windows vs linux, when you get windows you know you can solve problems to a limit, but can’t rely on them, and not very stable/powerful, while linux are hard to master, tough(er) to monitor and maintain, but no doubt a clear winner between the two.

Arun Srini

I did fail to mention the NoSQL – windows, OORDBMS – Linux :-[]

@Bill: Sorry again to differ. “[...] but you need to refute the argument”. I don’t need to refute every argument someone makes on one of my blog posts. But I could give you my opinion.

“If you have scaled up to the point where you need NoSQL, Ad Hoc queries are already out of the question.”

a.) Many companies need NoSQL early in their life cycle, and then still have not enough investment into a data warhouse and BI solution
b.) Many companies that are (too) large have complicated processes leading business users to just go to the DB department requesting ad hoc reports
c.) If your CEO comes to you requesting some information NOW for the next investors call, you tell him “this out of the question?” Cudos to you.

“ad-hoc repair is another matter – does it not indicate you haven’t thought through the operational usecases on the data?”

I don’t know about you, but I have strived for a zero bug strategy for the last years, and still there were bugs in the software we produced. Glad to hear that you do not have bugs because of edge cases no-one thought about or which are caused by the interference of different features in different departments meeting live data and users.

” Keeping olap/ad-hoc queries off the active database is a best practice for relatively small systems, [...]”

Not sure what your point is here.

Also not sure why you ignored “Also see the quote in the post.”

[...] Code Monkeyism: The dark side of NoSQL the tools definitely have their limitations, but a lot of that is lifecycle rather than intrinsic flaws in the approach (tags: nosql altdb database software cloud management databases couchdb criticism cassandra) [...]

[...] The dark side of NoSQL [...]

I’ve had great success storing user and login inoformation in MongoDB. In fact storage of user objects was always considered a core use case of the project from day one. I would be careful with things like orders though…although that can work too.

Leave a Reply

What people wrote somewhere else:

New blog post: “The dark side of NoSQL” http://bit.ly/1W9NlW #NoSQL #Cassandra #Voldemort #CouchDB

This comment was originally posted on Twitter

The dark side of NoSQL #programming http://bit.ly/33iQYZ

This comment was originally posted on Twitter

RT @codemonkeyism New blog post: “The dark side of NoSQL” http://bit.ly/1W9NlW
#NoSQL #Cassandra #Voldemort #CouchDB

This comment was originally posted on Twitter

RT @walshtp: RT @codemonkeyism New blog post: “The dark side of NoSQL” http://bit.ly/1W9NlW #NoSQL #Cassandra #Voldemort #CouchDB

This comment was originally posted on Twitter

The dark side of NoSQL: Comments http://url4.eu/YSeT

This comment was originally posted on Twitter

The dark side of NoSQL… http://bit.ly/mRA2n

This comment was originally posted on Twitter

The dark side of NoSQL http://bit.ly/cccMj

This comment was originally posted on Twitter

RT @codemonkeyism: New blog post: “The dark side of NoSQL” http://bit.ly/1W9NlW #NoSQL #Cassandra #Voldemort #CouchDB

This comment was originally posted on Twitter

RT @codemonkeyism: New blog post: “The dark side of NoSQL” http://bit.ly/1W9NlW #NoSQL #Cassandra #Voldemort #CouchDB

This comment was originally posted on Twitter

The dark side of NoSQL http://bit.ly/yROCq

This comment was originally posted on Twitter

HNews: The dark side of NoSQL http://bit.ly/UyEYt

This comment was originally posted on Twitter

The dark side of NoSQL. http://bit.ly/mRA2n

This comment was originally posted on Twitter

Nice infinispan name check here http://bit.ly/1f1WRo (I assume he chatted to @maniksurtani).

This comment was originally posted on Twitter

RT @codemonkeyism New blog post: “The dark side of NoSQL” http://bit.ly/1W9NlW

This comment was originally posted on Twitter

RT @codemonkeyism: New blog post: “The dark side of NoSQL” http://bit.ly/1W9NlW #NoSQL #Cassandra #Voldemort #CouchDB

This comment was originally posted on Twitter

RT @codemonkeyism: New blog post: “The dark side of NoSQL” http://bit.ly/1W9NlW #NoSQL #Cassandra #Voldemort #CouchDB

This comment was originally posted on Twitter

Thoughts on NoSQL? > http://bit.ly/1mzksO

This comment was originally posted on Twitter

The dark side of NoSQL http://bit.ly/mRA2n

This comment was originally posted on Twitter

RT @jboner: RT @codemonkeyism: New blog post: “The dark side of NoSQL” http://bit.ly/1W9NlW #NoSQL #Cassandra #Voldemort #CouchDB

This comment was originally posted on Twitter

“No silver bullet!!” RT @codemonkeyism: New blog post: “The dark side of NoSQL” http://bit.ly/1W9NlW #NoSQL #Cassandra #Voldemort #CouchDB

This comment was originally posted on Twitter

RT @jboner: RT @codemonkeyism: New blog post: “The dark side of NoSQL” http://bit.ly/1W9NlW #NoSQL #Cassandra #Voldemort #CouchDB

This comment was originally posted on Twitter

RT @codemonkeyism: New blog post: “The dark side of NoSQL” http://bit.ly/1W9NlW #NoSQL #Cassandra #Voldemort #CouchDB

This comment was originally posted on Twitter

RT @jboner: RT @codemonkeyism: New blog post: “The dark side of NoSQL” http://bit.ly/1W9NlW #NoSQL #Cassandra #Voldemort #CouchDB

This comment was originally posted on Twitter

RT @esteban27: http://icio.us/kf1z44 (the dark side of NoSQL (that also means no nice schemas) )

This comment was originally posted on Twitter

A programmer outlines 3 perceived problems with #nosql datastores. I say: give them time. http://is.gd/3OPko

This comment was originally posted on Twitter

The dark side of NoSQL: http://bit.ly/mRA2n

This comment was originally posted on Twitter

RT @codemonkeyism Code Monkeyism: The dark side of NoSQL http://bit.ly/1W9NlW

This comment was originally posted on Twitter

Dark side of #nosql: http://bit.ly/1f1WRo #sqlalchemy ’s answer to the “hybrid” suggestion in the last paragraph: http://bit.ly/PbEg2

This comment was originally posted on Twitter

RT @codemonkeyism New blog post: “The dark side of NoSQL” http://bit.ly/1W9NlW #NoSQL #Cassandra #Voldemort #CouchDB #yam

This comment was originally posted on Twitter

RT @codemonkeyism Code Monkeyism: The dark side of NoSQL http://bit.ly/1W9NlW…; Not “dark”, but troublesome.

This comment was originally posted on Twitter

RT @codemonkeyism Code Monkeyism: The dark side of NoSQL http://bit.ly/1W9NlW

This comment was originally posted on Twitter

RT @zzzeek: Dark side of #nosql: http://bit.ly/1f1WRo #sqlalchemy ’s answer to the “hybrid” suggestion [...] : http://bit.ly/PbEg2

This comment was originally posted on Twitter

Stephan Schmidt on the potential downsides of NoSQL (aka non-relational data stores): NoSQL is getting very ver.. http://awe.sm/1qQh

This comment was originally posted on Twitter

RT @codemonkeyism New blog post: “The dark side of NoSQL” http://bit.ly/1W9NlW (via @sbtourist)

This comment was originally posted on Twitter

RT @codemonkeyism Code Monkeyism: The dark side of NoSQL http://bit.ly/1W9NlW

This comment was originally posted on Twitter

sil

Reading http://bit.ly/mRA2n and being pleased that CouchDB is not susceptible to these issues :)

This comment was originally posted on Twitter

The dark side of NoSQL http://bit.ly/1xLA3H #postrank #entrepreneur

This comment was originally posted on Twitter

Great read on NoSQL. Don’t miss the comments. http://tr.im/AgDw

This comment was originally posted on Twitter

dark side of nosql http://bit.ly/cccMj #cassandra #nosql #voldemort #couchdb

This comment was originally posted on Twitter

RT @codemonkeyism Code Monkeyism: The dark side of NoSQL http://bit.ly/1W9NlW

This comment was originally posted on Twitter

Dark side of #NoSQL http://bit.ly/cccMj @codemonkeyism #couchdb #mongoDB #tokyotyrant #cassandra #voldemort #bigdata

This comment was originally posted on Twitter

Great post – “The dark side of NoSQL” http://ow.ly/s3Hy #NoSQL #Cassandra #Voldemort #CouchDB ( via @codemonkeyism ) #GraphDB

This comment was originally posted on Twitter

The dark side of NoSQL http://bit.ly/ENlbY

This comment was originally posted on Twitter

RT @justinvincent: The dark side of NoSQL databases http://bit.ly/ENlbY

This comment was originally posted on Twitter

RT @justinvincent: The dark side of NoSQL http://bit.ly/ENlbY

This comment was originally posted on Twitter

The RDMS Empire Strikes Back. RT @codemonkeyism Code Monkeyism: The dark side of NoSQL http://bit.ly/1W9NlW

This comment was originally posted on Twitter

RT @mikelim: The RDMS Empire Strikes Back. RT @codemonkeyism Code Monkeyism: The dark side of NoSQL http://bit.ly/1W9NlW

This comment was originally posted on Twitter

The dark side of NoSQL – http://cli.gs/gg09u

This comment was originally posted on Twitter

RT @michaelneale “Nice infinispan name check here http://bit.ly/1f1WRo (I assume he chatted to @maniksurtani).” – I assume so too. :P

This comment was originally posted on Twitter

RT @OdeToCode: On the flip side, this post looks at the “dark side of #nosql” – http://bit.ly/cccMj

This comment was originally posted on Twitter

Dave

That Silverlight/Mess thing’s just an ad though isn’t it? No different from that 4 year-old with the Pony-filled presentation about the benefits of Windows 7 except spread out over various blog posts.

This comment was originally posted on tecosystems

The dark side of NoSQL http://bit.ly/z3RvH

This comment was originally posted on Twitter

Code Monkeyism: The dark side of NoSQL http://bit.ly/9N8Ss nosql essay

This comment was originally posted on Twitter

RT @delicious50: Code Monkeyism: The dark side of NoSQL http://bit.ly/9N8Ss nosql essay

This comment was originally posted on Twitter

The dark side of NoSQL http://bit.ly/mRA2n

This comment was originally posted on Twitter

@tobi Cassandra looks very interesting, but it could be some time yet for tooling support: http://tinyurl.com/yb6vjq2

This comment was originally posted on Twitter

I doubt nosql is that good, as in: http://bit.ly/cccMj RT @celso: nosql, nomoreservers, nowindows. the world is changed, and it feels nice.

This comment was originally posted on Twitter

The dark side of NoSQL http://bit.ly/AbyFZ

This comment was originally posted on Twitter

The dark side of NoSQL http://bit.ly/z3RvH

This comment was originally posted on Twitter

Additional comments powered by BackType