the blog for developers

NoSQL: The Dawn of Polyglot Persistence

For some developers polyglot programming is already reality. I’m not such a big fan of polyglot programming, using many programming languages in one company. Especially for small ones there are hurdles, like turnover. I’ve seen projects stranded because noone understood a particular language. Or as Alex Ruiz writes:

I haven’t seen any practical evidence yet to convince me this is a good idea.

Contrary to that, I’m a big fan of polyglot persistence. This simply means using the right storage backend for each of your usecases. For example file storages, SQL, graph databases, data ware houses, in-memory databases, network caches, NoSQL. Today there are mostly two storages used, files and SQL databases. Both are not optimal for every usecase. In the words of Ben Scofield:

Many applications may require a non-traditional data store (say, something like MongoDB) for their core domain, but have other features that fit perfectly into a relational database – say, a CMS that relies heavily on custom fields and has a traditional user management system. Just as polyglot programmers may use multiple languages in a single application, I think the future of the web is polyglot persistence: we should use the database that best represents our domain, even if that requires several distinct systems within a single application.

SQL is just fine

But you might say: “SQL is working for me!”. Yes, maybe. But in reality SQL storages are often problematic, not during development but during operations. SQL storages are hard to scale – not impossible, but scaling a MySQL database with master/slave and replication chains is no easy task. And when scaling, most companies drop SQL features like JOINs as they are slow and notoriously hard to scale.

Many companies on the web wave front have created their own storages to better suit their needs: Flickr, Facebook, Google and Amazon to only name a few. The ones those companies build and parly open sourced are Cassandra, Dynamo, BigTable, HayStack, MapReduce/Hadoop – although all of them use SQL databases too for the right use cases.

There are 4 main use cases for storages:

  1. Storing data, read/write
  2. Searching data, mostly read
  3. Navigation over data, read
  4. Reporting, read

SQL storages can do all four of them, though SQL as a language and philosophy is mainly driven by “4. Reporting”. Or as Rickard tweeted:

@emileifrem yeah, but I have to do this relational database thing just once. It’s for…. reporting! :-) The one thing it does really well.

Current SQL databases therefor do not support all use cases well (decision can be based on CAP theorem, BASE vs. ACID consistency). The one guy who drove home this point long ago is Richard with Qi4J: storing and querying are seperated form each other:

In Qi4j we have explicit SPI support for storing and querying objects, separated from each other. Typically the EntityStore SPI will be implemented by a key-value implementation, the benefits of which was described in the previous post. The EntityFinder SPI is currently only implemented by an RDF repository extension based on Sesame.

Best of breed

There are some best of breed examples for the four usecases above:

  1. Storing: Key/Value stores, document DBs like CouchDB
  2. Searching: search engines, SOLR
  3. Navigation: Either by hand in SQL, K/V stores, XML, JSON etc. or use a graph database like Neo4J (which also can do 1.)
  4. Reporting: SQL or structured stores like MongoDB, MapReduce with Hadoop and Pig

Using the most suitable storage for your usecase will lead to better fits and less problems ahead concerning data management, scalability and performance.

Problems with polyglot persistence?

Are there problems with polyglot persistence? Indeed there are some. It’s hard to join data across storages and to query data accross storages, because there is no way for the different storages to interact, except through your application code. The same goes for aggregation of data (how many users have bought a product?) across storages. And out the window goes reference integrity. But if you scale, with sharding, partitioning and forbidden joins you need to live with the same restrictions.

Storage evolution

Storage systems develop over time. The evolutionary steps for storages are:

  1. Unstructured: file system
  2. Unification: Network databases, CODASYL, SQL RDBMS
  3. Specialization: NoSQL DBs together with SQL – What I call polyglot persistence
  4. Abstraction: Amazon Dynamo, will lead to “Give me a storage for optimal load/store, give me one for search, I want to report ABC”
  5. Automatic: Just like garbage collection (GC) in VMs like the Java VM, automatic adaption and migration of data over storages. You no longer need to care

Currently we’re on level 3/4 with some storages and will move to 5. Developers and admins will need to make a big jump in faith, just as C developers needed to have faith in the GC of Java. You will no longer know what’s going on in detail.

The future

We can already see the beginning of automatic storages. Swarm as one example of automatic storage engines that optimizes execution context and data locality:

  • Written using continuations in Scala
  • Execution context gets transfered to data, execution of code is always near data
  • Optimization of data distribution, to reduce moving execution contexts to a minimum

They have a video on their website which is worth watching. In general future storages will self optimze for different use cases, just like the Java GC optimizes memory managment, the HotSpot VM optimizes compilation and Google AppEngine optimizes execution context – make the jump with faith.

Conclusion

Storages are changing. You need to take action. Learn about NoSQL storages and polyglot persistence. It’s no longer enough to only know and use one “hammer” (SQL storages), not every storage problem is a nail. There will be optimizing storages for different use cases in the not so distant future.

You can leave a Reply here. Of course, you should follow me on twitter here.

You can share this post!
Do you want to tell others about this article? Use the social bookmark icons to submit this artice to the service of your choice. Thanks.

About the author: Stephan Schmidt is head of development at brands4friends. He has more than 15 years of internet technology experience and 10 years experience in agile. He was head of development, consultant and CTO and is a speaker, author and blog writer. He specializes in organizing and optimizing software development helping companies by increasing productivity with lean software development and agile methodologies. Want to know more? All views are only his own.

14 Comments 13 Tweets 1 Comment

Leave a reply.

Comments

I am not sure its the dawn yet. Its probably more like 1 am with the promise of a dawn. And one of the reasons imo is that right now everyone is throwing ideas at the problem. That makes it seem like there are lots of specialised storage engines for lots of specialised use cases. I do believe there are at least two steps we need to go through for the dawn to occur.

a. More robustness and confidence building : Storage systems need to exhibit confidence. Thats one reason I think people flock to Oracle – that you will not lose data. And I think more production success stories need to emerge before such confidence can start becoming contagious.

b. IMO there’s way too much specialisation happening in this momentum. Eventually some degree of shakeout will be required. Thats still early work in progress I would imagine.

I do believe that going forward people will want to see some migration paths across such databases as a mechanism to protect their investments. That may slightly take the edge off the amount of polyglotism in the space, probably more towards some degree of standardisation.

I think #nosql is an extremely promising area with great potential. I believe the pace at which I imagine this potential will be converted into solutions and acceptance may not be as high as many of us might anticipate.

Not so sure if I agree. The dawn is before sunrise, which characterizes the situation quite good.

Many companies have been using diverse storages for years (Facebook, Flickr, Amazon, …) and others have followed with NoSQL in production environments (Reddit, Digg).

“Eventually some degree of shakeout will be required.”

Will happen for sure.

“I think #nosql is an extremely promising area with great potential.”

#NoSQL is only part of polyglot persistence.

I have to disagree about polyglot programming! Just like the right storage for the right job, the right language for the right purpose. We script in Bash, we query in SQL, but when we consider backend code, we talk “inversion of control” but do so in a language (Java) that abhor closures?

Developers would be much more effective if they more quickly embraced the right language for the right job with team buy-in. A team I’m on has already seen great benefits from attaching a scripting language to our Java runtimes to manually manipulate objects in the runtime context — absolutely invaluable! If we switched to a functional language we could also trim out about 30% Java-boiler-plate code, but the team isn’t ready for that yet.

Finally, there are just as many pitfalls with using a storage engine your team is not intimately familiar with! If we ever meet in a coffee shop I’ll have to give you the long list of Solr war stories. :)

Thanks for the post!

Tim

Like so many other things, these sorts of solutions are absolutely vital to those who need them, terribly dangerous to those who don’t, with a vast spectrum of grey in between.

One of the downsides that you haven’t mentioned is maintenance (admin)

I worked on a project in which we decided to use an LDAP directory to store the user accounts, some content on the filesystem, and the bulk of the data in an RDBMS.
In theory it was a good idea – each storage system aligned to the job it is designed for.

In practice, the directory has been troublesome, and we wouldn’t do it again.

Not because our directory server lacked features, but because it’s a specialised tool for a specialised task, and that requires specialised skills to maintain it.

We have a storage team that manage our filesystems, SAN infrastructure and file backups, and it all runs smoothly.
We have DBAs who manage our RDBMSs, and maintain backup schedules, check free space, and run integrity checks.

What we didn’t have was someone with the knowledge and responsibility to do that for a more specialised storage system. We could have hired someone. We could have trained someone. We actually would have needed multiple people in order to ensure we had support 365 days a year.
We also needed to train our developers and operations staff.
We needed to build integration from our site-wide security manager into the directory in order to manage admin passwords and permissions.

For a system that only provided marginal improvement over a relational database, the overhead of all that was too high.

In retrospect we were further down the “dangerous” end of the spectrum that we realised. In retrospect, putting in a specialised storage system for 1 part of the application had a negative ROI.

(For the record, we don’t regret putting things on the filesystem)

Some applications need specialised stores. Some applications are big enough that the teams running them can justify hiring a team of CouchDB admins.
Some applications *need* full text search, so you’re going to have to make Lucene/Solr work – Your RDBMS isn’t going to get the job done.

But for a lot of cases, it’s much like the polyglot programming scenario – a 10% performance improvement (on paper) isn’t a strong enough benefit to justify the investment you’ll need to get it running seamlessly. For the money that you would need to spend, you can probably solve it more easily with hardware.

@Tim: Thanks for your experience insights.

Your idea that some day storage of data will all be transparent just like garbage collection is in Java I found really intriguing.

I never even thought about the fact that this might happen some day.

I think we’re at least 10 years away from any one particular implementation of that idea gaining widespread use but the fact that it’s coming is interesting.

Thanks for the blog post :)

@Robert: Thanks – The idea came very lately to me – after seeing Swarm, and it clicked how similar this will be to GC.

Leave a Reply

What people wrote somewhere else:

Intéressant : NoSQL : The dawn of polyglot Persistence http://bit.ly/7kb7Dx #codemonkeyism

This comment was originally posted on Twitter

By chance stumbled on a reference to my Swarm project in an article on @codemonkeyism http://budurl.com/m4qz

This comment was originally posted on Twitter

#NoSQL: The Dawn of Polyglot Persistence http://is.gd/6vTZe

This comment was originally posted on Twitter

In addition to polyglot programming we now also have polyglot persistence: http://bit.ly/85cOBy

This comment was originally posted on Twitter

polyglott persistence http://codemonkeyism.com/nosql-polyglott-persistence/

This comment was originally posted on Twitter

RT @codemonkeyism Code Monkeyism: NoSQL: The Dawn of Polyglot Persistence http://bit.ly/7oNwNI

This comment was originally posted on Twitter

RT @iPoulet: RT @codemonkeyism Code Monkeyism: NoSQL: The Dawn of Polyglot Persistence http://bit.ly/7oNwNI

This comment was originally posted on Twitter

RT @codemonkeyism Code Monkeyism: NoSQL: The Dawn of Polyglot Persistence http://bit.ly/7oNwNI

This comment was originally posted on Twitter

Interesting article on “Polyglot Persistence”: http://codemonkeyism.com/nosql-polyglott-persistence/

This comment was originally posted on Twitter

@rickardoberg Can I quote you there: http://bit.ly/8wmMTn “The one thing it does….”

This comment was originally posted on Twitter

“when scaling [databases], most companies drop SQL features like JOINs” http://is.gd/6ALru

This comment was originally posted on Twitter

NoSQL: The Dawn of Polyglot Persistence – http://su.pr/2aIaho

This comment was originally posted on Twitter

Veja @codemonkeyism em persistência poliglota: http://is.gd/6BGFX #show #persistence

This comment was originally posted on Twitter

Additional comments powered by BackType

Guide to CodeMonkeyism

Over the last 4 years I wrote many articles on this blog. To make it easier for you to find the relevant ones, I've organized them into topics.

Top 10

6 reasons why my VC funded startup did fail

Go Ahead: Next Generation Java Programming Style

Java Interview questions: Write a String Reverser

The dark side of NoSQL

7 Bad Signs not to Work for a Software Company or Startup

Is Java dead?

Scala vs. Clojure

Never, never, never use String in Java

No future for functional programming in 2008 – Scala, F# and Nu

Clojure vs Scala, Part 2

Java Developer

Is Java Dead?

Go Ahead: Next Generation Java Programming Style

Be careful with magical code

All variables in Java must be final

Never, never, never use String in Java

Bending Java: More readable code with methods that do nothing?

NoSQL Guy

NoSQL: The Dawn of Polyglot Persistence

The dark side of NoSQL

Essential storage tradeoff: Simple Reads vs. Simple Writes

Sharding destroys the goals of your relational database

The unholy legacy of databases

Startup/CTO

Development Dream Teams

6 reasons why my VC funded startup did fail

American vs. European style of Software Development

12 Things to Reduce Your Lead Time and Time to Market

The high cost of overhead when working in parallel

Essential storage tradeoff: Simple Reads vs. Simple Writes

Job Seeker

Another Good (Java) Interview Question

7 Bad Signs not to Work for a Software Company or Startup

Java Interview questions: Write a String Reverser (and use Recursion!)

Java Interview questions: Multiple Inheritance

As a Manager: What I value in developers

Top 10 Tips (+1) to Get a Pay Raise

Agilist

What Developers Need to Know About Agile

5 Practices Better to Change in Your Scrum Implementation

Scrum is not about engineering practices

ScrumMaster and ZenMaster: The joke of certification

What is Trans-Scrum?