NoSQL: The Dawn of Polyglot Persistence
For some developers polyglot programming is already reality. I’m not such a big fan of polyglot programming, using many programming languages in one company. Especially for small ones there are hurdles, like turnover. I’ve seen projects stranded because noone understood a particular language. Or as Alex Ruiz writes:
I haven’t seen any practical evidence yet to convince me this is a good idea.
Contrary to that, I’m a big fan of polyglot persistence. This simply means using the right storage backend for each of your usecases. For example file storages, SQL, graph databases, data ware houses, in-memory databases, network caches, NoSQL. Today there are mostly two storages used, files and SQL databases. Both are not optimal for every usecase. In the words of Ben Scofield:
Many applications may require a non-traditional data store (say, something like MongoDB) for their core domain, but have other features that fit perfectly into a relational database – say, a CMS that relies heavily on custom fields and has a traditional user management system. Just as polyglot programmers may use multiple languages in a single application, I think the future of the web is polyglot persistence: we should use the database that best represents our domain, even if that requires several distinct systems within a single application.
SQL is just fine
But you might say: “SQL is working for me!”. Yes, maybe. But in reality SQL storages are often problematic, not during development but during operations. SQL storages are hard to scale – not impossible, but scaling a MySQL database with master/slave and replication chains is no easy task. And when scaling, most companies drop SQL features like JOINs as they are slow and notoriously hard to scale.
Many companies on the web wave front have created their own storages to better suit their needs: Flickr, Facebook, Google and Amazon to only name a few. The ones those companies build and parly open sourced are Cassandra, Dynamo, BigTable, HayStack, MapReduce/Hadoop – although all of them use SQL databases too for the right use cases.
There are 4 main use cases for storages:
- Storing data, read/write
- Searching data, mostly read
- Navigation over data, read
- Reporting, read
SQL storages can do all four of them, though SQL as a language and philosophy is mainly driven by “4. Reporting”. Or as Rickard tweeted:
@emileifrem yeah, but I have to do this relational database thing just once. It’s for…. reporting! :-) The one thing it does really well.
Current SQL databases therefor do not support all use cases well (decision can be based on CAP theorem, BASE vs. ACID consistency). The one guy who drove home this point long ago is Richard with Qi4J: storing and querying are seperated form each other:
In Qi4j we have explicit SPI support for storing and querying objects, separated from each other. Typically the EntityStore SPI will be implemented by a key-value implementation, the benefits of which was described in the previous post. The EntityFinder SPI is currently only implemented by an RDF repository extension based on Sesame.
Best of breed
There are some best of breed examples for the four usecases above:
- Storing: Key/Value stores, document DBs like CouchDB
- Searching: search engines, SOLR
- Navigation: Either by hand in SQL, K/V stores, XML, JSON etc. or use a graph database like Neo4J (which also can do 1.)
- Reporting: SQL or structured stores like MongoDB, MapReduce with Hadoop and Pig
Using the most suitable storage for your usecase will lead to better fits and less problems ahead concerning data management, scalability and performance.
Problems with polyglot persistence?
Are there problems with polyglot persistence? Indeed there are some. It’s hard to join data across storages and to query data accross storages, because there is no way for the different storages to interact, except through your application code. The same goes for aggregation of data (how many users have bought a product?) across storages. And out the window goes reference integrity. But if you scale, with sharding, partitioning and forbidden joins you need to live with the same restrictions.
Storage systems develop over time. The evolutionary steps for storages are:
- Unstructured: file system
- Unification: Network databases, CODASYL, SQL RDBMS
- Specialization: NoSQL DBs together with SQL – What I call polyglot persistence
- Abstraction: Amazon Dynamo, will lead to “Give me a storage for optimal load/store, give me one for search, I want to report ABC”
- Automatic: Just like garbage collection (GC) in VMs like the Java VM, automatic adaption and migration of data over storages. You no longer need to care
Currently we’re on level 3/4 with some storages and will move to 5. Developers and admins will need to make a big jump in faith, just as C developers needed to have faith in the GC of Java. You will no longer know what’s going on in detail.
- Written using continuations in Scala
- Execution context gets transfered to data, execution of code is always near data
- Optimization of data distribution, to reduce moving execution contexts to a minimum
They have a video on their website which is worth watching. In general future storages will self optimze for different use cases, just like the Java GC optimizes memory managment, the HotSpot VM optimizes compilation and Google AppEngine optimizes execution context – make the jump with faith.
Storages are changing. You need to take action. Learn about NoSQL storages and polyglot persistence. It’s no longer enough to only know and use one “hammer” (SQL storages), not every storage problem is a nail. There will be optimizing storages for different use cases in the not so distant future.