Sharding does destroy your relational database – which is a good thing. The idea behind sharding is to distribute data to several databases based on certain criterias. This could for example be the primary key. All entities that keys begin with 1 go to one database, with 2 to another and so on (often modulo functions on the key are used, or groups based on business data like customer location, or function). Several reasons exists for sharding, the main two being better performance and lower impact of crashed databases – only persons with a name that starts with S will be affected by a database crash.
Relational databases were the tool of choice for several decades when it comes to data storage. But they do more than store data. Even reading operations can be split into several functions. There are at least three kinds of database read queries:
- Data graph building queries: With these you get your data out of the database, customers together with adresses etc.
- Aggregation queries: How many orders have been stored in the August, aggregated by product category
- Search queries: Give me all customers who live in New York
Sharding now does away with the second and third query and reduces databases to data storage. Because the shards are different databases on different systems you can’t aggregate queries (compared to a cluster) without custom code across systems and you cannot search with one query (only several ones – one to each database). Databases have lead to the notion that search and retrieval are linked together and should be dealt together. Most people think as retrieval and search as the same thing. This has blocked development on technologies. Sharding, S3, Dynamo, Memcached have changed this preception recently. I’ve written about splitting search and retrieval in “The unholy legacy of databases”. There I quote Rickard from Qi4j fame:
Entities are really cool. We have decided to split the storage from the indexing/querying, sort of like how the internet works with websites vs Google, which makes it possible to implement really simple storages. Not having to deal with queries makes things a whole lot easier.
and have concluded
Free your mind! Storage and search are two different things, if you split them, you gain flexibility.
People talked about splitting storage and search for some time now. Search engines like Lucene have driven searching out of databases. But mainly the notion of store&search is prevalent. Sharding as a mechanism for more perfomance and lower risk will move into many web companies and reduce databases to storage mechanism and drop the aggreation (data warehouse and reporting) and search parts. Those can be better filled with real data warehouse servers like Mondrian and search services based on Lucene or semantic enginse like Sesame. And storage might move from databases to simple storages like Amazon Elastic Block Storage or JDBM.
Thanks for listening, and think about your databases.