the blog for developers

Sharding destroys the goals of your relational database

Sharding does destroy your relational database – which is a good thing. The idea behind sharding is to distribute data to several databases based on certain criterias. This could for example be the primary key. All entities that keys begin with 1 go to one database, with 2 to another and so on (often modulo functions on the key are used, or groups based on business data like customer location, or function). Several reasons exists for sharding, the main two being better performance and lower impact of crashed databases – only persons with a name that starts with S will be affected by a database crash.

Relational databases were the tool of choice for several decades when it comes to data storage. But they do more than store data. Even reading operations can be split into several functions. There are at least three kinds of database read queries:

  • Data graph building queries: With these you get your data out of the database, customers together with adresses etc.
  • Aggregation queries: How many orders have been stored in the August, aggregated by product category
  • Search queries: Give me all customers who live in New York

Sharding now does away with the second and third query and reduces databases to data storage. Because the shards are different databases on different systems you can’t aggregate queries (compared to a cluster) without custom code across systems and you cannot search with one query (only several ones – one to each database). Databases have lead to the notion that search and retrieval are linked together and should be dealt together. Most people think as retrieval and search as the same thing. This has blocked development on technologies. Sharding, S3, Dynamo, Memcached have changed this preception recently. I’ve written about splitting search and retrieval in “The unholy legacy of databases”. There I quote Rickard from Qi4j fame:

Entities are really cool. We have decided to split the storage from the indexing/querying, sort of like how the internet works with websites vs Google, which makes it possible to implement really simple storages. Not having to deal with queries makes things a whole lot easier.

and have concluded

Free your mind! Storage and search are two different things, if you split them, you gain flexibility.

People talked about splitting storage and search for some time now. Search engines like Lucene have driven searching out of databases. But mainly the notion of store&search is prevalent. Sharding as a mechanism for more perfomance and lower risk will move into many web companies and reduce databases to storage mechanism and drop the aggreation (data warehouse and reporting) and search parts. Those can be better filled with real data warehouse servers like Mondrian and search services based on Lucene or semantic enginse like Sesame. And storage might move from databases to simple storages like Amazon Elastic Block Storage or JDBM.

Thanks for listening, and think about your databases.

You can leave a Reply here. Of course, you should follow me on twitter here.

You can share this post!
Do you want to tell others about this article? Use the social bookmark icons to submit this artice to the service of your choice. Thanks.

About the author: Stephan Schmidt has more than 15 years of internet technology experience and 10 years experience in agile. He was head of development, consultant and CTO and is a speaker, author and blog writer. He specializes in organizing and optimizing software development helping companies by increasing productivity with lean software development and agile methodologies. Want to know more? All views are only his own.
Leave a reply.

Comments

The reason for this is CAP theorem Consistency, Availability and Partition tolerance you can only have two of the three. traditionally DBMSs take the first two. Shards mean taking the last two

Arnon

stephan

@Arnon: I did know about CAP, but didn’t see it in the case of shards. Interesting, thanks.

todd

I think sharding and CAP come in because each shard can independently fail. So if 1 shard out of 10 fails your other 9 are still available, which increases the availability of your overall system at the expense of consistency.

Well, you just need some new layer to let you do summary queries across all your shards… Something like Map/Reduce. I can see LINQ going in this direction in the long term.

stephan

@website: You’re right, as I’ve said “without custom code across”. You always can write new code on top.

“! Storage and search are two different things, if you split them, you gain flexibility”

What a powerful concept.

Your post is worth reading just for that quote.

Thanks for other wonderful ideas, too.

stephan

Thanks.

Actually one context where this is a strength in a multitenancy configuration where the sharding is being implemented to isolate data storage for different companies to contain the risk of accidental data leakage.

While it can constrain the host from running queries of type two and three the way you described, not being able to run them in the application then becomes a desirable feature.

Is sharding the right way to describe this configuration. Probably not (at least since the intent is not the same as conventional sharding), even though the term does get used at times.

Hi Stephan,

pretty good post with interesting insights: on a related note, “splitting storage and searching” goes hand-in-hand with Command-Query Responsibility Segregation (see http://www.udidahan.com/2009/12/09/clarified-cqrs/ for an excellent post about it), one of the most interesting patterns recently emerged.

Cheers,

Sergio B.

you wrote about retrieval and searching, but what about aggregation?

What solutions do you see for this? A little more specific than ‘custom code’ please :)

@Sergio: “one of the most interesting patterns recently emerged.” 100% agree

@Jens: I’ve meant code that you need to write, like handling joins in your application layer or write some general middleware.

Leave a Reply

What people wrote somewhere else:

Additional comments powered by BackType

Guide to CodeMonkeyism

Over the last 4 years I wrote many articles on this blog. To make it easier for you to find the relevant ones, I've organized them into topics.

Top 10

6 reasons why my VC funded startup did fail

Go Ahead: Next Generation Java Programming Style

Java Interview questions: Write a String Reverser

The dark side of NoSQL

7 Bad Signs not to Work for a Software Company or Startup

Is Java dead?

Scala vs. Clojure

Never, never, never use String in Java

No future for functional programming in 2008 – Scala, F# and Nu

Clojure vs Scala, Part 2

Java Developer

Is Java Dead?

Go Ahead: Next Generation Java Programming Style

Be careful with magical code

All variables in Java must be final

Never, never, never use String in Java

Bending Java: More readable code with methods that do nothing?

NoSQL Guy

NoSQL: The Dawn of Polyglot Persistence

The dark side of NoSQL

Essential storage tradeoff: Simple Reads vs. Simple Writes

Sharding destroys the goals of your relational database

The unholy legacy of databases

Startup/CTO

Development Dream Teams

6 reasons why my VC funded startup did fail

American vs. European style of Software Development

12 Things to Reduce Your Lead Time and Time to Market

The high cost of overhead when working in parallel

Essential storage tradeoff: Simple Reads vs. Simple Writes

Job Seeker

Another Good (Java) Interview Question

7 Bad Signs not to Work for a Software Company or Startup

Java Interview questions: Write a String Reverser (and use Recursion!)

Java Interview questions: Multiple Inheritance

As a Manager: What I value in developers

Top 10 Tips (+1) to Get a Pay Raise

Agilist

What Developers Need to Know About Agile

5 Practices Better to Change in Your Scrum Implementation

Scrum is not about engineering practices

ScrumMaster and ZenMaster: The joke of certification

What is Trans-Scrum?