the blog for developers

The unholy legacy of databases

When reading about the status of Qi4j on Rickards blog, I stumbled about

Entities are really cool. We have decided to split the storage from the indexing/querying, sort of like how the internet works with websites vs Google, which makes it possible to implement really simple storages. Not having to deal with queries makes things a whole lot easier.

We made the same experience when we developed the SnipSnap wiki application several years ago. We had a split in storages and search, each part with it’s own Java interface (a component could implement both of course). This way we could have Lucene, database and in-memory search and database and file (XML, plain text) storage. We were very flexible with storage and search this way and people could easily implement different storage backends because developers have been freed from the search implementation. Rickard seems to have made the same experiences:

We have one EntityStore based on JDBM (persistent binary hashmap), one on JGroups (replicated cluster hashmap), one on Amazon S3 (for global storage), and one on iBatis (for RDBMS storage)

So today SnipSnap would easily be able to supply a S3 backend, because of the split, whereas others which rely on the storage/search combination have much more problems to support a storage-only backend. So they have problems to support S3 or WebDav out of the box.

Why don’t more people split the problem of storage into storage and search? After some contemplation on the topic, perhaps it’s the unholy legacy of databases. Databases make it easy to solve the search/storage problem with only one technology. After 30 years of databases the problems have merged in a way that most developers think of them as one problem. By splitting the problem again, projects will be freed for better backends and better search solutions. Open Source projects will emerge which adress each of the problems better than current databases do.

This of course breaks the DAO pattern and the usage of the EntityManager as an DAO replacement and should be replaced by a Storage and Search pattern. Free your mind! Storage and search are two different things, if you split them, you gain flexibility.

Thanks for listening.

You can leave a Reply here. Of course, you should follow me on twitter here.

You can share this post!
Do you want to tell others about this article? Use the social bookmark icons to submit this artice to the service of your choice. Thanks.

About the author: Stephan Schmidt is head of development at brands4friends. He has more than 15 years of internet technology experience and 10 years experience in agile. He was head of development, consultant and CTO and is a speaker, author and blog writer. He specializes in organizing and optimizing software development helping companies by increasing productivity with lean software development and agile methodologies. Want to know more? All views are only his own.
Leave a reply.

Comments

I have tried this sort of thing as well and I really really like the concept. The only thing that stops my poor, slow, small brain from really seeing it through to its logical conclusion is the sometimes-requirement to join attributes from one Thing stored with one Storage mechanism with the attributes of another Thing stored with another Storage mechanism. Do you have suggestions here?

Obviously, a Compass/Lucene-type search handles a huge number of cases–people tend to like to search by keywords, and so a coarse-grained search/locate strategy like that makes a lot of sense. But in some of the applications I work on, careful targeted queries that join bits of two entities together–a classic SQL join–are also needed.

Have you found a convenient way to expose a *common* SQL-like query mechanism across items that use different Search implementations?

stephan

@Laird: I’m not sure if this is possible. You’re giving stuff up when you try this approach. But you also gain something. I guess it depends on the application you have. If flexibility in the backend is needed, than this is a good approach. If a RDBMS is all you need, then this approach is overengineered.

Have you read the transaction apostate paper and the Amazon dynamo paper? Sometimes it seems it isn’t even possible to have data on one machine to join it.

http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p15.pdf

But if you have new insights and a solution to the problem of data joining of disparate stores, please drop me a line.

I surely don’t. :-) My main problem is that I need both–the ability to join LDAP information (Student information, say), as it happens, with additional information relevant to it from a database (classes they’re taking).

The decidedly brute-force, ugly, smelly, hairy, nasty and yet strangely cool approach I took at one point was to make a kind of query builder that, in conjunction with some simple lookup and filtering facilities implemented on Storage instances–but again, not true searches–whittled down the sets of items from each Storage to be combined (so the Student Storage was able to filter using a simple where clause/predicate, and the Class Storage was able to do the same). Then I loaded those sets into a temporary database (H2–obviously could have been anything) and did the more complicated joining there (minus the simple where clauses/filters that were used to get me the candidate sets). (A tip of my hat to a former colleague for first exploring this approach.) The result, of course, was dog slow and could not be used on enormous datasets, but performance was never a priority and the client understood that they would pay dearly in performance costs for this approach. I was hoping that someone somewhere smarter than me had figured out how to bridge querying disparate systems in a better way.

Thanks for the links to the papers; very interesting reading.

stephan

@Laird: Your query builder doesn’t sound too ugly, but I haven’t seen the code :-)

“I was hoping that someone somewhere smarter than me had figured out how to bridge querying disparate systems in a better way.”

Uh, smarter, than I’m most possibly not the right person.

Perhaps the joins are only needed for reporting, if that is the case it would be best to write the data also into a OLAP for reports.

Like http://mondrian.pentaho.org/

Part of the point of splitting storage and query is that it becomes easier to do cross-storage queries. If you store objects in many places, but index/query them in one (again, the website vs Google analogy), it becomes supertrivial to query stuff in different places. The LDAP database problem you outline is one of the cases I had in mind when I designed these API’s in Qi4j, because I want to be able to do the same thing. In Qi4j our primary indexer is going to be Sesame2 (i.e. RDF), with SPARQL as the main query language (although it’s usually hidden under a domain-oriented Java API). Will be very interesting to see how it works out.

The unholy legacy of databases is that database professionals like myself have not adequately engaged the object-oriented programming community like yourselves.

There is nothing in the relational technology that obliges storage and search to be unified. It is the POJO/DAO mindset that combines them. This mindset is the very essence of the long-bemoaned object/relational “impedance mismatch.”

My preferred persistence framework is Oracle ADF Business Components for Java. Suggest you take a look at it for a database guy’s perspective on persistence programming.

Best,

Andrew

Leave a Reply

What people wrote somewhere else:

Additional comments powered by BackType

Guide to CodeMonkeyism

Over the last 4 years I wrote many articles on this blog. To make it easier for you to find the relevant ones, I've organized them into topics.

Top 10

6 reasons why my VC funded startup did fail

Go Ahead: Next Generation Java Programming Style

Java Interview questions: Write a String Reverser

The dark side of NoSQL

7 Bad Signs not to Work for a Software Company or Startup

Is Java dead?

Scala vs. Clojure

Never, never, never use String in Java

No future for functional programming in 2008 – Scala, F# and Nu

Clojure vs Scala, Part 2

Java Developer

Is Java Dead?

Go Ahead: Next Generation Java Programming Style

Be careful with magical code

All variables in Java must be final

Never, never, never use String in Java

Bending Java: More readable code with methods that do nothing?

NoSQL Guy

NoSQL: The Dawn of Polyglot Persistence

The dark side of NoSQL

Essential storage tradeoff: Simple Reads vs. Simple Writes

Sharding destroys the goals of your relational database

The unholy legacy of databases

Startup/CTO

Development Dream Teams

6 reasons why my VC funded startup did fail

American vs. European style of Software Development

12 Things to Reduce Your Lead Time and Time to Market

The high cost of overhead when working in parallel

Essential storage tradeoff: Simple Reads vs. Simple Writes

Job Seeker

Another Good (Java) Interview Question

7 Bad Signs not to Work for a Software Company or Startup

Java Interview questions: Write a String Reverser (and use Recursion!)

Java Interview questions: Multiple Inheritance

As a Manager: What I value in developers

Top 10 Tips (+1) to Get a Pay Raise

Agilist

What Developers Need to Know About Agile

5 Practices Better to Change in Your Scrum Implementation

Scrum is not about engineering practices

ScrumMaster and ZenMaster: The joke of certification

What is Trans-Scrum?