The dark side of NoSQL

There is a dark side to most of the current NoSQL databases. People rarely talk about it. They talk about performance, about how easy schemaless databases are to use. About nice APIs. They are mostly developers and not operation and system administrators. No-one asks those. But it's there where rubber hits the road.

The three problems no-one talks about - almost noone, I had a good talk with the Infinispan lead [1] - are:

  • ad hoc data fixing - either no query language available or no skills
  • ad hoc reporting - either no query language available or no in-house skills
  • data export - sometimes no API way to access all data

In an insightful comment to my blog post "Essential storage tradeoff: Simple Reads vs. Simple Writes", Eric Z. Beard, VP Engineering at Loop, wrote:

My application relies on hundreds of queries that need to run in real-time against all of that transactional data – no offline cubes or Hadoop clusters. I’m considering a jump to NoSql, but the lack of ad-hoc queries against live data is just a killer. I write probably a dozen ad-hoc queries a week to resolve support issues, and they normally need to run “right now!” I might be analyzing tens of millions of records in several different tables or fixing some field that got corrupted by a bug in the software. How do you do that with a NoSql system?

  1. Data export: NoSQL data bases are differently affected by those problems. Each of them is unique. With some it's easy to export all our data, mostly the non distributed ones (CouchDB, MongoDB, Tokyo Tyrant) compared to the more difficult ones (Voldemort, Cassandra). Voldemort looks especially weak here.
  2. Ad hoc data fixing: With the non-distributed NoSQL stores, which do posess a query and manipulation language, ad hoc fixing is easier, while it is harder with distributed ones (Voldemort, Cassandra).
  3. Ad hoc reporting: The same with ad hoc reporting. The better the query capabilities (CouchDB, MongoDB) the easier ad hoc reporting becomes. For some of those reporting woes Hadoop is a solution. But as the Scala Swarm author Ian Clarke notes, not every problem is applicable to map/reduce. Either way you need to train customers and their expectations as they have become addicted to ad hoc reporting. This is not only a technical question, but a cultural one.

One solution is to split data that needs to be queried or reported (User, Login, Order, Money) and data which needs best performance (app data, social network data). Use a tradition SQL database for the first kind of data, and a fast, distributed NoSQL store for the second kind of data. Joining will be difficult, you need to support more different systems and skills are an issue. But the three problems can be solved this way.

What is your NoSQL strategy? Please leave a comment, I would like to know.

[1] they plan a distributed query language for ad hoc reporting in distributed environments

Real JSON vs. XMLish JSON

Recently I came to the conclusion, while playing with data formats that XML and JSON cannot be converted into each other nicely. Both data formats miss something in relation to the other. Good JSON misses root types and types for arrays - which XML both has, while XML misses list types - which JSON has. This leads to XMLish JSON when people transform JSON to XML and vice versa. With the advent of document stores and NoSQL, for you as a developer this means to decide how to store your data. Lets explore this.

Suppose we have a short XML format for storing shopping lists. We have a list with a name, an id and a sub list of items.

<shoppinglist>
  <id>123</id>
  <name>Stephans List</name>
  <items>
    <item>
      <id>234</id><description>Apple</description>
    </item>
    <item>
      <id>233</id><description>Banana</description>
    </item>
  </items>
</shoppinglist>

From reading this XML it's easy to see that items is a list because it contains several entries of the same type. The XML can be transformed to proper JSON:

{
	id: "123",
	name: "Stephans List",
	items: [
		{ id: 234, description: "Apple"},
		{ id: 233, description: "Banana"}
	]
}

XMLish JSON

I call this proper JSON, compared to XMLish JSON:

{
  shoppinglist: {
	id: "123",
	name: "Stephans List",
	items: [
		{ item: { id: 234, description: "Apple"} },
		{ item: { id: 233, description: "Banana"} }
	]
  }
}

We want proper JSON, because working with XMLish JSON looks ugly in code. To access an item id one would need to write

 id = list.shoppinglist.items[0].item.id

compared to

id = list.items[0].id

When seeing the following fragment, it's not easy to decide if items is a list of items or not.

<items>
  <item>
    <id>234</id><description>Apple</description>
  </item>
</items>

Why do we need to know? Because if you transform this XML to JSON, one needs to decide between representing this as

{ items: { item: { id: 234, description: "Apple"} } }

or

{ items: [{ id: 234, description: "Apple"}] }

XML can solve this decision problem with meta descriptions, XSD and DTD. But when transforming XML to JSON in your code, it's a performance problem and a lot of ugly code is needed to evaluate a DTD or XSD description.

JSON to XML conversions

The same problems occurs when transforming JSON to XML. And with the upcoming NoSQL stores that focus on JSON, this will perhaps become a major a problem in the future. Lots of semantic information is lost in the data format and is only present in application code. Take our example:

{
	id: "123",
	name: "Stephans List",
	items: [
		{ id: 234, description: "Apple"},
		{ id: 233, description: "Banana"}
	]
}

When we want to transform this to XML, we do not know the name of the root node. Without a root node the XML is not valid. An unsatisfying solution would be to create a generic root like <document>. We also have a problem with the entries of the items array. What names do those nodes have? <item-entry>? How ugly.

Solutions

I've written about a solution - a third format from which to generate both JSON and XML. Most solutions without a higher level format (this includes XSD and XMLish JSON) are not very satisfying . Badgerfish creates very XMLish JSON documents with @xmlns entries for namespaces and $ entries for text content and loses a lot appeal compared to lean JSON. The one solution I currently use is to store XML when storing data in a key value store, not JSON. A "list:" namespace or type attribute lets us easily transform this XML then to JSON.

  <items type="list">
  <list:items>

A solution for JSON? When you need to store JSON, supplement your data with meta information on types. How are you gonna solve this?

Agile isn’t low quality – a rebuttal to Mike Brunt

Mike Brunt quotes an email on his blog. The mail exposes a very negative view of agile. Although the email is not written by Mike, he is "posting this here because it states my reservations very well". So address him directly because my experience with Agile over the last years is contrary to his experience.

1. Agile programming emphasizes programming over engineering. This results in software that does not have clean interfaces and is intertwined with other code. Of course, such code is difficult to maintain, debug, and replace. Expensive code bloat is the consequence.

The base fallacy here is that agile equals chaos. When looking at the clean code movement, which has grown out of agile, there is a lot of emphasis on good code. With heavy unit testing and refactoring, especially when you do TDD, code has clean interfaces, is easy to debug, maintain and replace. The code is light and agile

Many of these [bugs] are edge cases and not detectable by testing.

I think edge cases are especially found by testing. Agile emphasizes tight integration with testers and quality assurance. From my experience with agile this leads to more test-aware thinking with developers, leading to less bugs. The usage of static analysis tools like FindBugs or PMD is due to continuous integration and automation more accepted in agile circles which also results in higher quality.

An agile team, like a Scrum development team, has more control and more responsibility. They consider the code "their" code, which is often not the case with non-agile teams who are pushed to deadlines and low qualitiy.

Everywhere where I introduced Scrum or Agile code quality went up, architecture quality went up and bugs went down.

4. Agile programming deemphasizes designing performance into products.

Many agile teams have due to continuous integration and tight integration with QA and operations a clear grasp on performance. They include rigorous performance test into their builds and deployment strategies. Especially with continuous deployment designing and measuring performance is a must.

5. Agile programming never views a program or project as complete. There's always room to tinker and add new levels of abstraction and modify the mechanics of a program. Expenses around programming become a sucking black hole.

To the contrary: Scrum emphasizes business value delivered, as does XP. No more tinkering, no exploding programming expenses. Some agile shops can clearly state to cost for each story point and calculate detailed ROI.

6. Agile programming is a model that rewards software churn. It's a great model for building fiefdoms in a corporation and employ busy programmers; it's terrible for corporations who want to produce maintainable stable quality products that will not incure high overheads.

Not sure if this is ironic and the post a satire. Nothing could be more far away from truth. Agile is about communication and cooperation. Scrum of Scrums coordinate teams to solve problems together for the future.

7. Agile programming deemphasizes quality. Deploying software that works after a fashion "rather than waiting for perfection" introduces a dangerous slippery slope. I doubt that many managers can define "acceptable imperfection." Quality should be job one. Apple demonstrates that customers will pay a premium for well designed and implemented software.

As stated above, agile emphasizes quality on every level. Only agile teams have consistently a level of done agreement with jobs are only done when they are done done. This often includes acceptance tests, documentation, clean code, unit tests, release notes, refactoring, code reviews (reading from the Level of Done Agreement at the wall here).

8. Agile programming over emphasizes schedules. Production schedules and engineering requirements should be balanced by management.

As it's true that iteration cycles often coincide with releases to shorten time to market, they are independent. In agile product managers can release each minute, each day, every two weeks or every six months, just as you see fit.

9. When there are many projects to add assorted features to a product, code become difficult to manage. Code merges and inconsistencies become difficult to manage so all the pieces play together. Merging code down can take several days given high rates of code churn. Costs associated with code management are not linear as the number of projects increase. I suspect that the cost function is exponential.

Merging sometimes get difficult. With agile or without agile. Many agile shops migrate to distributed version control systems like Git or Mercurial that aid tremendously with merging. Personally working in enviroment from 5 to 70 developers, I haven't seen merges that "take several days". Distributed version control helps with keeping costs down.

10. Agile programming uses customers as the test bed. Customers don't appreciate being treated as guinea pigs.

Agile programming low bug-count, production ready code to customers. Due to late binding of requirements, active communication with customers and reviews customers get what they need and want. Not something that they assumed they needed some months ago. Due to "done done" customers are not used as "guinea pigs" contrary to traditional development with "alpha", "beta" and "RC" releases.

11. The agile programming model creates an unstable expensive house of cards. The house of cards will eventually collapse despite efforts to keep it standing.

Hu?

I've been dabbling with agile since the beginning of XP and Kents first book. Later I introduced Agile in some companies and processes and became a Scrum Master. Nowhere where I've been I've seen those issues described. To the contrary, those issues were often rampant before the introduction of agile. This leads me to the conclusion that Mike never experienced agile, either by not doing agile or by falling into the traps of "let's do agile" management, ScrumBut or Snake Oil selling consultants.

My experience is very positive, as is the experience of all developers I've asked and worked with. But perhaps its best to draw your own conclusion by listening to both sides.

How your application becomes enterprisy

Your application is going to be an enterprise application soon. Prepare for it. There is a certain disdain for enterprise applications in the new world of dynamic languages and frameworks like Ruby/Rails or Python/Django. Mostly this is associated with the world of Java and C#. Developers think they are immune from enterprise woes. Think again.

Enterprise applications are not defined - contrary to public opinion - by applications for the enterprise. What developers associate with enterprise applications like untested code, tangled code, old frameworks and slow development are not something that happens in the enterprise. If you do not act, it happens to every application.

I've seen many applications becoming enterprise applications over time. Top driving forces for applications becoming "enterprise" are:

  1. A startup with few people becomes a company with many people - Craigslist being the exception
  2. A growing company with investors gains more goals, which results in more feature wishes. Often this means more developers. If recruitment isn't tough enough this leads to averaging the developer force. This in turn reduces code quality, leads to lower testing, higher coupling and worse documentation.
  3. More employees lead to more wishes and more features. These features need more real estate on the website. Marketing wants banners, sales wants contact forms and customer support wants info boxes. Pages get cluttered, simple forms get complex.
  4. Marketing usually wants to store everything about your customers (and others want too). This means more fields, more complex forms and more dependencies to third party services
  5. Integration with other services is the most common enterprise "problem": Integration with mail services, backends, web tracking companies, financial systems, data warehouses or payment providers tangle your application. Deployment gets harder as does testing. Everything takes longer.
  6. The first few years a startup has very low turnover. But from my experience as a team manager, retention in our industry is not measured in decades. So as a startup ages, turnover increases: After some years the initial developers are no longer there, and others have not quite the grab of the system. Development, code quality and architecture deteriorates.
  7. Founders leave: After some years, often founders leave or are ousted by their VCs. Technic savvy founding types are replaced by executives. Technology looks less important, quality goes down.

Law of software development: Greenfield becomes brownfield.

If you do not want this to happen, you need to fight it every step:

  • Have migration paths, upgrade paths and life cycle management for frameworks
  • Clean code
  • Architecture guidelines and architecture strategy
  • Someone who fights for clean web pages and forms
  • Strategy for integrating with many, many systems

What do you think? How do you prevent applications from becoming "enterprise"?