Scala Goodness: RichString

Scala is a marvelous beast. Fire up the Scala shell and enter:

scala> "capitalize"
res0: java.lang.String = capitalize

scala> "capitalize".capitalize
res1: String = Capitalize

How can that be? "capitalize" is of type java.lang.String, and the Java class does not have a capitalize method. But the Scala RichString methods has (Scala 2.8 will use StringOps). And there are more methods to RichString, like reverse, drop, toDouble, toInt and toLong

scala> "123".toInt
res2: Int = 123

scala> "123".toLong
res3: Long = 123

Another nice one, is format. Not as nice as as GStrings in Groovy, but nevertheless:

scala> "hello %s".format("stephan")
res6: String = hello stephan

How can Scala do this? There is a conversion going on in Scala, which automatically converts your Java String to RichString when the method cannot be found in String but in RichString. The relevant code is in Predef (look for yourself and see how deep the rabbit-hole goes):

implicit def stringWrapper(x: String) = new runtime.RichString(x)

Scala Glory!

NoSQL: The Dawn of Polyglot Persistence

For some developers polyglot programming is already reality. I'm not such a big fan of polyglot programming, using many programming languages in one company. Especially for small ones there are hurdles, like turnover. I've seen projects stranded because noone understood a particular language. Or as Alex Ruiz writes:

I haven’t seen any practical evidence yet to convince me this is a good idea.

Contrary to that, I'm a big fan of polyglot persistence. This simply means using the right storage backend for each of your usecases. For example file storages, SQL, graph databases, data ware houses, in-memory databases, network caches, NoSQL. Today there are mostly two storages used, files and SQL databases. Both are not optimal for every usecase. In the words of Ben Scofield:

Many applications may require a non-traditional data store (say, something like MongoDB) for their core domain, but have other features that fit perfectly into a relational database – say, a CMS that relies heavily on custom fields and has a traditional user management system. Just as polyglot programmers may use multiple languages in a single application, I think the future of the web is polyglot persistence: we should use the database that best represents our domain, even if that requires several distinct systems within a single application.

SQL is just fine

But you might say: "SQL is working for me!". Yes, maybe. But in reality SQL storages are often problematic, not during development but during operations. SQL storages are hard to scale - not impossible, but scaling a MySQL database with master/slave and replication chains is no easy task. And when scaling, most companies drop SQL features like JOINs as they are slow and notoriously hard to scale.

Many companies on the web wave front have created their own storages to better suit their needs: Flickr, Facebook, Google and Amazon to only name a few. The ones those companies build and parly open sourced are Cassandra, Dynamo, BigTable, HayStack, MapReduce/Hadoop - although all of them use SQL databases too for the right use cases.

There are 4 main use cases for storages:

  1. Storing data, read/write
  2. Searching data, mostly read
  3. Navigation over data, read
  4. Reporting, read

SQL storages can do all four of them, though SQL as a language and philosophy is mainly driven by "4. Reporting". Or as Rickard tweeted:

@emileifrem yeah, but I have to do this relational database thing just once. It's for.... reporting! 🙂 The one thing it does really well.

Current SQL databases therefor do not support all use cases well (decision can be based on CAP theorem, BASE vs. ACID consistency). The one guy who drove home this point long ago is Richard with Qi4J: storing and querying are seperated form each other:

In Qi4j we have explicit SPI support for storing and querying objects, separated from each other. Typically the EntityStore SPI will be implemented by a key-value implementation, the benefits of which was described in the previous post. The EntityFinder SPI is currently only implemented by an RDF repository extension based on Sesame.

Best of breed

There are some best of breed examples for the four usecases above:

  1. Storing: Key/Value stores, document DBs like CouchDB
  2. Searching: search engines, SOLR
  3. Navigation: Either by hand in SQL, K/V stores, XML, JSON etc. or use a graph database like Neo4J (which also can do 1.)
  4. Reporting: SQL or structured stores like MongoDB, MapReduce with Hadoop and Pig

Using the most suitable storage for your usecase will lead to better fits and less problems ahead concerning data management, scalability and performance.

Problems with polyglot persistence?

Are there problems with polyglot persistence? Indeed there are some. It's hard to join data across storages and to query data accross storages, because there is no way for the different storages to interact, except through your application code. The same goes for aggregation of data (how many users have bought a product?) across storages. And out the window goes reference integrity. But if you scale, with sharding, partitioning and forbidden joins you need to live with the same restrictions.

Storage evolution

Storage systems develop over time. The evolutionary steps for storages are:

  1. Unstructured: file system
  2. Unification: Network databases, CODASYL, SQL RDBMS
  3. Specialization: NoSQL DBs together with SQL - What I call polyglot persistence
  4. Abstraction: Amazon Dynamo, will lead to "Give me a storage for optimal load/store, give me one for search, I want to report ABC"
  5. Automatic: Just like garbage collection (GC) in VMs like the Java VM, automatic adaption and migration of data over storages. You no longer need to care

Currently we're on level 3/4 with some storages and will move to 5. Developers and admins will need to make a big jump in faith, just as C developers needed to have faith in the GC of Java. You will no longer know what's going on in detail.

The future

We can already see the beginning of automatic storages. Swarm as one example of automatic storage engines that optimizes execution context and data locality:

  • Written using continuations in Scala
  • Execution context gets transfered to data, execution of code is always near data
  • Optimization of data distribution, to reduce moving execution contexts to a minimum

They have a video on their website which is worth watching. In general future storages will self optimze for different use cases, just like the Java GC optimizes memory managment, the HotSpot VM optimizes compilation and Google AppEngine optimizes execution context - make the jump with faith.

Conclusion

Storages are changing. You need to take action. Learn about NoSQL storages and polyglot persistence. It's no longer enough to only know and use one "hammer" (SQL storages), not every storage problem is a nail. There will be optimizing storages for different use cases in the not so distant future.

Java Interview questions: Multiple Inheritance

Recruiting and interviewing is a never ending task for development managers. Sometimes you get help from HR, but you're still on your own when deciding to hire or not to hire. Or as I've written last time in "Java Interview questions: Write a String Reverser":

Interviewing developers for a programming job is hard and tedious. There are some excellent Guides, like the Joel Guerilla Guide to interviewing, but in the end you need to decide yourself to hire or not to hire. To get a quick idea about their programming abilities I have considered asking the String reverse question.

Beside the String reverse question I have another favorite.

Does Java support multiple inheritance?

Well, obviously Java does not have multiple inheritance in the classical sense of the word. So the right answer should be "no", or "no, but" or "yes, but". From there one can explore several ways. Mostly I start by asking if the Java language designers were too stupid to implement multiple inheritance? Why did the C++ guys implement it then? We mostly land at the Diamond Anti-Pattern then:

In object-oriented programming languages with multiple inheritance and knowledge organization, the diamond problem is an ambiguity that arises when two classes B and C inherit from A, and class D inherits from both B and C. If a method in D calls a method defined in A (and does not override the method), and B and C have overridden that method differently, then from which class does it inherit: B, or C?

The other way to explore is how Java "emulates" multiple inheritance? The answer, which might already have surfaced, ist Interfaces. We then usually discuss interfaces in Java, if, when and how the candidate has used them in the past. What are interfaces good for, does he like them? I can explore how good he is at modelling and sometimes make him draw a diagram with interfaces. We go on with problems of interfaces in Java, and what happens when two interfaces have the same static fields and a class implements both - some kind of "multiple inheritance" in Java:

public interface I1 {
   String NAME = "codemonkeyism";
}

public interface I2 {
   String NAME = "stephan";
}

public class C implements I1, I2 {
   public static void main(String[] args) {
      System.out.println(NAME);
   }
}

Staying true, the language designer decided that this does not work in Java:

C.java:3: reference to NAME is ambiguous, both variable NAME 
              in I1 and variable NAME in I2 match
      System.out.println(NAME);
                         ^
1 error

There are many more ways to explore with the candidate, e.g. what are the modifiers of interface methods. Are mixins or traits a better solution to the diamand problem than interfaces? Or are they just as bad as multiple inheritance? As I'm no longer very fond of inheritance and think most uses are a code smell, one can also discuss the down side of inheritance - thight coupling for example - with the candidate.

Why?

Why do I ask this question and what do I learn from it? Inheritance is a very basic concept in OO and should be understood by every Java developer. Reflection on his work and going beyond knowing the syntax is an essential trait also, therefor the multiple inheritance question. I prefer questions that spawn many opportunities to explore. The inheritance question is one of them: Mutliple inheritance, language design, code smells, solutions, interfaces, role based development.

If you interview with me, you should know the answers, or tell me you've read my blog.

See also:

Scala Goodness: Tuples

Scala has a wonderful feature: Tuples. As others have already written, tuples are very simple but powerful. Especially if you come from Java, they solve some problems easily, that were ugly in Java.

What are tuples? Tuples are containers for values. In Scala you create a Tuple with:

scala> val t = (1,2)
t: (Int, Int) = (1,2)

which is syntactic sugar for

scala> val t = new Tuple2(1,2)
t: (Int, Int) = (1,2)

as Tuples are plain classes in the Scala library. Tuples are of type Tuple1, Tuple2, Tuple3 and so on. There currently is an upper limit of 22 in the Scala library for creating tuples, which should be enough (as is 640k of RAM). If you need more, then perhaps you really need a collection, not a tuple.

Values in tuples don't need to be of the same type as shown here

scala> val t = (1, "Codemonkeyism")
t: (Int, java.lang.String) = (1,Codemonkeyism)

which is one reason you should not think of them as collections (see below).

After creating a tuple there are several ways of accessing the values:

scala> t
res2: (Int, Int) = (1,2)

scala> t._1
res3: Int = 1

scala> t._2
res4: Int = 2

Beside accessing the values "by index", most often it's more readable to depack tuples into variables. Scala uses extractors for this.

scala> val (x,y) = (1,2)
x: Int = 1
y: Int = 2

Scala matches the unbound variables on the left, x and y, with the values contained in the tuple. If you need only one value, we can read one value if we want

scala> val (x,_) = (1,2)
x: Int = 1

Tuples can be used for returning multiple values from a method - something which is often missed in Java. Side note: Contrary to others I think you should consider creating a class as the return type, if it does contain semantic value and you reuse the type. We define a method which returns two values. Using the depacking from above, we assign each value to its own variable:

scala> def m(a:Int, b:Int) = { (a+b,a-b) }
m: (Int,Int)(Int, Int)

scala> val (s,d) = m(5,8)
s: Int = 13
d: Int = -3

Tuples are not collections. As Jesse Eichar writes:

Tuples are quite handy but a potential annoyance is that at a glance they seem list-like but appearances can be deceiving. The normal collection methods are not supported by Tuples.

He gives some examples how to iterate through tuples though:

scala> (1,2,3).productIterator foreach {println _}            
1
2
3

Another nice hack with Tuples: 1->2 creates a tuple in Scala.

scala> 1->2
res0: (Int, Int) = (1,2)

This is used for creating maps - no need for handling this on a language level like in other languages. As before, this is defined in the Scala library not the language.

scala> val m = Map(1->2, 3->4)
m: scala.collection.immutable.Map[Int,Int] = Map(1 -> 2, 3 -> 4)

Scala glory!