Saturday, December 15, 2012

RabbitMQ, ActiveMQ, ZeroMQ, HornetQ

Warning: In this post I'm going to compare RabbitMQ, ZeroMQ, ActiveMQ, and HornetQ. The basis of the comparison is not the performance, or the scalability, or any other serious feature. The comparison is done purely based on the popularity of those systems. Therefore, if you want to see some performance metrics, this post is not what you are looking for.

Note: To calculate popularity, I'm going to use MongoDB and Python, so if you don't care about message brokers, but you want to see some examples of MongoDB scripts, this post might be interesting to you.

Popularity

What is the best messaging system out there? If you read my blog regularly, you probably know my biased answer. But to give an objective answer, we have to compare the candidates based on some criteria. There are multiple criteria, some of which are more relevant to your project than the others. One of them is how popular the candidate solutions are. In other words, if you choose a message broker and then you encounter a problem, how easy would it be to solve it? Is there anybody who can help you? One way to find it out is to check how many people are interested in the same solution. And the obvious way to do it is to ask Google.

Here is the Google trend graph for the last five years. It turns out, my personal preferences coincide with the public interest.

At this point I can stop and say "Well, you see who's the winner". There are 5 times more people interested in RabbitMQ than HornetQ, so if you bet on Rabbit you have more chances to get the help from your fellow programmers, if you need to.

But before we make the final decision, I want to hear another opinion about the popularity of our candidates. Where do people go nowadays when they have software related problems? Right, they go to…

StackOverflow

The best thing about StackOverflow is their REST API. For our purposes we need two API queries: get all questions by a tag, and get all answers for the question. In fact, the second one is optional. Even the first query alone can give us most of what we want to know:

  • how many questions have been posted for every candidate on our list?
  • how many answers did those questions receive?
  • how many answers were accepted?
  • how many questions and answers were marked useful?

When we get all the numbers, we should know what people are actually using. We can also check if there is any correlation between Google data and StackOverflow.

So how do we proceed? We cannot use API directly to run analytics, because we would quickly exhaust the daily quota. What we can do is to fetch the data, save it locally, and run analytics against the local data. Here is another good thing about StackOverflow API: it comes in JSON format. What is the best way to analyze JSON data? Obviously, saving it in a JSON-oriented database that supports aggregated queries. And that's where MongoDB comes into play.

Here is the Python script that downloads all the questions for the specified tags from StackOverflow, and saves the results in the local MongoDB instance. I chose Python because I want to draw some graphs later, which is easy to do in Python. Plus, it's a simple and expressive language.

After we run this script, we get all the questions we need in our database. The next step is to get all the answers for those questions. Here is the script that does that.

Depends on how many questions we have saved on the first step, there might be quite a lot of queries to run to get all the answers. With my second script I exceeded the daily quota, so I had to wait for the next day to get the rest of the answers.

Now, when we have all the data, let's take a look how we can use it. Here is a typical record. I highlighted the fields that might be useful for our analysis.

{
     "_id" : 269363,
     "accepted_answer_id" : 290764,
     "answer_count" : 4,
     "answers" : [
          ...
          {
               "view_count" : 0,
               "answer_comments_url" : "/answers/303710/comments",
               "answer_id" : 303710,
               "title" : "ActiveMQ .net client locks up",
               "community_owned" : false,
               "down_vote_count" : 0,
               "last_activity_date" : 1317300099,
               "creation_date" : 1227135282,
               "score" : 1,
               "up_vote_count" : 1,
               "owner" : {
                    "display_name" : "HitLikeAHammer",
                    "reputation" : 1152,
                    "user_id" : 35165,
                    "user_type" : "registered",
                    "email_hash" : "584cd9905db85f744e7e96740b11b7c0"
               },
               "accepted" : false,
               "last_edit_date" : 1317300099,
               "question_id" : 269363
          },
          ...
     ],
     "community_owned" : false,
     "creation_date" : 1225989513,
     "down_vote_count" : 0,
     "favorite_count" : 1,
     "last_activity_date" : 1317300112,
     "owner" : {
          "display_name" : "HitLikeAHammer",
          "reputation" : 1152,
          "user_id" : 35165,
          "user_type" : "registered",
          "email_hash" : "584cd9905db85f744e7e96740b11b7c0"
     },
     "question_answers_url" : "/questions/269363/answers",
     "question_comments_url" : "/questions/269363/comments",
     "question_id" : 269363,
     "question_timeline_url" : "/questions/269363/timeline",
     "score" : 1,
     "tags" : [
          ".net",
          "activemq"
     ],
     "title" : "ActiveMQ .net client locks up",
     "up_vote_count" : 1,
     "view_count" : 1183
}

First of all, we want to know how many questions are posted for each messaging system on our list. Here is the MongoDB query for that. The query itself is in blue and the results are in black.

> db.stackoverflow.aggregate([
     {$unwind:'$tags'},
     {$group:{_id:'$tags', questions:{$sum:1}}},
     {$match:{_id:{$in:['activemq', 'rabbitmq', 'zeromq', 'hornetq']}}},
     {$sort:{questions:-1}}
])['result'];
[
     {
          "_id" : "activemq",
          "questions" : 1039
     },
     {
          "_id" : "rabbitmq",
          "questions" : 988
     },
     {
          "_id" : "zeromq",
          "questions" : 373
     },
     {
          "_id" : "hornetq",
          "questions" : 185
     }
]

The next query is to get the total number of answers by tag

> db.stackoverflow.aggregate([
     {$unwind:'$tags'},
     {$group:{_id:'$tags', answers:{$sum:'$answer_count'}}},
     {$match:{_id:{$in:['activemq', 'rabbitmq', 'zeromq', 'hornetq']}}},
     {$sort:{answers:-1}}
])['result'];
[
     {
          "_id" : "activemq",
          "answers" : 1382
     },
     {
          "_id" : "rabbitmq",
          "answers" : 1322
     },
     {
          "_id" : "zeromq",
          "answers" : 572
     },
     {
          "_id" : "hornetq",
          "answers" : 227
     }
]

It seems that the number of answers is proportional to the number of questions. With MongoDB we can quickly verify it.

> db.stackoverflow.aggregate([
     {$unwind:'$tags'},
     {$group:{_id:'$tags', answers:{$sum:'$answer_count'}, questions:{$sum:1}}},
     {$match:{_id:{$in:['activemq', 'rabbitmq', 'zeromq', 'hornetq']}}},
     {$project:{answers:1, questions:1, ratio:{$divide:['$answers', '$questions']}}},
     {$sort:{ratio:-1}}
])['result'];
[
     {
          "_id" : "zeromq",
          "answers" : 572,
          "questions" : 373,
          "ratio" : 1.5335120643431635
     },
     {
          "_id" : "rabbitmq",
          "answers" : 1322,
          "questions" : 988,
          "ratio" : 1.3380566801619433
     },
     {
          "_id" : "activemq",
          "answers" : 1382,
          "questions" : 1039,
          "ratio" : 1.3301251203079885
     },
     {
          "_id" : "hornetq",
          "answers" : 227,
          "questions" : 185,
          "ratio" : 1.227027027027027
     }
]

Indeed, the answers/question ratio is almost the same for every tag. That means we can use just the number of questions for our analysis.

Here is the query that calculates the number of accepted answers by tag. Again, it correlates fairly well with the total number of answers and questions.

> db.stackoverflow.aggregate([
     {$match:{accepted_answer_id:{$ne:null}}},
     {$unwind:'$tags'},
     {$group:{_id:'$tags', accepted_answers:{$sum:1}}},
     {$match:{_id:{$in:['activemq', 'rabbitmq', 'zeromq', 'hornetq']}}},
     {$sort:{accepted_answers:-1}}
])['result'];
[
     {
          "_id" : "activemq",
          "accepted_answers" : 531
     },
     {
          "_id" : "rabbitmq",
          "accepted_answers" : 500
     },
     {
          "_id" : "zeromq",
          "accepted_answers" : 221
     },
     {
          "_id" : "hornetq",
          "accepted_answers" : 94
     }
]

The next query is more interesting. It calculates the number of question up-votes by tag. In other words, it shows the number of useful questions. If we divide it by the total number of questions, we should see which messaging system has bigger rate of useful questions than others

> db.stackoverflow.aggregate([
     {$unwind:'$tags'},
     {$group:{_id:'$tags', upvotes:{$sum:'$up_vote_count'}, questions:{$sum:1}}},
     {$match:{_id:{$in:['activemq', 'rabbitmq', 'zeromq', 'hornetq']}}},
     {$project:{upvotes:1, questions:1, ratio:{$divide:['$upvotes', '$questions']}}},
     {$sort:{ratio:-1}}
])['result'];
[
     {
          "_id" : "zeromq",
          "upvotes" : 1078,
          "questions" : 373,
          "ratio" : 2.8900804289544237
     },
     {
          "_id" : "rabbitmq",
          "upvotes" : 1864,
          "questions" : 988,
          "ratio" : 1.8866396761133604
     },
     {
          "_id" : "activemq",
          "upvotes" : 1459,
          "questions" : 1039,
          "ratio" : 1.4042348411934553
     },
     {
          "_id" : "hornetq",
          "upvotes" : 233,
          "questions" : 185,
          "ratio" : 1.2594594594594595
     }
]

Interesting. The ZeroMQ users seem to ask more useful questions than the users of other brokers.

Let's do the same analysis for the answers. Here is the query that calculates the number of answer up-votes by tag.

> db.stackoverflow.aggregate([
     {$unwind:'$answers'},
     {$unwind:'$tags'},
     {$group:{_id:{question:'$_id', tag:'$tags'}, upvotes:{$sum:'$answers.up_vote_count'}}},
     {$group:{_id:'$_id.tag', upvotes:{$sum:'$upvotes'}, questions:{$sum:1}}},
     {$match:{_id:{$in:['activemq', 'rabbitmq', 'zeromq', 'hornetq']}}},
     {$project:{upvotes:1, questions:1, ratio:{$divide:['$upvotes', '$questions']}}},
     {$sort:{ratio:-1}}
])['result'];
[
     {
          "_id" : "zeromq",
          "upvotes" : 1469,
          "questions" : 338,
          "ratio" : 4.346153846153846
     },
     {
          "_id" : "rabbitmq",
          "upvotes" : 2437,
          "questions" : 858,
          "ratio" : 2.84032634032634
     },
     {
          "_id" : "activemq",
          "upvotes" : 2199,
          "questions" : 902,
          "ratio" : 2.4379157427937916
     },
     {
          "_id" : "hornetq",
          "upvotes" : 262,
          "questions" : 156,
          "ratio" : 1.6794871794871795
     }
]

Again, ZeroMQ users post more useful answers than others.

To complete the picture of typical users, let's run the following query that calculates an average reputation of people that post answers

> db.stackoverflow.aggregate([
     {$unwind:'$answers'},
     {$unwind:'$tags'},
     {$group:{_id:{question:'$_id', tag:'$tags'}, reputation:{$avg:'$answers.owner.reputation'}}},
     {$group:{_id:'$_id.tag', reputation:{$avg:'$reputation'}}},
     {$match:{_id:{$in:['activemq', 'rabbitmq', 'zeromq', 'hornetq']}}},
     {$sort:{reputation:-1}}
])['result'];
[
     {
          "_id" : "zeromq",
          "reputation" : 10088.29552338687
     },
     {
          "_id" : "activemq",
          "reputation" : 7298.7539383380845
     },
     {
          "_id" : "rabbitmq",
          "reputation" : 6082.172231934734
     },
     {
          "_id" : "hornetq",
          "reputation" : 3472.9658119658116
     }
]

Wow. ZeroMQ users not only ask more useful questions and give useful answers, they also have higher reputation on average in the StackOverflow community.

As a final exercise, I want to build a graph of question distribution over time. After all, ActiveMQ is the oldest broker, and it might have got more questions just because it was launched first. For this purpose I created this Python script that uses amazing matplotlib library. And here is the result for the last 60 months

It shows that the proportion of interest in different massaging systems was approximately the same all the time. Furthermore, the StackOverflow statistics of this year correlates well with the Google statistics.

Conclusion

1. RabbitMQ and ActiveMQ are very popular. If you choose one of them for your messaging infrastructure, you shouldn't have any problem with the community support. HornetQ might be a good message broker but it definitely lacks the community interest. Finally, as I suspected before, ZeroMQ is worth looking at. There are bunch of smart and helpful people in ZeroMQ community.

2. MongoDB rocks! Its aggregation framework is powerful and easy to use. It was fun playing with it.

Sunday, December 09, 2012

Code Retreat 2012

Yesterday was the Global Day of Code Retreat. Software engineers around the world met together to learn from each other.

There were several sessions where people were sitting in pairs, programming Conway's Game of Life.

Each session you choose a new partner, so that you both can learn something new.

1. During the first session my partner and I decided to implement the Game in Java, mainly because it was the language she was most comfortable with. We implemented the procedural solution using two-dimensional array and nested loops. At that moment that was the only solution I could think of. The main challenge was to cover all edge cases and fix all ArrayIndexOutOfBoundsExceptions. Java is fairly verbose language, and with nested loops and if-else statements the final solution was pretty hard to read. You can see here how it might look like.

2. First session was a warmup, during which most people realized that programming arrays is a tedious work. For the second session my new partner suggested an object-oriented approach, where you would operate on Cell objects that would encapsulate coordinates on the grid. In this case you move the game logic from the grid to the cell, making it easier to calculate a new state. This was my first acquaintance with C#. Interesting language. Basically, Java with lambdas. Here is an example of C# implementation. Our solution was very similar.

3. If the first session's data structure was array of booleans, at the second session it was replaced by a list of objects. The next step would be to relax the data structure even further. We decided to experiment with un-ordered set of coordinate pairs. For language we chose Clojure. Although we didn't finish the implementation, by the end of the session we had a clear picture how to solve the problem in functional style.

4. On the fourth session the facilitators put an interesting constraint: the coding must be done in absolute silence. That was the most amazing experience of the day. Before we started I thought we couldn't accomplish much without talking. As it turned out, we could. The key in silent coding is to use the tools which both partners are familiar with. In our case we both were advanced users of Vim, and we knew Lisp languages. Our Clojure implementation was based on map/filter/reduce approach and spanned 20 lines of code. Later on, I found Christophe Grand's 7-line solution based on list comprehensions. It is so wonderful that I want to reproduce it here

(defn neighbours [[x y]]
  (for [dx [-1 0 1] dy (if (zero? dx) [-1 1] [-1 0 1])]
    [(+ dx x) (+ dy y)]))

(defn step [cells]
  (set (for [[loc n] (frequencies (mapcat neighbours cells))
             :when (or (= n 3) (and (= n 2) (cells loc)))]
         loc)))

5. For the last session we chose Erlang. Because we already knew how to implement the functional solution, that was an exercise of translating Clojure code into Erlang. Unfortunately we didn't find an equivalent of frequencies() function, so we implemented it ourselves. Other than that, the Erlang code is identical to Clojure.

-import(lists, [flatmap/2]).
-import(sets, [from_list/1, to_list/1, is_element/2]).

neighbours({X, Y}) ->
    [{X + DX, Y + DY} || DX <- [-1, 0, 1], DY <- [-1, 0, 1], {DX, DY} =/= {0, 0}].

step(Cells) ->
    Nbs = flatmap(fun neighbours/1, to_list(Cells)),
    NewCells = [C || {C, N} <- frequencies(Nbs),
                     (N == 3) orelse ((N == 2) andalso is_element(C, Cells))],
    from_list(NewCells).

frequencies(List) -> frequencies(List, []).
frequencies([], Acc) -> Acc;
frequencies([X|Xs], Acc) ->
    case lists:keyfind(X, 1, Acc) of
        {X, F} -> frequencies(Xs, lists:keyreplace(X, 1, Acc, {X, F+1}));
        false  -> frequencies(Xs, lists:keystore(X, 1, Acc, {X, 1}))
    end.

Summary

During one day I learnt a lot: new language, new abstractions, new techniques, new ways of communication, new ideas. I met bunch of smart people. I was so overwhelmed with all this cool stuff that I had to write this blog post to offload it from my head.

If you are a programmer and you've never been at Code Retreat, I strongly encourage you to do it next year. It's exciting experience.

And, of course, thanks to all the people who made it possible.