Sunday, June 09, 2013

Switching to Octopress

I’m switching to different blogging platform. Again. There are few reasons for that. The main reason is I want to own the sources of my articles. All platforms I used before save blog posts in HTML format, which is not the source format but rather one of the representations. If I want to update an article I have to edit HTML in some online editor which is not the best way of editing. It’s especially difficult on Blogspot, because after some time Blogspot converts the articles to a single-line of HTML which is very ugly.

I’d like to have my articles in a simple format like Markdown, and the blogging engine should just render them to HTML. I also like to keep the sources on my hard drive where I can quickly grep for phrases. Essentially what I want is this

  • Separate the source of the articles from their representation.
  • Being able to save sources locally on my disk and upload them if needed to the Internet.
  • Keep the history of article changes.
  • Have flexibility in representation, i.e. being able to change the layout of the blog to whatever I want.

Neither Blogspot nor the previous platforms I used to publish my articles to provide these features. Therefore I’m switching to Jekyll.

With Jekyll you write the article in Markdown, Textile, or Liquid, and it generates the static HTML which can be uploaded to any Jekyll-aware hosting. Github, for example, is one of them.

To get even more power, I’m going to use Octopress, which is built on top of Jekyll. Apart from all Jekyll features, it also provides syntax highlighting (which is very handy when you post code snippets), professional mobile layout, and tons of plugins you can use to customize the blog.

As a part of the migration I also want to revisit my old articles and update the interesting ones. You will find them in the archives section of my new blog.

Without further ado, here is the link to my new blog: blog.ndpar.com. The sources can be found on Github.

The Blogspot blog is now deprecated.

Thursday, May 09, 2013

Stuck with your first programming language

Barbara Liskov in her keynote presentation:

I'm a little dismayed what's happened in programming languages. And the reason I'm dismayed is because on the one hand we have the programming languages that experts use — I'm thinking of Java and C# but you can name bunch of others. And the problem with these languages is — they are powerful, you can build big systems in them — but they aren't very good for beginners. And what happens in MIT, and I think this is happening across the US anyway, is the people are no longer using those languages in their introductory programming language courses, because the amount of craft you have to go through in order to write the little print-loop is just too much, and the students just get lost in the process. So they've been switching to languages like Python.

Python is very simple and nice when you start to use it. But you don't get too far down the road, if you me, before you discover it has no data abstraction at all. That's not good because big programs require modularity and encapsulation, and you'd like the language that supports that. So my question is: Can we find a language that will work for both communities? And the reason it matters is because a lot of kids start off programming in their first programming language and that's it. They may have to change eventually, but they are building huge systems in languages that are really ill-suited for this. So it would be good if the language they started off with is the one they can grow with, and would be good for building big programs as well as small ones.

Saturday, March 09, 2013

Simple ZooKeeper Cluster

Sometimes I need to run ZooKeeper ensemble on my development box to test my application on the production-like environment. I found that recreating the whole ensemble from scratch is much faster than cleaning it up using ZooKeeper CLI tool. To automate this process I created a bash script which I want to share in this blog post. I hard-coded all the paths in the script using my regular conventions. You might need to change them to yours — it should be fairly straightforward.

Before you can use the script, you need to install ZooKeeper on your box. That's what I did on my machine

$ cd /opt
$ sudo mkdir zookeeper
$ sudo chown -R andrey:admin zookeeper
$ cd zookeeper
$ wget http://apache.mirror.rafal.ca/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz
$ tar xf zookeeper-3.4.5.tar.gz
$ rm zookeeper-3.4.5.tar.gz
$ ln -s zookeeper-3.4.5 zookeeper

In the end you should have a ZooKeeper installed in /opt/zookeeper/zookeeper directory.

Now download, chmod, and run the script. It will create the following files

/opt/zookeeper/zookeeper/cluster
├── server1
│   ├── conf
│   │   ├── log4j.properties
│   │   └── zoo.cfg
│   ├── data
│   │   └── myid
│   └── logs
├── server2
│   ├── conf
│   │   ├── log4j.properties
│   │   └── zoo.cfg
│   ├── data
│   │   └── myid
│   └── logs
├── server3
│   ├── conf
│   │   ├── log4j.properties
│   │   └── zoo.cfg
│   ├── data
│   │   └── myid
│   └── logs
├── start.sh
└── stop.sh

This is the minimum configuration for 3-node ensemble (cluster), which is recommended for production. To start the cluster, run the following command

$ cd /opt/zookeeper/zookeeper
$ cluster/start.sh

Check the log files to see if the cluster is successfully started

$ tail -f cluster/server{1,2,3}/logs/zookeeper.out

When the cluster is up and running, you can test your application. After you are done, shutdown the cluster using the following command

$ cluster/stop.sh
$ ps -ef | grep java

To recreate a clean cluster, just run the script again

$ ./zookeeper-init-ensemble.sh

Saturday, December 15, 2012

RabbitMQ, ActiveMQ, ZeroMQ, HornetQ

Warning: In this post I'm going to compare RabbitMQ, ZeroMQ, ActiveMQ, and HornetQ. The basis of the comparison is not the performance, or the scalability, or any other serious feature. The comparison is done purely based on the popularity of those systems. Therefore, if you want to see some performance metrics, this post is not what you are looking for.

Note: To calculate popularity, I'm going to use MongoDB and Python, so if you don't care about message brokers, but you want to see some examples of MongoDB scripts, this post might be interesting to you.

Popularity

What is the best messaging system out there? If you read my blog regularly, you probably know my biased answer. But to give an objective answer, we have to compare the candidates based on some criteria. There are multiple criteria, some of which are more relevant to your project than the others. One of them is how popular the candidate solutions are. In other words, if you choose a message broker and then you encounter a problem, how easy would it be to solve it? Is there anybody who can help you? One way to find it out is to check how many people are interested in the same solution. And the obvious way to do it is to ask Google.

Here is the Google trend graph for the last five years. It turns out, my personal preferences coincide with the public interest.

At this point I can stop and say "Well, you see who's the winner". There are 5 times more people interested in RabbitMQ than HornetQ, so if you bet on Rabbit you have more chances to get the help from your fellow programmers, if you need to.

But before we make the final decision, I want to hear another opinion about the popularity of our candidates. Where do people go nowadays when they have software related problems? Right, they go to…

StackOverflow

The best thing about StackOverflow is their REST API. For our purposes we need two API queries: get all questions by a tag, and get all answers for the question. In fact, the second one is optional. Even the first query alone can give us most of what we want to know:

  • how many questions have been posted for every candidate on our list?
  • how many answers did those questions receive?
  • how many answers were accepted?
  • how many questions and answers were marked useful?

When we get all the numbers, we should know what people are actually using. We can also check if there is any correlation between Google data and StackOverflow.

So how do we proceed? We cannot use API directly to run analytics, because we would quickly exhaust the daily quota. What we can do is to fetch the data, save it locally, and run analytics against the local data. Here is another good thing about StackOverflow API: it comes in JSON format. What is the best way to analyze JSON data? Obviously, saving it in a JSON-oriented database that supports aggregated queries. And that's where MongoDB comes into play.

Here is the Python script that downloads all the questions for the specified tags from StackOverflow, and saves the results in the local MongoDB instance. I chose Python because I want to draw some graphs later, which is easy to do in Python. Plus, it's a simple and expressive language.

After we run this script, we get all the questions we need in our database. The next step is to get all the answers for those questions. Here is the script that does that.

Depends on how many questions we have saved on the first step, there might be quite a lot of queries to run to get all the answers. With my second script I exceeded the daily quota, so I had to wait for the next day to get the rest of the answers.

Now, when we have all the data, let's take a look how we can use it. Here is a typical record. I highlighted the fields that might be useful for our analysis.

{
     "_id" : 269363,
     "accepted_answer_id" : 290764,
     "answer_count" : 4,
     "answers" : [
          ...
          {
               "view_count" : 0,
               "answer_comments_url" : "/answers/303710/comments",
               "answer_id" : 303710,
               "title" : "ActiveMQ .net client locks up",
               "community_owned" : false,
               "down_vote_count" : 0,
               "last_activity_date" : 1317300099,
               "creation_date" : 1227135282,
               "score" : 1,
               "up_vote_count" : 1,
               "owner" : {
                    "display_name" : "HitLikeAHammer",
                    "reputation" : 1152,
                    "user_id" : 35165,
                    "user_type" : "registered",
                    "email_hash" : "584cd9905db85f744e7e96740b11b7c0"
               },
               "accepted" : false,
               "last_edit_date" : 1317300099,
               "question_id" : 269363
          },
          ...
     ],
     "community_owned" : false,
     "creation_date" : 1225989513,
     "down_vote_count" : 0,
     "favorite_count" : 1,
     "last_activity_date" : 1317300112,
     "owner" : {
          "display_name" : "HitLikeAHammer",
          "reputation" : 1152,
          "user_id" : 35165,
          "user_type" : "registered",
          "email_hash" : "584cd9905db85f744e7e96740b11b7c0"
     },
     "question_answers_url" : "/questions/269363/answers",
     "question_comments_url" : "/questions/269363/comments",
     "question_id" : 269363,
     "question_timeline_url" : "/questions/269363/timeline",
     "score" : 1,
     "tags" : [
          ".net",
          "activemq"
     ],
     "title" : "ActiveMQ .net client locks up",
     "up_vote_count" : 1,
     "view_count" : 1183
}

First of all, we want to know how many questions are posted for each messaging system on our list. Here is the MongoDB query for that. The query itself is in blue and the results are in black.

> db.stackoverflow.aggregate([
     {$unwind:'$tags'},
     {$group:{_id:'$tags', questions:{$sum:1}}},
     {$match:{_id:{$in:['activemq', 'rabbitmq', 'zeromq', 'hornetq']}}},
     {$sort:{questions:-1}}
])['result'];
[
     {
          "_id" : "activemq",
          "questions" : 1039
     },
     {
          "_id" : "rabbitmq",
          "questions" : 988
     },
     {
          "_id" : "zeromq",
          "questions" : 373
     },
     {
          "_id" : "hornetq",
          "questions" : 185
     }
]

The next query is to get the total number of answers by tag

> db.stackoverflow.aggregate([
     {$unwind:'$tags'},
     {$group:{_id:'$tags', answers:{$sum:'$answer_count'}}},
     {$match:{_id:{$in:['activemq', 'rabbitmq', 'zeromq', 'hornetq']}}},
     {$sort:{answers:-1}}
])['result'];
[
     {
          "_id" : "activemq",
          "answers" : 1382
     },
     {
          "_id" : "rabbitmq",
          "answers" : 1322
     },
     {
          "_id" : "zeromq",
          "answers" : 572
     },
     {
          "_id" : "hornetq",
          "answers" : 227
     }
]

It seems that the number of answers is proportional to the number of questions. With MongoDB we can quickly verify it.

> db.stackoverflow.aggregate([
     {$unwind:'$tags'},
     {$group:{_id:'$tags', answers:{$sum:'$answer_count'}, questions:{$sum:1}}},
     {$match:{_id:{$in:['activemq', 'rabbitmq', 'zeromq', 'hornetq']}}},
     {$project:{answers:1, questions:1, ratio:{$divide:['$answers', '$questions']}}},
     {$sort:{ratio:-1}}
])['result'];
[
     {
          "_id" : "zeromq",
          "answers" : 572,
          "questions" : 373,
          "ratio" : 1.5335120643431635
     },
     {
          "_id" : "rabbitmq",
          "answers" : 1322,
          "questions" : 988,
          "ratio" : 1.3380566801619433
     },
     {
          "_id" : "activemq",
          "answers" : 1382,
          "questions" : 1039,
          "ratio" : 1.3301251203079885
     },
     {
          "_id" : "hornetq",
          "answers" : 227,
          "questions" : 185,
          "ratio" : 1.227027027027027
     }
]

Indeed, the answers/question ratio is almost the same for every tag. That means we can use just the number of questions for our analysis.

Here is the query that calculates the number of accepted answers by tag. Again, it correlates fairly well with the total number of answers and questions.

> db.stackoverflow.aggregate([
     {$match:{accepted_answer_id:{$ne:null}}},
     {$unwind:'$tags'},
     {$group:{_id:'$tags', accepted_answers:{$sum:1}}},
     {$match:{_id:{$in:['activemq', 'rabbitmq', 'zeromq', 'hornetq']}}},
     {$sort:{accepted_answers:-1}}
])['result'];
[
     {
          "_id" : "activemq",
          "accepted_answers" : 531
     },
     {
          "_id" : "rabbitmq",
          "accepted_answers" : 500
     },
     {
          "_id" : "zeromq",
          "accepted_answers" : 221
     },
     {
          "_id" : "hornetq",
          "accepted_answers" : 94
     }
]

The next query is more interesting. It calculates the number of question up-votes by tag. In other words, it shows the number of useful questions. If we divide it by the total number of questions, we should see which messaging system has bigger rate of useful questions than others

> db.stackoverflow.aggregate([
     {$unwind:'$tags'},
     {$group:{_id:'$tags', upvotes:{$sum:'$up_vote_count'}, questions:{$sum:1}}},
     {$match:{_id:{$in:['activemq', 'rabbitmq', 'zeromq', 'hornetq']}}},
     {$project:{upvotes:1, questions:1, ratio:{$divide:['$upvotes', '$questions']}}},
     {$sort:{ratio:-1}}
])['result'];
[
     {
          "_id" : "zeromq",
          "upvotes" : 1078,
          "questions" : 373,
          "ratio" : 2.8900804289544237
     },
     {
          "_id" : "rabbitmq",
          "upvotes" : 1864,
          "questions" : 988,
          "ratio" : 1.8866396761133604
     },
     {
          "_id" : "activemq",
          "upvotes" : 1459,
          "questions" : 1039,
          "ratio" : 1.4042348411934553
     },
     {
          "_id" : "hornetq",
          "upvotes" : 233,
          "questions" : 185,
          "ratio" : 1.2594594594594595
     }
]

Interesting. The ZeroMQ users seem to ask more useful questions than the users of other brokers.

Let's do the same analysis for the answers. Here is the query that calculates the number of answer up-votes by tag.

> db.stackoverflow.aggregate([
     {$unwind:'$answers'},
     {$unwind:'$tags'},
     {$group:{_id:{question:'$_id', tag:'$tags'}, upvotes:{$sum:'$answers.up_vote_count'}}},
     {$group:{_id:'$_id.tag', upvotes:{$sum:'$upvotes'}, questions:{$sum:1}}},
     {$match:{_id:{$in:['activemq', 'rabbitmq', 'zeromq', 'hornetq']}}},
     {$project:{upvotes:1, questions:1, ratio:{$divide:['$upvotes', '$questions']}}},
     {$sort:{ratio:-1}}
])['result'];
[
     {
          "_id" : "zeromq",
          "upvotes" : 1469,
          "questions" : 338,
          "ratio" : 4.346153846153846
     },
     {
          "_id" : "rabbitmq",
          "upvotes" : 2437,
          "questions" : 858,
          "ratio" : 2.84032634032634
     },
     {
          "_id" : "activemq",
          "upvotes" : 2199,
          "questions" : 902,
          "ratio" : 2.4379157427937916
     },
     {
          "_id" : "hornetq",
          "upvotes" : 262,
          "questions" : 156,
          "ratio" : 1.6794871794871795
     }
]

Again, ZeroMQ users post more useful answers than others.

To complete the picture of typical users, let's run the following query that calculates an average reputation of people that post answers

> db.stackoverflow.aggregate([
     {$unwind:'$answers'},
     {$unwind:'$tags'},
     {$group:{_id:{question:'$_id', tag:'$tags'}, reputation:{$avg:'$answers.owner.reputation'}}},
     {$group:{_id:'$_id.tag', reputation:{$avg:'$reputation'}}},
     {$match:{_id:{$in:['activemq', 'rabbitmq', 'zeromq', 'hornetq']}}},
     {$sort:{reputation:-1}}
])['result'];
[
     {
          "_id" : "zeromq",
          "reputation" : 10088.29552338687
     },
     {
          "_id" : "activemq",
          "reputation" : 7298.7539383380845
     },
     {
          "_id" : "rabbitmq",
          "reputation" : 6082.172231934734
     },
     {
          "_id" : "hornetq",
          "reputation" : 3472.9658119658116
     }
]

Wow. ZeroMQ users not only ask more useful questions and give useful answers, they also have higher reputation on average in the StackOverflow community.

As a final exercise, I want to build a graph of question distribution over time. After all, ActiveMQ is the oldest broker, and it might have got more questions just because it was launched first. For this purpose I created this Python script that uses amazing matplotlib library. And here is the result for the last 60 months

It shows that the proportion of interest in different massaging systems was approximately the same all the time. Furthermore, the StackOverflow statistics of this year correlates well with the Google statistics.

Conclusion

1. RabbitMQ and ActiveMQ are very popular. If you choose one of them for your messaging infrastructure, you shouldn't have any problem with the community support. HornetQ might be a good message broker but it definitely lacks the community interest. Finally, as I suspected before, ZeroMQ is worth looking at. There are bunch of smart and helpful people in ZeroMQ community.

2. MongoDB rocks! Its aggregation framework is powerful and easy to use. It was fun playing with it.

Sunday, December 09, 2012

Code Retreat 2012

Yesterday was the Global Day of Code Retreat. Software engineers around the world met together to learn from each other.

There were several sessions where people were sitting in pairs, programming Conway's Game of Life.

Each session you choose a new partner, so that you both can learn something new.

1. During the first session my partner and I decided to implement the Game in Java, mainly because it was the language she was most comfortable with. We implemented the procedural solution using two-dimensional array and nested loops. At that moment that was the only solution I could think of. The main challenge was to cover all edge cases and fix all ArrayIndexOutOfBoundsExceptions. Java is fairly verbose language, and with nested loops and if-else statements the final solution was pretty hard to read. You can see here how it might look like.

2. First session was a warmup, during which most people realized that programming arrays is a tedious work. For the second session my new partner suggested an object-oriented approach, where you would operate on Cell objects that would encapsulate coordinates on the grid. In this case you move the game logic from the grid to the cell, making it easier to calculate a new state. This was my first acquaintance with C#. Interesting language. Basically, Java with lambdas. Here is an example of C# implementation. Our solution was very similar.

3. If the first session's data structure was array of booleans, at the second session it was replaced by a list of objects. The next step would be to relax the data structure even further. We decided to experiment with un-ordered set of coordinate pairs. For language we chose Clojure. Although we didn't finish the implementation, by the end of the session we had a clear picture how to solve the problem in functional style.

4. On the fourth session the facilitators put an interesting constraint: the coding must be done in absolute silence. That was the most amazing experience of the day. Before we started I thought we couldn't accomplish much without talking. As it turned out, we could. The key in silent coding is to use the tools which both partners are familiar with. In our case we both were advanced users of Vim, and we knew Lisp languages. Our Clojure implementation was based on map/filter/reduce approach and spanned 20 lines of code. Later on, I found Christophe Grand's 7-line solution based on list comprehensions. It is so wonderful that I want to reproduce it here

(defn neighbours [[x y]]
  (for [dx [-1 0 1] dy (if (zero? dx) [-1 1] [-1 0 1])]
    [(+ dx x) (+ dy y)]))

(defn step [cells]
  (set (for [[loc n] (frequencies (mapcat neighbours cells))
             :when (or (= n 3) (and (= n 2) (cells loc)))]
         loc)))

5. For the last session we chose Erlang. Because we already knew how to implement the functional solution, that was an exercise of translating Clojure code into Erlang. Unfortunately we didn't find an equivalent of frequencies() function, so we implemented it ourselves. Other than that, the Erlang code is identical to Clojure.

-import(lists, [flatmap/2]).
-import(sets, [from_list/1, to_list/1, is_element/2]).

neighbours({X, Y}) ->
    [{X + DX, Y + DY} || DX <- [-1, 0, 1], DY <- [-1, 0, 1], {DX, DY} =/= {0, 0}].

step(Cells) ->
    Nbs = flatmap(fun neighbours/1, to_list(Cells)),
    NewCells = [C || {C, N} <- frequencies(Nbs),
                     (N == 3) orelse ((N == 2) andalso is_element(C, Cells))],
    from_list(NewCells).

frequencies(List) -> frequencies(List, []).
frequencies([], Acc) -> Acc;
frequencies([X|Xs], Acc) ->
    case lists:keyfind(X, 1, Acc) of
        {X, F} -> frequencies(Xs, lists:keyreplace(X, 1, Acc, {X, F+1}));
        false  -> frequencies(Xs, lists:keystore(X, 1, Acc, {X, 1}))
    end.

Summary

During one day I learnt a lot: new language, new abstractions, new techniques, new ways of communication, new ideas. I met bunch of smart people. I was so overwhelmed with all this cool stuff that I had to write this blog post to offload it from my head.

If you are a programmer and you've never been at Code Retreat, I strongly encourage you to do it next year. It's exciting experience.

And, of course, thanks to all the people who made it possible.

Thursday, November 22, 2012

Flexible language

I've been learning Lisp for few years now, and every Lisp book I read keeps saying that Lisp is a flexible language that you can extend to the degree when it fits naturally to your domain. It's easy to say, but what exactly does this phrase mean? After all, when you program in your non-Lisp language, don't you modify it for your domain problem? I've been thinking about it for a long time, and only recently I started to understand what flexibility really means. There is a difference between using the language and changing the language to solve a problem. In this post I will try to show the difference based on a simple example.

Problem

Suppose you have a process that listens to a message queue. The messages are just ordinary maps. If the map contains certain keys, one or more handlers must be invoked. Here is a matrix that shows which handler is invoked for which key

For example, if the map has key a, then DocHandler and AlertHandler need to be called. If it has key b, then NoteHandler and AlertHandler are called. In reality there might be more keys and more handlers, but for simplicity we limit our example to three keys and three handlers.

Java

Let's see how this can be implemented in Java. I chose Java just as an example of non-Lisp language. You can pick any other non-Lisp language instead.

public class SimpleMessageListener {

    public List onMessage(Map message) {
        List result = new LinkedList();
        if (isDoc(message)) {
            result.add(handleDoc(message));
        }
        if (isNote(message)) {
            result.add(handleNote(message));
        }
        if (isAlert(message)) {
            result.add(handleAlert(message));
        }
        return result;
    }

    // Decision makers

    private boolean isDoc(Map message) {
        return message.containsKey("a") || message.containsKey("c");
    }

    private boolean isNote(Map message) {
        return message.containsKey("b") || message.containsKey("c");
    }

    private boolean isAlert(Map message) {
        return message.containsKey("a") || message.containsKey("b");
    }

    // Handlers

    private String handleDoc(Map message) {
        return String.format("Document:%s:%s", message.get("a"), message.get("c"));
    }

    private String handleNote(Map message) {
        return String.format("Note:%s:%s", message.get("b"), message.get("c"));
    }

    private String handleAlert(Map message) {
        return String.format("Alert:%s:%s", message.get("a"), message.get("b"));
    }
}

The internals of handle- methods might be very different in reality. Consider the fact they have the same structure as a coincidence. What is not coincidence though is the structure of is- methods. Those methods are identical indeed.

Is this code clean? I would say, no. The main issue is that it's split in three separate but closely related parts. If tomorrow I introduce another message key and a new handler, I have to change three places in the code. Another problem is the code duplication in two spots: a series of if-statements and a group of is- methods.

The last thing to notice about this code is that it's hard to see what kind of problem it's trying to solve. If I didn't provide a matrix which maps message keys to handlers, it would take even more time to figure out what the code is doing. Can we make this code better?

Let's rewrite it as follows:

public class FunctionalMessageListener {

    private interface Handler {
        void handle(Map message, List acc);
    }

    private class DocHandler implements Handler {
        private boolean isDoc(Map message) {
            return message.containsKey("a") || message.containsKey("c");
        }
        public void handle(Map message, List acc) {
            if (isDoc(message)) acc.add(String.format("Document:%s:%s", message.get("a"), message.get("c")));
        }
    }

    private class NoteHandler implements Handler {
        private boolean isNote(Map message) {
            return message.containsKey("b") || message.containsKey("c");
        }
        public void handle(Map message, List acc) {
            if (isNote(message)) acc.add(String.format("Note:%s:%s", message.get("b"), message.get("c")));
        }
    }

    private class AlertHandler implements Handler {
        private boolean isAlert(Map message) {
            return message.containsKey("a") || message.containsKey("b");
        }
        public void handle(Map message, List acc) {
            if (isAlert(message)) acc.add(String.format("Alert:%s:%s", message.get("a"), message.get("b")));
        }
    }

    private List<Handler> handlers() {
        return Arrays.asList(new DocHandler(), new NoteHandler(), new AlertHandler());
    }

    public List onMessage(Map message) {
        List result = new LinkedList();
        for (Handler handler : handlers()) {
            handler.handle(message, result);
        }
        return result;
    }
}

In this version we eliminated ugly if-series, and group together decision making logic and message handling. From that perspective the code became cleaner, but not necessarily clearer. Now it actually takes more effort to understand what the code is doing. Also, the duplication inside the is- methods is still there. We can fix it by extracting it to some abstract class or utility method. We can also use Java reflection within handlers() method to build a collection of handlers without explicitly specifying them. All these manipulations arguably make the code cleaner, but... one thing we'll never be able to fix is the separation between decision making logic and message handling. Whatever you do, there will always be the if-statement that checks if you need to process the message, and the message processing logic itself. Those two things will always be separate. Here is the point where we hit the language constraints.

Clojure

Now let's try to solve the same problem in Lisp and see if we can fix the language to eliminate the last issue from the paragraph above. Here is the direct translation of the previous Java snippet to Clojure dialect of Lisp

(defn- doc-handler [msg]
  (let [a (msg :a), c (msg :c)]
    (when (or a c)
      (format "Document:%s:%s" a c))))

(defn- note-handler [msg]
  (let [b (msg :b), c (msg :c)]
    (when (or b c)
      (format "Note:%s:%s" b c))))

(defn- alert-handler [msg]
  (let [a (msg :a), b (msg :b)]
    (when (or a b)
      (format "Alert:%s:%s" a b))))

(defn- handlers []
  [doc-handler note-handler alert-handler])

(defn on-message [msg]
  (letfn [(handle [acc h]
            (if-let [res (h msg)]
              (conj acc res)
              acc))]
    (reduce handle [] (handlers))))

This code is already easier to read, but we can do even better. The separation between decision making logic and message handling is still there. At this point we should ask the question: what kind of code do we want to see there? And my answer is: I want to replace the -handler methods above with the following code

(handler doc [a c]
  (format "Document:%s:%s" a c))

(handler note [b c]
  (format "Note:%s:%s" b c))

(handler alert [a b]
  (format "Alert:%s:%s" a b))

You see: no conditionals. Handlers are self-sufficient entities which know when they have to be applied and how. In their signatures they explicitly declare what kind of parameters they expect, and in the body they just use those parameters. No boilerplate, clean and simple. The beauty of Lisp is that you can actually implement that code. The way you do it is by creating a macro which will generate the appropriate functions. Creating a macro is not a simple task, I spent quite some time to get this one working, but it's worth of doing, because now the code is clean and clear.

We can make one additional step further by moving the handler declarations inside the handlers() method. (We need one small macro for that.) And here is the final solution

(defmacro handler [name args & body]
  `(fn [~'msg]
     (let [~@(interleave args (map (fn [x] `(get ~'msg ~(keyword x))) args))]
       (when (or ~@args)
         ~@body))))

(defmacro build-handlers [& body]
  `(defn- handlers []
     [~@body]))

(build-handlers

  (handler doc [a c]
    (format "Document:%s:%s" a c))

  (handler note [b c]
    (format "Note:%s:%s" b c))

  (handler alert [a b]
    (format "Alert:%s:%s" a b)))

(defn on-message [msg]
  (letfn [(handle [acc h]
            (if-let [res (h msg)]
              (conj acc res)
              acc))]
    (reduce handle [] (handlers))))

As I said, the first macro might be cryptic, but look at the highlighted part. This is the essence of our problem, and it cannot be done any simpler. Suppose, we need to implement a new handler which should be called if key c is present in the message. Here what we would need to add to build-handler's body to implement this new requirement

  (handler new [c]
    ...)

Simple, right? And what if a new key is added to the message which should be processed by document handler? Here is what we need to change

  (handler doc [a c d]
    ...)

We just add a new key to the function's parameter list. That's it — one-word change.

Summary

Lisp is the most powerful programming language. By that I mean you can change the language in such a way that the solution to any particular problem can be expressed in the simplest possible way. By changing the language, you can remove all the barriers between the language and the problem domain. I hope I demonstrated this in my simple example.

Resources

Clojure source code for this blog, along with the unit tests.

Monday, November 12, 2012

PyCon Canada 2012

This weekend I attended PyCon Canada, the first conference in Canada dedicated to Python ecosystem. As you might find from my blog, I'm not a Python guy. I've been using Python mostly as a scripting language. I went to this conference for fresh ideas, or, as Michael Feathers said, for cross-polination from Python community. This blog post is not a detailed review of the conference — I just want to share my impressions in general.

Organization

Considering how little time the organizers had for preparing this conference, 5 months I believe, they did amazing job. They invited great speakers. They kept people well informed using mailing list and Twitter. The official web site was clear and easy to navigate. The location was good. The food was decent. The only complaint I had is about the temperature in the rooms on the first day. It was so freezing cold inside that I had to wear my jacket all the time. But on the second day the problem was fixed.

Keynotes

Keynotes were absolutely fantastic. There were three of them. Jessica McKellar was talking about Python community. How they foster it, how they attract new people to programming in general and to Python in particular. She shared her experience from organizing Boston Python user group, the biggest Python user group in the world. The takeaway from her talk: Python community is big, welcoming, and well supported by Python Foundation.

Second keynote was Michael Feathers' Why You Should(n't) be Using a Functional Programming Language Instead. The main idea of his talk is: Don't lock yourself inside one language. Go outside of your community to see what other languages exist out there, how they solved the problems. Study those languages, learn their idioms and techniques, and then go back to your language and start using the ideas you've learnt. I completely agree with that, and that's why I went to this conference in the first place. He gave bunch of examples of functional programming in Haskell. Then he showed his Ruby code written in functional style, where you could see the influence of Haskell. I liked his presentation because he verbalized the ideas I myself have been thinking about for a while. When I started programming in Groovy my Groovy code was basically a Java code without semicolons. Now my Java code looks more like Groovy.

The closing keynote was by Fernando Pérez, the scientist from University of California, Berkeley, and the creator of IPython. The talk, titled Science and Python, was really mind blowing. When I was a student I did all my computations using mainly Fortran and some proprietary software I don't even remember the name of. Later, I played with Mathematica and Octave a little bit. But I didn't know that you can do very sophisticated scientific calculations using Python. Fernando gave some examples from neuroscience, astrophysics and biology, and it's really impressive. The discovery of Supernova PTF11kyl is especially astonishing. From now on, if I need to do some math, I'll be using Python libraries; no more proprietary expensive software. Another theme of the presentation was IPython. Initially I thought it's just a shell on top of standard Python, but it's actually the whole ecosystem. I cannot explain in few words how amazing it is. Just google for "ipython notebook" or read Fernando's blog.

Talks

As it happens on every conference, there were some great talks and some lousy talks, interesting talks and boring talks, geeky talks and academic talks. It's all normal and fine. The good thing about this conference though is that signal-noise ratio was pretty high; congratulations to the organizers for choosing talks. Another thing I like is the diversity of formats. There were 45-min presentation, 20-min talks, 5-min lightning talks, 90-min tutorials, and 3-hour workshop (there are also two full day coding sessions but I'm not attending them). This is a really good approach. Switching between different formats during the day helps your brain functioning more productive, in my opinion.

Pleasant discoveries

I found many projects presented at the conference are using RabbitMQ, and that's great. I wish in Java world people would use AMQP more frequently instead of blindly choosing JMS for every new project.

Many people are using MongoDB properly. Nowadays NoSQL is a very popular buzzword, and many projects are using various NoSQL databases for no other reason but fashion, even if it makes no sense for the project at all. It was nice to see that there are developers out there who do their homework and adopt NoSQL because it fits their domain.

Unpleasant discoveries

There seems to be a trend in Python community to despise Java. I actually see this trend in many communities outside Java, so it's not Python specific, but at this conference I've heard too many jokes about Java so it's not funny anymore, especially hearing them from the people who don't write a line in Java.

Another thing surprised me is the fanatic admiration of Mercurial and hate of Git from some Python programmers. I know lots of people who hate Git, mainly because they are confused and scared by Git. But dislike it for the reason not being written in Python is something new to me.

Problems in Python

Package and distribution management in Python is in pretty bad shape. Every person I talked to admitted that it's complete mess at the moment. I myself feel that pain every time I need to install a new library. Which tool should I use: pip, easy_install, pysetup? Some libraries installed using those tools don't work, or work partially. Many programmers use rpm or deb packages instead of Python tools, because OS packages usually work. I came to the same conclusion on my Mac OS. The only flawlessly working Python environment I have is that installed via mac ports. In Java we don't have those problems. Maven solved it once and for all long time ago. Now every JVM language benefits from it. Python community should clean up this mess and standardize their tools. I was told that with introducing PyPi and PEPs the situation is getting better, well, let's see if it resolves all the issues.

What I've learnt

Here is the list of things I found pretty interesting, in no particular order.

Python libraries to use

numpy, matplotlib, pandas, scipy, sympy, quantities, collections. Thanks to the people who told me about these libraries.

Cool Python stuff

RunSnakeRun — GUI for Python profiler. Check out the screenshots on their web site. I wish Java profilers could draw such nice graphs.

bpython — Python REPL for geeks written in Urwid. Thanks to Ian Ward for really nice presentation.

Interesting ideas

Print log statements in JSON format so that you can analyze them using powerful tools. You can also save logs in MongoDB, either offline or asynchronously, and do statistic analysis using MapReduce.

Write stored procedures in PostgreSQL in Python (and some other languages). They look much better in Python than plSQL.

Things to learn

Here are some technology and tools that have a great potential, in my opinion, and worthy of learning: ZeroMQ, IPython, OpenStack. Those were mentioned multiple times during the conference, and I need to check them out in more details.

Summary

The conference was great. I'm glad I attended it. The organizers did a great job. The conference was beneficial not only to Python community but to Toronto programming community in general. Thanks to all who made it happen.

P.S. Videos from the conference are available here.

Thursday, November 08, 2012

Simple web application in Clojure

This blog post is a micro-tutorial on how to build a simple web application in Clojure. The reason I call it micro will be clear when I introduce the framework we are going to use. This tutorial will be interesting to programmers relatively new to Clojure, but who have some experience with web frameworks in other languages, for instance Spring MVC. The goal of this tutorial is to help you get started with web development in Clojure. Also I want to share my approach to web development in general and in Clojure in particular. This approach is by no means a paragon of web development, but because I like to watch how other people write the software, I thought somebody might be interested to see how I do it.

Problem

So what are we going to develop? I don't want to build a simplistic web application for the sake of building the application. On the other hand, I want to constrain myself to small feature set to prevent this tutorial from sinking in too many details. After thinking a while I found a problem which looks pretty simple, but at the same time there is a good chance people (including myself in the first place) might actually use the program I'm going to create. And here is the problem.

I have a bunch of articles and e-books sitting in some directory on my home server. To be able to read those books from any computer in my home network, I run the simple Python web-server, which exposes the content of the directory via HTTP. If you are curious, here is the command I'm using:

$ cd /path/to/your/ebook/dir
$ python -m SimpleHTTPServer 3030
And here is how the "library" looks like in the browser

This library application is good enough for me, mainly because it's functional. I can easily find the book by skimming the page or using Find command in a browser. But for the purpose of this tutorial I want to make it slightly better. For example, I can split the file names and display the books in a table view, where I can see clearly what the name of the book is, who the author is, and when it was pablished. I can even add sorting as a bonus feature. Basically, I want something like this

As you can imagine, it shouldn't be hard to do this. The file names are already in the form that is easy to parse. So the question is really how to show the table data in a browser. Simple problem, minimum requirements. Let's see how to solve it in Clojure.

Tools

Clojure, being a Lisp descendant, is a powerful language. That means you can create your own web framework during a weekend, which many people actually do. But I think it's much better to take existing library, promote it, enhance it, fix the bugs if you like it. In Clojure I found such a framework, it's called Noir. This framework is very small, so small that their developers call it micro-framework, that's why I'm calling this tutorial micro-tutorial. Probably we shouldn't even call it framework at all, library would probably be a better name. The closest analog to Noir in other languages I know is probably Webmachine in Erlang, or Spring MVC in Java. I wouldn't compare it to Grails or Rails because those guys are huge.

Noir is not only small, it's also simple. You can look at their source code and understand how it works without any problem, provided you have some experience with Clojure. As a result, Noir is a perfect tool for the problem we are going to solve.

Without further ado let's see how Noir works. The easiest way to set up a scaffolding of our future application is by using Leiningen. Leiningen is a Clojure build tool, very similar to Maven. In Maven you would do mvn archetype:generate, in Leiningen you run

$ lein new noir bookshelf

where bookshelf is the name of our application. This creates a directory called bookshelf where you can find some Noir template files, plus Leiningen project descriptor

/.gitignore
/project.clj
/README.md
/resources/public/css/
/resources/public/css/reset.css
/resources/public/img/
/resources/public/js/
/src/bookshelf/models/
/src/bookshelf/server.clj
/src/bookshelf/views/common.clj
/src/bookshelf/views/welcome.clj
/test/bookshelf/

You can ignore .gitignore, it's already configured properly. Before we make our initial checkin, it's a good practice to edit project.clj and README.md to replace FIXME's. We will edit README file again later when we finish the development to provide more information on how to use the application. Before we move to Noir, let's quickly review project.clj

(defproject bookshelf "0.1.0-SNAPSHOT"
  :description "Bookshelf site"
  :dependencies [[org.clojure/clojure "1.4.0"]
                 [noir "1.3.0-beta3"]]
  :main bookshelf.server)

project.clj is a Clojure version of pom.xml (in fact, you can use pom.xml if you want, Clojure perfectly understands it). First two lines are obvious. Dependency entry specifies which JAR files we need to make our application work. In our case we need only two JARs. Leiningen will check Maven central repository as well as Clojars to download the required JARs with all transitive dependencies. The last line in the project descriptor says which namespace contains the main method. In our case it is bookshelf.server. You can find the source of this namespace in /src/bookshelf/server.clj file. Let's take a look at this file

(ns bookshelf.server
  (:require [noir.server :as server]))

(server/load-views-ns 'bookshelf.views)

(defn -main [& m]
  (let [mode (keyword (or (first m) :dev))
        port (Integer. (get (System/getenv) "PORT" "8080"))]
    (server/start port {:mode mode
                        :ns 'bookshelf})))

The third line specifies the prefixes of the namespaces that will be loaded by Noir server. By default Leiningen generates two files satisfying this criterion: bookshelf/views/common.clj and bookshelf/views/welcome.clj, but you can create more if your project becomes more complex. Since Noir scans namespaces by prefixes, you can even put your files in the nested directories under bookshelf/views, no changes in server.clj required. To start Noir server, run

$ lein run

This will start Jetty web server bound to localhost at port 8080. If you want to change the default port, say to 3030, run the following command

$ export PORT=3030; lein run

Besides port, you can specify few other parameters such as :mode, :jetty-options, etc. (you can see all available options in the server source). I'll show below how to specify production mode, for example, when we deploy the final application.

If you started the server with the default port, open your browser at http://localhost:8080. You should see the Noir's start page

This page by itself contains all you need to get started with Noir, so you can safely stop reading this blog, and just follow the instructions on that page. Those who continue reading this tutorial and wondering where that start page is coming from, please open /src/bookshelf/views/welcome.clj

(ns bookshelf.views.welcome
  (:require [bookshelf.views.common :as common]
            [noir.content.getting-started])
  (:use [noir.core :only [defpage]]))

(defpage "/welcome" []
  (common/layout
    [:p "Welcome to bookshelf"]))

Third line tells us that the source of the start page is in noir/content/getting-started file. If you are curious where this file is, look here. Search for (defpage "/" [] …) to see how the start page is defined. On your web-site you probably want the start page to be different, so you can remove [noir.content.getting-started] from :require section.

The next thing to notice on the snippet above is (defpage "/welcome" [] …) function. That's how you define URL mappings (or routes, in Noir lingo) of your web-site. (Internally, Noir uses Compojure library to handle the routing.) It is similar to @RequestMapping annotations in Spring-MVC, where you specify which method is called when a user hits the given URL. As you can see, we have only one mapping at the moment, /welcome. Since we are building bookshelf application, let's rename it to /books. Also, to be even more explicit, let's rename the whole file to books.clj. Don't forget to update the namespace. Your books.clj file should now look like this

(ns bookshelf.views.books
  (:require [bookshelf.views.common :as common])
  (:use [noir.core :only [defpage]]))

(defpage "/books" []
  (common/layout
    [:p "Welcome to bookshelf"]))

If you go to http://localhost:8080/books in your browser, you should see this screen

By looking at the source of this page, you find it a proper HTML with head and body elements. Those are generated by Hiccup library, which we'll discuss in a moment. One thing I want to mention about defpage is that you can get the same result if you change /books route definition as follows

(defpage "/books" []
  "<html>
     <head>
       <title>bookshelf</title>
     </head>
     <body>
       <p>Welcome to bookshelf</p>
     </body>
   </html>")

It is just a theoretical exercise, in reality nobody hard-codes the entire HTML inside the Clojure code.

Now let's take look at how the page content is generated. If you look at the routing definition in books.clj, you see that the body of defpage function is a call to layout function defined in /src/bookshelf/views/common.clj. Let's open this file

(ns bookshelf.views.common
  (:use [noir.core :only [defpartial]]
        [hiccup.page :only [include-css html5]]))

(defpartial layout [& content]
  (html5
    [:head
     [:title "bookshelf"]
     (include-css "/css/reset.css")]
    [:body
     [:div#wrapper
      content]]))

defpartial is just a wrapper on top of hiccup.core/html function. Hiccup is an XML/HTML rendering library in Clojure. The idea behind it is pretty simple: You build a tree using Clojure vectors, and Hiccup transforms it to a valid HTML. If you are familiar with Groovy MarkupBuilder, it's the same idea. For example, let's define a couple of trees: head and body

(def head
  [:head
   [:title "bookshelf"]])
(def body
  [:body
   [:div
    [:p "Welcome to bookshelf"]]])

Here is what you see in REPL when it evaluates different Hiccup HTML formats (I pretty formatted them for visibility purposes)

(hiccup.page/html5 head body)
;=> "<!DOCTYPE html>
<html>
  <head>
    <title>bookshelf</title>
  </head>
  <body>
    <div><p>Welcome to bookshelf</p></div>
  </body>
</html>"

(hiccup.page/html4 head body)
;=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/strict.dtd\">
<html>
  <head>
    <title>bookshelf</title>
  </head>
  <body>
    <div><p>Welcome to bookshelf</p></div>
  </body>
</html>"

(hiccup.page/xhtml head body)
;=> "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">
<html xmlns=\"http://www.w3.org/1999/xhtml\">
  <head>
    <title>bookshelf</title>
  </head>
  <body>
    <div><p>Welcome to bookshelf</p></div>
  </body>
</html>"

Noir's default format is html5, the first example above. You can change it to any other format if needed.

The last piece of common.clj I want to mention is (include-css "/css/reset.css"). This is another function from Hiccup library. It generates <link href="/css/reset.css" rel="stylesheet" type="text/css"> element and inserts it in the head element of the page. If you recall the scaffolding we generated at the beginning, there is a directory called /resources/public where Noir keeps CSS files, JavaScript, and images required by your web-site. By default Noir creates reset.css in the corresponding subdirectory. Later we'll create other stylesheets and update common.clj appropriately.

Now, after we covered all the basics, we are ready to build our application.

Controller and View

Let's build our view and controller first. While doing that we'll figure out what data we need from the back-end. That's called top-down design.

We define controllers and views in books.clj file. For now I also include the model in this file. We'll move it to a proper namespace later when we are done with the front-end. Here is the new version of books.clj

(ns bookshelf.views.books
  (:require [bookshelf.views.common :as common])
  (:use [noir.core :only [defpage]]
        [hiccup.element :only [link-to]]))

(defn books []
  [{:author "Fogus M., Houser C."
    :title "The Joy Of Clojure"
    :year "2011"
    :format "pdf"
    :id 1}
   {:author "Fogus M., Houser C."
    :title "The Joy Of Clojure"
    :year "2011"
    :format "epub"
    :id 2}])

(defn- list-books []
  [:table
   [:thead
    [:tr
     [:th "Author"]
     [:th "Title"]
     [:th "Published"]
     [:th "Format"]]]
   (into [:tbody]
         (for [book (books)]
           [:tr
            [:td (:author book)]
            [:td (:title book)]
            [:td (:year book)]
            [:td (link-to (clojure.string/join "/" ["/books" (:id book) (:format book)])
                          (:format book))]]))])

(defpage "/books" []
  (common/layout
    (list-books)))

There are three functions in this file: books (model), list-books (view), and /books (controller/router). The controller is essentially the same as before and the model is simply a vector of maps. Each map contains the same keys as column names on the book table. View might look complicated, but there shouldn't be any surprise here either — it's again an ordinary tree. The only new element here is link-to function defined in hiccup.element namespace. We could really build it directly using [:a {:href …}] vector, but link-to is a standard way to do it in Hiccup.

If you refresh http://localhost:8080/books now (Leiningen/Jetty supports hot redeployment), you will see this screen

The important part of the view function is the format of the links. For the current book model the links are http://localhost:8080/books/1/pdf and http://localhost:8080/books/2/epub. You can verify it by hovering the mouse over them. Those links are actually the main goal of our application. By clicking the link I want to load (or download) the book into the browser and read it. To implement this feature add the following functions to books.clj

(defn get-file [id]
  nil)

(defn- ctype [format]
  (if (= "pdf" format) "application/pdf" "text/plain"))

(defpage "/books/:id/:format" {:keys [id format]}
  (content-type (ctype format) (java.io.ByteArrayInputStream. (get-file id))))

Ignore for a moment get-file function — it does nothing. Later we will move it to model namespace and implement it properly, but we need some placeholder now to compile the application. ctype is a helper function that maps the file format to HTTP content type. I have three ebook formats in my library: ePub, Mobi, and PDF. The first two are plain text formats from HTTP content type perspective. Only PDF requires special type.

The interesting part of the snippet is the controller definition. If you ever implemented REST in Spring MVC, you should see the familiar pattern here. Like Spring, Noir (via Compojure) supports path variables. If you click on http://localhost:8080/books/1/pdf link in the browser, Noir calls the corresponding controller and binds id variable to "1" and format variable to "pdf". When we implement get-file function, it should return the file with the given id as an array of bytes. Controller then wraps the array into a stream and Noir pushes it to HTTP response. content-type function is defined in noir.response namespace, so we need to add [noir.response :only [content-type]] to :use section at the top of books.clj.

That's it, in terms of functionality the controller and the view of our application are done.

Model

Now we need to extract the logic that creates a model from presentation tier to middle-tier. In Noir that's what models directory is for. In our case it's /src/bookshelf/models. Let's create a file called db.clj in that directory, and move there books and get-file functions from bookshelf.views.books namespace. We have to update books.clj to load the new namespace. It should now look like this

(ns bookshelf.views.books
  (:require [bookshelf.models.db :as db]
            [bookshelf.views.common :as common])
  (:use [noir.core :only [defpage]]
        [noir.response :only [content-type]]
        [hiccup.element :only [link-to]]))

(defn- list-books []
  [:table
   [:thead
    [:tr
     [:th "Author"]
     [:th "Title"]
     [:th "Published"]
     [:th "Format"]]]
   (into [:tbody]
         (for [book (db/books)]
           [:tr
            [:td (:author book)]
            [:td (:title book)]
            [:td (:year book)]
            [:td (link-to (clojure.string/join "/" ["/books" (:id book) (:format book)])
                          (:format book))]]))])

(defpage "/books" []
  (common/layout
    (list-books)))

(defn- ctype [format]
  (if (= "pdf" format) "application/pdf" "text/plain"))

(defpage "/books/:id/:format" {:keys [id format]}
  (content-type (ctype format) (java.io.ByteArrayInputStream. (db/get-file id))))

If you refresh your browser, nothing should change.

All right, now it's time to implement our model properly. Recall that our model should scan the directory it's running in for the files with the format Author-Title-Year.FileFormat, and convert each of those files to byte array. Here is how I implemented it

(ns bookshelf.models.db
  (:use [clojure.java.shell :only (sh)]))

(defn- list-files [dir]
  (clojure.string/split (:out (sh "ls" dir)) #"\n"))

(defn- parse [file]
  (when-let [match (re-matches #"([^-]+)-([^-]+)-(\d+)\.(\S+)" file)]
    (zipmap [:file :author :title :year :format] match)))

(defn- add-id [book]
  (assoc book :id (clojure.string/replace (:file book) #"[., ]" "")))

(defn books []
  (->> (list-files ".") (map parse) (remove nil?) (map add-id)))

(defn- file [id]
  (let [mapping (into {} (for [b (books)] [(:id b) (:file b)]))]
    (get mapping id)))

(defn get-file [id]
  (with-open [input  (java.io.FileInputStream. (file id))
              buffer (java.io.ByteArrayOutputStream.)]
    (clojure.java.io/copy input buffer)
    (.toByteArray buffer)))

Let's take a look what each of the functions does. Function list-files returns a vector of file names that reside in the given directory dir. To find all files in the directory I'm using clojure.java.shell/sh function which executes ls command. This works fine on Linux and Mac, but I'm not sure about Windows. Function parse checks if the given file name has the required format. If so, it returns a map {:file file, :author Author, :title Title, :year Year, :format FileFormat}, otherwise it returns nil. add-id function removes dots, commas, and spaces from the file name and add the result as a book ID to the book map. Function books is just a composition of those three functions, and it returns the result expected by the view.

Function file returns the file by given ID. The implementation above is not efficient, but my library is too small to notice any performance issues. Finally, get-file function finds the file by ID and returns it as a byte array. Those four lines is a pretty standard idiom which you can find in many Clojure source files.

Now we are ready to test our application. For testing purposes I'm going to copy a couple of e-books I recently received updates for to the project home directory. The content of this directory looks like this

.gitignore
README.md
Thomas D.-Programming Ruby 1.9-2010.epub
Thomas D.-Programming Ruby 1.9-2010.pdf
project.clj
resources
src
test

I refresh my browser and here I can see these two books

If I click on pdf, I can read the book in my browser

OK, the application is functional. The next step is to make it little bit prettier.

Styling

Since our application is written in Noir framework, let's make it look like Noir. First, I download Noir background image and save it to /resources/public/img directory. Second, I create a stylesheet /resources/public/css/noir.css which resembles Noir's original

body {
    background: #2a2b2b;
    color: #d1d9e1;
    background: url('/img/bg.png');
    padding: 60px 60px;
    font-family: 'Helvetica Neue',Helvetica,Verdana;
}
a {
    text-decoration: none;
    color: #d1d9e1;
}
a:hover {
    color: #6bffbd;
}
h1 {
    color: #6bffbd;
}

Then I update bookshelf.views.common namespace to include new CSS

(defpartial layout [& content]
  (html5
    [:head
     [:title "Bookshelf"]
     (include-css "/css/reset.css")
     (include-css "/css/noir.css")]
    [:body
     [:div#wrapper
      content]]))

Finally, I want to add a header to the page in bookshelf.views.books

(defpage "/books" []
  (common/layout
    [:h1 "Books"]
    (list-books)))

Refresh books web page on the browser to see the changes

The last thing left unstyled is the book table. I won't style it directly, because I want to add client-side sorting to it, and I happen to know that TableSorter JavaScript library provides its own style.

JavaScript

TableSorter is a jQuery plugin, so you need to download jQuery first. Grab the latest min and save it to /resources/public/js directory. Then, download tablesorter.zip that contains both JavaScript and stylesheet files. As before, JavaScript goes to /resources/public/js and stylesheets go to /resources/public/css directory. Here is the resources directory structure I have after everything is saved

resources/public/css/tablesorter/asc.gif
resources/public/css/tablesorter/bg.gif
resources/public/css/tablesorter/desc.gif
resources/public/css/tablesorter/style.css
resources/public/js/jquery-1.8.2.min.js
resources/public/js/jquery.tablesorter.js

If you are curious, here is my style.css. I changed the original tablesorter css a little bit to better fit Noir theme

table.tablesorter {
    margin:10px 0pt 15px;
    font-size: 10pt;
    width: 100%;
    text-align: left;
}
table.tablesorter thead tr th, table.tablesorter tfoot tr th {
    color: #000000;
    background-color: #b0b8c0;
    border: 1px solid #2a2b2b;
    font-size: 10pt;
    padding: 4px;
}
table.tablesorter thead tr .header {
    background-image: url(bg.gif);
    background-repeat: no-repeat;
    background-position: center right;
    cursor: pointer;
}
table.tablesorter tbody td {
    color: #b0b8c0;
    padding: 4px;
    background-image: url('/img/bg.png');
    vertical-align: top;
}
table.tablesorter tbody tr.odd td {
    background-color:#F0F0F6;
}
table.tablesorter thead tr .headerSortUp {
    background-image: url(asc.gif);
}
table.tablesorter thead tr .headerSortDown {
    background-image: url(desc.gif);
}
table.tablesorter thead tr .headerSortDown, table.tablesorter thead tr .headerSortUp {
    background-color: #6bffbd;
}

To enable tablesorter we have to update both view files. Few changes in bookshelf.views.common

(ns bookshelf.views.common
  (:use [noir.core :only [defpartial]]
        [hiccup.page :only [include-css include-js html5]]))

(defpartial layout [& content]
  (html5
    [:head
     [:title "Bookshelf"]
     (include-js "/js/jquery-1.8.2.min.js")
     (include-js "/js/jquery.tablesorter.js")
     (include-css "/css/reset.css")
     (include-css "/css/tablesorter/style.css")
     (include-css "/css/noir.css")]
    [:body
     [:div#wrapper
      content]]))

and few changes in bookshelf.views.books

(ns bookshelf.views.books
  (:require [bookshelf.models.db :as db]
            [bookshelf.views.common :as common])
  (:use [noir.core :only [defpage]]
        [noir.response :only [content-type]]
        [hiccup.element :only [link-to javascript-tag]]))

(defn- list-books []
  [:table.tablesorter {:id "bookTable"}
   [:thead
    [:tr
     [:th "Author"]
     [:th "Title"]
     [:th "Published"]
     [:th "Format"]]]
   (into [:tbody]
         (for [book (db/books)]
           [:tr
            [:td (:author book)]
            [:td (:title book)]
            [:td (:year book)]
            [:td (link-to (clojure.string/join "/" ["/books" (:id book) (:format book)])
                          (:format book))]]))])

(defpage "/books" []
  (common/layout
    [:h1 "Books"]
    (javascript-tag "$(document).ready(function() {$(\"#bookTable\").tablesorter();});")
    (list-books)))

(defn- ctype [format]
  (if (= "pdf" format) "application/pdf" "text/plain"))

(defpage "/books/:id/:format" {:keys [id format]}
  (content-type (ctype format) (java.io.ByteArrayInputStream. (db/get-file id))))

After we refresh the browser, we should see the final design and be able to sort the table

Ship it!

OK, we are ready to ship. But before we build a deployable artifact, we, as professional developers, should update documentation (README.md in our case) and finalize the version of the application (project.clj)

(defproject bookshelf "0.1.0"
  :description "Bookshelf site"
  :dependencies [[org.clojure/clojure "1.4.0"]
                 [noir "1.3.0-beta3"]]
  :main bookshelf.server)

I want to run this application as a a standalone Java web application, without any dependency on Leiningen. Therefore I have to add :gen-class to server.clj

(ns bookshelf.server
  (:require [noir.server :as server])
  (:gen-class))

(server/load-views-ns 'bookshelf.views)

(defn -main [& m]
  (let [mode (keyword (or (first m) :dev))
        port (Integer. (get (System/getenv) "PORT" "8080"))]
    (server/start port {:mode mode
                        :ns 'bookshelf})))

Now package the application by running the following command

$ lein uberjar

As a result of this command, bookshelf-0.1.0-standalone.jar artifact is created in the target directory. I scp this file to my server, to the directory where my books are located, and start the app

$ export PORT=3030; nohup java -jar bookshelf-0.1.0-standalone.jar prod > nohup.out 2>&1 &

And that's basically it. We have created a simplest web application in Clojure, which might be useful by its own. But more importantly, you've learned how to build it. I hope you enjoyed reading this tutorial as I enjoyed writing it.

Recap

If you want to create web application in Clojure, try Noir. Noir is small and easy to pick up.

  • use defpage macro to define URL routes
  • use defpartial macro to build views
  • use Leiningen to run local web server
  • use REPL to experiment with business logic
  • have fun with Clojure

Resources

  1. Long in-depth Noir tutorial: http://yogthos.net/blog/22
  2. Source code of this tutorial: https://github.com/ndpar/bookshelf
  3. Noir: http://www.webnoir.org
  4. Hiccup: https://github.com/weavejester/hiccup
P.S. After I wrote the draft of this post I found another tutorial on Clojure web development, which even has the same name for the application! If I knew about it before, I wouldn't probably write mine. But since it's already typed, let it be published.

Monday, October 01, 2012

Exporting Solr documents

Recently I had to copy some documents from one Solr server to another. I expected Solr already had an interface that allowed me to extract documents in the same format they were inserted. In that case I would pipe an output of one curl command to another, and consider the job done. As it turned out, the format of Solr input document is different than the output format. Here is how input document looks like:

<add>
    <doc>
        <field name="id">12345</field>
        <field name="articlestate">published</field>
        <field name="articletype">news</field>
        <field name="body">Lorem ipsum dolor...</field>
        <field name="referenceid">175820</field>
        <field name="referenceid">163786</field>
        <field name="created">2011-02-15T14:57:54.766Z</field>
    </doc>
</add>
Notice the flat structure of this document: all element names are the same regardless of the filed type, and arrays (referenceid) are not grouped. Now compare it to the output format. Here is what you get when you execute a query against a Solr server:
<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">1</int>
        <lst name="params">
            <str name="q">id:12345</str>
        </lst>
    </lst>
    <result name="response" numFound="1" start="0">
        <doc>
            <str name="id">12345</str>
            <str name="articlestate">published</str>
            <str name="articletype">news</str>
            <str name="body">Lorem ipsum dolor...</str>
            <arr name="referenceid">
                <str>175820</str>
                <str>163786</str>
            </arr>
            <date name="created">2011-02-15T14:57:54.766Z</date>
        </doc>
    </result>
</response>
Even if we ignore the response header, the structure of the response/result/doc is not the same as of input document: the element names reflect the types, the arrays are grouped. If you try to add this document to a Solr server, you will get an error "unexpected XML tag", obviously. I googled for couple hours on how to convert an output document to an input, and, to my surprise, didn't find any solution. (If you happen to know the solution, please leave a reference in the comments.)

Anyways, I implemented my own converter in Groovy, which solved the problem. I post it here in case somebody needs it.

Note: You can also use this script to re-index Solr.

Wednesday, November 23, 2011

Cygwin git-svn messed up

I upgraded my Cygwin from version 1.5 to 1.7 (finally), and found that git-svn command was broken

$ git svn rebase
Can't locate SVN/Core.pm in @INC (@INC contains:
/usr/lib/perl5/site_perl/5.10
/usr/lib/perl5/site_perl/5.14/i686-cygwin
/usr/lib/perl5/site_perl/5.14
/usr/lib/perl5/vendor_perl/5.14/i686-cygwin
/usr/lib/perl5/vendor_perl/5.14
/usr/lib/perl5/5.14/i686-cygwin
/usr/lib/perl5/5.14
/usr/lib/perl5/vendor_perl/5.10
/usr/lib/perl5/site_perl/5.8 .) at /usr/lib/git-core/git-svn line 42.

Usually this error indicates that subversion-perl package is not installed, but that was not the case as we can see it from cygcheck output:

$ cygcheck -c subversion-perl
Cygwin Package Information
Package Version Status
subversion-perl 1.7.1-1 OK

As it turned out, the problem was in the package dependency management. From the error above we saw that perl was looking for SVN/Core.pm in /usr/lib/perl5/vendor_perl/5.14/i686-cygwin directory. But the latest subversion-perl installed it in the different place

$ cygcheck -l subversion-perl
...
/usr/lib/perl5/vendor_perl/5.10/i686-cygwin/SVN/Core.pm
...

The problem was solved by downgrading perl from 5.14.x-x to 5.10.x-x.

If you have the same problem, check which version of perl package you have installed

$ cygcheck -f /usr/bin/perl
perl-5.10.1-5

It must be the same as the version used by subversion-perl.

Saturday, November 05, 2011

Ford Marbles

I found these marvelous renderings of Ford circles on flickr. I can't help but share them here.







As Thomae's function, Ford circles is another visual representation of rational numbers. You can investigate them here with interactive Wolfram demo.

Friday, September 16, 2011

Modulo who?

When programmer and mathematician are talking about modulus or modulo, there is often a confusion what this term means. For programmer modulo means an operator that finds the remainder of division of one number by another, e.g. 5 mod 2 = 1. For mathematician modulo is a congruence relation between two numbers: a and b are said to be congruent modulo n, written a ≡ b (mod n), if their difference a − b is an integer multiple of n.

These two definitions are not equivalent. The former is a special case of the latter: if b mod n = a then a ≡ b (mod n). The inverse is not true in general case. 5 mod 2 = 1, and 1 ≡ 5 (mod 2) because 1 - 5 = -4 is integer multiple of 2. Now 5 ≡ 1 (mod 2) because 5 - 1=4 is evenly divisible by 2, but 1 mod 2 = 1, not 5.

The biggest confusion happens when programmer and mathematician start arguing about Gauss' famous golden theorem where both definitions of modulus can be used.

Saturday, August 06, 2011

Thomae's function

Thomae's function (a.k.a. Riemann function) is defined on the interval (0, 1) as follows


Here is the graph of this function with some points highlighted as plus symbols for better view.


This function has interesting property: it's continuous at all irrational numbers. It's easy to see this if you notice that for any positive ε there is finite number of points above the line y = ε. That means for any irrational number x0 you can always construct a δ-neighbourhood that doesn't contain any point from the area above the line y = ε.


To generate the data file with point coordinates I used Common Lisp program:

(defun rational-numbers (max-denominator)
(let ((result (list)))
(loop for q from 2 to max-denominator do
(loop for p from 1 to (1- q) do
(pushnew (/ p q) result)))
result))

(defun thomae-rational-points (abscissae)
(mapcar (lambda (x) (list x (/ 1 (denominator x)))) abscissae))

(defun thomae (max-denominator)
(let ((points (thomae-rational-points (rational-numbers max-denominator))))
(with-open-file (stream "thomae.dat" :direction :output)
(loop for point in points do
(format stream "~4$ ~4$~%" (first point) (second point))))))

(thomae 500)

To create the images I used gnuplot commands:

plot "thomae.dat" using 1:2 with dots
plot "thomae.dat" using 1:2 with points

and Photoshop.

Thursday, June 30, 2011

Rod Johnson on Entrepreneurialism

One thing I think that you really need to be careful of as well, particularly if you, like me, are a programmer, is don’t get carried away writing code. Typically in my experience anyone who is a good programmer is pretty passionate about it, love writing code, get addicted to the process of writing code, fell pretty good about their code basis. As soon as you get down that path you are not thinking straight anymore and now you are increasing your emotional investment, you are having lots of fun writing interesting code and you are no longer in a place mentally where you are going to be trying to find some reason that you shouldn’t write that code. That has been a big lesson for me that the quicker I get to coding, the longer it takes me to ask the kind of questions I should ask upfront.
...
It is really, really hard to decide not to do things. One of the biggest killers of companies is trying to do too much. If you try to take on too many things you will assuredly fail, even if every one of those things is a good thing to do. It is incredibly hard to realize that a particular thing is a good idea, but you are not going to do it.
...
I think the biggest way to decide frankly if you are trading off business priorities is do the boring stuff like, look at the total addressable market, go and talk to customers, figure out what they will pay for. You really need to be guided by what the revenue is likely to be, and make sure you don’t just do something just because it’s cool.

Monday, June 27, 2011

Math and Physics of Benderama

The last episode of Futurama has interesting formula involved. The entire plot is based on the Professor's latest invention — Banach-Tarski Dupla-Shrinker — the machine that produces two copies of any object at a 60% scale. It was just a matter of time when Bender found a good usage of this machine: to replicate himself. Then two small copies of Bender replicated themselves making four smaller copies, and so forth. At some point the Professor horrified the crew that if they don't stop this unlimited growth, the total mass of all Benders will eventually be so big that the entire Earth will be consumed during the process of replication. As a proof he demonstrated this formula of the mass of all generations of Bender


This is a perfect toy for a science geek. The first obvious question it brings: is this formula mathematically correct? As it turns out, it is not. Considering the scale of 60%, the cubic dependency of volume on linear dimension, and the constant density of all copies, the formula should be the following


As you can see the total mass of infinite number of Benders actually converges to approximately 1.76 M0. So from Math perspective there is nothing to worry about. But what if our assumption of constant density is invalid. Would it be a problem from Physics perspective? Let's see.

Knowing that every new copy has a size of 0.6 of the original it was made from, we have the following formula for the size of Bender in the nth generation


This exponential function becomes very small pretty soon. In the 154th generation it already reaches the Planck length, after which the further replication is physically impossible. If we calculate the total mass of 154 Bender's generations using the Professor's formula, we get H(154) × 238 kg ≈ 1,337.56 kg, which is nothing comparing to the Earth mass.

So we have to admit that from both Math and Physics perspective the Professor was wrong, and there was no real threat to the Earth.

Although the Professor's formula doesn't describe the replication process adequately, it's still a beautiful piece of Math because it's a formula of harmonic series. If you want to know why harmonic series is beautiful and which real processes it describes, read this nice article of John H. Webb.

And don't miss the next episode of Futurama this Thursday :-)

Wednesday, June 08, 2011

Functional Groovy switch statement

In the previous post I showed how to replace chained if-else statements in Groovy with one concise switch. It was done for the special case of if-stement where every branch was evaluated using the same condition function. Today I want to make a generalization of that technique by allowing to use different conditionals.

Suppose your code looks like this:

if (param % 2 == 0) {
'even'
} else if (param % 3 == 0) {
'threeven'
} else if (0 < param) {
'positive'
} else {
'negative'
}

As soon as every condition operates on the same parameter, you can replace the entire chain with a switch. In this scenario param becomes a switch parameter and conditions become case parameters of Closure type. The only thing we need to do is to override Closure.isCase() method as I described in the previous post. The safest way to do it is to create a category class:

class CaseCategory {
static boolean isCase(Closure casePredicate, Object switchParameter) {
casePredicate.call switchParameter
}
}

Now we can replace if-statement with the following switch:

use (CaseCategory) {
switch (param) {
case { it % 2 == 0 } : return 'even'
case { it % 3 == 0 } : return 'threeven'
case { 0 < it } : return 'positive'
default : return 'negative'
}
}

We can actually go further and extract in-line closures:

def even = {
it % 2 == 0
}
def threeven = {
it % 3 == 0
}
def positive = {
0 < it
}

After which the code becomes even more readable:

use (CaseCategory) {
switch (param) {
case even : return 'even'
case threeven : return 'threeven'
case positive : return 'positive'
default : return 'negative'
}
}