Asynchronous JavaScript with Generators – An Experiment

I have recently added a third blade to my little asynchronous programming swiss army knife: the generator blade.

Looks sharp! Here are some details.

Generators

Generators are JavaScript’s flavor of coroutines. They solve one of computer science’s most important problems: generating the mysterious fibonacci numbers!

function genFibos() {  
  var f1 = 0, f2 = 1;  
  while (true) {  
    yield f1;  
    var t = f1;  
    f1 = f2;  
    f2 += t;  
  }  
}

function printFibos() {
	var g = genFibos();  
	for (var i = 0; i < 10; i++) {
	  var num = g.next();
	  print(num);  
	}
}

printFibos();

Generators are characterized by the presence of a yield keyword. Any function that contains a yield, like the genFibos function above, returns a generator.

Generators have a somewhat unusual execution pattern. I’ll quickly describe it on the example above.

The first oddity is that the genFibos() call inside printFibos does not execute the body of the genFibos function. Instead, it just returns a generator which is assigned to the g variable.

The first call to g.next() in the printFibos loop starts the generator: the body of the genFibos function executes until it reaches the yield keyword for the first time. At this point, control is transfered from genFibos to the point where g.next() was called and the num variable receives the value of f1 (1).

Then, execution continues in printFibos‘s for loop; i is incremented and g.next() is called a second time. This call transfers execution to the genFibos function again, at the point where we left it before, i.e. just after the first yield. The genFibos function loops and reaches yield a second time. As previously, control is transfered back to printFibos, and g.next() returns the value of f1 (2), which is assigned to num.

The for loop continues, hits another g.next() which transfers control to the yield point inside genFibos, which causes genFibos to loop again to the next yield, and so on, and so on.

What we have here is a little dance between printFibos and genFibos, with g.next() and yield f1 to jump from one to the other. Coroutines: functions that cooperate by yielding control to each other.

A bit disconcerting at first but it works, and it can help us deal with asynchronous code, as we’ll see shortly.

Generators in JavaScript

Generators are in the works for EcmaScript Harmony but they haven’t landed in V8 yet. So we cannot use them in node.js today.

On the other hand, they have been available for more than 5 years in Firefox (Firefox 2 and up). They are disabled by default but you can enable them by adding a version attribute to the <script> tag:

<script type="application/javascript;version=1.7">
function genFibos() { ... }
function printFibos() { ... }
printFibos();
</script>

And they are also available in luvmonkey, Tim Caswell’s port of libuv to spidermonkey. I’ve used both of these environments in my experimentations.

Streamline.js

Streamline.js is my little language tool for asynchronous JavaScript programming. It does not help much with fibonacci numbers but it does help a lot with code that sits on top of asynchronous APIs. No need to twist your brain any more; you can just write the code as if the APIs were synchronous, as long as you follow one simple rule:

Pass the _ marker everywhere a callback is expected.

Here is an example:

// waits one second (async/non blocking) and returns result
function delay(_, result) {
  setTimeout(_, 1000);
  return result;
}

function helloWorld(_) {
  print("starting ...");
  print(delay(_, 'hello ...'));
  print(delay(_, '... world'));
  print("done!");
}

helloWorld(function(err) {
  if (err) throw err;
  else print("continuing with callback");
});

It executes as follows:

_node helloWorld._js
starting ...
(wait one second)hello ...
(wait one second)... world
done!
continuing with callback

Streamline works by transforming the source, either to callback based code, or to code that uses Marcel Laverdet’s node-fibers library. You can see the callback transformation in action by cutting and pasting the example above in the online demo.

Yielding Hello World!

Supposedly, generators make it easier to write asynchronous code. And they do! Here is the generators version of my little hello world example:

function delay(_, result) {
  yield AsyncGen.invoke(this, setTimeout, [_, 1000], 0);
  yield result;
}

function helloWorld(_) {
  print("starting ...");
  print(yield delay(_, 'hello ...'));
  print(yield delay(_, '... world'));
  print("done!");
  yield;
}

function startDemo() {
  AsyncGen.run(helloWorld, [function(err) {
    if (err) alert(err);
    else print("continuing with callback");
  }], 0);
}

Quite nice! The general flow is similar to the synchronous flow. Instead of having to rewire everything with callbacks you just need to apply some rather simple rules:

  • First, keep the _ parameter (*) and add a yield keyword in front of all the calls that have this parameter.
  • Replace all return keywords by a yield.
  • Don’t forget to add a yield at the end of functions that don’t end with a return. This is because these functions do return at the end; it is just JavaScript that lets you omit the return when you don’t have a value to return.
  • Use the special AsyncGen.invoke function to call asynchronous APIs that expect a callback, like setTimeout.
  • Use the special AsyncGen.run function to call any of your generator functions as a node.js callback style function.

(*) You can rename the _ parameter though, as long as you do it consistently.

The other side of the mirror

Looks good. But how will it work? We now have generator functions that call other generator functions, and yield keywords all over the place in our asynchronous code. If you remember the little fibonacci dance that we saw earlier, you can imagine that we’ll need a dance partner to interact with these generator functions. And this time, the dance is going to be wild. So we’ll need a very strong partner on the other side to keep it going!

This strong partner is the the AsyncGen.run function, which gets a little help from AsyncGen.invoke. Here they come:

  var PENDING = {};

  window.AsyncGen = {
    run: function(fn, args, idx) {
a)    var cb = args[idx],
        g;

      function resume(err, val) {
        while (g) {
          try {
b)          val = err ? g.throw(err) : g.send(val);
            err = null;
            // if we get PENDING, the current call completed with a pending I/O
            // resume will be called again when the I/O completes. So just return here.
c)          if (val === PENDING) return;
            // if we get [PENDING, e, r], the current call invoked its callback synchronously
            // we just loop to send/throw what the callback gave us.
d)          if (val && val[0] === PENDING) {
              err = val[1];
              val = val[2];
            }
            // else, if g yielded a value which is not a generator, g is done. 
            // so we unwind it we send val to the parent generator (or through cb if we are at the top)
e)          else if (!isGenerator(val)) {
              g.close();
              g = g.prev;
            }
            // else, we got a new generator which means that g called another generator function
            // the new generator become current and we loop with g.send(undefined) (equiv to g.next()) 
            else {
f)            val.prev = g;
              g = val;
              val = undefined;
            }
          } catch (ex) {
            // the send/throw call failed.
            // we unwind the current generator and we rethrow into the parent generator (or through cb if at the top)
g)          g.close();
            g = g.prev;
            err = ex;
            val = undefined;
          }
        }
        // we have exhausted the stack of generators. 
        // return the result or error through the callback.
h)      cb(err, val);
      }
      // set resume as the new callback
i)    args[idx] = resume;
      // call fn to get the initial generator
j)    g = fn.apply(this, args);
      // start the resume loop
k)    resume();
    },

    invoke: function(that, fn, args, idx) {
      // Set things up so that call returns:
      // * PENDING if it completes with a pending I/O (and cb will be called later)
      // * [PENDING, e, r] if the callback is called synchronously.
      var result = PENDING,
        sync = true;
l)    var cb = args[idx];
      args[idx] = function(e, r) {
m)      if (sync) {
n)        result = [PENDING, e, r];
        } else {
o)        cb(e, r);
        }
      }
p)    fn.apply(that, args);
q)    sync = false;
      return result;
    }
  };

To demonstrate how it works, I’ll use an annotated version of the hello world program:

    function delay(_, result) {
1)    yield AsyncGen.invoke(this, setTimeout, [_, 1000], 0);
2)    yield result;
    }

    function helloWorld(_) {
3)    print("starting ...");
4)    print(yield delay(_, 'hello ...'));
5)    print(yield delay(_, '... world'));
      print("done!");
6)    yield;
    }

    function startDemo() {
7)    AsyncGen.run(helloWorld, [function(err) {
8)      if (err) alert(err);
        else print("continuing with callback");
      }], 0);
9)  }

Here is how execution unfolds (buckle up and be ready for a little rodeo!):

  • 7) startDemo calls AsyncGen.run.
  • a) and i) run stores the callback in cb and replaces it by resume.
  • j) run calls helloWorld with resume as callback. Nothing gets executed in helloWorld but a helloWorld generator is returned and assigned to g.
  • k) run calls resume().
  • b) resume calls g.send(undefined), which is the same as calling g.next(). Execution jumps to 3) inside helloWorld.
  • 3) helloWorld prints "starting...".
  • 4) helloWorld calls delay. Nothing gets executed in delay but a delay generator is returned and yielded by helloWorld. This yield jumps back to b), where g.send returns the delay generator.
  • b) The delay generator is assigned to val and err is reset.
  • c) d) and e) The tests are false so these if branches are skipped.
  • f) The delay generator is chained with the helloWorld generator and is assigned to g. val is set to undefined.
  • b) resume calls g.send(undefined). Execution jumps to 1) inside delay.
  • 1) delay calls AsyncGen.invoke to invoke setTimeout with resume as callback.
  • l) invoke remembers the callback in cb and replaces it by its own callback.
  • p) invoke calls setTimeout.
  • p) sync is set to false and invoke returns PENDING.
  • 1) We are back inside delay, and PENDING is yielded to the run loop. Executions jumps to b) with PENDING as return value.
  • b) PENDING is assigned to val and err is reset.
  • c) The test is true and resume returns.
  • k) run returns.
  • 9) startDemo returns.
  • Sleeping one second…
  • m) Awaken in setTimeout‘s callback. sync is false.
  • o) resume is called with both parameters undefined.
  • b) resume calls g.send(undefined). Execution jumps to 1) inside delay.
  • 2) delay yields result, which is "hello ...". Execution jumps to b) with "hello ..." as returned value.
  • b) "hello ..." is assigned to val and err is reset.
  • c) and d) The tests are false.
  • e) The test is true. The delay generator is closed and popped from the chain. Now, g points to the helloWorld generator which is where we left it, at 4).
  • b) resume calls g.send("hello ..."). Execution jumps to 4) inside helloWorld.
  • 4) helloWorld prints "hello ...".
  • 5) The same dance happens again with the second delay call, up to c) where the current resume call returns and o), the end of the setTimeout callback.
  • Sleeping one second…
  • m) Awaken in setTimeout‘s callback. Same dance as before until "... world" is returned at 5).
  • 5) helloWorld prints "... world" and then "done!".
  • 6) helloWorld yields. Execution jumps to b) with undefined as returned value.
  • b) undefined is assigned to val and err is reset.
  • c) and d) The tests are false.
  • e) The test is true. The helloWorld generator is closed and popped from the chain. Now, g is undefined.
  • h) The run loop is over. run calls its callback, which was passed by startDemo.
  • 8) startDemo‘s callback prints "continuing with callback".

This was probably tedious and painful to follow but I think that it is worth going through it at least once step by step, to understand how execution unfolds when mixing generators and asynchronous calls. I find it not very obvious, to say the least and I had to take a few aspirins to get the code into a simple form that works.

Also, the step by step execution that I just described did not explore all the branches because helloWorld is a well behaved dancer that does not throw exceptions. But the run function is robust enough to cope with less well behaved dancers.

The nice part is that, with these two functions, we can now write async code with normal, sync-like control flow. For example, we can call the delay function in a loop as:

function printNumbers(_, min, max) {
  for (var i = min; i <= max; i++) print(yield delay(_, i));
  yield;
}

AsyncGen.run(printNumbers, [function(err) {
  if (err) alert(err);
  else print("continuing with callback");
}, 1, 100], 0);

The two functions, run and invoke allow us to cross the mirror in both direction between the callbacks world and the generators world:

callbacks => run => generators => invoke => callbacks

The little trouble with finally

The run function is rather powerful and can handle all JavaScript constructs except one: finally clauses.

The problem is that all return statements are converted to yield statements and run is assuming that the generator is done when it yields a value (e) rather than another generator (f). This works, except in the case where the return was inside a try block with a finally clause. In this case, run must enter the generator again, to execute the finally clause after the return. And, unfortunately, g.send(val) cannot do the job because it would resume just after the return which has been turned into a yield, instead of resuming in the finally clause.

There is nevertheless a workaround, which imposes a small amount of code rewriting. The idea is to rewrite:

function f(_) {
  ...
  try {
    ...
      return x;
    ...
  } finally {
    ...
  }
  ...
}

as:

function f(_) {
  ...
  var result__ = null;
  finally__:
  do {
    try {
      ...
        { result__ = [x]; break finally__; }
      ...
    } finally {
      ...
    }
  } while (false);
  if (result__) return result__[0];
  ...
}

The do { ... } while (false); loop looks silly but it is labeled and it acts as a goto which lets us move the return outside of the try/finally construct.
Once all the returns have been moved outside of the try/finally, they can be converted to yield and the run function is back on tracks.

Looking for the full stack trace, desperately

The second problem that I hit in this experiment was the difficulty to reconstruct the stack trace when catching an exception.

The exception carries the stack trace of the current resume call but I did not find a good way to reconstruct the stack trace that corresponds to the sequence of yield calls. All the information is in the generators that are in the g --> g.prev --> g.prev.prev ... chain but I could not find any API to access this information.

The situation could be improved by making every generator yield upon entry to pass stack trace information to the run function but this is inefficient and insufficient (it can give the function names but not the line numbers). Thing would be much better if generators had an API that run could call to get stack information.

Wrapping up

As I mentioned in the introduction, I have added generators as a third blade to streamline.js. I was quite confident that streamline could be adjusted to generate code that takes advantage of generators but as usual, the proof of the pudding is in the eating. Good news! It works!

You can play with it here:

I’ll probably continue to use streamline, though, even when generators get widely available: the preprocessor handles all the ugly details and hides all the extra yield, invoke and run under the carpet. All that remains is the underscore in every spot where the code yields. Also, streamline provides other features, like futures, that would need to be hand-coded otherwise.

Having three blades makes it easy to compare callbacks, fibers and generators:

First, the generators and fibers transforms produce code that is much leaner and simpler than what the callback transform produces. This does not necessarily mean that they will outperform callbacks because they have their own runtime overhead (coroutines). I did not have time to do serious benchmarking but my experience with real code shows that callbacks win when there is a thin layer of logic on top of I/O calls but fibers can take the advantage when the logic becomes thicker and when caching comes into play.

I cannot benchmark fibers against generators today because they are supported by different engines (V8 and spidermonkey). All I can do is compare the generated code patterns. And here, the fibers transform wins because it does not need to introduce yield at all levels, only at the lowest level. So, fibers have an advantage with thick layers of code, which will need to be balanced with the optimizations that the JS engine may be able to apply to generators.

And I won’t close this post without thanking Marcel one more time. His implementation of the fibers transform gave me a big head start in this experiment.

Links

Posted in Asynchronous JavaScript, Uncategorized | 5 Comments

Node.js: Awesome Runtime and New Age JavaScript Gospel

I am a big fan of node.js but I have a big problem with the core team. No need to hide it and pretend everything is OK. The fibers vs. callback war erupted again this week on the mailing list, with new ballistic benchmark missiles fired from both sides and showers of sarcasms.

This is ridiculous. Time to sit down and blog about it!

Awesome Runtime

Node is an awesome runtime. It is simple; it is fast; it is 100% asynchronous; it is JavaScript. These are the things that make it so attractive for people like me. I’m sold on the “end to end JavaScript” story and node is an important piece of that story. I’ve also done my share of classical thread programming (Java, .NET) and I’m now deeply convinced that async is the way to go.

Node also comes with a good package system that manages dependencies in a very decoupled way and a simple tool to publish, install and update packages. I also like the fact that most components are published under the very liberal MIT license and that Windows, which is important for us, is supported as a first class platform.

Definitely, node.js is an an awesome runtime. I’m not regretting one single day to have chosen it for one of our projects.

New Age JavaScript Gospel

So the core team does a great job on the runtime but it seems invested with another mission: evangelizing a new way of programming, based on callbacks, streams and pipes.

To develop on node, you have to learn callbacks because this is how JavaScript deals with asynchronous code. You may find it hard but you are wrong:

It is not hard, you just need to learn callbacks.

Also, some people may tell you that they have found ways to ease your pain, but you should not listen to them; they are just heretics who are trying to divert you from the true and only way to node.

You *have* to learn callbacks!

So you will learn callbacks! And you’ll probably find out that they are not so hard after all. It is mostly a question of knowing a few patterns and applying them. It is more painful than hard. But it is also more error prone; it forces you to write more code; and the code is more difficult to read and modify because of all the “pattern noise”.

You’ll probably think that all this might be fine for a personal project but that it doesn’t scale well to large projects. Costs will rise at all levels: training costs to get people up to speed with callbacks, development costs because of extra code to write and of more difficult code reviews, quality costs because of more fragile code, maintenance costs, etc.

So, you may come back to the mailing list with a question like:

Fine, now I understand callbacks but I still have problems. Isn’t there a better way?

And you’ll get the same answer:

No. Callbacks are perfectly fine! You just need to refactor your logic and you should try to reformulate your problem with streams and pipes.

And don’t listen to the sirens who say that that they have solutions for you and that you shouldn’t bother with callbacks.

You *have* to learn callbacks and streams!

Someone might add:

Your application is probably not a good fit for node. You should have chosen PHP or Ruby instead.

But you want to use node because you like JavaScript and node’s asynchronous model.

I don’t know what you’ll do at this point. One possibility is that you’ll follow the party line: you will write a stream. It might not solve the problem you were trying to solve in the first place but you’ll be able to post your success story on the mailing list and you’ll get plenty of kudos from the core team.

The worst is that this is hardly a caricature. Node is not just a runtime; it comes with a gospel:

You have to program with callbacks and you have to rethink your application as streams and pipes.

The gospel is wrong!

Asynchronous !== Callbacks

Part of the problem comes from an unfortunate identification between asynchronous programming and callbacks. Asynchronous programming and callbacks are seen by many as one and the same thing. For them, programming asynchronously means programming with callbacks.

This is just plain wrong: asynchronism is the essence of node while callbacks are just an accident of node. They belong to different levels.

Asynchronism is a behavior. It is essential in node because node is all about performance and asynchronous I/O. Synchronous (blocking) I/O is a disaster for node.

Callbacks are just an artifact that JavaScript gives us to deal with asynchronous behaviors. Today, JavaScript only gives us this artifact, but tomorrow it will give us other artifacts: generators and yield. Callbacks are just an accident.

And, BTW, callbacks are probably not the best artifact to express asynchronism because they are *not* essentially asynchronous. Just consider the callbacks passed to Array.prototype.forEach: those are invoked synchronously.

IMO, the artifact that I have introduced in streamline.js (the _ marker), and that has been the target of so many sarcastic comments on the mailing list, is a better artifact because it captures the essence of asynchronism in code (the points where execution yields). So, in a sense, streamline.js provides a cleaner model for asynchronous programming than callbacks.

Every piece of software is not a proxy

The nirvana of node’s gospel is a program built entirely with streams and pipes. This comes from a vision that every piece of software can be thought of as being some kind of proxy and that systems can be built by connecting streams with pipes. This is a very appealing vision and it would just be so great if it would apply to every domain.

It does not!

The streams and pipes vision will likely work well in domains that are naturally event driven. This is the case of most front-end systems. But what about back-end systems? Most of the time these are more goal driven than event driven: they don’t push much data to other layers; instead, they respond to requests/questions by pulling data from other systems. I don’t see the streams and pipes model fitting too well in this context (but maybe it is just my imagination which is limited).

Does this mean that node.js should be ruled out for back-end systems and that I am just plain wrong because I’m trying to use it in areas where it does not fit? I don’t think so. Back-end systems pull data. This means I/O. And the best way to do I/O is the asynchronous way. So why not node.js?

Moving Forwards

I wrote all this because I was getting really fed up with the childish attitude on the mailing list and the difficulty to get into interesting discussions.

As I said in the introduction, I’m a big fan of node.js. It is a great runtime and I’ll continue to use it.

I don’t agree with the gospel and I think that it is totally counterproductive.

The simple fact that the “callback hell” question comes back so regularly on the mailing list should raise a red flag. There is a reality behind this question and this reality cannot be ignored forever.

The gospel is counterproductive because it slows down the adoption of node. The current technical entry ticket is just too high. The core team is young and is probably used to working with high profile software developers. I am working with a mix of high profile and normal developers (people who have very valuable domain knowledge but are less savvy technically) and for them “JavaScript with callbacks” is just a no go.

There is a big opportunity for node.js to compete with PHP and the likes but it won’t succeed if it keeps the bar so high.

It is also counterproductive because people don’t use the tools that would allow them to develop 100% asynchronous modules without pain. So, what they end up doing instead is bury a few Sync calls in their modules, thinking that “well, this is just a little bit of file I/O; it won’t hurt” (I hit two cases like this recently and I had to fork and streamline them to eliminate the Sync calls).

And it is counterproductive because it pollutes the discussions, blurs the message and upsets everyone.

Posted in Uncategorized | 42 Comments

Fibers and Threads in node.js – what for?

I like node.js, and I’m not the only one, obviously! I like it primarily for two things: it is simple and it is very fast. I already said it many times but one more won’t hurt.

Before working with node, I had spent many years working with threaded application servers. This was fun sometimes but it was also often frustrating: so many APIs to learn, so much code to write, so many risks with critical sections and deadlocks, such a waste of costly system resources (stacks, locks), etc. Node came as a breath of fresh air: a simple event loop and callbacks. You can do a lot with so little, and it really flies!

But it does not look like we managed to eradicate threads. They keep coming back. At the beginning of last year Marcel Laverdet opened Pandora’s box by releasing node-fibers: his threads are a little greener than our old ones but they have some similarities with them. And this week the box got wide open as Jorge Chamorro Bieling released threads_a_gogo, an implementation of real threads for node.js.

Isn’t that awful? We were perfectly happy with the event loop and callbacks, and now we have to deal with threads and all their complexities again. Why on earth? Can’t we stop the thread cancer before it kills us!

Well. First, things aren’t so bad because fibers and threads did not make it into node’s core. The core is still relying only on the event loop and callbacks. And it is probably better this way.

And then maybe we need to overcome our natural aversion for threads and their complexities. Maybe these new threads aren’t so complex after all. And maybe they solve real problems. This is what I’m going to explore in this post.

Threads and Fibers

The main difference between fibers and real threads is on the scheduling side: threads use implicit, preemptive scheduling while fibers use explicit, non-preemptive scheduling. This means that threaded code may be interrupted at any point, even in the middle of evaluating an expression, to give CPU cycles to code running in another thread. With fibers, these interruptions and context switches don’t happen randomly; they are into the hands of the programmer who decides where his code is going to yield and give CPU cycles to other fibers.

The big advantage of fiber’s explicit yielding is that the programmer does not need to protect critical code sections as long as they don’t yield. Any piece of code that does not yield cannot be interrupted by other fibers. This means a lot less synchronization overhead.

But there is a flip side to the coin: threads are fair; fibers are not. If a fiber runs a long computation without yielding, it prevents other fibers from getting CPU cycles. A phenomenon known as starvation, and which is not new in node.js: it is inherent to node’s event loop model; if a callback starts a long computation, it blocks the event loop and prevents other events from getting their chance to run.

Also, threads take advantage of multiple cores. If four threads compete for CPU on a quad-core processor, each thread gets 100% (or close) of a core. With fibers there is no real parallelism; at one point in time, there is only one fiber that runs on one of the cores and the other fibers only get a chance to run at the next yielding point.

Fibers – What for?

So, it looks like fibers don’t bring much to the plate. They don’t allow node modules to take advantage of multiple cores and they have the same starvation/fairness issues as the basic event loop. What’s the deal then?

Fibers were introduced and are getting some love primarily because they solve one of node’s big programming pain points: the so called callback pyramid of doom. The problem is best demonstrated by an example:

function archiveOrders(date, cb) {
  db.connect(function(err, conn) {
    if (err) return cb(err);
    conn.query("select * from orders where date < ?",  
               [date], function(err, orders) {
      if (err) return cb(err);
      helper.each(orders, function(order, next) {
        conn.execute("insert into archivedOrders ...", 
                     [order.id, ...], function(err) {
          if (err) return cb(err);
          conn.execute("delete from orders where id=?", 
                       [order.id], function(err) {
            if (err) return cb(err);
            next();
          });
        });
      }, function() {
        console.log("orders have been archived");
        cb();
      });
    });
  });
}

This is a very simple piece of business logic but we already see the pyramid forming. Also, the code is polluted by lots of callback noise. And things get worse as the business logic gets more complex, with more tests and loops.

Fibers, with Marcel’s futures library, let you rewrite this code as:

var archiveOrders = (function(date) {
  var conn = db.connect().wait();
  conn.query("select * from orders where date < ?",  
             [date]).wait().forEach(function(order) {
    conn.execute("insert into archivedOrders ...", 
                 [order.id, ...]).wait();
    conn.execute("delete from orders where id=?", 
                 [order.id]).wait();
  });
  console.log("orders have been archived");
}).future();

The callback pyramid is gone; the signal to noise ratio is higher, asynchronous calls can be chained (for example query(...).wait().forEach(...)), etc. And things don’t get worse when the business logic gets more complex. You just write normal code with the usual control flow keywords (if, while, etc.) and built-in functions (forEach). You can even use classical try/catch exception handling and you get complete and meaningful stack traces.

Less code. Easier to read. Easier to modify. Easier to debug. Fibers clearly give the programmer a better comfort zone.

Fibers make this possible because they solve a tricky topological problem with callbacks. I’ll try to explain this problem on a very simple example:

db.connect(function(err, conn) {
  if (err) return cb(err);
  // conn is available in this block
  doSomething(conn);
});
// Would be nice to be able to assign conn to a variable 
// in this scope so that we could resume execution here 
// rather than in the block above.
// But, unfortunately, this is impossible, at least if we 
// stick to vanilla JS (without fibers).

The topological issue is that the conn value is only accessible in the callback scope. If we could transfer it to the outer scope, we could continue execution at the top level and avoid the pyramid of doom. Naively we would like to do the following:

var c;
db.connect(function(err, conn) {
  if (err) return cb(err);
  c = conn;
});
// conn is now in c (???)
doSomething(c);

But it does not work because the callback is invoked asynchronously. So c is still undefined when execution reaches doSomething(c). The c variable gets assigned much later, when the asynchronous connect completes.

Fibers make this possible, though, because they provide a yield function that allows the code to wait for an answer from the callback. The code becomes:

var fiber = Fiber.current;
db.connect(function(err, conn) {
  if (err) return fiber.throwInto(err);
  fiber.run(conn);
});
// Next line will yield until fiber.throwInto 
// or fiber.run are called
var c = Fiber.yield();
// If fiber.throwInto was called we don't reach this point 
// because the previous line throws.
// So we only get here if fiber.run was called and then 
// c receives the conn value.
doSomething(c);
// Problem solved! 

Things are slightly more complex in real life because you also need to create a Fiber to make it work.

But the key point is that the yield/run/throwInto combination makes it possible to transfer the conn value from the inner scope to the outer scope, which was impossible before.

Here, I dived into the low level fiber primitives. I don’t want this to be taken as an encouragement to write code with these primitives because this can be very error prone. On the other hand, Marcel’s futures library provides the right level of abstraction and safety.

And, to be complete, it would be unfair to say that fibers solve just this problem. They also enable powerful programming abstractions like generators. But my sense is that the main reason why they get so much attention in node.js is because they provide a very elegant and efficient solution to the pyramid of doom problem.

Sponsored Ad

The pyramid of doom problem can be solved in a different way, by applying a CPS transformation to the code. This is what my own tool, streamline.js, does. It leads to code which is very similar to what you’d write with fiber’s futures library:

function archiveOrders(date, _) {
  var conn = db.connect(_);
  flows.each(_, conn.query("select * from orders where date < ?",  
                           [date], _), function(_, order) {
    conn.execute("insert into archivedOrders ...", 
                 [order.id, ...], _);
    conn.execute("delete from orders where id=?", 
                 [order.id], _);
  });
  console.log("orders have been archived");
}

The signal to noise ratio is even slightly better as the wait() and future() calls have been eliminated.

And streamline gives you the choice between transforming the code into pure callback code, or into code that takes advantage of the node-fibers library. If you choose the second option, the transformation is much simpler and preserves line numbers. And the best part is that I did not even have to write the fibers transformation, Marcel offered it on a silver plate.

Wrapping up on fibers

In summary, fibers don’t really change the execution model of node.js. Execution is still single-threaded and the scheduling of fibers is non-preemptive, just like the scheduling of events and callbacks in node’s event loop. Fibers don’t really bring much help with fairness/starvation issues caused by CPU intensive tasks either.

But, on the other hand, fibers solve the callback pyramid of doom problem and can provide a great relief to developers, especially those who have thick layers of logic to write.

Threads – What for?

As I said in the intro, threads landed into node this week, with Jorge’s thread_a_gogo implementation (and I had a head start on them because Jorge asked me to help with beta testing and packaging). What do they bring to the plate? And this time we are talking about real threads, not the green kind. Shouldn’t we be concerned that these threads will drag us into the classical threading issues that we had avoided so far?

Well, the answer is loud and clear: there is nothing to be worried about! These threads aren’t disruptive in any way. They won’t create havoc in what we have. But they will fill an important gap, as they will allow us to handle CPU intensive operations very cleanly and efficiently in node. In short, all we get here is bonus!

Sounds too good to be true! Why would these threads be so good when we had so many issues with threads before? The answer is simple: because we had the wrong culprit! The problems that we had were not due to the threads themselves, they were due to the fact that we had SHARED MUTABLE STATE!

When you are programming with threads in Java or .NET or other similar environments, any object which is directly or indirectly accessible from a global variable, or from a reference that you pass from one thread to another, is shared by several threads. If this object is immutable, there is no real problem because no thread can alter it. But if the object is mutable, you have to introduce synchronization to ensure that the object’s state is changed and read in a disciplined way. If you don’t, some thread may access the object in an inconsistent state because another thread was interrupted in the middle of a modification on the object. And then things usually get really bad: incorrect values, crashes because data structures are corrupted, etc.

If you have shared mutable state, you need synchronization. And synchronization is a difficult and risky art. If your locks are too coarse you get very low throughput because your threads spend most of their time waiting on locks. If they are too granular, you run the risk of missing some edge cases in your locking strategy or of letting deadlocks creep in. And, even if you get your synchronization right, you pay a price for it because locks are not free and don’t scale well.

But threads a gogo (I’ll call them TAGG from now on) don’t share mutable state. Each thread runs in its own isolate, which means that it has its own copy of the Javascript code, its own global variables, its own heap and stack. Also, the API does not let you pass a reference to a mutable Javascript object from one thread to another. You can only pass strings (which are immutable in Javascript) (*). So you are on the safe side, you don’t run the risk of having one thread modify something that another thread is accessing at the same time. And you don’t need synchronization, at least not the kind you needed around shared mutable objects.

(*) it would be nice to be able to share frozen objects across threads. This is not available in the first version of TAGG but this may become possible in the future. TAGG may also support passing buffers across thread boundaries at some point (note that this may introduce a limited, but acceptable, form of shared state).

I hope that I have reassured the skeptics at this point. As Jorge puts it, these threads aren’t evil. And actually, they solve an important problem which was dramatized in a blog post a few months ago: node breaks on CPU intensive tasks. The blog post that I’m referring to was really trashy and derogative and it was making a huge fuss about a problem that most node applications won’t have. But it cannot be dismissed completely: some applications need to make expensive computations, and, without threads, node does not handle this well, to say the least, because any long running computation blocks the event loop. This is where TAGG comes to the rescue.

If you have a function that uses a lot of CPU, TAGG lets you create a worker thread and load your function into it. The API is straightforwards:

var TAGG = require('threads_a_gogo');

// our CPU intensive function
function fibo(n) { 
  return n > 1 ? fibo(n - 1) + fibo(n - 2) : 1;
}

// create a worker thread
var t = TAGG.create();
// load our function into the worker thread
t.eval(fibo);

Once you have loaded your function, you can call it. Here also the API is simple:

t.eval("fibo(30)", function(err, result) {
  console.log("fibo(30)=" + result);
});

The function is executed in a separate thread, running in its own isolate. It runs in parallel with the main thread. So, if you have more than one core the computation will run at full speed in a spare core, without any impact on the main thread, which will continue to dispatch and process events at full speed.

When the function completes, its result is transferred to the main thread and dispatched to the callback of the t.eval call. So, from the main thread, the fibo computation behaves like an ordinary asynchronous operation: it is initiated by the t.eval call and the result comes back through a callback.

Often you’ll have several requests that need expensive computations. So TAGG comes with a simple pool API that lets you allocate several threads and dispatch requests to the first available one. For example:

var pool = TAGG.createPool(16);
// load the function in all 16 threads
pool.all.eval(fibo);
// dispatch the request to one of the threads
pool.any.eval("fibo(30)", function(err, result) {
  console.log("fibo(30)=" + result);
});

TAGG also provides support for events. You can exchange events in both directions between the main thread and worker threads. And, as you probably guessed at this point, the API is naturally aligned on node’s Emitter API. I won’t give more details but the TAGG module contains several examples.

A slight word of caution though: this is a first release so TAGG may lack a few usability features. The one that comes first to mind is a module system to make it easy to load complex functions with code split in several source files. And there are still a lot of topics to explore, like passing frozen objects or buffers. But the implementation is very clean, very simple and performance is awesome.

Wrapping up on threads

Of course, I’m a bit biased because Jorge involved me in the TAGG project before the release. But I find TAGG really exciting. It removes one of node’s main limitations, its inability to deal with intensive computations. And it does it with a very simple API which is completely aligned on node’s fundamentals.

Actually, threads are not completely new to node and you could already write addons that delegate complex functions to threads, but you had to do it in C/C++. Now, you can do it in Javascript. A very different proposition for people like me who invested a lot on Javascript recently, and not much on C/C++.

The problem could also be solved by delegating long computations to child processes but this is costlier and slower.

From a more academic standpoint, TAGG brings a first bit of Erlang’s concurrency model, based on share nothing threads and message passing, into node. An excellent move.

Putting it all together

I thought that I was going to write a short post, for once, but it looks like I got overboard, as usual. So I’ll quickly recap by saying that fibers and threads are different beasts and play different roles in node.

Fibers introduce powerful programming abstractions like generators and fix the callback pyramid problem. They address a usability issue.

Threads, on the other hand, fix a hole in node’s story, its inability to deal with CPU intensive operations, without having to dive into C/C++. They address a performance issue.

And the two blend well together (and — sponsored ad — they also blend with streamline.js), as this last example shows:

var pool = TAGG.createPool(16);
pool.all.eval(fibo);
console.log("fibo(30)=" + pool.any.eval("fibo(30)", _));

Kudos to Marcel and Jorge for making these amazing technologies available to the community.

Posted in Asynchronous JavaScript, Uncategorized | 55 Comments

Yield – Resume vs. Asynchronous Callbacks – An Equivalence

I introduced streamline.js about four months ago and I never took the time to explain the theory (if it can be called a theory) behind it. I’ll try to do it in this post.

The material that follows is a lot less entertaining than the little tale I started with. The tale could give the wrong impression that streamline.js is just a cute hack. This post gives a different perspective.

Y-R Javascript

For this investigation, I won’t use streamline.js as is because its syntax does not differentiate it well enough from vanilla Javascript. Instead, I will introduce a fictional language that I will call Y-R Javascript (for Yield and Resume Javascript).

Y-R Javascript is a small extension to Javascript. It introduces a new operator that can be applied to function definitions and function calls:

function myFunc@(p1, p2) {
  // function body here
}

result = myFunc@(arg1, arg2);

The @ sign is the yield and resume operator. When present in a function definition, it means that the function executes asynchronously. Somewhere in its body, the function yields to the system and the system calls it back with a result or an error. It may actually yield and resume more than once before returning to its caller. I will call such a function an asynchronous function, in contrast with normal Javascript functions (defined without @) that I will naturally call synchronous functions.

Y-R Javascript gives you two possibilities to call an asynchronous function.

  • The first one is to call it as myFunc@(arg1, arg2). In this case, the function will yield and resume (maybe several times) and then return a result to its caller (or throw an exception).
  • The second one is to call it as myFunc(arg1, arg2) (without @). In this case, the function will synchronously return a future F that you may call later as F@() to obtain a result. Yielding will happen when you call F@(), not in the initial call to myFunc.

I will also be assuming that the runtime library of Y-R Javascript contains a mixture of synchronous and asynchronous functions. The developer who writes asynchronous programs with Y-R Javascript does not use callbacks. Instead, he uses the @ operator to call asynchronous functions very much like synchronous ones. For example:

function lineCount@(fname) {
  return readFile@(fname, "utf8").split("\n").length;
}

function compareLC@(fname1, fname2) {
  // serial version: start reading fname2 only after
  // getting the count for fname1
  return lineCount@(fname1) - lineCount@(fname2);
}

function fastCompareLC@(fname1, fname2) {
  // parallelized I/O with futures
  var f1 = lineCount(fname1);
  var f2 = lineCount(fname2);
  // both I/O operations have been started.
  // now we yield and resume to obtain the results.
  return f1@() - f2@();
}

Y-R Javascript also supports anonymous asynchronous functions. The syntax is naturally:

function@(args) {
  // body
}

Y-R Javascript imposes the following rule:

Asynchronous calls are contagious: if a function contains an asynchronous call (an @ call), then the function itself must be defined as an asynchronous function (defined with an @).

So, for example, a Y-R Javascript compiler would reject the following function definition:

function lineCount(fname) {
  return readFile@(fname, "utf8").split("\n").length;
}

On the other hand, it would accept the following code:

function getLineCountFunc() {
  return function@(fname) {
    return readFile@(fname, "utf8").split("\n").length;
  }
}

In this last sample, getLineCountFunc is a synchronous function that returns an asynchronous function. The rule given above is not broken because getLineCountFunc does not contains any asynchronous function calls. What it contains is an asynchronous function definition.

There is also an additional rule to deal with top level calls:

Y-R Javascript does not allow asynchronous calls at the top level of a script module because this would conflict with the fact that require works synchronously.

The only exception to this rule is a main script in which we allow top level asynchronous calls.

Equivalence

In this post I want to investigate the relationship between this fictional Y-R Javascript language on one side and  Javascript with asynchronous callbacks on the other side.

I will show that any program written in Y-R Javascript can be mechanically translated into an equivalent Javascript program with asynchronous callbacks, and vice versa.

From Y-R to Callbacks

The translation from Y-R Javascript to Javascript with asynchronous callbacks requires many steps. I will not give a complete proof in this post but I will try to give enough information to convince the reader that this transformation is possible and that it preserves the semantics of the Y-R Javascript source.

Synchronous functions

The first thing to observe is that synchronous function definitions may only contain synchronous function calls. So synchronous functions do not need any translation at all.

If a synchronous function contains an asynchronous function definition, like getLineCountFunc above, then the synchronous function itself does not need any translation but the asynchronous function definition that it contains does need to be translated.

Main script with top level asynchronous calls

So we only have to translate asynchronous function definitions and top level asynchronous calls.

We can eliminate the case of top-level asynchronous calls. As mentioned above, these calls are only allowed in main scripts. Here is an example:

var count = readFile@(argv[1], "utf8").split("\n").length;
console.log(argv[1] + " has " + count + " lines.");

This script can be wrapped as follows:

(function@() {
  var count = readFile@(argv[1], "utf8").split("\n").length;
  console.log(argv[1] + " has " + count + " lines.");
})();

This transformation gives us a module that contains a single synchronous call to an anonymous asynchronous function.

The problem has been simplified. We don’t need to worry about top level asynchronous statements any more. If we know how to transform an arbitrary asynchronous function definition then we will have solved our problem.

Canonicalization steps

Our goal now is to eliminate all the @ calls in an arbitrary function definition and replace them by logic based on asynchronous callbacks.

But we are not going to introduce the callbacks right away. Before that, we are going to transform the Y-R Javascript code into a canonical form. The goal is to reduce the number of language constructs that we have to convert to callback form afterwards.

Each of these canonicalization steps is a simple transformation that preserves the semantics of the program. So the program that we obtain after applying all these steps will behave exactly like the the original program.

Moving variable declarations

The first canonicalization step moves all the variable declarations at the beginning of the function body. Here is an example:

function f1@() {
  var v1 = 2;
  if (f2@()) {
    var v2 = f3@();
    return v1 + v2;
  }
  return v1;
}
function f1@() {
  var v1, v2;
  v1 = 2;
  if (f2@()) {
    v2 = f3@();
    return v1 + v2;
  }
  return v1;
}

Moving functions declarations

We apply the same treatment to functions declarations (subject to hoisting). For example:

function f1@() {
  f2@(f3());
  function f3() {...}
}
function f1@() {
 function f3() {...}
 f2@(f3());
}

 Referencing this and arguments with variables

If the function body contains the this keyword, we introduce a new variable to reference this and we replace all occurrences of this by this variable. For example:

function f1@() {
  f2@(this.x);
  f3(this);
function f1@() {
 var __this = this;
  f2@(__this.x);
  f3(__this);
}

The arguments variable is handled in a similar way.

These three steps will allow us to introduce anonymous functions when we convert the body, without having to worry about the scope of variables and functions defined inside the body, and without having to worry about the fact that function calls may modify the value of this.

Canonicalizing loops

The next step is to transform loops. Javascript has several loop constructs but the most general one is the for(init; cond; incr) loop and they can all be converted to this form. We also move the init clause (if any) before the loop. Here are some examples:

while (cond) {
  ...
}
for (; cond; ) {
  ...
}
do {
  ...
} while (cond);
var __1;
__1 = true;
for (; __1 || cond; ) {
  __1 = false;
  ...
}
for (init; cond; incr) {
  ...
}
init;
for (; cond; incr) {
  ...
}
for (key in obj) {
  ...
}
var __1, __2, __3;
__1 = __keys(obj);
__2 = __1.length;
__3 = 0;
for (; __3 < __2; __3++) {
  key = __1[__3];
  ...
}

Note: __keys is a little helper functions that returns the properties of the object as an array. It is different from Object.keys because it also returns the prototype’s properties. To be completely rigorous we should generate an extra test in the loop to skip properties that may have been deleted during the loop.

If the loop condition is a complex condition containing one or more asynchronous calls, we transform it as follows:

for (; cond; incr) {
  ...
}
for (; (function@() {
  return cond;
  })@(); incr) {
  ...
}

This last transformation ensures that the whole condition will be evaluated at each iteration, even after it has been disassembled (see step below).

Splitting try/catch/finally

This step is very straightforward: we split try/catch/finally statements in two, as follows:

try {
  ...
} catch (ex) {
  ...
} finally {
  ...
}
try {
  try {
    ...
  } catch (ex) {
    ...
  }
} finally {
 ...
}

This will allow us to handle try/catch and try/finally with independent transformations.

Eliminating lazy operators

The next canonicalization step will eliminate the three lazy operators of Javascript (&&, || and ?:) when they combine asynchronous expressions. These operators will be converted into function calls as described in the following table:

exp1 && exp2
(function@() {
  var v = exp1;
  if (!v) return v;
  return exp2;
})@();
exp1 || exp2
(function@() {
  var v = exp1;
  if (v) return v;
  return exp2;
})@();
cond ? exp1 : exp2
(function@() {
  if (cond) return exp1;
  else return exp2;
})@();

Thanks to this transformation, we won’t have to worry about lazy operators any more. They have been converted to asynchronous functions that contain if/else statements.

Disassembling expressions

The next canonicalization step will disassemble all expressions that contain asynchronous calls. Here are three examples:

function f1@() {
  var x, y;
  x = f2@(f3() + f4@());
  y = f5@(x).f6@();
  ...
}
function f1@() {
  var x, y, __1, ...
  __1 = f3();
  __2 = f4@();
  __3 = __1 + __2;
  x = f2@(__3);
  __4 = f5@(x);
  y = __4.f6@();
  ...
}
function f1@() {
  if (f1@() < f2() + f3@()) {
    ...
  }
 }
function f1@() {
  var __1, __2, ...
  __1 = f1@();
  __2 = f2();
  __3 = f3@();
  __4 = __2 + __3;
  __5 = __1 < __4;
  if (__5) {
    ...
  }
}
function f1@() {
  return f1@() + f2() + f3@();
 }
function f1@() {
  var __1, __2, ...
  __1 = f1@();
  __2 = f2();
  __3 = f3@();
  __4 = __1 + __2
      + __3;
  return __4;
}

This transformation can be implemented by processing the parse tree in a bottom up fashion and introducing an intermediate variable (__1, __2, etc.) for every node.

As demonstrated by the second example, the transformation moves asynchronous calls outside of if conditions. It also moves them outside of switch, return and throw  expressions, as demonstrated by the third example above. On the other hand, it does not move asynchronous calls outside of loop conditions but they will be moved a bit later (see loop transformation pattern below).

This transformation is valid because we have taken special care of loop conditions and lazy operators. The following transformation would have been incorrect:

if (f1@() && f2@()) {
  ...
}
var __1, __2;
__1 = f1@();
__2 = f2@(); // wrong!
__3 = __1 && __2;
if (__3) {
...
}

Post-canonicalization state

We haven’t introduced any callbacks so far. We have just transformed our Y-R Javascript source into another Y-R Javascript source. It is relatively easy to verify that the individual transformations that we have applied preserve execution semantics. So, the transformed source that we have at this stage is another Y-R Javascript program which has the same observable behaviors as the original program.

But the new source will be easier to transform than the original one because all the complex expressions have been disassembled. Asynchronous calls will only be found in the following constructs:

func@(args)
obj.func@(args)
asynchronous procedure call
v = func@(args)
v = obj.func@(args)
asynchronous function call
for (; func@(args); incr)
for (; obj.func@(args); incr)
loop condition

We are now ready to introduce the callbacks.

Transforming an asynchronous function definition

An asynchronous function definition is transformed as follows:

function f@(a1, a2) {
  ...
}
function f(__cb, a1, a2) {
  if (!__cb)
    return __future.call(this,
      f, arguments);
  ...
  __cb();
}

The __cb argument is the callback that the transformed f function will use to return a value or an error. It has the standard node.js callback signature:

__cb(err, result)

The callback is passed as first rather than last argument. This choice is different from the standard node.js convention but it makes it completely trivial to handle optional parameters.

__future is a helper function similar to the one I described in my April post.

The transformation adds a __cb() call at the end of the function body because the callback must be called if execution reaches this point without hitting a return or a throw.

Transforming return and throw statements

The disassemble step has moved asynchronous calls outside of return and throw statements. So these statements can now be easily converted:

return exp;
return __cb(null, exp);
throw exp;
return __cb(exp);

Transforming loops

All loops have already been transformed into a for loop. The loops that do not contain any asynchronous calls do not to be transformed any further. The loops that contain asynchronous calls will be converted to callback form as follows:

for (; cond; incr) {
  loopBody
}
afterLoop
var __beenHere = false;
(function(__break) {
  (function __loop() {
    if (__beenHere) {
      incr
    } else {
      __beenHere = true;
    }
    var __doIt = cond;
    if (__doIt) {
      loopBody;
      __loop();
    } else {
      __break();
    }
  })();
})(function() {
  afterLoop
});

break and continue statements that appear inside the loop will be transformed as follows:

break;
return __break();
continue;
return __loop();

Transforming if statements

Similarly, if statements that do not contain any asynchronous calls do not need to be transformed. The ones that contains asynchronous calls will be transformed as follows:

if (cond) {
  thenBody
}
else {
  elseBody
}
afterIf
(function(__then) {
  if (cond) {
    thenBody
    __then();
  }
  else {
    elseBody
    __then();
  }
})(function() {
  afterIf
});

Transforming switch statements

switch statements have a similar transformation pattern:

switch (exp) {
case v1:
  v1Body
  break;
case v2:
case v3:
  v2v3Body
  break;
default:
  defaultBody
}
afterSwitch
(function(__break) {
  switch (exp) {
  case v1:
    v1Body
    return __break();
  case v2:
  case v3:
    v2v3Body
    return __break();
  default:
    defaultBody
    return __break();
  }
})(function() {
  afterSwitch
});

break statements contained in the branches of a switch will be transformed as follows:

break;
return __break();

Here we assume that the branches of the switch are all terminated by a break, a return or a throw. If this is not the case and if one of the branches leaks into the next branch, the transformation is still possible but a bit more complex. I will not describe it here.

Transforming try/catch

try/catch statements are a bit more challenging. The pattern is the following:

try {
  tryBody
}
catch (ex) {
  catchBody
}
afterTryCatch
(function(__then) {
  (function(__cb) {
    __tryCatch(__cb, function() {
      tryBody;
      __then();
    });
  })(function(ex, __result) {
    __tryCatch(__cb, function() {
      if (ex) {
        catchBody;
        __then();
      } else {
        __cb(null, __result);
      }
    });
  });
})(function() {
  __tryCatch(__cb, function() {
    afterTryCatch;
  });
});

The __tryCatch helper function is defined as follows:

function __tryCatch(__cb, fn) {
  try {
    fn();
  } catch (e) {
    try {
      __cb(e);
    } catch (ex) {
      console.error("UNCAUGHT EXCEPTION: " 
        + ex.message + "\n" + ex.stack);
    }
  }
}

The pattern is rather heavy but it is based on a simple idea: we execute the try body in a different scope which overrides __cb. The overriden __cb executes the catch body if called with an error. Otherwise, it forwards the result to the enclosing __cb callback.

The amount of try/catch logic is a really overwhelming and may look paranoid. But it is not. The problem is that most of this code is executed in the context of a callback and very strange things (like jumping into catch clauses that belong to functions from which we have already returned) would happen if we did not plug all the holes and if we let some exceptions bubble up through the callback chain. This is why we invoke __cb via the __tryCatch helper function.

Transforming try/finally

try/finally is our best of show. The pattern is the following:

try {
  tryBody
}
finally {
  finallyBody
}
afterTryFinally
(function(__then) {
  (function(__cb) {
    __tryCatch(__cb, function() {
      tryBody;
      __cb(null, null, true);
    });
  })(function(__e, __r, __cont) {
    (function(__then) {
      __tryCatch(__cb, function() {
        finallyBody;
        __then();
      });
    })(function() {
      __tryCatch(__cb, function() {
        if (__cont)
          __then();
        else
          __cb(__e, __r);
      });
    })
  });
})(function() {
  __tryCatch(__cb, function() {
    afterTryFinally;
  });
});

The idea is again to override __cb. I won’t explain it much further here.

Transforming the remaining blocks

This is probably getting very boring but we are now getting close to the end of our transformation journey.

We have now transformed all the control flow statements and we are left with blocks that contain asynchronous function calls.

If you look carefully at the transformations that we have used in the callback generation phase, you will notice the following:

  • Every time a control flow statement was encountered, the control flow statement and the statements that followed it have been replaced by a function call like (function(__then) { ... })(function() { ... }).
  • All the code blocks that have been transfered inside these functions are now followed by a function call like __then(), __cb() or __break().

So, we are left with blocks that contain the asynchronous function calls produced by the disassembly pass, and that are terminated by a function call (the two cases above).

The blocks that contained a return or a throw statement before the transformation now have a return __cb(...) statement before their ending function call. These blocks can be simplified by removing all the statements that follow the return __cb(...) statement. After this simplification we can also remove the return keyword on the return __cb(...) statement.

So we are left with blocks that contain:

  • synchronous statements that we don’t need to transform.
  • func@(args); statements
  • obj.func@(args); statements
  • v = func@(args); statements
  • v = obj.func@(args); statements
  • a function call at the end.

If we manage to transform the statements that contain an asynchronous call we will be done!

Transforming asynchronous calls

The asynchronous calls that remain will be transformed with the following patterns:

func@(args);
afterCall
func(function(__err) {
  if (__err) return __cb(__err);
  afterCall
}, args);
v = func@(args);
afterCall
func(function(__err, __result) {
  if (__err) return __cb(__err);
  v = __result;
  afterCall
}, args);

Note: the obj.func@(args) statements are transformed like the func@(args) statements.

That’s it! We now have a complete mechanical process that can transform any Y-R Javascript program into an equivalent Javascript program with asynchronous callbacks.

From Callbacks to Y-R

If you read this far you probably deserve a medal. Unfortunately for you, we are only half-way through! What about the other direction? Can any Javascript program with asynchronous callbacks be converted into an equivalent Y-R Javascript program?

Fortunately the reverse proof will be much shorter. We just need to prove  two things:

  • That the definition of a callback-based asynchronous function definition can be converted to a definition of a Y-R Javascript function.
  • That a callback-based asynchronous function call can be converted to a Y-R Javascript function call.

In this second proof, we will assume that the callback function has the default node.js format and we won’t worry about the callback’s position in the argument list. A complete proof would need to take these variations into account but these details don’t really bring anything fundamental to the table. So we will ignore them here.

Transforming a callback-based function definition

This part is completely trivial. The transformation is the the following:

function f(a, b, callback) {
  ...
}
function f@(a, b) {
  ...
}

Transforming a callback-based asynchronous call

This one is a little trickier. Here is a solution:

f(a, b, function(err, res) {
  callbackBody
})
(function@() {
  var err, res
  try {
    res = f@(a, b);
  }
  catch (ex) {
    err = ex;
  }
  callbackBody
})();

The try/catch part is not very interesting. We just need it to extract the error and assign it to an err variable.

The (function@() { ... })() wrapper is more interesting. What it does is wrap the function call into an anonymous asynchronous function which is then called synchronously (no @ operator before the parentheses). This call produces a future that we don’t even assign to a variable because we don’t really care about the termination of this anonymous call. What we care about is that the callback body will be called after f produces a result or an error, which is obviously the case!

Wrapping-up

This was a rather long and probably boring explanation and I haven’t even proved anything formally. But I hope that I have given enough elements to convince you that any Y-R Javascript program can be mechanically converted into an equivalent Javascript program with asynchronous callbacks, and vice versa.

You may wonder why this could be of any interest and why I wasted several hours to write this. The reason is that I think that Y-R Javascript is a much better language than Javascript with asynchronous callbacks when it comes to writing asynchronous code.

Y-R Javascript lets you to write very lean and elegant code. You can express your logic with the native constructs of the language rather than with flow-control libraries; you can do proper exception handing with try/catch/finally; you can chain calls even if some of them are asynchronous, etc. In short, you feel like a Javascript programmer, not like a super-hero who spends his days taming callbacks.

Y-R Javascript has another quality: it makes it very easy to reason about race conditions. The yield and resume points are very easy to spot: they correspond to all the @ calls in your code. If you see a piece of code that does not contain any @ calls, you immediately know that this code never yields and cannot be subject to a race condition of any kind. I think that this is an important feature.

The first part of this dissertation proves that Y-R Javascript is not a leaky abstraction: Y-R programs can be faithfully translated to Javascript programs with asynchronous callbacks.

The second part proves that Y-R Javascript is not a crippled subset of Javascript with asynchronous callbacks. Anything that you can do with asynchronous callbacks can also be done with Y-R Javascript’s primitives (@ calls and futures).

At this point, I have probably exhausted you (and I got my dose too). I will write another post later to illustrate the differences between Y-R Javascript and asynchronous callbacks. The only remaining point that I want to cover quickly is the relationship of all this to my current pet project: streamline.js

Y-R Javacript vs. streamline.js

Streamline.js is very similar to Y-R Javascript. The main difference is on the syntax: streamline.js uses an _ parameter in place of the @ operator.

I chose this syntax for streamline.js because I wanted to be able to edit and reformat my code with standard Javascript tools. It also made implementation easier because I did not have to modify parsers: I could use narcissus as is and I could piggyback the transformation to the CoffeeScript compiler. And, last but not least, it made it easier to call native node.js libraries because you can control where the callback goes in the argument list.

But I don’t have any special attachment to the _ parameter, nor to the @ operator. What I like in all this is the concept of a yield-resume marker that you can tag both onto function definitions and onto asynchronous function calls. I also like the idea of having asynchronous functions return a future when they are called without the marker. And I like the minimalist nature of this design: no extra libraries (promises, futures, tasks), no new keywords, just a small marker!

One last point: the CPS transform that I described here is similar to what I implemented in the streamline compiler. It is not completely identical though (and this article may have a few bugs that I introduced when adapting the patterns) but the general spirit is the same.

Posted in Asynchronous JavaScript, Uncategorized | 19 Comments

Asynchronous episode 3 – Adventures in event-land

From callbacks to events

This is my third post on asynchronous Javascript. In the first one, I explored callbacks and I explained how I ended up creating streamline.js, a tool that simplifies programming in callback-land. In the second one, I described how I introduced futures in streamline.js to enable and control parallel execution of asynchronous I/O operations. In this third one, I am going to talk about the other side of node’s asynchronous APIs: events.

But before that, I’ll quickly explain the process that drove me through these investigations: I did not build streamline.js just for the fun of it; I built it because I saw a big potential in node.js and I had a real project that I wanted to build on top of it. But I quickly realized that callbacks would be a showstopper for us. We are building business applications and we have to give our developers a simple and friendly programming environment that lets them concentrate on their problem domain. It does not make sense for us to try to turn our developers into callback ninjas. Also, readability is critical for us: business rules must be expressed in natural and understandable manner. Callbacks just add too much noise, even with libraries to ease the pain. So node.js would have been disqualified if we hadn’t found a way to drastically reduce the overhead introduced by callbacks.

Once I had the tool working, I started to convert several thousands of lines that we had written in callback style before. This went very smoothly actually and I am very pleased with the result: the code is leaner, simpler, much easier to write and, more important, much much easier to read, understand and maintain. Looks like node.js has now become a viable platform for us.

But the conversion left us with some modules that were written in event style rather than callback style, mostly middleware modules and specialized communication clients (HTTP and TCP). So I started to investigate these other modules. We had drastically improved the situation in callback-land. Maybe there was room for improvement in event-land too!

When events aren’t so rosy

When I first looked at node.js I thought that there was some beauty in all these observers and emitters. But when I started to investigate our event-style modules, which had been written by a brilliant Javascript programmer of our team, I started to cringe a bit. Of course, the code was correct and all the events that were emitted were getting handled properly by some observer that had been registered beforehand. But the logic was often hard to follow because the flow was jumping abruptly from one place to another instead of following natural call chains. With callbacks, I often had the feeling that the flow had been wired with GOSUB directives. Here, in event-land, it felt more like it had been wired with setjmp/longjmp calls. Node.js ninjas probably get a kick out of this, but is this really a viable approach for large industrial projects?

And then we ran into an interesting bug. We have an authentication module between our main module and the lower-level modules to which the requests are routed. We had chosen to let the lower-level modules register the data and end observers on requests so that they could stream posted data when necessary. It was all working fine, until we connected our authentication module to the database. Suddenly, our data and end event handlers were not getting called any more!

It did not take us long to figure out why: the authentication module had become asynchronous because of the database calls and consequently, the data and end events were being dequeued during authentication, before our lower level modules had got a chance to register their observers.

This raised a design issue: how should we fix this? One way would be to let the lower level modules invoke the authentication module after setting their handlers but then every low level module would be responsible for authentication and would have to hold its dispatch until it has received both the end event and the green light from the authentication module. This means a proliferation of non-obvious security sensitive code snippets. Not good! Another way to fix it would be to install the event handlers in the main module and let them buffer the data. Then we would probably need to introduce pause and resume calls to control the amount of buffering.

But to me, all this raised a red flag: this kind of event handling logic seems fundamentally fragile and we would need a great deal of discipline to write code that is robust and easy to maintain.

Funny enough, while I was writing this post, someone asked for help about request object losing events on the node.js forum. We are obviously not the only ones to have run into this problem.

Inverting the flow

So I started to question the beauty and relevance of all this. Is this event style somehow imposed upon us by the asynchronous nature of node.js? Or is it just an arbitrary API choice that shadows other possible designs?

What about trying to invert the flow around node’s stream objects? Instead of letting the stream push the data to its consumer through events, couldn’t we set up the API so that the consumer pulls the data from the stream by calling an asynchronous read method?

I gave it a shot and found out that this was actually fairly easy to do. The pattern is similar to the one I used for futures (and actually, I discovered the pattern when experimenting with streams, and applied it to futures afterwards). I packaged the solution as a streams module containing small wrappers around node streams, that I posted here.

To keep things simple, I chose to have the basic read method return the data one chunk at a time, and return null once all the chunks have been read and the end event has been received.

With this low level method, I could easily write a readAll method that reads the stream to the end. Here is the streamlined source of this method:

this.readAll = function(_) {
  var chunk, chunks = [];
  while (chunk = this.read(_))
    chunks.push(chunk);
  return concat(chunks);
}

where concat is an auxiliary function that deals with the fact that chunks could be buffers or strings.

I also added lowMark and highMark options to the stream constructor so that you can control how much data will be buffered without having to fool around with pause and resume calls.

Pull-style streams in action

This streams module is still under development. I’ve only published wrappers for HTTP client and server requests at this time and I haven’t written extensive unit tests yet. But I published a small example that calls Google’s search web service. Here is a condensed version of this little Google client:

function google(str, _) {
  var json = streams.httpRequest({
    url: 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=' + str,
    proxy: process.env.http_proxy
  }).end().response(_).checkStatus(200).readAll(_);

  return JSON.parse(json).responseData.results.map(function(entry) {
    return entry.url + '\n\t' + entry.titleNoFormatting;
  }).join('\n');
}

This example demonstrates some of the benefits that we get by inverting the flow:

  • The code is smaller
  • Calls can be chained in a natural way (and even more so with streamline.js which prevents the callbacks from disrupting the chains).
  • The code is more robust: there is no risk of losing events because listeners were not set up early enough; exceptions are naturally propagated along call chains; pause/resume calls are hidden and buffering is controlled declaratively, etc.
  • The code should look familiar to people who have been programming with traditional I/O APIs (read functions).

Wrapping up

I don’t know how the idea of inverting the flow will be received (it may not actually be completely new either) but I anticipate that some node.js aficionados will treat it as heresy. Aren’t callbacks and events the very essence of node? Who dares promote alternate programming styles, first without callbacks (actually not quite), then without events?

To me, the essence of node.js is the combination of an incredible Javascript engine and a simple, fundamentally asynchronous runtime, which treats HTTP as a first class protocol. APIs are just enablers for innovations on top of this foundation.

What I’m looking for are tools and APIs that will allow us to deliver great products, and I feel that they are starting to shape up. The streams module is just one more piece of the puzzle. It blends very well with streamline.js but the transformed source can also be used as a normal standalone node.js module with a callback oriented API.

Posted in Asynchronous JavaScript, Uncategorized | 3 Comments

Currying the callback, or the essence of futures…

Wow, this one is not for dummies!!!

I decided to play it pedantic this time because I posted a layman explanation of this thing on the node.js forum and nobody reacted. Maybe it will get more attention if I use powerful words like currying in the title. We’ll see…

Currying

I’ll start with a quick explanation of what currying means so that Javascript programmers who don’t know it can catch up. After all, it is a bit of a scary word for something rather simple, and many Javascript programmers have probably already eaten the curry some day without knowing what it was.

The idea behind currying is to take a function like

function multiply(x, y) { return x * y; }

and derive the following function from it:

function curriedMultiply(x) {
  return function(y) { return x * y; }
}

This function does something simple: it returns specialized multiplier functions. For example, curriedMultiply(3) is nothing else than a function which multiplies by 3:

function(y) {
  return 3 * y;
}

Attention: curriedMultiply does not multiply because it does not return numbers. Instead, it returns functions that multiply.

It is also interesting to note that multiply(x, y) is equivalent to curriedMultiply(x)(y).

Currying the callback

Now, what happens if we apply this currying principle to node APIs, to single out the callback parameter?

For example, by applying it to node’s fs.readFile(path, encoding, callback) function, we obtain a function like the following:

fs.curriedReadFile(path, encoding)

The same way our curriedMultiply gave us specialized multiplier functions, curriedReadFile gives us specialized reader functions. For example, if we write:

var reader = fs.curriedReadFile("hello.txt", "utf8");

we get a specialized reader function that only knows how to read hello.txt. This function is an asynchronous function with a single callback parameter. You would call it as follows to obtain the contents of the file:

reader(function(err, data) {
  // do something with data
});

Of course, we have the same equivalence as we did before with multiply: fs.readFile(path, encoding, callback) and fs.curriedReadFile(path, encoding)(callback) are equivalent.

This may sound silly, and you may actually think that this whole currying business is just pointless intellectual masturbation. But it is not! The interesting part is that if we are smart, we can implement curriedReadFile so that it starts the asynchronous read operation. And we are not forced to use the reader right away. We can keep it around, pass it to other functions and have our program do other things while the I/O operation progresses. When we need the result, we will call the reader with a callback.

By currying, we have separated the initiation of the asynchronous operation from the retrieval of the result. This is very powerful because now we can initiate several operations in a close sequence, let them do their I/O in parallel, and retrieve their results afterwards. Here is an example:

var reader1 = curriedReadFile(path1, "utf8");
var reader2 = curriedReadFile(path2, "utf8");
// I/O is parallelized and we can do other things while it runs

// further down the line:
reader1(function(err, data1) {
  reader2(function(err, data2) {
    // do something with data1 and data2
  });
});

Futures

Futures is a powerful programming abstraction that does more or less what I just described: it encapsulates an asynchronous operation and allows you to obtain the result later. Futures usually come with some API around them and a bit of runtime to support them.

My claim here is that we can probably capture the essence of futures with the simple currying principle that I just described. The reader1 and reader2 of the previous example are just futures, in their simplest form.

Implementation

This looks good but how hard is it to implement?

Fortunately, all it takes is a few lines of Javascript. Here is our curried readFile function:

function curriedReadFile(path, encoding) {
  var done, err, result;
  var cb = function(e, r) { done = true; err = e, result = r; };
  fs.readFile(path, encoding, function(e, r) { cb(e, r); });
  return function(_) { if (done) _(err, result); else cb = _; };
}

I won’t go into a detailed explanation of how it works.

Going one step further

Now, we can go one step further and create a generic utility that will help us currify any asynchronous function. Here is a simplified version of this utility (the complete source is on streamline’s GitHub site):

function future(fn, args, i) {
  var done, err, result;
  var cb = function(e, r) { done = true; err = e, result = r; };
  args = Array.prototype.slice.call(args);
  args[i] = function(e, r) { cb(e, r); };
  fn.apply(this, args);
  return function(_) { if (done) _(err, result); else cb = _; };
}

With this utility we can rewrite curriedReadFile as:

function curriedReadFile(path, encoding) {
  return future(fs.readFile, arguments, 2);
}

And then, we could even get one step further, and tweak the code of the original fs.readFile in the following way:

var original = fs.readFile;
fs.readFile = function(path, encoding, callback) {
  // note: assumes always called with encoding arg
  if (!callback) return future(readFile, arguments, 2);
  // delegate implementation to original function
  original.apply(this, arguments);
}

With this tweak we obtain a handy API that can be used in two ways:

  • directly, as a normal asynchronous call:
    fs.readFile("hello.txt", "utf8", function(err, data) { ... }
  • indirectly, as a synchronous call that returns a future:
    // somewhere:
    var reader = fs.readFile("hello.txt", "utf8");
    
    // elsewhere:
    reader(function(err, data) { ... });

Blending it with streamline.js

I have integrated this into streamline.js. The transformation engine adds the little if (!callback) return future(...) test in every function it generates. Then, every streamlined function can be used either directly, as an asynchronous call, or indirectly, to obtain a future.

Moreover, obtaining a value from a future does not require any hairy callback code any more because the streamline engine generates the callbacks for you. Everything falls down very nicely into place as the following example demonstrates:

function countLines(path, _) {
  return fs.readFile(path, "utf8", _).split('\n').length;
}

function compareLineCounts(path1, path2, _) {
  // parallelize the two countLines operations
  // with two futures.
  var n1 = countLines(path1);
  var n2 = countLines(path2);
  // get the results and combine them
  return n1(_) - n2(_);
}

Wrapping it up

I was looking for an elegant way to implement futures in streamline.js and I’m rather happy with this design. I don’t know if it will get wider adoption but I think that it would be nice if node’s API could behave this way and return futures when called without callback. Performance should not be an issue because all that is required is a simple test upon function entry.

I also find it cute that a future would just be the curried version of an asynchronous function.

Posted in Asynchronous JavaScript | 24 Comments

Asynchronous Javascript – the tale of Harry

Catching Up

Before telling you about Harry, I’ll start with a short account of what I went through recently so that those who know me from long ago can understand how I ended up being involved in asynchronous Javascript development, and what that means.

After many years of programming with mainstream OO languages (Java and C#), I eventually decided to give up in the summer of 2009 and I switched to Javascript. The trigger was jQuery. I used it to experiment with HTML 5 and it helped me realize that the world had changed and that many of my previous beliefs were just wrong. Functional programming, which I had put aside 20 years ago to focus on mainstream OO, was coming back, with new ideas and a huge potential. I also realized that strong typing, which I had been worshiping for years, was somewhat of a neurotic thing. It makes you feel safe but you pay a high price for it (you write a lot of code just to please the compiler) and it actually misses the point (you should be relying on unit tests that check the semantics rather than on compilers that only check formal constraints). It also introduces tight coupling between code modules, and makes the code more rigid than necessary. JQuery really made me rediscover the pleasure of functional programming and convinced me that Javascript was not a toy language but most likely a very important language for the future.

Then, I thought that if I was going to invest a lot on Javascript and jQuery for the client side, I might as well try to use it also on the server side. This would make it possible to reuse code between client and server. It would also make the development process simpler: one language to learn, common methodologies and tools, etc. This is how I ended in the SSJS (server side Javascript) world.

So, about 18 months ago, we (I had taken the lead of a new team for a new project in the meantime) started working with Helma NG (now RingoJS). We quickly switched to Narwhal which seemed to have more traction at the time. And we were keeping an eye on a blip that was getting bigger on our radar screens: node.js. It looked amazing but I wondered if it would be wise to drag a Sage project into it. Selling SSJS as a viable platform for future applications had already been a bold move but with node.js we were crossing the line between being leading-edge and bleeding-edge!

But as we were moving forwards with Narwhal, it became clearer every day that node.js was really where the future of SSJS was shaping up. There was a vibrant community around it. And an incredibly fast Javascript engine! So we ended up making the switch in the spring of last year.

Asynchronous Javascript in node.js

Node.js is really cool. Small, simple, very fast, etc. But it takes a radical approach on concurrency: no threads, asynchronous I/O instead. This has a profound impact on the way code is written.

Node.js provides two API styles to deal with asynchronous programming:

  • An event style
  • A callback style

The event style allows you to emit events from various points in your code, and set up listeners that catch these events and act upon them somewhere else. The flow of control is highly non-local and a bit similar to what we classically use for exception handling. It works really well for I/O handling (HTTP client and server for example) but I have a hard time imagining business application developers writing their mundane business logic with this kind of non-local flow of control.

So we would likely end up writing most of our code in the callback style, which is what node.js proposes for local flows that call asynchronous functions. The emblematic pattern for an asynchronous call is the following:

asyncFunc(args, function(err, result) {
  if (err)
    // error: propagate it or handle it
  else
    // do something with result
});

Every asynchronous function takes an extra argument which is a callback. When the asynchronous operation completes, the callback is executed. If the operation fails, an error object (err) is passed as first argument to the callback. If the operation succeeds the first argument is set to null and an optional result may be passed through the second argument. A simple and straightforward design (like most things in node.js)! The new thing is that node.js is highly asynchronous so this pattern is not anecdotic as it may have been before. In node.js, it is omnipresent.

Harry’s first steps

So now comes the time to tell you my little story about Harry. Harry is an experienced programmer who makes his first steps in node.js. He has done some async programming before but never with a system that is as pervasively asynchronous as node.

To get familiar with node’s APIs, Harry decides to implement a function that traverses directories to compute disk usage. In his previous life he would have written it as:

function du(path) {
  var total = 0;
  var stat = fs.stat(path);
  if (stat.isFile()) {
    total += fs.readFile(path).length;
  }
  else if (stat.isDirectory()) {
    var files = fs.readdir(path);
    for (var i = 0; i < files.length; i++) {
      total += du(path + "/" + files[i]);
    }
    console.log(path + ": " + total);
  }
  else {
    console.log(path + ": odd file");
  }
  return total;
}

In node.js, the fs.stat, fs.readFile and fs.readdir calls are asynchronous. So, the du function itself must be asynchronous too. Its signature becomes:

function du(path, callback)

where callback is a node.js callback function that du will use to return its result. The signature of the callback is the following:

callback(err, result)

So Harry tries to adapt his du implementation for node.js. He quickly reaches the following point:

function du(path, callback) {
  var total = 0;
  fs.stat(path, function(err, stat) {
    if (err) { callback(err); return; }
    if (stat.isFile()) {
      fs.readFile(path, function(err, data) {
        if (err) { callback(err); return; }
        total += data.length;
        // (a) what do I do here?
      });
      // (b) and here?
    }
    else if (stat.isDirectory()) {
      fs.readdir(path, function(err, files) {
        if (err) { callback(err); return; }
        // (c) is this right?
        for (var i = 0; i < files.length; i++) {
          du(path + "/" + files[i], function(err, len) {
            if (err) { callback(err); return; }
            total += len;
            (d) what do I do here?
          });
          // (e) and here?
        }
        // (f) this does not sound right either!
        console.log(path + ": " + total);
      });
      // (g) what do I do here?
    }
    else {
      console.log(path + ": odd file");
      // (h) and here?
    }
  });
  // (i) sounds right, but not in the right place.
  callback(null, total);
}

He has started to introduce the callbacks but he is hitting some difficulties and a lot of questions arose. After a bit of thinking, Harry figures out the answer to many of his questions:

At spots (b), (e), (g) and (i) I should not have any code because these statements follow an async call.

(i) is misplaced. I need to make 3 copies of it. The first two will go to spots (a) and (h). The third copy will go somewhere in the for branch where I am a bit lost at this point. I note that I have been lucky in this example because there is a single statement after the if/else if/else branching sequence. If there had been more code, I would have had to copy a whole block 3 times. So I’ll probably need to package the trailing statements into a function the next time I hit this kind of branching code

So Harry makes the changes and he is now left with (c), (d) and (f), that is the branch with the for loop. Let’s look at it again:

      fs.readdir(path, function(err, files) {
        if (err) { callback(err); return; }
        // (c) is this right?
        for (var i = 0; i < files.length; i++) {
          du(path + "/" + files[i], function(err, len) {
            if (err) { callback(err); return; }
            total += len;
            (d) what do I do here?
          });
        }
        // (f) this does not sound right either!
        console.log(path + ": " + total);
      });

After a bit more investigation, Harry comes up with the following conclusions:

(c) is clearly wrong. du is an async function. So if I leave the loop like this, all the du calls will execute in parallel and I have no way of collecting all the results and continuing!

(d): this seems to be where I need to continue looping.

(f) is clearly misplaced as it will execute before any of the loop iterations get a chance to run. Maybe I’ll know what to do with it once I have fixed (c) and (d).

Fortunately, Harry is a smart guy. The conclusion that he derives from (d) leads him to conclude that he should restructure the loop and handle it recursively rather than iteratively. After a bit of time, he comes up with the following solution:

      fs.readdir(path, function(err, files) {
        if (err) { callback(err); return; }
        function loop(i) {
          if (i < files.length) {
            du(path + "/" + files[i], function(err, len) {
              if (err) { callback(err); return; }
              total += len;
              loop(i + 1);
            });
          }
          else {
            // loop is done. Execute last statement and then callback to return value
            console.log(path + ": " + total);
            callback(null, total);
          }
        }
        // start the loop
        loop(0);
      });

Bingo! Harry is happy. He can write simple algorithms like this with asynchronous functions. It just seems to require a bit more brain cycles that before, but this is feasible.

Patterns

A few weeks have gone by. Harry has written a few modules in node.js and he starts to feel more comfortable with callbacks. He actually came up with some patterns that help him write robust code without too many headaches.

The first pattern is actually a helper function for loops:

function asyncForEach(array, iterator, then) {
  function loop(i) {
    if (i < array.length) {
      iterator(array[i], function() {
        loop(i + 1);
    }
    else {
      then();
    }
  }
  loop(0);
}

With this helper function, he can now write his loops as:

asyncForEach(array, function(item, next) {
  // body of my loop
  somethingAsync(function(err, result) {
    if (err) {callback(err); return; } // I'm starting to get tired of writing this one!
    // do something with item and result
    next(); // don't forget me at the end of every code path
  });
}, function() {
  // this is where execution resumes after the loop
});

He also came up with a small funny construct that helps him deal with branching code (he calls it the branch neutralizer):

(function(next) {
  // if or switch statement with branches that may mix sync and async calls.
  // All code paths must end up calling next() or callback(null, result)
})(function() {
  // this is where execution resumes after the branches
});

Also, he is now mentally wired to automatically replace return statements by statements like { callback(err, result); return; } that he often simplifies as return callback(err, result); as this is more compact and nobody actually cares about the values returned by asynchronous functions.

Harry feels more relaxed. He now has patterns and a methodology that lets him deal with node.js APIs without too many headaches. He has also published these patterns on the team’s wiki to foster consistent coding practices inside his team.

Looking back

Then summer arrives and Harry takes a well deserved vacation break on the French Riviera. When he comes back, he did not loose his newly acquired programming skills but sometimes he wonders:

In my previous life, I could write algorithms in a very natural form. The code that I was writing was directly related to the problem I was trying to solve. There was no noise between the statements. Also, I could easily chain calls. For example I could write something as simple as:

total += fs.readFile(path).length;

Now, despite all the patterns that I have found, I still have to write something like:

fs.readFile(path, function(err, data) {
  total += data.length;
  // continue here ..
});

Isn’t there something wrong here? I can write the same kind of algorithms as before in this amazing node.js environment but the code that I write contains a lot more noise than before, natural chains are broken, etc.

Actually, I often feel like working with a crippled language. As soon as I have asynchronous calls in scope, I cannot use while and for loops any more. Of course, I cannot use break and continue either. I also need to neutralize my if and switch statements. And I have completely given up on try/catch/finally for now. I’ll see later if I can find a pattern for it. The only keyword which seemed to have survived (and actually prospered) is function.

Sounds like a regression. What am I getting in return?

Sure, I know. I may get a raise. I was a dull classical programmer and I am now part of an elite of bleeding edge programmers who know how to do asynchronous programming in this trendy node.js environment. My management needs to know that!

Also, I can write programs that run fast and consume very little system resources because they run in this amazing new server.

But instead of spending my time crafting clever algorithms with simple control flow statements, I now find myself spending a significant part of my time applying the same silly patterns over and over. The code that I write is harder to read and the beauty of my algorithms is burried into lots of callback noise. Also, I need a lot more concentration because as soon as I miss one of these next() calls, I break a callback chain and my code just goes nowhere without returning a result, and it can take time to find out where I have forgotten to call next().

Also I had a bit of hope that async programming would allow me to write more parallel code and that for example, I could take advantage of the (b), (e), (g) and (i) spots in my code to do something clever after firing the async calls but I find it hard to exploit those spots. I haven’t yet found how I could take advantage of them. Maybe I’m not looking in the right place. I feel frustrated.

Haven’t I become some kind of slave programmer? Wouldn’t it make sense to have a machine take care of all the tedious callback patterns for me, so that I could go back to writing beautiful algorithms?

Harry also remembered something from his CS classes and he added the following:

Node.js is actually a new run-time environment that I am targeting when I am writing code. When I write a function like du, what the function does is no different than what my old du function did in my old environment. Also, the library functions that I am calling (fs.stat, fs.readdir and fs.readFile) are similar to the ones I had before (what they do, not how they do it). And I am using a high level programming language which is supposed to screen me from differences between execution engines. So why should I write different code? It looks like something is interfering when it should not and that I’m being asked to deal with a problem that the language tools should handle. Am I still programming in a high level language?

streamline.js

I’ll leave Harry to his thoughts for a moment.

As you can guess, I went through this too. And what I ended up finding is that Harry is right. He is writing code that a machine could write. So why not let the machine write it for him?

I was actually hoping that a solution would emerge from the community some day. The first signs of hope came from the promise pattern. This is an attempt to solve the problem with special libraries rather than by transforming code. The promise pattern actually improves the situation when the code contains a sequence of async calls that don’t return values. But as soon as the flow hits an asynchronous call that returns a value, a callback pops up. So it only solves part of the problem. I tried to understand where the limit was and a simple topological consideration convinced me that even the smartest of us would not be able to fully solve Harry’s problem and eliminate all the callback noise with any library-based approach. We had to step out and transform the code.

The second sign of hope came from Neil Mix’s narrative.js. This is a compiler tool that relies on a small extension to the language (a special yielding operator) to convert synchronous-looking code into its callback-based equivalent. Unfortunately it suffers from two flaws:

  • It introduces new syntax in the language: this makes the source incompatible with all sorts of tools that understand the Javascript syntax
  • It is not modular: The functional beauty and modularity of the original code is destroyed by the narrative.js compiler.

So the project did not get traction and got abandoned.

More recently, I stumbled upon Oni Labs’ stratified.js. I was excited at first but I quickly realized that this project suffered from the same flaws as narrative.js: it extends the language and the compiler does not preserve the modular quality of the original code.

At this point it became very clear to me that the right way to help Harry was to create a tool that would just try to automate what Harry does by himself when he writes code. The tool would take the statements one by one and apply the patterns that Harry has found. This would preserve the modular beauty of Javascript. And if I could do this through a naming convention rather than through new syntax it would be less disruptive for Harry because he would be able to keep his favorite editing tools (in the long term it makes sense to have a language construct for this but in the short term a naming convention will work much better).

So I decided to give it a try. The last week-ends have been somewhat sacrificed and I had some sleepless nights too but the tool started to shape up. It is not completely finished today but it now works well enough that I decided to publish it. I called it streamline.js and made it available on Sage’s GitHub site, here.

So I have good news for you Harry. You now have the best of both worlds. The old one because you can get back to writing real code with simple control flow and easy to read algorithms, and the new one because your code will run in this amazing and exciting node.js server. I’ve put your streamlined du function in diskUsage.js.

And by the way, I have a little extra goodie for you: you can now parallelize your code in a controlled way with two little functions that make parallel programming sound like gardening (or janitoring, your choice): spray and funnel. Just take a look at the diskUsage2.js example!

Happy programming!

Posted in Asynchronous JavaScript | 73 Comments