TJ leaving node.js

I just saw the news. TJ Holowaychuk, one of node’s important and respected contributors is leaving node.js for Go. Of course, this is not good news, especially for people like me who have invested a lot into node.js and have bet an industrial project on it.

Why is TJ leaving? Part of it has to do with the intrinsic attractiveness of Go. But a large part is related to deficiencies on the node side. Usabililty and the lack of robust error handling come first:

Error-handling in Go is superior in my opinion. Node is great in the sense that you have to think about every error, and decide what to do. Node fails however because:

  • you may get duplicate callbacks
  • you may not get a callback at all (lost in limbo)
  • you may get out-of-band errors
  • emitters may get multiple “error” events
  • missing “error” events sends everything to hell
  • often unsure what requires “error” handlers
  • “error” handlers are very verbose
  • callbacks suck

TJ also complains about APIs, tooling, lack of conventions:

Streams are broken, callbacks are not great to work with, errors are vague, tooling is not great, community convention is sort of there, but lacking compared to Go. That being said there are certain tasks which I would probably still use Node for, building web sites, maybe the odd API or prototype. If Node can fix some of its fundamental problems then it has good chance at remaining relevant, but the performance over usability argument doesn’t fly when another solution is both more performant and more user-friendly.

I have been supervising a large node.js project at Sage. We started 4 years ago and we have faced the issues that TJ mentions very early in our project. After 6 months of experimentation in 2010, I was seriously questioning the viability of node.js for our project and I was contemplating a backtrack. The reasons were precisely the ones that TJ gives today: usability, maintainability, robustness.

Yet, we went ahead with node.js; we put more and more people on the project and we successfully released a new version of our product last month, with a new web stack based on node.js. Our developers are very productive and generally happy to work with node.js.

Why?

Simply because the problems that TJ is mentioning don’t apply to us, and to others who have chosen the same approach:

  • Error handling and robustness are not issues for us. We are writing all our code with streamline.js. This lets us use good old structured exception handling. IMO this is even better than Go because you don’t have to check error codes after every call.
  • We never get duplicate callbacks; callbacks don’t get lost; errors are always reported in context, … All these problems are simply gone!
  • Debugging works and exceptions have understandable stacktraces.
  • We use an alternate streams library, based on callbacks rather than events, which keeps our code simple, robust and easy to understand.

So let us not throw the baby with the bath water. The problems that TJ puts forwards are very real but they are not insurmountable. You can write robust, elegant and maintainable node.js code today!

Maybe it is time to reconsider a few things:

  • Stop being stubborn about callbacks and push one of the alternatives: generators, fibers, preprocessors (a la streamline) (*). Probably not for the core code itself because of the performance overhead, but as an option for userland code.
  • Investigate alternatives for the streams API. Libraries like Tim Caswell’s min-stream or my own ez-streams should be considered. My experience with ez-streams is that a simple callback-based API makes a huge difference on usability and robustness (**).

(*) I left promises out of the list. IMO, they don’t cut it on usability.
(**) ez-streams is partly implemented with streamline.js, which will probably be a showstopper for some but its API is a pure callback API and it could easily be re-implemented in pure callbacks.

As I said in my intro above, I have a very strong investment in node.js and I really want the platform to continue to grow and succeed. Three years ago, node.js was the coolest platform because of the unique blend of JavaScript and asynchronous I/O. But there are alternatives today and people are asking for more, especially on usability and robustness.

The problems raised by TJ cannot be ignored.

Posted in Uncategorized | Tagged | 2 Comments

Easy node.js streams

JavaScript is a great playground for experimentation. After ES6 generators and Galaxy I went back to one of my pet topics: streams. The simple streams API that we have been using in our product works really well but I was getting a bit frustrated with it: too low level! Working with this simple read/write API felt a bit like working with arrays without the ES5 functional goodies (forEach, filter, map, reduce, etc.). You get the job done with loops but you lack the elegance of functional chaining. So, I decided to fix it and a new project was born: ez-streams.

I had been keeping an eye on streams2, the streaming API that got introduced in node 0.10.0, but I have not been convinced: too complex, people seem to be struggling with it, exception handling has problems, etc. So I went with a different design. Compatibility with steams2 was crucial but it was just too hard to get where I wanted to go by building directly on top of it.

The ez-streams project is now starting to take shape and I’ve just published a first version to NPM. The README gives an overview of the API and I don’t want to repeat it here. You should probably glance through it before reading this post to get a feel for the API. Here I want to focus on API design issues and explain why I took this route.

This project is a natural continuation of my earlier work on streamline.js. So I will be using the streamline syntax for the examples in this post. But the ez-streams API is just a regular callback based API and you don’t have to write your code with streamline to use it. You can call it from regular JavaScript code. I have included pure callback versions of some of the examples, to show how the API plays with vanilla JavaScript.

Minimal essential API

The first idea in this project was to keep the essential API as small and simple as possible. The essential ez-streams API consists in two function signatures:

  • an asynchronous read(_) function which characterizes reader streams.
  • an asynchronous write(_, val) function which characterizes writer streams.

The complete reader API is much more sophisticated but all the other calls are implemented directly or indirectly around the read(_) call. This makes it very easy to implement readers: all you have to do is pass a read function to a helper that will decorate it with the rest of the API.

For example, here is how you can expose a mongodb cursor as an EZ stream:

var reader = function(cursor) {
    return ez.devices.generic.reader(function(_) {
        var obj = cursor.nextObject(_);
        return obj == null ? undefined : obj;
    });
}

Also, there was no reason to limit this API to string and buffer types: read could very well return integers, Booleans, objects. Even null or undefined. I decided to reserve undefined as an end-of-stream marker because I wanted streams to be able to transport all the types that are serializable in JSON. Symmetrically, I used undefined as end-of-stream marker for write. So there was no need for a separate end method, writer.write(_) would do the job.

As a consequence the API is not tainted by datatype specific issues. For example there is nothing in the reader and writer API about string encoding. This issue is handled in the devices that you put at both ends of your data processing chains. Better keep things orthogonal!

I may sound like an extremist in API minimalism here but I think that this is a very important point. A simple API is easier to wrap around existing APIs, it lends itself naturally to algebraic (monadic) designs, etc. This is probably the main reason why I did not go with node’s stream APIs (version 1 or 2).

Function application rather than pipes

The EZ streams design is directly influenced by ES5′s functional array API. It actually started as an attempt to mimick completely the ES5 design and the rest of the API followed naturally. It is also more remotely influenced by jQuery.

There is a pipe function in the EZ streams API but it plays a less prominent role than in node’s standard stream API. The pipe calls do not appear between processing steps. Instead, pipe only appears at the end of the chains, to transfer the data to a writer. The typical structure of an EZ streams chain is:

reader . op1(fn1) . op2(fn2)  .... opN(fnN) . pipe(_, writer)

All operations produce a reader, except the last one which is a reducer. pipe is a reducer but it is not the only one: forEach, some, every and of course reduce are all reducers and you can end your chains with any of them.

Most operations take a callback function as parameter (fn1, fn2, etc. above). The callback depends on the operation. It can be a filter, a mapper, a complex transform, etc. These callbacks allow you to inject your own logic into the chain.

The classical node pattern is different. It is directly inspired from UNIX’s command piping:

source | op1 | op2 | ... | opN

which becomes:

source . pipe(stream1) . pipe(stream2) ... .pipe(streamN)

The node design forces you to package your logic as streams, usually duplex streams which receive data from one pipe, transform it and send their results to another pipe.

I find the EZ stream design more natural and easier to use: it does not force you to package your code as streams and handle low-level stream events. Instead, you just have to provide a set of callbacks that are specific to the operations that you apply. Moreover, the most basic operations, like filter and map are aligned on the ES5 array API. So you are using familiar patterns.

Mixing functional and imperative styles

The general structure of an ez-streams processing chain is very functional. The basic operations (filter, map, reduce) are directly modelled after the ES5 functional array API. They are applied to a reader, and they produce another reader on which other operations can be chained.

But there is one important operation that somehow violates this rule: transform. The transform call itself is functional and is chained exactly like the other operations. But its callback receives 3 parameters: a continuation callback, a reader and a writer. You write the body of your transformation as a function that reads its input from the reader parameter and writes its output to the writer parameter.

Let us look at the CSV parser that I used as example in the README:

var csvParser = function(_, reader, writer) {
	// get a lines parser from our transforms library
	var linesParser = ez.transforms.lines.parser();
	// transform the raw text reader into a lines reader
	reader = reader.transform(linesParser);
	// read the first line and split it to get the keys
	var keys = reader.read(_).split(',');
	// read the other lines
	reader.forEach(_, function(_, line) {
		// ignore empty line (we get one at the end if file is terminated by newline)
		if (line.length === 0) return;
		// split the line to get the values
		var values = line.split(',');
		// convert it to an object with the keys that we got before
		var obj = {};
		keys.forEach(function(key, i) {
			obj[key] = values[i];
		});
		// send the object downwards.
		writer.write(_, obj);
	});
};

This is clearly imperative style. Full of calls like read, write or forEach that smell side-effects.

This could be seen as a weakness of the design. Why introduce imperative style in this wonderful functional world?

The reason is simple: because it is usually easier to write transforms in imperative style!

If instead you try to write your transforms directly as chainable functions, you have to write functions that transform a reader into another reader. These functions usually take the form of a state automata. You have to write state machines!

Some folks find it natural and fun to write state machines. I don’t! I find it more difficult and more error prone than writing mundane loops with read and write calls. State machines are great but I’d rather let a program generate them than write them myself (I love regular expressions).

So the role of the transform function is simply to put the developpers back into their imperative shoes (*).

(*) Fortunately I noticed my horrible mistake and rephrased this in gender-neutral form before publishing. I don’t want to be the next victim!

When it comes to programming styles my religion is that you should just be pragmatic and use the style that best fits your problem instead of trying to fit everything into one style which has been arbitrarily designated as superior. Functional, imperative and object oriented styles all have a role to play in modern programming and a good developer is someone who uses the right style at the right moment, not someone why tries to force everything into a single style. I’d have a lot more to say but I’ll keep it for another post.

Exception handling

Exception handling works rather naturally with EZ streams: all processing chains are terminated by a reducer, and this reducer, unlike the previous chain elements, takes a continuation callback as first parameter. Exceptions that occur inside the chain are funneled through this continuation callback.

So, if you write your code with streamline.js, you can trap the exceptions with a try/catch around the whole chain. For example:

try {
    ez.devices.file.text.reader('users.csv').transform(ez.transforms.csv.parser())
        .filter(function(_, item) {
            return item.gender === 'F';
        })
        .transform(ez.transforms.json.formatter({ space: '\t' }))
        .pipe(_, ez.devices.file.text.writer('females.json'));
} catch (ex) {
    logger.write(_, ex);
}

If you use EZ streams with raw callbacks, you just need to test the first parameter of your continuation callback. The previous example becomes:

ez.devices.file.text.reader('users.csv').transform(ez.transforms.csv.parser())
    .filter(function(cb, item) {
        cb(null, item.gender === 'F');
    })
    .transform(ez.transforms.json.formatter({ space: '\t' }))
    .pipe(function(err) {
        if (err) logger.write(function(e) {}, err);
    }, ez.devices.file.text.writer('females.json'));

Of course, you can also trap exceptions in all the processing callbacks that you install in the chain (the callbacks that you pass to filter, map, transform, etc.). If you trap an exception in such callbacks and return a normal result instead, processing will continue as usual and your reducer callback will not receive the exception.

So you do not have to use domains or other advanced error handling techniques; the EZ streams API is just a regular API with continuation callbacks and exceptions are always propagated through these callbacks. If they are not, this is a bug.

Backpressure and buffering

Developers who implement node streams keep talking about backpressure. From what I understand, they have to write special code to inform their inputs when outputs are not processing data fast enough, so that the inputs get paused. Then, once the outputs get drained a bit, the inputs can be resumed.

Frankly, I do not understand any of this. We have been writing a lot of code with the low-level read/write API (the essential API) and we have never run into situations where we would need to worry about backpressure and write special code to pause inputs.

This is because EZ streams handle read and write operations in a decoupled way at the low level. When wrapping a node readable stream, our read function buffers a bit of input with low and high water marks. We pause the stream when the high mark is reached and we resume it when the buffer goes below the low mark. On the output side, our write wrapper handles the little drain event dance so that we don’t overflow the output buffers. There is no explicit coordination between inputs and outputs, it works almost magically, thanks to the event loop:

  • If the input is too fast when we pipe data, the input stream gets paused when its buffer hits the high mark. Then the output gets a chance to drain its output buffers and process the data which has been buffered upstream. When the input buffers fall below the low mark, the input stream will be resumed and it will likely fill again its input buffers before the output gets drained. So input will be paused again, etc., etc.
  • If, on the other hand, the output is faster, the input stream will have empty buffers at all times and the pipe will be waiting for input most of the time.

So backpressure is a non issue with EZ streams. You don’t need to worry about it!

What you should worry about instead is buffering because it will impact the throughput of your stream chains. If you do not buffer at all, your pipeline will likely be inefficient because data will remain buffered at the source of the chain whenever some other operation is waiting for I/O further down the line. To keep the data flowing you need to inject buffers into the chain. These buffers will keep the upstream chain busy (until they fill, of course) while the downstream operations are waiting for I/O. Then, when the downstream operation will be ready to accept new input, it will get it from the buffer instead of having to pull it from the beginning of the chain.

The EZ streams API includes a buffer operation that you can inject in your processing chains to add buffering. The typical pattern is:

reader.transform(T1).buffer(N).transform(T2).pipe(_, writer);

This will buffer N items between transforms T1 and T2.

Note that buffering can become really tricky when you start to have diamond shape topologies (a fork followed by a join). If you are unlucky and if one of the branches is systematically dequeued faster than the other, you will need illimited buffering in the fork node to keep things flowing. I hit this in one of my unit tests but fortunately this was with a very academic example and it seems unlikely to hit this problem with real data processing chains. But who knows?

Streams and pipes revisited

I have been rather critical of node’s streaming/piping philosophy in the past. I just did not buy the idea that code would be packaged as streams and that you would assemble a whole application by just piping streams into each other. I think that my reluctance came primarily from the complexity of the API. Implementing a node.js stream is a real endeavor, and I could just not imagine our team use such a complex API as a standard code packaging pattern.

I’ve been playing with ez-streams for a few weeks now, and I’m starting to really like the idea of exposing a lot of data through the reader API, and of packaging a lot of operations as reusable and configurable filters, mappers, transforms, etc. So I feel that I’m getting more into the streams and pipes vision. But I only buy it if the API is simple and algebraic.

One word of caution to close this post: the ez-streams implementation is still immature. I have written basic unit tests, and they pass but I’m very far from having tested all possible edge conditions. So don’t expect everything to be perfectly oiled. On the other hand, I’m rather pleased with the current shape of the API and I don’t expect to make fundamental changes, except maybe in advanced operations like fork and join.

Posted in Uncategorized | Leave a comment

Bringing async/await to life in JavaScript

My dream has come true this week. I can now write clean asynchronous code in JavaScript: no callbacks, no intrusive control flow library, no ugly preprocessor. Just plain JavaScript!

This is made possible by a new feature of JavaScript called generator functions, which has been introduced by EcmaScript 6 and is now available in node.js (unstable version 0.11.2). I already blogged about generators a few times so I won’t get into the basics again here. The important thing is that ES6 introduces two small extensions to the language syntax:

  • function*: the functions that you declare with a little twinkling star are generator functions. They execute in an unusual way and return generators.
  • yield: this keyword lets you transfer control from a generator to the function that controls it.

And, even though these two language constructs were not orginally designed to have the async/await semantics found in other languages, it is possible to give them these semantics:

  • The * in function* is your async keyword.
  • yield is your await keyword.

Knowing this, you can write asynchronous code as if JavaScript had async/await keywords. Here is an example:

function* countLines(path) {
    var names = yield fs.readdir(path);
    var total = 0;
    for (var i = 0; i < names.length; i++) {
        var fullname = path + '/' + names[i];
        var count = (yield fs.readFile(fullname, 'utf8')).split('\n').length;
        console.log(fullname + ': ' + count);
        total += count;
    }
    return total;
}

function* projectLineCounts() {
    var total = 0;
    total += yield countLines(__dirname + '/../examples');
    total += yield countLines(__dirname + '/../lib');
    total += yield countLines(__dirname + '/../test');
    console.log('TOTAL: ' + total);
    return total;
}

Here, we have two asynchronous functions (countLines and projectLineCounts) that call each other and call node.js APIs (fs.readdir, fs.readFile). If you look carefully you’ll notice that these functions don’t call any special async helper API. Everything is done with our two markers: the little * marks declarations of asynchronous functions and yield marks calls to asynchronous functions. Just like async and await in other languages.

And it will work!

Galaxy

The magic comes from galaxy, a small library that I derived from my earlier work on streamline.js and generators.

Part of the magic is that the fs variable is not the usual node.js file system module; it is a small wrapper around that module:

var galaxy = require('galaxy');
var fs = galaxy.star(require('fs'));

The galaxy.star function converts usual callback-based node.js functions into generator functions that play well with the generator functions that we have written above.

The other part of the magic comes from the galaxy.unstar function which converts in the other direction, from generator functions to callback-based node.js functions. This unstar function allows us to transform projectLineCounts into a callback-based function that we can call as a regular node.js function:

var projectLineCountsCb = galaxy.unstar(projectLineCounts);

projectLineCountsCb(function(err, result) {
    if (err) throw err;
    console.log('CALLBACK RESULT: ' + result);
});

The complete example is available here.

The whole idea behind this API design is that galaxy lets you write code in two different spaces:

  • The old callback space in which today’s node.js APIs live. In this space, you program with regular unstarred functions in continuation passing style (callbacks).
  • The new generator space. In this space, you program in synchronous style with starred functions.

The star and unstar functions allow you to expose the APIs of one space into the other space. And that’s all you need to bring async/await to life in node.js.

Status

I assembled galaxy quickly from pieces that I had developed for streamline.js. So it needs a bit of polishing and the API may move a bit. Generator support in V8 and node.js is also brand new. So all of this is not yet ready for prime time but you can already play with it if you are curious.

I have introduced a galaxy.spin function to parallelize function calls. I’ll probably carry over some other goodies from the streamline project (funnel semaphore, asynchronous array functions, streams module, …).

I find it exciting that modules written in async/await style with galaxy don’t have any direct dependencies on the node.js callback convention. So, for example, it would be easy to write a browser variant of the star/unstar functions which would be aligned on the jQuery callback conventions, with separate callback and errback.

Also, another module was announced on the node.js mailing list this week: suspend. It takes the problem from a slightly different angle, by wrapping every generator function with a suspend call. It lets you consume node.js APIs directly and write functions that follow node’s callback pattern. This is an attractive option for library developers who want to stay close to node’s callback model. Take a look at the source; it’s really clever: only 16 locs! Galaxy is different in that it moves you to a different space where you can program in sync style with no additional API, just language keywords. Probably a more attractive option if you are writing applications because you’ll get leaner code if most of your calls are to your own APIs rather than to node’s APIs.

Happy */yield coding!

Posted in Asynchronous JavaScript, Uncategorized | 33 Comments

Harmony Generators in streamline.js

Harmony generators have landed in a node.js fork this week. I couldn’t resist, I had to give them a try.

Getting started

If you want to try them, that’s easy. First, build and install node from Andy Wingo’s fork:

$ git clone https://github.com/andywingo/node.git node-generators
$ cd node-generators
$ git branch v8-3.19
$ ./configure
$ make
# get a coffee ...
$ make install # you may need sudo in this one

Now, create a fibo.js file with the following code:

function* genFibos() {  
  var f1 = 1, f2 = 1;  
  while (true) {  
    yield f1;  
    var t = f1;  
    f1 = f2;  
    f2 += t;  
  }  
}

function printFibos() {
    var g = genFibos();
    for (var i = 0; i < 10; i++) {
      var num = g.next().value;
      console.log('fibo(' + i + ') = ' + num);  
    }
}

printFibos();

And run it:

$ node --harmony fibo
fibo(0) = 1
fibo(1) = 1
fibo(2) = 2
fibo(3) = 3
fibo(4) = 5
fibo(5) = 8
fibo(6) = 13
fibo(7) = 21
fibo(8) = 34
fibo(9) = 55
$

Note that generators are not activated by default. You have to pass the --harmony flag to activate them.

Using generators with streamline.js

I had implemented generators support in streamline.js one year ago and I blogged about it but I could only test in Firefox at the time, with a pre-harmony version of generators. I had to make a few changes to bring it on par with harmony and I published it to npm yesterday (version 0.4.11).

To try it, install or update streamline:

$ npm install -g streamline@latest # you may need sudo

Then you can run the streamline examples:

$ cp -r /usr/local/lib/node_modules/streamline/examples .
$ cd examples
$ _node_harmony --generators diskUsage/diskUsage
./diskUsage: 4501
./loader: 1710
./misc: 7311
./streamlineMe: 13919
./streams: 1528
.: 28969
completed in 7 ms

You have to use _node_harmony instead of _node to activate the --harmony mode in V8. You also have to pass the --generators option to tell streamline to use generators. If you do not pass this flag, the example will still work but in callback mode, and you won’t see much difference.

To see what the transformed code looks like, you can just pass the -c option to streamline:

$ _node_harmony --generators -c diskUsage/diskUsage._js

This command generates a diskUsage/diskUsage.js source file containing:

/*** Generated by streamline 0.4.11 (generators) - DO NOT EDIT ***/var fstreamline__ = require("streamline/lib/generators/runtime"); (fstreamline__.create(function*(_) {var du_ = fstreamline__.create(du, 0); /*
 * Usage: _node diskUsage [path]
 *
 * Recursively computes the size of directories.
 *
 * Demonstrates how standard asynchronous node.js functions
 * like fs.stat, fs.readdir, fs.readFile can be called from 'streamlined'
 * Javascript code.
 */
"use strict";

var fs = require('fs');

function* du(_, path) {
  var total = 0;
  var stat = (yield fstreamline__.invoke(fs, "stat", [path, _], 1));
  if (stat.isFile()) {
    total += (yield fstreamline__.invoke(fs, "readFile", [path, _], 1)).length;
  } else if (stat.isDirectory()) {
    var files = (yield fstreamline__.invoke(fs, "readdir", [path, _], 1));
    for (var i = 0; i < files.length; i++) {       total += (yield du(_, path + "/" + files[i]));     }     console.log(path + ": " + total);   } else {     console.log(path + ": odd file");   }   yield ( total); } try {   var p = process.argv.length > 2 ? process.argv[2] : ".";

  var t0 = Date.now();
  (yield du(_, p));
  console.log("completed in " + (Date.now() - t0) + " ms");
} catch (ex) {
  console.error(ex.stack);
}
}, 0).call(this, function(err) {
  if (err) throw err;
}));

As you can see, it looks very similar to the original diskUsage/diskUsage._js source. The main differences are:

  • Asynchronous functions are declared with function* instead of function.
  • Asynchronous functions are called with a yield, and with an indirection though fstreamline__.invoke if they are not directly in scope

But otherwise, the code layout and the comments are preserved, like in --fibers mode.

You can execute this transformed file directly with:

npm link streamline # make streamline runtime available locally - may need sudo
node --harmony diskUsage/diskUsage

Benchmarks

Of course, the next step was to try to compare performance between the 3 streamline modes: callbacks, fibers and generators. This is a bit unfair because generators are really experimental and haven’t been optimized like the rest of V8 yet but I wrote a little benchmark that compares the 3 streamline modes as well as a raw callbacks implementation. Here is a summary of my early findings:

  • In tight benches with lots of calls to setImmediate, raw callbacks outperform the others by a factor of 2 to 3.
  • Fibers always outperform streamline callbacks and generators modes.
  • Fibers nails down everyone else, including raw callbacks, when the sync logic dominates the async calls. For example, it is 4 times faster than raw callbacks in the n=25, loop=1, modulo=1000, fn=setImmediate case.
  • Streamline callbacks and generators always come up very close, with a slight advantage to callbacks.
  • The results get much closer when real I/O calls start to dominate. For example, all results are in the [243 258] ms range with the simple loop of readMe calls.
  • The raw callbacks bench is more fragile than the others. It stack overflows when the modulo parameter gets close to 5000. The others don’t.
  • The generators bench crashed when setting the modulo parameter to values < 2.

My interpretation of these results:

  • The difference between streamline callbacks and raw callbacks is likely due to the fact that streamline provides some comfort features: long stack traces, automatic trampolining (avoids the stack overflow that we get with raw callbacks), TLS-like context, robust exception handling, etc. This isn’t free.
  • I expected very good performance from fibers when the sync/async code ratio increases. This is because the sync-style logic that sits on top of async calls undergoes very little transformation in fibers mode. So there is almost no overhead in the higher level sync-style code, not even the overhead of a callback. On the other hand fibers has more overhead than callbacks when the frequency of async calls is very high because it has to go through the fibers layer every time.
  • Generators are a bit disappointing but this is not completely suprising. First, they just landed in V8 and they probably aren’t optimized. But this is also likely due to the single frame continuation constraint: when you have to traverse several layers of calls before reaching the async calls, every layer has to create a generator and you need a run function that interacts with all these generators to make them move forwards (see lib/generators/runtime.js). This is a bit like callbacks where the callbacks impact all the layers that sit on top of async APIs, but not at all like fibers where the higher layers don’t need to be transformed.
  • The fibers and generators benches are based on code which has been transformed by streamline, not on hand-written code. There may be room for improvement with manual code, although I don’t expect the gap to be in any way comparable to the one between raw callbacks and streamline callbacks. The fibers transformation/runtime is actually quite smart (Marcel wrote it). I wrote the generators transform and I think it is pretty efficient but it would interesting to bench it against other solutions, for example against libraries that combine promises and generators (I think that those will be slower because they need to create more closures/objects but this is just a guess at this point).
  • The crashes in generators mode aren’t really anything to worry about. I was benching with bleeding edge software and I’m confident that the V8 generators gurus will fix them.

So yes, generators are coming… but they may have a challenge to compete head to head with raw callbacks and fibers on pure performance.

Posted in Asynchronous JavaScript, Uncategorized | Leave a comment

Node’s social pariahs

I learned a new expression on the node mailing list this week: social pariahs. The node.js police is after them, and it looks like I’m on the black list. I should probably take it easy and just “roll in my grave”, like Marcel did :-).

But Mikeal followed up with a blog article and I’d like to respond. Unfortunately, comments are turned off on his blog so I’m responding here (BTW, I wonder how comments work with horizontal scrolling).

Compatibility

I’ll be quick on this one. Yes, compatibility is very important and you need some rules if you want to build a lively ecosystem. The module system and the calling conventions are key. I learned this 25 years ago, when designing APIs on VAX/VMS. VMS had this great concept of common calling conventions which made it possible to link together modules written in different languages. Nothing new under the skies here.

Promise libraries are problematic in this respect because they promote a different API style for asynchronous functions. The standard callback(err, result) pattern is replaced by a pair of callback and errback, plus an optional progress callback, with different signatures. So you need wrappers to convert between the two API styles. Not a problem today as the wide majority of node.js libraries stick to node’s callback style but it could cause fragmentation if promises were to gain momentum.

Streamline.js is a good node.js citizen

Mikeal is quite vocal against streamline.js but I doubt that he has even read the README file. He is missing some very important points:

Streamline is not a library, it is a language tool, a precompiler.

Streamline is fully aligned on node’s callback convention.

Streamline is not trying to disrupt the ecosystem, it is trying to help people consume compliant code, and also produce compliant code.

To illustrate this, let me go back to the example that revived the debate this week on the mailing list. As I wrote in my post, streamline lets you chain the 3 asynchronous calls in a single line of code:

function computeAsyncExpression(_) {
  return (Object1.retrieveNum1(_) + Object2.retrieveNum2(_)) /  Object3.retrieveNum3(_);
}

The streamline preprocessor transforms this into (*):

function computeAsyncExpression(cb) {
  if (cb == null) return __future(computeAsyncExpression, 0);
  Object1.retrieveNum1(function(err, v1) {
    if (err) return cb(err);
    Object2.retrieveNum2(function(err, v2) {
      if (err) return cb(err);
      Object3.retrieveNum3(function(err, v3) {
        if (err) return cb(err);
        cb(null, (v1 + v2) / v3);
      });
    });
  });
}

(*) actual code is a bit different but the differences are irrelevant here.

So the computeAsyncExpression function generated by streamline is more or less what the OP posted on the mailing list. It is a regular node.js function with a callback. You can call it like any other node.js API that you would have implemented directly in JavaScript with callbacks.

Streamline.js does not try to enforce a new API style; it just helps you write functions that conform to node’s callback conventions. And for lazy people like me, writing one line instead of 10 is a big win.

I did not talk about the first line in the generated function:

  if (cb == null) return __future(computeAsyncExpression, 0);

This is not a standard node.js pattern! What does it do?

If you pass a null or undefined callback to a standard node API, you usually get an exception. This is considered to be a bug and you have to fix your code and pass a valid callback.

Streamline handles this case differently, by returning a future instead of throwing an exception. The returned future works very much like a promise but it does not come with a new API pattern. Instead, a streamline future is a function that takes a regular node.js callback as parameter. You typically use it as:

  var future = computeAsyncExpression(null);
  // code that executes in parallel with computeAsyncExpression
  ...
  // now, get a result from the future
  future(function(err, result) {
    // async computation is over, handle the result
  });

Streamline is not introducing a disruptive API pattern here. It is leveraging the existing callback pattern.

So far so good but streamline also supports a fibers mode, and experimental support for generators. Is this still aligned on node’s callback convention?

The answer may seem surprising but it is a YES. If you precompile the computeAsyncExpression(_) function with the --fibers option, what you get is still a regular asynchronous node.js function that you can call with a regular callback. This function uses fibers under the hood but it retains the standard callback signature. I won’t explain the technical details here because this would drive us too far but this is how it is.

And when generators land into V8 and node, it will be the same: streamline will give you the option to use them under the hood but still produce and consume standard callback APIs!

Libraries and Applications

The second point I wanted to discuss in this response is the distinction between libraries and applications. The node.js ecosystem is not just about people who publish modules to NPM. There are also lots of people who are building applications and services with node.js. Maybe they do not directly contribute to the ecosystem because they do not share their code but they contribute to the success and visibility of the platform.

I did not write streamline.js because I wanted to flood NPM with idiosyncratic modules. I wrote it because I was developing an application with a team of developers and I wanted application code that is robust, easy to write and easy to maintain. I wrote it because we had started to develop our application in callback style and we had produced code that was too convoluted and too fragile. Also we had reached the end of our prototyping phase and were about to move to a more industrial phase, and the learning curve of callbacks was just too high.

If I were in the business of writing NPM modules, I would probably think twice before writing them with streamline: there is a slight overhead because of some of the comfort features that streamline gives you (robust exception handling, TLS-like context, long stack trace); it is a different language, like CoffeeScript, which may rebuke people who want to fork, etc. I would probably use it to write drivers for complex legacy protocols (we are doing this internally in our project) but I would probably stick to raw callbacks for creative, very lightweight modules.

But I’m not writing NPM modules; I’m writing applications and services. And if I post a link to streamline to the mailing list it is because I think that this tool may help other people who are trying to write applications and who are running into the problems that we ran into more than 2 years ago: robustness, maintainability, learning curve, etc. To plagiarize Mikeal:

I feel really bad for people that ask incredibly simple questions on this list and get these incredibly complex answers!

I may have been too vocal on the mailing list at some point but I’m trying to be much more discreet these days. The streamline ecosystem is not very large but the feedback that I get is very positive. People like the tool and it solves their problem. So I don’t feel bad posting a link when the async topic comes back to the mailing list. Maybe a tool like streamline can help the OP solve his problem. And even if it does not, it won’t hurt the OP to take a look, discover that there is not one way of dealing with async code. He’ll learn along the way and make up his own mind.

Posted in Uncategorized | 5 Comments

Node.js stream API: events or callbacks?

Last year, I wrote a blog post about events and node streams. In this post, I proposed an alternative API for streams: callback-oriented rather than event-oriented.

For readable streams, the proposal was to have a simple read(cb) call, where cb is a callback with a function(err, data) signature. A null data value signals the end of stream.

I did not discuss writable streams in this early post but shortly afterwards I implemented wrappers for both readable and writable streams in streamline.js’ streams module and I used a very similar design for the writable stream API: a simple write(data, cb) function (similarly, a null data ends the stream).

Note: the parameters are swapped in the streamline API (write(cb, data)) because it makes it easier to deal with optional parameters. In this post I will stick to the standard node.js convention of passing the callback as last parameter.

I have been using this callback-based API for more than a year and I have found it very pleasant to work with: it is simple and robust (no risk of losing data events); it handles flow control naturally and it blends really well with streamline.js. For example, I could easily re-implement the pump/pipe functionality with a simple loop:

function pump(inStream, outStream, _) {
  var data;
  do {
    data = inStream.read(_);
    outStream.write(data, _);
  } while (data != null);
}

State of affairs

I find the current node.js stream API quite hairy in comparison. On the read side we have three events (data, end, error) and two calls (pause and resume). On the write side we have two events (drain, error) and two calls (write and end).

The event-oriented API is also more fragile because you run the risk of losing events if you do not attach your event handlers early enough (unless you pause the stream immediately after creating it).

And from the exchanges that I see on the node mailing list, I have the impression that this API is not completely sorted out yet. There are talks about upcoming changes in 0.9.

I have tried to inject the idea of a callback based API into the debate but I’ve been unsuccessful so far. Discussions quickly turned sour. I got challenged on the fact that flow control would not work with such an API but I didn’t get any response when I asked for a scenario that would demonstrate where the potential problem would be.

Equivalence

So I’m writing this post to try to shed some light on the issue. What I’ll try to do in this post is prove that the two APIs are equivalent, the corrolary being that we should then be free to choose whatever API style we want.

To prove the equivalence, I am going to create wrappers:

  • A first set of wrappers that transform streams with event-oriented APIs into streams with callback-oriented APIs.
  • A second set of wrappers that transform streams with callback-oriented APIs into streams with event-oriented APIs.

There will be three wrappers in each set: a Read wrapper for readable streams, a Write wrapper for writable streams, and a wrapper that handles both read and write.

After introducing these wrappers, I will demonstrate on a small example that we get an equivalent stream when we wrap a stream twice, first in callback style and then in event style.

In this presentation I will deliberately ignore peripheral issues like encoding, close events, etc. So I won’t deal with all the subtleties of the actual node.js APIs.

The callback read wrapper

The callback read wrapper implements the asynchronous read(cb) API on top of a standard node.js readable stream.

exports.CallbackReadWrapper = function(stream) {
  var _chunks = [];
  var _error;
  var _done = false;

  stream.on('error', function(err) {
    _onData(err);
  });
  stream.on('data', function(data) {
    _onData(null, data);
  });
  stream.on('end', function() {
    _onData(null, null);
  });

  function memoize(err, chunk) {
    if (err) _error = err;
    else if (chunk) {
      _chunks.push(chunk);
      stream.pause();
    } else _done = true;
  };

  var _onData = memoize;

  this.read = function(cb) {
    if (_chunks.length > 0) {
      var chunk = _chunks.splice(0, 1)[0];
      if (_chunks.length === 0) {
        stream.resume();
      }
      return cb(null, chunk);
    } else if (_done) {
      return cb(null, null);
    } else if (_error) {
      return cb(_error);
    } else _onData = function(err, chunk) {
      if (!err && !chunk) _done = true;
      _onData = memoize;
      cb(err, chunk);
    };
  }
}

This implementation does not make the assumption that data events will never be delivered after a pause() call, as this assumption was not valid in earlier versions of node. This is why it uses an array of chunks to memoize. The code could be simplified if we made this assumption.

The callback write wrapper

The callback write wrapper implements the asynchronous write(data, cb) API on top of a standard node.js writable stream.

exports.CallbackWriteWrapper = function(stream) {
  var _error;
  var _onDrain;

  stream.on('error', function(err) {
    if (_onDrain) _onDrain(err);
    else _error = err;
  });
  stream.on('drain', function() {
    _onDrain && _onDrain();
  });

  this.write = function(data, cb) {
    if (_error) return cb(_error);
    if (data != null) {
      if (!stream.write(data)) {
        _onDrain = function(err) {
          _onDrain = null;
          cb(err);
        };
      } else {
        process.nextTick(cb);
      }
    } else {
      stream.end();
    }
  }
}

The process.nextTick call guarantees that we won’t blow the stack if stream.write always returns true.

The event read wrapper

The event read wrapper is the dual of the callback read wrapper. It implements the node.js readable stream API on top of an asynchronous read(cb) function.

exports.EventReadWrapper = function(stream) {
  var self = this;
  var q = [],
    paused;

  function doRead(err, data) {
    if (err) self.emit('error', err);
    else if (data != null) {
      if (paused) {
        q.push(data);
      } else {
        self.emit('data', data);
        stream.read(doRead);
      }
    } else {
      if (paused) {
        q.push(null);
      } else {
        self.emit('end');
      }
    }
  }
  self.pause = function() {
    paused = true;
  }
  self.resume = function() {
    var data;
    while ((data = q.shift()) !== undefined) {
      if (data != null) self.emit('data', data);
      else self.emit('end');
    }
    paused = false;
    stream.read(doRead);
  }

  stream.read(doRead);
}

exports.EventReadWrapper.prototype = new EventEmitter();

The event write wrapper

The event write wrapper is the dual of the callback write wrapper. It implements the node.js writable stream API on top of an asynchronous write(data, cb) function.

exports.EventWriteWrapper = function(stream) {
  var self = this;
  var chunks = [];

  function written(err) {
    if (err) self.emit('error', err);
    else {
      chunks.splice(0, 1);
      if (chunks.length === 0) self.emit('drain');
      else stream.write(chunks[0], written);
    }
  }
  this.write = function(data) {
    chunks.push(data);
    if (chunks.length === 1) stream.write(data, written);
    return chunks.length === 0;
  }
  this.end = function(data) {
    if (data != null) self.write(data);
    self.write(null);
  }
}

exports.EventWriteWrapper.prototype = new EventEmitter();

The combined wrappers

The combined wrappers implement both APIs (read and write). Their implementation is straightforwards:

exports.CallbackWrapper = function(stream) {
  exports.CallbackReadWrapper.call(this, stream);
  exports.CallbackWriteWrapper.call(this, stream);
}

exports.EventWrapper = function(stream) {
  exports.EventReadWrapper.call(this, stream);
  exports.EventWriteWrapper.call(this, stream);
}

exports.EventWrapper.prototype = new EventEmitter();

Equivalence demo

The demo program is based on the following program:

"use strict";
var http = require('http');
var zlib = require('zlib');
var util = require('util');
var fs = require('fs');

http.createServer(function(request, response) {
  response.writeHead(200, {
    'Content-Type': 'text/plain; charset=utf8',
    'Content-Encoding': 'deflate',    
  });
  var source = fs.createReadStream(__dirname + '/wrappers.js');
  var deflate = zlib.createDeflate();
  util.pump(source, deflate);
  util.pump(deflate, response);
}).listen(1337);
console.log('Server running at http://127.0.0.1:1337/');

This is a simple program that serves a static file in compressed form. It uses two util.pump calls. The first one pumps the source stream into the deflate stream and the second one pumps the deflate stream into the response stream.

Then we modify this program to wrap the three streams twice before passing them to util.pump:

"use strict";
var wrappers = require('./wrappers');
var http = require('http');
var zlib = require('zlib');
var util = require('util');
var fs = require('fs');

http.createServer(function(request, response) {
  response.writeHead(200, {
    'Content-Type': 'text/plain; charset=utf8',
    'Content-Encoding': 'deflate',    
  });
  var source = fs.createReadStream(__dirname + '/wrappers.js');
  var deflate = zlib.createDeflate();
  source = new wrappers.EventReadWrapper(new wrappers.CallbackReadWrapper(source));
  response = new wrappers.EventWriteWrapper(new wrappers.CallbackWriteWrapper(response));
  deflate = new wrappers.EventWrapper(new wrappers.CallbackWrapper(deflate));
  util.pump(source, deflate);
  util.pump(deflate, response);
}).listen(1337);
console.log('Server running at http://127.0.0.1:1337/');

This program works like the previous one (maybe just a little bit slower), which shows that the doubly wrapped streams behave like the original unwrapped streams:

EventWrapper(CallbackWrapper(stream)) <=> stream

Note that this program won’t exercise the full pause/resume/drain API with a small input file like wrappers.js. You have to try it with a large file to exercise all events.

The next demo is a streamline.js variant that transforms the three streams into callback-oriented streams and uses the pump loop that I gave in the introduction:

"use strict";
var wrappers = require('./wrappers');
var http = require('http');
var zlib = require('zlib');
var util = require('util');
var fs = require('fs');

http.createServer(function(request, response) {
  response.writeHead(200, {
    'Content-Type': 'text/plain; charset=utf8',
    'Content-Encoding': 'deflate',    
  });
  var source = fs.createReadStream(__dirname + '/wrappers.js');
  var deflate = zlib.createDeflate();
  source = new wrappers.CallbackReadWrapper(source);
  response = new wrappers.CallbackWriteWrapper(response);
  deflate = new wrappers.CallbackWrapper(deflate);
  pump(source, deflate);
  pump(deflate, response);
}).listen(1337);
console.log('Server running at http://127.0.0.1:1337/');


function pump(inStream, outStream, _) {
  var data;
  do {
    data = inStream.read(_);
    outStream.write(data, _);
  } while (data != null);
}

This program too behaves like the original one.

Conclusions

This experiment demonstrates that event-based and callback-based streams are equivalent. My preference goes to the callback version, as you may have guessed. I’m submitting this as I think that it should be given some consideration when discussing evolutions of the stream API.

Notes:

  • The APIs are not completely equivalent though. One difference is that the event-driven API supports multiple observers. But in most pumping/piping scenarios the stream has a single observer. And callback APIs can also be tweaked to support multiple observers (streamline’s futures support that).
  • It is also important to verify that the flow control patterns are similar and that, for example, the callback version does not do excessive buffering. This is the case as the queues don’t hold more than two elements in pumping/piping scenarios.

Source code is available as a gist.

Posted in Asynchronous JavaScript, Uncategorized | 6 Comments

node.js for the rest of us

Simple things should be simple. Complex things should be possible.
Alan Kay

I published streamline.js 18 months ago but did not write a tutorial. I just took the time to do it.

The tutorial implements a simple search aggregator application. Here is a short spec for this application:

  • One page with a search field and a submit button.
  • The search is forwarded to Google and the results are displayed in the page.
  • A second search is run on the local tree of files. Matching files and lines are displayed.
  • A third search is run against a collection of movies in a MongoDB database. Matching movie titles and director names are displayed.
  • The 3 search operations are performed in parallel.
  • The file search is parallelized but limited to 100 simultaneous open files, to avoid running out of file descriptors on large trees.
  • The movies collection in MongoDB is automatically initialized with 4 entries the first time the application is run.

The implementation takes 126 lines (looks nicer in GitHub):

"use strict";
var streams = require('streamline/lib/streams/server/streams');
var url = require('url');
var qs = require('querystring');

var begPage = '<html><head><title>My Search</title></head></body>' + //
'<form action="/">Search: ' + //
'<input name="q" value="{q}"/>' + //
'<input type="submit"/>' + //
'</form><hr/>';
var endPage = '<hr/>generated in {ms}ms</body></html>';

streams.createHttpServer(function(request, response, _) {
  var query = qs.parse(url.parse(request.url).query),
    t0 = new Date();
  response.writeHead(200, {
    'Content-Type': 'text/html; charset=utf8'
  });
  response.write(_, begPage.replace('{q}', query.q || ''));
  response.write(_, search(_, query.q));
  response.write(_, endPage.replace('{ms}', new Date() - t0));
  response.end();
}).listen(_, 1337);
console.log('Server running at http://127.0.0.1:1337/');

function search(_, q) {
  if (!q || /^\s*$/.test(q)) return "Please enter a text to search";
  try {
    // start the 3 futures
    var googleFuture = googleSearch(null, q);
    var fileFuture = fileSearch(null, q);
    var mongoFuture = mongoSearch(null, q);
    // join the results
    return '<h2>Web</h2>' + googleFuture(_) //
    + '<hr/><h2>Files</h2>' + fileFuture(_) //
    + '<hr/><h2>Mongo</h2>' + mongoFuture(_);
  } catch (ex) {
    return 'an error occured. Retry or contact the site admin: ' + ex.stack;
  }
}

function googleSearch(_, q) {
  var t0 = new Date();
  var json = streams.httpRequest({
    url: 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=' + q,
    proxy: process.env.http_proxy
  }).end().response(_).checkStatus(200).readAll(_);
  // parse JSON response
  var parsed = JSON.parse(json);
  // Google may refuse our request. Return the message then.
  if (!parsed.responseData) return "GOOGLE ERROR: " + parsed.responseDetails;
  // format result in HTML
  return '<ul>' + parsed.responseData.results.map(function(entry) {
    return '<li><a href="' + entry.url + '">' + entry.titleNoFormatting + '</a></li>';
  }).join('') + '</ul>' + '<br/>completed in ' + (new Date() - t0) + ' ms';
}

var fs = require('fs'),
  flows = require('streamline/lib/util/flows');
// allocate a funnel for 100 concurrent open files
var filesFunnel = flows.funnel(100);

function fileSearch(_, q) {
  var t0 = new Date();
  var results = '';

  function doDir(_, dir) {
    fs.readdir(dir, _).forEach_(_, -1, function(_, file) {
      var f = dir + '/' + file;
      var stat = fs.stat(f, _);
      if (stat.isFile()) {
        // use the funnel to limit the number of open files 
        filesFunnel(_, function(_) {
          fs.readFile(f, 'utf8', _).split('\n').forEach(function(line, i) {
            if (line.indexOf(q) >= 0) results += '<br/>' + f + ':' + i + ':' + line;
          });
        });
      } else if (stat.isDirectory()) {
        doDir(_, f);
      }
    });
  }
  doDir(_, __dirname);
  return results + '<br/>completed in ' + (new Date() - t0) + ' ms';;
}

var mongodb = require('mongodb'),
  mongoFunnel = flows.funnel(1);

function mongoSearch(_, q) {
  var t0 = new Date();
  var db = new mongodb.Db('tutorial', new mongodb.Server("127.0.0.1", 27017, {}));
  db.open(_);
  try {
    var coln = db.collection('movies', _);
    mongoFunnel(_, function(_) {
      if (coln.count(_) === 0) coln.insert(MOVIES, _);
    });
    var re = new RegExp(".*" + q + ".*");
    return coln.find({
      $or: [{
        title: re
      }, {
        director: re
      }]
    }, _).toArray(_).map(function(movie) {
      return movie.title + ': ' + movie.director;
    }).join('<br/>') + '<br/>completed in ' + (new Date() - t0) + ' ms';;
  } finally {
    db.close();
  }
}

var MOVIES = [{
  title: 'To be or not to be',
  director: 'Ernst Lubitsch'
}, {
  title: 'La Strada',
  director: 'Federico Fellini'
}, {
  title: 'Metropolis',
  director: 'Fritz Lang'
}, {
  title: 'Barry Lyndon',
  director: 'Stanley Kubrick'
}];

I organized the tutorial in 7 steps but I did not have much to say at each step because it just all felt like normal JavaScript code around cool APIs, with the little _ to mark the spots where execution yields.

I’m blogging about it because I think that there is a real opportunity for node.js to attract mainstream programmers. And I feel that this is the kind of code that mainstream programmers would feel comfortable with.

Posted in Asynchronous JavaScript, Uncategorized | 3 Comments