FacebookLinkedInShare

Intro

Error handling is a topic that is often forgotten and skipped over in favor of the “sunny day” scenarios, both in code and in tests. Some of it is due to our natural laziness, while some is due to the fact that error handling is hard. Hard in multiple dimensions: both complex, tedious and unrewarding in a way.

For example, how do you decide whether an error should be taken care of locally or skipped over for a centralized error handler to deal with? How do you handle error propagation through multiple asynchronous service calls? How do you identify the session (or user) that it belongs to and provide context for further investigation?

This article is our attempt at making error handling a little easier or a little more straightforward.

Article Structure

Let’s set some expectations, to ease our journey through the article, shall we?

What it is?

The article is about general approach of error handling in Node. While Node is undoubtedly a special snowflake of a platform, most of the concepts we are going to discuss has been around for quite some time. With that said, Node’s asynchronous inclinations put a spin on some of the approaches, mostly making error handling a more complex endeavor.

We’ll try to discuss these concepts through the prism of Node, where relevant, and will establish a number of rules (recommendations, really, but “rule” is a much shorter word to type) along the way.

Finally, there will be some code examples, mostly schematic, illustrating these concepts and paradigms.

What it is not?

The article is not intended as the ultimate resource, rather a breeding ground for further thoughts and ideas to apply in your applications.

Also, while Node imposes certain limitations on the way errors are propagated and reported, the major principles do not differ from other platforms and, in fact, rely heavily on the established patterns.

Additionally, with JavaScript being such a flexible language, some programming styles, like functional, might not fully (at all?) benefit from the article.

A Glossary

We, along with many other JavaScript “texts”, are going to use the terms exception and error somewhat interchangeably. In most cases, error is something that describes an erroneous condition or state of a system and may become an exception when you throw (raise) it. In addition, exceptions are thrown by the engine or runtime environment, perhaps as a result of a lower-level error.

Some languages, like Java, distinguish between the two — exception is intended to be caught by the program, while error may remain unchecked.

JavaScript, on the other hand, doesn’t make such a distinction, even if some errors are called exceptions, like DOMException, and other keep their name.

Error Ergonomics

Software systems have a lot of users. The actual users and their actions (real or perceived) are, of course, the most prominent influencers on various architectural and technological decisions during the implementation process.

There is, however, another group of people that uses and impacts the software on a daily basis, but is rarely considered its users — developers and maintainers.

Similar to how “the code is written once, but read many times over”, these are the people that use the system the most, especially in its most crucial periods of creation and stabilization. They rely on various mechanisms to be able to maintain, fix and improve the system, and errors are one of the more important ones.

Sometimes, when discussing error handling or monitoring, development ergonomics is a secondary concern, if at all. We’ll try to change that, if only within this article, and address the maintenance/development impact of a specific decision, where possible and/or applicable.

Error Types

Not all errors are created equal. In fact, there are two very distinct groups of errors: technical (that contain programmer and operational errors) and business logic errors, that are different in nature and should be handled differently.

We’ll skip the word “logic” in business logic errors and refer to them as business errors, for brevity.

Programmer Errors

As you will witness, the conceptual differences between these errors are important enough to actually create, deliver and handle them in a different manner.

Most programmers, present company excluded, make mistakes. They forget to declare variables, invoke functions with parameters of incorrect types and generally stink up the joint with code smells.

These mistakes later, during build, testing or runtime, “become” bugs.

Here is an example of the last programmer error ever:

if( enemyMissiles.inFlight() = true ) {
    ourMissiles.launch();
}

Operational ErrorsDon’t destroy the world! Pay attention, lint and test!

In contrast to programmer errors, operational errors are something that is a part of correct functioning of a system — something that is expected to happen. In most cases they are a result of an external system or entity being faulty, unavailable or non-existent.

Note the stipulation of “correct functioning” above. This is not an absolute term, rather something that should be defined for each system and each interface within it.

For example, if you provide your user with a form and she fills it incorrectly — this should not be considered an operational or programmer error, but a normal behavior. An example of handling would be to allow the user to re-submit, while pointing out incorrectly filled fields.

In contrast, if your user doesn’t have the ability to input free text and only selects a group of items — receiving them corrupted may indicate operational error (network-level failure of some sort), programmer error (incorrect parsing/serialization of user’s selection) or even an attack (some sort of injection).

Additionally, the errors may be connected. A timeout during an HTTP call between two services in a system is something that is expected to happen and has to be handled. Same for a logging system being down when you send an error report — it produces an error (for example 503) that should be dealt with.

One can picture this as a chain of errors:

  1. programmer error caused memory leak in the logging system
  2. as a result the logging system had to be restarted to reset memory
  3. another programming/architectural/deployment error didn’t create any redundancy and as such the logging system became unavailable
  4. another system received operation error (the 503 from above)
  5. another programmer error skipped on handling of such a case and the originating system crashed

Programmer errors don’t exist in a vacuum, they propagate often becoming operational errors, or worse, business logic errors.

Business Logic Errors

The second group of errors — business logic errors, refers to erroneous conditions, usually caused by a combination of programmer and operational errors, that violate some set of business rules.

For instance, allowing withdrawal from an account with insufficient funds is a business logic error, most likely a result of some sort of underlying lower-level error:

  • operational error — network failure when trying to check the funds, and
  • programmer error — swallowing of the error and returning some default non-zero amount, or
  • programmer error — not checking for funds at all

It should be noted, that in many cases what one would consider an error is a correct logical path in a system. A user in our online shopping application attempting to purchase a product that is only available in specific countries (I am looking at you, Amazon) is an erroneous, from business rules perspective, action, but may not necessarily cause an error to be created or an exception thrown.

Error Creation & Delivery

Any design of error handling mechanisms should first establish the method in which errors are created and delivered throughout the system.

Note that in some texts, the word “reporting” is used instead of “delivery”. These two, however, carry different semantic meaning in today’s Node/Web development — reporting assumes interaction with a centralized system that collects the errors, while delivery talks about the methods in which errors are propagated within the system’s code/network/interfaces.

There are several paradigms, and, much like the size and type of indentation in your editor (it should be 4 spaces!) or whether you insert new line before opening brackets (you shouldn’t!), their proponents are zealously religious about them.

Even though we know (and will later tell you) which one is the best (confidence is always key in these types of discussions), it is absolutely worth discussing the options.

A Special Case of Libraries

Before diving into the “regular” cases, let’s discuss a special one — error delivery and creation from within library code.

Library, in the broadest sense, is a self-contained and well-encapsulated module that may be used in various places in the system, sometimes with vastly different error handling paradigms.

How should a library function should handle or deliver errors?

The short answer is that it shouldn’t. It should defer judgement to the consuming system, on both how and whether the errors should be created and delivered. Any assumption about the way consuming system want to handle error is likely to break, the more successful and generic library abstraction is.

The preferred way, especially within Node, should be to use inversion of control and require the consumer to provide, via settings or directly to a function, with decision making and reporting callbacks.

An example, for a library that does some sort of computation, would be:

const lib = new Library({
    shouldReportError() {
        // logic that decides when and where to report
        return shouldReport;
    },

    onError() {
        // create the error
        ...
        return error;
    }
});

lib.someMethod(value);

As a library author, provide the client system with ability to inject an error reporting callback and, in some cases, a callback to decide, perhaps on case-by-case basis, whether an error should be created and delivered at all.

Error Value Objects

These sometimes are called, mistakenly, Error Codes. We’ll expand on why it is a mistake later in Error Identification chapter and accept Error Value as the name for the paradigm.

The idea is to return a normalized object from a function:

function foo() {
    ...
    if (someCondition) {
        return {
            error: true,
            errorCode: ‘AAA’,
            context: ..., 
            result: ... // since the same object is used for success
       };
    }
    ...
}

// Consume:

const result = foo();

if (result.error) {
    // handle the error and exit? return? throw an exception?
} else {
    // continue the flow
}

Pros

It is hard to find any redeemable property of this pattern, aside from perhaps a uniformity in function results? Even then — it introduces a lot of unnecessary boilerplate code of checking whether the result is an error, unwrapping the object and then perhaps re-wrapping it again for further propagation.

Cons

The pattern:

  1. prevents functions from returning primitive values
  2. requires complex and repetitive handling of the error/success
  3. introduces error creation code that is not atomic error-wise and may introduce additional errors.
  4. most importantly, it doesn’t force the consumer to handle the error at all. The following consumer code will fail somewhere else, magnifying the problem:
foo();

A Variation

The consumer relied on function’s side effects (which is fine) and is not forced by the language or in any other manner to handle anything.

A refinement for the pattern is that you should return a meaningful value from a function call, wherever possible, much like array.indexOf(item) returns -1 if no match is found.

While this kind of a suggestion makes sense, we’d argue that this kind of behavior falls under business error that should not be handled. It is an expected interface feature to not find an item in an array and is not at all exceptional or even erroneous!

A Honorable Mention

The pattern of using a return value to denote both success, failure and unexpected result is a common pattern in functional languages. While JavaScript has some functional features, it is rarely the only style to be used in a typical JavaScript program.

So, unless you use JavaScript as a purely functional language throughout of your system (kudos!), see further for the verdict:

Verdict

Avoid at all costs!

Error Handlers

Another way to deliver an error is by providing an error callback:

function foo(params, onError) {
    ...
    if (someCondition) {
        onError({
            errorCode: ‘AAA’,
            context: ...
        });
    }
    ...
}

// Consume:

foo(value, (err) => {
    ...
});

Pros

  1. This pattern doesn’t suffer from the same limitation of the Error Value Objects, namely that it doesn’t force the function to return a predefined normalized result object.

Cons

The most glaring ones are:

  1. what happens if there is an error in the onError handler, should it also somehow close on an error callback?
  2. you need to add an additional parameter (or resort to passing a God object) to all functions that may deliver an error; question then becomes how do you know which function will? It will quickly derail into a common pattern that all meaningful functions
  3. No contract or enforcement to make sure that the handler is only invoked once

Verdict

Avoid.

Error-First Callbacks

The above pattern and the fact that Node is a mostly asynchronous-by-approach platform created a new pattern, often called error-first callback:

function foo(callback) {
    ...
    if (someCondition) {
        callback({
            errorCode: ‘AAA’,
            context: ...
        });
    } else {
        callback(null, ...);
    }
    ...
}

// Consume:

foo((err, result) => {
    if (!err) {
        // handle failure
    } else {
        // handle success
    }
});

Pros

This clearly works, as many Node libraries and frameworks successfully use it, but that success may just at least partially be in spite and not because of it:

  1. It is a consistent, clearly defined pattern
  2. it works well for both synchronous and asynchronous code (which may be considered a drawback, actually)

Cons

  1. If used for both synchronous and asynchronous error delivery, makes it harder to distinguish between the two
  2. Creates a Pyramid of Doom (and we’ve all seen one, so not horrid illustration)
  3. Forces to manually propagate the error stack and context through the Pyramid of Doom, whether you are going to handle the error in a specific callback or not
  4. No contract or enforcement to make sure that the handler is only invoked once

Verdict

Avoid. Transfer to the next pattern — Promises, where possible!

Events

Generally speaking, events are a form of callback, with a difference — they are expected to happen an arbitrary amount of times, thus invoking the callback multiple times and perhaps delivering multiple errors or other events.

While there are variations (and subtle differences between event emitters and observables), the common form is:

foo(param) {
    ...
    return eventEmitter;
}

foo(value).on('error', (err) => {
    // handle failure
}).on('end', (result) => {
    // handle success
});

Pros

  1. Flexibility in various events, including various error events (although it’s rare)
  2. Consistency of consumption of success and failure
  3. Ability to be reused (as opposed to one-off nature of other patterns)

Cons

  1. Mostly reserved for asynchronous execution
  2. Creates a Pyramid of Doom (granted, a smaller one)
  3. Just like error-first callback, forces to manually propagate the error stack and context through the Pyramid of Doom, whether you are going to handle the error in a specific callback or not
  4. Non-deterministic, as the exact amount of times or even interaction with other possible events is not defined in an exact manner (imagine stop-the-world error being triggered multiple times or being “outrun” by a different non-error event)

Exceptions

Finally, one of the most fundamental ways to create and deliver an error is by raising or throwing an exception. Many languages provide a mechanism to do so, which in JavaScript looks like this:

function foo() {
    ...
    if (someCondition) {
        throw new Error('some condition is violated');
    }
    ...
}

One clear caveat is that in JavaScript you can throw anything:

function foo() {
    ...
    if (someCondition) {
        throw 'some condition is violated';
    }
    ...
}

as in case of the string above. You shouldn’t and it will become clear later in the article.

Another possible, but absolutely incorrect, usage for exceptions is similar to this:

function bar() {
    try {
        while(someCondition) {
            ...
            throw new Error('stop');
        }
    } catch () {
        // continue normal flow
    }
}

where exceptions are used as part of flow control mechanisms.

Pros

  1. A part of the language specification, thus subject to optimization and improvements by the engine, TC39 etc.
  2. A clear and distinct way to show that there is an exceptional logical path within the code
  3. Doesn’t impose any limitations on the throwing function
  4. Guaranteed to be executed once

Cons

  1. Doesn’t work for asynchronous code:
 try {
    setTimeout(() => {
        ...
        throw new Error('failure');
        ...
    }, 100);
} catch (err) {
    // handle failure
}

The catch will never be executed, as it exists before the exception is thrown.

2. Hidden/alternative (just like most other patterns) flow within the code

Verdict

Use in synchronous code where possible using the recommended approach below.

Also, consider Rule #1 below.


Rule #1. Do not use exceptions for flow control

Do not use exceptions where a standard language construct like break or return would do:

  • it is not immediately obvious to a programmer
  • it introduces unnecessary performance hit (break would work just as well)
  • it prevents compiler optimizations

Promises

With ES6, Promises, already popular, have become the de facto way to write asynchronous code, with a good reason.

They allow to write code in a much more concise manner, the future value reference, which Promises represent, may be manipulated just like any other reference (passed to functions, put in arrays etc.):

function foo(param) {
    ...
    return promise;
}

// Consume: 

foo(value).then((result) => {
    // handle success
}).catch((err) => {
    // handle failure
});

Pros

  1. Guaranteed to execute callbacks (then or catch) once
  2. Errors in then are propagated to catch
  3. Guaranteed to be asynchronous
  4. Consistency of consumption of success and failure

Cons

  1. Different coding paradigm for synchronous and asynchronous code
  2. Just like all other cases, exception in error handling code (catch) is not caught in any way

Verdict

Use in asynchronous code where possible using the recommended approach below.

Hybrid (Recommended)

With official introduction of async/await in Node version 7.6 the recommended approach is a hybrid of Exception and Promise patterns.

Here is an example for such a combination:

try {
    const matches = await getMatches(text);
    const products = getProducts(matches, filters);
    ...
} catch (err) {
    // handle errors
}

You can see that the same block of code is contains both asynchronous and synchronous code and is enclosed in one try/catch.

Pros

  1. All pros from Exception and Promise patterns
  2. Consistency for all code, asynchronous or synchronous

Cons

  1. Still hidden/alternative (just like most other patterns) flow within the code

 Rule #2. Use a combination of exceptions and promises

  • use Promises for asynchronous code
  • use async/await to handle asynchronous code
  • use throw/try/catch in all (asynchronous or synchronous) code

A Little Perspective on Performance

In conclusion, I’d like to bring some sense into what seems to persist through JavaScript community. There is a notion that throwing and catching exceptions is a very costly operation.

While we don’t claim to have perfected the measurement method, but using the very well supported and acclaimed benchmark.js library we’ve set up an execution test for two roughly equivalent pieces of code.

One for exception production and consumption:

function exceptionProducer(param) {
    if (param > 100) {
        throw new Error('param is larger than 100');
    }
}

function exceptionConsumer() {
    try {
        exceptionProducer(101);
    } catch (err) {
        // do something with error, just for
        // completness sake
        const {message} = err;
    }
}

exceptionConsumer();

and a similar one for error production and consumption via return:

function errorProducer(param) {
    if (param > 100) {
        return new Error('param is larger than 100');
    }
}

function errorConsumer() {
    const result = errorProducer(101);

    if (result instanceof Error) {
        // do something with error, just for
        // completness sake
        const {message} = result;
    }
}

errorConsumer();

after running these as benchmark.js suite on Node, these are the results:

  • Error production & consumption x 450,317 ops/sec ±1.38%
  • Exception production & consumption x 318,017 ops/sec ±1.56%

At this point, it should be clear that while error side “won”, but the absolute values are staggering: a mean run times for a single execution (benchmark.js provides statistical results as well) for the exception and error samples are 3.144 and 2.22 microseconds respectively.

While this test is very very far from being scientifical, it is also an excellent illustration of premature optimization.

There are quotes, by Harold Abelson:

Programs must be written for people to read,

and only incidentally for machines to execute.”

and Donald Knuth:

“We should forget about small efficiencies, say about 97% of the time:

premature optimization is the root of all evil.

Yet we should not pass up our opportunities in that critical 3%.”

that summarize the discussion above.


Rule #3. Do not optimize prematurely

Do not optimize at the cost of readability, structural soundness and developer ergonomics. Use exceptions and errors where appropriate and optimize on critical paths after measurements.


 Reporting & Tracking

While the bulk of the article is dedicated to error handling, it doesn’t exist (and is only half as useful) in a vacuum. Reporting and tracking the errors is paramount in ensuring system health in the long run. While “if a tree falls in the forest…” approach may work at first, it will eventually float to the surface and crash the system in unpredictable ways. Worse yet — not crash at all, while corrupting unknown data or interacting with external systems in incorrect ways (and have the dreaded business logic errors running amok).

Usually the errors are reported, centrally or locally, to a remote collection system, where they are categorized, notified of and generally made available for discovery and analysis.

The information of errors’ call-site, context and root cause is crucial to allow to find and fix these errors in a timely manner. This information should be a part of what the main system passes upon error handling.

Error Identification

So, we have decided on the error handling pattern and have integrated with an error collection system.

How do you we ensure that when the need arises we will be able to actually benefit from it?

Connecting the Dots

One of the more fundamental needs when analysing a distributed system tracking and error logs, is to be able to follow the trail of a request, an isolated context.

Of course, these days it is infinitely easier to use an external error report/collection system. Multiple excellent choices like Logz.io or Rollbar.

multiple requests arriving at the same time with no guaranteed orderOften it is far from a straightforward endeavor. Obstacles may include:

  • identical (business-wise) requests originated from the same user session
  • error logs and other monitoring messages may not necessarily be collocated and permissions to access these logs may not be available to the same group of maintainers
  • load balancing may split parts of the same “session” across multiple nodes

In a poorly designed system, error forensics may be an extremely tedious and highly demanding, skill and system understanding-wise task.

Consider that for different parts of the request chain (user → FE → BE → service → service):

  • machine names may be different
  • timestamps may not match due to difference in system time or even time zones and daylight saving time
  • business IDs may interweave due to multiple identical request within the same “session”

The solution is to create a unique error ID, independent of any logical IDs, and return it back to the caller through all the systems, while ensuring that it is included in any error report.

This brings us to not one, but two new rules:


Rule #4. Screw the timestamp

Use a unique, preferably globally, identifier for any error you report and return this identifier back to the caller through the handling chain.


Keep in mind that timestamps still provide very useful information for forensic analysis, they just should not be the only way to identify a “distributed” error.

And on the matter of ID generation:


Rule #5. Don’t reuse IDs

Do not use any logic business identifier as a substitute for the generated one — identical or similar requests originating from the same “session” will make this unusable.


Error Context & Meta

Now, that we deliver connected errors under the same umbrella ID, a question arises as to what to report? What kind of information and how much of it should be pushed across service boundaries and to an external error tracking and collection system?

Consider the users of the error — they include error collection and tracking system, layers of system code that called the “offending” function and in some cases the “real” users of the system.

This brings us to a recommendation to include three types of information:

  1. some sort of normalized error code/status for automatic consumption by non-human users; this would allow easier decision what to do with the error in higher levels of code, what kind of human-readable (if any) content to present to the user and eventual proper tagging and sorting in tracking system
  2. context in which the erroneous state, which the error represents, occurred; this includes non-structured information like various data items and more structured one, like call stack, mostly suited for human analysis

Thus, Rule #6:


Rule #6. Servant of two masters

Include, in addition to a unique ID, two types of information with any created and delivered error:

  • an error code, from a set of predefined codes, for automatic consumption
  • forensic information for human to use that includes IDs, stack traces etc.

Unless there is some specific limitation on the size of information attached to an error (for example, there is some serialization), err on the side of being more verbose. This may include things like request parameters, calculated data or other contexts. More information would probably help identifying the cause and context of the error in more timely manner.

Error Hierarchy

Some languages, notably Java, allow to have type-checked exception handling via catch clauses, which allows to write code like this:

try {
   int x = doSomething();
}
catch (SomeError e) {
    // handle some error
}
catch (SomeOtherError e){
    // handle some other error
}

and the rest of the errors are propagated to higher-level code.

Such an ability allows to implement Error Hierarchy, similar to how water bounces through waterfall stairs — smaller stairs (more specific Exceptions) first, larger (more general Exceptions) last:

Or, in our pragmatic Java-ish code way, how exception bounces through catch clauses, from most specific to most generic one:

class SpecificError extends GenericError {}
class MoreSpecificError extends SpecificError {}
class SuperDuperSpecificError extends MoreSpecificError {}
try {
    this.doSomething();
} catch (SuperDuperSpecificError e) {
   // super duper specific handling
} catch (MoreSpecificError e) {
    // more specific handling  
} catch (SpecificError e) {
    // specific handling  
} catch (GenericError e) {
    // run-of-the-mill handling  
}

JavaScript, and by extension Node, however, are an all-or-nothing kind of language, so synchronous:

try {
   const x = doSomething();
} catch (err) {
    // handle failure
}

or, in Promise-land, asynchronous catch clauses:

doSomething().then((result) => {
    return doSomethinElse();
}).catch((err) => {
    // handle failure
});

will catch everything.

What’s a poor Node developer to do, then?

So far, the only way to actually establish Error Hierarchy, is to implement it using instanceof:

class SpecificError extends Error {}
class MoreSpecificError extends SpecificError {}
class SuperDuperSpecificError extends MoreSpecificError {}
try {
   const x = doSomething();
} catch (err) {
    if (err instanceof SuperDuperSpecificError) {
        // super duper specific handling
    } else if (err instanceof SuperDuperSpecificError) {
        // more specific handling
    } else if (err instanceof MoreSpecificError) {
        // specific handling 
    } else if (err instanceof Error) {
        // run-of-the-mill handling  
    }
}

with similar approach in case of Promises.

Curiously, Firefox is the only browser that supports a conditional catch clause, allowing to do this:

try {
   const x = doSomething();
} catch (err if err instanceof SuperDuperSpecificError) {
    // super duper specific handling
} catch (err if err instanceof MoreSpecificError) {
    // more specific handling  
} catch (err if err instanceof SpecificError) {
    // specific handling  
} catch (err if err instanceof Error) {
    // run-of-the-mill handling
}

To emphasize this, Error Identification/Meta and Error Hierarchy assist in making distinction between errors easier and more straightforward.

Error Handling

Now that we decided:

  • how we are going to create and identify errors
  • what kind of information we are going to attach to them, and
  • whether we will establish some kind of error hierarchy

we can start a discussion on how the delivered errors ought to be handled.

Separating the Errors

In a highly distributed system, like our application above, it’s not unlikely to hop between a business interface and a technical one.

For example, when catalog system above receives a search request from the serving system (FE server), it is likely to do the following:

  • extract from the request search term, user context and possible filters
  • call search engine service to get a set of results, as close to the search term as possible
  • call the database with the IDs from search results and filters extracted from the request
  • call eligibility/availability service, with the user context and data base results to filter out unavailable products
  • return the remaining products to the serving system to be presented to the user

This process may be, somewhat schematically and with a lot of simplifications for brevity, implemented like this:

app.get('/search', async (req, res) => {
    const {text, filters} = req.query;
    const {context} = req.body;
    try {
        const matches = await getMatches(text);
        const products = await getProducts(matches, filters);
        const availableProducts = await getAvailable(products, context);

        res.status(200).send(availableProducts);
    } catch (err) {
        // handle errors
    }
});

Consider the possibly erroneous/exceptional conditions that the above piece of code may encounter:

  • search system unavailable
  • no search matches
  • search index is not initialized
  • no matching (to IDs) products
  • unknown user

to name a few.

Some of these are possibly operational or programmer errors, like “search system unavailable” or “search index is not initialized”, where there is:

  • a network issue,
  • or a programmer/devop forgot to initialize the search engine with a previously generated index.

Some are clear business errors, like “no search matches” for a specific search text.

Finally, some may be either, like “unknown user”. This erroneous condition may indicate that:

  • a user management system being down
  • user context being propagated incorrectly, or
  • there is an impersonation attempt, where the session (with authentication token within) is copied from a different user than the currently logged in one

Consider how naive handling of these errors would look in the code above (considering we went the error hierarchy route):

app.get('/search', async (req, res) => {
    ...
    try {
        ...
    } catch (err) {
        if (err instanceof ConnectionRefused) {

        } else if (err instanceof NoIndexFound) {

        } else if (err instanceof NoMatch) {

        } else if (err instanceof IncorrectUserInfo) {

        } else if (err instanceof AuthenticationError) {

        } else if (err instanceof Error) {

        }
    }
});

These are the issues, among many others, that immediately arise in regards to these exceptions:

  • ConnectionRefused or NoIndexFound
    • how do you know it is a search engine exception?
    • how do you retry the attempt?
    • what kind of response return to the serving system
  • NoMatch
    • is this an acceptable result from the catalog system?
  •  IncorrectUserInfo or AuthenticationError
    • should this be reported to the serving system or failed silently and reported?
    • should a re-login request (in case the token has been timed out) be transferred?

Despite the fact that the originating code is perfectly valid:

app.get('/search', async (req, res) => {
    ...
    try {
        const matches = await getMatches(text);
        const products = await getProducts(matches, filters);
        const availableProducts = await getAvailable(products, context);

        res.status(200).send(availableProducts);
    } catch (err) {
        // handle errors
    }
}); 

the handling of the errors violates at least two of SOLID principles — the single responsibility and dependency inversion principles.

Handling of a refused connection is an entirely different responsibility and technical domain than handling of a possible hacking attack or than handling of lack of matches for a search. They may even require different people to be handled — both in code and in forensic analysis.

NoIndexFound is an implementation detail of search engine, while NoMatch is an appropriate interface, depending on input (more on that later).

We designed and coded against implementation and not abstraction and thus also made the route handler do too much and on different levels of abstraction.

How do you approach such a problem?

Splitting (which is the go-to solution) is not really an option, nor it is really correct, as it is expected for the route handler to be a sort of a hub that orchestrates these various activities. Besides, if you look at the call interfaces (getMatches, getProducts, getAvailable), there is an acceptable balance in abstraction.

The solution is to separate different level of errors, much like you would with function calls. Search engine system should not expose all these technical errors, instead opting for a, possibly,  single one, global within our distributed system: SearchSystemError.

The more technical the system is, the less errors it should expose to a higher-level system.


Rule #7. Abstract Error Implementation Details

Abstract away implementation details of in favor of a minimal API surface, on error level. A consuming system should only receive, to handle, errors it would know what to do with.


Technical errors should have been combined, in case of inescapable failure, into a single one (with code, context and ID as described earlier).In addition to having as little implementation details (and NoIndexFound is an implementation detail) as possible to spill into the consuming system, you should consider how these would be reported

In the same manner, the reporting of errors should not be delegated to a higher-level code, rather reported at the boundary, especially in case of technical errors.

Reporting may also contain quite a bit of information, so passing it over the network only to be reported doesn’t make sense.

Finally, different personnel may handle different levels of error in different subsystems, so an administrator with access to the search engine servers and logs may not have the same access rights for a higher-order system like catalog system above.

With that said, you most certainly should still include the generated error ID with the single abstract error the system reports.


Rule #8. Report at system’s edge

Report errors at the system’s edge (for example, before it returns an HTTP response), to ensure that:

  • they can be accessed by the correct personnel
  • the don’t leak implementation details of error context, codes etc. into higher-level system

 

A Word on Business Errors

The example above, where we search for a product using text that user may have entered, considered NoMatch exception to actually not be an error at all.

Indeed, a text search that returns an empty result set is a valid business result and should not be treated like an error. There is nothing really exceptional about it and there should be no need to report it, especially considering that it may obfuscate the actual, important errors.

On the other hand, if a request to find a product was made by the ordering system, it is more likely to be made with a specific ID, as is registered in the user’s cart. Receiving NoMatch for this kind of interaction would probably indicated an error, possibly an operational or programmer one, as having invalid IDs in cart or missing IDs from the catalog database is rarely a business error.


Rule #9. Create exceptions for exceptional conditions

Create, deliver and report as errors/exceptions only those occurrences that are considered exceptional and are not a part of a normal business operation.


 

A question arises, then, what to do with these non-exceptional “errors”? They probably carry some value and should probably be reported/collected somehow?

The answer is a resounding yes. They should be reported and collected in your data — along with all other information you use for BI.

To Crash or not to Crash?

Now, that we’ve defined a clear separation between business and technical errors, let’s consider these technical errors again, especially the programmer errors.

Imagine having an unexpected condition that is result of a programmer error. A bug.

There is no way it could have been anticipated or the code would have been fixed. There is no way to recover from the error, due to the state being likely corrupted or the control flow broken.

What kind of handling can we propose in such a case?

The most common suggestion is to crash the system as fast as possible:

  • to notify the necessary personnel as soon as possible, so the bug can be fixed
  • to prevent from the bug from being propagated around the system, causing further and more difficult to diagnose issues in other, sometimes unrelated parts

The last part is especially important, since, in some cases, an unhandled bug may not actually result in an obvious error, rather cause an incorrect business rule to be broken silently — resulting in an erroneous state that would be extremely difficult to trace and fix.


 

Rule #10. Crash on bugs, then crash the bugs

When an bug, a programmer error, is discovered:

  1. report the error
  2. crash the relevant part of the system as soon as possible
  3. restart the system automatically

Do not let the error propagate, contaminating the rest of the system.


 

With that said, if we can’t handle programmer errors, and we don’t know when or why the happen ahead of time, what can we do, aside from crashing (which is a programming equivalent of taking the ball and going home)?

The first, expected answer, is simple – test, test and then test some more.

The second is a little more sophisticated — consider using assertions in addition to tests.

Async Context

Node is a non-blocking I/O platform, which means, in context of distributed applications, that each Node service should:

  • process an incoming request as quickly as possible
  • distribute the necessary computation to various subsystems
  • provide these subsystems with callbacks/receive promises/initialize generators
  • continue to process the next request
  • somewhere in the future process some callbacks/promises fulfillment or generators yields

This paradigm is what makes Node such a popular and powerful platform. It also introduces one of the major hardships in dealing with a distributed Node system — session isolation.

This is not a formal term, but the premise is relatively simple — since Node services a lot a interweaving requests, it is possible (and frequent) that data prepared by one is consumed by another, if we are not careful.

diagrams (9)

Example above shows how easily, and in relatively non-obvious way, due to asynchrony, such a overlapping can occur:

  1. first client issues a request A
  2. Node processes it, updates some local variable X and issues a request A-1 to some subsystem
  3. second client issues a request B
  4. Node processes it, updates the same variable X and issues a request B-1 to another subsystem

At this point the request contexts (again, very informal name) are isolated. However:

  1. response to request A-1 returns, reads variable X
  2. Node returns the response to the first client, based on variable X

The value of variable X used to calculate the response for the first client is incorrect, which may in turn result in an incorrect data or sensitive information of the second client being returned to the first one!

The issue is compounded further when competing requests result in similar paths through the distributed system with various asynchronous calls within each subsystem.

Racing to Errors

What that means to us, in context of error handling? Simply put — we have to take special care to not report errors for incorrect user (think of user ID being placed in that variable X).

There is a need to connect each asynchronous callback to its origin — create a sort of asynchronous propagating context.

The following schematic code shows a possible desired API:

function foo() {
    const {id} = ...
    ...
    process.context.set('id', id);
    ...
    bar(value, (err, result) => {
        ...
    });
}

// somewhere else, in a different file

function bar(param, callback) {
    const id = process.context.get('id');
    ...    
    callback(result);
}

Explicit Propagation of Context

One may claim that there are ways to achieve the above pattern, most commonly:

  • via Node global or some module-level closure or,
  • by using the Node module system to create a singleton context (cached by require) to access from any module

This ensures propagation of context, but leaves the issue of isolation unresolved. Now, each request needs to be separated in that context by some sort of ID. This probably means that it needs to be propagated through the entire Web of distributed system and generated somewhere at the top of the asynchronous chain (likely in your Express route handler):

This is ugly and error-prone, to say the least.

function foo() {
    const {id} = ...
    ...
    global[contextID] = {id} // context;
    ...
    bar(value, contextID, (err, result) => {
        ...
    });
}

// somewhere else, in a different file

function bar(param, contextID, callback) {
    const {id} = global[contextID];
    ...    
    callback(result);
}

Note that such a context ID will have to be passed explicitly via the distributed system boundaries (over HTTP or other network protocol) in case of any async context solution. There is no implicit way to propagate such an ID between different network endpoints.

So, is async context nothing but a dream?

Domains to the Rescue?

Node appeared to have a partial solution, on error level — a built-in platform construct called Domains.

In some platforms and frameworks the concept is called zones — the idea being that you create an isolate domain (zone) that is accessible only for asynchronous (and synchronous) code originated from the same initial call stack.

An example would be:

const domain = require('domain');
const d = domain.create();

app.get('/search/:term', (req, res) => {
    const {term} = req.params;
    const iid = generateImpressionID();
    ...
    d.on('error', (err) => {
        report(iid, err);
    });   
    d.run(() => {
        search(term, () => {
            ...
        });
    });
});

If any of the code enclosed in d.run throws an exception, it is caught in the defined error callback and is isolated to that specific session/isolated context, thus negating the need to propagate the IDs.

Alas, Domains have been deprecated for quite some time now, due to being error prone and fragile. Number of implementations on top of Domains, or instead of Domains exist, most if not all monkey-patching all async methods, which is something we’d recommend not doing, if possible, for production systems.

For an example of an offender, see one of the more popular libraries that many such solutions rely upon — async-listener.

Horrible, horrible code.

Not all is lost, however. There has been a candidate to replace Domains, a lower-level API, in Node core, that is slowly brewing for more than 2 years, but is expected to be much more robust — async-wrap.

Async Wrap

Using the magic of async-wrap, we would be able to write code like this:

app.get('/search/:term', (req, res) => {
    const {term} = req.params;
    const iid = generateImpressionID();
    createContext({iid});

    searchItemByText(term).then(({catalogID}) => {
        getProduct(catalogID).then((product) => {
            track(iid, 'product --', product);
            res.status(200).send(product);
        }).catch((err) => {
            report(iid, err);
            res.status(404).send(err);
        });
    }).catch((err) => {
        report(iid, err);
        res.status(404).send(err);
    });
});

and then use it, in other places, to retrieve the impression ID: 

function searchItemByText(term) {
    const {iid} = getContext();

    return searchText(term).then((results) => {
        const {ref: catalogID} = results[0];

        track(iid, `item by text: ${term} --`, {catalogID});
        return {catalogID};
    }).catch((err) => {
        report(iid, err);
        throw new Error(err);
    });
}

 

function getProduct(catalogID) {
    const {iid} = getContext();

    return db.findOne({catalogID}).then((product) => {
        track(iid, `product by catalogID: ${catalogID} --`, product);
        return product;
    }).catch((err) => {
        report(iid, err);
        throw new Error(err);
    });
}

which is precisely what we wanted to achieve.

For now, however, this is not a part of Node recommended core API, rather an experimental feature.

For curious, see async-hook library that creates a thin wrapper around async-wrap and is the one the code above uses behind that createContext method.

An example schematic implementation may look similar to the following:

‘use strict’;

const asyncHook = require('async-hook');

const contexts = {};

function init(uid, handle, provider, parentUid, parentHandle) {
    // if current handle is a descendant of a handler that created
    // a context or already has one
    if (isDescendant(uid)) {
        // save its context
        contexts[uid] = contexts[parentId || currentUid]
    }
}

function pre(uid, handle) {
    // do some state initialization
    // save currently "processed" handle
    currentUid = uid;
}

function post(uid, handle, didThrow) {
    // restore state
    currentUid = null;
}

function destroy(uid) {
    // remove context for the specific handle
    removeContext(uid);
}

function createContext(contextData) {
    // add context for the current handle uid
    contexts[currentUid] = contextData;
}

function getContext() {
    return contexts[currentUid];
}

// register the hooks
asyncHook.addHooks({init, pre, post, destroy});
asyncHook.enable();

Clearly, this is potentially an excellent way to provide things like asynchronously propagate storage or long stack traces. For deeper dive, look into one of the existing libraries the above code draws inspiration from.

Summary

As we went through creation, delivery and handling errors patterns in a distributed system, a set of high-level rules has been established. These should be, as we’ve mentioned before, considered recommendations and not hard and fast rules, if only because, for brevity and due to format/size limitations of an article, we skipped over a number of interesting exceptions (ugh, almost succeeded to avoid this).

These exceptions may carry a different weight in your system and as such may emphasize or negated specific rules.

Know the rules to break the rules”