We are working on large scale system with many micro-services running mostly on top of Node.js. Services are deployed as docker containers and code gets to production on a daily basis.
Looking at NewRelic, we noticed that after each deploy of our services, memory started to rise immediately. It looked like it could very likely be a leak, but we wanted to make sure before we all start diving into heap dumps.
We tried reproducing the leak on a local machine with a heavy stress test, but memory was just fine.
So, we found a service that wasn’t deployed often, and started to dig into data with NewRelic.
By default, NewRelic shows the average memory use over a time period, but it is also possible to show the data as a graph, so it easy to see trends.
We selected a wide time range, so we can see when did the leak start. This is what we found:
In NewRelic you can zoom in or out by selecting different time ranges on the graph. Each time memory was cleared, it was due to a restart or a deploy. So we zoomed in to the time memory first started to climb.
Now things got much clearer. We were OK and stable until March 9th, and clearly a memory leak started right after the deployment that was done on that date (our CI reports these events to NewRelic as is noted by the blue line).
From that point on, we just had to check GitHub for the commits under that tag. There were more than a few features added in that version, but one change in particular caught our eyes.
We saw that we also changed our Dockerfile to use an image with Node version 7.6.
At that point Node was only a suspect, so we decided to do something else (which in hindsight we could have done before, it just that we didn’t suspect the leak to be from Node).
NewRelic lets you add different metrics to dashboards, by allowing you to choose an application name and then a metric that you want to see. Luckily it tracks both heap and non-heap memory.
The leak was in the non-heap memory, now Node version 7.6, was a primary suspect.
We upgraded to Node 7.6, so we would be able to enjoy async-await. To verify that our conclusion was true, we could either revert to a lower or a higher version of node.
In general we always like to move forward. We use TDD and our code is covered with unit, system, integration, and e2e tests. So we rather move forward, fail quickly, and stay updated.
We updated our Dockerfile to use Node 7.10, and here is the result:
Memory leak is gone.
When we searched the interwebs for any other clues about this memory leak, we only found one documented article about the bug that probably caused this leak: https://github.com/nodejs/node/issues/12019 (it appears the leak was in the crypto module, https://github.com/nodejs/node/pull/12089, as some certifications validity checks, failed to free memory before returning).
We have expected to find some more evidence of this leak online, but it was hardly documented. So we posted our findings to a facebook community Node group, and wrote this blog post in hope it would help others who experience the same problem.