Using child_process to run code inside a Cloud Function seems to produce significant latency spikes around require statements. I looked into it and came up with some stats.


_Code to reproduce this is available on _GitHub

The whole idea of Asserted is that it allows you to run custom test code against your application continuously. At the time I started building it, I figured the fastest and easiest way to do that was using GCP Cloud Functions. Cloud Functions have been around for years, and have well known performance and security characteristics, so it seemed like a safe bet.

At it’s core, the implementation was simple. Copy code into a Cloud Function and then use child_process to execute it safely with a timeout.

This seemed to work great at first. Relatively low-latency, and easy to maintain.

But this code runs continuously, as often as every minute, forever. Within less than a day, I got a timeout on the child_process.exec.

The Mystery Begins

Logically, I assumed it was my fault, because most things are.

The code I was executing was calling API endpoints and maybe they were holding the connection open too long or something. I ignored it first, but then I noticed that when I ran the code locally on my machine for extended periods, the timeouts didn’t happen. So it wasn’t the code exactly, and it wasn’t the API I was calling from inside that code.

I started investigating. I did the usual debugging steps of basically adding console.log statements everywhere to see where the holdup was, and set the exec to inherit stdio so I could easily see the logs.

I added some around child_process.exec:

After running the function a number of times, I looked into GCP Logging where I could see the log lines and the time they occurred.

I was surprised to see that the delay wasn’t happening within the bulk of the user code, it was happening between the exec starting and when the require statements finished.

There is a huge variance in how long the require statements take to finish. Some times the require statements would complete within 100 ms, and other times it may take over 2 seconds, or not even complete before the timeout.

That definitely seemed weird. These aren’t weird esoteric dependencies. They are some of the most commonly used libraries on NPM.

Profiling these require statements on my own machine showed negligible impact, so maybe it was something about Cloud Functions itself that was weird?

I decided to come up with a more formal test to see if I could track it down.

The Experiment

Environments

I had tried out Cloud Run around the same time and knew that I didn’t see the issue there, only in Cloud Functions. So I decided to do a three-way comparison. I would run the same code in three environments and compare the results:

  • Cloud Function — 2048 MB Memory — single 2.4 GHz CPU
  • Cloud Run — 2048 MB Memory — single vCPU
  • Local Docker — 2048 MB Memory — single CPU

#devops #development #serverless #nodejs #performance

Analyzing Spikes in Cloud Function Require Latency
1.25 GEEK