When your platform processes meal orders for hospital patients, downtime isn’t just inconvenient - people don’t get fed. High availability is a clinical requirement, not just a product target.

The original deployment process was: SSH into the server, git pull, npm install, and pm2 restart all. This caused a 10–30 second outage per deploy, during which our API dropped orders and hung terminals.

Here is exactly how I set up graceful, zero-downtime deployments for our Node.js server cluster using PM2’s native clustering and NGINX upstream proxies.


1. PM2 Cluster Configuration

The foundation of zero-downtime is simple: never let the number of active app instances drop to zero. PM2’s cluster mode makes it straightforward to scale your process across all cores:

// ecosystem.config.js
module.exports = {
  apps: [{
    name: 'mealpe-api',
    script: './dist/server.js',
    instances: 'max', // Scale to all available CPU cores
    exec_mode: 'cluster',
    max_memory_restart: '500M',
    listen_timeout: 10000, // Wait 10s for boot signal
    kill_timeout: 5000     // Wait 5s for clean close
  }]
};
  • instances: 'max': Spawns a process on each CPU core.
  • listen_timeout: Instructs PM2 to wait for a database connection or socket handshake before marking a new process as “Online.”
  • kill_timeout: Gives active processes 5 seconds to wrap up in-flight REST queries before forcing a close.

2. Transitioning to Graceful Reloads

Instead of calling pm2 restart (which kills all instances simultaneously), we transition to pm2 reload.

Reload initiates a rolling update: it spawns a new instance, waits for it to become online, then safely turns down an old instance. It repeats this pattern process-by-process, maintaining maximum API capacity:

# Production deploy script (deploy.sh)
#!/bin/bash
set -e

echo "Pulling latest branch code..."
git pull origin main

echo "Installing production-only dependencies..."
npm ci --production

echo "Executing rolling reload..."
pm2 reload ecosystem.config.js --update-env

echo "Deploy successfully completed!"

3. Implementing Application Graceful Shutdowns

PM2 sends a SIGINT trigger to your process before shutting it down. If your application doesn’t handle this signal, it terminates instantly, dropping all connections mid-transaction.

You must catch the SIGINT event, close the HTTP port to block new inbound traffic, finish active queries, and release database pools:

// ✅ Professional graceful shutdown hook in server.js
process.on('SIGINT', () => {
  console.log('SIGINT signal received. Starting graceful shutdown sequence...');

  // Stop the HTTP server from accepting new socket sessions
  server.close(async () => {
    console.log('HTTP server successfully closed.');

    try {
      // Release database connection pools cleanly
      await db.end();
      console.log('Database pools released. Exiting cleanly.');
      process.exit(0);
    } catch (err) {
      console.error('Error during database teardown:', err);
      process.exit(1);
    }
  });

  // Force close after a 6-second timeout block if connections hang
  setTimeout(() => {
    console.warn('Forced shutdown active: connections did not close in time.');
    process.exit(1);
  }, 6000);
});

4. Configuring NGINX Load-Balancing

NGINX is our front-door gateway. The key configurations for zero-downtime routing include setting up an upstream pool and instructing NGINX to pass traffic to active processes on failures:

upstream mealpe_backend {
    server 127.0.0.1:3000;
    keepalive 64; # Keep connection channels open to reduce latency
}

server {
    listen 443 ssl http2;
    server_name api.mealpe.in;

    location / {
        proxy_pass http://mealpe_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # 🚀 Crucial: if one instance is reloading, try next active worker
        proxy_next_upstream error timeout http_502 http_503;
        proxy_connect_timeout 3s;
        proxy_read_timeout 15s;
    }
}

By adding proxy_next_upstream, if an instance is in the middle of reloading and drops a connection, NGINX instantly retries the query on a sibling process. The client has zero awareness of the deploy event.


5. Webhook CI/CD Automation

To secure execution, we set up a lightweight deployment daemon on our EC2 instance that exposes a secured webhook route. When a PR merges into main on GitHub, our actions pipeline runs tests, builds the typescript bundle, and pings the webhook:

# GitHub Actions CI deployment step
- name: Trigger Server Deploy Webhook
  run: |
    curl -X POST \
      -H "X-Hub-Signature: sha256=${{ secrets.DEPLOY_SECRET }}" \
      https://api.mealpe.in/webhooks/deploy

This triggers the automated deploy.sh script locally on the server.


The Outcome

  • 0 seconds of user-facing downtime recorded.
  • 45-second deployment pipeline from git merge to production availability.
  • Confidence to deploy minor changes and hotfixes safely during standard hours.