The problem with naive deployments

The original deployment process was: SSH into the server, git pull, npm install, pm2 restart all. This caused 10–30 seconds of downtime per deploy. During lunch rush, that meant dropped orders and confused staff.

PM2 cluster mode

The foundation of zero-downtime is running multiple instances of your app. PM2's cluster mode makes this straightforward:

// ecosystem.config.js
module.exports = {
  apps: [{
    name: 'mealpe-api',
    script: './src/server.js',
    instances: 'max',
    exec_mode: 'cluster',
    max_memory_restart: '500M',
    listen_timeout: 10000,
    kill_timeout: 5000
  }]
};

The key settings: instances: 'max' spawns one process per CPU core, and listen_timeout / kill_timeout give new instances time to start before old ones shut down.

Graceful reload instead of restart

The critical difference: pm2 reload instead of pm2 restart. Reload performs a rolling update — it starts new instances, waits for them to be ready, then gracefully shuts down old ones. No gap in service.

# Deploy script
#!/bin/bash
set -e

echo "Pulling latest code..."
git pull origin main

echo "Installing dependencies..."
npm ci --production

echo "Reloading application..."
pm2 reload ecosystem.config.js --update-env

echo "Deployment complete."

Graceful shutdown in the application

PM2 sends a SIGINT signal before killing a process. Your app needs to handle this properly — close database connections, finish in-flight requests, and clean up:

process.on('SIGINT', async () => {
  console.log('Graceful shutdown initiated');

  // Stop accepting new connections
  server.close(async () => {
    // Close database pool
    await db.end();
    process.exit(0);
  });

  // Force exit after timeout
  setTimeout(() => process.exit(1), 10000);
});

NGINX as the front door

NGINX acts as a reverse proxy, load-balancing across PM2's cluster instances. The key config for zero-downtime:

upstream mealpe_backend {
    server 127.0.0.1:3000;
    keepalive 64;
}

server {
    listen 443 ssl http2;
    server_name api.mealpe.in;

    location / {
        proxy_pass http://mealpe_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header X-Real-IP $remote_addr;
        proxy_next_upstream error timeout;
        proxy_connect_timeout 5s;
    }
}

The proxy_next_upstream directive is important — if one instance is mid-reload and drops a connection, NGINX automatically retries on another instance.

CI/CD automation

The final piece: automating the whole process so deploys happen on git push to main. I use a simple webhook-based approach — GitHub sends a POST to a deploy endpoint on the server, which triggers the deploy script above.

For additional safety, the CI pipeline runs tests before the deploy webhook fires. If tests fail, the deploy never happens.

Results

  • Zero measured downtime during deployments.
  • Average deploy time: 45 seconds from push to live.
  • Confidence to deploy multiple times per day, even during peak hours.