When your platform processes meal orders for hospital patients, downtime isn’t just inconvenient - people don’t get fed. High availability is a clinical requirement, not just a product target.
The original deployment process was: SSH into the server, git pull, npm install, and pm2 restart all. This caused a 10–30 second outage per deploy, during which our API dropped orders and hung terminals.
Here is exactly how I set up graceful, zero-downtime deployments for our Node.js server cluster using PM2’s native clustering and NGINX upstream proxies.
1. PM2 Cluster Configuration
The foundation of zero-downtime is simple: never let the number of active app instances drop to zero. PM2’s cluster mode makes it straightforward to scale your process across all cores:
// ecosystem.config.js
module.exports = {
apps: [{
name: 'mealpe-api',
script: './dist/server.js',
instances: 'max', // Scale to all available CPU cores
exec_mode: 'cluster',
max_memory_restart: '500M',
listen_timeout: 10000, // Wait 10s for boot signal
kill_timeout: 5000 // Wait 5s for clean close
}]
};
instances: 'max': Spawns a process on each CPU core.listen_timeout: Instructs PM2 to wait for a database connection or socket handshake before marking a new process as “Online.”kill_timeout: Gives active processes 5 seconds to wrap up in-flight REST queries before forcing a close.
2. Transitioning to Graceful Reloads
Instead of calling pm2 restart (which kills all instances simultaneously), we transition to pm2 reload.
Reload initiates a rolling update: it spawns a new instance, waits for it to become online, then safely turns down an old instance. It repeats this pattern process-by-process, maintaining maximum API capacity:
# Production deploy script (deploy.sh)
#!/bin/bash
set -e
echo "Pulling latest branch code..."
git pull origin main
echo "Installing production-only dependencies..."
npm ci --production
echo "Executing rolling reload..."
pm2 reload ecosystem.config.js --update-env
echo "Deploy successfully completed!"
3. Implementing Application Graceful Shutdowns
PM2 sends a SIGINT trigger to your process before shutting it down. If your application doesn’t handle this signal, it terminates instantly, dropping all connections mid-transaction.
You must catch the SIGINT event, close the HTTP port to block new inbound traffic, finish active queries, and release database pools:
// ✅ Professional graceful shutdown hook in server.js
process.on('SIGINT', () => {
console.log('SIGINT signal received. Starting graceful shutdown sequence...');
// Stop the HTTP server from accepting new socket sessions
server.close(async () => {
console.log('HTTP server successfully closed.');
try {
// Release database connection pools cleanly
await db.end();
console.log('Database pools released. Exiting cleanly.');
process.exit(0);
} catch (err) {
console.error('Error during database teardown:', err);
process.exit(1);
}
});
// Force close after a 6-second timeout block if connections hang
setTimeout(() => {
console.warn('Forced shutdown active: connections did not close in time.');
process.exit(1);
}, 6000);
});
4. Configuring NGINX Load-Balancing
NGINX is our front-door gateway. The key configurations for zero-downtime routing include setting up an upstream pool and instructing NGINX to pass traffic to active processes on failures:
upstream mealpe_backend {
server 127.0.0.1:3000;
keepalive 64; # Keep connection channels open to reduce latency
}
server {
listen 443 ssl http2;
server_name api.mealpe.in;
location / {
proxy_pass http://mealpe_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# 🚀 Crucial: if one instance is reloading, try next active worker
proxy_next_upstream error timeout http_502 http_503;
proxy_connect_timeout 3s;
proxy_read_timeout 15s;
}
}
By adding proxy_next_upstream, if an instance is in the middle of reloading and drops a connection, NGINX instantly retries the query on a sibling process. The client has zero awareness of the deploy event.
5. Webhook CI/CD Automation
To secure execution, we set up a lightweight deployment daemon on our EC2 instance that exposes a secured webhook route. When a PR merges into main on GitHub, our actions pipeline runs tests, builds the typescript bundle, and pings the webhook:
# GitHub Actions CI deployment step
- name: Trigger Server Deploy Webhook
run: |
curl -X POST \
-H "X-Hub-Signature: sha256=${{ secrets.DEPLOY_SECRET }}" \
https://api.mealpe.in/webhooks/deploy
This triggers the automated deploy.sh script locally on the server.
The Outcome
- 0 seconds of user-facing downtime recorded.
- 45-second deployment pipeline from git merge to production availability.
- Confidence to deploy minor changes and hotfixes safely during standard hours.