The early cracks

At 1,000 users, most things "worked." But as we onboarded more hospital clients, cracks started showing. API response times crept up. Database queries that ran fine with 10,000 rows started timing out at 200,000. Concurrent meal orders during lunch rush (11:30 AM – 1:00 PM) caused request queuing.

Database first

Almost every scaling problem traced back to the database. The most impactful changes:

  • Indexing strategy overhaul. I audited every query using EXPLAIN ANALYZE and added composite indexes for the most common access patterns. This alone dropped average query time from 800ms to under 50ms for the order listing endpoints.
  • Schema normalisation. Some tables had been designed with flexibility in mind (JSON columns for "dynamic" data). I migrated these to proper relational columns with appropriate types and constraints.
  • Connection pooling. We were creating new database connections per request. Implementing connection pooling reduced connection overhead significantly.

API architecture

The original API had grown organically — lots of endpoints doing too many things. I refactored with a focus on:

  • Pagination everywhere. No more returning 10,000 rows in a single response.
  • Response shaping. Different clients needed different data. Instead of sending everything, I implemented field selection to reduce payload sizes.
  • Caching hot data. Meal menus don't change every minute. Adding a cache layer for semi-static data eliminated thousands of unnecessary database hits per day.

The lunch rush problem

Hospitals have extremely predictable usage patterns. 80% of meal orders come in during a 90-minute lunch window. This meant our infrastructure needed to handle peak load that was 10–15x the average.

The solution was a combination of request queuing for non-critical operations (logging, notifications) and pre-computing meal availability data before the rush period started.

Monitoring that actually matters

I set up monitoring focused on the metrics that actually predicted problems: API p95 latency, database connection pool saturation, and order queue depth. Generic CPU/memory alerts were too noisy and too late.

What I'd do differently

  • Invest in proper database indexing from day one, not as a retrofit.
  • Build pagination and caching into the API framework, not as afterthoughts.
  • Set up load testing that simulates the lunch rush pattern specifically.