When I took over the backend at MealPe, the platform had around 1,000 registered users across a handful of client sites. A year later, it’s at 20,000+ users processing 50,000 meals per month. This is what that scaling journey actually looked like - not the clean version, but the real one.


The Early Cracks

At 1,000 users, most things “worked.” But as we onboarded more hospital clients, cracks started showing.

API response times crept up. Database queries that ran fine with 10,000 rows started timing out at 200,000. Concurrent meal orders during our intense lunch rush (11:30 AM – 1:00 PM) caused request queuing and database locks that slowed the entire site to a crawl.


1. Database-First Optimization

Almost every scaling problem traced back to the database. When I audited the system, I focused on three immediate interventions:

  • Indexing Strategy Overhaul: I audited every query using EXPLAIN ANALYZE and added composite indexes for the most common access patterns (such as combining hospital_id with order_status and delivery_time). This alone dropped average query time from 800ms to under 50ms for the order listing endpoints.
  • Schema Normalization: Some tables had been designed with flexibility in mind (utilizing JSON columns for dynamically defined fields). I migrated these to proper relational columns with appropriate types and foreign key constraints, which dramatically cut down CPU evaluation costs in query filters.
  • Connection Pooling: We were creating new database connections per request, creating heavy TCP handshake overhead. Implementing proper connection pooling using pg-pool reduced connection overhead and kept connection saturation safely below thresholds.

2. API Architecture Refinement

The original API had grown organically - lots of endpoints doing too many things. I refactored our core router with a focus on:

  • Pagination Everywhere: Enforced strict query pagination using limit/offset and cursors so that list views never try to render thousands of rows in a single DOM draw.
  • Response Shaping: Different clients needed different data structures. Instead of sending full database models, I implemented lightweight view DTOs to shrink our network payload size.
  • Caching Hot Data: Hospital menus change once a week, not once a second. Adding a cache layer in front of dynamic menus eliminated thousands of database reads per hour.

3. The Hospital Lunch Rush Problem

Hospitals have extremely predictable and highly congested usage patterns. Unlike standard food delivery, 80% of all hospital meal orders come in during a tightly bounded 90-minute lunch window. This meant our servers needed to handle peak loads that were 15x the daily average.

To survive these spikes without scaling our cloud costs infinitely:

  1. Decoupling Non-Critical Work: I offloaded tasks like push notifications, admin email alerts, and analytic log compilations into a background queue processed after peak hours.
  2. Pre-Computing Availability: Instead of scanning ordering records to determine inventory limits on every checkout, we pre-computed meal item quotas every hour, converting an O(N) scan into an O(1) cache read.

Monitoring That Actually Matters

I set up monitoring dashboards that tracked metrics predictive of failures rather than just CPU alerts:

  • API p95 Latency: Warns us when p95 responses exceed 200ms.
  • Connection Pool Saturation: Alerts us when available database pools drop below 15%.
  • Job Queue Depth: Helps us spot background processing lag.

What I’d Do Differently Next Time

  • Plan the Indexing Early: Invest in database indexes from day one rather than retrofitting them under pressure.
  • Bake in Caching First: Integrate cache and pagination middlewares into the router framework rather than adding them as custom interventions.
  • Realistic Load Testing: Build synthetic load tests modeling the concentrated lunch rush early in staging.