Share:
How Ruangguru Scaled to 22M Tech Stack
About the Author
Emachalan is a Full-Stack Developer specializing in MEAN & MERN Stack, focused on building scalable web and mobile applications with clean, user-centric code.
Key Takeaways
- Scaling EdTech requires stabilization before growth — we spent the first 3 months fixing reliability before adding a single new feature.
- Monolith-to-microservices migration works best incrementally — migrating piece-by-piece, not in a "big bang," was the key to zero-downtime migration.
- Load testing at 3× expected peak before every major event was the single most important reliability practice across the entire 8-year partnership.
- Embedded team model builds lasting capability — today Ruangguru's internal team handles day-to-day development, exactly as planned from day one.
- COVID-19 proved the architecture worked — 10× traffic in 2 weeks with zero downtime validated every infrastructure decision made since 2016.
- Indonesia's infrastructure variety (users on everything from 5G to 2G) required locale-specific CDN and bandwidth adaptation decisions that global defaults could not provide.
- A well-architected event-driven platform can handle 2M+ concurrent users with peak load variations of 40× — if designed for it from the start.
Ruangguru by the Numbers
| Metric | Value |
|---|---|
| Active Students | 22 Million+ |
| Video Lessons | 65,000+ |
| Provinces in Indonesia | 34 |
| Scale Achieved | 100× |
From 100K to 22M users — this is how the technology scaled.
Ruangguru grew from a startup to Indonesia's largest education technology platform, serving 22+ million students. This is the story of how we partnered with them in 2016 — when they had 100,000 users — and helped build technology that scaled to 22 million.
Learn how AgileSoftLabs architects and builds enterprise-grade platforms for education, healthcare, logistics, and e-commerce — from early-stage startups to national-scale deployments.
The Challenge: Scaling Education Technology
When Ruangguru first engaged us in 2016, they were a fast-growing EdTech startup with 100,000 registered students and big ambitions. Their initial infrastructure was adequate for the current scale but wasn't designed for the 100× growth they were targeting.
Initial State (2016)
| Dimension | Status |
|---|---|
| Users | ~100,000 registered students (early-stage growth) |
| Content | 10,000+ learning videos |
| Peak load | 5,000 concurrent users |
| Issues | Infrastructure not built for scale, monolithic architecture, no CDN strategy |
The Growth Trajectory
Explore AgileSoftLabs Education Platform Solutions — including Education Management and AI-Powered Academic Program Management Software — built on the same scalable architecture principles applied at Ruangguru.
Technical Partnership Approach
Our engagement evolved through several phases as Ruangguru's needs changed:
Phase 1: Stabilization (2016 — 3 Months)
Before we could scale, we had to stabilize.
Initial Issues Identified:
- Database bottleneck (single PostgreSQL instance)
- Video delivery (origin server overloaded)
- Session management (in-memory, not distributed)
- No auto-scaling (manual capacity management)
- Limited monitoring (reactive, not proactive)
- Database bottleneck (single PostgreSQL instance)
- Video delivery (origin server overloaded)
- Session management (in-memory, not distributed)
- No auto-scaling (manual capacity management)
- Limited monitoring (reactive, not proactive)
Immediate Actions:
- Database read replicas + connection pooling
- CDN implementation for video content
- Distributed session management (Redis)
- Auto-scaling configuration
- Comprehensive monitoring setup
Results (30 days):
- 99.5% uptime (from 94%)
- Page load time: 6s → 2.1s
- Video start time: 8s → 1.5s
- Zero exam-period outages
See how AgileSoftLabs Cloud Development Services stabilize infrastructure through CDN strategy, auto-scaling configuration, and distributed session management — the same interventions that transformed Ruangguru's reliability in 30 days.
Phase 2: Architecture Evolution (2017 — 6 Months)
With stability achieved, we rebuilt for scale.
Architecture Transformation
Before (Monolithic):
After (Microservices):
Key Technical Decisions
| Decision | Rationale | Result |
|---|---|---|
| Kubernetes for orchestration | Auto-scaling, self-healing, consistent deployment | Can scale to 10× in minutes |
| Multi-CDN strategy | Redundancy + regional optimization for Indonesia | 99.9% video availability |
| Event-driven architecture | Decouple services, handle spikes | 2M+ events/second capacity |
| Separate read/write paths | Optimize for different access patterns | 10× read throughput |
Explore how AgileSoftLabs Custom Software Development Services approach monolith-to-microservices migration — incremental, low-risk, and designed to build internal team capability throughout the process.
Phase 3: Feature Development (2018–Ongoing)
Beyond infrastructure, we built new capabilities:
Live Learning Platform:
- Real-time video streaming (100K+ concurrent viewers)
- Interactive Q&A during sessions
- Whiteboard collaboration
- Recording and playback
- Bandwidth adaptation for varied connections
Adaptive Learning Engine:
- Student performance tracking
- Personalized content recommendations
- Difficulty adjustment based on progress
- Weakness identification and targeted practice
- Learning path optimization
Assessment System:
- Large-scale exam delivery (500K simultaneous)
- Anti-cheating measures
- Instant grading and feedback
- Performance analytics for teachers
- Question bank management
See how AgileSoftLabs AI & Machine Learning Development Services build adaptive learning engines — personalization algorithms, recommendation systems, and real-time performance analytics at scale.
Results and Impact
Technical Metrics
| Metric | Before (2016) | After (2024) | Improvement |
|---|---|---|---|
| Peak concurrent users | 5,000 | 2,000,000+ | 400× |
| System availability | 94% | 99.95% | ~6× fewer outages |
| Page load time | 6 seconds | 1.2 seconds | 5× faster |
| Video start time | 8 seconds | 0.8 seconds | 10× faster |
| API response time (p95) | 2.5 seconds | 200ms | 12× faster |
Business Impact
I. Growth Metrics:
- User base: 1M → 28M (28x growth)
- Content library: 100K → 1M+ items
- Live classes delivered: 10K/month → 500K/month
- Revenue growth: 15x over partnership period
- Market position: #1 EdTech in Indonesia
II. Student Outcomes:
- 10M+ students prepared for national exams
- 85% of users report improved grades
- 2M+ scholarship assessments processed
- 500K+ hours of live instruction delivered
Review more enterprise-scale technology outcomes in the AgileSoftLabs Case Studies — including platforms across healthcare, logistics, and consumer applications.
COVID-19 Response: 10× Traffic in 2 Weeks — Zero Downtime
When schools closed in March 2020, Ruangguru had to scale overnight:
March 2020 Scaling Event:
Before (Feb 2020):
- 200K daily active users
- 50K peak concurrent
After (April 2020):
- 2M daily active users (10x)
- 400K peak concurrent (8x)
- Required: 2-week timeline to scale
Our Response:
- Emergency capacity planning (48 hours)
- Additional infrastructure provisioning (72 hours)
- Performance optimization sprint
- Free tier launch for all Indonesian students
- Result: Zero downtime during transition
The COVID-19 response was the ultimate proof-of-concept for every architecture decision made since 2016. The event-driven, Kubernetes-orchestrated, multi-CDN infrastructure absorbed 10× normal traffic with no user-facing outages — a result that would have been impossible on the 2016 monolithic stack.
Lessons from the Partnership
What Worked
- Embedded team model: Our engineers worked alongside Ruangguru's team, building internal capability
- Incremental migration: Moved to microservices piece by piece, not a big bang
- Load testing obsession: Tested at 3x expected peak before every major event
- Local optimization: Indonesia-specific CDN and infrastructure choices
- Knowledge transfer: Documented everything, trained internal team
Challenges Overcome
- Indonesia's infrastructure variety: Users on everything from 5G to 2G connections
- Peak load unpredictability: Viral content could 10x traffic in hours
- Regulatory compliance: Data localization and content requirements
- Rapid feature demands: Business moved faster than typical enterprise
Technology Stack
| Layer | Technology | Why We Chose It |
|---|---|---|
| Container orchestration | Kubernetes (GKE) | Managed, auto-scaling, reliable |
| Backend services | Go, Node.js | Performance + developer productivity |
| Databases | PostgreSQL, MongoDB, Redis | Right tool for each data type |
| Message queue | Apache Kafka | High throughput, durability |
| Video delivery | Multi-CDN (Akamai, Cloudflare, local) | Redundancy + regional performance |
| Real-time | WebSocket + custom signaling | Low latency for live classes |
| Analytics | ClickHouse, Apache Spark | Fast queries on large datasets |
Explore AgileSoftLabs Web Application Development Services — our engineering teams apply the same Go, Node.js, Kubernetes, and Kafka stack principles across enterprise platform builds for global clients.
Partnership Evolution: 8 Years, 4 Phases
Engagement Model Over Time:
| Phase | Years | Mode | Primary Deliverable |
|---|---|---|---|
| Foundation & Stabilization | 2016–2017 | Active build | Infrastructure rebuild, CDN, first services |
| Embedded Team | 2018–2019 | Collaborative | Microservices migration, knowledge transfer |
| Scale for COVID-19 | 2020–2021 | Emergency + product | 10× scale, live class platform |
| Strategic Advisory | 2022–Present | Advisory | Architecture review, Southeast Asia expansion |
2016–2017: Foundation & Stabilization
- Infrastructure assessment and rebuild
- CDN strategy for Indonesia
- Monolith → first modular services
2018–2019: Embedded Team
- Engineers embedded in Ruangguru's team
- Microservices migration (piece by piece)
- Knowledge transfer and internal capability build
2020–2021: Scale for COVID-19
- Emergency capacity response (March 2020)
- 10x traffic in 2 weeks — zero downtime
- Live class platform for 100K+ concurrent viewers
2022–Present: Strategic advisory
- Architecture reviews for new product lines
- Scaling guidance as they expand across Southeast Asia
- Ongoing support relationship
- Ruangguru's internal team handles day-to-day
Conclusion
Ruangguru's journey from 100,000 to 28 million students — which we've been part of since 2016 — demonstrates what's possible when technology scales with business ambition. The keys to success were pragmatic architecture decisions, obsessive focus on reliability, and a partnership model that built lasting capability rather than lasting dependency.
Today, Ruangguru's internal team handles most development, exactly as planned from the beginning. Our ongoing role is supporting their continued growth and tackling new technical challenges as they expand across Southeast Asia.
The numbers — 400× peak concurrency, 99.95% availability, 12× API response improvement — are not the story. The story is of 22 million Indonesian students accessing quality education that wasn't previously available to them. The technology made that possible. The partnership made the technology sustainable.
Building an EdTech platform or scaling an existing one? AgileSoftLabs brings the same partnership model and architecture expertise to your platform. Browse our product portfolio, explore our case studies, and contact our team to discuss how we can help.
Frequently Asked Questions (FAQs)
1. How did Ruangguru grow from 100K to 22M students technically?
Started 2016 with 100K users on Node.js monolith serving 10K videos. Migrated Kubernetes/GKE 2018 handling 40x exam spikes. 2022 hit 2M concurrent peaks via auto-scaling across multi-region clusters with intelligent load distribution.
2. What Kubernetes HPA settings managed Ruangguru's 40x spikes?
Horizontal Pod Autoscaler targeted 70% CPU utilization scaling 10x pods in 2 minutes. Cluster Autoscaler provisioned nodes dynamically. Self-healing replaced 5% daily pod failures automatically during exam seasons.
3. Why migrate 90% backend from Node.js to Golang microservices?
Go delivered 10x throughput per instance vs Node event loop limits. Single binary deployments eliminated Docker layer complexity. Goroutines processed 400K concurrent WebSocket connections efficiently.
4. How does Kafka handle Ruangguru's 2M events/second throughput?
12-node Kafka cluster with 3x replication across 3 AZs. Separate exam analytics vs transactional streams. Consumer lag alerts trigger auto-partition rebalancing maintaining <100ms end-to-end latency.
5. What multi-CDN routing ensures 99.9% video delivery uptime?
Cloudflare + Akamai + 3 Indonesian providers with latency-based steering. Dynamic origin failover switches traffic <3s. Pre-cached exam content regionally prevents origin overload during peaks.
6. How does Redis Cluster manage sessions for 22M distributed users?
6-node Redis Cluster (3 master/replicas) with consistent hashing. 30min TTL sessions, multi-region async replication. Jakarta-Singapore reads <50ms via cross-region read replicas.
7. What sharding strategy supports Ruangguru's mixed read/write patterns?
PostgreSQL sharded by user_id (28M registered), 1:5 read replica ratio. ClickHouse analytical cluster for exam reports. Write throughput 2K TPS → 20K TPS post-sharding.
8. How was page load reduced from 6s to 2.1s serving 400K DAU?
React 18 micro-frontends with code splitting, critical CSS extraction. Cloudflare Polish images, GKE service mesh caching. TTFB dropped 60% via edge compute + preconnect optimization.
9. What monitoring prevented Ruangguru's 99.9% SLA violations?
Prometheus/Grafana scraped 10K metrics/second cluster-wide. Datadog APM traced 90% microservices. PagerDuty escalated >5% 5xx errors within 2 minutes to on-call rotation.
10. How did Ruangguru survive exam-day 2M concurrent surges reliably?
Pre-scaled clusters 80% capacity exam week. Per-user rate limits, progressive circuit breakers. CDN overprovisioned 3x forecasted peak. GKE preemptible nodes handled non-critical workloads cost-effectively.









