Back to Tutorials

ML System Design

Machine learning infrastructure: serving, training, monitoring, and production systems

100 minutes
9Detailed Sections
Senior Level

Online inference: low-latency response to individual requests. REST APIs (slower, high latency), gRPC (faster, binary, HTTP/2).

Batch inference: process large volumes asynchronously (daily model updates). Constraints: model size (gigabytes), latency requirement (sub-100ms for real-time).

Solutions: quantization (reduce precision for speed), distillation (train smaller model from larger), caching predictions. Tools: TensorFlow Serving, KServe, Ray Serve, Seldon.

Deployment: A/B test on subset, canary rollout. Monitoring: latency, error rate, throughput.

Real-world: Netflix recommends movies via batch; Uber estimates rides via online inference; Amazon personalizes via both.

Key Takeaways

1
Online Inference: Per-request model execution; REST or gRPC; sub-100ms latency required
2
Batch Inference: Precompute predictions for many items; daily/hourly updates
3
Optimization: Quantization (reduce precision), Distillation (smaller model)
4
Latency Challenge: Model size and computation limits real-time speed
5
Deployment Strategy: A/B test subset, canary rollout, monitor inference quality
6
Monitoring: Latency (p50/p99), throughput, error rate, model staleness

Visual Diagram

Request -> Model Server (inference) -> Response (online) vs Batch job -> Precompute predictions

Sign in to unlock

Sign In Free