ML System Design
Machine learning infrastructure: serving, training, monitoring, and production systems
Online inference: low-latency response to individual requests. REST APIs (slower, high latency), gRPC (faster, binary, HTTP/2).
Batch inference: process large volumes asynchronously (daily model updates). Constraints: model size (gigabytes), latency requirement (sub-100ms for real-time).
Solutions: quantization (reduce precision for speed), distillation (train smaller model from larger), caching predictions. Tools: TensorFlow Serving, KServe, Ray Serve, Seldon.
Deployment: A/B test on subset, canary rollout. Monitoring: latency, error rate, throughput.
Real-world: Netflix recommends movies via batch; Uber estimates rides via online inference; Amazon personalizes via both.
Key Takeaways
Visual Diagram
Request -> Model Server (inference) -> Response (online) vs Batch job -> Precompute predictions