CDO Inferencing Services
Enterprise AI Solution
Internal scalable inferencing endpoints for embedding models, classifiers, and LLMs for the Chief Data Office.
Overview
A centralized AI inferencing platform providing scalable endpoints for embedding models, classifiers, and large language models. This infrastructure serves as the backbone for AI capabilities across AT&T Chief Data Office, enabling teams to deploy and consume AI models without managing infrastructure.
Challenges
- Supporting diverse model types (embeddings, classifiers, LLMs) with varying resource requirements
- Ensuring consistent API interfaces across different model serving frameworks
- Managing GPU resources efficiently across multiple concurrent model deployments
- Implementing enterprise-grade authentication and rate limiting
- Providing observability and monitoring for model performance and usage
Solutions
- Architected unified API layer using LiteLLM Proxy Server for consistent OpenAI-compatible interfaces
- Deployed KServe with custom inference services for flexible model deployment patterns
- Implemented NVIDIA NIM and vLLM for optimized LLM serving with dynamic batching
- Created Helm charts with Kustomize overlays for environment-specific configurations
- Built comprehensive monitoring stack with Prometheus metrics and Grafana dashboards
Key Results
Reduced model deployment time from weeks to hours with standardized pipelines
Achieved 10x cost reduction through GPU sharing and efficient resource allocation
Serving 50+ models across 20+ teams with unified API access
Maintained 99.95% availability with automatic failover and scaling
Processed over 1 million inference requests daily with p99 latency under 200ms
Technologies
KServeNVIDIA TritonvLLMKubernetesHelmPython