CDO Inferencing Services

Enterprise AI Solution

Internal scalable inferencing endpoints for embedding models, classifiers, and LLMs for the Chief Data Office.

Overview

A centralized AI inferencing platform providing scalable endpoints for embedding models, classifiers, and large language models. This infrastructure serves as the backbone for AI capabilities across AT&T Chief Data Office, enabling teams to deploy and consume AI models without managing infrastructure.

Challenges

Supporting diverse model types (embeddings, classifiers, LLMs) with varying resource requirements
Ensuring consistent API interfaces across different model serving frameworks
Managing GPU resources efficiently across multiple concurrent model deployments
Implementing enterprise-grade authentication and rate limiting
Providing observability and monitoring for model performance and usage

Solutions

Architected unified API layer using LiteLLM Proxy Server for consistent OpenAI-compatible interfaces
Deployed KServe with custom inference services for flexible model deployment patterns
Implemented NVIDIA NIM and vLLM for optimized LLM serving with dynamic batching
Created Helm charts with Kustomize overlays for environment-specific configurations
Built comprehensive monitoring stack with Prometheus metrics and Grafana dashboards

Key Results

Reduced model deployment time from weeks to hours with standardized pipelines

Achieved 10x cost reduction through GPU sharing and efficient resource allocation

Serving 50+ models across 20+ teams with unified API access

Maintained 99.95% availability with automatic failover and scaling

Processed over 1 million inference requests daily with p99 latency under 200ms

Technologies

KServeNVIDIA TritonvLLMKubernetesHelmPython