Lead Azure GenAIOps / LLMOps Engineer
Remote / Hybrid (India)Full-TimeLead
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 10 – 14+ Years
- Required Skills
- DockerMicrosoft AzureFastAPITerraformGitHub ActionsDatadogGenerative AI
Requirements
- B.Tech/M.Tech in Computer Science or related field (Ph.D. is a plus but not mandatory for this Ops-centric role).
- Expert-level Microsoft Azure (AI Foundry, Azure OpenAI, Azure ML).
- Deep experience with Azure Kubernetes Service (AKS), Docker, and KEDA for auto-scaling AI workloads.
- Mastery of LangGraph, LlamaIndex, and FastAPI for building high-concurrency AI backends.
- Hands-on with Vector Stores—Azure AI Search, Pinecone, or Milvus.
- Proven experience with GitHub Actions or Azure Pipelines for ML/LLM CI/CD.
- Ability to explain the trade-offs between "Latency vs. Accuracy" to non-technical business leaders.
- Lead a team of 4–6 engineers, setting the technical standard for code reviews and architectural blueprints.
- A track record of moving beyond "Simple RAG" into advanced patterns like GraphRAG and Multi-modal pipelines.
- Azure Solutions Architect or Azure AI Engineer Associate certification preferred.
Responsibilities
- Architect and scale multi-agent systems using LangGraph, AutoGen, or Semantic Kernel.
- Implement persistent state management and deterministic fallback logic for autonomous agents.
- Design and manage a centralized AI Gateway (using Azure APIM) to handle request routing, rate limiting, and cost-attribution across different business units.
- Provision and manage Azure AI resources (Foundry, Search, CosmosDB) using Terraform or Bicep to ensure reproducible environments.
- Implement end-to-end distributed tracing for LLM calls using tools like Langfuse, Arize Phoenix, or LangSmith integrated with Azure Monitor/Datadog.
- Build automated "Evaluation-as-a-Service" pipelines and use "LLM-as-a-Judge" patterns to score groundedness, relevance, and faithfulness.
- Manage the lifecycle of models (GPT-4o, Llama 3.x, Phi-4) including versioning, blue-green deployments, and A/B testing of system prompts.
- Enforce Zero Trust security for AI—implementing Private Links, Managed Identities, and Virtual Network isolation for all LLM traffic.
- Deploy and tune Azure AI Content Safety and custom jailbreak detection layers to prevent prompt injection and PII leakage.
- Monitor token usage and latency metrics to provide FinOps insights and prevent "runaway" agent costs.
View Full Description & ApplyYou'll be redirected to the employer's site