Senior DevOps Engineer
I
ISHIRDigital Innovation, Enterprise AI Services
India / United States, EST OverlapFull-TimeSenior
Salary not disclosed
Apply NowOpens the employer's application page
Job Details
- Experience
- 8+ years
- Required Skills
- AWSDockerPythonBashJavaKubernetesJiraGoPrometheusLinuxTerraformServiceNowHelm
Requirements
- 8+ years in a platform/SRE/DevOps or infrastructure role, with a strong bias toward automation and support.
- Experience operating Kubernetes (or similar) and core ecosystem tools (Helm, Docker, Ingress NGINX, Argo Rollouts basics).
- Hands-on CI/CD experience (preferably GitLab CI): writing/modifying jobs, artifacts, environments, and basic deployment strategies.
- Scripting ability in Bash or Python (Go a plus) to automate repetitive tasks and improve runbooks.
- Familiarity with AWS fundamentals (e.g., IAM, EC2/EKS, S3, CloudWatch/CloudTrail, Parameter Store/Secrets Manager).
- Practical understanding of monitoring/observability (dashboards, logs, alerts) and how to use them for triage and remediation, including Prometheus/Alertmanager/Thanos and OpenTelemetry basics.
- Comfortable working from tickets (Jira/ServiceNow), following change-management practices, and communicating clearly with stakeholders.
- Terraform experience for infrastructure as code (highly preferred).
- API integration experience (Java, Python, or Go) to build small internal tools or glue code (highly preferred).
- Deeper Linux fundamentals and container runtime basics for effective debugging and performance tuning (highly preferred).
- Exposure to insurance/financial services environments, including awareness of compliance and operational controls (highly preferred).
Responsibilities
- Operate and improve platform tools for reliable product shipping, triaging tickets, fixing build issues, and handling routine service requests.
- Maintain and extend self-service workflows (templates, golden paths) by updating docs, examples, and guardrails.
- Perform day-to-day Kubernetes operations: deploy/update Helm charts, manage namespaces, diagnose rollout issues, and follow runbooks for incident response.
- Support CI/CD pipelines (e.g., GitLab CI): keep pipelines green, add/adjust jobs, implement basic quality gates, and help teams adopt safer deploy strategies.
- Monitor and operate the observability stack using Prometheus, Alert manager, and Thanos; maintain alert rules, dashboards, and SLO/SLA indicators; help reduce alert noise and improve signal quality.
- Assist with service instrumentation across tracing, logging, and metrics with OpenTelemetry usage and related telemetry tooling.
- Contribute to and improve documentation: runbooks, FAQs, onboarding guides, and standard operating procedures.
- Participate in an on-call rotation as needed with a well-defined escalation path; assist during incidents, post small fixes, and capture learnings in docs.
- Help with cost- and performance-minded housekeeping: right-size workloads, prune unused resources, and automate routine tasks where appropriate.
View Full Description & ApplyYou'll be redirected to the employer's site