Site Reliability Engineer

Posted about 1 month agoViewed

View full description

💎 Seniority level: Senior, 7+ years

📍 Location: United States

🔍 Industry: Software Development

🏢 Company: Freed👥 11-50 Health Care

🗣️ Languages: English

⏳ Experience: 7+ years

🪄 Skills: PostgreSQLSQLBashCloud ComputingGitKubernetesAzureRedisCI/CDTerraformCompliance

Requirements:

7+ years of experience in a SRE, Production Engineering, Infrastructure Engineering or related roles
Strong proficiency with SQL, Git, Kubernetes, Bash, and Networking (DNS, SSL, IP)
Familiarity with Azure, JavaScript/TypeScript, Python, Github, and VSCode
Security-oriented mindset and experience implementing security best practices

Responsibilities:

Manage and expand cloud infrastructure (Azure, Kubernetes), putting in place IaC best practices
Implement observability instrumentation, dashboards and alerts to monitor product health
Collaborate with engineers on product teams for new infrastructure needs or dev experience improvements and make application-level changes that improve system reliability (e.g. database configuration, JS bundle delivery, caching or network resiliency)
Implement security requirements (HITRUST and SOC2) and collaborate with Technical Program Manager for audits
Maintain databases (PostgreSQL, Redis) with backups, migrations, and security/privacy controls while monitoring performance and stability

Apply

Related Jobs

Apply

🔥 Senior Site Reliability Engineer

Posted about 1 hour ago

📍 USA

🧭 Full-Time

💸 140000.0 - 185000.0 USD per year

🔍 Software Development

🏢 Company: Juniper Square

🔧 Requirements

5+ years of experience in SRE, DevOps, or Infrastructure Engineering with a proven track record of ownership and initiative.
Strong experience with Kubernetes, Helm, and CNIs, including networking and security.
Proficiency in AWS services such as RDS, Aurora, IAM, VPC, EKS, and EC2.
Experience in PostgreSQL administration, including performance tuning and high availability in RDS/Aurora.
Hands-on experience with GitHub Actions and ArgoCD for secure and scalable CI/CD automation.
Strong background in Infrastructure as Code (IaC) with Crossplane and Terraform.
Deep understanding of observability and monitoring with Datadog.
Experience with Kyverno for Kubernetes policy-based security enforcement.
Proficiency in Python and Bash scripting for automation and system management.
Strong understanding of CI/CD security best practices and ability to implement controls for securing deployments.

💡 Responsibilities

Own reliability and scalability initiatives—identify, prioritize, and implement solutions before issues escalate.
Participate in an on-call rotation, responding to incidents, performing root cause analysis, and driving long-term fixes.
Design, deploy, and manage Kubernetes clusters using Helm charts, Cilium, and Karpenter to optimize performance and cost.
Architect and maintain AWS infrastructure with a focus on RDS/Aurora PostgreSQL, networking, and scaling best practices.
Implement GitHub Actions CI/CD pipelines, integrating security best practices and automation.
Define and enforce policy-based security for Kubernetes using Kyverno.
Automate infrastructure provisioning with Crossplane and Terraform to ensure consistency and scalability.
Enhance observability and monitoring using Datadog to proactively detect and resolve issues.
Improve security and reliability by identifying risks in CI/CD, cloud environments, and Kubernetes, then implementing necessary safeguards.
Lead post-incident reviews, drive lessons learned into long-term improvements, and document best practices in Confluence.

AWSPostgreSQLPythonBashKubernetesCI/CDDevOpsTerraform

Posted about 1 hour ago

Apply

🔥 Site Reliability Engineer (CST or EST Remote)

Posted 3 days ago

📍 United States

🧭 Full-Time

💸 72700.0 - 145400.0 USD per year

🔍 Software Development

🏢 Company: careers

🔧 Requirements

3-years experience as a Site Reliability Engineer, DevOps Engineer, or Software Engineer
Experience with AWS, Azure, or GCP cloud infrastructure
Experience with PHP and Javascript/Typescript
Bachelor's degree in business / information technology / computer science, or equivalent qualifications or experience.

💡 Responsibilities

Be the escalation point for problems and incidents for our Customer Support teams.
Triage problems and incidents, and either resolve them, or escalate them to the technical specialists that can resolve them.
Internally communicate the status of problems and incidents.
Generate Root Cause Analysis (RCA) statements for internal and external use.
Champion corrective and preventative actions internally to ensure similar problems and incidents don't happen again.
Design, implement, maintain, and continuously improve our monitoring and alerting mechanisms.
Proactively explore and drive improvements to the overall quality and reliability of our software platform.
Measure and report on the overall quality of service of the software platform, including incidents, actions, and SLA metrics.

AWSPHPCloud ComputingGCPJavascriptTypeScriptAzureCI/CDDevOps

Posted 3 days ago

Apply

🔥 Senior Site Reliability Engineer, Performance

Posted 7 days ago

📍 United States, Canada, Mexico

🧭 Full-Time

🔍 Software Development

🏢 Company: Fleetio

🔧 Requirements

5+ years of Ruby/Rail Experience
3+ years of AWS Experience
Kubernetes experience
Experience with profiling and benchmarking source code
Effective at code review, and identifying potential performance problems before they reach production
Experience with Datadog or other APM tools
Excellent written and verbal communication skills

💡 Responsibilities

Proactively identify, triage, and resolve performance issues
Enhance system observability by monitoring performance metrics across Ruby, Rails, and database systems, including SLOs and SLIs
Guide product engineers on Ruby/Rails performance and database best practices through code reviews and pair programming
Optimize performance through instance configuration and monitoring
Collaborate with other SREs to proactively identify and address performance bottlenecks
Lead database capacity planning and upgrade initiatives
Manage the database-specific components of disaster recovery planning and execution
Oversee backup systems and pre-production databases
Create and maintain infrastructure and operations documentation
Participate in the on-call rotation

AWSPostgreSQLSQLCloud ComputingKubernetesRubyRuby on RailsCI/CDTerraform

Posted 7 days ago

Apply

🔥 Site Reliability Engineer

Posted 8 days ago

📍 France, Germany, Spain, United Kingdom, United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Platform.sh👥 251-500💰 $140,000,000 Series D almost 3 years agoInternet Open Source PaaS Cloud Management Software

🔧 Requirements

DevOps, Cloud Operations, or SRE Expertise: A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
Advanced Linux Internals Expertise: Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
Programming Languages: Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
Scripting Skills: Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
Cloud Infrastructure Knowledge: Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
Containerization and Orchestration: Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
Problem-Solving and Collaboration: Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.

💡 Responsibilities

Refine Monitoring and Observability: Enhance system monitoring with tools like Prometheus, Grafana, and ELK Stack, ensuring visibility and alignment with business objectives.
Automate Deployments and Workflows: Transition manual processes to automated solutions using IaC tools (e.g., Terraform, Ansible) to streamline deployments and improve operational efficiency.
Optimize CI/CD Pipelines: Improve pipeline architecture for fast, reliable releases, ensuring scalability and resilience to handle high volumes of changes.
Cloud Infrastructure Management: Help scale cloud-based systems on platforms like AWS, GCP, and Azure while minimizing technical debt and operational complexity.
Incident Response and Post-Mortem: Support incident management and lead post-mortem analysis, ensuring continuous improvement and knowledge sharing.
Collaborate with Cross-Functional Teams: Work closely with engineering and product teams to integrate reliability practices into the development lifecycle and prioritize reliability efforts.
Drive Technical Innovation: Introduce and champion new tools, technologies, and practices that improve system reliability, performance, and scalability.

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 8 days ago

Apply

🔥 Site Reliability Engineer

Posted 9 days ago

📍 France, Germany, Spain, the United Kingdom, West Coast in the United States, Canada

🧭 Full-Time

🔍 Software Development

🏢 Company: Remote Woman

🔧 Requirements

A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.

💡 Responsibilities

Refine Monitoring and Observability
Automate Deployments and Workflows
Optimize CI/CD Pipelines
Cloud Infrastructure Management
Incident Response and Post-Mortem
Collaborate with Cross-Functional Teams
Drive Technical Innovation

AWSDockerPythonBashCloud ComputingGCPKubernetesAzureGoGrafanaPrometheusCollaborationCI/CDProblem SolvingLinuxDevOpsTerraformAnsibleScripting

Posted 9 days ago

Apply

🔥 Senior Site Reliability Engineer - Identity Platform

Posted 9 days ago

📍 USA

🧭 Full-Time

💸 186065.0 - 218900.0 USD per year

🔍 Software Development

🏢 Company: Coinbase Careers Page👥 1000-5000

🔧 Requirements

5+ years of experience building, iterating upon, and maintaining corporate IAM systems
5+ years of experience with operational procedures and application development
Deep domain-knowledge with prominent cloud identity provider(s): Okta, Duo, Google Workspace, Azure AD, Ping, etc.
Demonstrated success developing and implementing toolings that solves problems related to: identity lifecycle and provisioning, SSO, MFA, ABAC, RBAC, directory services, zero trust networking, PAM, PIM, and secrets management
Experience configuring and implementing modern open source tooling such as: Terraform, Ansible, Kubernetes, Docker
Fluency in a modern programming language (Golang, Python, Ruby, Java, C# etc.)
Strong experience using and managing AWS, GCP, Azure, or other cloud environment with IaC
Strong understanding of CI/CD workflows, automation frameworks, and best practices
Clear communication—demonstrate ability to explain technical concepts simply
Self starter—possess a continuous learning mindset
Demonstrate critical thinking under pressure

💡 Responsibilities

Engage in a dynamic role that combines traditional operations responsibilities and active contributions to the development and deployment of cloud-native applications, fostering a DevOps culture that emphasizes collaboration and automation
Partner across Coinbase to design, implement, and maintain performant, reliable, and secure system architectures
Provide corporate IAM and DevOps tooling subject matter expertise to adjacent IT, Security, and Engineering teams
Implement automation tooling and scripts to eliminate manual, repetitive tasks and reduce inefficiencies in system operations
Create comprehensive documentation and runbooks that detail system configurations, operational procedures, and troubleshooting steps across system lifecycle
Build and maintain CI/CD pipelines for integrating changes and deploying to production in progressively tested environments
Deliver configurations and maintain state using configuration management tools
Facilitate incident response, conduct root cause analysis, and blameless retrospectives
Define metrics and bolster monitoring/observability across corporate IAM systems
Participate in regular on-call rotation to ensure 24x7 uptime for critical systems

AWSDockerPythonCloud ComputingKubernetesLDAPCI/CDRESTful APIsLinuxDevOpsTerraformAnsibleScripting

Posted 9 days ago

Apply

🔥 Senior Site Reliability Engineer, EU, UK or Americas

Posted 10 days ago

📍 Americas, EU, UK

🔍 Cryptocurrency

🏢 Company: Auros👥 11-50💰 $17,000,000 about 2 years agoCryptocurrency

🔧 Requirements

An SRE/DevOps professional with experience managing and optimising Linux systems in a high-performance 24 x 7 environment.
Cloud management using IaC, with experience in AWS, Azure or Google Cloud.
A background in container management, deployment, and orchestration. Kubernetes experience is good to have, strong docker skills are required.
Knowledge and experience in managing configuration at scale.
Experience with CI/CD pipeline, version control best practices.
Experience with application and infrastructure instrumentation using tools like Prometheus, OpenTelemetry and eBPF.
Strong knowledge of cloud security and IAM policies is required.
SIEM and threat management experience.
Must know how to secure Mac and Linux endpoints.
Python and bash experience is a must.

💡 Responsibilities

Participate in on-call roster to support our trading operations.
Maintain and improve our global infrastructure with high performance and reliability requirements.
Improve and update the security infrastructure of a widely distributed company that operates in a high-risk environment.
Engage and collaborate with other teams around system layout, rollout procedures and improving DevOps processes.
Development of internal tools and automation to accomplish the team’s goals.
Application tuning and troubleshooting; you will keep abreast of changes to trading system features and deployment, providing guidance to developers looking to improve their application performance or reliability.
Active participation in various trading and infrastructure projects.
Work closely with developers, traders and other staff to accomplish our firm’s goals.

AWSDockerPythonBashCloud ComputingCybersecurityGCPKubernetesAzurePrometheusCI/CDLinuxDevOpsTerraformAnsible

Posted 10 days ago

Apply

🔥 Site Reliability Engineer

Posted 13 days ago

📍 United States

🧭 Full-Time

💸 175000.0 - 220000.0 USD per year

🔍 Software Development

🏢 Company: Orca👥 11-50💰 $18,000,000 Series A over 3 years agoCryptocurrency Blockchain Online Portals Information Technology

🔧 Requirements

Extensive experience with AWS services (e.g., ECS, copilot, Cloudwatch) and the ability to troubleshoot and optimize cloud-based systems.
Hands-on experience with tools like GitHub Action for reliable and efficient deployment workflows.
Familiarity with tools like Datadog to build actionable monitoring and alerting systems.
Proficiency in infrastructure-as-code tools like Terraform, and containerization tools like Docker.
Experience with orchestrators like Kubernetes or Airflow is a plus.

💡 Responsibilities

Design, manage, and optimize AWS infrastructure with a focus on scalability, reliability, and cost efficiency.
Build and refine CI/CD processes using modern tools, ensuring seamless, secure, and efficient deployments.
Develop robust monitoring, logging, and alerting systems using tools like Datadog or Grafana to improve visibility and system performance.
Architect systems that handle growth effortlessly, minimize downtime, and maintain high performance.
Implement effective alerting mechanisms to prioritize and address critical issues proactively.
Optimize and document infrastructure processes, leveraging tools like Terraform, Docker, and Airflow to create scalable and maintainable systems.
Partner with engineering teams to design and refine infrastructure that powers features like real-time monitoring, automated transaction execution, and analytics.

AWSDockerPostgreSQLKubernetesAirflowGrafanaRustCI/CDLinuxDevOpsTerraform

Posted 13 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 14 days ago

📍 United States

🧭 Full-Time

💸 120000.0 - 150000.0 USD per year

🔍 Software Development

🏢 Company: Echo360 Inc

🔧 Requirements

5+ years of experience as a Site Reliability Engineer or similar role.
Strong understanding of AWS cloud services, including DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, EKS, ECS and EC2.
Experience with infrastructure automation tools like Ansible, Terraform, or CloudFormation.
Experience with monitoring and alerting tools like CloudWatch, DataDog, Prometheus, Grafana, Kibana, and PagerDuty.
Experience with GitHub actions, Cl/CD pipelines and deployment strategies.
Strong problem-solving and analytical skills.
Excellent communication and collaboration skills.
Ability to work independently and take ownership of complex tasks.
Passion for technology and a desire to learn and grow.
Experience with Jenkins, PostgreSQL, and MongoDB.
Experience with cloud cost optimization, security best practices and tools.
Experience working in a fast-paced, agile environment.
Experience Rancher, Cattleprod, and TeamCity a plus.

💡 Responsibilities

Ensure service reliability and SLO/SLA adherence to production, preventing incidents by proactively conducting failure testing.
Implement automated monitoring and alerting systems for early detection of potential problems.
Collaborate with development teams to perform deployments and rollbacks with minimal disruption.
Optimize the performance and scalability of our AWS infrastructure, including RDS, DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, SES, EKS, ECS, and EC2.
Automate infrastructure provisioning and deployment processes using Terraform, CI/CD pipelines, and configuration management tools.
Proactively identify and address potential security vulnerabilities to maintain compliance, IAM best practices, and secrets management.
Participate in incident response and post-mortem analysis activities to identify root causes and prevent future occurrences.
Help onboard and mentor junior team members, sharing your knowledge and expertise.
Stay up to date on the latest cloud technologies and best practices for SRE.
Participate in a well-structured on-call rotation with other Site Reliability Engineers.
Explore new technologies and innovative solutions to improve service quality and speed to market.
Participate in technical discussions and deep dives with the other engineering and product teams.

AWSPostgreSQLDynamoDBJenkinsKafkaKibanaMongoDBMySQLGrafanaPrometheusCI/CDAgile methodologiesLinuxDevOpsTerraformMicroservicesAnsible

Posted 14 days ago

Apply

🔥 Staff Site Reliability Engineer

Posted 15 days ago

📍 Australia, Austria, Bangladesh, Belgium, Brazil, Canada, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, India, Indonesia, Ireland, Israel, Italy, Kenya, Mexico, Netherlands, Nigeria, Peru, Poland, Singapore, South Africa, Spain, Sweden, Switzerland, Uganda, United Arab Emirates, United Kingdom, United States of America, Uruguay

🧭 Full-Time

💸 129347.0 - 200824.0 USD per year

🔍 Software Development

🏢 Company: Wikimedia Foundation👥 251-500💰 $2,100,000 Grant over 5 years ago

🔧 Requirements

7+ years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles, with substantial exposure to production-grade machine learning systems.
Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker, GPU acceleration, distributed training systems).
Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible, Helm, Argo CD).
Experience implementing observability, monitoring, and logging for ML systems (e.g., Prometheus, Grafana, ELK stack).
Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow, scikit-learn).

💡 Responsibilities

Designing and implementing robust ML infrastructure used for training, deployment, monitoring, and scaling of machine learning models.
Improving reliability, availability, and scalability of ML infrastructure, ensuring smooth and efficient workflows for internal ML engineers and researchers.
Collaborating closely with ML engineers, product teams, researchers, SREs, and the Wikimedia volunteer community to identify infrastructure requirements, resolve operational issues, and streamline the ML lifecycle.
Proactively monitoring and optimizing system performance, capacity, and security to maintain high service quality.
Providing expert guidance and documentation to teams across Wikimedia to effectively utilize the ML infrastructure and best practices.
Mentoring team members and sharing knowledge on infrastructure management, operational excellence, and reliability engineering.

AWSDockerPythonCloud ComputingElasticSearchKubernetesMachine LearningMLFlowNumpyPyTorchGrafanaPrometheusTensorflowLinuxDevOpsTerraformAnsible

Posted 15 days ago

Apply