Site Reliability Engineer

Posted 3 days agoViewed

View full description

💎 Seniority level: Senior, 5 years

📍 Location: AL, AZ, CA, CO, CT, FL, GA, IL, IN, MA, MI, NC, NJ, NV, NY, OH, PA, SC, TX, UT, PST

💸 Salary: 120000.0 - 150000.0 USD per year

🔍 Industry: Software Development

🏢 Company: Convoso👥 251-500 Internet Computer SaaS Call Center Brand Marketing Telecommunications Software

🗣️ Languages: English

⏳ Experience: 5 years

🪄 Skills: PythonBashKubernetesMySQLNginxZabbixGrafanaPrometheusTomcatLinuxDevOpsScripting

Requirements:

At least 5 years of experience in configuring enterprise-level Linux systems within a highly networked environment.
Experience in using Chef for configuration management and automation, including the creation and management of Chef cookbooks and recipes.
Proficiency in scripting languages such as Python and Bash for automating tasks.
Familiarity in designing, implementing, and optimizing centralized logging solutions for effective log analysis, integration, security compliance, and incident response.
Familiarity with MySQL or similar database systems.
Experience with virtualization technologies such as Proxmox, KVM, QEMU, or VMware is beneficial.
Experience managing infrastructure with bare metal servers in data center environments.
Proactive, resourceful, and adept at managing projects and solving problems from start to finish.
Strong skills in diagnosing and resolving complex system issues while maintaining stability and reliability.
Experience working in a fast-paced startup environment.
Experience in administering and tuning application stacks such as Tomcat, Apache, Nginx, and HAProxy.
Experience with monitoring systems like Grafana, Prometheus, and Zabbix.
Flexible with shift work and available for overtime as needed.
Bachelor's or Master's degree in Computer Science, Engineering, or a related field.

Responsibilities:

Manages and monitors installed systems, networks, and infrastructure for the organization to be in line with company guidelines or SOP (standard operating procedure).
Ensures the highest level of systems and infrastructure availability.
Installs, configures, and tests operating systems, application software, and system management tools.
Design, plan and automate deployment of network communications systems.
Provide documentation as required to ensure accurate and current information is available for each site network.
Implements warranty and support activities.
Plans and implements system automation as required for better efficiency.
Oversees the development of customized software and hardware requirements.
Willing to stay up-to-date with security best practices and drive implementation accordingly
Collaborates with other professionals to ensure high-quality deliverables within organization guidelines, policies, and procedures.
Run diagnostics to resolve customer-reported issues
Deals with work process, optimization methods, and risk management tools in the given projects for successful accomplishments according to the requirements of the stakeholders

Apply

Related Jobs

Apply

🔥 Site Reliability Engineer

Posted 4 days ago

📍 United States of America

💸 63000.0 - 108675.0 USD per year

🏢 Company: vspvisioncareers

🔧 Requirements

Bachelor’s Degree in Computer Science or related field and/or equivalent experience
4+ years of related functional experience
Experience with both Windows and Linux, as well as containerization software products
Functional with continuous integration and continuous delivery
Experience with automation and orchestration using Chef, Puppet, Ansible and containers
Coding skills beyond simple scripts and knowledge of application architecture
Ability to program (structured and OO) with one or more high level languages, such as Python, Java, C/C++/C#, Ruby, and JavaScript
Understanding of distributed storage technologies like NFS, HDFS, Ceph, S3 as well as dynamic resource management frameworks (OpenShift, Kubernetes, Yarn)
Skilled in spotting problems and identifying performance bottlenecks, leading to problem and root cause analysis and risk mitigation
Capacity monitoring and performance planning experience with cloud solutions like AWS using applications such as Dynatrace, New Relic, App Dynamic

💡 Responsibilities

Use engineering design concepts to recommend design or test methods for attaining or improving operational reliability in support of business objectives.
Develop and implement high-reliability tools, systems, and services using engineering methodologies and tools.
Determine reliability requirements and deliver insights from massive scale data in real time.
Propose changes in design or formulation to improve system and/or process reliability.
Utilize best practices and work with cross-functional teams to provide solutions and a positive user experience.
Improve reliability, quality, and time-to-market for suite of software solutions, through effective hosting, monitoring, operations, and automation
Develop proprietary tools to improve system reliability and mitigate weaknesses in incident management or software delivery
Collaborate with team members to troubleshoot and fix issues utilizing knowledge of problems to route support escalation issues to the appropriate teams
Add automation for improved collaborative response in real-time, updates documentation, runbook tools, and modules to prepare teams for incidents
Support optimizing the software development life cycle to boost service reliability, based on post-incident reviews
Support system cost modeling for all hosted systems
Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
Deliver primary operational support and engineering for distributed software applications
Implement guidelines and plans for automated systems delivery maintaining system and data security
Assist with impact analysis regarding enterprise-wide technology
Perform capacity monitoring with various monitoring tools (Splunk, Dynatrace, etc.) and make recommendations
Gather and analyze metrics from both operating systems and applications to assist in performance tuning, fault finding, and corrective action planning
Support system integration, software, and hardware at enterprise level for optimum performance
Partner with development teams to improve services through rigorous automated testing and release procedures
Contribute to system architecture planning, and policies and procedures surrounding enterprise-wide technology
Participate in system design consulting, platform management, and capacity planning
Stay abreast of new technologies; introduce applicable technology in alignment with business goals and for creative solutions

AWSDockerPostgreSQLPythonSQLBashCloud ComputingData AnalysisDynamoDBElasticSearchGitJavaKafkaKubernetesMySQLOracleRabbitmqSoftware ArchitectureZabbixAlgorithmsCassandraData StructuresPrometheusRedisSparkCommunication SkillsAnalytical SkillsCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformMicroservicesTeamworkTroubleshootingJSONCross-functional collaborationAnsibleScriptingDebugging

Posted 4 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 4 days ago

📍 United States, Europe

🧭 Full-Time

🔍 Software Development

🏢 Company: Dune👥 101-250

🔧 Requirements

Proven expertise in managing and optimising bare-metal infrastructure and containerised environments.
Experience with infrastructure-as-code and orchestration tools.
Strong understanding of system performance, debugging, and optimization across diverse environments.
Ability to collaborate with interdisciplinary teams and communicate complex technical concepts clearly.
Solid foundation in computer science fundamentals and system design.
Ability to work collaboratively in a remote setting, contributing to a positive and inclusive team culture.
5+ years of experience as a systems or infrastructure engineer in a collaborative, problem-solving environment.
Experience with distributed systems and managing large-scale, high-availability environments.
Hands-on experience with Nomad or Kubernetes for workload orchestration in production environments.
Proficiency in infrastructure-as-code tools like Ansible and Terraform, with a proven ability to automate and manage complex systems.
Experience with bare-metal infrastructure.
Proficiency in scripting or programming languages such as Python, Go, or Bash.
Experience with monitoring and observability tools for infrastructure performance.
Familiarity with cloud cost management and performance improvement strategies.
Strong analytical and troubleshooting skills.
Experience working across multiple time zones.

💡 Responsibilities

Collaborate closely with interdisciplinary teams to ensure the infrastructure meets the demanding performance, reliability, and scalability needs of our products.
Embrace the Platform team's mission to empower product teams with efficient, low-overhead services by developing and maintaining robust infrastructure and scalable services.
Design and maintain highly reliable containerized environments, ensuring seamless operation of our critical systems.
Analyze system performance to identify bottlenecks, proposing and implementing solutions to enhance infrastructure efficiency.
Contribute to maintaining high system reliability and scalability, focusing on unique and challenging technical problems.

DockerPythonSQLBashCloud ComputingGitKubernetesGoREST APICI/CDLinuxDevOpsTerraformAnsibleScriptingDebugging

Posted 4 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 4 days ago

📍 United States

🧭 Full-Time

💸 128350.0 - 192100.0 USD per year

🔍 Software Development

🏢 Company: ClickHouse👥 101-250💰 Series B over 2 years agoDatabase Artificial Intelligence (AI)Big Data Analytics Software

🔧 Requirements

At least 8 years of experience in Site Reliability Engineering or a related field.
Previous experience using ClickHouse in production.
Coding experience with Go and/or Python.
Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus.
Hands-on experience with container orchestration tools such as Kubernetes or Docker Swarm.
Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet.

💡 Responsibilities

Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse.
Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.
Ensure all the infrastructure components in ClickHouse Cloud (including Dataplane, Control Plane and ClickHouse Core) have monitoring and alerting in place to ensure timely detection and resolution of incidents.
Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers.
Continuously improve the reliability and performance of our ClickHouse services.
Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities.
Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime.

AWSDockerPythonSQLCloud ComputingKubernetesCross-functional Team LeadershipClickhouseGoREST APICommunication SkillsCI/CDProblem SolvingLinuxDevOpsTerraformExcellent communication skillsTeamworkStrong communication skillsAnsibleDebugging

Posted 4 days ago

Apply

🔥 Site Reliability Engineer

Posted 4 days ago

📍 United States

🧭 Full-Time

🔍 Software Development

🏢 Company: TCGPlayer_External_Career

🔧 Requirements

5+ years of experience in Site Reliability Engineering or related roles
Experience with an enterprise monitoring solution (New Relic, Scalyr, Datadog, Etc)
Experience managing Linux and/or Windows environments
Experience with IaaS and PaaS solutions (i.e. AWS, GCP, Azure, etc.)
Experience with Infrastructure as Code (Terraform or Helm)
Knowledge of Kubernetes / ECS orchestration, and containerization (e.g. Docker)
Demonstrable expertise around specifying, designing and/or implementing system health, performance monitoring tools and software management tools for 24x7 environments
Proficiency in writing code / scripts to automate tasks
Excellent critical thinking and solving skills

💡 Responsibilities

Innovate, build, and evangelize the practice of site reliability so that TCGPlayer can deliver excellent customer experiences.
Define and measure key performance metrics, such as SLAs and Mean Time Between Failures (MTBF), using those metrics to identify trends and measure the impact on the business.
Develop and maintain up-to-date operational procedures, including runbooks, to adapt to evolving needs.
Anticipate system failures through practices like chaos engineering and tabletop exercises, and establish processes to learn from operational incidents.
Foster strong relationships within the team and across departments while cultivating a communicative, supportive, and results-oriented culture.

AWSDockerCloud ComputingGCPKubernetesAzureCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesScripting

Posted 4 days ago

Apply

🔥 Senior Site Reliability Engineer - Midnight

Posted 5 days ago

📍 United States

🔍 Blockchain

🏢 Company: IO Global

🔧 Requirements

7+ years of experience in SRE, DevOps, or a related role.
Understanding of SRE best practices, architectures, and methods.
Good knowledge on resiliency patterns and cloud security.
Strong programming proficiency in Python, Golang, or Javascript.
Demonstrated experience with AWS and modern cloud architectures.
Proficiency in Helm, Terraform, and CI/CD tools like Github Actions and ArgoCD
Hands-on experience with Kubernetes/EKS and GitOps methodologies.
Proven track record with monitoring tools such as Prometheus, OpenTelemetry, as well as familiarity with the LGTM stack, or other comparable tools
Exceptional problem-solving skills with a knack for translating vague requirements into clear, strategic plans.
Ability to engage in technical discussions and be part of the decision making process
Strong problem-solving skills and capability to work on complex systems
Experience in working within an Agile environment
Experience in working with a distributed team
Strong communication and collaboration abilities to work seamlessly across different teams.
A proactive and innovative mindset, with a passion for continuous improvement and operational excellence.

💡 Responsibilities

Design, build, and maintain scalable and highly available systems, primarily on AWS, using best practices.
Manage and optimize Kubernetes clusters for high availability and performance, extending them when it makes sense to expand functionality.
Leverage GitOps principles to automate deployments and manage container orchestration.
Implement and manage CI/CD pipelines ensuring seamless, high-quality deployments, finding and removing bottlenecks, improving performance and working alongside teams to refine feedback loops and automate toil away.
Develop automation tools and scripts to improve operational efficiency.
Implement robust monitoring solutions with Prometheus and related tooling to ensure system health and performance.
Participate in on-call rotations and lead incident response efforts, turning challenges into learning opportunities.
Collaborate with dev teams to define and implement SLOs/SLIs
Take vague or loosely defined problems, work closely with cross-functional teams, and distill them into clear, actionable plans.
Communicate technical solutions and incident retrospectives effectively across both technical and non-technical stakeholders.
Evaluate and adopt new technologies, with a special advantage for candidates with blockchain experience, to keep our systems at the cutting edge.
Document processes and best practices, ensuring that knowledge is shared across the team and continuously improved.
Strive to strike a balance between effective delivery of goals and a measurable high standard of these goals. Always apply a layer of polish and due diligence when delivering.

AWSDockerPythonAgileBlockchainCloud ComputingJavascriptKubernetesPrometheusRustCommunication SkillsCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformMicroservicesScripting

Posted 5 days ago

Apply

🔥 Sr Site Reliability Engineer (SRE)

Posted 7 days ago

📍 United States

🧭 Full-Time

💸 165000.0 - 205000.0 USD per year

🔍 Software Development

🏢 Company: Cribl👥 251-500💰 $150,000,000 Series D almost 3 years agoReal Time Big Data Information Technology Software

🔧 Requirements

Extensive experience with enterprise scale continuous delivery environments
5+ years of experience with a DevOps or SRE job title
Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
Experience with sustainable incident response in a blameless environment
Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
Background in Linux Systems Engineering
Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.

💡 Responsibilities

Engage with teams and improve service delivery and reliability across their entire lifecycle
Measure and monitor all production systems with an eye towards availability, latency and overall system health
Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
Help Identify and drive down toil with creative innovation and automation
On-call responsibilities

AWSDockerNode.jsCloud ComputingJavascriptKibanaKubernetesTypeScriptGrafanaPrometheusREST APICI/CDLinuxDevOpsTerraformJSONData management

Posted 7 days ago

Apply

🔥 Senior Site Reliability Engineer

Posted 10 days ago

📍 United States, Canada, Latin America

🧭 Full-Time

💸 160000.0 - 185000.0 USD per year

🔍 Software Development

🏢 Company: Superhuman👥 51-200💰 $75,000,000 Series C over 3 years ago🫂 Last layoff almost 3 years agoSoftware Development

🔧 Requirements

6+ years of experience in SRE, DevOps, or systems engineering roles.
Proven experience managing high-availability, mission-critical systems.
Strong proficiency with cloud platforms (GCP, AWS, or Azure).
Hands-on experience with containers and orchestration tools (Docker, Kubernetes).
Expertise in monitoring, logging, and alerting tools (e.g., Metabase, Datadog, Prometheus, Grafana, etc).
Proficiency in scripting/programming languages (Python, Go, Bash, etc.).
Knowledge of database management systems (SQL/NoSQL).
Strong knowledge of networking, security, and distributed systems.
Experience with Infrastructure as Code (Terraform, Ansible, Chef, or Puppet).
Familiarity with version control systems (Git) and CI/CD pipelines (Jenkins, GitLab CI, etc.).
Strong communication skills and ability to work collaboratively across teams.
Problem-solving mindset with a focus on root cause analysis.
Proactive, self-driven, and able to handle high-pressure environments.

💡 Responsibilities

Collaborate with software engineers to design scalable, fault-tolerant systems and services.
Proactively monitor service health, availability, and performance.
Respond to and troubleshoot production issues.
Perform capacity planning and scaling activities.
Automate repetitive tasks to enhance efficiency.
Design and implement disaster recovery plans and high availability strategies.
Collaborate with our security team to ensure infrastructure adheres to best practices and compliance requirements.
Build, maintain, and enhance CI/CD pipelines.
Manage and automate infrastructure provisioning and configuration.
Work closely with development teams to ensure best practices in deployment and release processes.
Champion DevOps culture by mentoring and guiding other engineers in the use of tools and best practices.

AWSDockerPythonSQLBashCloud ComputingGCPGitJenkinsKubernetesAzureGoGrafanaPrometheusNosqlCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesNetworkingAnsibleScripting

Posted 10 days ago

Apply

🔥 Site Reliability Engineer

Posted 11 days ago

📍 United States, UK, Philippines, Poland, South Africa

🧭 Permanent

🔍 FinTech

🏢 Company: Zepz👥 1001-5000💰 $267,000,000 Series F 6 months ago🫂 Last layoff over 1 year agoMobile Payments Financial Services Payments FinTech

🔧 Requirements

At least 5 years in SRE, DevOps or Engineer role with a keen interest in solving problems using automation.
Understand SRE and DevOps methodologies.
Experience with Grafana, Loki and Prometheus.
Experience supporting or developing applications written in Java, Python or node.js.
You should have an understanding of how to analyze, and troubleshoot large-scale distributed systems.
Our Cloud Native platform is hosted on AWS.

💡 Responsibilities

Use code to solve problems.
Using best practices and standards in regards to Observability, Monitoring, Alerting, Capacity Planning, availability, performance/latency, change, troubleshooting for all our Tech services.
Work closely with feature teams to ensure that services are correctly monitored, change is delivered in a safe and secure way, resilience is built into our product and our standards and best practices adopted.
Lead or be involved in the troubleshooting of complex incidents and problems.
Have visibility on end to end service to our customers and ensure their journey is stable and consistent across all the microservices and 3rd party dependencies with the observability tool you will have implemented with the Engineering teams.
Helping the team meet its strategic goals; to maintain the highest level of observability, maximize developer velocity while keeping our product reliable, and ensure that we can deliver the highest quality experience to our customers.
Growing together. You’ll review others' work and happily seek feedback on yours to ensure we build a better codebase and sharpen each other's skills.

AWSNode.jsPythonSQLAgileBashCloud ComputingGitJavaKafkaKubernetesActiveMQGrafanaPrometheusREST APICI/CDLinuxDevOpsTerraformMicroservicesTroubleshootingJSONAnsibleScripting

Posted 11 days ago

Apply

🔥 Sr. Site Reliability Engineer

Posted 16 days ago

📍 United States, EU

🧭 Full-Time

🔍 Software Development

🏢 Company: AuthZed👥 11-50💰 $12,000,000 Series A 9 months agoInformation Technology Cyber Security Software

🔧 Requirements

Proven experience as a Site Reliability Engineer or in a similar role.
Strong understanding of networking, operating systems, and cloud infrastructure.
Experience with Site Reliability Engineering, System Design, and Distributed Computing.
Experience in various programming languages — we currently have SDKs for NodeJS, Java, Python, Ruby, and Go.
Experience with containerization technologies such as Docker and Kubernetes.
Knowledge of infrastructure-as-code tools like Terraform and Pulumi.
Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Experience with lower-level implementation details of relational databases (bonus if you have have experience with distributed SQL databased like Google Cloud Spanner or CockroachDB).
Experience working with Git and GitHub.
Experience with continuous integration and deployment systems.
Strong problem-solving and troubleshooting skills.
Excellent communication and collaboration abilities.

💡 Responsibilities

Design, implement, and maintain highly available and scalable infrastructure solutions for our projects, products, and customers.
Monitor and analyze system performance, identifying and resolving bottlenecks and issues to ensure optimal performance and reliability.
Automate infrastructure deployment and configuration management processes.
Continuously improve system reliability, security, and efficiency through proactive monitoring, capacity planning, and performance tuning.
Troubleshoot and resolve complex infrastructure and application issues in production and test environments.
Collaborate with software engineering teams to design and implement systems that are resilient, scalable, and secure.
Participate in on-call rotation and respond to production incidents in a timely manner.
Document system configurations, troubleshooting procedures, and operational guidelines.

DockerPythonSQLCloud ComputingGitJavaKubernetesGoGrafanaPrometheusCI/CDProblem SolvingLinuxTerraformNetworkingTroubleshootingNodeJSScripting

Posted 16 days ago

Apply

🔥 Snr. Site Reliability Engineer (Remote)

Posted 17 days ago

📍 United States

🧭 Full-Time

💸 130000.0 - 165000.0 USD per year

🔍 Software Development

🏢 Company: KnowBe4👥 1001-5000💰 $300,000,000 Post-IPO Equity almost 2 years agoComputer Security Cyber Security Network Security Software

🔧 Requirements

BS/MS/Ph.D. or equivalent plus 5 years experience
Proficient authoring scripts in one or more programming languages (e.g. Python, Ruby, Javascript).
Experience designing and operating high-scale patterns in AWS
Experience building and designing repeatable workflows for continuous integration and continuous deployment (CI/CD) - GitLab is preferred
Excellent communication skills
Effectively able to self-manage your time across competing projects
Ability to quickly understand and debug complex distributed systems

💡 Responsibilities

Work with other Site Reliability Engineers to build highly scalable and resilient applications and infrastructure in AWS
Maintain and improve extensible infrastructure-as-code using Terraform
Learn, maintain, and improve our existing deployment strategies
Deliver effective observability, monitoring, and alerting patterns for KnowBe4’s applications and infrastructure
Act as an escalation point for identifying and resolving the root cause for production incidents
Provide assistance designing globally distributed systems and processes for the organization
Identify deficiencies in our current applications and infrastructure and correct them when found
Define new approaches and tailored solutions to complex technical problems
Act as a project leader with other Site Reliability Engineers and ensure progress is communicated effectively to project stakeholders

AWSDockerPythonSQLAWS EKSCloud ComputingDynamoDBKubernetesAlgorithmsData StructuresREST APIRustCI/CDProblem SolvingLinuxDevOpsTerraformMicroservicesExcellent communication skillsScriptingDebugging

Posted 17 days ago

Apply