Apply

Site Reliability Engineer

Posted about 21 hours agoViewed

View full description

๐Ÿ’Ž Seniority level: Senior, 5 years

๐Ÿ“ Location: AL, AZ, CA, CO, CT, FL, GA, IL, IN, MA, MI, NC, NJ, NV, NY, OH, PA, SC, TX, UT, PST

๐Ÿ’ธ Salary: 120000.0 - 150000.0 USD per year

๐Ÿ” Industry: Software Development

๐Ÿข Company: Convoso๐Ÿ‘ฅ 251-500InternetComputerSaaSCall CenterBrand MarketingTelecommunicationsSoftware

๐Ÿ—ฃ๏ธ Languages: English

โณ Experience: 5 years

๐Ÿช„ Skills: PythonBashKubernetesMySQLNginxZabbixGrafanaPrometheusTomcatLinuxDevOpsScripting

Requirements:
  • At least 5 years of experience in configuring enterprise-level Linux systems within a highly networked environment.
  • Experience in using Chef for configuration management and automation, including the creation and management of Chef cookbooks and recipes.
  • Proficiency in scripting languages such as Python and Bash for automating tasks.
  • Familiarity in designing, implementing, and optimizing centralized logging solutions for effective log analysis, integration, security compliance, and incident response.
  • Familiarity with MySQL or similar database systems.
  • Experience with virtualization technologies such as Proxmox, KVM, QEMU, or VMware is beneficial.
  • Experience managing infrastructure with bare metal servers in data center environments.
  • Proactive, resourceful, and adept at managing projects and solving problems from start to finish.
  • Strong skills in diagnosing and resolving complex system issues while maintaining stability and reliability.
  • Experience working in a fast-paced startup environment.
  • Experience in administering and tuning application stacks such as Tomcat, Apache, Nginx, and HAProxy.
  • Experience with monitoring systems like Grafana, Prometheus, and Zabbix.
  • Flexible with shift work and available for overtime as needed.
  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
Responsibilities:
  • Manages and monitors installed systems, networks, and infrastructure for the organization to be in line with company guidelines or SOP (standard operating procedure).
  • Ensures the highest level of systems and infrastructure availability.
  • Installs, configures, and tests operating systems, application software, and system management tools.
  • Design, plan and automate deployment of network communications systems.
  • Provide documentation as required to ensure accurate and current information is available for each site network.
  • Implements warranty and support activities.
  • Plans and implements system automation as required for better efficiency.
  • Oversees the development of customized software and hardware requirements.
  • Willing to stay up-to-date with security best practices and drive implementation accordingly
  • Collaborates with other professionals to ensure high-quality deliverables within organization guidelines, policies, and procedures.
  • Run diagnostics to resolve customer-reported issues
  • Deals with work process, optimization methods, and risk management tools in the given projects for successful accomplishments according to the requirements of the stakeholders
Apply

Related Jobs

Apply
๐Ÿ”ฅ Site Reliability Engineer
Posted about 24 hours ago

๐Ÿ“ United States of America

๐Ÿ’ธ 63000.0 - 108675.0 USD per year

๐Ÿข Company: vspvisioncareers

  • Bachelorโ€™s Degree in Computer Science or related field and/or equivalent experience
  • 4+ years of related functional experience
  • Experience with both Windows and Linux, as well as containerization software products
  • Functional with continuous integration and continuous delivery
  • Experience with automation and orchestration using Chef, Puppet, Ansible and containers
  • Coding skills beyond simple scripts and knowledge of application architecture
  • Ability to program (structured and OO) with one or more high level languages, such as Python, Java, C/C++/C#, Ruby, and JavaScript
  • Understanding of distributed storage technologies like NFS, HDFS, Ceph, S3 as well as dynamic resource management frameworks (OpenShift, Kubernetes, Yarn)
  • Skilled in spotting problems and identifying performance bottlenecks, leading to problem and root cause analysis and risk mitigation
  • Capacity monitoring and performance planning experience with cloud solutions like AWS using applications such as Dynatrace, New Relic, App Dynamic
  • Use engineering design concepts to recommend design or test methods for attaining or improving operational reliability in support of business objectives.
  • Develop and implement high-reliability tools, systems, and services using engineering methodologies and tools.
  • Determine reliability requirements and deliver insights from massive scale data in real time.
  • Propose changes in design or formulation to improve system and/or process reliability.
  • Utilize best practices and work with cross-functional teams to provide solutions and a positive user experience.
  • Improve reliability, quality, and time-to-market for suite of software solutions, through effective hosting, monitoring, operations, and automation
  • Develop proprietary tools to improve system reliability and mitigate weaknesses in incident management or software delivery
  • Collaborate with team members to troubleshoot and fix issues utilizing knowledge of ย problems to route support escalation issues to the appropriate teams
  • Add automation for improved collaborative response in real-time, updates documentation, runbook tools, and modules to prepare teams for incidents
  • Support optimizing the software development life cycle to boost service reliability, based on post-incident reviews
  • Support system cost modeling for all hosted systems
  • Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
  • Deliver primary operational support and engineering for distributed software applications
  • Implement guidelines and plans for automated systems delivery maintaining system and data security
  • Assist with impact analysis regarding enterprise-wide technology
  • Perform capacity monitoring with various monitoring tools (Splunk, Dynatrace, etc.) and make recommendations
  • Gather and analyze metrics from both operating systems and applications to assist in performance tuning, fault finding, and corrective action planning
  • Support system integration, software, and hardware at enterprise level for optimum performance
  • Partner with development teams to improve services through rigorous automated testing and release procedures
  • Contribute to system architecture planning, and policies and procedures surrounding enterprise-wide technology
  • Participate in system design consulting, platform management, and capacity planning
  • Stay abreast of new technologies; introduce applicable technology in alignment with business goals and for creative solutions

AWSDockerPostgreSQLPythonSQLBashCloud ComputingData AnalysisDynamoDBElasticSearchGitJavaKafkaKubernetesMySQLOracleRabbitmqSoftware ArchitectureZabbixAlgorithmsCassandraData StructuresPrometheusRedisSparkCommunication SkillsAnalytical SkillsCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformMicroservicesTeamworkTroubleshootingJSONCross-functional collaborationAnsibleScriptingDebugging

Posted about 24 hours ago
Apply
Apply

๐Ÿ“ United States, Europe

๐Ÿงญ Full-Time

๐Ÿ” Software Development

๐Ÿข Company: Dune๐Ÿ‘ฅ 101-250

  • Proven expertise in managing and optimising bare-metal infrastructure and containerised environments.
  • Experience with infrastructure-as-code and orchestration tools.
  • Strong understanding of system performance, debugging, and optimization across diverse environments.
  • Ability to collaborate with interdisciplinary teams and communicate complex technical concepts clearly.
  • Solid foundation in computer science fundamentals and system design.
  • Ability to work collaboratively in a remote setting, contributing to a positive and inclusive team culture.
  • 5+ years of experience as a systems or infrastructure engineer in a collaborative, problem-solving environment.
  • Experience with distributed systems and managing large-scale, high-availability environments.
  • Hands-on experience with Nomad or Kubernetes for workload orchestration in production environments.
  • Proficiency in infrastructure-as-code tools like Ansible and Terraform, with a proven ability to automate and manage complex systems.
  • Experience with bare-metal infrastructure.
  • Proficiency in scripting or programming languages such as Python, Go, or Bash.
  • Experience with monitoring and observability tools for infrastructure performance.
  • Familiarity with cloud cost management and performance improvement strategies.
  • Strong analytical and troubleshooting skills.
  • Experience working across multiple time zones.
  • Collaborate closely with interdisciplinary teams to ensure the infrastructure meets the demanding performance, reliability, and scalability needs of our products.
  • Embrace the Platform team's mission to empower product teams with efficient, low-overhead services by developing and maintaining robust infrastructure and scalable services.
  • Design and maintain highly reliable containerized environments, ensuring seamless operation of our critical systems.
  • Analyze system performance to identify bottlenecks, proposing and implementing solutions to enhance infrastructure efficiency.
  • Contribute to maintaining high system reliability and scalability, focusing on unique and challenging technical problems.

DockerPythonSQLBashCloud ComputingGitKubernetesGoREST APICI/CDLinuxDevOpsTerraformAnsibleScriptingDebugging

Posted 1 day ago
Apply
Apply

๐Ÿ“ United States

๐Ÿงญ Full-Time

๐Ÿ’ธ 128350.0 - 192100.0 USD per year

๐Ÿ” Software Development

๐Ÿข Company: ClickHouse๐Ÿ‘ฅ 101-250๐Ÿ’ฐ Series B over 2 years agoDatabaseArtificial Intelligence (AI)Big DataAnalyticsSoftware

  • At least 8 years of experience in Site Reliability Engineering or a related field.
  • Previous experience using ClickHouse in production.
  • Coding experience with Go and/or Python.
  • Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
  • Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus.
  • Hands-on experience with container orchestration tools such as Kubernetes or Docker Swarm.
  • Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet.
  • Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse.
  • Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud.
  • Ensure all the infrastructure components in ClickHouse Cloud (including Dataplane, Control Plane and ClickHouse Core) have monitoring and alerting in place to ensure timely detection and resolution of incidents.
  • Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers.
  • Continuously improve the reliability and performance of our ClickHouse services.
  • Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities.
  • Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime.

AWSDockerPythonSQLCloud ComputingKubernetesCross-functional Team LeadershipClickhouseGoREST APICommunication SkillsCI/CDProblem SolvingLinuxDevOpsTerraformExcellent communication skillsTeamworkStrong communication skillsAnsibleDebugging

Posted 1 day ago
Apply
Apply

๐Ÿ“ United States

๐Ÿงญ Full-Time

๐Ÿ” Software Development

๐Ÿข Company: TCGPlayer_External_Career

  • 5+ years of experience in Site Reliability Engineering or related roles
  • Experience with an enterprise monitoring solution (New Relic, Scalyr, Datadog, Etc)
  • Experience managing Linux and/or Windows environments
  • Experience with IaaS and PaaS solutions (i.e. AWS, GCP, Azure, etc.)
  • Experience with Infrastructure as Code (Terraform or Helm)
  • Knowledge of Kubernetes / ECS orchestration, and containerization (e.g. Docker)
  • Demonstrable expertise around specifying, designing and/or implementing system health, performance monitoring tools and software management tools for 24x7 environments
  • Proficiency in writing code / scripts to automate tasks
  • Excellent critical thinking and solving skills
  • Innovate, build, and evangelize the practice of site reliability so that TCGPlayer can deliver excellent customer experiences.
  • Define and measure key performance metrics, such as SLAs and Mean Time Between Failures (MTBF), using those metrics to identify trends and measure the impact on the business.
  • Develop and maintain up-to-date operational procedures, including runbooks, to adapt to evolving needs.
  • Anticipate system failures through practices like chaos engineering and tabletop exercises, and establish processes to learn from operational incidents.
  • Foster strong relationships within the team and across departments while cultivating a communicative, supportive, and results-oriented culture.

AWSDockerCloud ComputingGCPKubernetesAzureCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesScripting

Posted 2 days ago
Apply
Apply

๐Ÿ“ United States

๐Ÿ” Blockchain

๐Ÿข Company: IO Global

  • 7+ years of experience in SRE, DevOps, or a related role.
  • Understanding of SRE best practices, architectures, and methods.
  • Good knowledge on resiliency patterns and cloud security.
  • Strong programming proficiency in Python, Golang, or Javascript.
  • Demonstrated experience with AWS and modern cloud architectures.
  • Proficiency in Helm, Terraform, and CI/CD tools like Github Actions and ArgoCD
  • Hands-on experience with Kubernetes/EKS and GitOps methodologies.
  • Proven track record with monitoring tools such as Prometheus, OpenTelemetry, as well as familiarity with the LGTM stack, or other comparable tools
  • Exceptional problem-solving skills with a knack for translating vague requirements into clear, strategic plans.
  • Ability to engage in technical discussions and be part of the decision making process
  • Strong problem-solving skills and capability to work on complex systems
  • Experience in working within an Agile environment
  • Experience in working with a distributed team
  • Strong communication and collaboration abilities to work seamlessly across different teams.
  • A proactive and innovative mindset, with a passion for continuous improvement and operational excellence.
  • Design, build, and maintain scalable and highly available systems, primarily on AWS, using best practices.
  • Manage and optimize Kubernetes clusters for high availability and performance, extending them when it makes sense to expand functionality.
  • Leverage GitOps principles to automate deployments and manage container orchestration.
  • Implement and manage CI/CD pipelines ensuring seamless, high-quality deployments, finding and removing bottlenecks, improving performance and working alongside teams to refine feedback loops and automate toil away.
  • Develop automation tools and scripts to improve operational efficiency.
  • Implement robust monitoring solutions with Prometheus and related tooling to ensure system health and performance.
  • Participate in on-call rotations and lead incident response efforts, turning challenges into learning opportunities.
  • Collaborate with dev teams to define and implement SLOs/SLIs
  • Take vague or loosely defined problems, work closely with cross-functional teams, and distill them into clear, actionable plans.
  • Communicate technical solutions and incident retrospectives effectively across both technical and non-technical stakeholders.
  • Evaluate and adopt new technologies, with a special advantage for candidates with blockchain experience, to keep our systems at the cutting edge.
  • Document processes and best practices, ensuring that knowledge is shared across the team and continuously improved.
  • Strive to strike a balance between effective delivery of goals and a measurable high standard of these goals. Always apply a layer of polish and due diligence when delivering.

AWSDockerPythonAgileBlockchainCloud ComputingJavascriptKubernetesPrometheusRustCommunication SkillsCI/CDProblem SolvingRESTful APIsLinuxDevOpsTerraformMicroservicesScripting

Posted 2 days ago
Apply
Apply

๐Ÿ“ United States

๐Ÿงญ Full-Time

๐Ÿ’ธ 165000.0 - 205000.0 USD per year

๐Ÿ” Software Development

๐Ÿข Company: Cribl๐Ÿ‘ฅ 251-500๐Ÿ’ฐ $150,000,000 Series D almost 3 years agoReal TimeBig DataInformation TechnologySoftware

  • Extensive experience with enterprise scale continuous delivery environments
  • 5+ years of experience with a DevOps or SRE job title
  • Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
  • Experience with Configuration Management Tools like Terraform (preferred) or Puppet, Chef, Ansible
  • Experience with sustainable incident response in a blameless environment
  • Knowledge of cloud platforms (prefer AWS) and container + orchestration technologies
  • Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
  • Background in Linux Systems Engineering
  • Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
  • Engage with teams and improve service delivery and reliability across their entire lifecycle
  • Measure and monitor all production systems with an eye towards availability, latency and overall system health
  • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence
  • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability
  • Help Identify and drive down toil with creative innovation and automation
  • On-call responsibilities

AWSDockerNode.jsCloud ComputingJavascriptKibanaKubernetesTypeScriptGrafanaPrometheusREST APICI/CDLinuxDevOpsTerraformJSONData management

Posted 4 days ago
Apply
Apply

๐Ÿ“ United States, Canada, Latin America

๐Ÿงญ Full-Time

๐Ÿ’ธ 160000.0 - 185000.0 USD per year

๐Ÿ” Software Development

๐Ÿข Company: Superhuman๐Ÿ‘ฅ 51-200๐Ÿ’ฐ $75,000,000 Series C over 3 years ago๐Ÿซ‚ Last layoff almost 3 years agoSoftware Development

  • 6+ years of experience in SRE, DevOps, or systems engineering roles.
  • Proven experience managing high-availability, mission-critical systems.
  • Strong proficiency with cloud platforms (GCP, AWS, or Azure).
  • Hands-on experience with containers and orchestration tools (Docker, Kubernetes).
  • Expertise in monitoring, logging, and alerting tools (e.g., Metabase, Datadog, Prometheus, Grafana, etc).
  • Proficiency in scripting/programming languages (Python, Go, Bash, etc.).
  • Knowledge of database management systems (SQL/NoSQL).
  • Strong knowledge of networking, security, and distributed systems.
  • Experience with Infrastructure as Code (Terraform, Ansible, Chef, or Puppet).
  • Familiarity with version control systems (Git) and CI/CD pipelines (Jenkins, GitLab CI, etc.).
  • Strong communication skills and ability to work collaboratively across teams.
  • Problem-solving mindset with a focus on root cause analysis.
  • Proactive, self-driven, and able to handle high-pressure environments.
  • Collaborate with software engineers to design scalable, fault-tolerant systems and services.
  • Proactively monitor service health, availability, and performance.
  • Respond to and troubleshoot production issues.
  • Perform capacity planning and scaling activities.
  • Automate repetitive tasks to enhance efficiency.
  • Design and implement disaster recovery plans and high availability strategies.
  • Collaborate with our security team to ensure infrastructure adheres to best practices and compliance requirements.
  • Build, maintain, and enhance CI/CD pipelines.
  • Manage and automate infrastructure provisioning and configuration.
  • Work closely with development teams to ensure best practices in deployment and release processes.
  • Champion DevOps culture by mentoring and guiding other engineers in the use of tools and best practices.

AWSDockerPythonSQLBashCloud ComputingGCPGitJenkinsKubernetesAzureGoGrafanaPrometheusNosqlCI/CDRESTful APIsLinuxDevOpsTerraformMicroservicesNetworkingAnsibleScripting

Posted 7 days ago
Apply
Apply

๐Ÿ“ U.S., EU

๐Ÿงญ Full-Time

๐Ÿ” Software Development

๐Ÿข Company: AuthZed๐Ÿ‘ฅ 11-50๐Ÿ’ฐ $12,000,000 Series A 9 months agoInformation TechnologyCyber SecuritySoftware

  • Proven experience as a Site Reliability Engineer or in a similar role.
  • Strong understanding of networking, operating systems, and cloud infrastructure.
  • Experience with Site Reliability Engineering, System Design, and Distributed Computing.
  • Experience in various programming languages โ€” we currently have SDKs for NodeJS, Java, Python, Ruby, and Go.
  • Experience with containerization technologies such as Docker and Kubernetes.
  • Knowledge of infrastructure-as-code tools like Terraform and Pulumi.
  • Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
  • Experience with lower-level implementation details of relational databases (bonus if you have have experience with distributed SQL databased like Google Cloud Spanner or CockroachDB).
  • Experience working with Git and GitHub.
  • Experience with continuous integration and deployment systems.
  • Strong problem-solving and troubleshooting skills.
  • Excellent communication and collaboration abilities.
  • Design, implement, and maintain highly available and scalable infrastructure solutions for our projects, products, and customers.
  • Monitor and analyze system performance, identifying and resolving bottlenecks and issues to ensure optimal performance and reliability.
  • Automate infrastructure deployment and configuration management processes.
  • Continuously improve system reliability, security, and efficiency through proactive monitoring, capacity planning, and performance tuning.
  • Troubleshoot and resolve complex infrastructure and application issues in production and test environments.
  • Collaborate with software engineering teams to design and implement systems that are resilient, scalable, and secure.
  • Participate in on-call rotation and respond to production incidents in a timely manner.
  • Document system configurations, troubleshooting procedures, and operational guidelines.

DockerPythonSQLCloud ComputingGitJavaKubernetesGoGrafanaPrometheusCI/CDProblem SolvingLinuxTerraformNetworkingTroubleshootingNodeJSScripting

Posted 13 days ago
Apply
Apply

๐Ÿ“ United States

๐Ÿงญ Full-Time

๐Ÿ’ธ 130000.0 - 165000.0 USD per year

๐Ÿ” Software Development

๐Ÿข Company: KnowBe4๐Ÿ‘ฅ 1001-5000๐Ÿ’ฐ $300,000,000 Post-IPO Equity almost 2 years agoComputerSecurityCyber SecurityNetwork SecuritySoftware

  • BS/MS/Ph.D. or equivalent plus 5 years experience
  • Proficient authoring scripts in one or more programming languages (e.g. Python, Ruby, Javascript).
  • Experience designing and operating high-scale patterns in AWS
  • Experience building and designing repeatable workflows for continuous integration and continuous deployment (CI/CD) - GitLab is preferred
  • Excellent communication skills
  • Effectively able to self-manage your time across competing projects
  • Ability to quickly understand and debug complex distributed systems
  • Work with other Site Reliability Engineers to build highly scalable and resilient applications and infrastructure in AWS
  • Maintain and improve extensible infrastructure-as-code using Terraform
  • Learn, maintain, and improve our existing deployment strategies
  • Deliver effective observability, monitoring, and alerting patterns for KnowBe4โ€™s applications and infrastructure
  • Act as an escalation point for identifying and resolving the root cause for production incidents
  • Provide assistance designing globally distributed systems and processes for the organization
  • Identify deficiencies in our current applications and infrastructure and correct them when found
  • Define new approaches and tailored solutions to complex technical problems
  • Act as a project leader with other Site Reliability Engineers and ensure progress is communicated effectively to project stakeholders

AWSDockerPythonSQLAWS EKSCloud ComputingDynamoDBKubernetesAlgorithmsData StructuresREST APIRustCI/CDProblem SolvingLinuxDevOpsTerraformMicroservicesExcellent communication skillsScriptingDebugging

Posted 15 days ago
Apply
Apply

๐Ÿ“ United States

๐Ÿ’ธ 138380.0 - 284900.0 USD per year

๐Ÿ” Software Development

  • 4+ years of industry experience, building and operating large scale, high performance distributed systems
  • Experience programming with Python or Go
  • Strong knowledge of Linux/Unix/BSD internals and experience working with open source software (e.g. MySQL, Hadoop, Envoy, HAProxy, Nginx)
  • Experience with technologies such as ElasticSearch, ZooKeeper, HBase, Hadoop, Memcache and Kafka with a focus on reliability, automation, operability and performance
  • Infrastructure as code a plus (e.g. Terraform, Puppet, Chef, Ansible, Salt, Fabric, Docker, etc)
  • Bonus points if experienced with deploying web apps to cloud infrastructure (AWS, etc.) and working with distributed, service-oriented architecture
  • Develop software solutions to enable reliability and operability of large scale distributed systems handling petabytes of data and serving
  • Build a deep understanding of how Pinterestโ€™s systems behave, scale, interact and fail, and use that insight to identity risks and opportunities for remediation
  • Build tools and automation to eliminate toil and reduce operational overhead. Create frameworks, processes and best practices to be used across Pinterest Engineering
  • Build meaningful, insightful and actionable SLIs
  • Automate critical portions of Pinterestโ€™s engineering processes, to minimize risk and maximize the speed of innovation
  • Manage capacity and performance to help scale our infrastructure both on public and private clouds around the world

DockerPythonSQLCloud ComputingElasticSearchHadoopKafkaKubernetesMySQLNginxGoREST APICI/CDLinuxDevOpsTerraformMicroservicesAnsibleScripting

Posted 15 days ago
Apply