Prometheus Jobs

Find remote positions requiring Prometheus skills. Browse through opportunities where you can utilize your expertise and grow your career.

Prometheus
255 jobs found. to receive daily emails with new job openings that match your preferences.
255 jobs found.

Set alerts to receive daily emails with new job openings that match your preferences.

Apply

📍 Poland

🔍 Software Development

🏢 Company: Jamf

  • Minimum of 5 years experience managing large database clusters implemented using one of the most popular database technologies (MySQL Server, Oracle, PostgreSQL) (Required).
  • Minimum of 5 years experience managing NoSQL database instances and/or columnar-based databases(Preferred).
  • Minimum of 2 years Amazon RDS and EC2 knowledge (Required).
  • Minimum of 5 years Familiarity with Linux, Tomcat, and Apache (Required).
  • Minimum of 2 years experience with system and infrastructure configuration tools (such as Terraform, Cloud Formation. Ansible and Puppet).
  • Minimum of 2 years experience with modern monitoring tools (preferably Prometheus and Grafana, but others such as CloudWatch, New Relic, Datadog are also accepted).
  • Minimum of 2 years experience with major public providers: AWS (preferred), Azure, GCP.
  • Strong Communication Skills.
  • Excellent Interpersonal Skills.
  • Excellent Organizational Skills.
  • Proven Analytical Skills.
  • Ability to communicate complex technical terms in an easy to understand, non-technical manner.
  • Self-starter, energetic multi-tasker, highly motivated and team player.
  • Ability to engage with and establish trust and rapport with all levels of customers and employees.
  • Apple Platform.
  • Agile.
  • Ability to adapt and solve challenges quickly and efficiently.
  • Ability to work independently and as part of a team.
  • Provide technical guidance, strategic direction and operational mentorship.
  • Work with the rest of the DBA team and cloud leadership to design and implement the public cloud infrastructure that supports Jamf database systems.
  • Work with the rest of the DBA team and cloud leadership to design and implement Jamf database systems.
  • Participate in system administration duties, remote access, maintenance of server’s integrity, data backups and restoration, and offsite storage
  • Proactively monitor all databases and related cloud infrastructure to detect and resolve problems and ensure uninterrupted operations.
  • Work closely with JAMF Software Development team on interrelated products.
  • Estimate accurately both time to complete tasks and impact of changes.
  • Drive continuous improvements within the JAMF Cloud infrastructure.
  • Work with JAMF Software’s technical team to bring current industry best practices to scale our systems to meet ever expanding demand.
  • Track and follow and, when required, escalate events in a timely manner to the appropriate individual according to documented procedures.
  • Maintain the Online Services technical documentation of processes and procedures.
  • When required, participate in a 24x7 on-call rotation.
  • Identify techniques and tools that will enhance the capabilities of the team to be more effective.
  • Act as an active advisor to department management team.
  • Build a deep understanding of workflows that make a JAMF Software customer successful.
  • Other duties and special projects as assigned.

AWSPostgreSQLSQLAmazon RDSApache TomcatBashCloud ComputingKubernetesMySQLOracleGrafanaPrometheusNosqlCI/CDLinuxTerraform

Posted 1 day ago
Apply
Apply

📍 France

🧭 Full-Time

🔍 Streaming Advertising

🏢 Company: Vibe👥 101-250💰 $22,500,000 Series A about 1 year agoInternetAdvertisingTVMarketing

  • 8+ years of experience in relevant technical roles (Software Engineering, Data Engineering, ML Ops, or Infrastructure), with a strong foundation to build and scale ML infrastructure
  • Strong coding skills in Python, CI/CD, and Infrastructure as Code (Terraform, Ansible)
  • Deep expertise in ML Infrastructure: training orchestration (Dagster, Airflow), feature stores, live model monitoring, and distributed/multi-GPU training (TensorFlow, PyTorch)
  • Extensive cloud & scalability experience: deploying ML models on AWS/GCP, optimizing real-time inference, handling large-scale data pipelines, and implementing cost-efficient FinOps strategies
  • Build and optimize automated ML training pipelines (MLflow, Dagster)
  • Improve scalability and performance (multi-GPU, caching, distributed architectures)
  • Deploy and optimize real-time inference systems to ensure sub-20ms latency at scale
  • Implement monitoring and observability for models (Prometheus, Grafana, Evidently AI)
  • Optimize cloud costs and resource management (AWS Spot Instances, auto-scaling Kubernetes, FinOps)

AWSPythonGCPKubernetesMLFlowPyTorchData engineeringGrafanaPrometheusTensorflowCI/CDTerraformAnsibleSoftware Engineering

Posted 1 day ago
Apply
Apply

📍 Greece, Sweden

🏢 Company: InventYOU AB

  • Deep hands-on experience with AWS cloud services.
  • Strong skills in Kubernetes, Docker, and container orchestration.
  • Familiarity with GitOps tools (e.g., ArgoCD, Flux) and practices.
  • Proficiency with Terraform, Ansible, or similar IaC tools.
  • Experience with observability and monitoring stacks (e.g., Prometheus, Grafana, CloudWatch).
  • Operate and optimize large-scale AWS cloud environments with a strong focus on reliability, scalability, and performance.
  • Implement GitOps workflows for streamlined, version-controlled infrastructure changes.
  • Automate deployments and infrastructure management using Infrastructure as Code (IaC).
  • Support monitoring, alerting, incident response, and root cause analysis.
  • Collaborate with engineering, SRE, DevOps, and security teams to continuously improve operations and cloud practices.

AWSDockerBashCloud ComputingGitKubernetesGrafanaPrometheusCI/CDLinuxTerraformAnsible

Posted 1 day ago
Apply
Apply

📍 Canada

💸 146409.0 - 175691.0 CAD per year

🔍 Software Development

  • Solid experience with at least one programming language. We use Go, but if you have familiarity with Python, C, C++, Rust or similar then that translates well
  • Some experience with delivering projects from gathering requirements, brainstorming ideas all the way to shipping a product to the customer’s hands in a self-driven way
  • Some experience with developing software that runs in the Cloud or some experience with systems engineering
  • Experience writing clean, robust, and performant software that is easily maintained by others
  • Take an active role in influencing our roadmap and your own career objectives
  • Work with your team to deliver new features, then use the results to iterate and improve.
  • Drive projects from initial ideation all the way to operations once it is in the hands of customers
  • Embrace our open-source culture and contribute to other projects that may not directly fall within your team’s scope
  • Design, build, operate, and maintain critical systems, owning the reliability, performance, and availability
  • Be a part of your team’s on-call rotations and take ownership of the services you’re running
  • Mentor and support other team members, participate in design discussions and collaborate with the team
  • Learn new skills by gaining a deeper understanding of our cloud product and our customers and getting to know the codebase of a large distributed system

Backend DevelopmentDockerSoftware DevelopmentCloud ComputingDesign PatternsGitKubernetesAlgorithmsData StructuresGoGrafanaPrometheusREST APICommunication SkillsAnalytical SkillsCollaborationCI/CDProblem SolvingMentoringLinuxWritten communicationMicroservicesAdaptabilityTeamworkActive listeningJSONSoftware EngineeringDebugging

Posted 2 days ago
Apply
Apply

📍 United States

🧭 Full-Time

💸 120000.0 - 150000.0 USD per year

🔍 Software Development

🏢 Company: Echo360 Inc

  • 5+ years of experience as a Site Reliability Engineer or similar role.
  • Strong understanding of AWS cloud services, including DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, EKS, ECS and EC2.
  • Experience with infrastructure automation tools like Ansible, Terraform, or CloudFormation.
  • Experience with monitoring and alerting tools like CloudWatch, DataDog, Prometheus, Grafana, Kibana, and PagerDuty.
  • Experience with GitHub actions, Cl/CD pipelines and deployment strategies.
  • Strong problem-solving and analytical skills.
  • Excellent communication and collaboration skills.
  • Ability to work independently and take ownership of complex tasks.
  • Passion for technology and a desire to learn and grow.
  • Experience with Jenkins, PostgreSQL, and MongoDB.
  • Experience with cloud cost optimization, security best practices and tools.
  • Experience working in a fast-paced, agile environment.
  • Experience Rancher, Cattleprod, and TeamCity a plus.
  • Ensure service reliability and SLO/SLA adherence to production, preventing incidents by proactively conducting failure testing.
  • Implement automated monitoring and alerting systems for early detection of potential problems.
  • Collaborate with development teams to perform deployments and rollbacks with minimal disruption.
  • Optimize the performance and scalability of our AWS infrastructure, including RDS, DynamoDB, MySQL, S3, CloudSearch, OpenSearch, Kafka, Presto, SES, EKS, ECS, and EC2.
  • Automate infrastructure provisioning and deployment processes using Terraform, CI/CD pipelines, and configuration management tools.
  • Proactively identify and address potential security vulnerabilities to maintain compliance, IAM best practices, and secrets management.
  • Participate in incident response and post-mortem analysis activities to identify root causes and prevent future occurrences.
  • Help onboard and mentor junior team members, sharing your knowledge and expertise.
  • Stay up to date on the latest cloud technologies and best practices for SRE.
  • Participate in a well-structured on-call rotation with other Site Reliability Engineers.
  • Explore new technologies and innovative solutions to improve service quality and speed to market.
  • Participate in technical discussions and deep dives with the other engineering and product teams.

AWSPostgreSQLDynamoDBJenkinsKafkaKibanaMongoDBMySQLGrafanaPrometheusCI/CDAgile methodologiesLinuxDevOpsTerraformMicroservicesAnsible

Posted 2 days ago
Apply
Apply

📍 Worldwide

🧭 Full-Time

🔍 Hospitality

🏢 Company: Lighthouse

  • 3+ years Rust, including mastery of advanced Rust concepts and async Rust, or 1+ years Rust and 3+ years in any other systems programming language (e.g. C++).
  • High level understanding of networking protocols.
  • Experience in cloud environment and designing infrastructure architecture.
  • Eagerness to learn and explore the world of web scraping.
  • Strong problem-solving and analytical skills, with the ability to identify and troubleshoot complex technical issues.
  • Excellent communication and collaboration skills, with the ability to work effectively in a team environment.
  • Drive the optimization and expansion of our crawling infrastructure, ensuring both high scalability and superior performance, while maintaining effective web scraping capabilities.
  • Drive developments on our man-in-the-middle proxy, written in Rust.
  • Introduce innovative methods for enhancing our monitoring and troubleshooting processes and tools.
  • Engineer scraping solutions by applying findings from our R&D.
  • Actively collaborate with diverse engineering teams to address challenges and advance initiatives related to backend infrastructure.
  • Serve as a subject matter expert, providing mentorship and support to team members while fostering an environment of continuous learning and development.

Backend DevelopmentPythonCloud ComputingData AnalysisGCPKubernetesData engineeringGrafanaPrometheusREST APIRustMicroservicesNetworking

Posted 3 days ago
Apply
Apply

📍 France

🧭 Full-Time

🏢 Company: Vibe👥 101-250💰 $22,500,000 Series A about 1 year agoInternetAdvertisingTVMarketing

  • Proven experience in designing, implementing, and maintaining highly available and fault-tolerant systems
  • Deep knowledge of AWS with Terraform, Kubernetes, and CI/CD pipeline management (GCP experience is a plus)
  • Strong understanding of security principles, IAM, RBAC, and network security
  • Proficiency in Python (Go or Rust as a plus)
  • Experience with Prometheus/Grafana (OpenTelemetry as a plus)
  • Ensure high availability (99.99%) and prevent service disruption
  • Automate deployment & infrastructure management (Terraform, Kubernetes, CI/CD)
  • Monitor system health & performance (Prometheus, Grafana, OpenTelemetry)
  • Optimize cloud costs & FinOps strategies (AWS, GCP, Spot Instances)
  • Implement security best practices and compliance policies

AWSPythonCloud ComputingCybersecurityKubernetesGrafanaPrometheusCI/CDRESTful APIsLinuxDevOpsTerraformMicroservices

Posted 3 days ago
Apply
Apply

📍 United States

💸 148505.0 - 178206.0 USD per year

🔍 Software Development

  • Solid experience with at least one programming language
  • Some experience with delivering projects from gathering requirements, brainstorming ideas all the way to shipping a product to the customer’s hands in a self-driven way
  • Some experience with developing software that runs in the Cloud or some experience with systems engineering
  • Experience writing clean, robust, and performant software that is easily maintained by others
  • Take an active role in influencing our roadmap and your own career objectives
  • Work with your team to deliver new features, then use the results to iterate and improve.
  • Drive projects from initial ideation all the way to operations once it is in the hands of customers
  • Embrace our open-source culture and contribute to other projects that may not directly fall within your team’s scope
  • Design, build, operate, and maintain critical systems, owning the reliability, performance, and availability
  • Be a part of your team’s on-call rotations and take ownership of the services you’re running
  • Mentor and support other team members, participate in design discussions and collaborate with the team
  • Learn new skills by gaining a deeper understanding of our cloud product and our customers and getting to know the codebase of a large distributed system

Backend DevelopmentSoftware DevelopmentCloud ComputingKubernetesAlgorithmsData StructuresGoGrafanaPrometheusREST APICI/CDLinuxDevOpsMicroservicesJSONSoftware EngineeringDebugging

Posted 3 days ago
Apply
Apply
🔥 Senior SRE / DevOps
Posted 3 days ago

📍 France, the UK, Italy, Spain, Portugal, Netherlands

🧭 Full-Time

🔍 FinTech or Cyrpto

🏢 Company: Kiln👥 101-250💰 $17,000,000 Series A about 1 year agoCryptocurrencyBlockchain

  • +5 years of background experience in Software or Infrastructure
  • Proven experience as a Senior SRE with a very strong focus on Kubernetes.
  • Proficiency with IaC (Terraform/Terragrunt) and infrastructure automation (Helm, GitOps).
  • Familiar with Prometheus and PromQL
  • Familiar with infrastructure and data security (KMS, Hashicorp Vault).
  • Ability to ship opinionated architectural choices and code, and to share software best practices.
  • Deploying new blockchain protocols in accordance with the Product team.
  • Architect, deploy and maintain our multi-cloud infrastructure.
  • Ensure that our services communicate with each other seamlessly, have minimal downtime, and recover quickly.
  • Make sure we respect any software security norms (Kiln is a SOC 2 Type 1 and Type 2 company).
  • Continuously support our Software/Smart Contract team to ship code of quality.
  • Actively suggest continuous improvement of Kiln's architecture.
  • Assess any protocol deployment risks.
  • Communicate with our Product & Sales team to make sure they understand any risk that may occur during protocol deployment.

AWSPostgreSQLBlockchainGCPGitKubernetesTypeScriptPrometheusWeb3.jsLinuxDevOpsTerraform

Posted 3 days ago
Apply
Apply

📍 Western EU

🧭 Full-Time

🔍 Crypto

🏢 Company: Kiln👥 101-250💰 $17,000,000 Series A about 1 year agoCryptocurrencyBlockchain

  • 3+ years experience in the crypto industry.
  • 3+ years DevOps experience with a very strong focus on Kubernetes.
  • You have run validator nodes
  • Doing technical due diligence on new protocols being evaluated for Kiln’s roadmap
  • Running protocols in testnet on Kiln servers before handing them over to the Kiln infa team for mainnet / ongoing production
  • Being a technical expert on relevant protocols to help answer internal and external questions on their operations. Writing educational content

AWSPostgreSQLBlockchainEthereumGCPKubernetesTypeScriptPrometheusREST APIWeb3.jsCI/CDLinuxDevOpsTerraformJSON

Posted 3 days ago
Apply
Shown 10 out of 255