Site Reliability Engineer

Posted 12 days agoViewed

CanadaFull-TimeEdge AI

Company:Maneva

Location:Canada

Languages:English

Seniority level:Senior

Skills:

DockerPythonBashEmbedded SystemsGrafanaPrometheusCI/CDLinuxDevOpsNetworkingTroubleshooting

Requirements:

Strong Linux systems administration experience (Ubuntu, embedded Linux, ARM systems). Proficiency in Python and/or Bash for scripting and operations automation. Solid networking fundamentals: TCP/IP, routing, DNS, DHCP, VPNs, VLANs, firewall rules. Familiarity with troubleshooting tools: tcpdump, nmap, iftop, netstat, etc. Hands-on experience with Prometheus, Grafana, or similar monitoring/alerting platforms. Experience with logging/observability stacks (ELK/EFK, Loki, Fluentd, etc.) is a plus. Experience with Docker or containerized applications is desirable. Comfort supporting distributed or remote device fleets. Excellent diagnostic and analytical abilities under pressure. Strong communication skills with both technical and non-technical stakeholders. High ownership mentality and ability to follow issues through to resolution. Comfortable working independently in a fully remote environment. Willingness to participate in on-call rotation, including off-hours and weekends. Experience supporting machine learning, computer vision, or GPU-accelerated systems. Familiarity with NVIDIA Jetson or other embedded AI hardware. Prior SRE/DevOps/Systems Engineer experience in a 24/7 operational environment. Exposure to industrial IoT, manufacturing systems, or operational technology (OT). Experience writing customer-facing operational documentation or SOPs.

Responsibilities:

Serve as a first responder for production issues, alarms, and system outages (24/7 rotation required). Troubleshoot Linux system issues, hardware problems, networking connectivity, and edge-device performance. Perform root-cause analysis (RCA) and implement corrective and preventive solutions. Build and maintain robust monitoring dashboards and alerts using Prometheus, Grafana, and similar tools. Continuously improve observability, including metrics, logs, traces, and health checks. Analyze trends to proactively identify reliability risks before incidents occur. Improve deployment workflows, CI/CD pipelines, configuration management, and automated provisioning. Create tools and scripts in Python/Bash to streamline operational processes. Understand and operate Maneva's end-to-end edge AI stack. Create and maintain SOPs for on-site customer teams and internal engineering workflows. Produce detailed incident reports and reliability documentation.

Similar Jobs:

Posted about 15 hours ago

US, CanadaFull-TimeSoftware Development

AWSBackend DevelopmentLeadership+12 more

Site Reliability Engineer

Requirements:

Responsibilities:

Similar Jobs:

Staff Software Engineer – Backend (Python / Typescript / Big Data / AWS / Kubernetes)

Staff Software Engineer – Backend (Python / Typescript / Big Data / AWS / Kubernetes)

Lead Software Engineer – Backend (Python / Typescript / Big Data / AWS / Kubernetes)

Similar Jobs