Site Reliability Engineer

Posted 12 days agoViewed
CanadaFull-TimeEdge AI
Company:Maneva
Location:Canada
Languages:English
Seniority level:Senior
Skills:
DockerPythonBashEmbedded SystemsGrafanaPrometheusCI/CDLinuxDevOpsNetworkingTroubleshooting
Requirements:
Strong Linux systems administration experience (Ubuntu, embedded Linux, ARM systems). Proficiency in Python and/or Bash for scripting and operations automation. Solid networking fundamentals: TCP/IP, routing, DNS, DHCP, VPNs, VLANs, firewall rules. Familiarity with troubleshooting tools: tcpdump, nmap, iftop, netstat, etc. Hands-on experience with Prometheus, Grafana, or similar monitoring/alerting platforms. Experience with logging/observability stacks (ELK/EFK, Loki, Fluentd, etc.) is a plus. Experience with Docker or containerized applications is desirable. Comfort supporting distributed or remote device fleets. Excellent diagnostic and analytical abilities under pressure. Strong communication skills with both technical and non-technical stakeholders. High ownership mentality and ability to follow issues through to resolution. Comfortable working independently in a fully remote environment. Willingness to participate in on-call rotation, including off-hours and weekends. Experience supporting machine learning, computer vision, or GPU-accelerated systems. Familiarity with NVIDIA Jetson or other embedded AI hardware. Prior SRE/DevOps/Systems Engineer experience in a 24/7 operational environment. Exposure to industrial IoT, manufacturing systems, or operational technology (OT). Experience writing customer-facing operational documentation or SOPs.
Responsibilities:
Serve as a first responder for production issues, alarms, and system outages (24/7 rotation required). Troubleshoot Linux system issues, hardware problems, networking connectivity, and edge-device performance. Perform root-cause analysis (RCA) and implement corrective and preventive solutions. Build and maintain robust monitoring dashboards and alerts using Prometheus, Grafana, and similar tools. Continuously improve observability, including metrics, logs, traces, and health checks. Analyze trends to proactively identify reliability risks before incidents occur. Improve deployment workflows, CI/CD pipelines, configuration management, and automated provisioning. Create tools and scripts in Python/Bash to streamline operational processes. Understand and operate Maneva's end-to-end edge AI stack. Create and maintain SOPs for on-site customer teams and internal engineering workflows. Produce detailed incident reports and reliability documentation.
Similar Jobs:
Posted about 16 hours ago
United States, CanadaFull-TimeSoftware Development
Lead Software Engineer – Backend (Python / Typescript / Big Data / AWS / Kubernetes)