Senior Solutions Architect, AI Factory Observability and Visualization
New
Canada, United StatesFull-TimeSenior
SalaryUSD 184,000 - 287,500 for Level 4, and 224,000 - 356,500 USD for Level 5
Apply NowOpens the employer's application page
Job Details
- Experience
- 6+ years
- Required Skills
- PythonBashGrafanaPrometheusLinuxDistributed Systems
Requirements
- Bachelor's degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or related field.
- 6+ years of experience managing Linux-based systems in HPC, distributed systems, or large AI/ML settings.
- Hands-on experience with the architecture of multi-GPU and/or multi-node clusters, including networking and interconnects.
- Solid grasp of how HPC and AI factory systems fit together end to end.
- Proficiency with Python and Shell/Bash for scripting, automation, and tooling.
- Practical experience with observability systems (e.g., Prometheus, Grafana, Loki).
- Experience building custom exporters or collectors, setting up alerts, and handling metric cardinality at scale.
- Experience transforming metrics, logs, and traces into actionable insight for distributed environments.
- Familiarity with GPU and fabric telemetry (e.g., DCGM, NVLink, InfiniBand/Ethernet fabric counters).
Responsibilities
- Run AI factory validation tools, microbenchmarks, and workloads, interpreting results to assess system health and performance.
- Establish metrics, logs, and signals to define healthy system states and identify performance thresholds.
- Build and extend telemetry across hardware, fabric, and workloads, including data collection and storage.
- Develop automation using Python and Shell for collecting, transforming, and presenting system data.
- Collaborate with hardware, software, networking, and datacenter teams to prepare HPC systems and AI factories for deployment.
- Investigate visibility gaps as an observability expert to ensure accurate system behavior representation.
View Full Description & ApplyYou'll be redirected to the employer's site