Senior Site Reliability Engineer (SRE) - AI Inftastructure Job at Confidential, San Francisco, CA

ZGxYY2liRGdHdUZ4RTJlb0hVNlZDREh4Nmc9PQ==
  • Confidential
  • San Francisco, CA

Job Description

Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. As well as supporting their extremely exciting new products coming to the market! 

This is a rare opportunity to work at the intersection of AI infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment.

If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it. Get in touch and apply today! 

Responsibilities:

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have:

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Salary & Benefits:

  • $300,000 gross per year 
  • Equity

Job Tags

Permanent employment

Similar Jobs

Royal Caribbean Group

Lead Data Engineer Job at Royal Caribbean Group

 ...lifetime into a lifetime of vacations for our guests. Royal Caribbean Groups Enterprise Data Team has an exciting career opportunity for a full time Lead Data Engineer reporting to the Senior Manager, Data Solutions Engineering . Position Summary... 

Minneapolis Public Schools

Licensed Practical Nurse (LPN)- Candidate Pool (2025-2026) Job at Minneapolis Public Schools

 ...about this position. Job Title: Licensed Practical Nurse (LPN)- Candidate Pool (2025-2026) and Requisition ID number: 103051...  ...for future Licensed Practical Nursepositions for the 2025-2026 school year. Positions may be full or part time. If you are selected for... 

Ready to Round LLC

Medical Billing Specialist Job at Ready to Round LLC

 ...READY TO ROUND LLC is seeking a detail-oriented Medical Billing Specialist to join our Revenue Cycle Management (RCM) Department. This role...  ...detail and accuracy Ability to work independently in a remote setting Preferred Qualifications Certification in... 

AnheuserBusch

Director, Digital Marketing Job at AnheuserBusch

 ...including Michelob ULTRA - America's #1 top-selling beer - as well as Busch Light, Budweiser, Bud Light, Stella Artois, Cutwater Spirits,...  ...Must be able to lift up to 15 pounds at times. WHY ANHEUSER-BUSCH: At Anheuser-Busch, our purpose is to create a future... 

Sanford Health

Patient Care Technician (CNA) Inpatient - PRN Job at Sanford Health

 ...equipment/supplies preferred. Certified Nursing Assistant (CNA) preferred. Obtains and subsequently maintains required...  ...Sanford offers an attractive benefits package for qualifying full-time and part-time employees. Depending on eligibility, a variety of benefits...