Senior Platform Engineer - Cloud & Infrastructure
Overview: Architect the Infrastructure of MLOps
We are looking for a heavy-hitting infrastructure engineer who lives in Kubernetes but understands the reality of enterprise deployments. ZenML is an open-source MLOps framework, and as we scale our ZenML Cloud and Enterprise offerings, we need someone to own the "plumbing" that makes ML pipelines run anywhere.
This is a unique hybrid role. You won't just be maintaining internal clusters; you will be building core product features (like our new workload manager and scheduler) AND helping our most advanced customers architect their MLOps stacks.
Key Responsibilities
- Build "Infra-Heavy" Product Features: You will design and implement core features in ZenML Pro, such as native schedulers and the workload manager that triggers pipelines across hybrid clouds.
- Own the ZenML Pro (SaaS) Infrastructure: ensuring our managed control plane is resilient, scalable, and secure using modern SRE practices (Grafana, Prometheus, Alerting).
- Enterprise Architecture & PoCs: You will be the "Special Forces" engineer we send in when a major enterprise customer needs to deploy ZenML on a complex, air-gapped, or custom Kubernetes setup. You will unblock them and feed those learnings back into the product.
- Developer Experience: Abstracting the complexity of K8s away from the Data Scientists who use our tool.
Tech You'll Work With
- The Core: Kubernetes (Deep knowledge required - CKA level), Docker, Terraform, Helm.
- The Code: Python (for ZenML) and likely Go (for controllers/operators).
- The Clouds: AWS (EKS), GCP (GKE), Azure (AKS).
- The Stack: PostgreSQL, SQLModel, FastAPI.
What We're Looking For
- The K8s Native: You don't just use Kubernetes; you understand its internals. You’ve written Helm charts from scratch, debugged failed ingress controllers, and wrestled with VPC peering.
- Infrastructure as Code (IaC) Master: You hate clicking buttons in the AWS console. If it isn't in Terraform, it doesn't exist.
- Code + Ops: You are not just a SysAdmin. You can write production-quality code (Python or Go) to build features, not just scripts.
- Customer Empathy: You are comfortable jumping on a call with a customer’s DevOps team to debug a deployment. You can explain complex infra concepts to Data Scientists without overwhelming them.
- Problem Solver: You enjoy the detective work of figuring out why a pod is crashing in a customer's obscure private cloud environment.