Full-Time Manager, AI System Infrastructure and MLOps Engineering
Job Description
The Team
Across our work in Science, Education, and within our communities, we pair technology with grantmaking, impact investing, and collaboration to help accelerate the pace of progress toward our mission. Our Central team provides the support needed to push this work forward.
The Central team at CZI consists of our Finance, People & DEI, Real Estate, Events, Workplace, Facilities, Security, Brand & Communications, Business Systems, Central Operations, Strategic Initiatives, and Ventures teams. These teams provide strategic support and operational excellence across the board at CZI.
The AI/ML Infrastructure team works on building shared tools and platforms to be used across the Chan Zuckerberg Initiative, partnering and supporting the work of an extensive group of Research Scientists, Data Scientists, AI Research Scientists, as well as a broad range of Engineers focusing on Education and Science domain problems. Members of the shared infrastructure engineering team have an impact on all of CZI’s initiatives by enabling the technology solutions used by other engineering teams at CZI to scale.
The Opportunity
As a hands-on Manager of the AI System Infrastructure and MLOps Engineering team, you will be joining the AI/ML and Data Engineering team in CZI Central Tech, with the responsibility for the stability and scalable operations of our leading edge GPU Cloud Compute Cluster. This supports our AI Researchers in their development and training of state-of-the-art models in artificial intelligence and machine learning to solve important problems in the biomedical sciences aligned with CZI’s mission, contributing to greater understanding of human cell function.
As the Engineering Manager of the AI Infrastructure and MLOps Engineering team, you will be responsible for a variety of MLOps and AI development projects that empower our AI Researchers and help to accelerate Biomedical research across the whole of the AI lifecycle. You will guide our AI Systems Infrastructure and MLOps efforts focused on our GPU Cloud Cluster operations, ensuring that our systems are highly utilized, performant, and stable. You will be working in collaboration with other members of our own AI Engineering team as well as the Science Initiative’s AI Research team as they iterate and train their deep learning code, optimizing systems operations and in helping to troubleshoot problems encountered by jobs running on the cluster.
What You’ll Do
- Help to build out the MLOPs and Systems Infrastructure Engineering team, growing the team to support the large scale capacity systems and AI training efforts we will be undertaking.
- Drive our MLOps processes and System Infrastructure Engineering efforts in ensuring that our GPU Cloud computing systems are highly utilized and stable, and proactively guide our team in implementing the instrumentation and observability tooling integral to our AI Platform.
- Own the on-call efforts for our GPU Cloud computing systems, building out the MLOps and Systems Infrastructure Engineering alerting and monitoring efforts for our leading edge Kubernetes based AI platform, including troubleshooting problems encountered on the GPU platform infrastructure and with jobs running on the cluster and computing systems.
- Responsibility for a variety of AI/ML development infrastructure, instrumentation, and telemetry projects that empower our team in supporting our users across the AI/ML lifecycle, taking a key role in simplifying and optimizing the systems and processes that are integral to our GPU Cloud Cluster operations – in an MLOps meets SRE kind of hybrid operations model.
- Mentoring and managing your team in fulfilling their roles to the best of their abilities, provide skill and career coaching to help the team members keep growing along their own career and life paths, and keep the team engaged in meaningful and interesting projects in service of our north star philanthropic mission
What You’ll Bring
- Hands-on AI/ML Model Training Platform Operations experience in an environment with challenging data and systems platform challenges
- MLOps experience working with medium to large scale GPU clusters in Kubernetes, HPC environments, or large scale Cloud based ML deployments (Kubernetes Preferred)
- BS, MS, or PhD degree in Computer Science or a related technical discipline or equivalent experience
- 2+ years of experience managing MLOps teams
- 7+ years of relevant coding and systems experience
- 7+ years of relevant coding and systems experience
- 7+ years of systems Architecture and Design experience, with a broad range of experience across Data, AI/ML, Core Infrastructure, and Security Engineering
- Strong understanding of scaling containerized applications on Kubernetes or Mesos, including solid understanding of AI/ML training with containers using secure AMIs and continuous deployment systems that integrate with Kubernetes or Mesos. (Kubernetes preferred)
- Proficiency with Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure, and experience with On-Prem and Colocation Service hosting environments
- Solid coding ability with a systems language such as Rust,C/ C++, C#, Go, Java, or Scala
- Extensive experience with a scripting language such as Python, PHP, or Ruby (Python Preferred)
- Working knowledge of Nvidia CUDA and AI/ML custom libraries.
- Knowledge of Linux systems optimization and administration
- Understanding of Data Engineering, Data Governance, Data Infrastructure, and AI/ML execution platforms.
- PyTorch, Karas, or Tensorflow experience a strong nice to have
How to Apply
https://grnh.se/dc9971171us87 total views, 0 today