Director, MLOps Engineering
NYC Health + Hospitals · New York, US
Job description
About NYC Health + Hospitals
NYC Health + Hospitals is the largest public health care system in the United States. We provide essential outpatient, inpatient and home-based services to more than one million New Yorkers every year across the city’s five boroughs. Our large health system consists of ambulatory centers, acute care centers, post-acute care/long-term care, rehabilitation programs, Home Care, and Correctional Health Services. Our diverse workforce is uniquely focused on empowering New Yorkers.
At NYC Health + Hospitals, our mission is to deliver high quality care health services, without exception. Every employee takes a person-centered approach that exemplifies the ICARE values (Integrity, Compassion, Accountability, Respect, and Excellence) through empathic communication and partnerships between all persons.
Work Shifts
9:00 A.M – 5:00 P.M
Duties & Responsibilities
Purpose of Functional Assignment:
The Director of Machine Learning Operations (MLOps) Engineering provides strategic and operational leadership for the end‑to‑end Machine Learning (ML) and agentic Artificial Intelligence (AI) operations platform. This role oversees the full lifecycle required to take AI prototypes—including ML models, Large Language Model (LLM) based systems, and agentic/Retrieval-Augmented Generation (RAG) pipelines—from development to production, including environment setup, pipeline engineering, integration, Quality Assurance (QA), deployment, and ongoing maintenance.
Essential Duties and Responsibilities:
1. Defines the multi-year technical roadmap for the ML platform, continuously evaluating emerging MLOps tools, LLM frameworks, and infrastructure innovations to maintain a cutting-edge and efficient platform, guiding long-term strategy for reliability, lifecycle automation, cost optimization, and scaling across the System.
2. Leads the setup, governance, and maintenance of Quality Assurance (QA), staging, and production environments for ML applications, LLM pipelines, and agentic AI systems.
3. Owns the transition from AI prototypes to production, including model refactoring, packaging, optimization, dependency management, and deployment readiness validation.
4. Modifies and operationalizes ML and LLM applications to run as scalable mini‑batch or streaming pipelines to meet clinical workflow requirements.
5. Establishes automated model re-training and re-deployment pipelines triggered by performance degradation, data drift, or scheduled intervals, ensuring continuous model improvement.
6. Integrates AI applications with enterprise data platforms, interface engines, cloud services, container orchestration environments, model tracking tools, and clinical workflow systems in support of end-to-end AI operations.
7. Collaborates with Data Platform and AI Governance teams to ensure compliant data and features are usable by ML/LLM pipelines; manages the infrastructure for the low-latency feature serving layer required for real-time inference.
8. Designs, manages, and maintains infrastructure for Retrieval-Augmented Generation (RAG) pipelines, vector databases, embedding generation, orchestration layers, and automated agentic tools.
9. Implements and enforces a Model Governance framework, including automated checks for model versioning, lineage tracking, reproducibility, model card generation, and secure model access controls across all environments.
10. Oversees end‑to‑end deployment workflows using Continuous Integration/Continuous Deployment (CI/CD), infrastructure‑as‑code, containerization, distributed compute, and Kubernetes/Azure Kubernetes Services (AKS) orchestration.
11. Establishes and executes robust QA processes including unit, functional, and integration testing to ensure AI applications behave consistently with validated prototypes prior to deployment.
12. Develops and manages reliability and observability frameworks covering logging, monitoring, alerting, data quality drift detection, and runtime monitoring.
13. Manages and optimizes compute resource utilization (i.e. CPU/GPU/TPU) and cloud spending related to model training, experimentation, and high-volume, real-time model serving.
14. Collaborates with Product Development, Platform Engineering, Interoperability, Cybersecurity, and clinical partners to ensure safe, integrated, and workflow‑appropriate deployment of AI tools.
15. Leads incident response, troubleshooting, root‑cause analysis, and continuous reliability improvement for production AI systems.
16. Translates AI governance policies into automated, auditable, and repeatable technical controls embedded within the MLOps pipelines to ensure compliance.
17. Manages a team of ML Engineers, QA Engineers, and LLM Engineers.
18. Performs other duties as assigned.
Minimum Qualifications
1. Master's Degree from an accredited college or university in Computer Science, Information Systems or Technology, Cybersecurity, Hospital Administration, Health Care Planning, Business Administration, Mathematics, Engineering or Public Administration; and three (3) years of progressively responsible experience in health care information security, multifaced information technology, health and medical service administration, public administration, or a related discipline with an emphasis on systems programming, systems engineering, software developing, or providing technical support as a specialist; two (2) years of which must have been in a related administrative, managerial or supervisory capacity; or,
2. Bachelor’s Degree from an accredited college or university in disciplines, as listed in “1” above; and five (5) years of progressively responsible experience in health care information security, multifaced information technology, health and medical service administration, public administration, or a related discipline with an emphasis on systems programming, systems engineering, software developing, or providing technical support as a specialist; two (2) years of which must have been in a related administrative, managerial or supervisory capacity.
Assignment Qualification Preferences:
1. Master’s degree from an accredited college or university in Computer Science, Computer Engineering, or a related technical discipline; and,
2. Five (5) years of experience in Machine Learning Operations (MLOps), Machine Learning (ML) engineering, Artificial Intelligence (AI) platform engineering, or production Machine Learning (ML)/Large Language Model (LLM) system operations or ten (10) years of experience in Software Engineering and Data Engineering.
Certifications Preferred:
1. Professional certifications in cloud architecture, ML/AI engineering, or DevOps from leading cloud platforms.
Preferred Knowledge Areas, Skills, Abilities, and other Qualifications:
1. Deep experience with Databricks, Spark/Scala, MLFlow, Azure cloud, Docker, Terraform, Continuous Integration and Continuous Deployment (CI/CD), and Kubernetes/ Azure Kubernetes Services (AKS).
2. Experience deploying production ML systems, agentic AI systems, Retrieval-Augmented Generation (RAG) pipelines, vector databases (DBs), orchestration frameworks, or LLM applications.
3. Demonstrated ability to convert prototypes into production systems, including optimizing pipelines for streaming and mini‑batch.
4. Experience integrating AI applications with Electronic Health Record (HER) and clinical data platforms (i.e. Epic, Health Level 7 (HL7)/Fast Healthcare Interoperability Resources (FHIR), Mirth Connect, Laboratory Information System (LIS)/Picture Archiving and Communication System (PACS).
5. Strong background in observability, monitoring, drift detection, Quality Assurance (QA) automation, and ML system reliability.
6. Understanding of Health Insurance Portability and Accountability Act of 1996 (HIPAA), National Institute of Standards and Technology (NIST), responsible AI, and safety‑critical ML governance.
7. Proven track record leading engineering teams and collaborating across clinical, operational, and Information Technology (IT) domains.
8. Strong communication skills and ability to translate complex technical systems into clinically meaningful explanations.
9. Experience deploying AI systems in healthcare, public-sector environments, or other highly regulated systems
10. Experience building or operating large‑scale RAG or agentic AI systems in production.
11. Familiarity with Plotly visualization, clinical note processing, or multimodal clinical models.
12. Experience Using the Following Software and/or Platforms:
- Python, Java, Scala, PySpark, Structured Query Language (SQL).
- Spark, FastAPI/Spring, Docker, Kubernetes, Kafka, microservices frameworks.
- Retrieval frameworks (RAG), Vector databases (Pinecone, mongoDB, FAISS), LangChain, LlamaIndex, OpenAI/Azure tools.
- Containerization, Event-driven programming.
- Microsoft and Google operating systems.
Benefits
NYC Health and Hospitals offers a competitive benefits package that includes:
- Comprehensive Health Benefits for employees hired to work 20+ hrs. per week
- Retirement Savings and Pension Plans
- Paid Holidays and Vacation in accordance with employees' Collectively bargained contracts
- Loan Forgiveness Programs for eligible employees
- College tuition discounts and professional development opportunities
- College Savings Program
- Union Benefits for eligible titles
- Multiple employee discounts programs
- Commuter Benefits Programs
How To Apply
If you wish to apply for this position, please apply online by clicking the "Apply for Job" button.
Note: Candidates selected for a position are required to come to NYC as part of their onboarding.
ML/AI Work links you to the employer's original posting — always verify the details there before applying.
More Domain Specializations roles
View all →AI & Automation Engineer
Freestone Capital Management · Washington, US
Emerging Tech Engineer
U.S. Bank · Atlanta, US
Matterport – Senior Machine Learning/Computer Vision Engineer – 3D Reconstruction and Semantic Understanding
CoStar Group · Remote · Oakland
Junior AI/ML Engineer
Talan · Geneva, CH
Forma framtidens medicinska innovation med avancerad AI – Nu söker Karolinska Institutet 2 nya AI Ingenjörer
Karolinska Institutet (KI) · Uppsala, SE
AI/ML Engineer
MAERSK · Copenhagen, DK