Lead AI Infrastructure Engineer
Staffed4U · Baltimore, US
Job description
Lead AI Infrastructure Engineer
Location: Annapolis Junction, MD
Clearance: TS/SCI with Polygraph required
Work Type: On-site
Salary: $293,000-$306,000
Position Overview
We are seeking an experienced Lead AI Infrastructure Engineer to provide technical leadership for the design, deployment, and operation of enterprise artificial intelligence and machine learning platforms. This role will lead the development and sustainment of critical AI infrastructure components, with a focus on scalable model deployment, platform reliability, and support for AI-enabled applications and services.
The successful candidate will combine hands-on engineering expertise with team leadership responsibilities, serving as a technical lead for platform initiatives while supporting the professional development of engineering staff. This position requires strong cloud engineering, platform architecture, and organizational leadership skills to drive innovation, operational excellence, and technology adoption across multiple teams.
Key Responsibilities
- Design, implement, and optimize infrastructure supporting large-scale AI model deployment and inference services.
- Lead the development, deployment, and maintenance of production AI applications and platform services.
- Serve as the technical lead for AI infrastructure initiatives, coordinating activities across engineering teams and stakeholders.
- Provide mentorship, coaching, and professional development support to engineering team members.
- Support team operations, resource planning, and administrative coordination activities.
- Define technical solutions for complex and evolving requirements.
- Establish and maintain technical standards, policies, governance processes, and engineering best practices.
- Drive adoption of emerging technologies, automation capabilities, and platform modernization initiatives.
- Design, implement, and oversee monitoring, logging, alerting, and observability solutions.
- Ensure the reliability, availability, scalability, performance, and security of AI platform components.
- Communicate technical strategies, project status, and recommendations to stakeholders at multiple organizational levels.
- Lead troubleshooting, root cause analysis, and continuous improvement efforts for production systems.
Required Qualifications
Education and Experience
- Bachelor's degree in Computer Science, Software Engineering, Information Systems, Computer Engineering, or a related technical discipline and twelve (12) years of relevant experience; OR
- Four (4) additional years of directly related experience may be substituted for the degree requirement.
Technical Qualifications
- Extensive experience designing, building, deploying, and operating enterprise-scale production systems.
- Deep expertise in systems integration across diverse technologies, platforms, and cloud environments.
- Hands-on experience designing, deploying, and managing cloud infrastructure within Amazon Web Services (AWS).
- Advanced experience administering and deploying applications using Kubernetes.
- Strong software development skills using Python.
- Experience implementing and scaling observability solutions using technologies such as:
- Application Performance Monitoring (APM) tools
- OpenTelemetry
- Grafana
- Prometheus
- Experience developing and maintaining highly available, resilient, and secure distributed systems.
- Proven ability to lead complex technical initiatives and influence organizational technology adoption.
- Experience establishing technical standards, governance frameworks, and engineering policies.
- Excellent communication, stakeholder engagement, and leadership skills.
- Demonstrated ability to balance hands-on engineering responsibilities with leadership and team coordination duties.
Preferred Qualifications
- Experience supporting AI model serving and inference platforms.
- Experience integrating large language models (LLMs) and generative AI technologies into enterprise applications.
- Experience with AI orchestration and workflow frameworks, including LangChain or similar technologies.
- Knowledge of vector databases, embeddings, and semantic search technologies.
- Experience implementing Retrieval-Augmented Generation (RAG) architectures.
- Experience with distributed computing, high-performance computing, or large-scale processing environments.
- Demonstrated success leading technical transformation, modernization, or organizational change initiatives.
- Familiarity with autonomous agent frameworks and emerging AI technologies.
Knowledge, Skills, and Abilities
- Strong leadership and technical decision-making capabilities.
- Expertise in cloud-native architecture, platform engineering, and distributed systems.
- Ability to balance reliability, scalability, security, and performance requirements.
- Strong analytical and problem-solving skills.
- Ability to establish technical direction and influence engineering organizations.
- Excellent written and verbal communication skills.
- Strong mentoring, coaching, and team development abilities.
- Ability to work effectively across technical and non-technical stakeholder groups.
- Strong organizational skills and attention to detail.
Benefits
This position includes a competitive and flexible benefits package, including:
- Medical
Employer pays 100% of the monthly premium for the employee and 80% for the employee’s dependents. - Health Savings Account (HSA)
Save for all medical, dental, vision and prescription expenses by contributing pre-tax money to an HSA account. Employer contributes 50% of the annual deductible (prorated to start date). - Dental and Vision
Employer pays 100% of the monthly premium for the employee and 80% for dependents. - Life Insurance
100% company-paid Life and Accidental Death & Dismemberment (AD&D) coverage offered to all full-time employees. - Short-Term Disability
100% company-paid short-term disability. This benefit pays out 60% of earnings, with a $1,500 maximum for up to 12 weeks. - Retirement Plan
Automatic 6% of salary contributed to the company 401(k) plan, fully vested. Employee match encouraged but not required. - Paid Time Off (PTO) & Holidays
5–6 weeks of PTO based on tenure with the company, in addition to 11 paid holidays. - Tuition Reimbursement
$5,000 annually for courses directly related to job role and responsibilities. - Training Reimbursement
Paid training, certification courses, and conferences to support employee career growth.
We do not discriminate in employment on the basis of race, color, religion, sex (including pregnancy and gender identity), national origin, political affiliation, sexual orientation, marital status, disability, genetic information, age, membership in an employee organization, retaliation, parental status, military service, or other non-merit factor.
<!-td {border: 1px solid #cccccc;}br {mso-data-placement:same-cell;}-> <!-td {border: 1px solid #cccccc;}br {mso-data-placement:same-cell;}->
ML/AI Work links you to the employer's original posting — always verify the details there before applying.
More Generative AI and LLM roles
View all →Product Manager, Data Insights
nShift · Stockholm, SE
Data-scientist IA Générative - Computer vision
Société Générale · Paris, FR
INTERNSHIP - AI SOFTWARE ENGINEER– (Bac+5, end-of-study) F/M
Trekea SAS · Paris, FR
Google Cloud AI Engineer
Devoteam · Madrid, ES
MLOps-engineer till AI/ML-plattformsteam
Skatteverket · Borås, SE
AI Engineer London
Sharegain · Milton Keynes, GB