Lead AI Infrastructure Engineer

Location: Annapolis Junction, MD

Clearance: TS/SCI with Polygraph required

Work Type: On-site

Salary: $293,000-$306,000

Position Overview

We are seeking an experienced Lead AI Infrastructure Engineer to provide technical leadership for the design, deployment, and operation of enterprise artificial intelligence and machine learning platforms. This role will lead the development and sustainment of critical AI infrastructure components, with a focus on scalable model deployment, platform reliability, and support for AI-enabled applications and services.

The successful candidate will combine hands-on engineering expertise with team leadership responsibilities, serving as a technical lead for platform initiatives while supporting the professional development of engineering staff. This position requires strong cloud engineering, platform architecture, and organizational leadership skills to drive innovation, operational excellence, and technology adoption across multiple teams.

Key Responsibilities

Design, implement, and optimize infrastructure supporting large-scale AI model deployment and inference services.
Lead the development, deployment, and maintenance of production AI applications and platform services.
Serve as the technical lead for AI infrastructure initiatives, coordinating activities across engineering teams and stakeholders.
Provide mentorship, coaching, and professional development support to engineering team members.
Support team operations, resource planning, and administrative coordination activities.
Define technical solutions for complex and evolving requirements.
Establish and maintain technical standards, policies, governance processes, and engineering best practices.
Drive adoption of emerging technologies, automation capabilities, and platform modernization initiatives.
Design, implement, and oversee monitoring, logging, alerting, and observability solutions.
Ensure the reliability, availability, scalability, performance, and security of AI platform components.
Communicate technical strategies, project status, and recommendations to stakeholders at multiple organizational levels.
Lead troubleshooting, root cause analysis, and continuous improvement efforts for production systems. Required Qualifications

Education and Experience

Bachelor's degree in Computer Science, Software Engineering, Information Systems, Computer Engineering, or a related technical discipline and twelve (12) years of relevant experience; OR
Four (4) additional years of directly related experience may be substituted for the degree requirement.

Technical Qualifications

Extensive experience designing, building, deploying, and operating enterprise-scale production systems.
Deep expertise in systems integration across diverse technologies, platforms, and cloud environments.
Hands-on experience designing, deploying, and managing cloud infrastructure within Amazon Web Services (AWS).
Advanced experience administering and deploying applications using Kubernetes.
Strong software development skills using Python.
Experience implementing and scaling observability solutions using technologies such as:
- Application Performance Monitoring (APM) tools
- OpenTelemetry
- Grafana
- Prometheus
Experience developing and maintaining highly available, resilient, and secure distributed systems.
Proven ability to lead complex technical initiatives and influence organizational technology adoption.
Experience establishing technical standards, governance frameworks, and engineering policies.
Excellent communication, stakeholder engagement, and leadership skills.
Demonstrated ability to balance hands-on engineering responsibilities with leadership and team coordination duties.

Preferred Qualifications

Experience supporting AI model serving and inference platforms.
Experience integrating large language models (LLMs) and generative AI technologies into enterprise applications.
Experience with AI orchestration and workflow frameworks, including LangChain or similar technologies.
Knowledge of vector databases, embeddings, and semantic search technologies.
Experience implementing Retrieval-Augmented Generation (RAG) architectures.
Experience with distributed computing, high-performance computing, or large-scale processing environments.
Demonstrated success leading technical transformation, modernization, or organizational change initiatives.
Familiarity with autonomous agent frameworks and emerging AI technologies.

Knowledge, Skills, and Abilities

Strong leadership and technical decision-making capabilities.
Expertise in cloud-native architecture, platform engineering, and distributed systems.
Ability to balance reliability, scalability, security, and performance requirements.
Strong analytical and problem-solving skills.
Ability to establish technical direction and influence engineering organizations.
Excellent written and verbal communication skills.
Strong mentoring, coaching, and team development abilities.
Ability to work effectively across technical and non-technical stakeholder groups.
Strong organizational skills and attention to detail.

Benefits

This position includes a competitive and flexible benefits package, including:

Medical

Employer pays 100% of the monthly premium for the employee and 80% for the employee’s dependents.

Health Savings Account (HSA)

Save for all medical, dental, vision and prescription expenses by contributing pre-tax money to an HSA account. Employer contributes 50% of the annual deductible (prorated to start date).

Dental and Vision

Employer pays 100% of the monthly premium for the employee and 80% for dependents.

Life Insurance

100% company-paid Life and Accidental Death & Dismemberment (AD&D) coverage offered to all full-time employees.

Short-Term Disability

100% company-paid short-term disability. This benefit pays out 60% of earnings, with a $1,500 maximum for up to 12 weeks.

Retirement Plan

Automatic 6% of salary contributed to the company 401(k) plan, fully vested. Employee match encouraged but not required.

Paid Time Off (PTO) & Holidays

5–6 weeks of PTO based on tenure with the company, in addition to 11 paid holidays.

Tuition Reimbursement

$5,000 annually for courses directly related to job role and responsibilities.

Training Reimbursement

Paid training, certification courses, and conferences to support employee career growth.

We do not discriminate in employment on the basis of race, color, religion, sex (including pregnancy and gender identity), national origin, political affiliation, sexual orientation, marital status, disability, genetic information, age, membership in an employee organization, retaliation, parental status, military service, or other non-merit factor.

<!-td {border: 1px solid #cccccc;}br {mso-data-placement:same-cell;}-> <!-td {border: 1px solid #cccccc;}br {mso-data-placement:same-cell;}->

ML/AI Work links you to the employer's original posting — always verify the details there before applying.

Lead AI Infrastructure Engineer

Job description