Skip to main content
Back to jobs

AI Systems Administrator

External
draper logoDraper · Cambridge, MA
Full-timeHybridToday
AnsibleAWSAzureBashDocumentationGit
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


About the role

Draper is an independent, nonprofit research and development company headquartered in Cambridge, MA. The 2,000+ employees of Draper tackle important national challenges with a promise of delivering successful and usable solutions. From military defense and space exploration to biomedical engineering, lives often depend on the solutions we provide. Our multidisciplinary teams of engineers and scientists work in a collaborative environment that inspires the cross-fertilization of ideas necessary for true innovation. For more information about Draper, visit www.draper.com . Job Description Summary: The AI Systems Administrator is instrumental in bringing AI to Draper. The incumbent implements a closed GPT environment at Draper in which several different LLM models are maintained and used throughout the organization. This role works with engineering to ensure that multiple LLMs are accessible through a chat interface, API, and assistive tools for the general purpose of the organization. In addition, they will ensure the system health of the DraperGPT server to allow for additional AI infrastructure requiring large amounts of compute to be utilized without impacting the performance of other LLM resources. This will also include API interfaces with various software platforms across Draper (e.g., engineering, accounting, legal). This role helps Draper implement automation, streamline processes, and support mission-critical AI/ML workloads. Resource allocation is critical. It also involves traditional Linux admin duties (installing, configuring, securing servers, scripting, monitoring) but with a strong focus on supporting AI/ML (e.g., GPU servers, Kubernetes, data pipelines), managing AI. This job supports AI engineers using their knowledge to guide AI engineers with solutions and recommendations. The role is part of a team of Linux system administrators responsible for managing the functionality and efficiency of a group of computers, approximately 750, running primarily Oracle Linux. Additional operating system knowledge, e.g. Ubuntu and RHEL, maybe be necessary. Maintain the integrity and security of servers and systems. Serves as a front-line interface to end users and other IS teams. The Systems Administrator makes recommendations for hardware and software purchases. Interacts with vendors and VARs directly on proactive projects as well as reacting to support issues. Duties may include installation, configure, and maintain new hardware/software, troubleshooting, permissions and training other administrators. Requires a solid understanding of UNIX based operating systems. This role will by hybrid (3 days/week) in Cambridge, MA and will require an Active Secret Clearance. Job Description: Duties/Responsibilities Build, operate, and troubleshoot RHEL/Oracle systems supporting GPU workloads (OS lifecycle, patching, performance, reliability). Manage the GPU enablement layer: driver/toolkit lifecycle, kernel/driver compatibility, coordinated upgrades and rollback plans, and ongoing health monitoring. Implement and maintain observability (metrics, logs, alerting) for system, GPU, and storage performance/health (e.g., Prometheus/Grafana and GPU telemetry such as DCGM/NVML or equivalent). Couple above observability with LLM performance and usage, and identify and warn users over allocating resources. Maintaining (ie resetting or rebuilding) LLM servers to ensure high up times and usage capabilities across organization. Working with a team of engineers to allow for software upgrades (e.g. new models, or additional AI software) to the server while maintaining security needs. Partner with storage/network peers to baseline throughput/latency, identify bottlenecks, and tune the platform for predictable performance. Automation & scripting: create and maintain automation for platform administration and broader Linux team workflows (provisioning/config enforcement, patch orchestration, reporting, routine maintenance), using Git-based practices. (Python/Ansible) Work to support various Linux, Cloud AWS/Azure projects Lead projects including large scale migrations as well as platform redesign and implementation. Utilize resources within the Linux team as well as across the IS department to reach goals Skills/Abilities Strong production Linux administration experience (RHEL/Oracle preferred): systemd, networking, troubleshooting, performance analysis, patching, package management. Strong automation skills: Bash and/or Python, plus Ansible (preferred) or equivalent configuration management; comfortable with CI/Git workflows. Experience supporting enterprise platforms (incident response, root-cause analysis, postmortems, runbooks/documentation). Security-minded operations in regulated environments; familiarity with CUI handling concepts and control expectations (audit logging, vulnerability remediation, change control). Education Bachelor's degree in Computer Science or a related field.

Requirements

  • 3 years' experien

Benefits

Health insuranceVision insurance

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at draper? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect