Skip to main content
Back to jobs

Software Engineer III - AI/ML Platform Operations - Remote

External
aaaie logoAaaie · Arizona - Home Teleworkers
Full-timeRemoteToday
AWSCI/CDGenerative AILeadershipMachine LearningMLOps
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Benefits

Health insuranceRemote work options

Additional Information

External candidates: In order for your application to be correctly processed please sign-in before you apply Internal candidates: Please go to Workday and click "Find Jobs" link under Career Thank you for considering opportunities with us! Job Title Software Engineer III - AI/ML Platform Operations - Remote Requisition Number R7739 Software Engineer III - AI/ML Platform Operations - Remote (Open) Location Arizona - Home Teleworkers Additional Locations Job Information CSAA Insurance Group (CSAA IG), a AAA insurer, is one of the leading personal lines property and casualty insurance groups in the United States. Here, every employee shapes our mission. We build innovative, human-centered solutions that help AAA members prevent, prepare for, and recover from life's uncertainties. You will join a collaborative, inclusive culture where your strengths have room to grow and your ideas can drive real impact. Step into a role where you can contribute to our shared success through meaningful work. We are actively hiring for a Software Engineer - AI/ML Platform Operations - Remote Your Role: We are seeking a Software Engineer - AI/ML Platform Operations to lead the operational excellence, reliability, and support of our enterprise AI and data platforms. This role is responsible for ensuring the stability, scalability, observability, governance, and operational readiness of AI/ML solutions that power critical business capabilities. This is not a traditional software application development role. While strong software engineering skills are essential, the primary focus is on AI platform operations, MLOps, automation, reliability engineering, deployment support, observability, governance, and continuous improvement of enterprise AI capabilities. Your Work: You will work across a modern technology ecosystem that includes Palantir Foundry, AWS Bedrock, Amazon SageMaker, cloud-native services, and emerging Generative AI technologies. You will partner with Data Engineering, Data Science, Architecture, Infrastructure, Security, and Product teams to support production AI workloads and enable the successful adoption of AI capabilities across the organization. AI Platform Operations & Reliability Provide technical leadership for AI/ML platforms including Palantir, AWS Bedrock, Amazon SageMaker, and related cloud-native technologies. Ensure platform reliability, scalability, performance, security, and operational readiness for production AI workloads. Support deployment, monitoring, maintenance, and lifecycle management of AI/ML solutions and Generative AI services. Establish operational standards, support models, service-level objectives (SLOs), and platform governance practices. MLOps, Automation & Observability Design and implement automation, monitoring, observability, and operational tooling to improve platform reliability and efficiency. Develop and maintain dashboards, health metrics, alerts, logging frameworks, and operational runbooks. Enhance CI/CD pipelines, deployment automation, infrastructure-as-code, and model release processes. Implement best practices for MLOps, model monitoring, model lifecycle management, and AI operational governance. Incident Management & Problem Resolution Serve as a senior escalation point for complex production issues involving AI platforms, machine learning workloads, cloud infrastructure, and data integrations. Lead root cause analysis efforts and drive corrective and preventive actions to improve platform stability. Solve performance, availability, deployment, and integration issues across AI and data ecosystems. Partner with engineering and business teams to rapidly restore service and minimize operational risk. Technical Leadership & Collaboration Provide mentorship, technical guidance, and operational expertise to engineers and platform teams. Influence platform strategy, architecture decisions, operational processes, and technology adoption. Collaborate with team members to align platform capabilities with business priorities and AI adoption goals. Communicate complex technical concepts effectively to both technical and non-technical audiences. Continuous Improvement & Innovation Remain current with advancements in AI/ML, Generative AI, cloud technologies, platform engineering, and reliability practices. Identify opportunities to improve operational efficiency, governance, security, and developer experience. Champion modern engineering practices including automation, observability, DevOps, Site Reliability Engineering (SRE), and AI Operations (AIOps). Required Experience, Education and Skills 3+ years of progressive experience in software engineering, platform engineering, cloud operations, MLOps, DevOps, or related technical disciplines. Bachelor's degree in Computer Science, Engineering, Information Technology, or a related field, or equivalent practical experience. Experience supporting production cloud-based applications and services in AWS environme


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at aaaie? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect