Site Reliability Engineer - Platforms

External

Toyota · Plano, TX

Full-timeOn-siteToday

AnsibleAWSBashCapacity PlanningComplianceLinux

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

Collaborative. Respectful. A place to dream and do. These are just a few words that describe what life is like at Toyota. As one of the world's most admired brands, Toyota is growing and leading the future of mobility through innovative, high-quality solutions designed to enhance lives and delight those we serve. We're looking for talented team members who want to Dream. Do. Grow. with us. An important part of the Toyota family is Toyota Financial Services (TFS), the finance and insurance brand for Toyota and Lexus in North America. While TFS is a separate business entity, it is an essential part of this world-changing company- delivering on Toyota's vision to move people beyond what's possible. At TFS, you will help create best-in-class customer experience in an innovative, collaborative environment. Toyota does not offer support or sponsorship of job applicants for employment-based visas or any other work authorization for this role now or in the future. You must have the right to work in the United States and not require Toyota support or sponsorship for immigration-related employment (e.g., H-1B, O-1, E-3, H-1B1, TN, F-1 OPT, F-1 STEM OPT, F-1 CPT, TN, 'job flexibility benefits' (also known as I-140 or Adjustment of Status portability), etc. now or in the future. You should not apply for this role if you will require Toyota to assist with immigration support or sponsorship now or in the future. Who w e' re l o o king f o r T h e Toyo t a F i n a n ci al Serv ic es Technology Operations Center i s l ook i n g for a p ass i o n at e a n d h ig h l y m o t i v at ed Site Reliability Engineer (SRE) - Platforms . The SRE - Platforms reports to the Manager of the SRE Department. In this role, you will apply software engineering principles to ensure the availability, performance and stability of TFS's enterprise platforms and infrastructure services. You will play a key role in maintaining and modernizing our Infrastructure Platforms including AWS Cloud Platform, Core Operating Platforms like Linux, Windows. What y o u ' ll be d o ing Manage and maintain operating systems across Red Hat Enterprise Linux (RHEL), Amazon Linux, and Windows Server environments Perform OS-level configuration, hardening, and lifecycle management following industry best practices and organizational security standards Manage user access, permissions, file systems, storage, networking, and core OS services across platforms Coordinate with relevant teams for maintenance and change management processes as needed. Build/Update, own and maintain the end-to-end patch management lifecycle across all supported operating systems Maintain tooling and workflows for automated patch scheduling, compliance reporting, and remediation tracking Ensure patch compliance targets are consistently met and documented Work with tools such as Red Hat Satellite, AWS Systems Manager (SSM), WSUS, Ansible, or similar patch management platforms Design and maintain observability setups including metrics, logging, and alerting for all managed systems Ensure all systems are instrumented with appropriate monitoring agents and are integrated into centralized observability platforms. Define and maintain meaningful alerting thresholds, dashboards, and runbooks to provide operational visibility Proactively identify gaps in monitoring coverage and address them before they impact reliability Participate in incident triage and use observability data to drive faster resolution Manage and maintain backup and restore solutions such as Cohesity, AWS backups for operating systems and critical data Regularly test and validate restore procedures to ensure reliability and meet defined Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) Document backup policies, schedules, and recovery procedures Identify and remediate failures in backup jobs and ensure alerts are in place for backup health monitoring Write and maintain scripts and automation workflows to reduce manual toil and streamline operational tasks (e.g., provisioning, configuration management, log rotation, disk cleanup, service restarts) Develop and implement self-healing mechanisms for common, well-understood system issues such as service crashes, disk space alerts, memory pressure, and connectivity failures Use tools such as Bash, Python, PowerShell, Ansible, or Terraform to automate repeatable operational workflows Contribute to internal automation libraries and maintain version-controlled infrastructure code Troubleshoot complex production issues and implement permanent fixes to improve reliability. Build and Maintain components required to Automate operational workflows and reduce toil using Python or equivalent scripting language. Participate in capacity planning, disaster recovery, and business continuity exercises. Define and manage SLIs/SLOs, health checks, and automated remediation processes Collaborate across teams to ensure service reliability,

Benefits

Health insuranceVision insurance

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at toyota? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect