Site Reliability Engineer Tech Lead
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- System Reliability: Design, implement, and maintain automated solutions to ensure high availability, resiliency, and scalability of applications and services.
- Incident Management: Collaborate with stakeholders to respond to production incidents, develop protocols to minimize downtime, conduct postmortems, and implement preventive measures to avoid recurrence.
- Monitoring & Observability: Set up monitoring systems to track performance metrics, meeting system health and performance targets and addressing potential issues before they impact users.
- Performance Optimization: Analyze system performance, identify bottlenecks, and optimize for speed, scalability, and resource utilization.
- Automation: Leverage automation tools to reduce manual interventions in application management tasks and ensure efficiency, repeatability, and minimal human error.
- Collaboration: Work closely with stakeholders to support new features, deployments, and compliance initiatives.
- Capacity Planning: Forecast resource needs and plan for future growth to ensure system stability and scalability.
- Documentation: Create and maintain up-to-date documentation for systems, processes, and troubleshooting procedures.
- Continuous Improvement: Exhibit the intellectual curiosity to continuously learn emerging technologies and practices to design and deliver best of breed solutions for MF Technology
Requirements
- Proven expertise in designing, developing, and maintaining automation frameworks for application operations, including infrastructure provisioning, deployment pipelines, monitoring, and incident response, using tools such as Ansible, Terraform, Jenkins, and related technologies.
- Extensive experience with observability and monitoring platforms (Elasticsearch Observability, Elasticsearch APM, OpenTelemetry), with a focus on automating system health checks, alerting, and root cause analysis.
- Strong proficiency in programming and scripting languages (e.g., Python, Go, Bash, Java), with a track record of automating repetitive operational tasks and building self-healing solutions.
- Hands-on experience with cloud infrastructure (AWS, Azure, GCP) and container orchestration (Docker, Kubernetes, EKS), including automated provisioning, scaling, and recovery of resources.
- Demonstrated ability to lead and implement transformative initiatives that reduce manual toil, streamline operational workflows, and drive continuous improvement in reliability and efficiency.
- Experience with CI/CD tools and configuration management for fully automated build, test, and deployment pipelines.
- Deep understanding of SRE principles such as SLIs, SLOs, error budgets, and applying automation to enforce and improve these metrics.
- Experience with data management platforms and automation of data workflows (e.g., MongoDB, Snowflake, SQL, Dremio, Qlik Replicate).
- Familiarity with enterprise job schedulers (Autosys, Control-M) and automation of batch processes and job orchestration.
- Solid foundation in networking, databases, and distributed systems, with experience automating troubleshooting and recovery procedures.
- Experience with agile and DevOps cultures, driving adoption of automation best practices across teams.
- Track record of championing automation-first initiatives that modernize legacy application operations and deliver measurable improvements in reliability, scalability, and team productivity.
- Ability to mentor and guide teams in adopting automation tools and practices, fostering a culture of continuous improvement and operational excellence.
- Relevant certifications in cloud, automation, or SRE/DevOps (e.g., AWS DevOps Engineer, Google SRE) are a plus.
- Bachelor's degree in computer science, information technology, or related field (or equivalent experience).
- Keys to Success in this Role:
- Demonstrate a sense of accountability and ownership
Benefits
Additional Information
At Freddie Mac, our mission of Making Home Possible is what motivates us, and it's at the core of everything we do. Since our charter in 1970, we have made home possible for more than 90 million families across the country. Join an organization where your work contributes to a greater purpose. Position Overview: At Freddie Mac, you will do important work to build a better housing finance system, and you'll be part of a team helping to make rental housing more accessible and affordable across the nation. The Technology & Operational Risk department within the Multifamily (MF) division is seeking a Site Reliability Engineer (SRE) who will blend software engineering with IT operations to ensure the reliability, availability, scalability, in the performance of key systems, services, and environments. Our Impact: At Freddie Mac, our mission of Making Home Possible is what motivates us, and it's at the core of everything we do. Since our charter in 1970, we have made home possible for more than 90 million families across the country. Join an organization where your work contributes to a greater purpose.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Freddie Mac? Share your experience