Senior SRE/Platform Engineer

External

Equifax · Canada

Full-timeHybrid2w ago

ArgoCDAWSAzureCI/CDDatadogDesign Systems

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Kubernetes Management: Design, provision, and manage hardened, secure, cost-optimized GKE and AWS EKS production clusters.
Infrastructure as Code: Standardize automated, cross-cloud infrastructure delivery utilizing Terraform.
GitOps CD: Maintain a GitOps model via ArgoCD to match environment state directly to code repositories.
Deployment Strategies: Execute Canary deployments (online, live-traffic validation) and Blue-Green deployments (offline/batch, zero-downtime, instant rollback).
Cloud Networking: Architect complex topologies including VPCs, Shared VPCs, Peering, Transit Gateways, and Cloud Interconnect/Direct Connect.
Security & Connectivity: Manage cross-cloud connectivity and enforce zero-trust network policies within Kubernetes.
Observability: Implement end-to-end distributed tracing and infrastructure monitoring using DataDog.
Telemetry & Alerting: Build custom dashboards, monitors, and SLO/SLI alerts for deep visibility into app and infra health.
Architectural Partnership: Translate Enterprise Architects' high-level blueprints into automated, scalable, and secure technical implementations.
FinOps Governance: Drive AWS/GCP/Azure cost-saving (rightsizing, Spot/Preemptible instances, storage tiers) and automated governance (tagging, lifecycle policies, budget alerts).
Leverage AI/ML frameworks to drive end-to-end automation across the infrastructure lifecycle, from automated IaC (Terraform) generation to predictive observability and self-healing systems with automated Root Cause Analysis (RCA).
What experience you need
Professional Experience: Requires 7-10+ years of enterprise-scale experience in Platform Engineering, Site Reliability Engineering (SRE), or DevOps
Multi-Cloud Ecosystems: Proven mastery managing production-grade environments across AWS and Google Cloud (GCP), plus Azure experience specifically for cost governance
Deep Kubernetes Expertise: 4+ years of hands-on experience provisioning and managing EKS and GKE clusters, including production upgrades, hardening, and namespace isolation
Infrastructure as Code (IaC): Advanced proficiency with Terraform for multi-cloud resource provisioning, utilizing modular, reusable code and state management.
GitOps & CI/CD Automation: Experience building declarative workflows using ArgoCD or Flux, alongside automated pipelines that integrate security scanning, testing, and validation.
Advanced Deployment Strategies: A proven track record of executing Canary deployments for high-traffic online services and Blue-Green deployments for large-scale batch/offline workloads.
Multi-Cloud Networking & Zero-Trust Security: Expertise in hybrid architectures (Transit Gateways, Shared VPCs, Direct Connect/Cloud Interconnect) combined with Kubernetes Network Policies and cloud IAM management.
Observability & Reliability: Hands-on experience with DataDog APM for distributed tracing, dashboard creation, defining SLIs/SLOs, and configuring alerting logic to reduce MTTR.
FinOps & Cost Governance: Capability to lead cloud financial initiatives through workload rightsizing, strategic use of Spot/Preemptible instances, and building automated policy enforcement for cloud spend
What could set you apart
"Platform-as-a-Product" Mindset: Ability to treat infrastructure as a product to champion the developer experience, leveraging internal portals like Backstage.
Developer Tooling O

Benefits

Health insuranceVision insurance

Additional Information

Synopsis of the role Site Reliability Engineering (SRE)/Platform Engineering at Equifax is a discipline that combines software and systems engineering for building and running large-scale, distributed, fault-tolerant systems. SRE ensures that internal and external services meet or exceed reliability and performance expectations while adhering to Equifax engineering principles. SRE is also an engineering approach to building and running production systems - we engineer solutions to operational problems. Our SREs are responsible for overall system operation and we use a breadth of tools and approaches to solve a broad set of problems. Practices such as limiting time spent on operational work, blameless postmortems, proactive identification, and prevention of potential outages.Our SRE culture of diversity, intellectual curiosity, problem solving and openness is key to its success. Equifax brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big, and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn, grow and take pride in our work

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at equifax? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect