Senior Site Reliability Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
DigiCert is a global leader in intelligent trust. We protect the digital world by ensuring the security, privacy, and authenticity of every interaction. Our AI-powered DigiCert ONE platform unifies PKI, DNS, and certificate lifecycle management, to secure infrastructure, software, devices, messages, AI content and agents. Learn why more than 100,000 organizations, including 90% of the Fortune 500, choose DigiCert to stop today's threats and prepare for a quantum-safe future at www.digicert.com Job summary The Site Reliability Engineer (SRE) collaborates with development teams to embed reliability, scalability, and performance best practices throughout the software development lifecycle. This role bridges software engineering and cloud operations, ensuring mission-critical systems remain highly available and resilient. By integrating reliability early, the SRE fosters a culture of shared responsibility while enabling rapid and safe feature delivery.
Responsibilities
- Design and build fault-tolerant, high-performing systems that meet Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
- Implement monitoring, alerting, distributed tracing, and logging to ensure real-time system health visibility and proactive issue resolution.
- Act as a first responder for production incidents, conduct blameless postmortems, and drive root cause analysis (RCA) and corrective actions.
- Develop self-healing, automated deployments, and scaling solutions to minimize toil and improve system efficiency.
- Improve continuous integration and deployment pipelines to enable safe, rapid, and reliable feature rollouts.
- Review code, debug issues, and perform quality assurance (QA) on software components to enhance system reliability and performance.
- Work closely with development teams to ensure best practices in software architecture, coding standards, and operational readiness.
- Forecast scalability needs and optimize cloud infrastructure costs while balancing performance and efficiency.
- Ensure production environments meet security and compliance requirements, collaborating with teams to mitigate vulnerabilities and enforce best practices.
- Work closely with development teams to embed reliability at every stage rather than treating it as an afterthought.
- Use error budgets to balance feature velocity with system stability.
- Implement observability and automation-first principles to measure system health and drive continuous improvement.
- Leverage game days, chaos engineering, and resilience testing to validate system robustness and refine operational processes.
- What you will have
- Extensive experience in distributed systems, cloud-native architectures (AWS, GCP, Azure), and DevOps practices.
- Proficiency in Kubernetes, Terraform, CI/CD pipelines, and Infrastructure as Code (IaC).
- Strong scripting and automation skills in Python, Go, Bash, or similar languages.
- Expertise in observability tools such as Prometheus, Grafana, Datadog, Splunk, New Relic, and OpenTelemetry.
- Ability to troubleshoot complex production issues and drive scalable, resilient solutions.
- Experience reviewing code, debugging applications, and conducting software testing to ensure high reliability and quality.
Benefits
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at digicert? Share your experience