Site Reliability Engineer

External

Yes Energy · Bucharest, Romania

Full-timeOn-site3w ago

AWSAzureBashBitbucketCI/CDCloudFormation

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Requirements

Bachelor's or Master's degree in Computer Science, Information Technology, or a related field; or equivalent practical experience.
Minimum of five years of experience supporting mission-critical production infrastructure, SaaS platforms, web applications, or service-oriented systems.
Deep hands-on AWS experience, including production operations for compute, networking, IAM, storage, load balancing, monitoring, and troubleshooting; greater depth is strongly valued.
Proven incident management experience, including responding to pages, leading high-severity incidents, coordinating responders, writing postmortems and RCA, and driving corrective actions.
Experience with containers and Kubernetes, monitoring and alerting systems, CI/CD tooling such as Jenkins and Bitbucket, and operational automation or scripting.
Strong communicator and collaborator who can provide technical leadership, delegate effectively, mentor engineers, and unblock teams during high-pressure operational work.
Working knowledge of scripting and automation tools such as Python, PowerShell, Bash, Terraform, CloudFormation, Azure CLI, or AWS CLI.
Strong Linux and Windows systems administration and troubleshooting experience in production environments.
Key Competencies & Preferred Qualifications
Problem Solving: Frames and solves complex, ambiguous production issues; surfaces cross-system or systemic failure modes.
Systems Thinking: Takes a broad, holistic view of how software, infrast

Additional Information

Join the Market Leader in Electric Power Data and Analytics Solutions The electrical grid is the largest and most complicated machine ever built. Yes Energy's industry-leading electric power trading analytics software provides real-time visibility into the massive amount of data generated by the North American electrical grid daily. Our unique and innovative view of the data informs real-time trading decisions and mid-to-long-term investment decisions that keep utility prices low, support the energy transition, and keep the grid running. It's both challenging work and work with a purpose. Be a part of our successful, growing business during international transformation. Position Summary We are hiring a Site Reliability Engineer to serve as a senior, hands-on reliability leader across all product lines. This role sits within the Systems Administration team, part of the Product Technology Services (PTS) group, and is focused squarely on operational excellence: incident response, systems availability, monitoring and alerting, release support, and reliability improvements across our production services. During your working hours, you will be expected to take ownership of active incidents: respond to pages, coordinate response across engineering teams, diagnose production issues, restore service quickly, and drive clear communication through resolution. Incident response and operational readiness are central to the role, not occasional side responsibilities. This is a senior individual contributor and team-lead role responsible for setting SRE standards, mentoring additional SREs as the function grows, unblocking engineering teams, and improving the systems, pipelines, and practices that keep Yes Energy products reliable at scale. Position Details Salary Range: Net 14.000 - 18.000 RON/month Location: Hybrid (Bucharest, Romania) Schedule: Full-time; 2-3 days in the office Reporting to: Manager of Systems Administration Primary Responsibilities Respond to pages across all product lines and lead incident response from initial detection through mitigation and recovery, while driving root-cause remediation that reduces repeat incidents, prevents similar future alerts, and improves overall service reliability. Serve as the incident owner when online, coordinating cross-functional responders and making clear decisions under pressure to restore service quickly. Build and improve monitoring, alerting, dashboards, service-level objectives (SLOs), runbooks, and escalation processes so issues are detected quickly and responders have useful context. Operate and troubleshoot Linux and Windows systems across AWS, Azure, OCI, and related hybrid or multi-cloud environments. Support production web applications, containers, and Kubernetes workloads, with a focus on reliability, scalability, and availability. Work with load balancers, forward and reverse proxies, DNS, networking, firewalls, security groups, and traffic-routing patterns to diagnose and resolve availability and performance issues. Unblock engineering teams by diagnosing and fixing Jenkins jobs, CI/CD pipelines, deployment failures, environment issues, and release blockers. Partner with Engineering, Security, DBA, and Product Technology Services teams to improve operational readiness, production support models, and reliability practices. Mentor SRE and Systems team members, establish practical standards, and help lead the growth of a stronger site reliability function.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Yes Energy? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect