Middle/High-Middle DevOps / SRE Engineer

External

Aghanim · Lisbon, Portugal

Full-timeOn-site2mo ago

BigQueryCDNCI/CDCloudflareDatadogDevSecOps

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Platform Operations (GCP/GKE)
Operate and support production systems on GCP , primarily GKE and managed services.
Execute platform improvements and operational tasks delegated by Senior/Principal owners.
IaC & Delivery Enablement
Implement infrastructure changes via Terraform (and Terragrunt where used).
Maintain and evolve Helm charts and Kubernetes manifests.
Improve reliability of GitHub Actions / CI/CD workflows and deployment automation.
Observability & Monitoring (Datadog)
Build and maintain Datadog dashboards/monitors and keep alerting healthy.
Close monitoring gaps across critical components; reduce noisy alerts and improve signal quality.
Incident Response
Participate in incident response and operational support: triage, mitigation using runbooks, escalation, and follow-up fixes.
Contribute to postmortems with clear facts, timelines, and actionable remediation tasks.
Security Basics (DevSecOps)
Run/configure security tooling and monitoring, help triage findings, and implement fixes under guidance.
Support secure-by-default practices (secrets hygiene, access controls, baseline hardening).
Cost Awareness
Identify and implement cost optimizations (right-sizing, waste removal, efficiency improvements) without harming reliability.
Required Qualifications
Hands-on production experience with Kubernetes (ideally GKE ) and basic cluster operations.
Working experience with Terraform and Helm in PR-based workflows.
Familiarity with GCP services used in SaaS operations (e.g., Cloud SQL, BigQuery, BigTable, Pub/Sub, Cloud Run, Memorystore ).
Monitoring/alerting and troubleshooting skills (preferably Datadog ).
Strong scripting/automation mindset to reduce manual work and prevent repetitive incidents.
Reliability awareness: understanding how changes affect availability/latency and how to operate under SLA constraints.

Requirements

Cloudflare basics (WAF/DNS, edge concepts; Workers/CDN is a plus).
Experience writing/maintaining runbooks and participating in postmortems.
Exposure to SOC 2 / PCI-DSS requirements or willingness to learn.
Experience in high-load consumer products or game dev.
What Success Looks Like
Improved monitoring coverage and healthier alerting (less noise, faster detection).
Faster, safer deployments with fewer manual steps and fewer production regressions.
Incidents are triaged effectively and resolved within expected timelines.
Platform reliability improves through steady delivery of operational fixes and automation.
Costs trend in the right direction thanks to recurring optimizations and guardrails.
Why Join Us
Cloud-only, highload environment with real engineering challenges (not "just keep the lights on").
Small team with ownership, autonomy, and quick iteration.
Strong opportunity to grow into broader platform ownership and SRE leadership paths.
Direct impact on reliability, scalability, and developer velocity.
Aghanim helps game developers achieve financial and creative independence by providing the solutions they need to launch, run, and grow their businesses.

Benefits

Health insurance

Additional Information

We're looking for a Middle/High-Middle DevOps / SRE Engineer to help run and improve our production platform in GCP + GKE, fronted by Cloudflare, with observability in Datadog and CI/CD in GitHub Actions. You'll work closely with Senior/Principal engineers, implementing reliability improvements, expanding monitoring coverage, and reducing operational toil-especially important in a highload system with sudden traffic spikes.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Aghanim? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect