Skip to main content
Back to jobs

Senior/Principal DevOps

External
Aghanim logoAghanim · Lisbon, Portugal
Full-timeOn-site2mo ago
BigQueryCapacity PlanningCDNCI/CDCloudflareDatadog
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Responsibilities

  • Cloud Infrastructure Ownership
  • Own and evolve production infrastructure on GCP and Cloudflare (cloud-only, no on-prem).
  • Maintain high availability and performance for a SaaS platform serving both B2B and B2C use cases.
  • Scalability & Highload Resilience
  • Design and operate for unpredictable spikes where load can jump 10-20× within seconds .
  • Build scaling strategies across compute, networking, and data layers (autoscaling, capacity planning, bottleneck removal, safe degradation patterns).
  • SLA/SLO & Incident Excellence
  • Be accountable for reliability outcomes: availability/latency/error rates tied to SLA/SLO.
  • Lead incident response practices: detection → mitigation → postmortem → permanent fixes (root cause elimination).
  • IaC & Kubernetes Platform Operations
  • Build and maintain Infrastructure as Code using Terraform (and Terragrunt where applicable).
  • Own Kubernetes operations on GKE : upgrades, scaling, operational hardening.
  • Write and maintain Helm charts and Kubernetes manifests where needed.
  • Observability (Datadog)
  • Build end-to-end observability using Datadog (metrics/logs/APM): dashboards, monitors, alert strategy.
  • Ensure critical system paths and dependencies are visible and actionable (reduce alert noise, increase signal).
  • DevSecOps Baseline
  • Configure and operate security tooling and monitoring (e.g., Security Command Center , scanners/analyzers).
  • Triage findings and either fix issues directly or delegate remediation to the right teams.
  • CI/CD Enablement
  • Collaborate with engineering to streamline and harden GitHub Actions / GitHub CI/CD pipelines.
  • Increase deployment safety and speed through automation and platform guardrails.
  • Cost Management
  • Own cost visibility and optimization: identify waste, right-size resources, and implement practical FinOps controls.
  • Required Qualifications
  • Strong production experience in DevOps/SRE (typically 5+ years, but we value impact over years).
  • Proven experience operating infrastructure for SaaS with explicit SLA commitments (B2B + B2C is a plus).
  • Hands-on expertise with GCP , especially GKE , plus relevant managed services (e.g., Cloud SQL, BigQuery, BigTable, Pub/Sub, Dataflow, Cloud Run, Cloud Deploy, Memorystore ).
  • Strong Infrastructure-as-Code with Terraform (bonus: Terragrunt ).
  • Strong Kubernetes operations background (GKE at scale, reliability practices, upgrades, scaling).
  • Experience with Cloudflare (WAF/DNS/edge basics; Workers/CDN is a plus).
  • Production observability experience with Datadog (or comparable), ideally including APM/logging.
  • Strong scripting/automation skills and a reliability-first mindset.

Requirements

  • Experience in game dev or similarly bursty high-load consumer products.
  • Familiarity with SOC 2 / PCI-DSS audits and security architecture requirements.
  • Service mesh experience (e.g., Cloud Service Mesh ) in production.
  • Mature SRE practices: error budgets, on-call maturity, runbooks, proactive incident prevention.
  • What Success Looks Like
  • Platform consistently meets or exceeds SLA/SLO targets under bursty highload.
  • Incidents are detected early, mitigated quickly, and don't repeat due to strong postmortem follow-through.
  • Scaling events (10-50×) are routine rather than heroic.
  • Cloud spend is transparent, controlled, and optimized without harming reliability.
  • Engineering teams ship faster with safer, smoother CI/CD and fewer infrastructure bottlenecks.
  • Why Join Us
  • Cloud-only infrastructure (GCP) with meaningful scale and real reliability ownership.
  • Small team (15-20 engineers) with high autonomy and fast decision-making.
  • Direct impact on platform stability, scaling, and cost efficiency.
  • Opportunity to shape SRE culture, tooling, and operational standards in a fast-growing startup.
  • Aghanim helps game developers achieve financial and creative independence by providing the solutions they need to launch, run, and grow their businesses.

Benefits

Performance bonus

Additional Information

We're looking for a Senior/Principal DevOps to own our cloud-only platform and keep it reliable under high-load and bursty traffic. Our services run entirely on GCP, fronted by Cloudflare, with deep observability in Datadog and CI/CD in GitHub Actions. This is a hands-on role with real ownership: ensuring we meet our SLA/SLOs, scaling fast (10-50×), and keeping infrastructure efficient and cost-conscious as the company grows and microservices multiply.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Aghanim? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect