Middle/High-Middle DevOps / SRE Engineer
ExternalFull-timeOn-site2mo ago
BigQueryCDNCI/CDCloudflareDatadogDevSecOps
Prepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Platform Operations (GCP/GKE)
- Operate and support production systems on GCP , primarily GKE and managed services.
- Execute platform improvements and operational tasks delegated by Senior/Principal owners.
- IaC & Delivery Enablement
- Implement infrastructure changes via Terraform (and Terragrunt where used).
- Maintain and evolve Helm charts and Kubernetes manifests.
- Improve reliability of GitHub Actions / CI/CD workflows and deployment automation.
- Observability & Monitoring (Datadog)
- Build and maintain Datadog dashboards/monitors and keep alerting healthy.
- Close monitoring gaps across critical components; reduce noisy alerts and improve signal quality.
- Incident Response
- Participate in incident response and operational support: triage, mitigation using runbooks, escalation, and follow-up fixes.
- Contribute to postmortems with clear facts, timelines, and actionable remediation tasks.
- Security Basics (DevSecOps)
- Run/configure security tooling and monitoring, help triage findings, and implement fixes under guidance.
- Support secure-by-default practices (secrets hygiene, access controls, baseline hardening).
- Cost Awareness
- Identify and implement cost optimizations (right-sizing, waste removal, efficiency improvements) without harming reliability.
- Required Qualifications
- Hands-on production experience with Kubernetes (ideally GKE ) and basic cluster operations.
- Working experience with Terraform and Helm in PR-based workflows.
- Familiarity with GCP services used in SaaS operations (e.g., Cloud SQL, BigQuery, BigTable, Pub/Sub, Cloud Run, Memorystore ).
- Monitoring/alerting and troubleshooting skills (preferably Datadog ).
- Strong scripting/automation mindset to reduce manual work and prevent repetitive incidents.
- Reliability awareness: understanding how changes affect availability/latency and how to operate under SLA constraints.
Requirements
- Cloudflare basics (WAF/DNS, edge concepts; Workers/CDN is a plus).
- Experience writing/maintaining runbooks and participating in postmortems.
- Exposure to SOC 2 / PCI-DSS requirements or willingness to learn.
- Experience in high-load consumer products or game dev.
- What Success Looks Like
- Improved monitoring coverage and healthier alerting (less noise, faster detection).
- Faster, safer deployments with fewer manual steps and fewer production regressions.
- Incidents are triaged effectively and resolved within expected timelines.
- Platform reliability improves through steady delivery of operational fixes and automation.
- Costs trend in the right direction thanks to recurring optimizations and guardrails.
- Why Join Us
- Cloud-only, highload environment with real engineering challenges (not "just keep the lights on").
- Small team with ownership, autonomy, and quick iteration.
- Strong opportunity to grow into broader platform ownership and SRE leadership paths.
- Direct impact on reliability, scalability, and developer velocity.
- Aghanim helps game developers achieve financial and creative independence by providing the solutions they need to launch, run, and grow their businesses.
Benefits
Health insurance
Additional Information
We're looking for a Middle/High-Middle DevOps / SRE Engineer to help run and improve our production platform in GCP + GKE, fronted by Cloudflare, with observability in Datadog and CI/CD in GitHub Actions. You'll work closely with Senior/Principal engineers, implementing reliability improvements, expanding monitoring coverage, and reducing operational toil-especially important in a highload system with sudden traffic spikes.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Aghanim? Share your experience