Platform Reliability Engineer

External

Apify · Prague, Czechia

Full-timeRemote2d ago

ArgoCDAWSCI/CDCloudFormationCypressDocumentation

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Requirements

You have hands-on experience choosing what to measure in production - not just reading dashboards, but picking signals that reflect the customer experience.
You're comfortable with incidents and alerts , from early detection through resolution and follow-up so similar issues are less likely to recur.
You have hands-on experience with Prometheus, Grafana, OpenTelemetry , or similar, and with alert-routing tools such as PagerDuty .
You read and write code: you can follow services and pipelines across the stack and collaborate on technical details with the teams building them.
You know what good post-incident culture looks like in practice - blame-free, learning-focused, and actually used to make things better - even if your past title never mentioned reliability.
You can write clear, concise guidance that teams adopt, and you work constructively toward sound decisions.
You're driven to automate repetitive tasks and improve developer workflows.
Meaningful hands-on experience as an application or backend developer - you've built things that run in production and approach observability as someone who needs it as a "user," not just the person who sets it up.
Experience building and maintaining infrastructure on AWS (EC2, EKS, S3, CloudFormation, or similar), and hands-on experience with container technologies.
Some familiarity with CI/CD pipelines or release practices - enough to have an informed opinion on what makes deployments reliable and safe.
Don't worry if you don't meet all of the above criteria. We value diverse skills and experience and would love to hear from you.
Our tech stack
Infra: AWS Compute (Kubernetes (EKS), EC2, Lambda), Helm, ArgoCD, MongoDB, Redis, DynamoDB, S3, GitHub Actions
Monitoring: Grafana, Prometheus, OpenTelemetry, Mezmo, PagerDuty
Frontend: React.js, styled-components, Storybook, Chromatic, Cypress, Playwright
Backend: TypeScript/Node.js, Nest.js, Next.js, Express.js, Docusaurus, Vitest
Tools: GitHub, Notion, Google Workspace
Editor and AI assistant of your choice (GH Copilot, Cursor, Claude, Gemini, or JetBrains AI)
Process: two-week sprints, code reviews, tests, automating whatever we can, and deploying multiple times per day.
By the end of the first 3 months, we expect you to:
Have completed the general onboarding process.
Have built working relationships with platform engineers, engineering leads, and others involved in production response, and aligned on how you'll collaborate.
Understand, in principle, how the Apify platform works, and be able to handle smaller problems, incidents, or bugs on the infrastructure you work with most.
Have mapped how we handle monitoring, incidents, and alerts today - where the friction is and where a focused improvement would help.
Have published initial monitoring, observability, and alerting guidelines - covering signals, naming, key dashboards, and alerting principles (severity, routing, and noise reduction) - aligned with existing tooling.
Be participating in incident reviews and translating patterns into improved playbooks.
Be contributing actively in team ceremonies (planning, grooming) and technical discussions, and in touch with other teams to support their infrastructure needs.
By the end of the first 6 months, we expect you to:
Be working on bigger tasks mostly independently (while staying fearless about asking for help).
Have built a network across en

Additional Information

Apify is the largest marketplace of tools for AI. 40,000+ Actors helping people and agents get real-time web data, track competitors, generate leads, or integrate their apps. Actors are built by a global creator community that now earns more than $1.2 million every month. Join us to help people put the web to work. Apify can find missing children , protect consumers from fake discounts across the EU , and feed data to AI chatbots . To support our mission, we're looking for a Platform Reliability Engineer with a developer's mindset. You've shipped code and you care what happens when it runs in production (speed, failures, recovery). You'll help us strengthen how Apify monitors systems, handle incidents, and route alerts so engineering teams can ship with confidence. You won't be on-call. This role is focused on sustainable improvement, not after-hours emergency response. What you'll be working on: Monitoring & signals: Operate and improve our monitoring stack (Prometheus, Grafana, OpenTelemetry) - instrument services to expose the right metrics, define what we watch in production, and shape alerting so teams get actionable signals without the noise. When things go wrong: Help define how we run incidents - clear communication, structured learning afterward, and supporting artifacts (status page, runbooks). With the team: Work with platform and product engineers to make reliability standards practical - help teams adopt better tooling or practices when things change, and write documentation people actually use. Who we're looking for:

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at apify? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect