Staff Engineer, Platform Engineering & Operational Health
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
At AudioEye, we believe access to digital content is a fundamental right, not a privilege. Our mission is clear: eliminate every barrier to digital accessibility so that everyone, regardless of ability, can experience the web without limitations. We are a team of passionate problem-solvers who are driven by purpose and impact. Every challenge we tackle moves us closer to a future where creating accessible experiences is the standard. If you're looking for meaningful work where you can drive real change, influence how people with disabilities experience the internet, and be part of a mission that matters, AudioEye is the place for you. AudioEye is seeking a Staff-level Software Engineer to join the Platform Engineering team with a deep focus on operational health and reliability. The primary mission is to diagnose and fix foundational issues that cause incidents, slow down deployments, and burden on-call engineers. This role requires someone who thinks like an SRE, operates like a systems engineer, and builds like a software engineer, identifying root causes of operational pain and implementing systemic solutions. The successful candidate will own the technical strategy for our operational posture, set reliability standards, mentor peers in operational thinking, and drive architectural decisions with company-wide impact. This is not a full-time on-call position. This is a strategic platform engineering role focused on making on-call sustainable and incidents rare. How you'll Contribute: Conduct comprehensive audits of infrastructure, deployment processes, incident patterns, and on-call burden Identify foundational issues causing operational pain: fragile systems, deployment friction, poor observability, and architectural weaknesses Establish baseline metrics for system health and operational efficiency Prioritize improvements systematically based on impact to on-call burden and reliability Design and implement solutions that address the root causes of incidents Eliminate single points of failure in critical paths Implement patterns for graceful degradation and rapid recovery Build comprehensive observability, logging, metrics, and tracing infrastructure Identify and automate repetitive manual work that burdens operational staff Establish organizational standards and expectations for reliability Design and maintain runbooks, playbooks, and incident response processes Create feedback loops from incidents to systemic improvements Support CI/CD pipelines that enable safe, frequent deployments Develop operational tooling (dashboards, alerts, automation) Reduce friction in how engineers interact with infrastructure and deploy code Scale infrastructure responsibly as the organization grows Optimize platform reliability, performance, and cost through capacity planning, workload tuning, and architectural tradeoffs Own the technical strategy and roadmap for infrastructure, reliability, and operational posture Mentor engineers across the organization on operational thinking and reliability engineering Lead architecture reviews and establish technical standards Build organizational practices and knowledge that outlast any single engineer