Skip to main content
Back to jobs

Senior Software Engineer, Robinhood Command Center

External
Robinhood logoRobinhood · New York, NY
Full-timeOn-site1w ago
Capacity PlanningCross-functional CollaborationGrafanaIncident ResponseLeadershipMentoring
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Responsibilities

  • Serve as a senior technical leader driving the long-term reliability and observability strategy across Robinhood's infrastructure
  • Partner closely across many different types of engineers to raise the bar for operational excellence and incident response
  • Lead incident mitigation efforts by coordinating service owners, facilitating time-sensitive decisions like rollbacks, traffic shifts, and maintaining a clear source of truth during active incidents
  • Develop and maintain incident management processes and procedures to ensure timely resolution and minimize customer impact
  • Own incident discovery at the company level by defining and maintaining global dashboards and alerts tied to critical user journeys (CUJs), availability, and business-impact metrics
  • Own and evolve incident response tooling and processes, including education, adoption, and measurement of MTTD/MTTR improvements
  • Drive post-incident governance and learning, defining standards for postmortems, SEV reviews, and follow-up tracking to ensure durable reliability improvements
  • Design and implement next-generation failure mitigation strategies that avoid full-region or full-datacenter failovers
  • Define and build frameworks to improve monitoring, alerting, and observability across hundreds of services and systems
  • Define and own the roadmap of bringing observability to critical user journeys for Robinhood's products
  • Deliver key insights and executive-level reporting to enable better business decisions around service quality and reliability
  • Act as a force multiplier through mentoring, technical influence, and contributions to hiring and engineering culture
  • What you bring:
  • 5+ years of software engineering experience, including significant experience operating production systems
  • 2+ years focused on reliability engineering, infrastructure, distributed systems, or production operations
  • Hands-on experience serving in incident leadership roles (e.g., IMOC, incident commander, primary oncall)
  • Strong communication and cross-functional collaboration skills, especially during high-severity incidents
  • Deep knowledge of systems reliability, observability frameworks, and fault-tolerant architecture design
  • Experience with multi-region or multi-cluster architectures, capacity planning, and failover strategies
  • Familiarity with modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana)
  • Demonstrated ability to drive measurable improvements in MTTD, MTTR, availability, or customer impact

Benefits

Challenging, high-impact work to grow your careerPerformance driven compensation with multipliers for outsized impact, bonus programs, equity ownership, and 401(k) matchingBest in class benefits to fuel your work, including 100% paid health insurance for employees with 90% coverage for dependentsLifestyle wallet - a highly flexible benefits spending account for wellness, learning, and moreEmployer-paid life & disability insurance, fertility benefits, and mental health benefitsTime off to recharge iHealth insurance401(k)Flexible scheduleEquity / stock optionsPerformance bonus

Additional Information

Join us in building the future of finance. Our mission is to democratize finance for all. An estimated $124 trillion of assets will be inherited by younger generations in the next two decades. The largest transfer of wealth in human history. If you're ready to be at the epicenter of this historic cultural and financial shift, keep reading. About the team & role We are building an elite team, applying frontier technologies to the world's biggest financial problems. We're looking for bold thinkers. Sharp problem-solvers. Builders who are wired to make an impact. Robinhood isn't a place for complacency, it's where ambitious people do the best work of their careers. We're a high-performing, fast-moving team with ethics at the center of everything we do. Expectations are high, and so are the rewards. The Robinhood Command Center (RCC) is a newly formed reliability team that serves as the front line for detecting, coordinating, and mitigating production incidents across Robinhood. As part of Robinhood's broader reliability initiative, RCC works closely with product engineering, reliability, observability, infrastructure, and business teams to reduce customer impact and shorten incident duration. As a Senior Engineer, you will be part of the founding RCC team, helping define how Robinhood responds to and learns from incidents at scale. This is a highly visible role focused on incident leadership, operational excellence, and reliability tooling. You will not own product services or core infrastructure, but you will own the processes and tools that enable fast, high-quality incident response. This role is based in our New York, New York office, with in-person attendance expected at least 3 days per week. At Robinhood, we believe in the power of in-person work to accelerate progress, spark innovation, and strengthen community. Our office experience is intentional, energizing, and designed to fully support high-performing teams.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Robinhood? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect
Senior Software Engineer, Robinhood Command Center at Robinhood