Skip to main content
Back to jobs

Senior Production Engineer

External
veeamsoftware logoVeeamsoftware · Pune, India
Full-timeOn-site2w ago
AzureCI/CDComplianceDocumentationGrafanaIncident Response
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


About the role

As a Senior Production Engineer , you will play a leading role in designing and operating reliable, scalable systems for Veeam's Data Cloud platform. You will own high‑impact production efficiency, automation, and documentation initiatives, drive reliability and observability improvements, and own or participate in the full incident lifecycle - from on‑call response, through mitigation, to leading post‑incident reviews and driving improvements across support and development teams. You will work as part of a team of skilled engineers, collaborating with support and development as a senior bridge and driving force for change. You will communicate with product managers and security professionals to ensure our services are production‑ready, performant, and fault‑tolerant, and that we rapidly incorporate user feedback into improvements.

Responsibilities

  • Production
  • Own the reliability, performance, and operability of complex, business‑critical production services and workflows.
  • Own complex and escalated production issues from support, and drive long‑term fixes in collaboration with engineering, including code, configuration, and architecture changes.
  • Proactively identify and address systemic risks that are identified during the problem‑solving process, and convert them into long‑term engineering improvements.
  • Lead production efficiency initiatives, and define, develop, and maintain processes, run‑books, and knowledge base integrity across multiple services or domains.
  • Operational Excellence
  • Define, build, and maintain production monitoring systems for critical services, ensuring deep visibility into system health and user experience.
  • Continuously improve alerting to minimize noise and ensure actionable, well‑documented runbooks with clearly owned responses.
  • Define and maintain SLIs/SLOs for key services, and use error budgets to guide operational and product decisions, influencing priorities where necessary.
  • Turn manual processes into robust automation, and champion automation patterns and tooling adoption across teams.
  • Own and drive the post‑mortem review process and actions arising from incident analysis, ensuring high‑quality follow‑up and measurable reliability improvements.
  • Team Collaboration
  • Collaborate with the support organization as a senior escalation point and systematically feed back knowledge, tooling enhancements, and improvement recommendations.
  • Collaborate with developers throughout the lifecycle of changes, from design through rollout and patch delivery, ensuring safe deployments and efficient incident mitigation.
  • Lead or significantly contribute to design reviews to ensure services are operable with minimal manual intervention in production (automation, safe deployments, clear run‑books, resilience patterns), and share learnings through documentation and feedback.
  • Mentor and coach other engineers in production engineering practices (observability, incident handling, automation, design for failure), helping to raise the operational bar across the organization.

Requirements

  • 5-8+ years of experience in software engineering, site reliability, production engineering, or senior technical support roles operating distributed systems.
  • Experience with log analysis and advanced troubleshooting in complex production environments.
  • Strong programming experience (e.g., JS , Go, Typescript, Java, or C#).
  • Experience deploying and troubleshooting systems on public cloud platforms (Azure preferred).
  • Strong familiarity with observability tooling (e.g., Elastic, Prometheus, Grafana, OpenTelemetry).
  • Solid understanding of distributed systems, networking, automation, and CI/CD.
  • Preferred
  • Prior on‑call or incident response experience, including leading significant incidents or problem‑management efforts.
  • Background in automation, performance testing, or service scalability, ideally at significant scale.
  • Familiarity with compliance or security best practices, and experience incorporating them into production design and operations.
  • Why Join Veeam?
  • Make a high‑impact contribution to the architecture and reliability of Veeam's first global SaaS product suite in a senior capacity.
  • Help shape a modern SRE / Production Engineering organization, influencing best practices, tooling, and culture.
  • Collaborate with highly skilled tea

Benefits

Health insurance

Additional Information

Veeam is the Data and AI Trust Company, specializing in helping organizations ensure their data and AI are fully understood, secured, and resilient to enable the acceleration of safe AI at scale. As the market leader in both data resilience and data security posture management, Veeam is built for the convergence of identity, data, security, and AI risk. Headquartered in Seattle with offices in more than 30 countries, Veeam protects over 550,000 customers worldwide, who trust Veeam to keep their businesses running. Join us as we go fearlessly forward together, growing, learning, and making a real impact for some of the world's biggest brands.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at veeamsoftware? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect