Lead Site Reliability Engineer - Observability
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Benefits
Additional Information
WHAT MAKES US, US Join some of the most innovative thinkers in FinTech as we lead the evolution of financial technology. If you are an innovative, curious, collaborative person who embraces challenges and wants to grow, learn and pursue outcomes with our prestigious financial clients, say Hello to SimCorp! At its foundation, SimCorp is guided by our values - caring, customer success-driven, collaborative, curious, and courageous. Our people-centered organization focuses on skills development, relationship building, and client success. We take pride in cultivating an environment where all team members can grow, feel heard, valued, and empowered. If you like what we're saying, keep reading! WHY THIS ROLE IS IMPORTANT TO US SimCorp's Observability strategy is to deliver a consistent & coherent Observability approach across the full SimCorp One ecosystem. This includes different technology stacks, products and services across the organization. This is a requirement to be able to observe SimCorp products & services seamlessly & efficiently investigate emerging problems to provide high quality software to our clients as well as being able to stay within agreed resolution times. Also provide insights to KPIs, SLOs, SLAs and cost attribution. As a Lead Site Reliability Engineer - Observability, you will blend site reliability engineering principles with deep telemetry expertise to ensure system visibility, uptime, and performance. Candidate must possess in-depth knowledge and expertise in telemetry data collection, analysis, and implementation. Fully understand the intricacies of and how to derive meaningful insights from different telemetry sources such as metrics, traces, logs and events. Candidate will work closely with product management, architects and engineering teams to establish unified visibility across the full stack, from LLM‑driven agents to backend services. You won't just monitor systems-you'll define the patterns and tools that are a core part of empowering and driving SimCorp's engineering culture. Your contributions will drive stability, continuous improvement, and operational excellence in our Azure-based environments. This role blends hands-on engineering, incident response, platform configuration, and service quality - guided by ITIL and SRE best practices. WHAT YOU WILL BE RESPONSIBLE FOR Support the operational and enhancement of mission-critical environments for both new and existing Cloud Native products & services. Deploy and manage instrumentation for applications to gain granular insights into service health. Assist engineering teams in implementing and maintaining metrics, logs, and traces for applications & infrastructure Unify observability tooling across teams, ensuring metrics, logs, and traces flow into a central platform (e.g., Application Insights or equivalent). Enable and configure OpenTelemetry-based data collection within Azure Monitor Application Insights by leveraging Azure Monitor OpenTelemetry Distro Make sure AI agent frameworks adopt the semantic convention to ensure interoperability and consistency in observability data. Work with product development teams to enable structured logging, basic distributed tracing, and core metrics. Support incident response by gathering logs, metrics, and traces to perform root cause analysis using observability tools. Build tools and automation to eliminate TOIL, improve engineering velocity, developer experience, and improve system reliability. Define and manage SLOs and error budgets in partnership with Engineering teams. Flexible working in regular & evening shift on rotational basis and provide weekend or On-Call support as needed. Collaborate with Agile teams and take part in design discussions with clients, vendors, and stakeholders. Contribute to knowledge sharing across multiple Product Areas. Leverage a strong foundation in ITIL practices, including problem, change, and incident management WHAT WE VALUE Bachelor's degree in Computer Science or related field (Master's is a plus) 5+ year experience in Site Reliability, Observability, DevOps, or Cloud Engineering roles Must have expertise with Microsoft Azure Cloud. Must have experience working with observability frameworks like Open Telemetry and distributed tracing systems Expertise in Infrastructure as Code (IaC) using Bicep, ARM and Terraform. Strong understanding of instrumenting, tracing, and correlating AI/LLM workflows with infrastructure telemetry. Solid experience in monitoring and logging tools (Azure Monitor, Application Insights, DataDog, Log Analytics). Knowledge of AI/ML-based anomaly detection, log aggregation and analysis tools like Microsoft Azure Anomaly Detector or equivalent. Experience with Agentic/LLM‑based systems (like LangChain, Celery, OpenAI APIs, orchestration frameworks) Experience working with application reliability platforms like Checkly or equivalent Experience setting up synthetic monitoring using Playwright or equival