Site Reliability Engineer (SRE) - Observability & Platform Operations

External

Omnissa · Sofia, Bulgaria

Full-timeOn-site1w ago

AnsibleApacheBudgetingCapacity PlanningCI/CDCompliance

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

We're seeking an SRE with deep observability expertise (Grafana, Loki, Prometheus, automation, and scripting) to maintain the reliability, performance, and operational integrity of our platforms. You'll work across planned and unplanned workstreams with engineering, incident management, and service owners. The role includes an on-call rotation covering nights and weekends.

Responsibilities

Design, deploy, and maintain Loki, Grafana, Prometheus, and observability pipelines; expand logging, metrics, and tracing coverage
Build and refine automation and AI workflows for incident analysis and auto-remediation
Drive reliability through capacity planning, performance optimization, SLIs/SLOs, and root cause analysis
Participate in the global on-call rotation; manage incidents and outages and lead post-mortem reviews
Use Atlassian tools (Jira, Confluence, Opsgenie) for task, change, and incident management
Operate and improve internal clouds (vCF, CloudStack, Proxmox), Kubernetes clusters, and S3-compatible storage
Required Skills
Hands-on expertise with Grafana, Loki, Tempo (or similar tracing), and Prometheus
At least one scripting/programming language
Configuration management tools (Ansible, SaltStack)
Strong Linux skills and experience operating large-scale, highly available distributed systems
Familiarity with Kubernetes, CI/CD, and Infrastructure as Code
Comfortable with on-call participation and incident leadership
Experience with Atlassian tools; proficiency in Linux and Windows

Requirements

Exposure to Ollama, n8n, or similar AI orchestration tooling
Experience with S3/open-source object stores (SeaweedFS, Ceph)
Knowledge of virtualization stacks (Proxmox, vSphere/vCF, CloudStack)
Background in SRE culture, including SLIs/SLOs and error budgeting

Benefits

Flexible schedule

Additional Information

Job Description: We Are Omnissa! Omnissa is the first AI-driven digital work platform, built to support flexible, secure, work-from-anywhere experiences. We integrate industry-leading solutions-including Unified Endpoint Management, Virtual Apps and Desktops, Digital Employee Experience, and Security & Compliance-into a seamless, autonomous workspace that adapts to how people work. Our platform boosts employee engagement while optimizing IT operations, security, and cost. Guided by our Core Values-Act in Alignment, Build Trust, Foster Inclusiveness, Drive Efficiency, and Maximize Customer Value-we're growing rapidly and committed to delivering meaningful impact. If you're passionate about shaping the future of work, we'd love to hear from you. The Team Our internal Platform Engineering team architects and operates Omnissa's enterprise-grade infrastructure. Our environment includes: Core platforms: VMware Cloud Foundation, Apache CloudStack, Proxmox, Kubernetes, and S3-compatible object storage Observability: Prometheus, Grafana, Loki, and Ansible AI-driven automation: An internal incident diagnosis platform built on Ollama, n8n, and MCP servers to reduce MTTD and MTTR

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at omnissa? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect