Skip to main content
Back to jobs

SRE Leader

External
Bybit logoBybit · Kuala Lumpur, Malaysia
Full-timeOn-site2w ago
Capacity PlanningChaos EngineeringComplianceWeb3Zero Trust
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


About the role

Established in 2018, Bybit is one of the world's leading cryptocurrency exchanges and digital financial platforms, serving over 80 million users across more than 200 countries and regions. Powered by world-class technology and a user-first mindset, Bybit delivers a seamless ecosystem across trading, payments, wealth management, custody, institutional services, and Web3 - connecting users to the future of digital finance. Our core values define how we build. We listen, care and improve to create products and experiences that put users first. Backed by a global team of ambitious builders, problem-solvers, and innovators, we foster a high-performance and fast-moving environment where talent is empowered to drive real impact at the global scale. Supported by 24/7 multilingual customer service and a strong commitment to innovation, we are shaping the future of finance through technology, collaboration, and bold execution. Today, Bybit is recognized as one of the most trusted and transparent platforms in the digital asset industry, continuing to expand its global presence while building the infrastructure for the next generation of financial services. Core responsibilities Construction of reliability engineering system Establish a company-wide SLO/SLA system: Define quantifiable reliability indicators (availability, latency, error rate) for each Line of Business, and drive change rhythm and investment decisions based on Error Budget Construct MTTD/MTTR measurement system, set grading goals and continuously optimize: P-1 target MTTD Building fault self-healing capabilities: automated fault detection → diagnosis → recovery link, reducing reliance on manual intervention Promote chaos engineering practice: regularly conduct fault drills (Chaos Engineering) and actively discover weak links in the system Establish a change risk control system: canary release standardization, change impact pre-assessment, automatic rollback mechanism Cost Governance System (Key Points) Building a Data-driven cost governance closed loop: from cost visualization → attribution analysis → optimization decision → execution verification → continuous monitoring of whole-link automation Establish a scientific capacity planning model: based on the correlation model between business indicators (QPS/TPS/number of users) and resource consumption, instead of impulsive N-fold reservation Promote the implementation of FinOps culture. Line of Business/Application Cost Billing and Showback Define cost efficiency metrics ($/transaction, $/user, $/QPS) and conduct industry benchmarking Embed cost assessment into the resource request process to achieve 100% capacity assessment of new resources Automated cost optimization engine: Low-load automatic recognition and scaled-down recommendation (AI-based anomaly detection and prediction model) Reserved Instance/Savings Plan Automated Purchase Decision System Optimization of elastic volume expansion and contraction strategies: pre-scaling based on predictive models to reduce over-reservation Automatic recycling and lifecycle management of idle resources Goal: Annual cloud cost optimization of 15-20% without affecting business SLO. III. Automated operation and maintenance (key) Toil elimination system: measure team toil ratio (target GitOps/IaC fully implemented: Infrastructure 100% coded, all changes executed through PR review and automated pipeline Environmental consistency guarantee: Ensure drift detection and automatic repair of dev/staging/prod configuration through IaC Intelligent Operations and Maintenance (AIOps) Construction: AI-based alarm aggregation, root cause analysis, and repair suggestions Automatic detection of log/metric anomalies, moving from passive alarms to active discovery Knowledge Base AI: natural language query operation status, execution standard operation Self-service platform construction: Business teams can complete more than 80% of routine operation and maintenance operations (volume expansion and contraction, configuration change, permission application) by themselves. Operation and maintenance ticket automation processing rate target > 60% On-call system optimization: Alarm accuracy > 95% (eliminating alarm fatigue) Establish Runbook automated execution capability On-call quality measurement and continuous improvement Financial cloud isolation and multi-compliance station deployment (key) Financial-grade network isolation architecture design and operation and maintenance: Design and implementation of network isolation strategies for multiple accounts, multiple VPCs, and multiple regions Standardized management of security groups, end point nodes, and dedicated lines across compliance stations Zero Trust Network architecture landing: micro-segmentation, minimum privilege, dynamic access control Compliance station efficient building website ability: Goal: Deployment of new compliance station infrastructure from weekly to hourly

Benefits

Paid time off

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Bybit? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect