We're seeking a Sr. Engineer - ML Platform to maintain and optimize CrowdStrike's mission-critical ML infrastructure. You'll diagnose complex distributed systems issues and ensure platform reliability for infrastructure processing billions of events daily.
Responsibilities
System Optimization & Performance: Profile and optimize Ray clusters and Spark jobs on K8s and Cloud (EMR/Dataproc) Troubleshoot JupyterHub spawner issues, kernel crashes, and resource allocation Optimize SLURM job scheduling, GPU allocation, and HPC cluster utilization
Infrastructure & Monitoring: Build observability solutions and automated health checks Develop runbooks, alerting workflows, and incident response procedures Maintain platform stability metrics (SLAs, error rates, latency)
Collaboration: Partner with ML and ML Platform engineers to resolve workflow issues Conduct post-mortems and mentor on debugging techniques
Requirements
12+ years in distributed systems engineering
5+ years debugging ML platforms in production
Deep expertise in 3+ one of: Ray, Spark, JupyterHub, SLURM, K8 Performance profiling, optimization, and capacity planning
What Sets You Apart: Open-source ML infrastructure contributions Experience with high-throughput inference systems and reducing MTTR Published debugging guides or tools Chaos engineering and GPU/CUDA debugging experience On-call and incident management experience
#LI-DP1
Benefits of Working at CrowdStrike:
Market leader in compensation and equity awards
Comprehensive physical and mental wellness programs
Competitive vacation and holidays for recharge
Paid parental and adoption leaves
Professional development opportunities for all employees regardless of level or role
Employee Networks, geographic neighborhood groups, and volunteer opportunities to build connections
Vibrant office culture with world class amenities
Great Place to Work Certified™ across the globe
CrowdStrike is proud to be an equal opportunity employer. We are committed to fostering a culture of belonging where everyone is valued for who they are and empowered to succeed. We support veterans and individuals with disabilities through our affirmative action program.
If you need assistance accessing or reviewing the information
Benefits
Health insurancePaid time offEquity / stock optionsParental leave
Additional Information
As a global leader in cybersecurity, CrowdStrike protects the people, processes and technologies that drive modern organizations. Since 2011, our mission hasn't changed - we're here to stop breaches, and we've redefined modern security with the world's most advanced AI-native platform. We work on large scale distributed systems, processing almost 3 trillion events per day and this traffic is growing daily . Our customers span all industries, and they count on CrowdStrike to keep their businesses running, their communities safe and their lives moving forward. We're also a mission-driven company. We cultivate a culture that gives every CrowdStriker both the flexibility and autonomy to own their careers. We're always looking to add talented CrowdStrikers to the team who have limitless passion, a relentless focus on innovation and a fanatical commitment to our customers, our community and each other. Ready to join a mission that matters? The future of cybersecurity starts with you.