Hardware Systems Engineer - Data Center HWE
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
We are seeking an experienced Server Hardware Engineer with deep expertise in CPU architectures, compute performance analysis, and cloud-scale hardware optimization. In this role, you will evaluate next-generation server platforms, assess workload requirements, and guide strategic decisions on processor selection across x86, ARM, and accelerator-enabled systems. You will partner closely with SRE, platform engineering, and capacity planning teams to ensure our Compute cloud infrastructure runs on the most efficient and cost-effective compute platforms.","responsibilities":"CPU Architecture & Performance Engineering: Evaluate and compare CPU architectures (Intel, AMD, ARM, emerging, accelerators) for cloud and containerized workloads. Conduct microbenchmarking and workload-specific benchmarking (DB, caching, ML inference, microservices, network-heavy workloads). Analyze CPU metrics including IPC, cache hierarchy behavior, NUMA topology, SMT/Hyperthreading efficiency, turbo behaviors, and power/performance curves. Build performance models to predict workload scaling, throughput, and cost efficiency across CPU generations. Server Hardware & Platform Analysis: Assess server platforms including memory configurations, PCIe topology, I/O bandwidth, NIC configurations, and accelerator integration (GPU, SmartNIC, DPU). Optimize server configurations for Kubernetes clusters, virtualized environments, and bare-metal workloads. Collaborate with OEMs and silicon vendors on roadmap evaluations and feature testing. Cloud & Kubernetes Workload Profiling: Profile and analyze workload patterns across large-scale distributed systems. Evaluate CPU/memory utilization trends to identify bottlenecks, over-provisioning, or scaling inefficiencies. Provide input into cluster sizing, autoscaling strategies, and capacity planning based on real workload telemetry. Build dashboards and telemetry pipelines to visualize compute efficiency across clusters. Procurement & Cost Optimization: Lead CPU and server platform recommendations for procurement cycles. Model TCO (total cost of ownership) including performance per watt, performance per dollar, and rack-level density. Provide guidance on the right mix of CPU SKUs, instance types, and server configurations for various cloud services.