Principal Debug and SRE Lead
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Requirements
- Experienced technical leader with 10+ years building and operating complex software, infrastructure, site reliability, or systems engineering environments.
- Strong Linux systems expert with deep experience debugging issues across operating systems, networking, distributed services, hardware, and firmware.
- Proficient in automation and software development using Python, C++, Go, Bash, or similar languages, with experience building scalable engineering tools.
- Familiar with observability platforms such as Prometheus, Grafana, OpenTelemetry, ELK, or similar technologies used to monitor large-scale production systems.
- Passionate about mentoring engineers, driving technical execution, and improving reliability through automation, operational excellence, and cross-functional collaboration.
- What We Need
- Lead a team responsible for the reliability, observability, and operational health of engineering infrastructure supporting AI hardware and software development.
- Drive root-cause analysis and resolution of complex issues spanning silicon, firmware, operating systems, networking, distributed software, and development infrastructure.
- Build and improve debugging methodologies, monitoring systems, automation, and engineering workflows that increase productivity and reduce operational overhead.
- Partner closely with silicon, firmware, software, validation, and infrastructure teams to prioritize work, resolve critical issues, and improve platform reliability.
- Mentor engineers, establish technical direction, and drive execution across key initiatives that support long-term engineering success.
- What You Will Learn
- How next-generation AI hardware and software platforms are developed, validated, and deployed at scale.
- Advanced debugging techniques across silicon, firmware, operating systems, networking, infrastructure, and distributed software.
- How custom RISC-V processors, AI accelerators, and large-scale AI compute clusters are monitored, operated, and optimized.
- How engineering organizations coordinate across hardware and software disciplines to deliver highly reliable AI infrastructure.
- How technical leadership influences the architecture, reliability, and operational strategy behind one of the industry's most ambitious AI computing platforms.
- Tenstorrent offers a highly competitive compensation package and benefits, and we are an equal opportunity employer.
Benefits
Additional Information
Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations in software models, compilers, platforms, networking, and semiconductors. Our diverse team of technologists have developed a high performance RISC-V CPU from scratch, and share a passion for AI and a deep desire to build the best AI platform possible. We value collaboration, curiosity, and a commitment to solving hard problems. We are growing our team and looking for contributors of all seniorities. Tenstorrent is building next-generation AI systems powered by custom silicon, large-scale distributed infrastructure, and advanced software platforms. The Debug & Site Reliability Engineering team is responsible for ensuring the reliability, observability, and operational excellence of the environments that enable AI hardware and software development. This team partners closely with silicon, firmware, software, validation, and infrastructure engineers to diagnose complex issues, improve engineering workflows, and keep critical systems operating at scale. As the Principal Debug & Site Reliability Engineering Lead, you will combine deep technical expertise with engineering leadership to guide a team responsible for debugging complex hardware and software interactions, improving operational efficiency, and driving long-term reliability across development infrastructure. Your work will help accelerate engineering productivity while shaping the processes, tooling, and technical direction that support Tenstorrent's next generation AI platforms. This role is hybrid, based out of Kraków, Poland. We welcome candidates at various experience levels for this role. During the interview process, candidates will be assessed for the appropriate level, and offers will align with that level, which may differ from the one in this posting.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Tenstorrent Unlisted/Referral Jobs? Share your experience