Member of Technical Staff - Data Ingestion Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
Data is playing an increasingly crucial role at the frontier of AI innovation. Many of the most meaningful advances in recent years have come not from new architectures, but from better data. As a member of the Data Team, your mission is to build and operate the ingestion systems that turn the open web and other large-scale data sources into reliable, well-structured corpora for training frontier models. You will own the machinery that acquires, extracts, normalizes, versions, and delivers data to our pre-training pipelines. You'll work directly with world-class researchers to close the loop between what we collect and how it impacts model performance. This role is ideal for engineers who love building robust distributed systems, but who also want to run experiments, reason about tradeoffs in data acquisition, and iterate quickly based on measurable impact. Working closely with our pre-training and data quality teams, you will: Build and operate large-scale data ingestion systems for pre-training, including web crawling, extraction, and dataset delivery Run experiments to evaluate crawling strategies, extraction methods, and ingestion tradeoffs Analyze ingested data to identify gaps, redundancy, and areas to improve Build ingestion pipelines that scale reliably across large data campaigns Develop specialized crawlers for high-priority data sources Review code, debug production issues, and continuously improve ingestion infrastructure About You: Curious about how training data influences model capabilities, and can iterate quickly based on measurable downstream impact Able to collaborate tightly across functions: researchers, infra, operations, and external partners. Enjoy working in a hybrid research-engineering role
Requirements
- Experience building web crawling, data ingestion, or large-scale data acquisition systems using Ray, Beam, Spark, or similar technologies.
- Familiarity with how LLMs are trained and evaluated, and an intuition for what makes data useful for training
- Comfortable working with very large datasets (multi-TB to PB scale) and building systems that are observable, testable, and maintainable
- Comfortable designing experiments and using data to guide system improvements
- Excellent communication skills. You can explain system behavior. You consider and communicate tradeoffs clearly
Benefits
Additional Information
Our Mission Reflection's mission is to build open superintelligence and make it accessible to all . We're developing open weight models for individuals, agents, enterprises, and even nation states. Our team of AI researchers and company builders come from DeepMind, OpenAI, Google Brain, Meta, Character.AI, Anthropic and beyond.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at reflectionai? Share your experience