Explore data-centric methods for improving coding LLMs, including data filtering, quality assessment, deduplication, data mixture, and diversity analysis.
Build synthetic data and evaluation pipelines for code generation, code editing, repo-level reasoning, tool use, and multi-step coding tasks.
Run experiments to analyze how data, model, and training strategies affect coding capabilities.
Work with large-scale code corpora, developer activity data, and agentic coding trajectories.
Who We Look For
Strong programming skills in Python.
Solid understanding of machine learning and large language models.
Familiarity with LLM pre-training, mid-training, code models, data curation, evaluation, agents, or tool use.
Strong experiment design, data analysis, and problem-solving skills.
Interest in code intelligence, software engineering automation, and agentic coding.
Requirements
Experience with code data processing, GitHub-scale data, synthetic data, LLM evaluation, semantic deduplication, or agentic coding.
Research experience, publications, or open-source projects in related areas are a plus.
Benefits
Access to large-scale real-world coding data and agentic trajectories.Rich compute resources and model APIs for fast research iteration.Opportunities to work on real-world coding model applications and the full model development loop.Equal Employment Opportunity at TencentAs an equal opportunity employer, we firmly believe that diverse voices fuel our innovation and allow us to better serve our users and the community. We foster an environment where every employee of Tencent feels supported and inspired to achieve individual and common goals.
Additional Information
Business Unit
Technology Engineering Group (TEG) is responsible for supporting the company and its business groups on technology and operational platforms, as well as the construction and operation of R&D management and data centers, TEG provides users with a full range of customer services. As the operator of the largest networking, devices, and data center in Asia,TEG also leads the Tencent Technology Committee in strengthening infrastructure R&D through internal and distributed open source collaboration, constructing new platforms and supporting business innovation.
What the Role Entails
We are looking for research interns to work on foundational areas for coding language models, including pre-training data, mid-training data, synthetic data generation, evaluation, and agentic coding.