PySpark Data Engineer - Big Data & Analytics

External

Synechron · Bengaluru - Ec-2 Gateway Campus

Full-timeOn-site4d ago

AirflowApacheAWSAzureCassandraCI/CD

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Requirements

Bachelor's or Master's degree in Computer Science, Data Science, Mathematics, or a related field
Relevant certifications in big data, cloud platforms, or analytics (preferred)
Strong portfolio showcasing data pipeline projects, analytics solutions, and ML workflows
Professional Competencies
Critical thinking and analytical problem-solving skills
Excellent communication skills for technical and non-technical audiences
Leadership qualities to guide project execution and mentor junior team members
Adaptability to new tools, frameworks, and evolving project requirements
Ability to handle multiple priorities under pressure with a focus on quality and deadlines
S YNECHRON'S DIVERSITY & INCLUSION STATEMENT
Diversity & Inclusion are fundamental to our culture, and Synechron is proud to be an equal opportunity workplace and is an affirmative action employer. Our Diversity, Equity, and Inclusion

Benefits

Health insuranceVision insuranceEquity / stock options

Additional Information

Job Summary Synechron is seeking an experienced PySpark Data Engineer / Data Scientist to lead data pipeline development and advanced analytics initiatives within our financial data and index analytics division. This role plays a crucial part in building scalable data processing solutions, enabling data-driven insights, and supporting machine learning workflows in both batch and streaming environments. The ideal candidate will possess a strong technical foundation in big data processing, analytics, and software engineering, along with leadership capabilities to drive impactful data projects. Software Requirements Required Skills: Proven expertise in Python programming, emphasizing clean, maintainable, and scalable code Hands-on experience with PySpark in both batch and streaming workflows Deep knowledge of data manipulation and feature engineering, including Pandas, NumPy, and visualization libraries (matplotlib, seaborn) Experience with Spark components like Spark SQL, DataFrames, and Spark MLlib Familiarity with data storage solutions: SQL and NoSQL databases (e.g., Hive, Cassandra) Knowledge of ETL tools such as Apache Airflow, Jenkins, or GithHub Actions for scheduling and automation Experience working with cloud environments, especially Azure or AWS for big data processing Preferred Skills: Hands-on with containerization and orchestration (Docker, Kubernetes) Exposure to distributed storage solutions like Hadoop HDFS or Azure Data Lake Overall Responsibilities 5 years of experience in Design, develop, and optimize large-scale data pipelines using PySpark for structured, semi-structured, and unstructured data 5 years of experience to Lead the building of ML pipelines for training, validation, and deployment of models in streaming/batch modes Write high-quality, efficient code that supports data transformation, cleaning, and feature engineering Collaborate with data scientists, analysts, and stakeholders to understand data requirements and deliver actionable insights Build and maintain reusable code base and automation scripts for data processing and model validation Monitor pipeline performance, troubleshoot issues, and implement improvements to ensure robustness and scalability Stay up-to-date with the latest in big data processing, ML techniques, and analytics tools to improve system efficiency and analytics capabilities Technical Skills (By Category) Programming Languages: Required: Python (required), PySpark (required) Preferred: Scala, Java Databases & Data Management: SQL (MySQL, SQL Server), NoSQL (Cassandra, MongoDB), Hive, Data Lakes Cloud Technologies: Azure Data Factory, Azure Synapse, AWS Glue, S3 (preferred) Frameworks & Libraries: Spark MLlib, Pandas, NumPy, seaborn, matplotlib, scikit-learn (preferred) Development Tools & Methodologies: Jupyter, PyCharm, VSCode, Git, CI/CD (Jenkins, GitHub Actions), Airflow Security & Data Governance: Data privacy principles, secure data ingestion and output, compliance Experience Requirements 7-12 years of experience in data engineering, analytics, or data science roles, with significant hands-on experience in big data processing and ML pipelines Proven track record of building scalable data pipelines and supporting ML workflows in enterprise environments Experience working with structured, semi-structured, and unstructured data across financial domains Previous leadership or mentorship experience in a technical team is preferred Day-to-Day Activities Develop and optimize data pipelines for financial and index data using PySpark and related tools Build ML workflows, feature engineering, and model deployment pipelines in both streaming and batch environments Collaborate with business analysts and data scientists to refine data requirements and deliver insights Automate data ingestion, transformation, and validation processes Monitor system performance, troubleshoot issues, and implement tuning activities Review code and pipeline health with peer teams, uphold best practices in software development and data security

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at synechron? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect