Data Engineer / Pune, India

ETL Developer / Big Data Developer

Pallavi Langhe

Data engineer with 2+ years of experience building distributed ETL pipelines and cloud-based data workflows using Python, PySpark, Scala, Airflow, Snowflake, and AWS.

Pune, India 2+ years in data engineering PySpark / Airflow / Snowflake / AWS
50m -> 7mexecution time improvement on a key ETL workload
75%infrastructure savings through Spark optimization
Airflowautomation, monitoring, and near real-time refreshes
About Me

Designing dependable data pipelines for analytics teams.

I work across ETL development, workflow orchestration, monitoring, validation, and cloud data services with a strong focus on reliability and delivery quality.

I build scalable data pipelines that are fast, dependable, and easier for analytics teams to use.

My work focuses on ETL design, workflow orchestration, validation, monitoring, and performance tuning so data moves reliably from raw sources into analytics-ready systems.

My recent work includes distributed processing in PySpark and Scala, Airflow-based orchestration, Snowflake refresh workflows, and data quality checks that help downstream reporting stay dependable.

Highlights

Selected highlights.

A snapshot of recent results across ETL performance, cost efficiency, and data availability.

85%runtime reduction on a key ETL pipeline
75%infrastructure cost savings through Spark optimization
50m -> 7mexecution time improvement on a production workload
Near real-timedata refresh availability enabled through Airflow automation
Work Experience

Experience in ETL delivery, orchestration, and cloud data workflows.

Hands-on work across pipeline development, workflow monitoring, validation, and cloud-based data systems.

Oct 2024 - Present Infocepts Pune, India

Data Engineer

  • Designed and implemented distributed ETL pipelines using PySpark and Scala for high-throughput, fault-tolerant processing.
  • Developed proactive monitoring and alerting using Airflow, CloudWatch, and Python scripts to keep data workflows observable and recoverable.
  • Orchestrated complex data workflows in Apache Airflow, improving scheduling, dependency handling, and operational visibility.
  • Implemented validation and reconciliation frameworks that improved data accuracy and trust across analytics platforms.
  • Introduced incremental processing and partitioning strategies that improved efficiency and reduced compute costs.
Apr 2024 - Sep 2024 NDSoftTech Solutions Pune, India

Data Engineer Intern

  • Cleaned and transformed raw data using Python and SQL for recurring reports and structured analysis.
  • Wrote SQL queries and simple ETL jobs to load reporting tables and support downstream business use cases.
  • Supported testing and monitoring of AWS-based workflows and helped resolve data quality issues.
Featured Project

Unified Semantic Layer

A project focused on improving runtime, reducing cost, and making data available faster for downstream analytics.

Media and Entertainment Client / Infocepts

What I delivered

  • Built high-performance ETL pipelines in PySpark and Scala to support a unified semantic layer initiative for a media and entertainment client.
  • Optimized Spark execution with better partitioning and parallel processing, dramatically reducing runtime on a critical workload.
  • Automated Snowflake refresh workflows with Python-based Airflow DAGs so data moved closer to near real-time availability.
  • Connected CI/CD into the delivery flow through GitHub Actions, shipping JAR artifacts to AWS S3 and working alongside EMR Serverless, Glue, Athena, SNS, and Secrets Manager.

Impact

50m -> 7mPipeline runtime after optimization
75%Infrastructure cost reduction
Airflow + PythonAutomated refresh orchestration

Key outcomes

  • Faster delivery for downstream analytics consumers.
  • Lower compute spend through smarter workload design.
  • More dependable operational visibility through monitoring and validation.
Skills & Tools

Skills and tools.

Technologies I use across data engineering, orchestration, cloud infrastructure, and data warehousing.

Data Engineering

Apache SparkPySparkETLData ModelingData WarehousingDistributed Data Processing

Cloud and Big Data

AWS EMR ServerlessS3AthenaGlueEC2SNSIAMLambdaCloudWatchSecrets Manager

Programming

PythonScalaSQLJavaScript

Orchestration and CI/CD

Apache AirflowGitHub ActionsJenkinsGitLab CI/CDWorkflow Automation

Databases

SnowflakeMySQLPostgreSQLDynamoDBMongoDBAWS Athena
Education & Contact

Education and contact.

Open to data engineering roles, collaborations, and conversations around ETL, orchestration, and cloud data systems.

B.Tech. in Computer Science and Engineering

MIT-WPU, Pune

CGPA 8.6

Diploma in Information Technology

Government Polytechnic Awasari, Pune

88.56%

Get in touch

Open to data engineering opportunities and conversations.

Feel free to reach out for roles, collaborations, or conversations around ETL, distributed processing, and cloud data platforms.