Developer Data pipelines using Scala Spark and PySpark in GCP to ingest, transform, and load terabytes of data for near real-time analytics.
Created Data Models to organize and structure information effectively.
Improved Spark job performance by 250% by optimizing resource allocation, refining query execution plans, and utilizing built-in features, leading to notable reductions in processing times and resource usage.
Collaborated with data science team to implement machine learning model in pipelines, for using various fraud detection metrics.
Implemented robust data validation checks, cleansing procedures, and quality control measures to ensure data accuracy and integrity.
Developed efficient SQL queries and integrated table data into UI dashboards for real-time visualization and interactive exploration.
Utilized Airflow DAGs to efficiently manage and orchestrate complex data pipelines.
Migrated applications from on-premise Hadoop Ecosystem to Cloud (GCP) saving $4500 per month and improving performance by 2.5x.