ETL Processing On Google Cloud Using Dataflow and Big Query

Hammad Tariq

Data Modelling Analyst

Data Engineer

Software Engineer

Google BigQuery

Google Cloud Dataflow

Google Cloud Platform

Scope

The scope of the project encompassed a comprehensive exploration and implementation of Google Cloud Platform's core services, tailored to specific data processing and analytics tasks. Primarily focused on utilizing Data Flow for orchestrating data pipelines, Cloud Storage for efficient data storage and retrieval, and Big Query for executing intricate data analytics operations, the project aimed to achieve several milestones. These milestones included setting up the Data Flow API, acquiring starter code configurations, performing data ingestion from Cloud Storage, executing data transformations, and ultimately, generating insightful analytics through Big Query. Through structured tasks and workflows, the project's scope emphasized end-to-end data processing, validation, and visualization, culminating in actionable insights derived from processed datasets.

Key Outcomes of the Project

Data Flow API Setup and Management

Successfully enabled and managed the Dataflow API, ensuring seamless orchestration of data processing tasks.

Starter Code Acquisition and Configuration

Retrieved Dataflow Python examples from Google Cloud's professional services GitHub and configured the project environment for subsequent tasks.

Cloud Storage Operations

Created a regional Cloud Storage bucket tailored to the project requirements. Efficiently stored essential data files (usa_names.csv and head_usa_names.csv) into the designated Cloud Storage bucket.

Big Query Dataset Creation

Established a dedicated Big Query dataset named lake to house the project's tables, show casing effective data management practices.

Data Ingestion and Transformation

Developed and executed Dataflow pipelines for data ingestion, transforming raw data from Cloud Storage and populating Big Query tables (usa_names, usa_names_transformed, usa_names_enriched). Ensured data quality by filtering headers, converting data formats, and enriching datasets as required.

Advanced-Data Processing

Implemented intricate Dataflow pipelines to perform tasks like data joins, facilitating the creation of the orders_denormalized_sideinput table in BigQuery.

Validation and Monitoring

Monitored Dataflow jobs' progress through the Google Cloud Console, ensuring timely completion and data accuracy. Validated task completion by verifying populated tables in Big Query, confirming the successful execution of data processing tasks.

Summary

The project was motivated by the increasing need to utilize cloud-based tools for data processing and analytics, aiming to extract actionable insights from extensive datasets more efficiently than traditional methods. By leveraging Google Cloud Platform's (GCP) services, including Cloud Storage for scalable data storage, Data Flow for streamlining data pipelines, and Big Query for high-speed data analytics, the project sought to accelerate decision-making and enhance data-driven strategies. The scope encompassed comprehensive exploration and implementation of these GCP services, achieving milestones like Data Flow API setup, Cloud Storage operations, Big Query dataset creation, and advanced data processing tasks such as data ingestion, transformation, and joins. The key outcomes highlighted successful management of the Data Flow API, configuration of starter code, creation of dedicated datasets, and validation through monitoring, emphasizing the seamless integration and effectiveness of Cloud Storage, Data Flow, and Big Query in facilitating end-to-end data processing and analytics.

Like this project

Posted Oct 7, 2024

An ETL Pipeline built using Dataflow and Big Query to visualize the end results of the data extracted and efficiently storing it in the Google Cloud Storage

Likes

Views