Leonardo Rickli
brazil-e-commerce-azure-databricks
In this project, we will be using some ETL, analytics, and BI tools on Microsoft Azure cloud in the company of Databricks and Power Bi. We will use the Brazilian E-Commerce Public Dataset by Olist
The Microsoft Azure platform was used to create the necessary infrastructure because I believe it offers a better connection with Databricks and Power Bi. The following Azure tools and others were used:
Data Factory: Used for the ingestion of the raw files
Data Lake Gen 2: These will be the staging areas for the data. The first staging area will be the raw data that will be ingested by Databricks. The second staging area will be the transformed data by Databricks that will be ingested by Azure Synapse.
Databricks: Used for data cleaning and some EDA (Exploratory Data Analysis). This section is important because it will make the correct schema for further data analysis on Synapse.
Synapse: This is the analytics tool offered by Azure. We can use the transformed data to make a relational database model and query data for analytics.
Power Bi: Used for presenting dashboards with the data that is the relational format.
Using a fundamental Data Engineering method called ETL (Extract, Transform, and Load), we will extract the raw files
Before we start deploying the resources, it's important to create a unique resource group for this project.
Extract
Using Data Factory, we establish a source (the raw files
Transform
After the data has been inserted into the first staging area, we will use Databricks to transform the data through cleaning and EDA. Databricks goes in this process with the use of a notebook
The majority of the tables were already pretty decent in terms of cleaning, there were just some minor datatype changes for a better schema representation of the columns. There was one exception with the table "order_reviews", there were some problematic rows that were giving error messages when Synapse was trying to read it, those rows had double quotes in them, so I managed to delete all of those rows.
Once the data is transformed, it is then sent to another staging area where it will be ingested by Synapse.
Load
Data is now loaded into Synapse, we will make this data become relational by creating an ERD so we can query some answers from it.
We can now query some answers using SQL.
Dashboards
Before we start making our dashboards, we have to establish a connection between Power Bi and Synapse. To do that, download Power Bi desktop and grab the "Serverless SQL endpoint" inside the resource group that you are using.
Now that we inserted our credentials in, we can start making some dashboards.