Title: SQL Data Analysis Project by Mide BadusiTitle: SQL Data Analysis Project by Mide Badusi

Title: SQL Data Analysis Project

Mide Badusi

Mide Badusi

Title: SQL Data Analysis Project

·
10 min read
·
Jan 10, 2024
Hey everyone! 🎉 Just wrapped up a SQL data analysis project, and I’m excited to share the incredible journey and the goals I’ve achieved along the way.
Successfully completing the SQL data analysis project was a major milestone. It required dedication, problem-solving skills, and countless hours of digging into the datasets. One of my primary goals was to enhance my SQL skills, and I’m proud to say that I can now craft intricate queries with confidence. My SQL skills have leveled up significantly, opening doors to more challenging project. Also, Cleaning messy datasets, Dealing with missing values, outliers, and ensuring data integrity are skills I’ve honed during this project.
METHODOLOGY
Data Preparation and Cleaning
Data Importation: The data was firstly converted into CSV (comma separated value). Then it was imported into PostgreSQL
CREATE TABLE IF NOT EXISTS public.shg_booking(    booking_id integer,    hotel character varying(50) ,    booking_date date,    arrival_date date ,    lead_time integer,    nights integer,    guests smallint,    distribution_channel text ,    customer_type character varying(50) ,    country character varying(50),    deposit_type character varying(50),    avg_daily_rate text ,    status text ,    status_update date ,    cancelled smallint,    revenue text ,    revenue_loss text)
SELECT *FROM shg_booking
Data Cleaning steps
I removed unnecessary spaces and comma and also changed the data type of Avg_daily_rate , revenue and revenue_loss columns from text to numeric
-- DATA CLEANING STEPS FOR AVG_DAILY_RATE COLUMNUPDATE shg_bookingSET avg_daily_rate = REPLACE(avg_daily_rate, '$', '');UPDATE shg_bookingSET avg_daily_rate = REPLACE(avg_daily_rate, '(', '-');UPDATE shg_bookingSET avg_daily_rate = REPLACE(avg_daily_rate, ')', '');UPDATE shg_bookingSET avg_daily_rate = REPLACE(avg_daily_rate, ',', '');ALTER TABLE shg_bookingALTER COLUMN avg_daily_rateTYPE numericUSING avg_daily_rate::numeric;
-- DATA CLEANING STEPS FOR REVENUE COLUMNUPDATE shg_bookingSET revenue = REPLACE(revenue, '$', '');UPDATE shg_bookingSET revenue = REPLACE(revenue, ',', '');UPDATE shg_bookingSET revenue = REPLACE(revenue, '(', '-');UPDATE shg_bookingSET revenue = REPLACE(revenue, ')', '');ALTER TABLE shg_bookingALTER COLUMN revenueTYPE numericUSING revenue :: numeric;
-- DATA CLEANING STEPS FOR REVENUE LOSS COLUMNUPDATE shg_bookingSET revenue_loss = REPLACE(revenue_loss, '$', '');UPDATE shg_bookingSET revenue_loss = REPLACE(revenue_loss, ',', '');UPDATE shg_bookingSET revenue_loss = REPLACE(revenue_loss, '(', '-');UPDATE shg_bookingSET revenue_loss = REPLACE(revenue_loss, ')', '');ALTER TABLE shg_bookingALTER COLUMN revenue_lossTYPE numericUSING revenue_loss :: numeric;
The columns in the database include:
.Booking ID: Unique identifier for each booking.
•Hotel: Type or name of the hotel within the Splendor Hotel Group.
•Booking Date: Date when the booking was made.
•Arrival Date: Date when the guests are scheduled to arrive.
•Lead Time: Number of days between the booking date and arrival date.
•Nights: Number of nights the guests are booked to stay.
•Guests: Number of guests included in the booking.
•Distribution Channel: The channel through which the booking was made (e.g., Direct, Online Travel Agent, Offline Travel Agent).
•Customer Type: Type of customer making the booking (e.g., Transient, Corporate).
•Country: Country of origin of the guests.
•Deposit Type: Whether a deposit was made for the booking (e.g., No Deposit, Deposit).
•Avg Daily Rate: Average daily rate for the booking.
•Status: Status of the booking (e.g., Check-Out, Canceled).
•Status Update: Date of the last status update for the booking.
•Canceled (0/1): Binary indicator of whether the booking was canceled (1 if canceled, 0 if not canceled).
•Revenue: Revenue generated from the booking.
•Revenue Loss: Loss in revenue if the booking was canceled (negative value if the booking wasn’t canceled).
Table ERD
Case Study Questions
Booking Patterns:
•What is the trend in booking patterns over time, and are there specific seasons or months with increased booking activity?
Finding trends in booking patterns over time involves analyzing data to identify recurring patterns, fluctuations, or changes in booking activity. This includes understanding the peak seasons or months when booking activity is consistently higher. To do this I used the extract and count function. I extracted the month from the booking date as well as using the count function to get the total number of bookings for each month
SELECT COUNT (*),EXTRACT (month FROM booking_date) AS monthFROM shg_bookingGROUP BY monthORDER BY month;
Customer Behavior Analysis:
. Which distribution channels contribute the most to bookings, and how does the average daily rate (ADR) differ across these channels?
SELECT distribution_channel,        COUNT (booking_id) AS total_bookings,    COUNT (*) * 100.0 /SUM (COUNT(*)) OVER () AS contribution_percentageFROM shg_bookingWHERE country is not NULLGROUP BY distribution_channelORDER BY total_bookings DESC;
SELECT distribution_channel,        SUM(avg_daily_rate)FROM shg_bookingGROUP BY distribution_channel;
Cancellation Analysis:
How does the revenue loss from cancellations compare across different customer segments and distribution channels?
SELECT distribution_channel,       ABS (SUM(revenue_loss)) AS revenuelossFROM shg_bookingWHERE status = 'Canceled'GROUP BY distribution_channelORDER BY revenueloss ASC;
Revenue Optimization:
•What is the overall revenue trend, and are there specific customer segments or countries contributing significantly to revenue?
SELECT country,       SUM (revenue) AS totalcountryrevenueFROM shg_bookingGROUP BY countryORDER BY totalcountryrevenue DESCLIMIT 10;
SELECT EXTRACT (month FROM booking_date) AS months,SUM (revenue) AS totalrevenuepermonthFROM shg_bookingGROUP BY monthsORDER BY months;
-- Revenue Trend by Customer typeSELECT customer_type,         SUM(revenue) AS revenueFROM shg_bookingWHERE country IS NOT NULLGROUP BY customer_typeORDER BY revenue DESC;
Geographical Analysis:
•How does the distribution of guests vary across different countries, and are there specific countries that should be targeted for marketing efforts?
Identifying patterns in the distribution of guests based on their country of origin involves analyzing data to understand where guests predominantly come from
SELECT country,       SUM (guests) AS totalguestsFROM shg_bookingWHERE country IS NOT NULLGROUP BY countryORDER BY totalguests DESC;
Operational Efficiency:
What is the average length of stay for guests, and how does it differ based on booking channels or customer types?
--1 Total Avg length of stay SELECT AVG (nights) AS total_avglength_of_stay,       TRUNC( AVG (nights), 2)FROM shg_bookingWHERE country IS NOT NULL
-- Across distribution channelSELECT shg.distribution_channel,    AVG (nights) AS averagelengthday,    TRUNC (AVG (nights), 2)FROM shg_booking AS shgWHERE country IS NOT NULLGROUP BY shg.distribution_channel--Across customer typesSELECT shg.customer_type,    AVG (nights) AS averagelengthday,    TRUNC (AVG (nights), 2)FROM shg_booking AS shgWHERE country IS NOT NULLGROUP BY shg.customer_type
Are there patterns in check-out dates that can inform staffing and resource allocation strategies?
/* 2.  Are there patterns in check-out dates that can inform staffing and resource allocationstrategies? */SELECT status_update,        COUNT(booking_id) AS number_of_bookings,     SUM(guests) AS number_of_guestsFROM shg_bookingWHERE country IS NOT NULL AND status = 'Check-Out'GROUP BY status_updateORDER BY status_update;-- BY day of weekSELECT CASE WHEN EXTRACT(dow from status_update) = 0 THEN 'Sunday'   WHEN EXTRACT(dow from status_update) = 1 THEN 'Monday'   WHEN EXTRACT(dow from status_update) = 2 THEN 'Tuesday'   WHEN EXTRACT(dow from status_update) = 3 THEN 'Wednesday'   WHEN EXTRACT(dow from status_update)  = 4 THEN 'Thursday'   WHEN EXTRACT(dow from status_update)  = 5 THEN 'Friday'   WHEN EXTRACT(dow from status_update)  = 6 THEN 'Saturday'   END AS day_of_week,  COUNT(booking_id) AS number_of_bookings,  SUM(guests) AS number_of_guestsFROM shg_bookingWHERE country IS NOT NULL AND status = 'Check-Out'GROUP BY day_of_weekORDER BY COUNT(booking_id) DESC;  --> There seems to be more activities on Sundays
Impact of Deposit Types:
•How does the presence or absence of a deposit impact the likelihood of cancellations and revenue generation?
-- How does the presence/absence of a deposit impact the likelihood of cancellations and revenue generationSELECT deposit_type,       COUNT (*)FROM shg_bookingWHERE status = 'Canceled' AND country IS NOT NULLGROUP BY deposit_type ORDER BY COUNT (*) DESC;  --The absence of deposit leads to more cancelationSELECT deposit_type,    SUM (revenue) AS total_revenueFROM shg_bookingWHERE country IS NOT NULLGROUP BY deposit_typeORDER BY total_revenue; --The absence of Deposit also yielded more revenue compared to other deposit type
Analysis of Corporate Bookings:
•What is the proportion of corporate bookings, and how does their Average Daily Rate (ADR) compare to other customer types?
--Analysis of Corporate Bookings/* proportion of corporate bookings and how it affects ADR compare to other customer types */SELECT       SUM(CASE WHEN distribution_channel = 'Corporate' THEN 1 END) AS corporate_booking,       COUNT(booking_id) AS total_bookings,    (SUM(CASE WHEN distribution_channel = 'Corporate' THEN 1 END) * 100.0) / COUNT (booking_id) AS corporate_booking_proportionFROM shg_bookingWHERE country IS NOT NULL-- ii.Average daily rate with Customer TypeSELECT customer_type,       SUM(avg_daily_rate) as avg_daily_rateFROM shg_bookingWHERE distribution_channel = 'Corporate' AND country IS NOT NULLGROUP BY customer_typeORDER BY avg_daily_rate DESC;
/* 1. Are there specific trends or patterns related to corporate bookings that can informbusiness strategies? */-- Based on BookingSELECT distribution_channel,        COUNT(booking_id)FROM shg_bookingWHERE country IS NOT NULLGROUP BY distribution_channel;-- Based on RevenueSELECT distribution_channel, SUM(revenue)FROM shg_bookingWHERE country IS NOT NULLGROUP BY distribution_channel;-- Based on CancellationSELECT distribution_channel,        COUNT(*)FROM shg_bookingWHERE country IS NOT NULL AND status = 'Canceled'GROUP BY distribution_channelORDER BY COUNT(*);-- Based on lead timeSELECT distribution_channel,        AVG(lead_time) AS avg_leadtimeFROM shg_bookingWHERE country IS NOT NULLGROUP BY distribution_channelORDER BY avg_leadtime;
Time-to-Event Analysis:
•How does the time between booking and arrival date (lead time) affect revenue and the likelihood of cancellations?
Are there specific lead time ranges that are associated with higher customer satisfaction or revenue?
I organized the lead time into a four-month interval using the CASE function to determine which lead time ranges affects/ is associated with higher revenue and likelihood of cancelation.
SELECT        CASE        WHEN lead_time >= 0 AND lead_time <= 120  THEN '4months'      WHEN lead_time > 120 AND lead_time <= 240 THEN '8months'     WHEN lead_time > 240 AND lead_time <= 360 THEN '1yr'     WHEN lead_time > 360 AND lead_time <= 480 THEN '1yr 4 months'     WHEN lead_time > 480 AND lead_time <= 600 THEN '1yr 8 months'     WHEN lead_time > 600 THEN 'over 1yr 8 months'     END AS Lead_time_range,     SUM (revenue) AS Total_revenueFROM shg_bookingWHERE country IS NOT NULLGROUP BY lead_time_rangeORDER BY Total_revenue DESC;  --Leadtime between four months had the highest revenue
-- 2 Based on cancellationSELECT       CASE            WHEN lead_time >= 0 AND lead_time <= 120  THEN '4months'      WHEN lead_time > 120 AND lead_time <= 240 THEN '8months'     WHEN lead_time > 240 AND lead_time <= 360 THEN '1yr'     WHEN lead_time > 360 AND lead_time <= 480 THEN '1yr 4 months'     WHEN lead_time > 480 AND lead_time <= 600 THEN '1yr 8 months'     WHEN lead_time > 600 THEN 'over 1yr 8 months'     END AS Lead_time_range,     COUNT (*) AS canceled_bookingsFROM shg_bookingWHERE country IS NOT NULL AND status = 'Canceled'GROUP BY lead_time_rangeORDER BY canceled_bookings;
Comparison of Online and Offline Travel Agents:
•What is the revenue contribution of online travel agents compared to offline travel agents?
-- 1.  What is the revenue contribution of online travel agents compared to offline travel agents?SELECT distribution_channel,       SUM (revenue) AS total_revenueFROM shg_bookingWHERE country IS NOT NULL AND status = 'Canceled' AND (distribution_channel = 'Online Travel Agent' OR distribution_channel = 'Offline Travel Agent')GROUP BY distribution_channel;
•How do cancellation rates and revenue vary between bookings made through online and offline travel agents?
-- Revenue and cancellation rate across both channelSELECT distribution_channel,       SUM (revenue) AS total_revenue,    SUM(revenue) * 100.0 / SUM(SUM(revenue)) OVER () AS revenue_rateFROM shg_bookingWHERE country IS NOT NULL AND status = 'Canceled' AND (distribution_channel = 'Online Travel Agent' OR distribution_channel = 'Offline Travel Agent')GROUP BY distribution_channelORDER BY total_revenue;SELECT distribution_channel,       COUNT(booking_id) AS total_booking,   SUM(CASE WHEN status = 'Canceled' THEN 1 END) AS canceled_bookings,    (SUM(CASE WHEN status = 'Canceled' THEN 1 END) * 100.0) / COUNT(booking_id) AS cancelation_rateFROM shg_bookingWHERE country IS NOT NULL AND (distribution_channel = 'Online Travel Agent' OR distribution_channel = 'Offline Travel Agent')GROUP BY distribution_channel;
After using SQL to analyze, I imported my query into Power BI to visualize for better insight.
Key performance indicators: Total Revenue = $29.60M
January had the highest revenue generation and revenue loss While September had the lowest revenue generation, November had the lowest revenue loss.
The highest revenue was generated when the lead time ranged between 4 months (1–120 days). It can also be deduced that the higher the lead time range the lower the revenue.
Revenue peaked in 2016 with over 13M, while 2017 had a slight decrease generating 7M+ which is 6M less than 2016’s revenue.
Percentage of revenue loss: 30.7%
Total bookings: 119K
Percent of successful bookings: 74%
January had the most booking while June had the lowest booking.
The highest number of cancelations occurred for reservation without a deposit (No deposit)
Online Travel Agent was the distribution channel that contributed the most to bookings and cancellation.
Average length of stay: 3.43
Total number of Guests: 235K
Most guests were from Portugal
Embarking on this SQL data analysis journey has been incredibly rewarding This project has been a journey of growth, learning, and achievement. I’m excited to share my experience and showcase the results . Looking forward to applying these skills to new challenges and setting even more ambitious goals in the future! 💪🚀 #DataAnalysis #SuccessStory #Achievements #SQL #CareerGoals
Like this project

Posted Feb 27, 2025

Hey everyone! 🎉 Just wrapped up a SQL data analysis project, and I’m excited to share the incredible journey and the goals I’ve achieved along the way. Succes…