📊[Data Modeling]Prediction Model for BigMart Product Sales

Pei-Han Hsu

Data Modelling Analyst

Data Visualizer

Data Analyst

Python

Project Overview

About the project

As we manage a company, we not only have to produce products, but also we need to understand the sales status of each type of product in different regions, different channels, and additional merchants, which enhances the importance of analyzing datasets. Therefore, the project will analyze different tiers, types of products, and areas and build models to predict sales.

The Key Question

Depending on several factors (Reviews, Price, Location, etc.) to predict the ranking of the property.

Show total sales in each outlet

Predict the sales of different item types in each outlet next time

Show total sales in each tier

Predict the sales of different item types in each tier(location) next time

Analyzing sales in different types of store

Help the store owner find the problems

Approach

Data scraping

Model creation - multiclass classification

Data visual creation including heat map, pie chart, bar chart, and plot bar.

Data collecting

First, I used the data set sourced by The Devastator.

Feature collected include:

Amount of item type, item identifier, outlet identifier, and outlet location type

Sales of the above types

Items of weight, fat content, and visibility

Eventually, there are 12 columns in our data and 14204 rows of data collected.

Data cleaning

After collecting data, I found out there are 2439 rows of data containing missing value, I need to clean the data by using the following skills:

Column elimitation: 132 inessential columns emitted.

Replace null values: delete rows of nulls.

After conducting data cleaning, our final dataset contains 11765 rows of data, and 12 features left.

Part 1

1. Show total sales in each outlet

2. Predict the sales of different item types in each outlet next time

2.1 Using pie chart to appear the proportion of the sales of different item type in each outlet.

2.3 Predict the sales of different item types in each outlet next time.

X_1 = pd.get_dummies(df_sort_IdItemType[['Outlet_Identifier', 'Item_Type']], columns=['Outlet_Identifier', 'Item_Type'])
Y_1 = df_sort_IdItemType['Item_Outlet_Sales']

# Split data to training and test set
X_train_1, X_test_1, Y_train_1, Y_test_1 = train_test_split(X_1, Y_1, test_size=0.4, random_state=42)

# Remove features which don't match the training set or the test set
extraFeatureTrain = set(X_train_1.columns) - set(X_test_1.columns)
extraFeatureTest = set(X_test_1.columns) - set(X_train_1.columns)
X_train_1 = X_train_1.drop(columns=extraFeatureTrain)
X_test_1 = X_test_1.drop(columns=extraFeatureTest)

# Build KNN regressor model
knnRModel = KNeighborsRegressor(n_neighbors=5)

knnRModel.fit(X_train_1, Y_train_1)

Y_pred_1 = knnRModel.predict(X_test_1)
Y_pred_1

p_data1 = pd.DataFrame(index=pd.MultiIndex.from_product([l_OutId, l_ItemType], names=['Outlet_Identifier', 'Item_Type']))

p_data_encoded_1 = pd.get_dummies(p_data1.reset_index(), columns=[ 'Item_Type','Outlet_Identifier'])

common_features_1 = list(set(X_train_1.columns) & set(p_data_encoded_1.columns))
p_data_encoded_1 = p_data_encoded_1[common_features_1]

p_sales = knnRegressor.predict(p_data_encoded_1)

p_data1['Predicted_Sales'] = p_sales
#p_data

2.4 Using bar chart to appear the predicted price of different item types in each outlet next time.

3. Total sales in each tier

It shows that the total sale of Tier 3 is obviously higher than Tier 1 and Tier 2. If BigMart, the company, would like to invest other businesses, our suggestion is to set up more stores in Tier 3.

4. Predict the sales of different item types in each tier(location) next time

X_4 = pd.get_dummies(df_sort_LocItemType[['Outlet_Location_Type', 'Item_Type']], columns=['Outlet_Location_Type', 'Item_Type'])
Y_4 = df_sort_LocItemType['Item_Outlet_Sales']

# Split training set and test set
X_train_4, X_test_4, Y_train_4, Y_test_4 = train_test_split(X_4, Y_4, test_size=0.3, random_state=42)

# Remove features that do not match between the training set and the test set
extra_FeatureTrain_4 = set(X_train_4.columns) - set(X_test_4.columns)
extra_FeatureTest_4 = set(X_test_4.columns) - set(X_train_4.columns)
X_train_4 = X_train_4.drop(columns=extra_FeatureTrain_4)
X_test_4 = X_test_4.drop(columns=extra_FeatureTest_4)

# Build KNN regressor model
knnModel = KNeighborsRegressor(n_neighbors=5)

knnModel.fit(X_train_4, Y_train_4)

Y_pred_4 = knnModel.predict(X_test_4)
Y_pred_4

p_data_Location_4 = pd.DataFrame(index=pd.MultiIndex.from_product([l_LocId, l_ItemType], names=['Outlet_Location_Type', 'Item_Type']))

p_data_encoded_4 = pd.get_dummies(p_data_Location_4.reset_index(), columns=[ 'Item_Type','Outlet_Location_Type'])

common_features_4 = list(set(X_train_4.columns) & set(p_data_encoded_4.columns))

# Ensure that only the columns used during the training phase are used for prediction
p_data_encoded_4 = p_data_encoded_4.reindex(columns=X_train_4.columns, fill_value=0)

# Make predictions
p_sales_Location = knnModel.predict(p_data_encoded_4)

# Add predictions to the DataFrame
p_data_Location_4['Predicted_Sales'] = p_sales_Location

p_data_Location_4

Part 2 Analyzing sales in different type of store

extra_FeaturesTrain = set(X_train.columns) - set(X_test.columns)
extra_FeaturesTest = set(X_test.columns) - set(X_train.columns)
X_train = X_train.drop(columns=extra_FeaturesTrain)
X_test = X_test.drop(columns=extra_FeaturesTest)

# Build KNN regressor model
knn_model_3 = KNeighborsRegressor(n_neighbors=5)

Part 3 Is it true people that are wealthier eat healthier?

Snack food outperform soft drink in tier 2 In contrast to soft drinks, snack food also sales more product in tier 1 than tier 3. Maybe soft drinks sales more product in tier 1 because the store has more regular fat product in tier 1 and more low fat product in tier 3? based on what we learned before.

with pm.Model() as Store_Compare_regression_model:
    Item_MRP = pm.Data("Item_MRP", Store_Compare_models["Item_MRP"])
    Item_Outlet_Sales = pm.Data("Item_Outlet_Sales", Store_Compare_models["Item_Outlet_Sales"])
    Item_Visibility = pm.Data("Item_Visibility", Store_Compare_models["Item_Visibility"])
    Low_Fat = pm.Data("Low_Fat", Store_Compare_models["Low Fat"])
    Regular = pm.Data("Regular", Store_Compare_models["Regular"])
    Tier_1 = pm.Data("Tier_1", Store_Compare_models["Tier 1"])
    Tier_2 = pm.Data("Tier_2", Store_Compare_models["Tier 2"])
    Tier_3 = pm.Data("Tier_3", Store_Compare_models["Tier 3"])
    
    # priors
    beta_i = pm.Normal("beta_i", mu=0, sigma=1)
    beta_Item_MRP = pm.Normal("beta_Item_MRP", mu=0, sigma=1)
    beta_Item_Outlet_Sales= pm.Normal("beta_Item_Outlet_Sales", mu=0, sigma=1)
    beta_Item_Visibility = pm.Normal("beta_Item_Visibility", mu=0, sigma=1)
    beta_Low_Fat = pm.Normal("beta_Low_Fat", mu=0, sigma=1)
    beta_Regular = pm.Normal("beta_Regular", mu=0, sigma=1)
    beta_Tier_1 = pm.Normal("beta_Tier_1", mu=0, sigma=1)
    beta_Tier_2 = pm.Normal("beta_Tier_2", mu=0, sigma=1)
    beta_Tier_3 = pm.Normal("beta_Tier_3", mu=0, sigma=1)
    
    # linear model
    mu = beta_i + beta_Item_MRP * Item_MRP + beta_Item_Outlet_Sales * Item_Outlet_Sales + beta_Item_Visibility * Item_Visibility + beta_Low_Fat * Low_Fat + beta_Regular * Regular + beta_Tier_1 * Tier_1 + beta_Tier_2 * Tier_2 +beta_Tier_3 * Tier_3
    p = pm.Deterministic("p", pm.math.invlogit(mu))
    
    # likelihood
    pm.Bernoulli("WL", p=p, observed=Store_Compare_models["Item_Outlet_Sales"])