Clusterization

Caio Sanches

0

Data Scientist

Data Analyst

Matplotlib

pandas

SQL

Libraries

import pandas as pd
from sqlalchemy import create_engine
import psycopg2
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.cluster import KMeans
import seaborn as sns
import numpy as np

Transferring from .csv to Postgres to perform SQL queries.

# # create engine
# # connection string: dialect+driver://user:password@server/database
# engine = create_engine('postgresql+psycopg2://postgres:123@localhost/dp6')
# engine
# hits=pd.read_csv('hits.csv')
# hits
# hits.head()
# hits.to_sql(name='hits_prod_session', con=engine,index=False)
# hits.info()
# prod=pd.read_csv('products.csv')
# prod
# prod.to_sql(name='prod', con=engine,index=False)
# sessions=pd.read_csv('sessions.csv')
# sessions
# sessions.to_sql(name='session', con=engine,index=False)
To cluster, we will use RFM analysis. According to IBM, RFM (Recency, Frequency, Monetary) is a method used to analyze and segment customers based on three key metrics:
Recency (R): How recently a customer has made a purchase.
Frequency (F): How often a customer makes a purchase.
Monetary (M): How much money a customer spends.
This analysis helps businesses identify different customer segments, such as loyal customers, at-risk customers, or new customers, and tailor marketing strategies accordingly

Loading Data

df=pd.read_csv('tabela_final.csv')
df
df.info()
df.isnull().sum()

Null values distribution

(df.isnull().sum()/len(df))*100
Just over half of the users add a product to their cart. A little over 1% of the base makes a purchase (conversion rate if conversion is defined as making a purchase)

qntd_transacoes

df['qntd-transacoes'].describe()
...

Clusterization

X = final[['R_Score', 'F_Score', 'M_Score']]
# Calculate inertia (sum of squared distances) for different values of k
inertia = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, n_init= 10, random_state=42)
kmeans.fit(X)
inertia.append(kmeans.inertia_)
# Plot the elbow curve
plt.figure(figsize=(8, 6),dpi=150)
plt.plot(range(2, 11), inertia, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Curve for K-means Clustering')
plt.grid(True)
plt.show()

Segmentation

# Group by cluster and calculate mean values
cluster_summary = final.groupby('Cluster').agg({
'R_Score': 'mean',
'F_Score': 'mean',
'M_Score': 'mean'
}).reset_index()
cluster_summary
colors = ['#3498db', '#2ecc71', '#f39c12','#C9B1BD']
# Plot the average final scores for each cluster
plt.figure(figsize=(10, 8),dpi=150)
# Plot Avg Recency
plt.subplot(3, 1, 1)
bars = plt.bar(cluster_summary.index, cluster_summary['R_Score'], color=colors)
plt.xlabel('Cluster')
plt.ylabel('Avg Recency')
plt.title('Average Recency for Each Cluster')
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend(bars, cluster_summary.index, title='Clusters')
# Plot Avg Frequency
plt.subplot(3, 1, 2)
bars = plt.bar(cluster_summary.index, cluster_summary['F_Score'], color=colors)
plt.xlabel('Cluster')
plt.ylabel('Avg Frequency')
plt.title('Average Frequency for Each Cluster')
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend(bars, cluster_summary.index, title='Clusters')
# Plot Avg Monetary
plt.subplot(3, 1, 3)
bars = plt.bar(cluster_summary.index, cluster_summary['M_Score'], color=colors)
plt.xlabel('Cluster')
plt.ylabel('Avg Monetary')
plt.title('Average Monetary Value for Each Cluster')
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend(bars, cluster_summary.index, title='Clusters')
plt.tight_layout()
plt.show()
Clusters
Clusters
To view the rest, please get in touch, as this project is better visualized in Jupyter Notebook. The platform does not allow uploading this type of file
Like this project
0

Posted Jan 25, 2025

This work involves using RFM analysis for customer segmentation, transforming data, calculating recency, frequency, and monetary scores, and clustering with K-M

Likes

0

Views

0

Tags

Data Scientist

Data Analyst

Matplotlib

pandas

SQL

Fraud detection
Fraud detection
Automating SAP Login and Data Update with Selenium and Python
Automating SAP Login and Data Update with Selenium and Python