Amazon Segmentation

This project marked the conclusion of my studies at RevoU, serving as the final project. I was given the authority to choose my preferred dataset, understand the context, and observe possible business problems. I handled the end-to-end data analysis process, which involved defining the project scope, establishing the analysis approach, contributing to the data analysis, and creating data visualizations.

Client:

Self Project Amazon

Date:

June 27, 2025

Type:

Customer Segmentation

Role:

Data Analyst

About

This project marked the conclusion of my studies at RevoU, serving as the final project. I was given the authority to choose my preferred dataset, understand the context, and observe possible business problems. I handled the end-to-end data analysis process, which involved defining the project scope, establishing the analysis approach, contributing to the data analysis, and creating data visualizations.


  • Project Focuses :

    • Analyzing Amazon sales data to identify distinct customer behavioral patterns and develop targeted marketing strategies. The analysis combines advanced data science techniques including K-Means clustering, feature engineering, and statistical analysis to transform raw e-commerce data into actionable business insights.

  • Problem Statement :

    • How can we improve Amazon’s marketing strategies by 15% within 12 months by analyzing customer segmentation based on spending behaviour in order to enhance customer targeting by identifying the Category Based Spending Patterns, Discount Sensitivity Levels, Product Rating Influence.


Key Deliverables

  • Comprehensive customer segmentation model identifying 4 distinct behavioral personas

  • Interactive Tableau dashboard for dynamic data exploration

  • Strategic marketing recommendations projected to increase ROI by 15%

Essential Link

Screenshot 2025-07-08 at 17.57.32.png

Context

Amazon's vast marketplace generates enormous amounts of customer transaction data, creating both opportunities and challenges for businesses seeking to understand their customer base. With over 1,400 product records spanning multiple categories from Electronics to Home & Kitchen, the dataset represents a diverse customer ecosystem with varying purchasing behaviors, price sensitivities, and product preferences.

Business Environment

  • Multi-category e-commerce platform with diverse product offerings

  • Customers exhibiting varied purchasing patterns across different price points

  • Complex relationship between product ratings, discounts, and customer behavior

  • Need for data-driven marketing strategies to maximize customer lifetime value

Problem

Without clear understanding of customer segments, businesses struggle to implement effective marketing strategies, leading to suboptimal resource allocation and missed revenue opportunities.

Screenshot 2025-07-08 at 18.19.25.png

Mutually Exclusive Collectively Exhaustive (MECE) Issue Tree

Screenshot 2025-07-07 at 20.22.09.png

Business Impact

  • Potential revenue loss from untargeted marketing campaigns

  • Suboptimal inventory planning without understanding customer preferences

  • Missed opportunities for personalized customer experiences

Objective

The objective here was to developed a comprehensive customer segmentation model that enables data-driven marketing strategies and improves business performance.

Processes & Considerations

  1. End-to-End Data Cleaning Process

  2. Exploratory Data Analysis Process

  3. EDA CSV Data Sets Results

  4. Interactive Dashboard Tableau

Step 1

Data Cleaning

  • Identified and changed the Price Area Columns Data Type (discounted_price, actual_price, discount_percentage) to match the expected output to ease the exploratory data analysis phase for the price area section

Before Changed the Data Type

Screenshot 2025-07-07 at 23.44.59.png

After Changed the Data Type

Screenshot 2025-07-07 at 23.47.42.png
  • Identified and changed the Rating Area Columns Data Type, Found Unusual String, Inspected The Strange Row (rating and rating_count) to match the expected output to ease the exploratory data analysis phase for the rating area section

Before The Process

Screenshot 2025-07-07 at 23.51.29.png

After The Process

Screenshot 2025-07-07 at 23.54.36.png
  • Checked duplicates and filled missing values in the datasets

#Duplicates Handling Code
duplicates = df.duplicated()
df[duplicates]
#Missing Values Handling Code
df.isna().sum()

Duplicates

Missing Values

there’s no duplicated rows

2 values in rating_count was found

  • Created a new Product Category Data Frame from identifying which column included into Product or Category Section by differ the Category Naming and changed from the Category_1 and Category_2 into Main-Category and Sub-Category

Before the Process

image.png

After the Process

Screenshot 2025-07-08 at 00.24.05.png
  • Created Ranking Rating Score Categories into 5 types of ranking

#Creating Categories for Rankings

rating_score = []

for score in df1['rating']:
    if score < 2.0 : rating_score.append('Poor')
    elif score < 3.0 : rating_score.append('Below Average')
    elif score < 4.0 : rating_score.append('Average')
    elif score < 5.0 : rating_score.append('Above Average')
    elif score == 5.0 : rating_score.append('Excellent')

Before Ranking Categories

There are only rating and rating_count

After Ranking Process

Now it also consist the rating_score based on the rating criteria

  • Created Reviewers based on the user section (user_id and user_name) by identifying and subset into one user_id belongs to one user_name as well as splitting each of the user_id and user_name into one to one

Before The Identifying Process

column for user are hard to define because there are no one to one relationship for each user

column for user are hard to define because there are no one to one relationship for each user

After The Analysing Process

user one to one with their user_name and user_id

user one to one with their user_name and user_id

Step 2

Exploratory Data Analysis - Customer Segmentation Analysis

  • Assigned new Data Frame from the cleaning process to determine which features that belong to be analyzed for the cluster that correlate with Customer Segmentation Category By Product

  • Defined the scaler by identified the data, in this dataset I am using MinMaxScaler to standardize the data, why MinMaxScaler because in this case the data have a lot of identical value whereas as we can see it have a lot of 0.0 data and there a lot of customer as well who doesn't made a purchase

  • Filled the Nan-Values with 0 so it can perform a rigid analysis by analyzing all the essentials features, After the Scaling process I found 288 Nan Values where it leads to an impute of 0 values for each of Nan Values


    Screenshot 2025-07-08 at 18.58.32.png
  • Plotted the Data Frame with Elbow and Silhouette Method to determine the K Values

    • Elbow Method (Code and Results)

    plt.subplot(1, 2, 1)
    plt.plot(k_range, inertia, 'o-', markersize=8, c='royalblue')
    plt.grid(True)
    plt.xlabel('Number of clusters (k)', fontsize=12)
    plt.ylabel('Inertia', fontsize=12)
    plt.title('Elbow Method for Optimal k', fontsize=14)


    Screenshot 2025-07-08 at 19.00.55.png

    In the chart, inertia drops sharply from k = 2 to k = 3, and then the rate of decrease slows down. The "elbow" appears around k = 4, where the decrease in inertia starts to level off. This indicates k = 4 may be an optimal number of clusters.

    • Silhouette Method (Code and Results)

    plt.subplot(1, 2, 2)
    plt.plot(k_range, silhouette_scores, 'o-', markersize=8, c='green')
    plt.grid(True)
    plt.xlabel('Number of clusters (k)', fontsize=12)
    plt.ylabel('Silhouette Score', fontsize=12)
    plt.title('Silhouette Score for Optimal k', fontsize=14)


    Screenshot 2025-07-08 at 19.09.40.png

    In the chart, the highest silhouette score is observed at k = 4 (~0.65), suggesting the most natural clustering structure. Scores drop significantly after k = 4, which it provides the best cluster cohesion and separation according to the silhouette metric.

  • With the results of Elbow and Silhouette Method i chose K Values as 4, to differ the cluster or segmentation, to support the Customer Segmentation I use PCA Visualizations to outlook the variance of cluster

# Apply K-means with the optimal K = 4
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
# Fit kmeans and get predictions
cluster_labels = kmeans.fit_predict(scaled_df)

# Create a new column in scaled_df for clusters (don't try to assign to original df)
scaled_df_with_clusters = scaled_df.copy()
scaled_df_with_clusters['cluster'] = cluster_labels

# Visualize clusters using PCA for dimensionality reduction
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_df)
pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
pca_df['cluster'] = cluster_labels

# Plot clusters
plt.figure(figsize=(10, 8))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='cluster', palette='viridis', s=100)
plt.title('Customer Segments Visualization using PCA', fontsize=15)
plt.xlabel('Principal Component 1', fontsize=12)
plt.ylabel('Principal Component 2', fontsize=12)
plt.legend(title='Cluster')
plt.show()
  • Plotted the results of Customer Segmentation by assigning it with each of their characteristics

# Analyzing the characteristics of each cluster
# First, add original features to the cluster dataframe
# Get the index from scaled_df to ensure we're matching the right rows
cluster_analysis_df = scaled_df_with_clusters.copy()

# Get the indices of rows that exist in scaled_df
valid_indices = scaled_df.index

# Use only matching rows from the original df for analysis
valid_df = df.loc[valid_indices].copy()
valid_df['cluster'] = cluster_labels

# Now perform the cluster analysis on valid_df
cluster_analysis = valid_df.groupby('cluster').agg({
    'discounted_price': 'mean',
    'actual_price': 'mean',
    'discount_percentage': 'mean',
    'rating': 'mean',
    'rating_count': 'mean',
    'difference_price': 'mean',
    'product_id': 'count'  # Number of purchases
}).reset_index()

print("Cluster Characteristics:")
print(cluster_analysis)

# Visualize cluster characteristics
plt.figure(figsize=(15, 10))
# Use only features that exist in cluster_analysis dataframe
available_features = [col for col in cluster_analysis.columns if col != 'cluster']

#Amazon Color Palette
amazon_palette = ['#FF9900', '#146EB4', '#232F3E', '#000000']

for i, feature in enumerate(available_features):
    if i < 9:  # Limit to 9 subplots
        plt.subplot(3, 3, i+1)
        sns.barplot(x='cluster', y=feature, data=cluster_analysis, palette=amazon_palette)
        plt.title(f'Average {feature} by Cluster')
        plt.xlabel('Cluster')
        plt.ylabel(feature)

plt.tight_layout()
plt.show()
Screenshot 2025-07-08 at 19.49.29.pngVisualization of Each Characteristic by using

Visualization of Each Characteristic by using Category

Analyzed the Segments Price Sensitivity

Analyzed the Segments Price Sensitivity

  • To pursue the customer spending behaviour analysis, I analyzed the category based spending pattern for the average spending main category and also the sub category

Average Spending Main Category

Screenshot 2025-07-08 at 20.17.49.png

Average Spending Sub Category

Screenshot 2025-07-08 at 20.18.24.png
  • Created the Discount Sensitivity Levels and Discount Bins to determine the discount level and discount differentiation into 5 bins of discount level

Screenshot 2025-07-08 at 20.23.20.pngNumber of Purchases and Average Price Reduction

Number of Purchases and Average Price Reduction

Purchase to Customer Ratio and Price compare to the Discount Percentage

Purchase to Customer Ratio and Price compare to the Discount Percentage

  • Had the visualization of how discount effected the price and also number of purchases, I managed to see the Discount Distribution by using Heatmap for the Main Category

Screenshot 2025-07-08 at 20.33.42.png
  • To gain deeper understand of the customer spending behaviour I decided to analyzed the Product Ratings Influence Analysis to know how the rating influence the customer spending behaviour.

Number of Purchases and Average Pricing by Rating Range

Number of Purchases and Average Pricing by Rating Range

The Amount of Price compare to the Rating and The Distribution Percentage compare with the Rating

The Amount of Price compare to the Rating and The Distribution Percentage compare with the Rating

  • Followed by the Rating Bins in this I analyzed the relationship between rating and the number reviews that could show how well separated for each rating and well identified the outliers rating

Step 3

Insights and Recommendations

  • Through the Exploratory Data Analysis (EDA) and back to the problem statement and objective where I want proposed to enhance the marketing strategies through customer segmentation by the customer spending behaviour, in this i created 4 Phase to enhance Marketing Strategies :

    1. Foundation - Data Driven Segmentation

      In the first phase, we transform raw customer data into a strategic asset by precisely defining four key customer segments: Balancers (32%), Validators (25%), Nobles (4%), and Bargainers (38%). This foundational stage focuses on creating a robust data infrastructure that enables accurate segment identification, implementing tracking systems, and establishing baseline performance metrics. The goal is to build a solid, data-driven framework that will support all subsequent marketing strategies.

      Screenshot 2025-07-08 at 21.00.43.png
    2. Activation - Personalized Engagement

      Building upon the foundational segmentation, this phase brings customer insights to life through hyper-personalized marketing approaches. We develop tailored messaging and creative assets for each segment, implementing multi-channel strategies that speak directly to unique customer preferences. By conducting rigorous A/B testing and creating segment-specific engagement tactics, we aim to dramatically improve customer resonance and initial conversion rates.

      Screenshot 2025-07-08 at 21.01.08.png
    3. Scaling - Advanced Optimization

      The scaling phase expands our segmentation strategy across all product categories and marketing channels. We dive deep into cross-category purchase patterns, develop sophisticated recommendation engines, and implement dynamic pricing strategies. This phase focuses on creating a comprehensive approach to customer lifecycle management, integrating insights across different segments and product lines to maximize marketing effectiveness and customer value.

      Screenshot 2025-07-08 at 21.01.31.png
    4. Refinement - Precision Marketing

      In the final phase, we leverage advanced AI-driven technologies to achieve unprecedented marketing precision. By developing predictive models, implementing real-time personalization engines, and conducting comprehensive performance analysis, we aim to fine-tune our targeting parameters. The ultimate objective is to achieve the targeted 15% improvement in marketing effectiveness, transforming our data-driven insights into a continuously evolving, highly adaptive marketing ecosystem.

      Screenshot 2025-07-08 at 21.02.03.png
  • Last step to measure how everything will go as planned, I created the Key Performance Indicator to track for the specific segment KPI’s and the Overall Business KPI’s

Screenshot 2025-07-08 at 21.06.21.pngScreenshot 2025-07-08 at 21.06.56.png

Step 4

Interactive Dashboard - in Tableau

  • The dashboard represent all the result that i’ve done to analyze the customer segmentation, these analysis reveals four distinct customer clusters, each with unique characteristics

Screenshot 2025-07-08 at 21.10.09.png

Key Components of The Dashboard

  • Top Section: Main KPIs

    (5 Key Performance Indicators)

    • Total Customer

    • Total Revenue

    • Average Order Value

    • Average Customer Rating Score

    • Number of Product Sold

  • Cluster Filtering (Top Right)The colored buttons allow you to filter the entire dashboard by different customer segments.

    • Orange = Validators

    • Black = Balancers

    • Gray = Nobles

    • Navy = Bargainers

  • Analysis Filtering Section (Center)

    Shows the selected analysis type with a dropdown menu

    Screenshot 2025-07-08 at 21.21.21.png
  • Cluster Distribution (Left Side)

    This pie chart shows customer segmentation into 4 groups

    • Balancers (52%): Largest segment, likely balanced shoppers

    • Bargainers (38%): Price-sensitive customers

    • Validators (6%): Smaller segment, possibly careful purchasers

    • Nobles (4%): Premium customers, smallest but likely high-value

  • Rating and Number of Reviews Distribution (Bottom Right)

    This histogram shows the distribution of customer ratings from 1-5 stars

    Most customers give ratings between 3-4 stars, The distribution appears relatively normal with a slight skew toward higher ratings, Peak appears around 4 stars with 98 reviews

  • Purchase Count in Discount Distribution (Bottom Left)

    This table breaks down purchase behavior by product category and discount ranges (0-10%, 10-20%, etc.).

Copyright © Baskoroajii.2025. All rights reserved.

Copyright © Baskoroajii.2025. All rights reserved.

Copyright © Baskoroajii.2025. All rights reserved.

Create a free website with Framer, the website builder loved by startups, designers and agencies.