Credit Product Revobank

Home

About Me

Projects

Gallery

Let’s Talk

Banking analytics project that leveraged machine learning to segment 12,559 customer records into distinct behavioral personas.

Client:

Revobank

Date:

June 27, 2025

Type:

Customer Segmentation

Role:

Data Analyst

About

This project was conducted as part of the RevoU Python module and aimed to sharpen my technical expertise in data cleansing, exploratory data analysis, and segmentation using clustering methods.

Key Deliverables :

the project’s primary focus was on customer segmentation using RFM and K-means methodologies. My role involved understanding business problem, extracting insights, categorizing customers based on distinct characteristics, overseeing behavior tendency, and customizing recommendation for each segments based on transaction trends.

Context

RevoBank is a European bank that offers credit card products to its customers. The bank aims to increase credit card usage among existing clients by analyzing customer behavior and sales performance.

My role in this part are part of the Performance Management (PM) team, that working with a dataset from the MIS team that contains 36 months of sales data. The goal is to create user personas and insights based on client activity to support credit card usage strategies.

Problem

Transaction Behavior Analysis: RevoBank wanted to understand how customers transacted during promotional vs. non-promotional periods.
Promotion Effectiveness: There was a need to compare credit card usage across different customer groups and validate the effectiveness of past promotions.
Targeted Strategy Development: The bank aimed to design cost-efficient, segmented promotional strategies that resonated with core user clusters

Objective

To analyze RevoBank credit card sales performance trends over the past three years and identify key growth patterns.
To develop user personas based on existing client data to understand customer behavior and segment characteristics.
To identify and prioritize business opportunities that will increase RevoBank credit card product usage among current customers.

Processes & Considerations

Step 1

Data Cleaning

Out of 12.558 rows of data, 72 of identical duplicate data are found and manage to deleted where it lead into a usage of 12.486 data or 99.2% from the original data are used.
Identified anomalies data that referred to the data dictionary caused :

Before	After	Process
4 Account Activity Level	3 Account Activity Level	Deletion because none information are support to impute or insert the data to others
6 Customer Value Level	5 Customer Value Level	Insert into other value level referring to the lower limit of a level

Found out 735 missing values of data from the avg_sales_L36M that handled with imputing the data using mean imputation method (5.89% missing values doesn’t pass the low threshold of deletion)

In order to ensure data quality and reliability for the RevoBank credit card analysis, comprehensive data cleaning and preparation processes were implemented on the original dataset of 12,558 rows.

Step 2

Exploratory Data Analysis - Customer Segmentation

Setting the environment by initialize Key Credit Card Usage Metrics and the Potential Profit Calculation through this features :

Key Credit Card Usage Metrics	Potential Profit Calculation
avg_sales_L36M	avg_sales_L36M
cnt_sales_L36M	cnt_sales_L36M
month_since_last_sales
count_direct_promo_L12M

Examine the distribution of credit card usage over 36 months to know how’s the distribution of the usage for the credit card by immediately visualize the distribution

plt.figure(figsize=(12, 8))
sns.histplot(df['cnt_sales_L36M'], kde=True)
plt.title('Distribution of Credit Card Product Usage Frequency (Past 36 Months)')
plt.xlabel('Number of Transactions')
plt.ylabel('Count of Customers')
plt.axvline(df['cnt_sales_L36M'].mean(), color='red', linestyle='--', label=f'Mean: {df["cnt_sales_L36M"].mean():.2f}')
plt.legend()
plt.show()

Identified that there are a lot of outliers in the data I pick Robust Scaler to scale the data by the features that already been selected that related to the Credit Card Product Usage because robust scaler uses medians and quantiles instead of means and standard deviations and that makes it more robust to the outliers

# Select relevant features that indicate credit card product usage patterns
features = [
    'avg_sales_L36M',        # Average sales amount (monetization of products)
    'cnt_sales_L36M',        # Frequency of product usage
    'month_since_last_sales', # Recency of product usage
    'count_direct_promo_L12M' # Response to promotions
]

# Check if additional credit-related features exist in the dataset
credit_features = [col for col in df.columns if 'credit' in col.lower()]
if credit_features:
    print(f"Additional credit-related features found: {credit_features}")
    features.extend(credit_features)

X = df[features].copy()

# Handle missing values if any
X = X.fillna(X.median())  # Using median instead of mean due to outliers

# Apply RobustScaler to handle outliers
# RobustScaler uses medians and quantiles instead of means and standard deviations
# This makes it more robust to outliers in the data
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

# Convert to DataFrame for better understanding
X_scaled_df = pd.DataFrame(X_scaled, columns=features)
print("\\nRobustScaler-Transformed Data Preview:")
display(X_scaled_df.head())

# Compare original vs. scaled distributions
plt.figure(figsize=(15, 10))
for i, feature in enumerate(features):
    plt.subplot(2, len(features), i+1)
    sns.histplot(X[feature], kde=True)
    plt.title(f'Original: {feature}')

    plt.subplot(2, len(features), i+1+len(features))
    sns.histplot(X_scaled_df[feature], kde=True)
    plt.title(f'RobustScaled: {feature}')

plt.tight_layout()
plt.show()

Implement K-Means Clustering Method (choosing K-Means Method rather than RFM because this is sales dataset where it could lead to an anomalies spread of data, and in this I want to identify by using a lot of features and K-Means can create as many segments as needed) for the credit card usage analysis
Confirm the K or Cluster number using the Elbow Method and Silhouette Method to support the usage of cluster number, Reasoning :
Why Elbow Method : To identify the optimal number of clusters by evaluating the Within-Cluster Sum of Squares (WCSS), the measure of how tightly grouped the data points are within each cluster.
# Elbow Method with RobustScaler-transformed data distortions = [] K_range = range(1, 10) for k in K_range: kmeanModel = KMeans(n_clusters=k, random_state=42, n_init=10) kmeanModel.fit(X_scaled) distortions.append(sum(np.min(cdist(X_scaled, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X_scaled.shape[0]) # Plot the Elbow Curve plt.figure(figsize=(10, 6)) plt.plot(K_range, distortions, 'bx-') plt.xlabel('Number of clusters (k)') plt.ylabel('Distortion (Average within-cluster distance)') plt.title('Elbow Method For Optimal k - RevoBank Credit Card Usage Segments') plt.grid(True) plt.show()
Why Silhouette Method : To evaluate how well-separated and well-defined the clusters are not just compact, but also distinct from one another.
# Silhouette Analysis for additional validation silhouette_scores = [] for k in range(2, 10): # Silhouette score requires at least 2 clusters kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) cluster_labels = kmeans.fit_predict(X_scaled) silhouette_avg = silhouette_score(X_scaled, cluster_labels) silhouette_scores.append(silhouette_avg) print(f"For n_clusters = {k}, the silhouette score is {silhouette_avg}") # Plot Silhouette Scores plt.figure(figsize=(10, 6)) plt.plot(range(2, 10), silhouette_scores, 'bo-') plt.xlabel('Number of clusters (k)') plt.ylabel('Silhouette Score') plt.title('Silhouette Analysis For RevoBank Customer Segments') plt.grid(True) plt.show()
From the Elbow and Silhouette Method we can know that K=4 or using 4 Cluster are the best for the segmentation where 4 cluster have a strong in several aspect just like :
1. Elbow Point: It's precisely where the marginal benefit of additional clusters drops dramatically
2. Balanced Trade-off: While not the absolute highest silhouette score, it provides reasonable cluster quality without over-segmentation
3. Business Practicality: Four segments are manageable for marketing strategies and customer relationship management
4. Natural Structure: The data appears to have inherent groupings that align well with four clusters
By choosing K=4 value as the cluster, we can identify the valuable credit card user segments by grouping the key metrics with each cluster

# Group by cluster and calculate key credit card usage metrics
cluster_analysis = df.groupby('cluster').agg({
    'avg_sales_L36M': 'mean',
    'cnt_sales_L36M': 'mean',
    'account_id': 'count',
    'month_since_last_sales': 'mean',
    'count_direct_promo_L12M': 'mean',
    'potential_profit': 'mean'
}).rename(columns={'account_id': 'customer_count'})

cluster_analysis

Step 3

Data Visualization - By Key Metrics

By identifying the key metrics of Credit Card Usage there are several points that we need to analyze based on each cluster :
1. Average Sales Per Client
(Understanding revenue potential and customer value)
# Visualize key metrics by cluster plt.figure(figsize=(16, 12)) # Average sales per client plt.subplot(2, 2, 1) sns.barplot(x=cluster_analysis.index, y=cluster_analysis['avg_sales_L36M']) plt.title('Average Credit Card Product Sales by Segment') plt.xlabel('Customer Segment') plt.ylabel('Average Sales (€)')
1. Average Transaction Frequency
(Understanding customer engagement patterns and loyalty)
# Average transaction frequency plt.subplot(2, 2, 2) sns.barplot(x=cluster_analysis.index, y=cluster_analysis['cnt_sales_L36M']) plt.title('Credit Card Product Usage Frequency by Segment') plt.xlabel('Customer Segment') plt.ylabel('Avg Number of Transactions')
1. Total Profit by Cluster Metrics
(Understanding actual business value and ROI)
# Total profit by cluster (using 2.4% margin) plt.subplot(2, 2, 3) sns.barplot(x=cluster_analysis.index, y=cluster_analysis['total_profit']) plt.title('Total Profit by Customer Segment (2.4% Margin)') plt.xlabel('Customer Segment') plt.ylabel('Total Profit (€)')
1. Active and Inactive Proportion Metrics
(Understanding segment health and engagement levels)
# Active vs inactive proportion plt.subplot(2, 2, 4) sns.barplot(x=cluster_analysis.index, y=cluster_analysis['active_proportion']) plt.title('Proportion of Active Credit Card Product Users') plt.xlabel('Customer Segment') plt.ylabel('Active Proportion') plt.tight_layout() plt.show()

Step 4

Identify Business Opportunities

With all the output we can identify the user persona for each segments
Segment 0 :
Segment 1 :
Segment 2 :
Segment 3 :

# Add demographic and behavior analysis
extended_cluster_analysis = df.groupby('cluster').agg({
    'avg_sales_L36M': 'mean',
    'cnt_sales_L36M': 'mean',
    'month_since_last_sales': 'mean',
    'flag_female': 'mean',  # Proportion of females
    'MOB': 'mean',  # Months on book
    'potential_profit': 'mean',
    'customer_value_level': lambda x: x.mode()[0] if not x.mode().empty else 'Unknown',
    'account_activity_level': lambda x: x.mode()[0] if not x.mode().empty else 'Unknown',
    'count_direct_promo_L12M': 'mean',
    'birth_date': lambda x: pd.to_datetime(x, errors='coerce').mean() if pd.to_datetime(x, errors='coerce').notna().any() else None
})

# If birth_date is datetime, calculate average age
if extended_cluster_analysis['birth_date'].dtype == 'datetime64[ns]':
    current_date = pd.Timestamp.now()
    extended_cluster_analysis['avg_age'] = (current_date - extended_cluster_analysis['birth_date']).dt.days / 365
    extended_cluster_analysis = extended_cluster_analysis.drop('birth_date', axis=1)

print("\\nExtended Credit Card User Segment Profiles:")
display(extended_cluster_analysis)

# Identify most valuable segments for credit card product usage
sorted_clusters = cluster_analysis.sort_values('total_profit', ascending=False)
top_clusters = sorted_clusters.head(2).index.tolist()

print(f"\\nTop 2 most valuable credit card user segments: {top_clusters}")

# Create credit card user personas
personas = {}
for cluster in range(optimal_k):
    data = extended_cluster_analysis.loc[cluster]

    # Determine persona characteristics relevant to credit card usage
    if data['avg_sales_L36M'] > extended_cluster_analysis['avg_sales_L36M'].mean():
        value_segment = "High Value"
    else:
        value_segment = "Low Value"

    if data['cnt_sales_L36M'] > extended_cluster_analysis['cnt_sales_L36M'].mean():
        usage_segment = "Active Users"
    else:
        usage_segment = "Inactive Users"

    if data['month_since_last_sales'] < extended_cluster_analysis['month_since_last_sales'].mean():
        recency_segment = "Recent Users"
    else:
        recency_segment = "Lapsed Users"

    # Create persona description focused on credit card product usage
    persona_name = f"Segment {cluster}: {value_segment}, {usage_segment}, {recency_segment}"

    # Additional details
    gender = "Predominantly Female" if data['flag_female'] > 0.5 else "Predominantly Male"

    details = f"""
    Credit Card Product Usage Profile:
    - Avg Product Sales: €{data['avg_sales_L36M']:.2f}
    - Usage Frequency: {data['cnt_sales_L36M']:.2f} transactions (past 36 months)
    - Months Since Last Usage: {data['month_since_last_sales']:.2f}
    - Avg Profit per Customer: €{data['potential_profit']:.2f} (at 2.4% margin)

    Customer Demographics:
    - Gender Distribution: {gender} ({100 * data['flag_female']:.1f}% Female)
    - Customer Tenure: {data['MOB']:.1f} months
    - Customer Value Level: {data['customer_value_level']}
    - Account Activity Level: {data['account_activity_level']}
    - Promotional Campaign Exposure: {data['count_direct_promo_L12M']:.2f} promotions (past 12 months)
    """

    if 'avg_age' in extended_cluster_analysis:
        details += f"- Average Age: {data['avg_age']:.1f} years\\n"

    personas[persona_name] = details

# Print the personas
print("\\nCredit Card User Personas:")
for persona, details in personas.items():
    print("\\n" + "=" * 60)
    print(persona)
    print("-" * 60)
    print(details)

Insights and Recommendations

For Each Segments

User Segment 0

Maintain standard credit product communication
Include in general credit promotional campaigns
Monitor for signals of increased credit usage potential
Provide basic financial education on credit product benefits
Implement cost-effective service model for this segment

User Segment 1

Create "win-back" campaigns with special credit product incentives
Conduct surveys to understand barriers to credit product usage
Simplify credit application and usage processes
Offer "welcome back" bonuses for first new credit transaction
Develop streamlined credit products for inactive customers

User Segment 2

Implement tiered rewards based on credit transaction value
Provide financial education on premium credit products and their benefits
Create value-added service bundles around core credit products
Develop upgrade paths to higher-tier credit options
Use targeted marketing to showcase premium credit product benefits

User Segment 3

Create transaction-based incentives focused on credit products
Implement "milestone" bonuses for reaching credit usage targets
Develop mobile notifications for unused credit benefits
Launch targeted campaigns highlighting the benefits of regular credit product usage
Consider temporary rate improvements for increasing usage frequency