Exploratory Data Analysis, Feature Engineering & Hypothesis

Learn exploratory data analysis, feature engineering & hypothesis testing through Telco churn case study. Step-by-step Python guide for data science beginners.
Facebook
Twitter
LinkedIn

Exploratory Data Analysis (EDA) can help solve a critical business problem: customer churn. Imagine you’re the CEO of a telecom company. Every month, some customers leave your service – a costly problem since acquiring new customers is often more expensive than retaining existing ones. But with EDA, we can predict who might leave and why.

That’s exactly what we’ll explore in this project! Using a real-world dataset from a telecom company, we’ll analyze customer behavior, identify patterns, and uncover the key factors that drive churn. By the end, you’ll understand how data science can help businesses reduce churn and improve customer retention.

To follow along with this article, you can find the full code on this GitHub Repo.

The Dataset

We’ll use the Telco Customer Churn Dataset from Kaggle, which includes:

  • Customer demographics (e.g., gender, senior citizen status).
  • Service details (e.g., internet service, phone service).
  • Billing information (e.g., monthly charges, contract type).
  • Churn status (whether the customer left the company).

Our goal is to clean, explore, and analyze this data to answer:

  • What factors are most strongly associated with churn?
  • Can we predict which customers are at risk of leaving?

Step 1: Setting Up the Tools

Before diving into the data, we need the right tools. Here’s what we’ll use:

  • Pandas: For loading and manipulating data (like a supercharged Excel).
  • NumPy: For numerical operations.
  • Matplotlib/Seaborn: For creating visualizations (graphs and charts).
  • SciPy: For statistical tests.
Python
import pandas as pd  # Data manipulation
import numpy as np   # Numerical operations
import matplotlib.pyplot as plt  # Basic plotting
import seaborn as sns  # Fancy statistical plots
from scipy import stats  # Hypothesis testing
import warnings  # Ignore annoying warnings
warnings.filterwarnings('ignore')  

# Set plot styles
sns.set(style='whitegrid')  
plt.rcParams['figure.figsize'] = (10, 6)  # Default figure size

Explanation:

  • We import libraries and configure settings (like plot size) for consistency.
  • warnings.filterwarnings('ignore') suppresses minor warnings to keep our output clean.

Step 2: Loading and Understanding the Data

As data scientists, a large part of our work is understanding the data before we can make any predictions. Loading and summarizing the data is the first step in this journey. It helps us see what kind of information we have, spot missing or unusual values, and get a feel for how the data is structured.

By carefully inspecting the dataset, we can catch potential issues early, decide what needs cleaning, and plan how to explore the data. This step is essential because any mistakes or overlooked problems here can affect the results of all our analysis and models later on.

In this section, we will:

  • Load the dataset and display a few rows to get a quick glimpse.
  • Check the size and structure of the dataset.
  • Look for missing values, duplicates, and basic statistics.

Doing this gives us a solid foundation for the rest of the project, ensuring our analysis is accurate and reliable.

2.1. Load and Preview the Dataset

Before we start analyzing, we first need to load the dataset into our workspace and take a quick look. Previewing a few rows helps us understand what kind of information each column holds and gives an initial feel for the data.

Python
data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')  
data.head()  # Show the first 5 rows
Customer chun dataset

Observations:

  • Each row represents a customer.
  • Columns include gender, MonthlyCharges, Contract, and the target variable Churn (Yes/No).

2.2. Check the Shape of the Dataset

Knowing the dataset’s shape tells us how many rows and columns it has. This helps us understand the dataset’s size and structure before diving deeper.

Python
print("Shape:", data.shape)  # (7043 rows, 21 columns)

Key Insight:

  • The dataset has 7,043 customers(rows) and 21 features(columns).

2.3 Check the Dataset Information

It’s helpful to peek under the hood. Checking the dataset information shows the type of each column, whether it’s numbers or text, and how many non-missing values each has. This gives us a quick sense of what needs cleaning or conversion, and which features we can use for analysis.

Python
data.info()
Customer chun info
  • Shows column types (numbers or text).
  • Helps identify potential issues like non-numeric data e.g TotalCharges should be of a number type.

2.4. Check for Missing Data

Even the cleanest-looking dataset can hide missing values. Checking for missing data helps us find gaps that could cause errors later. By identifying which columns have missing values and how many, we can plan how to fill or remove them, ensuring our analysis and models are accurate and reliable.

Python
print("Missing values:", data.isnull().sum())  # No missing values
Missing values

It is important to handle missing data before analysis.

TotalCharges being in the wrong data type could potentially hide missing values because empty strings are not detected as missing values.

So, they do need to be explicitly detected and handled, just like real NaN. We will start by checking these on the categorical columns, then proceed to check for null or missing values.

Python
# Filter the categorical columns values
categorical_columns = data.select_dtypes(include='object').columns

# Replace empty or whitespace-only strings with np.nan in all object columns
data[categorical_columns] = data[categorical_columns].replace(r'^\s*$', np.nan, regex=True)

# Check missing values
data.isnull().sum().to_frame(name='Missing Values')  # TotalCharges had 11 missing values.
Missing values

Since we have established that TotalCharges has missing values, we will start by fixing its data type to be of a numerical type. Then fill in the missing values with the median.

Python
# Convert TotalCharges from object to numeric values
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

# Fill missing values in the 'TotalCharges' column with median
median_total_charges = data['TotalCharges'].median()
data['TotalCharges'].fillna(median_total_charges, inplace=True)

Why the median?

The median is less affected by extreme values than the mean, making it a robust choice for filling missing data.

2.5. Check for Duplicate Rows

Sometimes datasets have repeated rows due to data entry errors or merging issues. Duplicates can skew analysis and make results misleading. Checking for duplicates helps us identify and remove them, ensuring each customer or record is counted only once for accurate insights.

Python
print("Duplicates:", data.duplicated().sum())  # 0 duplicates

2.6. Check statistics

Before diving deeper, it’s helpful to look at basic statistics of numerical columns. This gives us a quick sense of averages, spreads, and unusual values. By checking statistics like mean, median, min, max, and quartiles, we can spot outliers, errors, or surprising patterns that might need attention during cleaning or analysis.

Python
data.describe()
Statsitical information

Key Observations:

  • SeniorCitizen is a binary variable, where ~16.2% of customers are senior citizens.
  • tenure has a wide spread, ranging from 0 to 72 months, indicating variability in customer lifespans.
  • MonthlyCharges and TotalCharges are both right-skewed, with some high-paying customers pulling the max values up.
  • The 75th percentile of TotalCharges is around 3,786, but the maximum charges are over 8,600, suggesting a small group of long-term, high-value customers.

These statistics help establish baselines for deeper analysis and are important for understanding feature distributions prior to modeling or hypothesis testing.

Step 3: Cleaning the Data

Data cleaning is where our raw dataset starts its transformation into something truly useful. As data scientists, we know that real-world data is rarely perfect. It often comes with missing values, duplicates, inconsistent formatting, or outliers that can skew our analysis.

In this stage, our goal is to polish and prepare the data so it accurately represents reality. We’ll fill or remove missing values, fix inconsistencies, and ensure that each column is in the correct format for analysis. Think of it like prepping ingredients before cooking. If you skip this part, no matter how skilled you are with the recipe (your machine learning model), the final dish will suffer.

A well-cleaned dataset not only makes analysis easier but also ensures more reliable insights and predictions, which is the heart of data science.

3.1. Standardizing Categorical Values

Sometimes, our dataset likes to be a little dramatic with its wording. For example, instead of simply saying “No,” it might say things like “No internet service” or “No phone service.” While these phrases are technically accurate, for our purposes they mean the exact same thing: No.

If we leave these inconsistencies as they are, they can throw off our analysis and confuse our encoding process later. By standardizing these values to a clean, consistent “No,” we make life easier for both us and our future machine learning model. Think of it as teaching the dataset to use its “indoor voice” so everyone understands it clearly.

Python
# Columns that have 'No internet service'
no_internet_services_cols = [col for col in categorical_cols if 'No internet service' in data[col].unique().tolist()]

# Replace 'No internet service' with 'No'
data[no_internet_services_cols] = data[no_internet_services_cols].replace('No internet service', 'No')
    
# Column 'MultipleLines' replace 'No phone service' with 'No'
data['MultipleLines'] = data['MultipleLines'].replace('No phone service', 'No')

3.2. Dropping Unnecessary Columns

Some data don’t add real value to our analysis or predictions. For example, certain columns might simply act as identifiers, offering no meaningful insight into customer behavior. If we keep them, they can add unnecessary noise, make our models more complex than they need to be, and even slow things down. So, we need to drop such columns.

Python
# Let's drop the 'customerID' column
data.drop(columns=['customerID'], axis=1, inplace=True)

Step 4: Exploratory Data Analysis (EDA)

Now that our dataset is cleaned and in good shape, it’s time to roll up our sleeves and get to know it on a deeper level. EDA is like the “getting to know you” stage in a relationship with your data; you’re looking for patterns, trends, and quirks that will help guide the rest of your project.

In this step, we’ll explore the dataset visually and statistically to uncover insights that raw tables alone can’t show. We’ll examine how different features interact, spot potential outliers, and identify relationships that might influence customer churn.

EDA is important because it helps us:

  • Understand the underlying structure of the data.
  • Validate assumptions before modeling.
  • Discover hidden patterns that could be crucial for prediction.
  • Decide which features are worth keeping, transforming, or discarding.

By the end of this stage, we won’t just have numbers, we’ll have a story forming about our customers and their behavior.

4.1 Distribution of the Target Variable (Churn)

Before diving into all the other features, let’s first look at our main character, Churn. Seeing how many customers stayed versus those who left gives us a sense of the challenge ahead. If most customers stayed, we might be dealing with an imbalanced dataset, which can influence how we approach modeling later.

Python
# Visualize the distribution of the target variable 'Churn'
plt.figure(figsize=(6, 4))
sns.countplot(data=data, x='Churn', palette='Set2')
plt.title('Churn Distribution')
plt.xlabel('Churn')
plt.ylabel('No. of Customers')
plt.grid(axis='y', linestyle='--', alpha=0.75)
plt.show()
Chun distribution

4.2. Univariate Analysis of Numerical Features

Now that we’ve met our target variable, let’s turn our attention to the numerical features. By exploring each one individually, we can uncover their general shape, are they normally distributed, skewed, or full of outliers? This step helps us spot patterns, potential data quality issues, and hints about which features might play a big role in predicting churn.

Let’s extract the numerical columns from our dataset:

Python
# Define numerical columns
numerical_columns = data.select_dtypes(include='number').columns

# Remove 'SeniorCitizen' since this is just a binary 0/1 values
numerical_columns = numerical_columns.drop(['SeniorCitizen'])

print(f'Numerical Columns: {numerical_columns.to_list()}')
  • Numerical Columns: [‘tenure‘, ‘MonthlyCharges‘, ‘TotalCharges‘]

Let’s visualize the distribution of these columns using histograms and box plots.

Python
# Create histograms and box plots for each numeric column
for col in numerical_columns:
    plt.figure(figsize=(14, 5))

    # Histogram
    plt.subplot(1, 2, 1)
    sns.histplot(data[col], bins=20, kde=True, color='skyblue')
    plt.title(f"Distribution of {col}", fontsize=14)
    plt.xlabel(col)
    plt.ylabel('Frequency')

    # Box plot
    plt.subplot(1, 2, 2)
    sns.boxplot(data[col], color='lightgreen')
    plt.title(f"Boxplot of {col}", fontsize=14)
    plt.xlabel(col)

    plt.tight_layout()
    plt.show()
Tenure distribution
Total charges distribution
Monthly charges dsitribution

Summary

  • The visualizations reveal the presence of skewness and outliers, especially in the TotalCharges variable.
  • These insights are important for future steps such as feature scaling, outlier handling, and model selection, particularly for algorithms sensitive to distribution shapes.

4.3. Univariate Analysis of Categorical Variables

Next, we’ll zoom in on the categorical features, things like gender, contract type, and payment method. Looking at their distributions helps us see how customers are spread across different categories and whether certain groups seem more churn-prone. It’s a simple step, but it often reveals surprisingly strong signals hiding in plain sight.

Python
# Set up the grid
n_cols = 3
n_rows = (len(categorical_columns) + n_cols - 1) // n_cols  # Compute number of rows needed
plt.figure(figsize=(n_cols * 6, n_rows * 4))

# Loop through each categorical variable and create a bar plot
for i, col in enumerate(categorical_columns, 1):
    plt.subplot(n_rows, n_cols, i)
    sns.countplot(data=data, x=col, palette='Set2', order=data[col].value_counts().index)
    plt.title(f"Distribution of {col}", fontsize=14)
    plt.xticks(rotation=45)
    plt.tight_layout()

plt.show()
Univariate categorical

Summary

  • The bar plots gave us a quick snapshot of our customer base.
  • We saw that the dataset is fairly balanced by gender, but most customers do not have dependents and many are without partners.
  • Phone service is common, and about half of those users have multiple lines.
  • Fiber optic and DSL dominate internet usage, while many skip extra services like online security and tech support.
  • Streaming services are used by a good portion of customers, though slightly fewer than half.
  • Most people choose month-to-month contracts, suggesting they prefer flexibility over long-term commitments.
  • Paperless billing is popular, and electronic checks lead payment preferences, with automatic payment methods seeing moderate use.
  • Finally, most customers stayed, but almost 1 in 4 left—meaning churn is a real issue and the data is imbalanced.

4.4. Bivariate Analysis of Numerical Variables vs. Target (Churn)

Now that we’ve explored our numerical features on their own, it’s time to see how they relate to the very thing we care about most, Churn. This is where patterns begin to emerge. By comparing each numerical variable with churn, we can spot early warning signs, like whether shorter tenure or higher monthly charges might be nudging customers toward leaving. Think of this step as connecting the dots between what customers look like and whether they stay or go.

Python
# Create box plots for each numerical column against the target 'Churn'
plt.figure(figsize=(18, 5))

for i, col in enumerate(numerical_columns, 1):
    plt.subplot(1, 3, i)
    sns.boxplot(data=data, x='Churn', y=col)
    plt.title(f"Boxplot of {col}", fontsize=14)

plt.tight_layout()
plt.show()
Numerical vs churn

Summary

  • These patterns provide strong evidence that short tenure and higher monthly charges are associated with customer churn.
  • These features could be especially useful predictors in a machine learning model and may also guide business strategies like targeted retention offers for new high-paying customers.

4.5. Bivariate Analysis of Categorical Variable vs. Target (Churn)

Numbers tell one side of the churn story, but categories, like contract type, payment method, or internet service, often reveal the why behind it. In this step, we’ll break down each categorical feature by churn status to see which groups are more likely to leave. This helps us spot trends that numbers alone might hide, like whether month-to-month customers are quietly leading the exodus.

Python
# Set up plot
n_cols = 2
n_rows = math.ceil(len(categorical_cols) / n_cols)
plt.figure(figsize=(6 * n_cols, 4 * n_rows))

# Plot variables on count plot
for i, col in enumerate(categorical_cols, 1):
    plt.subplot(n_rows, n_cols, i)
    sns.countplot(data=data, x=col, hue='Churn', palette='Set2')
    plt.title(f'{col} vs Churn', fontsize=16)
    plt.xlabel(col, fontsize=14)
    plt.ylabel('Count', fontsize=14)
    plt.xticks(rotation=45)

plt.subplots_adjust(hspace=0.8, wspace=0.2)
plt.tight_layout()
plt.show()
Categorical vs churn

Summary

Several categorical variables reveal strong relationships with churn:

  • Contract, PaymentMethod, and InternetService are especially informative.
  • Features related to support services (OnlineSecurity, TechSupport) suggest that lack of added value may push users away.
  • We’ll explore these further in modeling and hypothesis testing stages.

4.6. Multivariate Analysis

So far, we’ve looked at features in isolation, either on their own or paired with churn. But real customer behavior is rarely driven by just one factor at a time. In this step, we’ll combine multiple variables to uncover patterns that only emerge when features interact, giving us a fuller picture of what drives customers to stay or go.

We’ll use a correlation matrix to assess linear relationships between numerical features.

Python
# Correlation matrix heatmap for numerical features
plt.figure(figsize=(8, 5))

# Calculate correlation matrix
correlation_matrix = data[numerical_columns].corr()

# Plot Heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix of Numerical Features', fontsize=14)
plt.show()
Correlation matrix

Summary

  • The correlation analysis shows that tenure and TotalCharges are strongly related (0.83), reflecting the natural accumulation of charges over time.
  • MonthlyCharges also correlates moderately with TotalCharges (0.65), while its link to tenure is weak (0.25), suggesting monthly rates are largely independent of customer longevity.
  • Overall, TotalCharges is essentially a combination of tenure and MonthlyCharges, so it may be partially redundant for modeling, though no severe multicollinearity issues are present.

Step 5: Feature Engineering

Now that we’ve explored our dataset and uncovered key patterns, it’s time to shape our data into a form that’s ready for modeling. Feature Engineering is where we transform raw information into meaningful features that help our model make better predictions.

In this step, we’ll create new variables, modify existing ones, and encode categorical data so that machine learning algorithms can understand it. Think of it as giving our model the best possible toolkit to work with.

Feature engineering is important because it can:

  • Improve model accuracy by adding more relevant information.
  • Reduce noise and irrelevant details.
  • Make different types of data usable for algorithms.

By the end of this stage, we’ll have a clean, well-structured dataset with features that are not just machine-readable, but also optimized for predictive power.

5.1. Encoding Categorical Variables

When working with machine learning models, numbers are the language they understand, but our dataset still has plenty of words in it. Encoding categorical variables is like translating those words into a numerical form so our algorithms can make sense of them. In this step, we’ll turn features like gender or contract type into meaningful numbers without losing the information they carry. It’s a small change in appearance, but a big step toward making our data model-ready.

5.1.1. Binary Encoding of Yes/No Features

Some columns in our dataset are simple “yes” or “no” answers—almost like light switches that are either on or off. Binary encoding flips these answers into 1s and 0s, giving our models a clear signal without overcomplicating things. It’s quick, tidy, and perfect for features like Partner or Dependents, where the story is already black and white.

Python
# List of binary categorical columns to encode
binary_cols = [
    'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn',
    'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
    'TechSupport', 'StreamingTV', 'StreamingMovies'
]

# Apply binary encoding
for col in binary_cols:
    encoded_data = data.copy()
    encoded_data[col] = encoded_data[col].map({'Yes': 1, 'No': 0})

5.1.2 Encoding Remaining Categorical Features

Not all categorical features are as simple as “yes” or “no.” Some have multiple categories—like InternetService with options such as DSL, Fiber optic, or None. For these, we can’t just flip a switch to 1 or 0. Instead, we use techniques like one-hot encoding, which creates a separate column for each category, marking it with a 1 if it’s present and 0 otherwise.

It’s like giving each category its own name tag so the model can clearly tell them apart without assuming one is “greater” than the other.

Python
# Exclude already encoded binary columns
remaining_cat_cols = [col for col in categorical_columns if col not in binary_cols]

# Perform one-hot encoding using Pandas 'get_dummies()'
encoded_cols = pd.get_dummies(encoded_data[remaining_cat_cols], drop_first=True)

# Convert boolean columns to integer (0/1)
data_encoded = data_encoded.astype(int)

# Display the encoded data first few rows
data_encoded.head()
Encoded data

5.2 Creating New Features

Sometimes the data tells a richer story when we combine what we already have into something new. Creating new features, often called feature creation or feature construction, is like adding fresh ingredients to a recipe to enhance the flavor.

For example, we might combine tenure and MonthlyCharges to estimate TotalCharges, or group contract lengths into “short-term” vs. “long-term” commitments. These new features can highlight hidden patterns, making it easier for our model to pick up on signals that the raw data alone might not reveal.

5.2.1 Creating the TotalServices feature

Instead of looking at PhoneService, InternetService, StreamingTV, and others individually, we can bundle them into a single TotalServices feature. This gives us a quick snapshot of a customer’s engagement level, someone using six services is clearly more invested than someone with just one.

From a churn prediction standpoint, this matters because customers with more services might be either more loyal (harder to leave) or more at risk if they feel they’re paying too much. By creating this feature, we give our model a clearer way to capture that relationship.

Python
# List the services columns
services_columns = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

# Create a copy of the encoded data
data_engineered = data_encoded.copy()

# Create the 'TotalServices' feature
data_engineered['TotalServices'] = data_engineered[services_columns].sum(axis=1)

# Display first few rows services columns and the new 'TotalServices' to confirm
data_engineered[services_columns + ['TotalServices']].head()
Totalservices

5.2.2 Creating the tenure_group feature

Here, we’re taking the continuous tenure variable (how many months a customer has been with the company) and grouping it into ranges, like “New,” “Established,” or “Long-term.”

Why? Because raw month counts might not tell the whole story, humans tend to think in stages rather than exact numbers. Grouping tenure helps us spot patterns, like whether most churn happens early in the relationship or after several years.

This transformation also makes it easier for some models to work with tenure-related patterns, especially when combined with categorical analysis later on.

Python
# Create a new categorical column 'tenure_group' based on tenure ranges
data_engineered['tenure_group'] = pd.cut(
    data_engineered['tenure'],
    bins=[-1, 12, 24, 48, 72],  # Define bin edges in months
    labels=['0-1 year', '1-2 years', '2-4 years', '4-6 years']  # Labels for each bin
)

data_engineered[['tenure', 'tenure_group']].head()
Tenure group

5.2.3 Creating the AverageChargesPerService Feature

Here, we’re adding a new metric that takes a customer’s MonthlyCharges and divides it by the number of services they subscribe to.

\(AverageChargesPerService = \frac{MonthlyCharges}{TotalServices}\)

The idea is to get a clearer sense of how much each service is costing them on average. A high AverageChargesPerService might indicate premium or niche services, while a lower value could suggest bundled or discounted packages.

This feature can help us uncover whether customers paying more per service are more likely to churn, giving the model another useful perspective on customer behavior.

Python
# Create 'AverageChargesPerService' feature
data_engineered['AverageChargesPerService'] = data_engineered.apply(lambda row: row['MonthlyCharges'] / row['TotalServices'] if row['TotalServices'] > 0 else 0, axis=1)

# Verify the new column
data_engineered[['MonthlyCharges', 'TotalServices', 'AverageChargesPerService']].head()
Average charges per service

5.2.4. Creating the HasMultipleServices Feature

Here, we’re creating a simple flag that tells us whether a customer subscribes to more than one service.

\(HasMultipleServices = \left\{ \begin{array}{cl}
1, &\text{if } TotalServices > 1\\
0, &\text{otherwise}
\end{array} \right.\)

Why? Because customers with multiple services are often more “locked in” and might be less likely to churn, think of it as having more ties that bind them to the company. On the other hand, it could also mean higher bills, which might prompt some to leave.

By turning this into a clear Yes/No feature, we provide our model with an easy way to determine if having multiple services plays a role in retention or churn.

Python
# Create 'HasMultipleServices' feature
data_engineered['HasMultipleServices'] = (data_engineered['TotalServices'] > 1).astype(int)

# Preview the new features
data_engineered[['TotalServices', 'HasMultipleServices']].head()

Step 6: Mini-EDA on Newly Engineered Features

Now that we’ve rolled up our sleeves and crafted some fresh new features, it’s time to give them a quick check-up before sending them into battle.

This “mini-EDA” is like a sneak peek, making sure our engineered features actually make sense, behave as expected, and add value to the dataset. We’ll look at their distributions, see how they relate to churn, and check if they’re capturing the patterns we hoped they would.

Doing this step is important because not all engineered features turn out to be useful (or even logical) once we see them in action. A quick analysis helps us spot duds early, confirm promising patterns, and refine any features that need a little extra polish.

6.1. Encode tenure_group as ordinal integers

Since tenure_group represents a natural order of customer loyalty over time, we’ll give each group a corresponding number. This way, our models can understand that “12–24 months” comes after “0–12 months” instead of treating them as unrelated categories.

Python
# Mapping for tenure buckets -> ordinal integers
tenure_map = {
    '0-1 year': 0,
    '1-2 years': 1,
    '2-4 years': 2,
    '4-6 years': 3,
}

# Create the encoded column (will be NaN where tenure_group is missing or unexpected)
data_engineered['tenure_group_encoded'] = data_engineered['tenure_group'].map(tenure_map)

# Let's convert 'tenure_group_encoded' to type integer
data_engineered['tenure_group_encoded'] = data_engineered['tenure_group_encoded'].astype(int)

# Preview the encoding
data_engineered[['tenure_group', 'tenure_group_encoded']].head()
Tenure group encoded

6.2. Descriptive statistics for the new features

Before we move forward, let’s take a quick pulse check on our newly engineered features. By looking at their descriptive statistics, we can spot any unexpected values, strange distributions, or interesting trends, basically making sure our new creations are behaving as intended.

Python
# Descriptive stats for the numerical engineered features
numerical_engineered_features = [
    'TotalServices', 'AverageChargesPerService', 'HasMultipleServices', 'tenure_group_encoded'
]

data_engineered[numerical_engineered_features].describe()
Feature engineered stats

6.3. Visual checks (countplots & histogram)

Now it’s time to give our new features a visual health check. Using countplots and histograms, we can quickly see how these variables are distributed and whether any unusual patterns pop up. It’s like holding them up to the light to spot hidden quirks before we move on.

Python
plt.figure(figsize=(12, 10))

# 1. TotalServices countplot
plt.subplot(2, 2, 1)
order_ts = sorted(data_engineered['TotalServices'].dropna().unique())
sns.countplot(x='TotalServices', data=data_engineered, order=order_ts, palette='Set2')
plt.title('Count of TotalServices (number of services per customer)')
plt.xlabel('TotalServices (0 = none, up to 6)')
plt.ylabel('Count')

# 2. AverageChargesPerService distribution
plt.subplot(2, 2, 2)
sns.histplot(data_engineered['AverageChargesPerService'].replace(0, np.nan).dropna(), bins=30, kde=True)
# Note: replace(0, np.nan) is optional — zeros represent customers with no services
plt.title('Distribution of AverageChargesPerService')
plt.xlabel('AverageChargesPerService')
plt.ylabel('Frequency')

# 3. HasMultipleServices (binary)
plt.subplot(2, 2, 3)
sns.countplot(x='HasMultipleServices', data=data_engineered, palette='Set2')
plt.title('HasMultipleServices (0 = <=1 service, 1 = >1 service)')
plt.xlabel('HasMultipleServices')
plt.ylabel('Count')

# 4. tenure_group_encoded counts with readable labels
plt.subplot(2, 2, 4)
order_tg = [0, 1, 2, 3]
sns.countplot(x='tenure_group_encoded', data=data_engineered, order=order_tg, palette='Set2')
plt.title('Tenure groups (encoded)')
plt.xlabel('tenure_group_encoded')
plt.xticks(ticks=[0, 1, 2, 3], labels=['0-1y', '1-2y', '2-4y', '4-6y'])
plt.ylabel('Count')

plt.tight_layout()
plt.show()
Feature enigineered viz

Mini-EDA Summary for Newly Engineered Features

  • TotalServices – Almost a third of customers (31.5%) have no services at all, with steadily fewer customers as the number of services rises. This pattern hints that bundling services might help with retention.
  • HasMultipleServices – A slight majority (54.8%) have more than one service, suggesting they may be more committed and less likely to churn.
  • tenure_group_encoded – Two big groups stand out: very new customers (0–1 year) and long-term customers (4–6 years), aligning with typical churn patterns where newcomers leave more often.
  • AverageChargesPerService – Many customers pay nothing (no active services), while active ones spend a median of ~$20 per service, with a small set of high spenders who might behave differently.

Overall – These features capture behavior, lifecycle stage, and spending patterns, giving us promising variables to test for their impact on churn.

6.4. Correlation heatmap including engineered features

Let’s see how our newly engineered features mingle with the rest of the dataset. By adding them to a correlation heatmap, we can spot fresh relationships, confirm if they bring unique value, or catch any signs of redundancy early on.

Python
# Columns to include in correlation check
corr_cols = [
    'tenure', 'MonthlyCharges', 'TotalCharges',
    'TotalServices', 'AverageChargesPerService', 'HasMultipleServices',
    'tenure_group_encoded', 'Churn'
]

# Compute correlation matrix (drop rows with missing values among these columns)
corr_df = data_engineered[corr_cols].dropna()  # dropna to avoid issues
corr_matrix = corr_df.corr()

plt.figure(figsize=(9, 7))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', center=0, linewidths=.5)
plt.title('Correlation matrix including engineered features')
plt.show()
Correlation matrix feature engineered

Correlation Insights – Engineered Features Included

  • Tenure & Churn (–0.35) – Moderate negative link: the longer a customer stays, the less likely they are to churn. A classic loyalty effect.
  • Tenure Group Encoded & Churn (–0.35) – Mirrors the raw tenure relationship, confirming our grouping didn’t lose the churn signal.
  • MonthlyCharges & Churn (+0.19) – Slight upward trend: higher bills are loosely tied to higher churn, hinting at some price sensitivity.
  • AverageChargesPerService & Churn (+0.24) – Weak-to-moderate link: paying more per service seems connected to higher churn risk, especially for customers with few services.
  • TotalCharges & Churn (–0.20) – Weak negative tie: big spenders (often long-tenured) churn less.
  • TotalServices (–0.09) & HasMultipleServices (–0.05) – Barely negative, suggesting service bundling offers only a small retention boost here.

Notable inter-feature relationships

  • TenureTenure Group Encoded (0.96) – Almost identical; expected.
  • TotalServicesHasMultipleServices (0.85) – Strong link; the flag is basically a simpler version of service count.
  • MonthlyChargesTotalServices (0.72) – More services generally mean higher bills.

Bottom line – Tenure-related measures remain the strongest churn indicators, with spending patterns coming next. Service counts may add value when paired with other features, but on their own they’re weaker predictors.

Step 7: Hypothesis Testing

Now that we’ve explored our data and engineered new features, it’s time to turn our observations into testable ideas. Hypotheses are essentially our educated guesses, what we expect to be true about the factors driving churn.

By stating these upfront, we create a clear plan for what to investigate, making our analysis more focused and purposeful. Hypotheses help us:

  • Stay on track and avoid aimless exploration
  • Test assumptions with actual evidence
  • Connect data insights to real-world reasoning

By the end of this step, we’ll have a set of clear, testable statements that guide the rest of our analysis and help us understand whether our assumptions hold true.

7.1 Hypothesis 1: MonthlyCharges vs Churn – (t-Test)

Our first hunch is that customers who churn might be paying higher monthly charges than those who stay. To check this, we’ll compare the average monthly bill between the two groups using a t-test. If the difference is significant, it could mean pricing plays a role in whether customers stick around.

Question: Are average monthly charges different between customers who churned and those who did not?

Hypotheses (two-sided):

  • H₀: μ_churn = μ_no_churn (mean MonthlyCharges are equal)
  • H₁: μ_churn ≠ μ_no_churn (means differ)

Test choice & assumptions:

  • We’ll perform an independent samples t-test.
  • If both groups have similar spread (variance), we use the standard t-test.
  • If spreads differ, we use Welch’s t-test, which is more reliable.
  • Significance level: α = 0.05.

We will:

  1. Prepare data and get group stats.
  2. Run Levene’s test (H₀: variances equal).
  3. If Levene p < α → use Welch’s t-test (equal_var=False).
    Else → use standard t-test (equal_var=True).
  4. Report t-statistic, p-value, 95% CI for mean difference, and an effect-size (Cohen’s d).

7.1.1. Prepare Data & Descriptive Stats

Before running the t-test, we’ll first get our data in order. This means filtering out any missing values, splitting customers into “churned” and “stayed” groups.

Python
# Keep only relevant columns
df_h1 = data_engineered[['MonthlyCharges', 'Churn']]

# Split 'MonthlyCharges' into yes and no groups (0/1) groups
group_churn = df_h1[df_h1['Churn'] == 1]['MonthlyCharges'].astype(float)
group_no_churn = df_h1[df_h1['Churn'] == 0]['MonthlyCharges'].astype(float)

7.1.2. Levene’s Test for Equal Variances

Before we compare the means, we need to check if both groups have similar variance (spread of values). Levene’s test helps us do exactly that. If the variances are equal, we can use the standard version of the t-test; if not, we’ll adjust our approach to avoid misleading results.

Python
# Levene's test for equality of variances
lev_stat, lev_p = stats.levene(group_churn, group_no_churn, center='median')  # center='median' is robust

print("--- Levene's test for equal variances ---")
print(f"stat = {lev_stat:.4f}, p = {lev_p:.4f}")

alpha = 0.05
if lev_p < alpha:
    print(f"p < {alpha} → variances appear different (reject H0). We'll use Welch's t-test (equal_var=False).")
    equal_var_flag = False
else:
    print(
        f"p >= {alpha} → no evidence variances differ (fail to reject H0). We can use standard t-test (equal_var=True).")
    equal_var_flag = True
--- Levene's test for equal variances ---
stat = 362.1557, p = 0.0000
p < 0.05 → variances appear different (reject H0). We'll use Welch's t-test (equal_var=False).

7.1.3. Run the t-test (Standard or Welch based on Levene)

Now that we know whether the group variances are equal, we can run the appropriate t-test. If Levene’s test said the variances are similar, we’ll use the standard version. If not, we’ll switch to Welch’s t-test, which is better when spreads differ. Either way, this will tell us if the difference in average monthly charges between churned and non-churned customers is statistically significant.

Python
# Run the appropriate t-test
t_stat, p_value = stats.ttest_ind(group_churn, group_no_churn, equal_var=equal_var_flag, nan_policy='omit')

print("--- Independent samples t-test result ---")
print(f"t-statistic = {t_stat:.4f}")
print(f"p-value     = {p_value:.6f}")
print(f"test used   = {'standard t-test (equal var assumed)' if equal_var_flag else 'Welch t-test (unequal var)'}")
print("--- Decision ---")
if p_value < alpha:
    print(f"p-value < {alpha} → Reject H₀: there is a statistically significant difference in MonthlyCharges.")
else:
    print(f"p-value >= {alpha} → Fail to reject H₀: no statistical evidence of a difference in MonthlyCharges.")
--- Independent samples t-test result ---
t-statistic = 18.4150
p-value = 0.000000
test used = Welch t-test (unequal var)

--- Decision ---
p-value < 0.05 → Reject H₀: there is a statistically significant difference in MonthlyCharges.

Summary – Hypothesis Test 1: MonthlyCharges vs Churn

  • Customers who churned paid, on average, more per month than those who stayed. The difference is not due to random chance, it’s statistically significant.
  • Business takeaway: Higher monthly charges appear linked to higher churn rates. This suggests pricing pressure may be pushing customers away. Possible strategies could include:
    • targeted retention discounts,
    • added perks for premium tiers, or
    • reviewing value for high-cost plans.

7.2. Hypothesis 2: Contract Type vs Churn – (Chi-square Test)

Now we shift our focus from monthly spending to contract type, to see if the length or flexibility of a customer’s plan relates to their likelihood of leaving. Using a Chi-square test, we’ll check whether churn rates differ significantly between customers on month-to-month, one-year, or two-year contracts. If a relationship exists, it could reveal how commitment levels influence retention.

Question: Is churn associated with contract type?

Hypotheses (two-sided):

  • H₀: Churn is independent of Contract type (no association).
  • H₁: Churn is not independent of Contract type (there is an association).
  • Significance level: α = 0.05.

7.2.1. Prepare data & contingency table

First, we’ll tidy up the data so that contract type and churn status are ready for comparison. Then, we’ll create a contingency table, a simple grid that shows how many customers fall into each combination of contract type and churn outcome. This gives us a clear snapshot before running the Chi-square test.

Python
# Select only relevant columns
df_h2 = data[['Contract', 'Churn']]

# Build contingency table: rows = Contract, cols = Churn
contingency = pd.crosstab(df_h2['Contract'], df_h2['Churn'], margins=False)

# Run chi-square test 
chi2, p, dof, expected = stats.chi2_contingency(contingency)

# Print test stats
print("Chi-square statistic:", round(chi2, 4))
print("p-value:", p)
print("Degrees of freedom:", dof)

print("--- Decision ---")
if p_value < alpha:
    print(f"p-value < {alpha} → Reject H₀: Churn is related to Contract type.")
else:
    print(f"p-value >= {alpha} → Fail to reject H₀: No evidence of relationship between Churn and Contract type.")
Chi-square statistic: 1184.5966
p-value: 0.0000
Degrees of freedom: 2

--- Decision ---
p-value < 0.05 → Reject H₀: Churn is related to Contract type.

Summary – Hypothesis Test 2: Contract Type vs Churn

A Chi-square test was performed to examine the relationship between Churn and Contract type. The results showed a Chi-square statistic of 1184.60 with 2 degrees of freedom, and a p-value < 0.05. Since the p-value is below the significance threshold, we reject the null hypothesis and conclude that Churn is significantly related to Contract type.

7.3 Hypothesis 3: (TechSupport vs Churn) — Two-Proportion z-Test

Here, we’re looking at whether the churn rate differs meaningfully between customers who have tech support and those who don’t. It’s a straightforward way to check if providing (or not providing) this service is linked to customer retention. By running a two-proportion z-test, we can see if any observed difference in churn rates is statistically significant or just random noise, and that can guide whether tech support should be a bigger part of our retention strategy.

Question: Is the proportion of churn different between customers with and without TechSupport?

Hypotheses (two-sided):

  • H₀: p_with = p_without (churn proportion is equal for both groups)
  • H₁: p_with ≠ p_without (churn proportion differs between the two groups)

Test choice & assumptions:

  • We’ll use a two-proportion z-test because:
    • Outcome variable Churn is binary (0 = No, 1 = Yes).
    • Group variable TechSupport has exactly two categories (Yes/No).
    • We want to compare proportions across the two groups.
  • Requirements:
    • Each group’s sample size should be large enough for the normal approximation to be valid (rule of thumb: np ≥ 5 and n(1–p) ≥ 5 for each group).
  • Significance level: α = 0.05.

We will:

  1. Prepare data by ensuring Churn is coded as 0/1 and selecting TechSupport groups.
  2. Count churn cases and total customers in each group.
  3. Calculate proportions for each group.
  4. Calculate the pooled proportion and standard error.
  5. Compute the z-statistic and p-value (two-tailed).
  6. Interpret results and check if we reject the null hypothesis.

7.3.1 Prepare Data

In this step, we set up the data needed for our two-proportion z-test. That means:

  • Filtering and grouping customers based on whether they have TechSupport (“Yes” or “No”).
  • Counting churn events in each group so we know exactly how many customers left versus stayed.
  • Calculating churn proportions for both groups, which will give us the baseline comparison we’re testing.

By the end of this step, we’ll have a clean contingency-style summary that shows churn counts and rates for customers with and without tech support, ready for statistical testing.

Python
# Separate the two groups: With and Without Tech Support
group_with = data[data['TechSupport'] == 1]
group_without = data[data['TechSupport'] == 0]

n1 = len(group_with)     # total in group 1
n2 = len(group_without)  # total in group 2

x1 = group_with['Churn'].sum()     # churn count in group 1
x2 = group_without['Churn'].sum()  # churn count in group 2

n1, x1, n2, x2
(2044, np.int64(310), 4999, np.int64(1559))

7.3.2 Calculate Sample Proportions

Now that we’ve separated customers into TechSupport: Yes and TechSupport: No groups, the next step is to figure out what proportion of each group actually churned.

We do this by dividing the number of churned customers in each group by the total number of customers in that group.

This gives us two key numbers:

  • The churn rate for customers with tech support.
  • The churn rate for customers without tech support.

These proportions are the heart of our two-proportion z-test, they let us see, in percentage terms, how much more (or less) likely customers without tech support are to leave compared to those who have it.

Python
p1 = x1 / n1
p2 = x2 / n2

print(f"Proportion churn WITH Tech Support: {p1:.4f}")
print(f"Proportion churn WITHOUT Tech Support: {p2:.4f}")
Proportion churn WITH Tech Support: 0.1517
Proportion churn WITHOUT Tech Support: 0.3119

7.3.3 Calculate Sample Proportions

Rather than just staring at totals, we’ll break churn down into proportions for each group, those with tech support and those without. This lets us compare churn risk on equal footing, regardless of how many customers are in each group. If the “no tech support” group’s proportion is noticeably higher, it’s an early clue that lack of support could be linked to higher churn, something worth digging into before we even run the formal test.

Python
# Pooled proportion
p_pool = (x1 + x2) / (n1 + n2)

# Standard error of the difference in proportions
se = sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))

p_pool, se
(np.float64(0.2653698707936959), 0.011591963489004459)

7.3.4 Compute z-Statistic and p-Value

Now that we know the churn proportions for each group, it’s time to see if the gap between them is just random noise or a real, statistically significant difference.

We’ll use the two-proportion z-test to crunch the numbers, the z-statistic measures how far apart the proportions are in standard error units, and the p-value tells us the likelihood of seeing such a gap by chance.

If that p-value is small enough (below our threshold), we can confidently say that tech support availability is linked to differences in churn rates.

Python
# z statistic
z_stat = (p1 - p2) / se

# Two-tailed p-value
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

z_stat, p_value
(np.float64(-13.81983022448171), np.float64(0.0))

7.3.5 Decision

With the p-value in hand, we can finally make the call. If it’s below our chosen significance level, we reject the null hypothesis and conclude that churn rates truly differ between customers with and without tech support. If not, we keep the null and chalk up any difference to chance. This step is where all the number crunching pays off, turning statistical output into a clear “yes, there’s a link” or “no, there isn’t.”

Python
alpha = 0.05
if p_value < alpha:
    decision = "Reject H₀ — Statistically significant difference in churn proportions."
else:
    decision = "Fail to reject H₀ — No statistically significant difference."

print(f"Z = {z_stat:.4f}, p = {p_value:.4f}{decision}")
Z = -13.8198, p = 0.0000 → Reject H₀ — Statistically significant difference in churn proportions.

Summary — Hypothesis 3: Tech Support vs Churn

Our analysis shows a clear link between tech support availability and customer churn. Customers without tech support churn at about 31%, which is roughly double the 15% churn rate of those who have it.

The two-proportion z-test confirmed this difference is statistically significant (z = -13.82, p < 0.0001), so we can confidently reject the idea that the gap is due to chance.

Business implication:

Lack of tech support appears to be a major churn risk factor. Offering or actively promoting tech support—especially to high-risk groups—could be an effective retention strategy.

Big picture:

Combined with earlier findings on contract length and monthly charges, it’s clear that churn is strongly associated with:

  • Higher monthly costs
  • Shorter contracts
  • No tech support

Targeting these factors could meaningfully reduce churn and improve customer lifetime value.

Step 8: Conclusion & Next Steps

We analyzed the Telco Customer Churn dataset to uncover key drivers of churn. After cleaning the data, exploring patterns, engineering new features, and running statistical tests, three clear insights emerged:

  • Contract Type is the biggest churn predictor — month-to-month customers churn far more than those with longer contracts.
  • Higher Monthly Charges are linked to higher churn.
  • Lack of Tech Support doubles churn rates, suggesting a major retention opportunity.

Next Steps

  • Modeling: Build and evaluate churn prediction models (Logistic Regression, Random Forest, XGBoost) using ROC-AUC and feature importance to validate drivers.
  • Business Actions: Target high-risk customers with contract upgrade incentives, service bundles, and tech support offers.
  • Future Analysis: Incorporate behavioral data, run survival analysis, and track churn trends to measure impact over time.

Final Thought

By blending statistical rigor with clear insights, we turned churn data into practical strategies — an approach that can guide retention efforts in any subscription-based business.

Other Articles You May Like

Better customer churn prediction is possible: See how we applied both Random Forest and XGBoost models to telecom data to anticipate cancellations in advance.
neural networks
This project walks through creating a neural network using NumPy to recognize handwritten digits. Gain hands-on experience with forward and backpropagation.
Learn how underfitting and overfitting affect model performance using polynomial regression on real housing data, with clear visuals and code examples.
>