Predict Appointments — Logistic Regression with Scikit-Learn

Learn logistic regression with scikit-learn by predicting patient appointment no-shows. A beginner-friendly Machine Learning project with clear steps.
Facebook
Twitter
LinkedIn

In this project, we use logistic regression with scikit-learn to predict whether a patient will show up for their medical appointment. It’s a simple but powerful example of how machine learning can help improve healthcare services.

Hospitals often face a common problem: patients who book appointments but never show up. These missed visits waste time, delay care, and cost money. By learning patterns from past appointments, we can predict who is more likely to miss theirs.

This project walks through every step:

  • cleaning and preparing real appointment data,
  • exploring key factors that affect attendance,
  • building and training a logistic regression model,
  • and finally, testing how well it predicts outcomes.

It’s beginner-friendly, practical, and shows how machine learning can be used to solve real problems in healthcare using scikit-learn, one of the most popular Python libraries for data science.

You can also find the complete Jupyter Notebook for this project on GitHub.

1. Importing the Libraries

Before working with the dataset, we start by importing all the Python libraries we’ll need. Each library plays a specific role in the machine learning workflow, from data handling to visualization and modeling.

Python
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Model building and evaluation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
)

# Statistical tests
from scipy.stats import ttest_ind, chi2_contingency, levene

Here’s what each of them helps us do:

  • pandas and NumPy: for loading, exploring, and preparing data.
  • matplotlib and seaborn: for visualizing trends and relationships.
  • scikit-learn (sklearn): for splitting the data, scaling features, training models, and measuring performance.
  • scipy.stats: for running basic statistical tests to compare variables.

These imports set up the foundation we’ll build on throughout the project.

2. Importing and Previewing the Dataset

Now that the libraries are ready, the next step is to load the dataset and take a first look at what we’re working with.

The dataset we’ll use is called “Healthcare Appointment Dataset”, available on Kaggle. It contains information about patient appointments, including demographics, health conditions, and whether or not they showed up.

We’ll start by loading it using pandas:

Python
# Load the dataset
data = pd.read_csv("healthcare_noshows.csv")

# Preview the first few rows
data.head()
Healthcare appointment dataset preview
  • The .head() function displays the first five rows of the dataset.
  • This gives us a quick look at the structure, the columns, data types, and a few sample records.

We can also check for missing values or basic info:

Python
# Quick overview of the dataset
data.info()
RangeIndex: 106987 entries, 0 to 106986
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PatientId 106987 non-null float64
1 AppointmentID 106987 non-null int64
2 Gender 106987 non-null object
3 ScheduledDay 106987 non-null object
4 AppointmentDay 106987 non-null object
5 Age 106987 non-null int64
6 Neighbourhood 106987 non-null object
7 Scholarship 106987 non-null bool
8 Hipertension 106987 non-null bool
9 Diabetes 106987 non-null bool
10 Alcoholism 106987 non-null bool
11 Handcap 106987 non-null bool
12 SMS_received 106987 non-null bool
13 Showed_up 106987 non-null bool
14 Date.diff 106987 non-null int64
dtypes: bool(7), float64(1), int64(3), object(4)

From the summary above, we can see that the dataset contains 106,987 records and 15 columns. The features include a mix of numerical, categorical, and boolean variables:

  • PatientId and AppointmentID uniquely identify each record.
  • Gender, ScheduledDay, AppointmentDay, and Neighbourhood are categorical or date-based fields.
  • Age and Date.diff are numerical variables.
  • The remaining columns (like Scholarship, Hipertension, Diabetes, Alcoholism, Handcap, and SMS_received) are boolean features representing patient conditions or actions.
  • The target variable is Showed_up, which indicates whether a patient attended their appointment.

This gives us a clear picture of what type of data we’re working with before we begin cleaning and exploration.

3. Data Cleaning

Before building our model, we need to prepare the dataset for analysis. This process involves a few key steps:

  • Removing duplicate records (if any).
  • Checking for missing values.
  • Dropping unnecessary columns.
  • Correcting data types where needed.
  • Checking for inconsistencies in column names and values.

By cleaning the data, we make sure it’s consistent, accurate, and ready for feature engineering.

3.1 Checking for Duplicates

We first check whether the dataset contains any duplicate records.

Python
print(f'Duplicates: {data.duplicated().sum()}')
Duplicates: 0
  • There are no duplicate entries, which is a good sign.

3.2 Checking for Missing Values

Next, we check if there are any missing or null values in the dataset.

Python
print(f'Missing Values: {data.isnull().sum().sum()}')
Missing Values: 0
  • No missing data means we can proceed without worrying about imputation or dropping records.

3.3 Dropping Unnecessary Columns

Some columns are identifiers and don’t contribute to prediction, such as PatientId and AppointmentID. We can safely remove them.

Python
# Drop the identifier columns
data = data.drop(columns=['PatientId', 'AppointmentID'])

3.4 Fixing Data Types

The columns AppointmentDay and ScheduledDay are stored as text (object type). We convert them to datetime format to make date-based analysis easier.

Python
# Convert date columns
date_cols = ['AppointmentDay', 'ScheduledDay']
data[date_cols] = data[date_cols].apply(pd.to_datetime, dayfirst=True)

3.5 Checking for Inconsistencies

We create a small utility function to inspect column types and unique values. This helps verify that categorical and boolean columns contain consistent entries.

Python
# Utility function
def get_column_types(data: pd.DataFrame):
    """
    Identify and separate columns of a DataFrame by data type.

    Parameters
    ----------
    data : pd.DataFrame
        The input DataFrame.

    Returns
    -------
    dict
        A dictionary with:
        - 'numeric' : list of numeric columns
        - 'categorical' : list of object/category columns
        - 'datetime' : list of datetime columns
        - 'boolean' : list of boolean columns
    """
    column_types = {
        'numeric': data.select_dtypes(include=['int64', 'float64']).columns.tolist(),
        'categorical': data.select_dtypes(include=['object', 'category']).columns.tolist(),
        'datetime': data.select_dtypes(include=['datetime64']).columns.tolist(),
        'boolean': data.select_dtypes(include=['bool']).columns.tolist()
    }
    
    return column_types

We then use this function to check each categorical and boolean column:

Python
# Retrieve categorical columns
column_types = get_column_types(data=data)
categorical_cols = column_types['categorical'] + column_types['boolean']

# Preview unique values
for col in categorical_cols:
    unique_values = data[col].unique()
    print(f'=== {col} ===')
    print(unique_values, '\n')
=== Gender ===
['F' 'M'] 

=== Scholarship ===
[False  True] 

=== Hipertension ===
[ True False] 
  
=== Diabetes ===
[False  True]

=== Alcoholism ===
[False  True]

=== Showed_up ===
[ True False] 
  
=== Handcap ===
[False  True] 

=== SMS_received ===
[False  True] 
  
  • All categorical and boolean variables appear consistent.
  • We only notice two column names that need correction (Hipertension, Handcap).

3.6 Renaming Columns

We rename the inconsistent column names for clarity.

Python
# Rename columns
col_names_dict = {
    'Hipertension': 'Hypertension',
    'Handcap': 'Handicap'
}

data = data.rename(columns=col_names_dict)

After this cleaning process, our dataset is consistent, well-formatted, and ready for Exploratory Data Analysis (EDA).

4. Exploratory Data Analysis (EDA)

Now that our data is clean, it’s time to explore it. EDA helps us understand the dataset, how different features behave, what patterns exist, and what factors might affect whether a patient shows up for their appointment.

We’ll start with the target variable, then move on to explore key features such as age, gender, medical conditions, and appointment timing. This step is crucial before training any model because it reveals insights that guide preprocessing and feature selection.

4.1 Target Variable Distribution

Our target variable is Showed_up, which tells us whether a patient attended their scheduled appointment. Here’s what each value means:

  • True → The patient showed up.
  • False → The patient did not show up.

Understanding how these two classes are distributed is important. If one class heavily dominates, the model might become biased toward that class during training, a common issue known as class imbalance.

Python
# Distribution of the target variable (Showed_up)
plt.figure(figsize=(6, 4))
sns.countplot(data=data, x='Showed_up', palette='Set2')
plt.title('Distribution of Target Variable (Showed_up)')
plt.xlabel("Showed_up")
plt.ylabel("Count")
plt.show()
Target distribution
  • The count plot shows that the dataset is imbalanced; most patients attended their appointments.

To confirm, let’s calculate the exact proportions.

Python
# Count summary of the target variable (Showed_up)
data['Showed_up'].value_counts().to_frame(name='Count').assign(
    Percent=lambda x: round((x['Count'] / x['Count'].sum()) * 100, 2))
           Count  Percent
Showed_up
True 85307 79.74
False 21680 20.26
  • So, about 80% of patients showed up for their appointments, while 20% did not.
  • This imbalance means we’ll later need to use careful evaluation metrics, not just accuracy, to make sure our model performs well for both groups.

4.2 EDA on Numerical Variables

Next, we explore the numerical features in our dataset. This helps us understand how they are distributed, whether they contain unusual values, and how they might relate to the target variable.

We’ll focus on two numerical columns, Age and Date.diff, and check their distributions, outliers, and correlations.

4.2.1 Univariate Numerical Analysis

We start with univariate analysis, where each variable is studied individually. Histograms show how values are distributed, and boxplots help us spot outliers.

Python
# Distribution of numerical variables
column_types = get_column_types(data=data)
numerical_cols = column_types['numeric']

for i, col in enumerate(numerical_cols, 1):
    plt.figure(figsize=(14, 4))

    # Histogram
    plt.subplot(1, 2, 1)
    sns.histplot(data=data, x=col, kde=True)
    plt.title(f'Distribution of {col}')

    # Boxplot
    plt.subplot(1, 2, 2)
    sns.boxplot(data=data, y=col, color='lightgreen')
    plt.title(f'Boxplot of {col}')
plt.show()

We then check some basic statistics:

Python
data.describe(include=['int64', 'float64'])
                 Age      Date.diff
count 106987.000000 106987.000000
mean 38.316085 10.166721
std 22.466214 15.263508
min 1.000000 -6.000000
25% 19.000000 0.000000
50% 38.000000 4.000000
75% 56.000000 14.000000
max 115.000000 179.000000
  • The Age variable ranges from 1 to 115 years.
  • That’s wide but realistic for a medical dataset.
  • The median of 38 shows most patients are middle-aged adults.
  • The Date.diff column ranges from -6 to 179, which is interesting because negative values mean some appointments were apparently booked after the appointment date, a clear data issue we’ll fix later.

Outliers and Skewness

We use the interquartile range (IQR) to find outliers and the skew() method to see if the data is symmetric or not.

Python
for col in numerical_cols:
    q1 = data[col].quantile(0.25)
    q3 = data[col].quantile(0.75)
    iqr = q3 - q1
    lower = q1 - iqr * 1.5
    upper = q3 + iqr * 1.5
    outliers = data[(data[col] < lower) | (data[col] > upper)][col]
    print(f'=== {col} ===')
    print(f'Outliers: {len(outliers)}')
    print(f'Skew: {data[col].skew()}', '\n')
=== Age ===
Outliers: 5
Skew: 0.12164402331150703

=== Date.diff ===
Outliers: 6489
Skew: 2.6901608800185848
  • The Age variable is clean, it has very few outliers (only 5) and is almost symmetric (skew ≈ 0.12). We can safely use it as it is.
  • On the other hand, Date.diff has many outliers (over 6,000) and is highly right-skewed (skew ≈ 2.69). This means most appointments are scheduled within a short period, but a few are booked very far in advance.

Correlation Between Numerical Variables

Finally, we check how these numerical variables relate to each other. This helps us avoid redundant information in modeling.

Python
corr_mat = data[numerical_cols].corr()
plt.figure(figsize=(5, 3))
sns.heatmap(corr_mat, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Collinearity of Numerical Variables')
plt.show()
Collinearity
  • The correlation heatmap shows no strong relationship between Age and Date.diff.
  • This means both variables capture different aspects of patient behavior, one reflects demographics, the other scheduling patterns, and both can be useful in the prediction model.

4.2.2 Numerical Variables vs Target (Showed_up)

After exploring each numerical feature on its own, we now check how these variables differ between patients who showed up and those who didn’t. This helps us see which numerical patterns may explain appointment attendance.

Mean Comparison by Target Group

We start by visualizing and comparing the average values of each numerical variable between the two groups (Showed_up = True or False).

Python
# Plot mean of numerical variables vs Showed_up
plt.figure(figsize=(14, 4))
for i, col in enumerate(numerical_cols, 1):
    plt.subplot(1, 2, i)
    sns.barplot(data=data, x='Showed_up', y=col, palette='Set2', estimator='mean', ci=None)
    plt.title(f'Mean {col} vs Showed_up')
    plt.xlabel("Showed_up")
    plt.ylabel(col)
plt.show()
Numerical vs target
Python
# Calculate mean of numerical variables vs Showed_up
for col in numerical_cols:
    mean_show_up = data.groupby('Showed_up')[col].mean().to_frame(name='Mean')
    print(f'=== {col} ===')
    print(mean_show_up, '\n')
=== Age ===
Mean
Showed_up
False 35.329151
True 39.075187

=== Date.diff ===
Mean
Showed_up
False 15.789299
True 8.737794
  • Patients who showed up had an average age of 39 years, compared to 35 years for those who didn’t. This means older patients are slightly more likely to attend their appointments.
  • For Date.diff, those who missed appointments had an average waiting time of about 16 days, while those who attended waited only 9 days on average. Longer waiting times clearly make patients more likely to skip appointments.

Statistical Significance Testing

To confirm that these observed differences are not just random, we perform t-tests for both variables. We use Welch’s t-test when the variances differ between groups.

Python
# T-Test
alpha = 0.05
for col in numerical_cols:
    showed_up = data[data['Showed_up'] == True][col]
    no_showed_up = data[data['Showed_up'] == False][col]

    _, lev_p = levene(showed_up, no_showed_up, center='median')
    equal_var = lev_p >= alpha
    t_stat, p_value = ttest_ind(showed_up, no_showed_up, equal_var=equal_var)

    print(f'=== {col} ===')
    print(f'T-Stat: {t_stat:.2f}, P-Value: {p_value:.4f}')

    test = 'Standard' if equal_var else 'Welch'
    print(f'Test: {test}')

    print(f'Null Hypothesis (Ho): Mean {col} is the same in showed up or no show.')
    decision = 'Reject Hypothesis' if p_value < alpha else 'Fail to reject hypothesis.'
    print(f'Decision: {decision}', '\n')
=== Age ===
T-Stat: 22.68, P-Value: 0.0000
Test: Welch
Null Hypothesis (Ho): Mean Age is the same in showed up or no show.
Decision: Reject Hypothesis

=== Date.diff ===
T-Stat: -57.16, P-Value: 0.0000
Test: Welch
Null Hypothesis (Ho): Mean Date.diff is the same in showed up or no show.
Decision: Reject Hypothesis
  • Both Age and Date.diff have p-values < 0.05, meaning the differences between the two groups are statistically significant.
  • The test confirms what we observed earlier, older patients are more likely to attend, while longer wait times make patients more likely to skip their appointments.
  • These insights are strong indicators that both features should be kept for modeling.

4.3 EDA on Categorical Variables

Next, we explore the categorical features to see their distributions and how they might relate to whether patients showed up for their appointments. Understanding these variables helps reveal social or behavioral factors that affect attendance.

4.3.1 Univariate Categorical Analysis

We first check the frequency of each categorical variable to identify imbalances or dominant categories.

Python
# Distribution of Categorical Variables
column_types = get_column_types(data=data)
categorical_cols = column_types['categorical'] + column_types['boolean']
categorical_cols.remove('Neighbourhood')
categorical_cols.remove('Showed_up')

n_cols = 3
n_rows = 3

plt.figure(figsize=(n_cols * 6, n_rows * 5))
for i, col in enumerate(categorical_cols, 1):
    plt.subplot(n_rows, n_cols, i)
    sns.countplot(data=data, x=col, palette='Set2')
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.show()
Univariate categorical
Python
# Count summary of the categorical variables
for col in categorical_cols:
    counts = data[col].value_counts().to_frame(name='Count').assign(
        Percent=lambda x: round((x['Count'] / x['Count'].sum()) * 100, 2))

    print(f'=== {col} ===')
    print(counts, '\n')
=== Gender ===
Count  Percent
F       70118    65.54
M       36869    34.46

=== Scholarship ===
Count  Percent
False   96178     89.9
True    10809     10.1

=== Hypertension ===
Count  Percent
False   85186    79.62
True    21801    20.38
  
=== Diabetes ===
Count  Percent
False   99044    92.58
True     7943     7.42

=== Alcoholism ===
Count  Percent
False  103627    96.86
True     3360     3.14
  
=== SMS_received ===
Count  Percent
False   72402    67.67
True    34585    32.33

=== Handicap ===
Count  Percent
False  104747    97.91
True     2240     2.09
  

Analysis of Categorical Variables

We explored the categorical features to understand their frequency distributions and identify any imbalances or dominant categories.

  • Gender: Majority of the patients are female 65.5%, while males make up 34.5%.
  • Scholarship: Only 10.1% of patients are on a scholarship (likely indicating low-income support), showing a large imbalance.
  • Hypertension: About 20.4% of patients have hypertension, while the majority 79.6% do not.
  • Diabetes: Only 7.4% of patients have diabetes, indicating it’s a less common condition in this dataset.
  • Alcoholism: Very few patients 3.1% reported alcoholism, suggesting it may not be a strong factor in attendance.
  • Handicap: Only 2.1% of patients have some form of handicap, another minority group.
  • SMS_received: Around 32.3% of patients received an SMS reminder, while 67.7% did not.

Key Insights:

  • Most categorical features are highly imbalanced, with the “False” category dominating.
  • The imbalance may affect model learning, so encoding and class weighting should be considered later.
  • The SMS_received variable will be important to analyze against attendance, as reminders could influence show-up rates.

4.3.2 Categorical Variables vs Target (Showed_up)

After understanding the basic distributions of our categorical features, the next step is to see how these variables relate to appointment attendance (Showed_up).

We’ll visualize each category against the target and then use a Chi-Square test to determine which relationships are statistically significant.

Python
# Plot categorical variables vs the target (Showed_up)
n_cols = 3
n_rows = 3

plt.figure(figsize=(n_cols * 6, n_rows * 5))
for i, col in enumerate(categorical_cols, 1):
    plt.subplot(n_rows, n_cols, i)
    sns.countplot(data=data, x=col, hue='Showed_up', palette='Set2')
    plt.title(f'{col} vs Showed_up')
    plt.xlabel(col)
    plt.ylabel('Count')
plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.show()
Univariate categorical analysis
  • These bar plots give a quick visual comparison of attendance rates across different categories (e.g., male vs. female, diabetic vs. non-diabetic, etc.).
  • We can already see subtle variations, such as patients with hypertension or diabetes appearing more likely to show up, while those with scholarships or SMS reminders seem to have lower attendance.
Python
# Percentage Distribution by Category
for col in categorical_cols:
    summary = data.groupby(col)['Showed_up'].value_counts(normalize=True).unstack() * 100
    print(f'=== {col} ===')
    print(summary.round(2), '\n')
=== Gender ===
Showed_up  False  True 
Gender                 
F          20.36  79.64
M          20.08  79.92 

=== Scholarship ===
Showed_up    False  True 
Scholarship              
False        19.87  80.13
True         23.79  76.21 

=== Hypertension ===
Showed_up     False  True 
Hypertension              
False         21.02  78.98
True          17.30  82.70
  
=== Diabetes ===
Showed_up  False  True 
Diabetes               
False      20.45  79.55
True       18.00  82.00 

=== Alcoholism ===
Showed_up   False  True 
Alcoholism              
False       20.27  79.73
True        20.15  79.85
  
=== Handicap ===
Showed_up  False  True 
Handicap               
False      20.31  79.69
True       18.17  81.83

=== SMS_received ===
Showed_up     False  True 
SMS_received              
False         16.73  83.27
True          27.67  72.33
  

From these observations, we can infer that:

  • Scholarship, Hypertension, Diabetes, Handicap, and SMS_received variables exhibit meaningful behavioral differences.
  • Gender and Alcoholism show negligible impact on attendance.
  • The unexpected SMS reminder pattern is an interesting finding that warrants deeper statistical and model-based exploration later in the project.

5. Feature Engineering

After performing exploratory data analysis, the next critical step is Feature Engineering, transforming and creating new variables to make the dataset more suitable for modeling.

The main goal here is to improve the predictive power of our logistic regression model by ensuring all features are clean, relevant, and machine-readable.

This process involves converting categorical data into numerical form, creating meaningful derived features, and removing redundant or less informative columns.

5.1 Converting Categorical and Boolean Variables to Numeric

Machine learning models, including logistic regression, work best with numerical input. Therefore, categorical and boolean features need to be encoded into numeric representations.

In this dataset:

  • Boolean variables (True/False) were replaced with 1 and 0.
  • Gender (M, F) was encoded as binary (M=1, F=0).
  • The target variable (Showed_up) was also converted to 1 (showed up) and 0 (no-show).
Python
engineered_data = data.copy()

# Binary Encoding
binary_cols = ['Scholarship', 'Diabetes', 'Alcoholism', 'Handicap', 'SMS_received', 'Gender', 'Hypertension',
               'Showed_up']
replace_dict = {
    True: 1,
    False: 0,
    'M': 1,
    'F': 0
}

engineered_data[binary_cols] = engineered_data[binary_cols].replace(replace_dict)
  • This ensures all key binary indicators (like whether a patient received an SMS or has hypertension) are numerically encoded and ready for model input.

5.2 Deriving New Features

Feature engineering isn’t just about converting existing data; it’s also about creating new variables that can capture hidden relationships.

Here, several new features were derived:

  1. Date-based features from ScheduledDay and AppointmentDay:
    • Scheduled_weekday, Scheduled_month
    • Appointment_weekday, Appointment_month
      These help capture temporal patterns, such as whether patients are more likely to attend appointments on specific weekdays or months.
  2. Condition_count — a new variable summarizing the total number of medical conditions a patient has (Hypertension, Diabetes, Handicap). This feature better represents the overall health burden of a patient.
Python
# Feature engineer and encode dates
engineered_data['Scheduled_weekday'] = pd.to_datetime(data['ScheduledDay']).dt.dayofweek
engineered_data['Scheduled_month'] = pd.to_datetime(data['ScheduledDay']).dt.month
engineered_data['Appointment_weekday'] = pd.to_datetime(data['AppointmentDay']).dt.dayofweek
engineered_data['Appointment_month'] = pd.to_datetime(data['AppointmentDay']).dt.month

# Feature engineer "Condition_count"
engineered_data['Condition_count'] = (
        engineered_data['Hypertension'] + engineered_data['Diabetes'] + engineered_data['Handicap'])

To understand how useful these new features are, we check their correlation with the target variable Showed_up:

Python
# Check added features importance
new_features = ['Scheduled_weekday', 'Scheduled_month', 'Appointment_weekday', 'Appointment_month', 'Showed_up', 'Condition_count']

corr_mat = engineered_data[new_features].corr()
plt.figure(figsize=(8, 4))
sns.heatmap(corr_mat, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Collinearity of New Variables')
plt.xticks(rotation=45)
plt.show()
New-features-correlations

Correlation Insights

After checking correlations between the newly engineered features and the target variable (Showed_up), we observed the following:

  • Scheduled_month → 0.16 correlation
    Shows a small positive correlation, indicating a slight seasonal trend — attendance rates may vary across different months.
  • Condition_count → 0.03 correlation
    Exhibits a weak but positive relationship, suggesting that patients with multiple health conditions tend to be slightly more consistent in attending appointments.
  • Scheduled_weekday, Appointment_weekday, and Appointment_month → near 0 correlation
    These features show no meaningful relationship with attendance and may only add noise to the model.

Interpretation:

  • We’ll keep Scheduled_month and Condition_count since they provide potential predictive value.
  • Features with near-zero correlation will be dropped to maintain a cleaner, more efficient dataset.

5.3 Dropping Less Informative or Redundant Columns

To simplify the dataset and reduce multicollinearity, we removed:

  • Categorical text columns (Neighbourhood).
  • Raw date columns (ScheduledDay, AppointmentDay).
  • Derived columns with negligible correlation (Scheduled_weekday, Appointment_weekday, Appointment_month).
  • Redundant medical indicators already represented by Condition_count (Hypertension, Diabetes, Handicap).
Python
drop_cols = ['Neighbourhood', 'Scheduled_weekday', 'Appointment_weekday', 'Appointment_month', 'ScheduledDay', 'AppointmentDay', 'Hypertension', 'Diabetes', 'Handicap']
engineered_data = engineered_data.drop(columns=drop_cols)
engineered_data.head()
Encoded

After feature engineering:

  • All categorical and boolean variables were converted to numeric form.
  • New temporal and health-related features were created to capture additional insights.
  • Redundant and weakly correlated features were removed to reduce noise.

Our dataset is now clean, compact, and model-ready, forming a solid foundation for building the logistic regression models that follow.

6. Implement Logistic Regression with Scikit-Learn

After cleaning and engineering our dataset, it’s time to build a model that can predict whether a patient will show up for their scheduled medical appointment.

This stage involves:

  • Splitting the dataset into training and testing subsets.
  • Scaling the features for better model performance.
  • Training multiple logistic regression variants (baseline, L1, and L2 regularized).
  • Evaluating results using classification metrics such as accuracy, precision, recall, F1-score, and AUC-ROC.

6.1 Data Preparation

Before training, we separate the dataset into features (X) and the target variable (y):

Python
X = engineered_data.drop(columns=['Showed_up'])
y = engineered_data['Showed_up']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, shuffle=True, random_state=42
)

To ensure all numerical features contribute equally to model training, we standardize them using StandardScaler:

Python
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Feature scaling is particularly important for models like logistic regression, which rely on gradient-based optimization; unscaled features can distort the learning process and slow convergence.

6.2 Model Training

We will train three logistic regression models:

  1. Baseline Logistic Regression – the default version without explicit regularization tuning.
  2. L1-Regularized Logistic Regression (Lasso) – encourages sparsity by shrinking less useful feature weights to zero.
  3. L2-Regularized Logistic Regression (Ridge) – penalizes large coefficients to prevent overfitting.
Python
# Fit Models
models = {}

# Baseline Logistic Regression
models['LogisticRegression'] = LogisticRegression(max_iter=1000, solver='liblinear', class_weight='balanced')
models['LogisticRegression'].fit(X_train_scaled, y_train)

# Logistic Regression with CV and regularization L1
models['LogisticRegression_L1'] = LogisticRegressionCV(max_iter=1000, solver='liblinear', class_weight='balanced', Cs=10, cv=5, penalty='l1')
models['LogisticRegression_L1'].fit(X_train_scaled, y_train)

# Logistic Regression with CV and regularization L2
models['LogisticRegression_L2'] = LogisticRegressionCV(max_iter=1000, solver='liblinear', class_weight='balanced', Cs=10, cv=5, penalty='l2')
models['LogisticRegression_L2'].fit(X_train_scaled, y_train)

We used the class_weight='balanced' parameter to handle the class imbalance in our target variable, ensuring the model gives appropriate importance to both show and no-show cases.

6.3 Model Evaluation

Each model was evaluated using Accuracy, Precision, Recall, F1-score, and AUC-ROC.

Python
# Make Predictions and Evaluate
results = []

for name, model in models.items():
    # Prediction
    y_pred = model.predict(X_test_scaled)
    
    # Evaluate
    acc = accuracy_score(y_test, y_pred)
    pre = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_pred)
    
    results.append([name, acc, pre, rec, f1, auc])

results_df = pd.DataFrame(results, columns=['Name', 'Accuracy', 'Precision', 'Recall', 'F1', 'AUC-ROC Score'])
results_df
Evaluation

Results and Insights

  • All three logistic regression models produced identical performance, confirming our preprocessing and feature selection were stable and consistent.
  • The model achieved about 66% accuracy, with high precision (~0.85) and moderate recall (~0.69).
  • This indicates that the model performs well at identifying patients who will attend, but has some difficulty capturing those who won’t.
  • Regularization (L1 or L2) had no noticeable impact, suggesting the model did not suffer from overfitting.

The logistic regression model provides a strong baseline for predicting medical appointment attendance. However, future improvements could include feature interaction terms, class rebalancing strategies, or exploring non-linear models such as decision trees or ensemble methods.

Project Conclusion

This project demonstrated how logistic regression, one of the most fundamental classification algorithms in machine learning, can be applied to a real-world healthcare problem, predicting whether a patient will attend their scheduled medical appointment.

Through each stage, from data cleaning and exploratory analysis to feature engineering and model evaluation, we gained valuable insights into both the data and the modeling process.

Key Takeaways:

  • The logistic regression model served as a reliable baseline, achieving around 66% accuracy with strong precision and balanced performance across classes.
  • Careful feature engineering, especially the creation of the Condition_count variable, improved interpretability and provided meaningful context around a patient’s health profile.
  • Our analysis revealed that factors like age, waiting time, chronic conditions, and even SMS reminders influence attendance patterns, though some relationships were surprisingly counterintuitive.
  • Despite solid predictive power, human behavior in healthcare remains complex and partly unpredictable, likely influenced by personal, social, and logistical factors not captured in the dataset.

Next Steps for Improvement:

  • Experiment with tree-based algorithms (e.g., Random Forest, XGBoost) to capture non-linear relationships.
  • Introduce behavioral or spatial features, such as prior attendance history or distance to the clinic.
  • Apply hyperparameter tuning and cross-validation for optimal model configuration.
  • Explore model interpretability tools (like SHAP or LIME) to better explain predictions in a healthcare context.

In summary, this project illustrates how combining data-driven insights with responsible modeling can offer practical support for healthcare operations, improving scheduling efficiency and patient engagement. Logistic regression may be simple, but when applied thoughtfully, it can uncover powerful patterns in real-world human behavior.

Other Articles You May Like

support vector machine (svm) with scikit-learn
Learn how to build a heart failure prediction model using SVM with Scikit-Learn. This step-by-step tutorial covers preprocessing, model training, and evaluation for beginners.
knn with numpy card
Learn KNN with NumPy step-by-step. Build K-Nearest Neighbors from scratch for machine learning.
linear regression
Predict house prices using a linear regression model built entirely with NumPy. This beginner project covers data prep, cost function, and gradient descent.
>