Logistic Regression for Email Spam Detection with NumPy

Build a spam detection model using logistic regression and NumPy. Learn how to process text data, apply the sigmoid function, and classify emails effectively.
Facebook
Twitter
LinkedIn

If you’ve ever wondered how Gmail or Outlook decides which emails go to your inbox and which vanish into your spam folder, this article is for you.

In this tutorial, I’ll walk you through building a spam email classifier from scratch using only NumPy. That means no scikit-learn, no TensorFlow, no black boxes, just raw matrix math and a hands-on approach to machine learning.

You’ll learn:

  • How to preprocess messy email data.
  • How to convert text into machine-readable features.
  • What logistic regression is and how it works.
  • How to train your own classifier using gradient descent.
  • How to evaluate the model with real metrics.

Let’s get started!

What Is This Project?

We’re solving a binary classification problem: determining whether an email is spam (1) or not spam (ham, 0). We’re using the Spam Email Classification Dataset from Kaggle, which combines messages from:

  • The 2007 TREC Public Spam Corpus.
  • The Enron-Spam Dataset.

This is real, raw, messy email data, perfect for practicing text preprocessing and classification. To follow along well with this article, you can find the code implementation in a Jupyter notebook in this GitHub repo.

Step 1: Imports and Setup

We’ll use a few core Python libraries to help with data handling and preprocessing.

Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
  • numpy – for matrix operations.
  • pandas – for loading and cleaning data.
  • re – for regex-based text cleaning.
  • CountVectorizer – for converting words into numerical features.

Step 2: Load and Preview the Dataset

We load the dataset and sample 1,000 emails for quick training.

Python
# Load dataset from csv file
df = pd.read_csv('combined-data.csv')

# Reduce the sample size for faster training
df = df.sample(n=1000, random_state=42).reset_index(drop=True)

# Preview the dataset
df.head()

Each row contains:

  • label1 if spam, 0 if ham
  • text – the raw email content

This is how the preview of the dataset looks like:

Spam email dataset

Step 3: Text Cleaning

Raw email text is messy. To prepare it for machine learning, I:

  • Lowercased everything (to ignore case sensitivity).
  • Removed punctuation/symbols (e.g., !, $).
  • Normalized whitespace (extra spaces → single space).
Python
def clean_text(text):
    """
    Cleans a raw email text string for use in machine learning models.
    
    This function applies basic text preprocessing steps:
    - Converts all characters to lowercase
    - Removes all non-alphanumeric characters (excluding spaces)
    - Replaces multiple spaces with a single space
    - Strips leading and trailing whitespace
    
    Parameters:
    -----------
    text : str
        The raw email text to be cleaned.
    
    Returns:
    --------
    str
        The cleaned and normalized text.
    """
    
    # Convert all characters to lowercase
    text = text.lower()
    
    # Remove all characters that are NOT a-z, 0-9, or whitespace
    text = re.sub(r'[^a-z0-9\s]', '', text)
    
    # Replace multiple spaces with a single space and strip leading/trailing space
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text


# Apply the cleaning function to the 'text' column
df['clean_text'] = df['text'].apply(clean_text)

Example:

Markdown
Original: "Hello! This is your LAST chance!!! $$$"
Cleaned: "hello this is your last chance"

Step 4: Converting Text to Features (Bag-of-Words)

Since machine learning models need numbers, not words, we use the Bag-of-Words (BoW) model to convert each email into a numeric vector.

We use CountVectorizer to:

  1. Build a vocabulary of the 3,000 most frequent words.
  2. Count word appearances in each email.

What is Bag-of-Words (BoW)?

Bag-of-Words (BoW) is a simple and widely used method to convert text into numerical features that a machine learning model can understand.

How it works:

  1. Build a vocabulary of all unique words across the dataset.
  2. Count the frequency of each word in every document (e.g. email).
  3. Represent each document as a vector of word counts.

Each position in the vector corresponds to a word in the vocabulary, and the value is how many times that word appears in the document.

Example:

If your dataset has 3 emails:

Markdown
Email 1: "I love Python"
Email 2: "Python loves me"
Email 3: "I love machine learning"

BoW vocabulary (after processing):

Markdown
["i", "love", "python", "loves", "me", "machine", "learning"]

Then each email becomes:

Wordilovepythonlovesmemachinelearning
Email 11110000
Email 20011100
Email 31100011

Why use BoW?

  • Easy to implement.
  • Works well for simple text classification tasks like spam detection.
  • Doesn’t require deep linguistic knowledge.

With that quick overview of what BoW is, lets circle back to our project. We will implement it using CountVectorizer form the Scikit-Learn library.

Python
# Initialize a CountVectorizer to extract the top 3,000 most frequent words.
vectorizer = CountVectorizer(max_features=3000)

# Fit the vectorizer on the cleaned email text and transform the text into a numeric feature matrix.
# Each row represents an email, and each column represents the count of a specific word.
X = vectorizer.fit_transform(df['clean_text']).toarray()

# Extract the target labels (0 = ham, 1 = spam) from the DataFrame.
y = df['label'].values

Example Output:

  • Email: "win free money now"[0, 1, 0, …, 1, 0] (1 = word exists, 0 = doesn’t)

Step 5: Train-Test Split

We divide the data into training and testing sets:

Python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

This ensures our model is evaluated on unseen data.

Step 6: Logistic Regression — From Scratch

6.1 Sigmoid Function

Logistic regression uses the sigmoid function to convert linear output into a probability between 0 and 1.

\(\sigma(z)=\frac{1}{1+e^{-z}}\)

Python
def sigmoid(z):
    """
    Applies the sigmoid activation function to map input values 
    into a range between 0 and 1.

    This function is commonly used in binary classification problems
    (like logistic regression) to convert raw model outputs (logits)
    into interpretable probabilities.

    Mathematically:
        sigmoid(z) = 1 / (1 + e^(-z))

    Parameters:
    -----------
    z : float or np.ndarray
        A scalar or NumPy array of real values (the linear output of a model).

    Returns:
    --------
    float or np.ndarray
        The sigmoid-transformed value(s), ranging from 0 to 1.
    """

    # Prevent extreme values of z from causing numerical overflow in exp()
    # Clipping helps maintain numerical stability for very large or small z
    z = np.clip(z, -500, 500)

    # Compute the sigmoid: 1 / (1 + e^(-z))
    return 1 / (1 + np.exp(-z))

6.2 Binary Cross-Entropy Loss

We use binary cross-entropy to measure how far off the model’s predictions are:

\(J(w,b)=-\frac{1}{m}\sum_{i=1}^{m}[ylog(\hat{y})+(1-y)log(1-\hat{y})]+\text{regularization}\)

Python
def compute_cost(X, y, w, b, lambda_):
    """
    Computes the regularized binary cross-entropy loss (cost) 
    for logistic regression.

    This function measures how well the model's predicted probabilities
    match the actual labels, and includes an L2 regularization term 
    to help reduce overfitting.

    Parameters:
    -----------
    X : np.ndarray, shape (m, n)
        Feature matrix where m is the number of examples and 
        n is the number of features.

    y : np.ndarray, shape (m,)
        Actual binary labels (0 for ham, 1 for spam).

    w : np.ndarray, shape (n,)
        Weight vector (model parameters for each feature).

    b : float
        Bias term (intercept).

    lambda_ : float
        Regularization strength. Higher values penalize large weights.

    Returns:
    --------
    float
        The total cost: binary cross-entropy + L2 regularization.
    """

    # Number of training examples
    m = X.shape[0]

    # Compute the model output (z = X.w + b)
    z = np.dot(X, w) + b

    # Apply the sigmoid function to get predicted probabilities
    y_hat = sigmoid(z)

    # Clip predicted values to avoid log(0) errors during cost calculation
    y_hat = np.clip(y_hat, 1e-15, 1 - 1e-15)

    # Compute binary cross-entropy loss
    cost = -(1 / m) * np.sum(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))

    # Compute L2 regularization term (excluding the bias b)
    reg = (lambda_ / (2 * m)) * np.sum(w ** 2)

    # Total cost = loss + regularization
    return cost + reg

6.3 Training with Gradient Descent

We update the model’s weights w and bias b using the gradients of the loss.

Python
def train_logistic_regression(X, y, learning_rate=0.01, iterations=10000, lambda_=10.0):
    """
    Trains a logistic regression model using batch gradient descent 
    and L2 regularization (also known as Ridge regularization).
    
    Parameters:
    -----------
    X : np.ndarray, shape (m, n)
        Feature matrix where m is the number of examples and 
        n is the number of features (e.g., BoW word counts).

    y : np.ndarray, shape (m,)
        Binary labels (0 = ham, 1 = spam).

    learning_rate : float, optional (default=0.01)
        Step size for updating weights during training.

    iterations : int, optional (default=10000)
        Maximum number of training iterations.

    lambda_ : float, optional (default=10.0)
        Regularization strength to reduce overfitting by 
        penalizing large weight values (L2 regularization).
    
    Returns:
    --------
    w : np.ndarray, shape (n,)
        Final optimized weights after training.

    b : float
        Final optimized bias (intercept).

    costs : list of float
        Cost at each iteration — useful for plotting convergence.
    """

    m, n = X.shape  # m = number of samples, n = number of features

    # Initialize weights and bias to zero
    w = np.zeros(n)
    b = 0

    # Keep track of cost at each iteration for visualization
    costs = []

    # Begin gradient descent loop
    for i in range(iterations):
        # Compute model output (z = X.w + b)
        z = np.dot(X, w) + b

        # Apply sigmoid to get predicted probabilities (0 to 1)
        y_hat = sigmoid(z)

        # Compute gradients with respect to weights and bias
        # Includes L2 regularization: (lambda_ / m) * w
        dw = (1 / m) * np.dot(X.T, (y_hat - y)) + (lambda_ / m) * w
        db = (1 / m) * np.sum(y_hat - y)

        # Update parameters using gradient descent
        w -= learning_rate * dw
        b -= learning_rate * db

        # Compute current cost (loss + regularization)
        cost = compute_cost(X, y, w, b, lambda_)
        costs.append(cost)

        # Print progress every 100 steps
        if i % 100 == 0:
            print(f"Iteration {i}: Cost = {cost:.4f}")

    # Return the learned weights, bias, and the list of costs
    return w, b, costs

Key Concepts:

  • Gradient Descent: Optimizes weights and bias by iteratively reducing the cost function.
  • Regularization: Helps prevent overfitting by penalizing large weights.
  • Cost Tracking: Allows you to visualize and understand learning progress over time.
  • Early Stopping: Stops training if the improvement in cost is negligible.

Step 7: Model Training and Evaluation

Let’s train the model and check how well it performs:

Python
w, b, costs = train_logistic_regression(X_train, y_train)

Visualizing Loss Over Time

Python
plt.plot(costs)
plt.title("Cost Over Iterations")
plt.xlabel("Iteration")
plt.ylabel("Cost")
plt.grid(True)
plt.show()

Let’s see how the graph looks like:

Logistic regression cost graph

You’ll see the cost decrease steadily, a sign that the model is learning.

Step 8: Making Predictions

We classify new emails using our learned weights and bias.

Python
def predict(X, w, b):
    """
    Predicts binary labels (0 or 1) for input feature vectors 
    using a trained logistic regression model.

    This function performs the following steps:
    1. Computes the linear combination of inputs and weights (z = X·w + b)
    2. Applies the sigmoid function to convert z into probabilities
    3. Converts probabilities into final class labels:
       - If probability ≥ 0.5, predict 1 (spam)
       - If probability < 0.5, predict 0 (ham)

    Parameters:
    -----------
    X : np.ndarray, shape (m, n)
        Feature matrix of input samples (e.g., Bag-of-Words vectors).

    w : np.ndarray, shape (n,)
        Trained weight vector from the logistic regression model.

    b : float
        Trained bias (intercept) from the model.

    Returns:
    --------
    np.ndarray, shape (m,)
        Predicted class labels (0 or 1) for each input example.
    """

    # Step 1: Compute the raw score (z = X·w + b)
    z = np.dot(X, w) + b

    # Step 2: Apply sigmoid to get probabilities between 0 and 1
    y_hat = sigmoid(z)

    # Step 3: Classify as 1 if probability ≥ 0.5, else 0
    return np.where(y_hat > 0.5, 1, 0)

Step 9: Evaluating Performance

Let’s measure accuracy:

Python
def accuracy(y_pred, y_true):
    """
    Calculates the classification accuracy of the model.
    
    Accuracy is the proportion of correct predictions out of 
    all predictions made. It's a basic but useful metric for 
    evaluating classification performance.

    Parameters:
    -----------
    y_pred : np.ndarray, shape (m,)
        Predicted labels (0 or 1) from the model.
    
    y_true : np.ndarray, shape (m,)
        Actual true labels (0 or 1) from the dataset.

    Returns:
    --------
    float
        Accuracy score between 0 and 1.
        Multiply by 100 for percentage.
    """
    return np.mean(y_pred == y_true)

Example Usage

Python
# Evaluate the model on the test set
# y_pred_test: model's predictions
# y_test: actual labels

acc = accuracy(y_pred_test, y_test)

# Print accuracy as a percentage with 2 decimal places
print(f"Test Accuracy: {acc * 100:.2f}%")

What This Does:

  • y_pred == y_true creates a Boolean array: [True, False, True, …]
  • np.mean(…) treats True as 1 and False as 0, giving the proportion of correct predictions.
  • Multiplying by 100 gives the percentage (e.g., 0.93 → 93.00%).

This is a strong result, especially given we used raw NumPy and a simple bag-of-words model!

Final Thoughts

In this project, we built a spam classifier from the ground up. Here’s what we did:

  • Cleaned and tokenized raw text data.
  • Turned emails into numeric features using Bag-of-Words.
  • Implemented logistic regression manually.
  • Trained with gradient descent and visualized the learning process.
  • Evaluated performance with test accuracy.

Key Learnings:

  • Machine learning isn’t just calling .fit(), it’s understanding every component.
  • Text data is messy, cleaning matters.
  • Regularization improves generalization.
  • Logistic regression is a great first model for classification.

Want to See the Full Code?

Check out the GitHub repository here: GitHub Repository Link

Other Articles You May Like

neural networks
This project walks through creating a neural network using NumPy to recognize handwritten digits. Gain hands-on experience with forward and backpropagation.
Better customer churn prediction is possible: See how we applied both Random Forest and XGBoost models to telecom data to anticipate cancellations in advance.
linear regression
Predict house prices using a linear regression model built entirely with NumPy. This beginner project covers data prep, cost function, and gradient descent.
>