Linear Regression With NumPy: House Price Prediction

Predict house prices using a linear regression model built entirely with NumPy. This beginner project covers data prep, cost function, and gradient descent.
Facebook
Twitter
LinkedIn

When I first started learning machine learning, I was fascinated by how algorithms could predict outcomes from data. But like many beginners, I relied heavily on high-level libraries like scikit-learn, treating models as black boxes that magically produced results. I wanted to change that, to truly understand what was happening under the hood.

So, I decided to build a linear regression model from scratch, using only NumPy, to predict house prices in Seattle. No shortcuts, no model.fit(), just raw math, code, and a lot of learning along the way.

Here’s the story of how I did it, the challenges I faced, and the insights I gained.

Why Build from Scratch?

Before diving into the code, I asked myself: Why go through the trouble of implementing an algorithm manually when libraries like scikit-learn already do it efficiently?

The answer? Understanding.

Anyone can call .fit() and .predict(), but if you can’t explain how gradient descent updates weights or why feature scaling matters, you’re missing the foundation of machine learning. By building from scratch, I forced myself to:

  1. Truly grasp the math behind linear regression.
  2. Debug issues when the model didn’t converge.
  3. Appreciate the importance of preprocessing.

Plus, recruiters love seeing this kind of depth, it shows you’re not just a library user, but someone who understands the mechanics.

Step 1: Understanding the Data

I started with the Seattle House Price Prediction dataset from Kaggle. To follow along with this article you can find the code implementation, a Jupyter Notebook in this GitHub repo. The dataset contained features like:

  • size (square footage)
  • beds (number of bedrooms)
  • baths (number of bathrooms)
  • lot_size (property land size)
  • zip_code (location)

1.1 Initial Exploration

First, we load the data and check for missing values:

Python
import pandas as pd

# Loading the dataset from the csv file 
df = pd.read_csv('train.csv')

# Preview first few rows
df.head()

Here’s a sample of what the raw data looks like:

Linear regression house price data

Looks simple, right? Well, not quite. As we’ll soon see, real-world data is messy, and cleaning it up is a crucial part of any machine learning workflow.

After loading the data into a Pandas DataFrame, the first thing I did was examine its structure. Right away, a few issues became obvious:

  • Mixed units: The lot_size column contained both square feet and acres, inconsistent units that would confuse our model.
  • Missing values: About 347 rows had missing data for lot_size, which meant we had incomplete examples.
  • Redundant columns: The columns size_units and lot_size_units weren’t helpful once everything was standardized.

1.2 Data Cleaning

To fix these issues, we:

  1. Standardize units – Convert all lot sizes to square feet (1 acre = 43,560 sqft).
  2. Drop missing values – Critical for training stability.
  3. Remove unnecessary columns – Like size_units and lot_size_units. Keep only numerical features relevant to price prediction.
Python
# Convert acres to sqft
df['lot_size_sqft'] = df.apply(
  lambda row: row['lot_size'] * 43560 if row['lot_size_units'] == 'acre' else row['lot_size'], axis=1
)

# Drop missing values
df.dropna(inplace=True)

1.3 Visualizing Relationships

Before jumping into modeling, I plotted the data to see trends:

Linear regression visualization

Key insights include:

  1. size (square footage) had the strongest linear relationship with price.
  2. beds and baths were step-wise (less ideal for simple linear regression).
  3. lot_size_sqft was noisy, bigger lots didn’t always mean higher prices.

A correlation heatmap confirmed this:

Linear regression house price correlation table

Decision: We’d use size as the primary feature for the model.

Step 2: Implementing Linear Regression from Scratch

2.1 Feature Scaling: Why It Matters

Before training a model using gradient descent, we need to scale the feature values. Otherwise, gradient descent may take tiny steps and converge very slowly (or not at all).

I used Z-score normalization:

\(X_{scaled} = \frac{X – \mu}{\sigma}​\)

  • Where μ is the mean and σ is the standard deviation of the feature.

We scaled both our input feature size and our target variable price using this method.

Python
# Select columns
X = df['size']
y = df['price']

# Scaling the data
X_scaled = (X - X.mean()) / X.std()
y_scaled = (y - y.mean()) / y.std()

Step 3: Implementing the Cost Function (MSE)

To evaluate our model’s predictions, we use the Mean Squared Error (MSE):

\(J_{(w,b)} = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y}^{(i)}-y^{(i)})^{2}\)

  • m is the number of training examples
  • w is the weight (slope)
  • b is the bias (intercept)
  • ŷ is the predicted price
  • y is the actual price

The goal of training is to find values of w and b that minimize this cost function.

Python
import numpy as np

def compute_cost(X, y, w, b):
    """
    Compute the Mean Squared Error (MSE) cost for linear regression.

    Parameters:
    ----------
    X : numpy array of shape (m,)
        The input feature values (e.g., house sizes). Already normalized if needed.
    
    y : numpy array of shape (m,)
        The actual target values (e.g., house prices).
    
    w : float
        The weight (slope) of the model — how much price increases with size.
    
    b : float
        The bias (intercept) of the model — the predicted price when size = 0.
    
    Returns:
    -------
    float
        The computed cost (error) using Mean Squared Error:
    """
    
    # Number of training examples
    m = len(y)

    # Calculate the model's predicted values for each input (ŷ = X * w + b)
    y_pred = np.dot(X, w) + b

    # Compute the squared difference between predicted and actual values
    squared_errors = (y_pred - y) ** 2

    # Compute the final cost using the MSE formula with a factor of 1/(2m)
    cost = (1 / (2 * m)) * np.sum(squared_errors)

    return cost

Step 4: Training with Gradient Descent

I implemented batch gradient descent manually. At each step, we updated the weight and bias using the gradients:

\(w = w\text{ }-\text{ }\alpha\cdot\frac{∂J}{∂w}\)

\(b = b\text{ }-\text{ }\alpha\cdot\frac{∂J}{∂b}\)

Python
def gradient_descent(X, y, w, b, learning_rate, iterations):
    """
    Performs batch gradient descent to optimize parameters w and b
    for a linear regression model.

    Parameters:
    ----------
    X : numpy array of shape (m,)
        Input features (e.g., house sizes), typically standardized.
    
    y : numpy array of shape (m,)
        Actual target values (e.g., house prices), also standardized if X is.
    
    w : float
        Initial weight (slope) of the model.
    
    b : float
        Initial bias (intercept) of the model.
    
    learning_rate : float
        Controls the step size during each iteration — how much to update w and b.
    
    iterations : int
        Total number of iterations to run gradient descent.

    Returns:
    -------
    w : float
        Final optimized weight after gradient descent.
    
    b : float
        Final optimized bias after gradient descent.
    
    cost_history : list of float
        Cost at each iteration (for visualization and analysis).
    """

    m = len(y)  # Number of training examples
    cost_history = []  # To keep track of cost over iterations

    for i in range(iterations):
        # STEP 1: Make predictions using the current w and b
        y_pred = np.dot(X, w) + b

        # STEP 2: Compute the error (difference between predictions and actual values)
        error = y_pred - y

        # STEP 3: Compute gradients for w and b
        dw = (1 / m) * np.dot(error, X)       # Gradient of cost w.r.t weight
        db = (1 / m) * np.sum(error)          # Gradient of cost w.r.t bias

        # STEP 4: Update the parameters using the gradients
        w -= learning_rate * dw
        b -= learning_rate * db

        # STEP 5: Calculate and store the current cost
        cost = compute_cost(X, y, w, b)
        cost_history.append(cost)

        # Print progress every 100 iterations
        if i % 100 == 0:
            print(f"Iteration {i}: Cost = {cost:.4f}")

    return w, b, cost_history

We chose:

  • Learning Rate (α) = 0.001
  • Iterations = 10,000 (but we stopped early upon convergence).

After about 2,500 iterations, the cost stopped improving significantly — which meant the model had converged.

Final parameters:

  • Weight (w) ≈ 0.3707
  • Bias (b) ≈ 0.0000

Step 5: Making Predictions

Using the final model parameters, we predicted prices for the houses in the dataset. Since the model was trained on scaled values, we had to rescale predictions back to their original dollar amounts using:

\(\hat{y}_{\text{original}}\text{ }=\text{ }\hat{y}_{\text{scaled}}\text{ }\cdot\text{ }\sigma_{y}\text{ }+\text{ }\mu_{y}\)

Python
# Predict scaled prices
y_hat_scaled = final_weight * X_scaled + final_bias

# Rescale predictions to original price range
y_hat = y_hat_scaled * y_std + y_mean

Step 6: Visualizing the Regression Line

We then plotted the predictions against the actual data to visualize the regression line. The result showed that our model captured the general upward trend between size and price, though not all points aligned perfectly.

Python
import matplotlib.pyplot as plt

# Plot actual data vs predicted line
plt.figure(figsize=(10, 6))
plt.scatter(X, y, label='Actual Data', alpha=0.5)
plt.plot(X, y_hat, color='red', label='Regression Line', linewidth=2)
plt.title('Linear Regression: Predicted Price vs Size (sqft)')
plt.xlabel('Size (sqft)')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True)
plt.show()
Linear regression line

The line captured the general trend, but some outliers remained.

Step 7: Evaluating the Model

To quantify the model’s performance, we used:

  1. Root Mean Squared Error (RMSE)

    Measures the average prediction error in the same units as the target variable (price in dollars):
    Lower RMSE = better fit.

\(RMSE\text{ }=\text{ }\sqrt{\frac{1}{m}\sum_{i=1}^{m}(\hat{y}_{i}-y_{i})^{2}}\)

  1. R² Score

    Represents the proportion of variance in the target explained by the model:

\(R^{2}=1-\frac{\sum_{}^{}(\hat{y}-y)^{2}}{\sum_{}^{}(y-\bar{y})^{2}}\)

  • R2 = 1.0: perfect prediction
  • R2 = 0.0: model predicts the mean
  • R2 < 0: worse than guessing the mean
Python
# Importing Scikit-Learn library that we will use for our evaluation
from sklearn.metrics import mean_squared_error, r2_score

# Compute RMSE
rmse = np.sqrt(mean_squared_error(y, y_hat))

# Compute R² score
r2 = r2_score(y, y_hat)

RMSE: $923,043.35
R² Score: 0.1609

The R² value isn’t high, and that’s expected. This is a very simple model using just one feature. The purpose here wasn’t to beat a leaderboard, but to understand and implement the full pipeline from scratch.

Final Summary & Reflections

In this project, we built a linear regression model from scratch using NumPy to predict house prices in Seattle. We worked step by step through the full machine learning pipeline, including:

  • Data cleaning & preprocessing (unit standardization, missing value handling).
  • Exploratory data analysis (scatter plots, correlation checks).
  • Feature scaling using Z-score normalization.
  • Model implementation: cost function (MSE), gradient descent.
  • Model training and prediction with visualization.
  • Model evaluation using RMSE and R² Score.

Key Learnings

  • Implementing algorithms from scratch is the best way to deeply understand how they work.
  • Feature selection significantly impacts model performance.
  • Standardizing data is crucial when using gradient descent.
  • Real-world data is messy — handling units, missing values, and scale differences is essential.
  • Evaluation metrics tell the real story about a model’s performance.

Want to See the Full Code?

Check out the GitHub repository here: GitHub Repository Link

Other Articles You May Like

decision tree
A beginner-friendly guide to using decision trees for predicting Titanic survival, featuring step-by-step code, clear explanations, pruning, and evaluation.
neural networks
This project walks through creating a neural network using NumPy to recognize handwritten digits. Gain hands-on experience with forward and backpropagation.
Learn how underfitting and overfitting affect model performance using polynomial regression on real housing data, with clear visuals and code examples.
>