Linear Regression With NumPy: House Price Prediction

Predict house prices using a linear regression model built entirely with NumPy. This beginner project covers data prep, cost function, and gradient descent.
Facebook
Twitter
LinkedIn

When I first started learning machine learning, I was fascinated by how algorithms could predict outcomes from data. But like many beginners, I relied heavily on high-level libraries like scikit-learn, treating models as black boxes that magically produced results. I wanted to change that, to truly understand what was happening under the hood.

So, I decided to build a linear regression model from scratch, using only NumPy, to predict house prices in Seattle. No shortcuts, no model.fit(), just raw math, code, and a lot of learning along the way.

Here’s the story of how I did it, the challenges I faced, and the insights I gained.

Why Build from Scratch?

Before diving into the code, I asked myself: Why go through the trouble of implementing an algorithm manually when libraries like scikit-learn already do it efficiently?

The answer? Understanding.

Anyone can call .fit() and .predict(), but if you can’t explain how gradient descent updates weights or why feature scaling matters, you’re missing the foundation of machine learning. By building from scratch, I forced myself to:

  1. Truly grasp the math behind linear regression.
  2. Debug issues when the model didn’t converge.
  3. Appreciate the importance of preprocessing.

Plus, recruiters love seeing this kind of depth, it shows you’re not just a library user, but someone who understands the mechanics.

Step 1: Import Libraries

The very first thing we always do in any machine learning project is import the libraries we need.

Since the goal of this project is to truly understand linear regression, I’ll be keeping things as minimal as possible. Instead of using powerful machine learning libraries, we’ll rely mainly on NumPy and Pandas.

Python
import numpy as np  # Numerical operations
import pandas as pd  # Data manipulations
import matplotlib.pyplot as plt  # Data visualizations
import seaborn as sns  # Advanced visualizations
import math  # Mathematical functions

Step 2: Load and Explore the Dataset

Now that we’ve set up our toolbox, the next step is to bring in the data we’ll be working with. For this project, we’re using the Seattle Housing Dataset from Kaggle, which contains information about houses in Seattle, USA.

Each row in the dataset represents a house, and each column describes a characteristic of that house, such as size, number of beds, number of baths, and, of course, the price. Our goal is to use these features to predict house prices.

To follow along with this article, you can find the code implementation in a Jupyter Notebook in this GitHub repo.

2.1 Loading the Data

Let’s load the dataset using Pandas. Assuming you’ve already downloaded the dataset as a CSV file, here’s how to get it into a DataFrame (a table-like structure in Pandas):

Python
# Loading the dataset from the csv file 
data = pd.read_csv('seattle-housing-data.csv')

# Preview first few rows
data.head()

Here’s a sample of what the raw data looks like:

Linear regression house price data

Looks simple, right? Well, not quite. As we’ll soon see, real-world data is messy, and cleaning it up is a crucial part of any machine learning workflow.

2.2 Understanding the Data

After loading the dataset, the next step is to understand its structure. Instead of checking the shape, columns, and data types separately, I prefer using the .info() method. It gives us a quick snapshot of everything we need to know: how many rows there are, the names of the columns, their data types, and whether or not they contain missing values.

Python
# Get a concise summary of the dataset
data.info()
RangeIndex: 505 entries, 0 to 504
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 beds 505 non-null int64
1 baths 505 non-null float64
2 size 505 non-null float64
3 size_units 505 non-null object
4 lot_size 428 non-null float64
5 lot_size_units 428 non-null object
6 zip_code 505 non-null int64
7 price 505 non-null float64
dtypes: float64(4), int64(2), object(2)

Breaking this down:

  1. Number of rows and columns
    • The dataset has 505 rows and 8 columns.
    • Each row represents one house listing, and each column represents some information about that house.
  2. Column names and data types
    • beds, baths, size, and price are numerical (int or float). These will likely be useful predictors for our model.
    • size_units and lot_size_units are object (string) columns, they describe measurement units, which we’ll eventually need to clean or drop.
    • zip_code is numeric but actually represents categorical information (a location), not a continuous number.
  3. Missing values
    • Most columns have 505 non-null values, meaning they’re complete.
    • But lot_size and lot_size_units only have 428 non-null values, so about 77 rows have missing data. That’s something we’ll need to handle later.

At this stage, we already see a few red flags that will affect our model later:

  • Mixed units in size_units and lot_size_units.
  • Missing values in lot_size.
  • Redundant columns that may not directly help in prediction.

Step 3: Data Cleaning

Real-world data is rarely perfect. In fact, one of the biggest surprises for beginners is just how much time data scientists spend cleaning and preparing datasets before even thinking about modeling. Machine learning algorithms expect clean, consistent, and numerical data — otherwise, the results can be misleading or the model might not work at all.

In this step, we’ll make sure our Seattle housing dataset is tidy and ready for analysis by checking for duplicates, missing values, and inconsistent categorical variables.

3.1 Checking for Duplicates

Sometimes, datasets contain the exact same row more than once, maybe a property was accidentally logged twice or included in multiple records. If we don’t remove these duplicates, our model will give extra weight to those repeated rows, biasing the results.

Python
# Check for duplicates
print(f'Duplicated rows: {data.duplicated().sum()}')
Duplicated rows: 1
Python
# Drop duplicates
data = data.drop_duplicates()

3.2 Fix Missing Values and Standardize Columns

Missing data is another common issue. In our dataset, we’ve already noticed that the lot_size and lot_size_units columns don’t have complete information for every house. If we ignore this, the model might crash or learn incorrect patterns.

There are several ways to deal with missing values:

  • Dropping rows that are incomplete.
  • Filling them in (imputation) using averages, medians, or even more advanced techniques.

The right choice depends on how important the column is and how much data is missing. For this case, the important column for us is the lot_size. We will fill the missing values with the median, but first we need to make the column values consistent since it has a mix of sqft and acre.

Python
# Remove empty strings on categorical variables which can be seen as null values
categorical_cols = data.select_dtypes(include='object').columns
data[categorical_cols] = data[categorical_cols].replace(r'^\s*$', np.nan, regex=True)

# Check for null values
print(data.isnull().sum())
beds 0
baths 0
size 0
size_units 0
lot_size 77
lot_size_units 77
zip_code 0
price 0
dtype: int64
Python
# Create column 'lot_size_sqft' with standardized sqft units
data['lot_size_sqft'] = data.apply(
  lambda row: row['lot_size'] * 43560 if row['lot_size_units'] == 'acre' else row['lot_size'], axis=1
)

# Fill missing values with median
lot_size_median = data['lot_size_sqft'].median()
data['lot_size_sqft'] = data['lot_size_sqft'].fillna(lot_size_median)

# Rename 'size' column
data = data.rename(columns={'size': 'size_sqft'})

# Drop the unnecessary columns
data = data.drop(columns=['lot_size','lot_size_units', 'size_units'])

Step 4: Exploratory Data Analysis (EDA)

Once our data is cleaned, the next step is to explore it. EDA is like detective work: we’re trying to uncover patterns, relationships, and potential problems in the data before we build any models.

The goal here isn’t just to generate pretty charts, it’s to understand how the features behave, how they relate to the target variable (price), and whether there are any issues (like outliers or skewed distributions) that could confuse our model.

We’ll look at the data in three stages: univariate (one variable at a time), bivariate (two variables), and multivariate (many variables together).

4.1 Univariate Analysis of the Target Variable (price)

We’ll start by focusing on our target variable: price. Since our whole project is about predicting house prices, it makes sense to deeply understand what those prices look like in the dataset.

4.1.1 Statistical Information

The .describe() method gives us basic summary statistics — mean, median, minimum, maximum, and quartiles. This quick snapshot helps us see whether house prices are generally high, low, or spread out across a wide range.

Python
df['price'].describe().to_frame().T
      count      mean       std       min       25%       50%         75%         max
price 504 979,542 609,080 170,000 619,742 840,000 1,156,250 6,250,000

Looking at the summary statistics for price, a few things stand out:

  • There are 504 houses in our dataset.
  • The average (mean) price is about $979,500, but the median (50%) price is lower at $840,000. This gap suggests that there are some very expensive homes pulling the mean upwards.
  • Most homes fall between $620,000 (25th percentile) and $1,156,000 (75th percentile).
  • The cheapest home is priced at $170,000, while the most expensive reaches an eye-popping $6.25 million.
  • The standard deviation is very high (~$609,000), which tells us house prices vary a lot across the dataset.

In short, Seattle’s housing market is highly varied, with most homes clustered under $1.2M, but a handful of luxury properties driving up the average.

4.1.2 Check for Outliers

Outliers are unusually high or low values that don’t fit the general trend. For house prices, these might be ultra-luxury homes worth millions, which could distort the scale of our model. Spotting and handling them is important because linear regression is sensitive to extreme values.

Python
q1 = data['price'].quantile(0.25)
q3 = data['price'].quantile(0.75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
outliers = data[(data['price'] < lower) | (data['price'] > upper)]['price']
print(f'Outliers: {len(outliers)}')
Outliers: 31

When we checked for outliers in the price column, we found 31 houses that stand far outside the typical price range. These are mostly ultra-luxury properties worth millions, compared to the majority of homes priced under $1.2M.

Outliers aren’t “bad data”, they’re real houses, but they can heavily influence a linear regression model, pulling the regression line towards them. That’s why it’s important to identify them now. Later, we’ll decide whether to keep them (if they’re relevant for prediction) or remove them (if they distort the model too much).

4.1.3 Check for skewness

Skewness tells us whether the distribution of prices leans heavily to the left (a few very cheap homes) or to the right (a few extremely expensive homes). Since linear regression assumes normally distributed errors, a highly skewed target variable could affect model performance.

Python
# Check for skewness
print(f'Skewness: {data['price'].skew()}')
Skewness: 2.8905

Since our skewness is 2.89, that’s strong positive skewness. This means most Seattle homes are clustered in the lower-to-mid price ranges, but a handful of very expensive luxury houses stretch the distribution to the right.

This also explains why the mean price ($979K) is higher than the median price ($840K), those multi-million-dollar homes are pulling the average upward.

4.1.4 Distribution of the price variable

Finally, we’ll visualize house prices using a histogram (or histplot in Seaborn). This gives us a clear picture of how prices are spread out, whether most homes fall into a certain price range or whether the distribution is uneven.

Python
plt.figure(figsize=(14, 5)
plt.subplot(1, 2, 1)
sns.histplot(data['price'], kde=True)
plt.title('Distribution of Price')
plt.xlabel('Price')
plt.ylabel('Count')
plt.show()
Price-hist

4.2 Bivariate Analysis of Numerical Values vs. Target (price)

Once we understand the target, the next step is to explore how other numerical variables (like beds, baths, size, lot_size) relate to price.

4.2.1 Correlation Heatmap

A correlation heatmap gives us a bird’s-eye view of relationships between variables. For instance, we might find that:

  • size has a strong positive correlation with price (bigger homes → more expensive).
  • beds and size are also strongly correlated (bigger houses usually mean more bedrooms).

This helps us identify which variables are likely to be the strongest predictors, and also warns us about multicollinearity (when two features are so closely related that they could confuse the model).

Python
# Drop zip_code from the numerical columns as that is a categorical variable not continuous.
numerical_cols = df.select_dtypes(include='number').columns.drop(['zip_code'])

# Plot correlation heatmap
corr_mat = df[numerical_cols].corr()
plt.figure(figsize=(10, 5))
sns.heatmap(corr_mat, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.show()
Correlation heatmap

Key insights include:

  1. size_sqft had the strongest linear relationship with price.
  2. beds and baths were step-wise (less ideal for simple linear regression).
  3. lot_size_sqft was noisy, bigger lots didn’t always mean higher prices.

4.2.2 Scatter plot

Scatter plots are one of the most powerful ways to see relationships between a single feature and the target. For example:

  • size vs. price: Do bigger houses generally cost more?
  • beds vs. price: Does adding bedrooms always increase price, or is there a limit?

By plotting these, we can spot patterns (linear or non-linear) and also detect if some variables don’t really influence price much at all.

Python
# Drop price so that we can compare against it.
numerical_cols = numerical_cols.drop(['price'])

n_cols = 2
n_rows = math.ceil(len(numerical_cols) / n_cols)
plt.figure(figsize=(6 * n_cols, 4 * n_rows))

for i, col in enumerate(numerical_cols, 1):
    plt.subplot(n_rows, n_cols, i)
    sns.scatterplot(data=df, x=col, y='price')
    plt.title(f'{col} vs Price')
    sns.regplot(
        x=col,
        y="price",
        data=df,
        scatter=False,
        color="red",
    )

plt.subplots_adjust(hspace=0.5, wspace=0.2)
plt.show()
House price scatter plot

Decision: We visually see that size the best linear relationship with price so we will select it as the primary feature for the model.

Step 5: Implementing Linear Regression from Scratch

5.1 Feature Scaling: Why It Matters

Before training a model using gradient descent, we need to scale the feature values. Otherwise, gradient descent may take tiny steps and converge very slowly (or not at all).

I used Z-score normalization:

\(X_{scaled} = \frac{X – \mu}{\sigma}​\)

  • Where μ is the mean and σ is the standard deviation of the feature.

We scaled both our input features size and our target variable price using this method.

Python
# Select columns
X = data['size']
y = data['price']

# Scaling the data
X_scaled = (X - X.mean()) / X.std()
y_scaled = (y - y.mean()) / y.std()

Step 6: Implementing the Cost Function (MSE)

To evaluate our model’s predictions, we use the Mean Squared Error (MSE):

\(J_{(w,b)} = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y}^{(i)}-y^{(i)})^{2}\)

  • m is the number of training examples
  • w is the weight (slope)
  • b is the bias (intercept)
  • ŷ is the predicted price
  • y is the actual price

The goal of training is to find values of w and b that minimize this cost function.

Python
def compute_cost(X, y, w, b):
    """
    Compute the Mean Squared Error (MSE) cost for linear regression.

    Parameters:
    ----------
    X : numpy array of shape (m,)
        The input feature values (e.g., house sizes). Already normalized if needed.
    
    y : numpy array of shape (m,)
        The actual target values (e.g., house prices).
    
    w : float
        The weight (slope) of the model — how much price increases with size.
    
    b : float
        The bias (intercept) of the model — the predicted price when size = 0.
    
    Returns:
    -------
    float
        The computed cost (error) using Mean Squared Error:
    """
    
    # Number of training examples
    m = len(y)

    # Calculate the model's predicted values for each input (ŷ = X * w + b)
    y_pred = np.dot(X, w) + b

    # Compute the squared difference between predicted and actual values
    squared_errors = (y_pred - y) ** 2

    # Compute the final cost using the MSE formula with a factor of 1/(2m)
    cost = (1 / (2 * m)) * np.sum(squared_errors)

    return cost

Step 4: Training with Gradient Descent

I implemented batch gradient descent manually. At each step, we updated the weight and bias using the gradients:

\(w = w\text{ }-\text{ }\alpha\cdot\frac{∂J}{∂w}\)

\(b = b\text{ }-\text{ }\alpha\cdot\frac{∂J}{∂b}\)

Python
def gradient_descent(X, y, w, b, learning_rate, iterations):
    """
    Performs batch gradient descent to optimize parameters w and b
    for a linear regression model.

    Parameters:
    ----------
    X : numpy array of shape (m,)
        Input features (e.g., house sizes), typically standardized.
    
    y : numpy array of shape (m,)
        Actual target values (e.g., house prices), also standardized if X is.
    
    w : float
        Initial weight (slope) of the model.
    
    b : float
        Initial bias (intercept) of the model.
    
    learning_rate : float
        Controls the step size during each iteration — how much to update w and b.
    
    iterations : int
        Total number of iterations to run gradient descent.

    Returns:
    -------
    w : float
        Final optimized weight after gradient descent.
    
    b : float
        Final optimized bias after gradient descent.
    
    cost_history : list of float
        Cost at each iteration (for visualization and analysis).
    """

    m = len(y)  # Number of training examples
    cost_history = []  # To keep track of cost over iterations

    for i in range(iterations):
        # STEP 1: Make predictions using the current w and b
        y_pred = np.dot(X, w) + b

        # STEP 2: Compute the error (difference between predictions and actual values)
        error = y_pred - y

        # STEP 3: Compute gradients for w and b
        dw = (1 / m) * np.dot(error, X)       # Gradient of cost w.r.t weight
        db = (1 / m) * np.sum(error)          # Gradient of cost w.r.t bias

        # STEP 4: Update the parameters using the gradients
        w -= learning_rate * dw
        b -= learning_rate * db

        # STEP 5: Calculate and store the current cost
        cost = compute_cost(X, y, w, b)
        cost_history.append(cost)

        # Print progress every 100 iterations
        if i % 100 == 0:
            print(f"Iteration {i}: Cost = {cost:.4f}")

    return w, b, cost_history

We chose:

  • Learning Rate (α) = 0.001
  • Iterations = 10,000 (but we stopped early upon convergence).

After about 2,500 iterations, the cost stopped improving significantly — which meant the model had converged.

Final parameters:

  • Weight (w) ≈ 0.3707
  • Bias (b) ≈ 0.0000

Step 5: Making Predictions

Using the final model parameters, we predicted prices for the houses in the dataset. Since the model was trained on scaled values, we had to rescale predictions back to their original dollar amounts using:

\(\hat{y}_{\text{original}}\text{ }=\text{ }\hat{y}_{\text{scaled}}\text{ }\cdot\text{ }\sigma_{y}\text{ }+\text{ }\mu_{y}\)

Python
# Predict scaled prices
y_hat_scaled = final_weight * X_scaled + final_bias

# Rescale predictions to original price range
y_hat = y_hat_scaled * y_std + y_mean

Step 6: Visualizing the Regression Line

We then plotted the predictions against the actual data to visualize the regression line. The result showed that our model captured the general upward trend between size and price, though not all points aligned perfectly.

Python
import matplotlib.pyplot as plt

# Plot actual data vs predicted line
plt.figure(figsize=(10, 6))
plt.scatter(X, y, label='Actual Data', alpha=0.5)
plt.plot(X, y_hat, color='red', label='Regression Line', linewidth=2)
plt.title('Linear Regression: Predicted Price vs Size (sqft)')
plt.xlabel('Size (sqft)')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True)
plt.show()
Linear regression line

The line captured the general trend, but some outliers remained.

Step 7: Evaluating the Model

To quantify the model’s performance, we used:

  1. Root Mean Squared Error (RMSE)

    Measures the average prediction error in the same units as the target variable (price in dollars):
    Lower RMSE = better fit.

\(RMSE\text{ }=\text{ }\sqrt{\frac{1}{m}\sum_{i=1}^{m}(\hat{y}_{i}-y_{i})^{2}}\)

  1. R² Score

    Represents the proportion of variance in the target explained by the model:

\(R^{2}=1-\frac{\sum_{}^{}(\hat{y}-y)^{2}}{\sum_{}^{}(y-\bar{y})^{2}}\)

  • R2 = 1.0: perfect prediction
  • R2 = 0.0: model predicts the mean
  • R2 < 0: worse than guessing the mean
Python
# Importing Scikit-Learn library that we will use for our evaluation
from sklearn.metrics import mean_squared_error, r2_score

# Compute RMSE
rmse = np.sqrt(mean_squared_error(y, y_hat))

# Compute R² score
r2 = r2_score(y, y_hat)

RMSE: $923,043.35
R² Score: 0.1609

The R² value isn’t high, and that’s expected. This is a very simple model using just one feature. The purpose here wasn’t to beat a leaderboard, but to understand and implement the full pipeline from scratch.

Final Summary & Reflections

In this project, we built a linear regression model from scratch using NumPy to predict house prices in Seattle. We worked step by step through the full machine learning pipeline, including:

  • Data cleaning & preprocessing (unit standardization, missing value handling).
  • Exploratory data analysis (scatter plots, correlation checks).
  • Feature scaling using Z-score normalization.
  • Model implementation: cost function (MSE), gradient descent.
  • Model training and prediction with visualization.
  • Model evaluation using RMSE and R² Score.

Key Learnings

  • Implementing algorithms from scratch is the best way to deeply understand how they work.
  • Feature selection significantly impacts model performance.
  • Standardizing data is crucial when using gradient descent.
  • Real-world data is messy — handling units, missing values, and scale differences is essential.
  • Evaluation metrics tell the real story about a model’s performance.

Want to See the Full Code?

Check out the GitHub repository here: GitHub Repository Link

Other Articles You May Like

Learn how underfitting and overfitting affect model performance using polynomial regression on real housing data, with clear visuals and code examples.
Build a spam detection model using logistic regression and NumPy. Learn how to process text data, apply the sigmoid function, and classify emails effectively.
neural networks
This project walks through creating a neural network using NumPy to recognize handwritten digits. Gain hands-on experience with forward and backpropagation.
>