When I first started learning machine learning, I was fascinated by how algorithms could predict outcomes from data. But like many beginners, I relied heavily on high-level libraries like scikit-learn, treating models as black boxes that magically produced results. I wanted to change that, to truly understand what was happening under the hood.
So, I decided to build a linear regression model from scratch, using only NumPy, to predict house prices in Seattle. No shortcuts, no model.fit()
, just raw math, code, and a lot of learning along the way.
Here’s the story of how I did it, the challenges I faced, and the insights I gained.
Why Build from Scratch?
Before diving into the code, I asked myself: Why go through the trouble of implementing an algorithm manually when libraries like scikit-learn already do it efficiently?
The answer? Understanding.
Anyone can call .fit()
and .predict()
, but if you can’t explain how gradient descent updates weights or why feature scaling matters, you’re missing the foundation of machine learning. By building from scratch, I forced myself to:
- Truly grasp the math behind linear regression.
- Debug issues when the model didn’t converge.
- Appreciate the importance of preprocessing.
Plus, recruiters love seeing this kind of depth, it shows you’re not just a library user, but someone who understands the mechanics.
Step 1: Understanding the Data
I started with the Seattle House Price Prediction dataset from Kaggle. To follow along with this article you can find the code implementation, a Jupyter Notebook in this GitHub repo. The dataset contained features like:
size
(square footage)beds
(number of bedrooms)baths
(number of bathrooms)lot_size
(property land size)zip_code
(location)
1.1 Initial Exploration
First, we load the data and check for missing values:
import pandas as pd
# Loading the dataset from the csv file
df = pd.read_csv('train.csv')
# Preview first few rows
df.head()
Here’s a sample of what the raw data looks like:

Looks simple, right? Well, not quite. As we’ll soon see, real-world data is messy, and cleaning it up is a crucial part of any machine learning workflow.
After loading the data into a Pandas DataFrame, the first thing I did was examine its structure. Right away, a few issues became obvious:
- Mixed units: The
lot_size
column contained both square feet and acres, inconsistent units that would confuse our model. - Missing values: About 347 rows had missing data for
lot_size
, which meant we had incomplete examples. - Redundant columns: The columns
size_units
andlot_size_units
weren’t helpful once everything was standardized.
1.2 Data Cleaning
To fix these issues, we:
- Standardize units – Convert all lot sizes to square feet (1 acre = 43,560 sqft).
- Drop missing values – Critical for training stability.
- Remove unnecessary columns – Like
size_units
andlot_size_units
. Keep only numerical features relevant to price prediction.
# Convert acres to sqft
df['lot_size_sqft'] = df.apply(
lambda row: row['lot_size'] * 43560 if row['lot_size_units'] == 'acre' else row['lot_size'], axis=1
)
# Drop missing values
df.dropna(inplace=True)
1.3 Visualizing Relationships
Before jumping into modeling, I plotted the data to see trends:

Key insights include:
size
(square footage) had the strongest linear relationship with price.beds
andbaths
were step-wise (less ideal for simple linear regression).lot_size_sqft
was noisy, bigger lots didn’t always mean higher prices.
A correlation heatmap confirmed this:

Decision: We’d use size
as the primary feature for the model.
Step 2: Implementing Linear Regression from Scratch
2.1 Feature Scaling: Why It Matters
Before training a model using gradient descent, we need to scale the feature values. Otherwise, gradient descent may take tiny steps and converge very slowly (or not at all).
I used Z-score normalization:
\(X_{scaled} = \frac{X – \mu}{\sigma}\)
- Where
μ
is the mean andσ
is the standard deviation of the feature.
We scaled both our input feature size
and our target variable price
using this method.
# Select columns
X = df['size']
y = df['price']
# Scaling the data
X_scaled = (X - X.mean()) / X.std()
y_scaled = (y - y.mean()) / y.std()
Step 3: Implementing the Cost Function (MSE)
To evaluate our model’s predictions, we use the Mean Squared Error (MSE):
\(J_{(w,b)} = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y}^{(i)}-y^{(i)})^{2}\)
m
is the number of training examplesw
is the weight (slope)b
is the bias (intercept)ŷ
is the predicted pricey
is the actual price
The goal of training is to find values of w
and b
that minimize this cost function.
import numpy as np
def compute_cost(X, y, w, b):
"""
Compute the Mean Squared Error (MSE) cost for linear regression.
Parameters:
----------
X : numpy array of shape (m,)
The input feature values (e.g., house sizes). Already normalized if needed.
y : numpy array of shape (m,)
The actual target values (e.g., house prices).
w : float
The weight (slope) of the model — how much price increases with size.
b : float
The bias (intercept) of the model — the predicted price when size = 0.
Returns:
-------
float
The computed cost (error) using Mean Squared Error:
"""
# Number of training examples
m = len(y)
# Calculate the model's predicted values for each input (ŷ = X * w + b)
y_pred = np.dot(X, w) + b
# Compute the squared difference between predicted and actual values
squared_errors = (y_pred - y) ** 2
# Compute the final cost using the MSE formula with a factor of 1/(2m)
cost = (1 / (2 * m)) * np.sum(squared_errors)
return cost
Step 4: Training with Gradient Descent
I implemented batch gradient descent manually. At each step, we updated the weight and bias using the gradients:
\(w = w\text{ }-\text{ }\alpha\cdot\frac{∂J}{∂w}\)
\(b = b\text{ }-\text{ }\alpha\cdot\frac{∂J}{∂b}\)
def gradient_descent(X, y, w, b, learning_rate, iterations):
"""
Performs batch gradient descent to optimize parameters w and b
for a linear regression model.
Parameters:
----------
X : numpy array of shape (m,)
Input features (e.g., house sizes), typically standardized.
y : numpy array of shape (m,)
Actual target values (e.g., house prices), also standardized if X is.
w : float
Initial weight (slope) of the model.
b : float
Initial bias (intercept) of the model.
learning_rate : float
Controls the step size during each iteration — how much to update w and b.
iterations : int
Total number of iterations to run gradient descent.
Returns:
-------
w : float
Final optimized weight after gradient descent.
b : float
Final optimized bias after gradient descent.
cost_history : list of float
Cost at each iteration (for visualization and analysis).
"""
m = len(y) # Number of training examples
cost_history = [] # To keep track of cost over iterations
for i in range(iterations):
# STEP 1: Make predictions using the current w and b
y_pred = np.dot(X, w) + b
# STEP 2: Compute the error (difference between predictions and actual values)
error = y_pred - y
# STEP 3: Compute gradients for w and b
dw = (1 / m) * np.dot(error, X) # Gradient of cost w.r.t weight
db = (1 / m) * np.sum(error) # Gradient of cost w.r.t bias
# STEP 4: Update the parameters using the gradients
w -= learning_rate * dw
b -= learning_rate * db
# STEP 5: Calculate and store the current cost
cost = compute_cost(X, y, w, b)
cost_history.append(cost)
# Print progress every 100 iterations
if i % 100 == 0:
print(f"Iteration {i}: Cost = {cost:.4f}")
return w, b, cost_history
We chose:
- Learning Rate (
α
) = 0.001 - Iterations = 10,000 (but we stopped early upon convergence).
After about 2,500 iterations, the cost stopped improving significantly — which meant the model had converged.
Final parameters:
- Weight (
w
) ≈ 0.3707 - Bias (
b
) ≈ 0.0000
Step 5: Making Predictions
Using the final model parameters, we predicted prices for the houses in the dataset. Since the model was trained on scaled values, we had to rescale predictions back to their original dollar amounts using:
\(\hat{y}_{\text{original}}\text{ }=\text{ }\hat{y}_{\text{scaled}}\text{ }\cdot\text{ }\sigma_{y}\text{ }+\text{ }\mu_{y}\)
# Predict scaled prices
y_hat_scaled = final_weight * X_scaled + final_bias
# Rescale predictions to original price range
y_hat = y_hat_scaled * y_std + y_mean
Step 6: Visualizing the Regression Line
We then plotted the predictions against the actual data to visualize the regression line. The result showed that our model captured the general upward trend between size and price, though not all points aligned perfectly.
import matplotlib.pyplot as plt
# Plot actual data vs predicted line
plt.figure(figsize=(10, 6))
plt.scatter(X, y, label='Actual Data', alpha=0.5)
plt.plot(X, y_hat, color='red', label='Regression Line', linewidth=2)
plt.title('Linear Regression: Predicted Price vs Size (sqft)')
plt.xlabel('Size (sqft)')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True)
plt.show()

The line captured the general trend, but some outliers remained.
Step 7: Evaluating the Model
To quantify the model’s performance, we used:
- Root Mean Squared Error (RMSE)
Measures the average prediction error in the same units as the target variable (price in dollars):
Lower RMSE = better fit.
\(RMSE\text{ }=\text{ }\sqrt{\frac{1}{m}\sum_{i=1}^{m}(\hat{y}_{i}-y_{i})^{2}}\)
- R² Score
Represents the proportion of variance in the target explained by the model:
\(R^{2}=1-\frac{\sum_{}^{}(\hat{y}-y)^{2}}{\sum_{}^{}(y-\bar{y})^{2}}\)
- R2 = 1.0: perfect prediction
- R2 = 0.0: model predicts the mean
- R2 < 0: worse than guessing the mean
# Importing Scikit-Learn library that we will use for our evaluation
from sklearn.metrics import mean_squared_error, r2_score
# Compute RMSE
rmse = np.sqrt(mean_squared_error(y, y_hat))
# Compute R² score
r2 = r2_score(y, y_hat)
RMSE: $923,043.35
R² Score: 0.1609
The R² value isn’t high, and that’s expected. This is a very simple model using just one feature. The purpose here wasn’t to beat a leaderboard, but to understand and implement the full pipeline from scratch.
Final Summary & Reflections
In this project, we built a linear regression model from scratch using NumPy to predict house prices in Seattle. We worked step by step through the full machine learning pipeline, including:
- Data cleaning & preprocessing (unit standardization, missing value handling).
- Exploratory data analysis (scatter plots, correlation checks).
- Feature scaling using Z-score normalization.
- Model implementation: cost function (MSE), gradient descent.
- Model training and prediction with visualization.
- Model evaluation using RMSE and R² Score.
Key Learnings
- Implementing algorithms from scratch is the best way to deeply understand how they work.
- Feature selection significantly impacts model performance.
- Standardizing data is crucial when using gradient descent.
- Real-world data is messy — handling units, missing values, and scale differences is essential.
- Evaluation metrics tell the real story about a model’s performance.
Want to See the Full Code?
Check out the GitHub repository here: GitHub Repository Link