Implementing Simple Linear Regression Part II

Honey
5 min readFeb 16, 2024
Linear Regression

1. Import Necessary libraries

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Q. For what purpose do we import the above libraries?

  • numpy : Since, we will convert out data to arrays, we will have to perform array operations. Thus, numpy provides support for large, multi-dimensional arrays and matrices. It also has a wide collection of mathematical functions to operate on these arrays efficiently.
  • pandas : Pandas is a powerful tool for data manipulation and analysis in Python. It offers functions to efficiently handle structured data, such as table, making it ideal for data preprocessing, cleaning, and exploration. With Pandas, we can easily load data from various file formats (e.g., CSV, Excel), filter rows, select columns, aggregate data, perform joins and merges, handle missing values, and much more.
  • matplotlib : Matplotlib is a great plotting library in Python. It unables us to create static, interactive, and animated visualizations. So, in this article, we used matplotlib for data visualization task such as plotting scatter graphs, plotting the best fit line, plotting all the possible lines, plotting the cost function, etc.

2. Read Data

data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Linear Regression/Salary_dataset.csv')

3. Assign the Dependent and the Independent Variables

X= pd.DataFrame(data.YearsExperience)
Y= pd.DataFrame(data.Salary)
m=len(Y) # Length of the array of Salary column.
X

OUTPUT —

Values in X

Similarly, Y will be the column of ‘Salary’.

Now since, we know that, X = 1 always, thus in matrix X[0] = 1. So we add an additional column titles ‘Intercept’ and assign the entire column the value 1 as out next step.

X₀ = 1 always
X['intercept'] = 1

OUTPUT —

X[‘intercept’] = 1

4. Plot the scatter plot to visualize the data in the dataset.

plt.figure(figsize=(11,8))
plt.scatter(X,Y)
plt.xlabel('Years of Experience')
plt.ylabel('Salary')

Output : -

5. Convert from table to arrays for further operations

The X and Y that has the columns, convert X and Y columns to arrays and store in x and y. Make an array theta with size of 2 such that it would have θ₀ and θ₁ taking any random values or maybe we could assign 0.

x = np.array(X)
y = np.array(Y).flatten()
# theta = np.array([0,0])
theta = np.random.rand(2)

6. Gradient Descent Function

def gradient_descent(x ,y , theta, iterations, L):
past_costs = []
past_thetas = [theta]
for i in range(iterations):
prediction = np.dot(x, theta)
error = prediction - y
cost = 1/(2*m) * np.dot(error.T, error) # Cost Function
past_costs.append(cost)
theta = theta - (L * (1/m) * np.dot(x.T, error))
past_thetas.append(theta)

return past_thetas, past_costs

Animation—

Gradient Descent Animation Example

Q . How do we form the code for Gradient Descent Function ?

7. Call the function to calculate and get the coefficients θ₀ and θ₁.

L = 0.01
iterations = 1500
past_theta, past_costs = gradient_descent(x, y, theta, iterations, L)
theta = past_theta[-1]
print(theta[0], theta[1])

Here,

L = ‘0.01’:

Here, ‘L’ represents the learning rate, which is small number used to control how big step we take in each iteration while we are trying to find the best parameters θ₀ and θ₁ for our model.

iterations = 1500 :

This line sets the number of times we want to update our parameters based on the data. In simpler terms, it’s like deciding how many times we want to adjust our guess for the best parameters to fit the data better.

past_theta, past_costs = gradient_descent(x, y, theta, iterations, L) :

The line calles the gradient_descent function and applies the gradient descent algorithm to out data. It takes the below parameters,

X — Input feature (Independent Variable)

Y — Output

θ — Initial random values for out parameters

iterations — number of iterations to perform

L — Learning Parameter

8. Plotting the cost function

# Plotting the cost function
plt.title('Cost Function')
plt.xlabel('No. of iterations')
plt.ylabel('Cost')
plt.plot(past_costs)

OUTPUT —

9. Plotting the best fit line

## Plotting the best fit line
best_fit_x = np.linspace(0, 10, 2)
best_fit_y = [theta[1] + theta[0]*xx for xx in best_fit_x]

plt.figure(figsize=(11,8))
plt.scatter(data.YearsExperience, data.Salary)
plt.plot(best_fit_x, best_fit_y, '-')
plt.axis([1,10,20000,120000]) #Axis
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.title('Salary vs. Years of Experience with Linear Regression Line')
Output — Best fit line for prediction

SUMMARY

In summary, the implementation above applies the gradient descent algorithm to find the best parameters for a model given some data. It also adjusts the parameters(i.e. θ) multiple times to minimize the error function(also called as a cost function), i.e. to minimize the difference between the model’s prediction and the actual data points. Once the optimization is done, it prints the final parameters θ₀ and θ₁ using which we make form the final line

OUTPUT : h(x) = θ₀ + θ₁X

--

--

Honey

Tech or non-tech 'll post all my crazy cool stuff here!