Linear Neural Networks

Linear Regression

Posted by Leonard Meng on February 28, 2018

1 Linear Regression

1.1 Linear Model

The linearity assumption just says that the target (price) can be expressed as a weighted sum of the features (area and age):

\[\mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b.\]

In , $w_{\mathrm{area}}$ and $w_{\mathrm{age}}$ are called weights, and $b$ is called a bias (also called an offset or intercept).

In machine learning, we usually work with high-dimensional datasets, so it is more convenient to employ linear algebra notation. When our inputs consist of $d$ features, we express our prediction $\hat{y}$ (in general the “hat” symbol denotes estimates) as

\[\hat{y} = w_1 x_1 + ... + w_d x_d + b.\]

Collecting all features into a vector $\mathbf{x} \in \mathbb{R}^d$ and all weights into a vector $\mathbf{w} \in \mathbb{R}^d$, we can express our model compactly using a dot product:

\[\hat{y} = \mathbf{w}^\top \mathbf{x} + b.\]

The vector $\mathbf{x}$ corresponds to features of a single data example.

1.2 Loss Functiion

Before we start thinking about how to fit data with our model, we need to determine a measure of fitness. The loss function quantifies the distance between the real and predicted value of the target. The loss will usually be a non-negative number where smaller values are better and perfect predictions incur a loss of 0. The most popular loss function in regression problems is the squared error. When our prediction for an example $i$ is $\hat{y}^{(i)}$ and the corresponding true label is $y^{(i)}$, the squared error is given by:

\[l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2.\]

The constant $\frac{1}{2}$ makes no real difference but will prove notationally convenient, canceling out when we take the derivative of the loss. Since the training dataset is given to us, and thus out of our control, the empirical error is only a function of the model parameters.

Note that large differences between estimates $\hat{y}^{(i)}$ and observations $y^{(i)}$ lead to even larger contributions to the loss, due to the quadratic dependence. To measure the quality of a model on the entire dataset of $n$ examples, we simply average (or equivalently, sum) the losses on the training set.

\[L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.\]

When training the model, we want to find parameters ($\mathbf{w}^, b^$) that minimize the total loss across all training examples:

\[\mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\ L(\mathbf{w}, b).\]

For brevity, we introduce $x_0$ to incorporate the bias $b$ into $wx$. \(\hat{y} = \mathbf{w}^\top \mathbf{x}\)

1.3 Minibatch Stochastic Gradient Descent

The most naive application of gradient descent consists of taking the derivative of the loss function, which is an average of the losses computed on every single example in the dataset. In practice, this can be extremely slow: we must pass over the entire dataset before making a single update. Thus, we will often settle for sampling a random minibatch of examples every time we need to compute the update, a variant called minibatch stochastic gradient descent.

In each iteration, we first randomly sample a minibatch $\mathcal{B}$ consisting of a fixed number of training examples. We then compute the derivative (gradient) of the average loss on the minibatch with regard to the model parameters. Finally, we multiply the gradient by a predetermined positive value $\eta$ and subtract the resulting term from the current parameter values.

We can express the update mathematically as follows ($\partial$ denotes the partial derivative):

\[(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b).\]

To summarize, steps of the algorithm are the following: (i) we initialize the values of the model parameters, typically at random; (ii) we iteratively sample random minibatches from the data, updating the parameters in the direction of the negative gradient. For quadratic losses and affine transformations, we can write this out explicitly as follows:

\[\begin{aligned} \mathbf{w} &\leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right),\\ b &\leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b) = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right). \end{aligned}\]

2 Linear Regression Implementation from Scratch

2.1 Generating the Dataset

To keep things simple, we will [construct an artificial dataset according to a linear model with additive noise.] Our task will be to recover this model’s parameters using the finite set of examples contained in our dataset. We will keep the data low-dimensional so we can visualize it easily. In the following code snippet, we generate a dataset containing 1000 examples, each consisting of 2 features sampled from a standard normal distribution. Thus our synthetic dataset will be a matrix $\mathbf{X}\in \mathbb{R}^{1000 \times 2}$.

(The true parameters generating our dataset will be $\mathbf{w} = [2, -3.4]^\top$ and $b = 4.2$, and) our synthetic labels will be assigned according to the following linear model with the noise term $\epsilon$:

(\(\mathbf{y}= \mathbf{X} \mathbf{w} + b + \mathbf\epsilon.\))

You could think of $\epsilon$ as capturing potential measurement errors on the features and labels. We will assume that the standard assumptions hold and thus that $\epsilon$ obeys a normal distribution with mean of 0. To make our problem easy, we will set its standard deviation to 0.01. The following code generates our synthetic dataset.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import torch
import numpy as np  # 数组相关的库
import matplotlib.pyplot as plt  
import random

def synthetic_data(w, b, num_examples):  #@save
    """Generate y = Xw + b + noise."""
    X = torch.normal(0, 1, (num_examples, len(w)))
    y = torch.matmul(X, w) + b
    y += torch.normal(0, 0.01, y.shape)
    return X, y.reshape((-1, 1))

true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)
print('features:', features[0],'\nlabel:', labels[0])

plt.scatter(features[:, (1)].detach().numpy(), labels.detach().numpy(), 1) 
plt.show()
1
2
features: tensor([1.9754, 0.7641]) 
label: tensor([5.5482])

png

2.2 Reading the Dataset

1
2
3
4
5
6
7
8
def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = [i for i in range(num_examples)]
    random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor(
            indices[i: min(i + batch_size, num_examples)])
        return features[batch_indices], labels[batch_indices]
1
2
3
4
5
batch_size = 10

for X, y in data_iter(batch_size, features, labels):
    print(X, '\n', y)
    break
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
tensor([[ 1.9568, -0.3131],
        [-0.0249, -0.2747],
        [-0.5916,  0.1276],
        [-0.6786,  1.0830],
        [ 0.2486,  0.7605],
        [ 0.5870, -0.6182],
        [-0.9922,  0.6241],
        [ 0.6596,  0.2618],
        [-0.7723, -1.5498],
        [-1.3238, -2.0346]]) 
 tensor([[ 9.1606],
        [ 5.0627],
        [ 2.5748],
        [-0.8464],
        [ 2.1123],
        [ 7.4709],
        [ 0.0914],
        [ 4.6239],
        [ 7.9244],
        [ 8.4662]])

2.3 Defining the Model

1
2
3
4
5
w = torch.normal(0, 0.01, size=(2,1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)
def linreg(X, w, b):  #@save
    """The linear regression model."""
    return torch.matmul(X, w) + b

2.4 Defining the Loss Function

1
2
3
def squared_loss(y_hat, y):  #@save
    """Squared loss."""
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2

2.5 Defining the Optimization Algorithm

1
2
3
4
5
6
def sgd(params, lr, batch_size):  #@save
    """Minibatch stochastic gradient descent."""
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad / batch_size
            param.grad.zero_()

2.6 Training

Now that we have all of the parts in place, we are ready to [implement the main training loop.] It is crucial that you understand this code because you will see nearly identical training loops over and over again throughout your career in deep learning.

In each iteration, we will grab a minibatch of training examples, and pass them through our model to obtain a set of predictions. After calculating the loss, we initiate the backwards pass through the network, storing the gradients with respect to each parameter. Finally, we will call the optimization algorithm sgd to update the model parameters.

In summary, we will execute the following loop:

  • Initialize parameters $(\mathbf{w}, b)$
  • Repeat until done
    • Compute gradient $\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{ \mathcal{B} } \sum_{i \in \mathcal{B}} l(\mathbf{x}^{(i)}, y^{(i)}, \mathbf{w}, b)$
    • Update parameters $(\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}$

In each epoch, we will iterate through the entire dataset (using the data_iter function) once passing through every example in the training dataset (assuming that the number of examples is divisible by the batch size). The number of epochs num_epochs and the learning rate lr are both hyperparameters, which we set here to 3 and 0.03, respectively. Unfortunately, setting hyperparameters is tricky and requires some adjustment by trial and error. We elide these details for now but revise them later in

1
2
3
4
5
6
7
8
9
10
11
12
lr = 0.01
num_epochs = 5
net = linreg
loss = squared_loss
for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y)
        l.sum().backward()
        sgd([w, b], lr, batch_size)
    with torch.no_grad():
        train_l = loss(net(features, w, b), labels)
        print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')
1
2
3
4
5
epoch 1, loss 35.439228
epoch 2, loss 4.737120
epoch 3, loss 0.637469
epoch 4, loss 0.086473
epoch 5, loss 0.011858

3 Concise Implementation of Linear Regression

1
2
3
4
from torch.utils import data
true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)
1
2
3
4
5
6
def load_array(features, labels, batch_size, is_train=True):  #@save
    """Construct a PyTorch data iterator."""
    dataset = data.TensorDataset(features, labels)
    return data.DataLoader(dataset, batch_size, shuffle=is_train)
batch_size = 10
data_iter = load_array(features, labels, batch_size)

3.2 Defining the Model

1
2
3
4
5
6
from torch import nn

net = nn.Sequential(nn.Linear(2, 1))
# Initializing Model Parameters
net[0].weight.data.normal_(0, 0.01)
net[0].bias.data.fill_(0)
1
tensor([0.])

3.3 Defining the Loss Function

1
loss = nn.MSELoss()

3.4 Defining the Optimization Algorithm

1
trainer = torch.optim.SGD(net.parameters(), lr=0.03)
1
2
3
4
5
6
7
8
9
10
num_epochs = 3
for epoch in range(num_epochs):
    for X, y in data_iter:
        l = loss(net(X) ,y)
        trainer.zero_grad()
        l.backward() # Pytorch calculate sum automaticly. So sum() is not needed.
        trainer.step()
    l = loss(net(features), labels)

    print(f'epoch {epoch + 1}, loss {l:f}')
1
2
3
epoch 1, loss 0.000375
epoch 2, loss 0.000105
epoch 3, loss 0.000105

Appendix A Linear Regression with sklearn

1
2
3
4
5
6
7
8
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error



reg = LinearRegression().fit(features, labels)
print(f'loss {mean_squared_error(reg.predict(features), labels):f}')
1
loss 0.000105
1
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Welcome to reprint, and please indicate the source: Lingjun Meng's Blog Keep the content of the article complete and the above statement information!