import numpy as np
import scipy.linalg
import matplotlib.pyplot as plt
%matplotlib inline
Linear Regression Tutorial
Linear Regression Tutorial
by Marc Deisenroth
The purpose of this notebook is to practice implementing some linear algebra (equations provided) and to explore some properties of linear regression.
We consider a linear regression problem of the form \[ y = \boldsymbol x^T\boldsymbol\theta + \epsilon\,,\quad \epsilon \sim \mathcal N(0, \sigma^2) \] where \(\boldsymbol x\in\mathbb{R}^D\) are inputs and \(y\in\mathbb{R}\) are noisy observations. The parameter vector \(\boldsymbol\theta\in\mathbb{R}^D\) parametrizes the function.
We assume we have a training set \((\boldsymbol x_n, y_n)\), \(n=1,\ldots, N\). We summarize the sets of training inputs in \(\mathcal X = \{\boldsymbol x_1, \ldots, \boldsymbol x_N\}\) and corresponding training targets \(\mathcal Y = \{y_1, \ldots, y_N\}\), respectively.
In this tutorial, we are interested in finding good parameters \(\boldsymbol\theta\).
# Define training set
= np.array([-3, -1, 0, 1, 3]).reshape(-1,1) # 5x1 vector, N=5, D=1
X = np.array([-1.2, -0.7, 0.14, 0.67, 1.67]).reshape(-1,1) # 5x1 vector
y
# Plot the training set
plt.figure()'+', markersize=10)
plt.plot(X, y, "$x$")
plt.xlabel("$y$"); plt.ylabel(
1. Maximum Likelihood
We will start with maximum likelihood estimation of the parameters \(\boldsymbol\theta\). In maximum likelihood estimation, we find the parameters \(\boldsymbol\theta^{\mathrm{ML}}\) that maximize the likelihood \[ p(\mathcal Y | \mathcal X, \boldsymbol\theta) = \prod_{n=1}^N p(y_n | \boldsymbol x_n, \boldsymbol\theta)\,. \] From the lecture we know that the maximum likelihood estimator is given by \[ \boldsymbol\theta^{\text{ML}} = (\boldsymbol X^T\boldsymbol X)^{-1}\boldsymbol X^T\boldsymbol y\in\mathbb{R}^D\,, \] where \[ \boldsymbol X = [\boldsymbol x_1, \ldots, \boldsymbol x_N]^T\in\mathbb{R}^{N\times D}\,,\quad \boldsymbol y = [y_1, \ldots, y_N]^T \in\mathbb{R}^N\,. \]
Let us compute the maximum likelihood estimate for a given training set
## EDIT THIS FUNCTION
def max_lik_estimate(X, y):
# X: N x D matrix of training inputs
# y: N x 1 vector of training targets/observations
# returns: maximum likelihood parameters (D x 1)
= X.shape
N, D = np.linalg.solve(X.T @ X, X.T @ y) ## <-- SOLUTION
theta_ml return theta_ml
# get maximum likelihood estimate
= max_lik_estimate(X,y)
theta_ml print(theta_ml)
[[0.499]]
Now, make a prediction using the maximum likelihood estimate that we just found
## EDIT THIS FUNCTION
def predict_with_estimate(Xtest, theta):
# Xtest: K x D matrix of test inputs
# theta: D x 1 vector of parameters
# returns: prediction of f(Xtest); K x 1 vector
= Xtest @ theta ## <-- SOLUTION
prediction
return prediction
Now, let’s see whether we got something useful:
# define a test set
= np.linspace(-5,5,100).reshape(-1,1) # 100 x 1 vector of test inputs
Xtest
# predict the function values at the test points using the maximum likelihood estimator
= predict_with_estimate(Xtest, theta_ml)
ml_prediction
# plot
plt.figure()'+', markersize=10)
plt.plot(X, y,
plt.plot(Xtest, ml_prediction)"$x$")
plt.xlabel("$y$"); plt.ylabel(
Questions
- Does the solution above look reasonable?
- Play around with different values of \(\theta\). How do the corresponding functions change?
- Modify the training targets \(\mathcal Y\) and re-run your computation. What changes?
Let us now look at a different training set, where we add 2.0 to every \(y\)-value, and compute the maximum likelihood estimate
= y + 2.0
ynew
plt.figure()'+', markersize=10)
plt.plot(X, ynew, "$x$")
plt.xlabel("$y$"); plt.ylabel(
# get maximum likelihood estimate
= max_lik_estimate(X, ynew)
theta_ml print(theta_ml)
# define a test set
= np.linspace(-5,5,100).reshape(-1,1) # 100 x 1 vector of test inputs
Xtest
# predict the function values at the test points using the maximum likelihood estimator
= predict_with_estimate(Xtest, theta_ml)
ml_prediction
# plot
plt.figure()'+', markersize=10)
plt.plot(X, ynew,
plt.plot(Xtest, ml_prediction)"$x$")
plt.xlabel("$y$"); plt.ylabel(
[[0.499]]
Question:
- This maximum likelihood estimate doesn’t look too good: The orange line is too far away from the observations although we just shifted them by 2. Why is this the case?
- How can we fix this problem?
Let us now define a linear regression model that is slightly more flexible: \[ y = \theta_0 + \boldsymbol x^T \boldsymbol\theta_1 + \epsilon\,,\quad \epsilon\sim\mathcal N(0,\sigma^2) \] Here, we added an offset (bias) parameter \(\theta_0\) to our original model.
Question:
- What is the effect of this bias parameter, i.e., what additional flexibility does it offer?
If we now define the inputs to be the augmented vector \(\boldsymbol x_{\text{aug}} = \begin{bmatrix}1\\\boldsymbol x\end{bmatrix}\), we can write the new linear regression model as \[ y = \boldsymbol x_{\text{aug}}^T\boldsymbol\theta_{\text{aug}} + \epsilon\,,\quad \boldsymbol\theta_{\text{aug}} = \begin{bmatrix} \theta_0\\ \boldsymbol\theta_1 \end{bmatrix}\,. \]
= X.shape
N, D = np.hstack([np.ones((N,1)), X]) # augmented training inputs of size N x (D+1)
X_aug = np.zeros((D+1, 1)) # new theta vector of size (D+1) x 1 theta_aug
Let us now compute the maximum likelihood estimator for this setting. Hint: If possible, re-use code that you have already written
## EDIT THIS FUNCTION
def max_lik_estimate_aug(X_aug, y):
= max_lik_estimate(X_aug, y) ## <-- SOLUTION
theta_aug_ml
return theta_aug_ml
= max_lik_estimate_aug(X_aug, y)
theta_aug_ml theta_aug_ml
array([[0.116],
[0.499]])
Now, we can make predictions again:
# define a test set (we also need to augment the test inputs with ones)
= np.hstack([np.ones((Xtest.shape[0],1)), Xtest]) # 100 x (D + 1) vector of test inputs
Xtest_aug
# predict the function values at the test points using the maximum likelihood estimator
= predict_with_estimate(Xtest_aug, theta_aug_ml)
ml_prediction print(ml_prediction.shape)
# plot
plt.figure()'+', markersize=10)
plt.plot(X, y,
plt.plot(Xtest, ml_prediction)"$x$")
plt.xlabel("$y$"); plt.ylabel(
(100, 1)
It seems this has solved our problem! #### Question: 1. Play around with the first parameter of \(\boldsymbol\theta_{\text{aug}}\) and see how the fit of the function changes. 2. Play around with the second parameter of \(\boldsymbol\theta_{\text{aug}}\) and see how the fit of the function changes.
Nonlinear Features
So far, we have looked at linear regression with linear features. This allowed us to fit straight lines. However, linear regression also allows us to fit functions that are nonlinear in the inputs \(\boldsymbol x\), as long as the parameters \(\boldsymbol\theta\) appear linearly. This means, we can learn functions of the form \[ f(\boldsymbol x, \boldsymbol\theta) = \sum_{k = 1}^K \theta_k \phi_k(\boldsymbol x)\,, \] where the features \(\phi_k(\boldsymbol x)\) are (possibly nonlinear) transformations of the inputs \(\boldsymbol x\).
Let us have a look at an example where the observations clearly do not lie on a straight line:
= np.array([10.05, 1.5, -1.234, 0.02, 8.03]).reshape(-1,1)
y
plt.figure()'+')
plt.plot(X, y, "$x$")
plt.xlabel("$y$"); plt.ylabel(
Polynomial Regression
One class of functions that is covered by linear regression is the family of polynomials because we can write a polynomial of degree \(K\) as \[ \sum_{k=0}^K \theta_k x^k = \boldsymbol \phi(x)^T\boldsymbol\theta\,,\quad \boldsymbol\phi(x)= \begin{bmatrix} x^0\\ x^1\\ \vdots\\ x^K \end{bmatrix}\in\mathbb{R}^{K+1}\,. \] Here, \(\boldsymbol\phi(x)\) is a nonlinear feature transformation of the inputs \(x\in\mathbb{R}\).
Similar to the earlier case we can define a matrix that collects all the feature transformations of the training inputs: \[ \boldsymbol\Phi = \begin{bmatrix} \boldsymbol\phi(x_1) & \boldsymbol\phi(x_2) & \cdots & \boldsymbol\phi(x_n) \end{bmatrix}^T \in\mathbb{R}^{N\times K+1} \]
Let us start by computing the feature matrix \(\boldsymbol \Phi\)
## EDIT THIS FUNCTION
def poly_features(X, K):
# X: inputs of size N x 1
# K: degree of the polynomial
# computes the feature matrix Phi (N x (K+1))
= X.flatten()
X = X.shape[0]
N
#initialize Phi
= np.zeros((N, K+1))
Phi
# Compute the feature matrix in stages
for k in range(K+1):
= X**k ## <-- SOLUTION
Phi[:,k] return Phi
With this feature matrix we get the maximum likelihood estimator as \[ \boldsymbol \theta^\text{ML} = (\boldsymbol\Phi^T\boldsymbol\Phi)^{-1}\boldsymbol\Phi^T\boldsymbol y \] For reasons of numerical stability, we often add a small diagonal “jitter” \(\kappa>0\) to \(\boldsymbol\Phi^T\boldsymbol\Phi\) so that we can invert the matrix without significant problems so that the maximum likelihood estimate becomes \[ \boldsymbol \theta^\text{ML} = (\boldsymbol\Phi^T\boldsymbol\Phi + \kappa\boldsymbol I)^{-1}\boldsymbol\Phi^T\boldsymbol y \]
## EDIT THIS FUNCTION
def nonlinear_features_maximum_likelihood(Phi, y):
# Phi: features matrix for training inputs. Size of N x D
# y: training targets. Size of N by 1
# returns: maximum likelihood estimator theta_ml. Size of D x 1
= 1e-08 # 'jitter' term; good for numerical stability
kappa
= Phi.shape[1]
D
# maximum likelihood estimate
= Phi.T @ y # Phi^T*y
Pt = Phi.T @ Phi + kappa*np.eye(D) # Phi^T*Phi + kappa*I
PP
# maximum likelihood estimate
= scipy.linalg.cho_factor(PP)
C = scipy.linalg.cho_solve(C, Pt) # inv(Phi^T*Phi)*Phi^T*y
theta_ml
return theta_ml
Now we have all the ingredients together: The computation of the feature matrix and the computation of the maximum likelihood estimator for polynomial regression. Let’s see how this works.
To make predictions at test inputs \(\boldsymbol X_{\text{test}}\in\mathbb{R}\), we need to compute the features (nonlinear transformations) \(\boldsymbol\Phi_{\text{test}}= \boldsymbol\phi(\boldsymbol X_{\text{test}})\) of \(\boldsymbol X_{\text{test}}\) to give us the predicted mean \[ \mathbb{E}[\boldsymbol y_{\text{test}}] = \boldsymbol \Phi_{\text{test}}\boldsymbol\theta^{\text{ML}} \]
= 5 # Define the degree of the polynomial we wish to fit
K = poly_features(X, K) # N x (K+1) feature matrix
Phi
= nonlinear_features_maximum_likelihood(Phi, y) # maximum likelihood estimator
theta_ml
# test inputs
= np.linspace(-4,4,100).reshape(-1,1)
Xtest
# feature matrix for test inputs
= poly_features(Xtest, K)
Phi_test
= Phi_test @ theta_ml # predicted y-values
y_pred
plt.figure()'+')
plt.plot(X, y,
plt.plot(Xtest, y_pred)"$x$")
plt.xlabel("$y$") plt.ylabel(
Experiment with different polynomial degrees in the code above. #### Questions: 1. What do you observe? 2. What is a good fit?
Evaluating the Quality of the Model
Let us have a look at a more interesting data set
def f(x):
return np.cos(x) + 0.2*np.random.normal(size=(x.shape))
= np.linspace(-4,4,20).reshape(-1,1)
X = f(X)
y
plt.figure()'+')
plt.plot(X, y, "$x$")
plt.xlabel("$y$"); plt.ylabel(
Now, let us use the work from above and fit polynomials to this dataset.
## EDIT THIS CELL
= 6 # Define the degree of the polynomial we wish to fit
K
= poly_features(X, K) # N x (K+1) feature matrix
Phi
= nonlinear_features_maximum_likelihood(Phi, y) # maximum likelihood estimator
theta_ml
# test inputs
= np.linspace(-5,5,100).reshape(-1,1)
Xtest = f(Xtest) # ground-truth y-values
ytest
# feature matrix for test inputs
= poly_features(Xtest, K)
Phi_test
= Phi_test @ theta_ml # predicted y-values
y_pred
# plot
plt.figure()'+')
plt.plot(X, y,
plt.plot(Xtest, y_pred)
plt.plot(Xtest, ytest)"data", "prediction", "ground truth observations"])
plt.legend(["$x$")
plt.xlabel("$y$"); plt.ylabel(
Questions:
- Try out different degrees of polynomials.
- Based on visual inspection, what looks like the best fit?
Let us now look at a more systematic way to assess the quality of the polynomial that we are trying to fit. For this, we compute the root-mean-squared-error (RMSE) between the \(y\)-values predicted by our polynomial and the ground-truth \(y\)-values. The RMSE is then defined as \[ \text{RMSE} = \sqrt{\frac{1}{N}\sum_{n=1}^N(y_n - y_n^\text{pred})^2} \] Write a function that computes the RMSE.
## EDIT THIS FUNCTION
def RMSE(y, ypred):
= np.sqrt(np.mean((y-ypred)**2)) ## SOLUTION
rmse return rmse
Now compute the RMSE for different degrees of the polynomial we want to fit.
## EDIT THIS CELL
= 20
K_max = np.zeros((K_max+1,))
rmse_train
for k in range(K_max+1):
# feature matrix
= poly_features(X, k)
Phi
# maximum likelihood estimate
= nonlinear_features_maximum_likelihood(Phi, y)
theta_ml
# predict y-values of training set
= Phi @ theta_ml
ypred_train
# RMSE on training set
= RMSE(y, ypred_train)
rmse_train[k]
plt.figure()
plt.plot(rmse_train)"degree of polynomial")
plt.xlabel("RMSE"); plt.ylabel(
Question:
- What do you observe?
- What is the best polynomial fit according to this plot?
- Write some code that plots the function that uses the best polynomial degree (use the test set for this plot). What do you observe now?
# WRITE THE PLOTTING CODE HERE
plt.figure()'+')
plt.plot(X, y,
# feature matrix
= poly_features(X, 5)
Phi
# maximum likelihood estimate
= nonlinear_features_maximum_likelihood(Phi, y)
theta_ml
# feature matrix for test inputs
= poly_features(Xtest, 5)
Phi_test
= Phi_test @ theta_ml
ypred_test
plt.plot(Xtest, ypred_test) "$x$")
plt.xlabel("$y$")
plt.ylabel("data", "maximum likelihood fit"]); plt.legend([
The RMSE on the training data is somewhat misleading, because we are interested in the generalization performance of the model. Therefore, we are going to compute the RMSE on the test set and use this to choose a good polynomial degree.
## EDIT THIS CELL
= 20
K_max = np.zeros((K_max+1,))
rmse_train = np.zeros((K_max+1,))
rmse_test
for k in range(K_max+1):
# feature matrix
= poly_features(X, k)
Phi
# maximum likelihood estimate
= nonlinear_features_maximum_likelihood(Phi, y)
theta_ml
# predict y-values of training set
= Phi @ theta_ml
ypred_train
# RMSE on training set
= RMSE(y, ypred_train)
rmse_train[k]
# feature matrix for test inputs
= poly_features(Xtest, k)
Phi_test
# prediction
= Phi_test @ theta_ml
ypred_test
# RMSE on test set
= RMSE(ytest, ypred_test)
rmse_test[k]
plt.figure()# this plots the RMSE on a logarithmic scale
plt.semilogy(rmse_train) # this plots the RMSE on a logarithmic scale
plt.semilogy(rmse_test) "degree of polynomial")
plt.xlabel("RMSE")
plt.ylabel("training set", "test set"]); plt.legend([
Questions:
- What do you observe now?
- Why does the RMSE for the test set not always go down?
- Which polynomial degree would you choose now?
- Plot the fit for the “best” polynomial degree.
# WRITE THE PLOTTING CODE HERE
plt.figure()'+')
plt.plot(X, y, = 5
k # feature matrix
= poly_features(X, k)
Phi
# maximum likelihood estimate
= nonlinear_features_maximum_likelihood(Phi, y)
theta_ml
# feature matrix for test inputs
= poly_features(Xtest, k)
Phi_test
= Phi_test @ theta_ml
ypred_test
plt.plot(Xtest, ypred_test) "$x$")
plt.xlabel("$y$")
plt.ylabel("data", "maximum likelihood fit"]); plt.legend([
Question
If you did not have a designated test set, what could you do to estimate the generalization error (purely using the training set)?
2. Maximum A Posteriori Estimation
We are still considering the model \[ y = \boldsymbol\phi(\boldsymbol x)^T\boldsymbol\theta + \epsilon\,,\quad \epsilon\sim\mathcal N(0,\sigma^2)\,. \] We assume that the noise variance \(\sigma^2\) is known.
Instead of maximizing the likelihood, we can look at the maximum of the posterior distribution on the parameters \(\boldsymbol\theta\), which is given as \[ p(\boldsymbol\theta|\mathcal X, \mathcal Y) = \frac{\overbrace{p(\mathcal Y|\mathcal X, \boldsymbol\theta)}^{\text{likelihood}}\overbrace{p(\boldsymbol\theta)}^{\text{prior}}}{\underbrace{p(\mathcal Y|\mathcal X)}_{\text{evidence}}} \] The purpose of the parameter prior \(p(\boldsymbol\theta)\) is to discourage the parameters to attain extreme values, a sign that the model overfits. The prior allows us to specify a “reasonable” range of parameter values. Typically, we choose a Gaussian prior \(\mathcal N(\boldsymbol 0, \alpha^2\boldsymbol I)\), centered at \(\boldsymbol 0\) with variance \(\alpha^2\) along each parameter dimension.
The MAP estimate of the parameters is \[ \boldsymbol\theta^{\text{MAP}} = (\boldsymbol\Phi^T\boldsymbol\Phi + \frac{\sigma^2}{\alpha^2}\boldsymbol I)^{-1}\boldsymbol\Phi^T\boldsymbol y \] where \(\sigma^2\) is the variance of the noise.
## EDIT THIS FUNCTION
def map_estimate_poly(Phi, y, sigma, alpha):
# Phi: training inputs, Size of N x D
# y: training targets, Size of D x 1
# sigma: standard deviation of the noise
# alpha: standard deviation of the prior on the parameters
# returns: MAP estimate theta_map, Size of D x 1
= Phi.shape[1]
D
# SOLUTION
= Phi.T @ Phi + (sigma/alpha)**2 * np.eye(D)
PP = scipy.linalg.solve(PP, Phi.T @ y)
theta_map
return theta_map
# define the function we wish to estimate later
def g(x, sigma):
= np.hstack([x**0, x**1, np.sin(x)])
p = np.array([-1.0, 0.1, 1.0]).reshape(-1,1)
w return p @ w + sigma*np.random.normal(size=x.shape)
# Generate some data
= 1.0 # noise standard deviation
sigma = 1.0 # standard deviation of the parameter prior
alpha = 20
N
42)
np.random.seed(
= (np.random.rand(N)*10.0 - 5.0).reshape(-1,1)
X = g(X, sigma) # training targets
y
plt.figure()'+')
plt.plot(X, y, "$x$")
plt.xlabel("$y$"); plt.ylabel(
Xtest.shape
(100, 1)
theta_map
array([[-1.26518298],
[-0.01298677]])
Phi.shape
(20, 2)
# get the MAP estimate
= 1 # polynomial degree
K
# feature matrix
= poly_features(X, K)
Phi
= map_estimate_poly(Phi, y, sigma, alpha)
theta_map
# maximum likelihood estimate
= nonlinear_features_maximum_likelihood(Phi, y)
theta_ml
= np.linspace(-5,5,100).reshape(-1,1)
Xtest = g(Xtest, sigma)
ytest
= poly_features(Xtest, K)
Phi_test = Phi_test @ theta_map
y_pred_map
= Phi_test @ theta_ml
y_pred_mle
plt.figure()'+')
plt.plot(X, y,
plt.plot(Xtest, y_pred_map)0))
plt.plot(Xtest, g(Xtest,
plt.plot(Xtest, y_pred_mle)
"data", "map prediction", "ground truth function", "maximum likelihood"]); plt.legend([
print(np.hstack([theta_ml, theta_map]))
[[-1.49712990e+00 -1.08154986e+00]
[ 8.56868912e-01 6.09177023e-01]
[-1.28335730e-01 -3.62071208e-01]
[-7.75319509e-02 -3.70531732e-03]
[ 3.56425467e-02 7.43090617e-02]
[-4.11626749e-03 -1.03278646e-02]
[-2.48817783e-03 -4.89363010e-03]
[ 2.70146690e-04 4.24148554e-04]
[ 5.35996050e-05 1.03384719e-04]]
Now, let us compute the RMSE for different polynomial degrees and see whether the MAP estimate addresses the overfitting issue we encountered with the maximum likelihood estimate.
## EDIT THIS CELL
= 12 # this is the maximum degree of polynomial we will consider
K_max assert(K_max < N) # this is the latest point when we'll run into numerical problems
= np.zeros((K_max+1,))
rmse_mle = np.zeros((K_max+1,))
rmse_map
for k in range(K_max+1):
# feature matrix
= poly_features(X, k)
Phi
# maximum likelihood estimate
= nonlinear_features_maximum_likelihood(Phi, y)
theta_ml
# predict the function values at the test input locations (maximum likelihood)
= 0*Xtest ## <--- EDIT THIS LINE
y_pred_test
####################### SOLUTION
# feature matrix for test inputs
= poly_features(Xtest, k)
Phi_test
# prediction
= Phi_test @ theta_ml
ypred_test_mle #######################
# RMSE on test set (maximum likelihood)
= RMSE(ytest, ypred_test_mle)
rmse_mle[k]
# MAP estimate
= map_estimate_poly(Phi, y, sigma, alpha)
theta_map
# Feature matrix
= poly_features(Xtest, k)
Phi_test
# predict the function values at the test input locations (MAP)
= Phi_test @ theta_map
ypred_test_map
# RMSE on test set (MAP)
= RMSE(ytest, ypred_test_map)
rmse_map[k]
plt.figure()# this plots the RMSE on a logarithmic scale
plt.semilogy(rmse_mle) # this plots the RMSE on a logarithmic scale
plt.semilogy(rmse_map) "degree of polynomial")
plt.xlabel("RMSE")
plt.ylabel("Maximum likelihood", "MAP"]) plt.legend([
C:\Users\HP\AppData\Local\Temp\ipykernel_30576\3627804172.py:13: LinAlgWarning: Ill-conditioned matrix (rcond=1.82839e-17): result may not be accurate.
theta_map = scipy.linalg.solve(PP, Phi.T @ y)
Questions:
- What do you observe?
- What is the influence of the prior variance on the parameters (\(\alpha^2\))? Change the parameter and describe what happens.
3. Bayesian Linear Regression
# Test inputs
= 200
Ntest = np.linspace(-5, 5, Ntest).reshape(-1,1) # test inputs
Xtest
= 2.0 # variance of the parameter prior (alpha^2). We assume this is known.
prior_var = 1.0 # noise variance (sigma^2). We assume this is known.
noise_var
= 3 # degree of the polynomial we consider at the moment pol_deg
Assume a parameter prior \(p(\boldsymbol\theta) = \mathcal N (\boldsymbol 0, \alpha^2\boldsymbol I)\). For every test input \(\boldsymbol x_*\) we obtain the prior mean \[ E[f(\boldsymbol x_*)] = 0 \] and the prior (marginal) variance (ignoring the noise contribution) \[ V[f(\boldsymbol x_*)] = \alpha^2\boldsymbol\phi(\boldsymbol x_*) \boldsymbol\phi(\boldsymbol x_*)^\top \] where \(\boldsymbol\phi(\cdot)\) is the feature map.
## EDIT THIS CELL
# compute the feature matrix for the test inputs
= poly_features(Xtest, pol_deg) # N x (pol_deg+1) feature matrix SOLUTION
Phi_test
# compute the (marginal) prior at the test input locations
# prior mean
= np.zeros((Ntest,1)) # prior mean <-- SOLUTION
prior_mean
# prior variance
= Phi_test @ Phi_test.T * prior_var # N x N covariance matrix of all function values
full_covariance = np.diag(full_covariance)
prior_marginal_var
# Let us visualize the prior over functions
plt.figure()="k")
plt.plot(Xtest, prior_mean, color
= np.sqrt(prior_marginal_var).flatten()
conf_bound1 = 2.0*np.sqrt(prior_marginal_var).flatten()
conf_bound2 = 2.0*np.sqrt(prior_marginal_var + noise_var).flatten()
conf_bound3 + conf_bound1,
plt.fill_between(Xtest.flatten(), prior_mean.flatten() - conf_bound1, alpha = 0.1, color="k")
prior_mean.flatten() + conf_bound2,
plt.fill_between(Xtest.flatten(), prior_mean.flatten() - conf_bound2, alpha = 0.1, color="k")
prior_mean.flatten() + conf_bound3,
plt.fill_between(Xtest.flatten(), prior_mean.flatten() - conf_bound3, alpha = 0.1, color="k")
prior_mean.flatten()
'$x$')
plt.xlabel('$y$')
plt.ylabel("Prior over functions"); plt.title(
Now, we will use this prior distribution and sample functions from it.
## EDIT THIS CELL
# samples from the prior
= 10
num_samples
# We first need to generate random weights theta_i, which we sample from the parameter prior
= np.random.normal(size=(pol_deg+1,num_samples), scale=np.sqrt(prior_var))
random_weights
# Now, we compute the induced random functions, evaluated at the test input locations
# Every function sample is given as f_i = Phi * theta_i,
# where theta_i is a sample from the parameter prior
= Phi_test @ random_weights # <-- SOLUTION
sample_function
plt.figure()="r")
plt.plot(Xtest, sample_function, color"Plausible functions under the prior")
plt.title(print("Every sampled function is a polynomial of degree "+str(pol_deg));
Every sampled function is a polynomial of degree 3
Now we are given some training inputs \(\boldsymbol x_1, \dotsc, \boldsymbol x_N\), which we collect in a matrix \(\boldsymbol X = [\boldsymbol x_1, \dotsc, \boldsymbol x_N]^\top\in\mathbb{R}^{N\times D}\)
= 10
N = np.random.uniform(high=5, low=-5, size=(N,1)) # training inputs, size Nx1
X = g(X, np.sqrt(noise_var)) # training targets, size Nx1 y
Now, let us compute the posterior
## EDIT THIS FUNCTION
def polyfit(X, y, K, prior_var, noise_var):
# X: training inputs, size N x D
# y: training targets, size N x 1
# K: degree of polynomial we consider
# prior_var: prior variance of the parameter distribution
# sigma: noise variance
= 1e-08 # increases numerical stability
jitter
= poly_features(X, K) # N x (K+1) feature matrix
Phi
# Compute maximum likelihood estimate
= Phi.T @ y # Phi*y, size (K+1,1)
Pt = Phi.T @ Phi + jitter*np.eye(K+1) # size (K+1, K+1)
PP = scipy.linalg.cho_factor(PP)
C # maximum likelihood estimate
= scipy.linalg.cho_solve(C, Pt) # inv(Phi^T*Phi)*Phi^T*y, size (K+1,1)
theta_ml
# theta_ml = scipy.linalg.solve(PP, Pt) # inv(Phi^T*Phi)*Phi^T*y, size (K+1,1)
# MAP estimate
= scipy.linalg.solve(PP + noise_var/prior_var*np.eye(K+1), Pt)
theta_map
# parameter posterior
= (np.eye(K+1)/prior_var + PP/noise_var) # posterior precision
iSN = scipy.linalg.pinv(noise_var*np.eye(K+1)/prior_var + PP)*noise_var # posterior covariance
SN = scipy.linalg.solve(iSN, Pt/noise_var) # posterior mean
mN
return (theta_ml, theta_map, mN, SN)
= polyfit(X, y, pol_deg, alpha, sigma) theta_ml, theta_map, theta_mean, theta_var
print(theta_mean, theta_var)
[[-0.59357667]
[ 0.41955968]
[ 0.01927393]
[-0.02591532]] [[ 0.31686871 -0.05423782 -0.03675352 0.0068937 ]
[-0.05423782 0.05899309 0.00762815 -0.00430896]
[-0.03675352 0.00762815 0.00680258 -0.00137103]
[ 0.0068937 -0.00430896 -0.00137103 0.00049154]]
Now, let’s make predictions (ignoring the measurement noise). We obtain three predictors: \[\begin{align} &\text{Maximum likelihood: }E[f(\boldsymbol X_{\text{test}})] = \boldsymbol \phi(X_{\text{test}})\boldsymbol \theta_{ml}\\ &\text{Maximum a posteriori: } E[f(\boldsymbol X_{\text{test}})] = \boldsymbol \phi(X_{\text{test}})\boldsymbol \theta_{map}\\ &\text{Bayesian: } p(f(\boldsymbol X_{\text{test}})) = \mathcal N(f(\boldsymbol X_{\text{test}}) \,|\, \boldsymbol \phi(X_{\text{test}}) \boldsymbol\theta_{\text{mean}},\, \boldsymbol\phi(X_{\text{test}}) \boldsymbol\theta_{\text{var}} \boldsymbol\phi(X_{\text{test}})^\top) \end{align}\] We already computed all quantities. Write some code that implements all three predictors.
## EDIT THIS CELL
# predictions (ignoring the measurement/observations noise)
= poly_features(Xtest, pol_deg) # N x (K+1)
Phi_test
# maximum likelihood predictions (just the mean)
= Phi_test @ theta_ml
m_mle_test
# MAP predictions (just the mean)
= Phi_test @ theta_map
m_map_test
# predictive distribution (Bayesian linear regression)
# mean prediction
= Phi_test @ theta_mean
mean_blr # variance prediction
= Phi_test @ theta_var @ Phi_test.T cov_blr
print(Xtest.shape, Phi_test.shape)
(200, 1) (200, 4)
print(mean_blr.shape, cov_blr.shape)
(200, 1) (200, 200)
# plot the posterior
plt.figure()"+")
plt.plot(X, y,
plt.plot(Xtest, m_mle_test)
plt.plot(Xtest, m_map_test)= np.diag(cov_blr)
var_blr = np.sqrt(var_blr).flatten()
conf_bound1 = 2.0*np.sqrt(var_blr).flatten()
conf_bound2 = 2.0*np.sqrt(var_blr + sigma).flatten()
conf_bound3
+ conf_bound1,
plt.fill_between(Xtest.flatten(), mean_blr.flatten() - conf_bound1, alpha = 0.1, color="k")
mean_blr.flatten() + conf_bound2,
plt.fill_between(Xtest.flatten(), mean_blr.flatten() - conf_bound2, alpha = 0.1, color="k")
mean_blr.flatten() + conf_bound3,
plt.fill_between(Xtest.flatten(), mean_blr.flatten() - conf_bound3, alpha = 0.1, color="k")
mean_blr.flatten() "Training data", "MLE", "MAP", "BLR"])
plt.legend(['$x$');
plt.xlabel('$y$'); plt.ylabel(