12 Chapter 12 - Multiple Regression
12.1 Introduction to Multiple Regression model
We call multiple linear regression model when we consider two or more independent variables X to linearly explain a dependent variable $Y. We can express the expected value of Y as:
E[Y]=f(X1,X2,...,Xp) Since the model is linear, the function is linear in the coefficients, so it becomes a sum of products including a constant coefficient b_0:
E[Y_i]=\hat{Y_i}=b_0 + b_1X_{1i}+b_2X_{2i}+...+b_pX_{pi} p is the number of X independent variables.
The expected value of the dependent variable Y is also named the predicted value of Y, which is the \hat{Y} here. \hat{Y} depends on the values of the independent X variables. As in the simple linear regression, this expected value is also called the conditional mean of Y since it is conditional to specific values of the X variables.
We can also express the value of Y_i as the following multiple regression equation:
Y_i=E[Y_i/X's]+\varepsilon_i=\hat{Y_i}+\varepsilon_i
Y_i=b_0 + b_1X_{1i}+b_2X_{2i}+...+b_pX_{pi}+ \varepsilon_i Most of what we learned in previous chapters about simple linear regression apply to multiple regression models. There is an important distinction when we add more than 1 X independent variable. In the simple regression model, we consider the effect of only one X variable. In the multiple regression model, we consider the effect of multiple X variables. In the case of the multiple regression model, all X variables included compete for linearly explanatory power. What does this mean?
When we start our analysis of understanding the dependent variable Y, we can start with a simple regression model considering only X_1 as independent. Assume that X_2 variable is an important explanatory variable that we missed to include in our first model. In this case, the omission of X_2 will be problematic since the estimation of the b_1 coefficient might not be totally reliable since the OLS method assumes that there is no other explanatory variable and tries to attribute any Y movement exclusively to movements in X_1.
When we add X_2 to our simple regression model and keep using the OLS method, now the beta coefficients will represent the partial effect of each X considering the effect of the other X. In other words, the beta coefficients are called partial coefficients since their value and significance are estimated considering the effect of the other X variables. Then, if we have 2 X variables in the regression model, we can say that the partial coefficient b_2 represents how much else the X_2 explaines the movements of Y beyond what X_1 already explains.
Interestingly, the OLS method to estimate the optimal (BLUE) beta coefficients can be used for multiple linear regression model, not only for the simple regression model. In the case of multiple regression we can use Matrix Algebra with the OLS method to simultaneously estimate the optimal beta coefficients that minimize the errors of the model.
12.2 Setting the multiple regression equation with Matrix Algebra
With matrix algebra we can easily represent sum of products when we multiply two more matrices. In the case of multiple regression equation, we see that it is equal to sum of products:
\hat{Y_i}=b_0*1 + b_1X_{1i}+b_2X_{2i}+...+b_pX_{pi} I multiplied b_0 times 1 to consider \hat{Y_i} as a sum of products; each product is the multiplication of a beta coefficient times its corresponding X variable or times 1 (in the case of b_0). One of the important advantages of using matrix algebra is that it is easy to represent sum of products.
We can represent a sum of products using matrix algebra as follows.
Let’s X to be the matrix of the independent p variables with their respective sample values with n observations, along with a column vector 1:
\mathbf{X} = \begin{bmatrix}1 & X_{11} & . & . & . & X_{1p}\\ 1 & X_{21} & . & . & . & X_{2p}\\ . & . & . & . & . & .\\ . & . & . & . & . & .\\ 1 & X_{n1} & . & . & . & X_{np} \end{bmatrix}\\
Let’s Y to be the vector of all sample values of the dependent variable Y with n observations:
\mathbf{Y} = \begin{bmatrix}Y_{1}\\ Y_{2}\\ .\\ .\\ Y_{n} \end{bmatrix}\\
Now the vector of beta coefficients can be represented as the vector b:
\mathbf{b} = \begin{bmatrix}b_{0}\\ b_{1}\\ .\\ .\\ b_{n} \end{bmatrix}\\
I can express the predicted \hat{Y} as the regression equation in matrix algebra as:
\mathbf{\hat{Y}}=\mathbf{Xb}
\mathbf{\hat{Y}}=\begin{bmatrix}1 & X_{11} & . & . & . & X_{1p}\\ 1 & X_{21} & . & . & . & X_{2p}\\ . & . & . & . & . & .\\ . & . & . & . & . & .\\ 1 & X_{n1} & . & . & . & X_{np} \end{bmatrix}*\begin{bmatrix}b_{0}\\ b_{1}\\ .\\ .\\ b_{n} \end{bmatrix}=\begin{bmatrix}\hat{Y}_{1}\\ \hat{Y}_{2}\\ .\\ .\\ \hat{Y}_{n} \end{bmatrix} When multiplying a matrix times another matrix or vector, we are calculating sum of products. In this case:
\hat{Y_i}=b_0*1 + b_1X_{1i}+b_2X_{2i}+...+b_pX_{pi} If we remember matrix multiplication, then we can see that the result of this matrix multiplication is the original regression equation for the case of p independent variables.
Now we can express the real vector \mathbf{Y} as the following equation using matrix algebra:
\mathbf{Y} = \mathbf{Xb}+\mathbf{\varepsilon}
Where \varepsilon is the vector of errors of the n sample observations:
\mathbf{\varepsilon}=\begin{bmatrix}\varepsilon_{1}\\ \varepsilon_{2}\\ .\\ .\\ \varepsilon_{n} \end{bmatrix}
12.3 Aplying the OLS method and Matrix Algebra to find the optimal beta coefficients
As in the simple regression model, in the multiple regression model the OLS method search for the beta coefficients that minimize the sum of squared errors.
The OLS tries to minimize the sum of squared errors to get the optimal beta coefficients:
Min \sum_{i=1}^{n}\varepsilon_{i}^{2}
In matrix algebra we can express the sum of squared of errors (SSE) as the following matrix multiplication:
\mathbf{\varepsilon'\varepsilon} = \sum_{i=1}^{n}\varepsilon_{i}^{2}=SSE
Where \varepsilon' is the transpose of the \varepsilon vector of errors.
From previous equation we can define the error vector as the difference between the Y vector -the actual real Y values- and the predicted values \hat{Y}, which are computed by the Xb matrix multiplication:
\mathbf{\varepsilon=Y-\hat{Y}=Y-Xb} Then, we need to minimize the sum of squared errors:
Min:
\mathbf{\varepsilon'\varepsilon} = \mathbf{\left(Y-Xb\right)'\left(Y-Xb\right)}
According to the transpose rule of matrix algebra, the transpose of a matrix multiplication is the multiplication of the transpose matrices in opposite order:
Transpose rule: \mathbf{\left(AB\right)'=B'A'}
Applying this rule to \varepsilon':\mathbf{\left(Y-Xb\right)'=(Y'-b'X')}
Then we can write the sum of squared of errors as:
\varepsilon'\varepsilon=\mathbf{\left(Y'-b'X'\right)(Y-Xb)}
Applying the distributive law of matrix multiplications, we multiply each element by each element of these 2 factors and sum them up:
\mathbf{\varepsilon'\varepsilon=Y'Y-Y'Xb-b'X'Y+b'X'Xb}
We can see that the matrix multiplication \mathbf{b'X'Y'} is a scalar (a number). Looking at the dimensions of these matrices, we can see that the end result of this multiplication is matrix of dimension \left(1x1\right), which is a number:
Remember that the dimension of the resulting matrix after multiplying 2 matrices is as follows:
\mathbf{A_{(rA,cA)}B_{(rB,cB)}=C_{(rA,cB)} } The resulting matrix C after multiplying A*B has the following number of rows and columns:
Number of rows of C = Number of rows of the first matrix A
Number of columns of C = Number of columns of the second matrix A
Check that the condition for a matrix multiplication is that the number of rows of the first matrix A must be equal to the number of columns of the second matrix B: cA=rB
Here is an example of the same matrix multiplication with their elements:
\underbrace{ \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1c_A} \\ a_{21} & a_{22} & \cdots & a_{2c_A} \\ \vdots & \vdots & \ddots & \vdots \\ a_{r_A1} & a_{r_A2} & \cdots & a_{r_Ac_A} \end{bmatrix} }_{(r_A,\,c_A)} \times \underbrace{ \begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1c_B} \\ b_{21} & b_{22} & \cdots & b_{2c_B} \\ \vdots & \vdots & \ddots & \vdots \\ b_{r_B1} & b_{r_B2} & \cdots & b_{r_Bc_B} \end{bmatrix} }_{(r_B,\,c_B)} = \underbrace{ \begin{bmatrix} c_{11} & c_{12} & \cdots & c_{1c_B} \\ c_{21} & c_{22} & \cdots & c_{2c_B} \\ \vdots & \vdots & \ddots & \vdots \\ c_{r_A1} & c_{r_A2} & \cdots & c_{r_Ac_B} \end{bmatrix} }_{(r_A,\,c_B)}
Then:
\mathbf{b'} is a \left(1xN\right) matrix, \mathbf{X'} is a \left(p,N\right) matrix, \mathbf{Y'} is a \left(Nx1\right) matrix. \mathbf{b'X'} is a \left(1xN\right) matrix, so \mathbf{b'X'Y'} is a \left(1x1\right) matrix.
Since the transpose of a number is the number itself, then:
\left(\mathbf{b'X'Y'}\right)=\left(\mathbf{b'X'Y'}\right)'=\mathbf{Y'Xb}, which is a number. Then:
\varepsilon'\varepsilon=\mathbf{Y'Y-2b'X'Y+b'X'Xb}
Now we need to differentiate this sum of squared errors and set it to zero in order to find set of betas that minimize the sum of squared errors. According to basics of differential calculus, when we take the derivative of a mathematical function and make this derivative equal to zero, we can identify a maximum or a minimum point of the function.
Applying intuition, considering a function of one variable x, the derivative in a specific point x is the slope of the tangent line that touches the function. Then, if this derivative (the slope) is equal to zero means that the function at that point is the top or the bottom of a curve.
In the case of sum of errors, we can make sure that if the derivative exists, then what we can find is the minimum not the maximum value. The maximum value for the squared of errors is actually infinite. For the case of one independent variable x, we can imagine a regression line so far from the actual points (x,y). And we can always imagine another regression line that is further apart from the first one.
Then, we can to use differentiation calculus applied to matrix algebra. I will review basic rules of matrix differentiation before we differentiate \varepsilon'\varepsilon.
Rule 1. If a is a vertical vector of numbers and x is a vertical vector of values of a variable X, then the derivative of a'x with respect to x will be equal to the vertical vector a:
a=\left[\begin{array}{c} a_{1}\\ a_{2}\\ .\\ .\\ a_{n} \end{array}\right] x=\left[\begin{array}{c} x_{1}\\ x_{2}\\ .\\ .\\ x_{n} \end{array}\right] \frac{\delta(a'x)}{\delta(x)}=a=\left[\begin{array}{c} a_{1}\\ a_{2}\\ .\\ .\\ a_{n} \end{array}\right]
Note that \frac{\delta(a'x)}{\delta(x)} represents a set of differential equations.
Rule 2. The derivative of the matrix \mathbf{x'Ax} with respect to x, if A is a squared symmetric matrix, then the result will be equal to the matrix \mathbf{\textrm{2}Ax}
$$ A=
$$ Where a_{ij}=a_{ji}, and
x=\left[\begin{array}{c} x_{1}\\ x_{2}\\ .\\ .\\ x_{n} \end{array}\right] Then:
\mathbf{x'Ax}=\left[\begin{array}{c} x_{1} & x_{2} & . & . & x_{n}\end{array}\right]\left[\begin{array}{ccccc} a_{11} & a_{12} & . & . & a_{1n}\\ a_{21} & a_{22} & . & . & a_{2n}\\ . & . & . & . & .\\ . & . & . & . & .\\ a_{n1} & a_{n2} & . & . & a_{nn} \end{array}\right]\left[\begin{array}{c} x_{1}\\ x_{2}\\ .\\ .\\ x_{n} \end{array}\right]
\mathbf{x'Ax}={\displaystyle \sum_{i=1}^{n}{\displaystyle \sum_{j=1}^{n}x_{i}x_{j}a_{ij}=}}{\displaystyle {\displaystyle \sum_{i=1}^{n}x_{i}^{2}a_{ii}+\sum_{i=1}^{n}\sum_{j=1,i\nleqslant j}^{n}x_{i}x_{j}a_{ij}}}
The set of derivatives with respect to each x_{i} is equal to :
\mathbf{\frac{\delta\left(x'Ax\right)}{\delta\left(x\right)}=\textrm{2}Ax}
But why this is true? Let’s play with the set of derivatives:
\frac{\delta\left(\mathbf{x'Ax}\right)}{\delta\left(\mathbf{x}\right)}=\left[\begin{array}{c} \frac{\delta\left(x'Ax\right)}{\delta\left(x_{1}\right)}\\ \frac{\delta\left(x'Ax\right)}{\delta\left(x_{2}\right)}\\ .\\ .\\ \frac{\delta\left(x'Ax\right)}{\delta\left(x_{n}\right)} \end{array}\right]
\frac{\delta\left(\mathbf{x'Ax}\right)}{\delta\left(\mathbf{x}\right)}=\left[\begin{array}{c} \frac{\delta\left(\sum_{i=1}^{n}x_{i}^{2}a_{ii}+\sum_{i=1}^{n}\sum_{j=1,i\nleqslant j}^{n}x_{i}x_{j}a_{ij}\right)}{\delta\left(x_{1}\right)}\\ \frac{\delta\left(\sum_{i=1}^{n}x_{i}^{2}a_{ii}+\sum_{i=1}^{n}\sum_{j=1,i\nleqslant j}^{n}x_{i}x_{j}a_{ij}\right)}{\delta\left(x_{2}\right)}\\ .\\ .\\ \frac{\left(\sum_{i=1}^{n}x_{i}^{2}a_{ii}+\sum_{i=1}^{n}\sum_{j=1,i\nleqslant j}^{n}x_{i}x_{j}a_{ij}\right)}{\delta\left(x_{n}\right)} \end{array}\right]
Now taking the partial derivative for each equation (remember that matrix A is symmetric, so a_{ij}=a_{ji}):
\frac{\delta\left(\mathbf{x'Ax}\right)}{\delta\left(\mathbf{x}\right)}=\left[\begin{array}{c} 2x_{1}a_{11}+x_{2}a_{12}+x_{2}a_{21}+...+x_{n}a_{1n}+x_{n}a_{n1}\\ 2x_{2}a_{22}+x_{1}a_{21}+x_{1}a_{12}+...+x_{n}a_{2n}+x_{n}a_{n2}\\ .\\ .\\ 2x_{n}a_{nn}+x_{1}a_{n1}+x_{1}a_{1n}+...+x_{n-1}a_{n(n-1)}+x_{n-1}a_{(n-1)n} \end{array}\right]
\frac{\delta\left(\mathbf{x'Ax}\right)}{\delta\left(\mathbf{x}\right)}=\left[\begin{array}{c} 2x_{1}a_{11}+2x_{2}a_{12}+...+2x_{n}a_{1n}\\ 2x_{2}a_{22}+2x_{1}a_{21}+2x_{3}a_{31}+...+2x_{n}a_{2n}\\ .\\ .\\ 2x_{n}a_{nn}+2x_{1}a_{n1}+2x_{2}a_{n2}+...+2x_{n-1}a_{n(n-1)} \end{array}\right]
Then, we can group the elements with sums and express it as matrix notation:
\frac{\delta\left(\mathbf{x'Ax}\right)}{\delta\left(\mathbf{x}\right)} = \left[\begin{array}{c} 2{\displaystyle \sum_{i=1}^{n}a_{1i}x_{i}}\\ 2{\displaystyle \sum_{i=1}^{n}a_{2i}x_{i}}\\ .\\ .\\ 2{\displaystyle \sum_{i=1}^{n}a_{ni}x_{i}} \end{array}\right]=2\mathbf{Ax}
Interestingly, the derivative of \mathbf{x'Ax} (which is a scalar) results in a column vector. The matrix \mathbf{2Ax} is a set of n partial derivatives of \mathbf{x'Ax} with respect to each x_{i}.
Now I will return with the sum of squared errors:
\varepsilon'\varepsilon=\mathbf{Y'Y-2b'X'Y+b'X'Xb}
I apply the partial derivative of the sum of squared errors with respect to the vector \mathbf{b}:
\mathbf{\frac{\delta(\varepsilon'\varepsilon)}{\delta(b)}}=\frac{\delta(\mathbf{Y'Y-2b'X'Y+b'X'Xb})}{\mathbf{\delta(b)}}=\mathbf{\frac{\delta(Y'Y)}{\delta(b)}-\frac{\delta(2b'X'Y)}{\delta(b)}+\frac{\delta(b'X'Xb)}{\delta(b)}} The derivative of \mathbf{Y'Y} with respect to \mathbf{b} is zero since \mathbf{Y'Y} is not a function of b. Then:
\mathbf{\frac{\delta(\varepsilon'\varepsilon)}{\delta(b)}}=-\frac{\delta(2\mathbf{b'X'Y})}{\delta(\mathbf{b})}+\frac{\delta(\mathbf{b'X'Xb})}{\delta(\mathbf{b})} We can see that the matrix X'X is a squared symmetric matrix, so we can apply Rule 2 to get the derivative of \mathbf{b'X'Xb} with respect to \mathbf{b}. If we consider \mathbf{X'X=A}, then:
\frac{\delta(\mathbf{b'X'Xb})}{\delta(\mathbf{b})}=2\mathbf{X'Xb} We can apply Rule 1 to get the derivative of 2\mathbf{b'X'Y} with respect to \mathbf{b}. We have shown that \mathbf{b'X'Y=Y'Xb} (which is a scalar). \mathbf{Y'X} is a row vector with dimensions (1x(p+1)). Then:
\frac{\delta(2\mathbf{b'X'Y})}{\delta(\mathbf{b})}=\frac{\delta((2\mathbf{Y'X})\mathbf{b})}{\delta(\mathbf{b})}=(2\mathbf{Y'X})'=2(\mathbf{Y'X})'=2\mathbf{X'Y}
Then:
\mathbf{\frac{\delta(\varepsilon'\varepsilon)}{\delta(b)}}=-2\mathbf{X'Y}+2\mathbf{X'Xb}
Now I make this derivative equal to zero to find the beta coefficients that will minimize the sum of squared errors:
\mathbf{\frac{\delta(\varepsilon'\varepsilon)}{\delta(b)}}=-2\mathbf{X'Y}+2\mathbf{X'Xb}=0
Then:
2\mathbf{X'Xb}=2\mathbf{X'Y} \mathbf{X'Xb=X'Y} Multiplying by (X'X)^{-1} both sides:
\mathbf{(X'X)^{-1}(X'X)b=(X'X)^{-1}X'Y}
\mathbf{(I)b=(X'X)^{-1}X'Y}
I is the identity matrix. Then:
\mathbf{b=(X'X)^{-1}X'Y}
This estimation of the vector of betas is a set of “BLUE” estimators: Best, Linear, Unbiased Estimator. This can be demonstrated using matrix algebra.
Now, I can estimate the variance-covariance matrix of the beta estimators. It is important to estimate this variance-covariance matrix, so we can calculate the standard error of each beta. The standard error of a beta coefficient is actually its standard deviation. To know whether each beta coefficient is significantly different from zero we need to estimate the standard error of the beta coefficients. When running a multiple regression model we are interested in knowing whether there is evidence to reject the null hypothesis, which states that the regression coefficient beta is equal to zero. In other words, the null hypothesis is that an independent variable is not related to the dependent variable. Then the null hypothesis that we look to reject is that the beta coefficient of the independent variable is equal to zero.
In order to run this hypothesis test we need to estimate the 95% confidence interval of each beta coefficient. A confidence interval is estimated by adding and subtracting about 2 times the standard error of the beta coefficient to the actual beta coefficient. I have to calulate the variance of each beta coefficient to get their standard deviations, or standard error. Then, let’s work with the variance-covariance matrix of the betas. The variance-covariance matrix can be expressed as the expected value of the squared differences between the sample betas and their respective population betas:
E[\mathbf{\left(b-\beta\right)\left(b-\beta\right)}']=Var(\mathbf{b}) = variance-covariance matrix of betas
\beta is the vector of population or theoretical betas. Remember that:
\mathbf{Y=X\beta+\varepsilon}
Substituting equation [eq:4] in equation [eq:5]:
\mathbf{b=(X'X)^{-1}X'(X\beta+\varepsilon)=(X'X)^{-1}X'(X\beta)+(X'X)^{-1}X'\varepsilon}=\mathbf{(X'X)^{-1}(X'X)}\beta+\mathbf{(X'X)^{-1}X'}\varepsilon
Since \mathbf{(X'X)^{-1}(X'X)=I}, then:
\mathbf{b=}\beta+\mathbf{(X'X)^{-1}X'}\varepsilon
Then I re-arrange the betas to have an equation for (\mathbf{b}-\beta):
\mathbf{b-\beta=}\mathbf{(X'X)^{-1}X'}\varepsilon
Then:
\mathbf{(b-\beta)'=}\mathbf{[((X'X)^{-1})(X'}\varepsilon)]'=\mathbf{(X'\varepsilon)'((X'X)^{-1})'=(\varepsilon'X)((X'X)^{-1})'}
Now I can get the expected value of the squared differences (b-B):
E\mathbf{[(b-\beta)(b-\beta)']}=E[\mathbf{((X'X)^{-1}X'}\mathbf{\varepsilon)}\mathbf{((\varepsilon'X)((X'X)^{-1})')]}
\mathbf{E[(b-\beta)(b-\beta)']}=E[\mathbf{(X'X)^{-1}X'\varepsilon\varepsilon'X((X'X)^{-1})']}
Since the transpose of an inverse matrix is the inverse matrix, then
\mathbf{E[(b-\beta)(b-\beta)']}=E[\mathbf{(X'X)^{-1}X'\varepsilon\varepsilon'X(X'X)^{-1}]}
Only the errors are random, so the expected value of any multiplication of matrices X’s will be the actual multiplication of the matrices. Then:
\mathbf{E[(b-\beta)(b-\beta)']}=\mathbf{(X'X)^{-1}X'E[\varepsilon\varepsilon']X(X'X)^{-1}}
We can see that \mathbf{E[\varepsilon\varepsilon']} is the variance-covariance matrix of the errors. The diagonal of this matrix will have the constant variance of the errors, and the non-diagonal will be zeros since the covariance of the errors are supposed to be zero. Then:
\mathbf{E[\varepsilon\varepsilon']}=\begin{bmatrix}E[\varepsilon_{1}^{2}] & E[\varepsilon_{1}\varepsilon_{2}] & . & . & E[\varepsilon_{1}\varepsilon_{n}]\\ E[\varepsilon_{2}\varepsilon_{1}] & E[\varepsilon_{2}^{2}] & . & . & E[\varepsilon_{2}\varepsilon_{n}]\\ . & . & . & . & .\\ . & . & . & . & .\\ E[\varepsilon_{n}\varepsilon_{1}] & . & . & . & E[\varepsilon_{n}^{2}] \end{bmatrix}=\begin{bmatrix}\sigma_{\varepsilon}^{2} & 0 & 0 & 0 & 0\\ 0 & \sigma_{\varepsilon}^{2} & 0 & 0 & 0\\ 0 & 0 & \sigma_{\varepsilon}^{2} & 0 & 0\\ 0 & 0 & 0 & \sigma_{\varepsilon}^{2} & 0\\ 0 & 0 & 0 & 0 & \sigma_{\varepsilon}^{2} \end{bmatrix} The variance of errors is also called the mean squared errors (MSE), which can be calculated dividing the sum of squared errors (SSE) by (n-p-1), where n is the number of observations, p is the number of independent variables.
I can express this matrix as a multiplication of the scalar variance of errors times the identity matrix:
\mathbf{E[\varepsilon\varepsilon']}=\sigma_{\varepsilon}^{2}\begin{bmatrix}1 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 & 1 \end{bmatrix}=\sigma_{\varepsilon}^{2}\mathbf{I} Now I can express the variance-covariance matrix of betas as:
\mathbf{E[(b-\beta)(b-\beta)']}=\mathbf{(X'X)^{-1}X'\sigma_{\varepsilon}^{2}IX(X'X)^{-1}}
I can move the scalar variance of the errors and drop the identity matrix to simplify the expression. If a is a scalar and X a matrix, then aX=Xa; and XI=IX=X:
\mathbf{E[(b-\beta)(b-\beta)']}=\mathbf{(X'X)^{-1}(X'X)\sigma_{\varepsilon}^{2}(X'X)^{-1}}
Since (X'X)^{-1}(X'X)=I, then:
\mathbf{E[(b-\beta)(b-\beta)']}=\mathbf{I\sigma_{\varepsilon}^{2}(X'X)^{-1}}
\mathbf{E[(b-\beta)(b-\beta)']}=\mathbf{\sigma_{\varepsilon}^{2}(X'X)^{-1}} The error variance \sigma_{\varepsilon}^2 is also called the mean squared error and it is calculated as:
\sigma_{\varepsilon}^2 = MSE = \frac{SSE}{(n-p-1)}=\frac{\mathbf{\varepsilon'\varepsilon}}{(n-p-1)}
Then the diagonal of this matrix will has the variances of the betas. And the non-diagonal terms will have the paired covariances of the betas, which are supposed to be zero since the independent variables X are supposed to be independent from each other.
Then I can express the variance of the beta coefficients as the vector \sigma_{b}^2 resulting from the following:
\sigma_{b}^2=diagonal(\mathbf{\sigma_{\varepsilon}^{2}(X'X)^{-1}}) The beta standard errors is the squared root of the beta variances:
\sigma_{b}=\sqrt{\sigma_{b}^2}