Linear Algebra with NumPy

Linear Algebra is essential to understand ML for three main reasons. One that when you read a book or an article of ML, models are very often explained with linear algebra. This is a consequence of much mathematical convenience as explained below. Second, many models are founded by linear algebra methods. Third, deep learning uses extensively vectors. In either way, if ML interest you, you need to trespass linear algebra. This article contains its most important notions with NumPy examples.

What is Linear Algebra?

Linear Algebra is an area of Mathematics that focuses on linear equations, linear maps and their representation in vector spaces through matrices. Now not all of linear algebra is needed in ML and I only talk here about its essential notions, first by talking about the vectors and matrices, operations involving them, linear equations and their representation.

What is NumPy?

NumPy is a fundamental package of Python, containing multi-dimensional arrays and matrices, along with a large collection of high mathematics functions to operate on these arrays. So it is a perfect package to review some of the linear algebra stuff! Install by the usual method: pip install numpy.

Vectors

Vectors are an ordered array of numbers. We can identify each element by its position. We can identify a point in space by a vector. For instance the following vector shows a point in a 3D space with each element showing its position to the relevant axis.

Let’s create a simple vector in NumPy:

import numpy as np
a_vector = np.array([1, 2, 3])
print(a_vector)
[1 2 3]

We can find the dimensions of a np.array by calling the ndim attribute, its shape by calling upon shape and its transpose by the T.

print(f"The vectors {a_vector}'s shape is: {a_vector.ndim} \n \
The vectors {a_vector}'s transpose is: {a_vector.T} \n \
The vectors {a_vector}'s shape is: {a_vector.shape}. ")
The vectors [1 2 3]'s shape is: 1 
The vectors [1 2 3]'s transpose is: [1 2 3] 
The vectors [1 2 3]'s shape is: (3,). 

NumPy uses broadcasting for operations between an array and a scalar. For instance adding the scalar 2 to a vector will add 2 to every element of the vector. The same for multiplication, division and subtraction:

print(a_vector+2)
[3 4 5]

We can add and multiply vectors and subtract one from another.

The multiplication between two vectors is called a dot product and we sum the products of the pairwise elements of the two vectors, that is:

thus vectors $\bf{a} $ and $\bf{b}$ can be multiplied with correct shapes only. If we apply the matrix multiplication (explained below) to row vectors, dot product can be seen as the matrix multiplication of a row vector with a column vector:

another_vector = np.array([4, 5, 6])
print(f"Sum of a_vector ({a_vector}) and another_vector ({another_vector}) is {a_vector+another_vector}.")
print(f"Subtract a_vector ({a_vector}) from another_vector ({another_vector}) is {another_vector+a_vector}.")
print(f"Dot product of a_vector ({a_vector}) and another_vector ({another_vector}) is {another_vector.dot(a_vector)}.")
Sum of a_vector ([1 2 3]) and another_vector ([4 5 6]) is [5 7 9].
Subtract a_vector ([1 2 3]) from another_vector ([4 5 6]) is [5 7 9].
Dot product of a_vector ([1 2 3]) and another_vector ([4 5 6]) is 32.

One of the useful properties of dot products is it’s interpretation as a measure of vector length: the root of the dot product of a vector with its transpose is its euclidean norm:

# Numpy's linalg module provide ways to compute norms. If no norm is # specified, the euclidian norm is called, o/w the specified norm. print(f"Length of a_vector this is: {np.sqrt(a_vector.dot(a_vector))}")
print(f"The same can be computed with the linalg module: {np.linalg.norm(a_vector)}")
np.linalg.norm(a_vector) == np.sqrt(a_vector.dot(a_vector))

# To call other norms: 
print(f"L1 norm of a_vector: {np.linalg.norm(a_vector, ord=1)}")
Length of a_vector this is: 3.7416573867739413
The same can be computed with the linalg module: 3.7416573867739413
L1 norm of a_vector: 6.0

The cosine rule makes the link between the norms of two vectors and the angle between them:

I will talk more about norms here.

Matrices

Matrices are 2 dimensional ordered arrays of numbers, each element is called by its row and column index. If a matrix A has m rows and n columns, we write β„π‘šΓ—π‘›. 

For instance A is a 3×3 matrix can be written as:

Its diagonal elements are Ai,i. The transpose of a matrix is its mirror image to its diagonal, thus the transpose of A can be written as:

Its shape and dimension can be called by the attribute shape and ndim respectively, its transpose is computed by the attribute T.

A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"The matrix A : \n {A}")
print(f"Its shape : \n {A.shape}")
print(f"Its dimension : \n {A.ndim}")
print(f"Its transpose : \n {A.T}")
#print(f"Its inverse : \n {A^-1}")
The matrix A : 
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
Its shape : 
 (3, 3)
Its dimension : 
 2
Its transpose : 
 [[1 4 7]
 [2 5 8]
 [3 6 9]]
Adding, subtracting a scalar from a matrix or multiplying/ dividing it by a scalar is simple with NumPy, that uses broadcasting:
print(f"Adding 2 to every element of A: \n {A+2}")
print(f"Multiplying every element of A by 3: \n {A*3}")
Adding 2 to every element of A: 
 [[ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
Multiplying every element of A by 3: 
 [[ 3  6  9]
 [12 15 18]
 [21 24 27]]
We can add, subtract and multiply matrices with compatible shapes. The two first operations require that the matrices are of the same shape while the matrix multiplication requires that the first matrix has the same number of columns as the second has rows.

The order of the matrix multiplication counts, AB β‰  BA generally.

B = np.array([[11, 12, 13], [14, 15, 16], [17, 18, 19]])
print(f"A = {A}")
print(f"B = {B}")
print(f"Adding A and B: \n {A+B}")
print(f"Substract A from B: \n {B-A}")
print(f"Multiply A by B: \n {A.dot(B)}")
print(f"Multiply B by A: \n {B.dot(A)}")
A = [[1 2 3]
 [4 5 6]
 [7 8 9]]
B = [[11 12 13]
 [14 15 16]
 [17 18 19]]
Adding A and B: 
 [[12 14 16]
 [18 20 22]
 [24 26 28]]
Substract A from B: 
 [[10 10 10]
 [10 10 10]
 [10 10 10]]
Multiply A by B: 
 [[ 90  96 102]
 [216 231 246]
 [342 366 390]]
Multiply B by A: 
 [[150 186 222]
 [186 231 276]
 [222 276 330]]

Inverse of a matrix:

This notion is very important in ML. We say that matrix A ∈ ℝn,n is invertible and its inverse is B ∈ ℝn,n if:

AB = I

B is the inverse matrix of A and is denoted as A-1.

Not every matrix has an inverse and we shortly see its necessary conditions. But before we go forward any further, let’s see why matrix inversion is that important.

Let’s suppose that you have a system of linear equations:

This could be rewritten as:

Ax = B

with:

Now if Ax = B has a solution, then the system of linear equations has a solution. If it does, than the solution is:

x = A-1 b

Therefore the system of linear equations has a solution if its corresponding matrix form can be written with an invertible A.

It is also possible to have zero or infinite solutions, but it is not possible to have more than one, finite number of solutions.

In order to analyse how many solutions there are, we can think about the columns of A as vectors, showing the directions in which we can move from the origin. These directions (the columns of A) will be multiplied by the vector x (each  x𝑖 element of x will be multiplied by the corresponding column of A). We can think about this as the x𝑖 elements show how much we need to move in a given direction to reach finally the location indicated by b. The set of points obtainable by the linear combination of a set of vectors (the columns of A) is called the span  of these vectors. Therefore, our system of linear equation has a solution if b ∈ span(A).

We assumed that b ∈ ℝn,n and we stated that we can only have a solution if b ∈ span(A). This means that we can only have a solution if ℝn,n ∈ span(A). This means that A has to have at least m linearly independent columns (and rows as well).

What we are searching for is eventually the rank of a matrix: the rank of a matrix shows the number of linearly independent columns by definition.

All these above criterions mean that the matrix A needs to be square (that is we require m = n) and that all its columns are independent. A square matrix with linearly independent columns is said to be singular.

Happily we find that NumPy does the matrix inversion quite simply:

B = np.array([[4,7], [2,6]]) 
print(f"Inverse of matrix B: \n {np.linalg.inv(B)}")
print(f"Multiplying B with its inverse gives: \n {np.linalg.inv(B).dot(B)}")
Inverse of matrix B: 
 [[  1      -1  ]
 [  -0       0  ]]
Multiplying B with its inverse gives: 
 [[  1       0  ]
 [   0       1  ]]
# Not every matrix has an inverse: Numpy gives a LinAlgError since we do not have a singular matrix. 
np.linalg.inv(A)
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-108-f3038be86b87> in <module>
      1 # Not every matrix has an inverse: Numpy gives a LinAlgError since we do not have a singular matrix.
----> 2 np.linalg.inv(A)

<__array_function__ internals> in inv(*args, **kwargs)

~/PycharmProjects/Articles/venv/lib/python3.7/site-packages/numpy/linalg/linalg.py in inv(a)
    545     signature = 'D->D' if isComplexType(t) else 'd->d'
    546     extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 547     ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    548     return wrap(ainv.astype(result_t, copy=False))
    549 

~/PycharmProjects/Articles/venv/lib/python3.7/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
     95 
     96 def _raise_linalgerror_singular(err, flag):
---> 97     raise LinAlgError("Singular matrix")
     98 
     99 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix
Some important matrices:

1) Square matrices have the same number of rows and columns. (A ∈ ℝn,n)2) Diagonal matrices are matrices in which the entries outside the main diagonal are all zero. For instance:Diagonal matrices show up often in linear algebra since multiplying a matrix by a diagonal matrix is efficient; we only need to scale the other vector/ matrix by the diagonal elements. Computing the inverse of a diagonal matrix is also computationally efficient. It exists only if all diagonal elements of the diagonal matrix is non-zero, and its inverse is a diagonal matrix (with diagonal elements that are equal to the inverse of the diagonal elements of the original matrix). Many ML algorithm is simplified/ made computationally more efficient by using diagonal matrices.Not all diagonal matrices are square, but non-square diagonal matrices have no inverses.3) The identity matrix I βˆˆ ℝn,n is a square diagonal matrix with its diagonal elements equal to 1:4) A symmetric matrix is such that its transpose equal the matrix.5) We say that the vectors x and y are orthogonal if their dot product is zero (xyT = 0) If both vectors have non-zero norms, they are at 90 degrees to each other. Furthermore, if they are orthogonal and they have both unit norms, they are said to be orthonormal.An orthogonal matrix is a square matrix with mutually orthonormal rows and columns. If A is orthogonal, Orthogonal matrices come up often in ML because their inverse is efficiently computed.

# Create a diagonal matrix by specifing the main diagonal 
C = np.diag([4., 8., 9.]) print(C) # identity matrix: print(np.linalg.inv(C))
[[  4       0       0  ]
 [   0       8       0  ]
 [   0       0       9  ]]
[[   0       0       0  ]
 [   0       0       0  ]
 [   0       0       0  ]]
# Find the matrix rank 
np.linalg.matrix_rank(A)
2

Eigendecomposition

First, let me give some motivation for the eigendecomposition. We have a problem, and we want to solve it. It may be better to break down the problem into several subproblems and draw some universal conclusions. For instance, you might not know exactly whether 672 288 278 can be divided by 18 but if ypu break down 672 288 278 into its prime decomposition, you can easily reply to the question. Thus, breaking down the problem to several subproblems might help in understanding it in more detail, at a much lower cost.

The eigendecomposition does the same thing for matrices. It allow the study of the properties of a matrix by breaking it down to another form, a form we called its eigendecomposition.

First, let’s define the eigenvector: The eigenvector of a square matrix A is the vector v that satisfies Av = Ξ»v with Ξ» a scalar.
This means that multiplying v by A alters only the scale of v. Ξ» is called the eigenvalue corresponding to the eigenvector, v

If v is an eigenvector of A, then of course any scalar multiplier of v is also an eigenvector of A, with the same corresponding eigenvalue.

Now suppose A has n linearly independent eigenvectors: {v1 , v2 , …, vn} with their associated eigenvalues {Ξ»1 , Ξ»2 , …, Ξ»n}. Let’s form a matrix Q from these eigenvectors, so that

Q = [v1 , v2 , …, vn]

and similarly let’s form a vector $\Lambda$ from the eigenvalues:

Ξ› = [Ξ»1     Ξ»2      …   Ξ»n]

So the diagonal square matrix, with diagonal elements equal to the eigenvalues.

Then the eigendecomposition of A is:

A = Q  Ξ›  Q-1

Now of course not every matrix has an eigendecomposition, but if A is symmetric and real valued, it can. In this case Ξ› is a diagonal matrix composed from the eigenvalues, and Q is a orthogonal matrix since it is composed from the linearly independent eigenvectors. Therefore its inverse equals its transpose:

A = Q  Ξ›  Q-1 = Q  Ξ›  Q-T

By convention, we arrange Ξ› and the corresponding Q in ascending order in the eigenvalues.

Now as the prime decomposition tells us a lot about a number, the Eigendecomposition tells us a lot about the matrix. It tells us the following informations:
1) Whether the matrix is singular (square matrix with linearly independent columns): if any of the eigenvalues of the Eigendecomposition is equal to 0, then the matrix is singular.

2) A matrix whose eigenvalues are all positive/ negative is called positive/ negative definite respectively. A matrix whose eigenvalues are all equal to or greater/ less then zero is a positive/ negative semi-definite matrix. Positive semi-definite matrices guarantee that

βˆ€ x :    xTA x β‰₯ 0 

while positive definite matrices guarantee additionally that:

xTA x = 0     β‡’       x = 0 

Let’s see the Eigendecomposition of a matrix with NumPy:

A = np.array([[5, 1], [3, 3]])
# np.linalg.eig returns the eigenvalues and the eigenvectors of A
w, v = np.linalg.eig(A)
print(f"The eigenvalues of A are: \n {w}")
print(f"The eigenvectors of A are: \n {v}")
[[5 1]
 [3 3]]

The eigenvalues of A are: 
 [ 6 2 ]

The eigenvectors of A are: 
 [[ 1 -0 ]
 [ 1 1 ]]
# Show that the eigendecomposition gives back the matrix A 
v.dot(np.diag([6, 2])).dot(np.linalg.inv(v))
array([[ 5 , 1 ],
       [ 3 , 3 ]])

Singular Value Decomposition (SVD)

This method is similar to the eigenvalue decomposition in principle, it helps to reveal similar information as the former but more generally applicable. Every matrix has a SVD but only square real value matrices have an Eigenvalue decomposition.
SVD factorises the matrix into singular matrices and singular values. The SVD of matrix A is:

A = U D VT

with A ∈ ℝn,n, U ∈ ℝm,m , D ∈ ℝn,n

Conditioning

Conditioning refers to how a function changes with respect to a small change in the input. Let our function be: f = A-1x. When A ∈ ℝn,n has an eigendecomposition, the conditional number of A is ratio of the magnitude of the largest and smallest eigenvalues of A: 

When this number is large, matrix inversion is very sensitive to small changes in the inputs and the matrix said to be ill-conditioned. The conditional number is formally defined as the value of the asymptotic worst-case relative change in the output for a change in the input. Intuitively, a high conditional number means that small changes in the inputs lead to large changes in the output, thus the optimal solution is hard to find.

Conclusion

This short blog post contains some of the most important notions of linear algebra. It aims to summarise these notions with a little recap, if you need to look at them in more details, check out the references! And thanks for reading!

References

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: