The Winding Number: linear transformations

Showing posts with label linear transformations. Show all posts

Null, row spaces, transpose, fundamental theorem of linear algebra

This is an exciting article, so pay close attention.

In the last article, we introduced the column space, and we are lead to wonder about the vectors that are mapped to the zero vector (let's call the set of these vectors the "null space"). Based on some of the transformations we've seen, we might wonder if the null space is essentially the subspace perpendicular to the column space, i.e. the set of all vectors perpendicular to every vector in the column space (we call this the "orthogonal complement" of the column space).

However, this is demonstrably wrong. Suppose one rotates the co-ordinate system before collapsing it onto a lower dimension. The "collapsing" matrix, the singular matrix, may be $\left[ {\begin{array}{*{20}{c}}1&0\\0&0\end{array}} \right]$, and the rotation matrix may be $\left[ {\begin{array}{*{20}{c}}0&{ - 1}\\1&0\end{array}} \right]$, then the composite transformation is $\left[ {\begin{array}{*{20}{c}}1&0\\0&0\end{array}} \right]\left[ {\begin{array}{*{20}{c}}0&{ - 1}\\1&0\end{array}} \right] = \left[ {\begin{array}{*{20}{c}}0&{ - 1}\\0&0\end{array}} \right]$

In general in $\mathbb{R}^2$, if the "simple collapse" transformation looks like this:

$$\left[ {\begin{array}{*{20}{c}}{{{\cos }^2}\phi }&{\cos \phi \sin \phi }\\{\cos \phi \sin \phi }&{{{\sin }^2}\phi }\end{array}} \right]$$
(Show that the non-rotating collapses can be written in this way), and the rotation matrix looks like this:

$$\left[ {\begin{array}{*{20}{c}}{\cos \theta }&{\sin \theta }\\{ - \sin \theta }&{\cos \theta }\end{array}} \right]$$
Then their product

$$\left[ {\begin{array}{*{20}{c}}{{{\cos }^2}\phi }&{\cos \phi \sin \phi }\\{\cos \phi \sin \phi }&{{{\sin }^2}\phi }\end{array}} \right]\left[ {\begin{array}{*{20}{c}}{\cos \theta }&{\sin \theta }\\{ - \sin \theta }&{\cos \theta }\end{array}} \right] \\= \left[ {\begin{array}{*{20}{c}}{\cos \phi \cos (\phi - \theta )}&{\cos \phi \sin (\phi - \theta )}\\{\sin \phi \cos (\phi - \theta )}&{\sin \phi \sin (\phi - \theta )}\end{array}} \right]$$
Which is also a singular matrix with the same rank and column space as the original, but its null space is a rotated version of the original, and therefore no longer perpendicular to the column space.

We make the following observations:

The rotation has introduced an "asymmetry" in the components of the matrix -- the matrix is no longer symmetric across the principal diagonal, the rows and columns are no longer the same. In our original terminology before we introduced matrices, the "contribution of y to x" is no longer the same as "the contribution of x to y.
The span of the column vectors is not dependent on $\theta$ (proving again that the column space is not the space orthogonal to the null space), but the span of the row vectors is -- we'll call this the row space. The row space used to be the same as the column space, but is now rotated clockwise by an angle of $\theta$. Since the space is rotated counter-clockwise by $\theta$ before being collapsed, this means the null space is $\theta$ clockwise to what its position would have been without the rotation.

Hence, the null space is perpendicular to the row space, at least if the pre-collapse transformation is a rotation. It would be a very educated guess to say that this applies generally, that the null space of a transformation is always perpendicular to the row space.

Indeed, this is the case, and is pretty easy to show:

$$\left[ {\begin{array}{*{20}{c}}{{r_1}}\\ \vdots \\{{r_n}}\end{array}} \right]\vec x = 0\,\,\, \Rightarrow \,\,\,{r_i} \cdot \vec x = 0\,\,\,\,\,\,\forall i \in [1,n] \cap \mathbb{Z}$$
More interestingly, however, we made a pretty important observation about the symmetry of a matrix: asymmetry in the matrix seems to be a measure of "rotation-ish" a matrix is. The reason this makes sense is that while the values on the principal diagonal talk about how much each component is scaled in the resulting image, the off-diagonal elements talk about how much contribution there is to one component of a vector from the component in another direction, something that happens during rotations.

If the contribution from x to y is exactly the same as the contribution from y to x, then the xy-plane isn't really being rotated, but the basis vectors are rather being pulled closer to/apart from each other. On the other hand, when they are different, the "orientation" of the axes changes, and an actual rotation is induced.

The most "rotation-ish" effect is created when the values of the matrix are the negative of the their reflection across the principal diagonal, because it means the basis vectors are being rotated by the same angle in the same direction.

Quick exercise: plot how such a transformation looks in $\mathbb{R}^2$. You'll notice that the rotation isn't as "perfect" as you might've hoped -- the vectors change length, etc., and it's the orthogonal displacement in the basis vectors that is the same in orthogonal directions for the basis vectors, not the angles.

These changes in length are a result of scaling, i.e. the values along the principal diagonal.

Well, not exactly, because the values along the principal diagonal scale the part of the basis vectors in their original direction, not the overall length of the basis vectors. This is why (i) the resulting matrix not only eliminates the scaling, but also ensures the rotation is in right-angles, (ii) the resulting matrix is not actually just a bunch of pure rotation, as its rotations are still scaled, and (iii) why some otherwise-antisymmetric matrices with a principal diagonal, like$\left[ {\begin{array}{*{20}{c}}
{\cos \theta }&{ - \sin \theta } \\
{\sin \theta }&{\cos \theta }
\end{array}} \right]$, can still be rotations in some angle other than a right angle. Highly recommended reading: my answer to "Intuition behind the Speciality of Symmetric Matrices".

Therefore, it is useful to extract from the matrix the purely anti-symmetric part, with a zero principal diagonal:

$$\left[ {\begin{array}{*{20}{c}}0&{{b_{12}}}& \ldots &{{b_{1n}}}\\{ - {b_{12}}}&0& \ldots &{{b_{2n}}}\\ \vdots & \vdots & \ddots & \vdots \\{ - {b_{1n}}}&{ - {b_{2n}}}& \ldots &0\end{array}} \right]$$
This matrix is called a skew-symmetric matrix or an anti-symmetric matrix, and in $\mathbb{R}^n$ is essentially a combinations of scaled rotations around different axes.

Meanwhile, one may define a symmetric matrix in the following way:

$$\left[ {\begin{array}{*{20}{c}}b_{11}&{{b_{12}}}& \ldots &{{b_{1n}}}\\{{b_{12}}}&b_{12}& \ldots &{{b_{2n}}}\\ \vdots & \vdots & \ddots & \vdots \\{{b_{1n}}}&{{b_{2n}}}& \ldots &b_{nn}\end{array}} \right]$$
These matrices essentially scale and skew vectors -- the principal diagonal components do the scaling in each direction, and the other components do the skewing, i.e. tilting the basis vectors towards each other.

One may wonder if it would have been better to simply deal with diagonal matrices instead of symmetric ones. However, as we will see, it turns out that any matrix can be written as the sum of a symmetric matrix and an anti-symmetric matrix. It is also worth noting that we will eventually see that a symmetric matrix is a generalisation of a scaling matrix, where the scaling is done in some arbitrary directions that need not be the basis vectors. These vectors are called "eigenvectors".

Looking at the definitions of symmetric and antisymmetric matrices, it is clear that any matrix may be written as the sum of a symmetric matrix and an antisymmetric one. Specifically, one may write:

$$A = \underbrace {\frac{1}{2}(A + {A^T})}_{\scriptstyle{\rm{symmetric }}\atop\scriptstyle{\rm{part}}} + \underbrace {\frac{1}{2}(A - {A^T})}_{\scriptstyle{\rm{antisymmetric }}\atop\scriptstyle{\rm{part}}}$$
Where $A^T$ is the "transpose" of $A$, referring to the matrix formed upon taking flipping $A$'s entries across the principal diagonal. E.g. the row space of $A$ is the column space of $A^T$.

Does this all remind you of something?

Any entity can be written in two "parts" of a specific nature, which look curiously like the exponential form of the cosine and sine functions
These two parts represent scaling and scaled $\pi/2$ rotation respectively.

It should remind you of the Cartesian form of a complex number.

A symmetric matrix is "like" a real number, and an anti-symmetric matrix is "like" an imaginary number. A complex number can be thought of as an object to be transformed (like a vector) as well as as a transformation itself (like a matrix).

However, it is important to highlight the differences: (i) while a real number only scales things in their own direction (like a multiple of $I$), a symmetric matrix can skew things, which is equivalent to scaling across a different set of axes. Therefore while complex numbers can encode spiral transformations, matrices encode all linear transformations. (ii) complex numbers transform on the complex plane, which is a two dimensional plane like $\mathbb{R}^2$, while linear transformations can operate in any number of dimensions, and not necessarily even just in $\mathbb{R}^n$. An example of a consequence of this would be that an anti-symmetric matrix can encode a series of right-angle rotations with corresponding scalings in dimensions greater than three, but an imaginary number can only correspond to a rotation around an axis pointing out of the plane.

It is fair to say that linear algebra generalises complex numbers with matrices. For some examples of correspondence between specific complex numbers, see Introduction to symmetry, section "symmetry on the complex plane". We will formalise this whole idea of matrices being "like real numbers" and "like imaginary numbers" when we do eigenvalues and eigenbases -- in fact, antisymmetric matrices have imaginary eigenvalues.

The complex conjugate, similarly, is generalised to the transpose. Explain how. (Hint: $A^T$ does precisely the opposite rotation-ish, i.e. antisymmetric actions as $A$ while preserving the symmetric actions. What does this tell you about matrices whose transpose and inverse are equal? What does it tell you about matrices that are equal to their transpose?)

Watch the above Khan Academy video.
Come up with an intuitive explanation for why the row space solution is the shortest.

The three following results:

Row space is perpendicular to the null space
Row rank equals column rank
Rank plus nullity equals dimension (Rank-nullity theorem)

Together imply what is called the fundamental theorem of linear algebra -- that the row space is the orthogonal complement of the null space. Why? To be the orthogonal complement, you need to be orthogonal (implied by 1), and your dimensions need to add up to the total dimension of the space. The latter is implied by 3, but our intuition applies perhaps better to the dimension of the column space, and 2 implies that this has the same dimension as the row space, and we're done.

Among these three results, this article provides fairly solid intuition for the first and the third. The second is really assumed throughout this article, but if you want really solid intuition for it, read on.

Consider an $m$ by $n$ matrix $A$ with linearly independent rows (i.e. its row rank is $m$) (we can always turn matrices to this form with a bunch of zero rows at the end via row operations). Then the nullity of $A$, i.e. the number of solutions to $Ax=0$ is $n-m$, as there are $m$ equations in $n$ variables. Thus we have that row rank plus nullity equals the dimension. But we know from our "collapsing intuition" that column rank plus nullity equals the dimension. And we're done.

An obvious consequence of rank-nullity is that injectivity is equivalent to surjectivity. However, this is not true for infinite-dimensional spaces. Why not? You might find it's related to Hilbert's hotel.

It's worth noting that while we've pretended that having the column space equal to the row space is equivalent to being symmetric, this is not really true. There's a broader set of matrices for which the column and row spaces are equal, called "rank-symmetric" or EP matrices.

This article is extended in SVD, polar decomposition, normal matrices; a re-look at transposes and FTLA.

Inverses, determinants, column spaces, non-square matrices

PART 1

First, let's look at a graphical way of expressing a linear transformation $L:V\to W$. We choose a set of parallel lines from $V$, transform them under $L$ and graph their images in $W$. In the following example, we take the linear transformation of a rotation of $-\pi/4$ (about the origin) and use a set of lines parallel to the x or y-axes with integer y and x-values respectively.

(Courtesy: Interactive Matrix Visualization)

If you've thought about linear transformations a bit, you might have realised that some linear transformations map distinct vectors onto the same line (and by linearity, this would imply that there are distinct vectors that are mapped to the same vector). This would only be possible if the transformation mapped the plane to a lower dimension (again, this is because we're dealing with a linear transformation). It seems that it would be useful to study such transformations in greater detail.

For one, such a transformation wouldn't have an inverse, because it is not possible to determine exactly which vector maps to a given vector in the target space.

If we have $Ax=v$ for some known $A$ and $v$, then we can find the inverse for an ordinary $A$. But if $A$ is of the form mentioned -- we call this a "singular matrix" -- this is not possible, and there is either an infinite number of solutions for $x$ (if $v$ lies in the target space/range of $A$) or no solution for $x$ (if it doesn't).

In terms of linear equations, these correspond to cases of the form $x + 2y = 4, x + 2y = 8$ and $x + 2y = 4, x + 2y = 9$ respectively. Indeed, these are the sort of linear equations that one obtains from expanding out $Ax=v$ and the equations in the components become the linear equations.

Think: functions, invertible functions, etc.

Let's think about how one can identify a singular matrix. In two dimensions, this is relatively obvious -- a matrix is singular if it maps the space to either 0 or 1 dimensions, and in both cases, the unit vectors are mapped onto the same line -- hence, the column vectors are multiples of each other.

In more than two dimensions, this is not necessary -- for example, the three basis vectors in $\mathbb{R}^3$ may be mapped onto a plane, which would cause the space to be transformed onto this plane.

Similarly, in $\mathbb{R}^4$, if four basis vectors are mapped to a 3-dimensional hypersurface, OR at least three of the basis vectors are mapped onto a 2-dimensional plane, OR at least two of the basis vectors are mapped onto a line, OR at least one of the basis vectors is mapped onto a point (the origin), the space is reduced to a space of lower dimension and the matrix is singular.

Each of these cases correspond to linear dependence between the column vectors. This is, in general, our criterion for singularity: a matrix with linearly dependent columns is singular.

Note that this statement can also be formulated based on rows, due to the connection with linear equations we mentioned earlier -- this should foreshadow a certain identity regarding the determinant that we will derive later.

Name the "kind" of singular matrices that each of the following is (i.e. what is its range?) without graphing or calculating anything. Consider where the basis vectors are mapped. $\vec a, \vec b, \vec c$ are linearly independent vectors in $\mathbb{R}^4$.

$$\left. \begin{array}{l}(a)\,\,\left[ {\begin{array}{*{20}{c}}{\vec a}&{2\vec a}&{\vec b}&{\vec c}\end{array}} \right]\\(b)\,\,\left[ {\begin{array}{*{20}{c}}{\vec a}&{\vec a + \vec b}&{\vec b}&{\vec c}\end{array}} \right]\\(c)\,\,\left[ {\begin{array}{*{20}{c}}{\vec a}&{\vec a + \vec b + 2\vec c}&{\vec b}&{\vec c}\end{array}} \right]\\(d)\,\,\left[ {\begin{array}{*{20}{c}}{\vec a}&{2\vec a}&{3\vec a}&{\vec c}\end{array}} \right]\\(e)\,\,\left[ {\begin{array}{*{20}{c}}{\vec a}&{2\vec a}&{2\vec b}&{\vec b}\end{array}} \right]\\(f)\,\,\left[ {\begin{array}{*{20}{c}}{\vec a}&{2\vec a}&{\vec a + 2\vec b}&{\vec b}\end{array}} \right]\\(g)\,\,\left[ {\begin{array}{*{20}{c}}{\vec a}&{\vec b}&0&{\vec c}\end{array}} \right]\\(h)\,\,\left[ {\begin{array}{*{20}{c}}{\vec a}&{\vec b}&0&{3\vec b}\end{array}} \right]\end{array} \right\}$$

(a) Two basis vectors are mapped onto a line, hence range is a three-dimensional subspace.
(b) Three basis vectors are mapped to the same plane, hence the range is a three-dimensional space.
(c) Four basis vectors are mapped onto a 3D subspace, hence range is a 3D subspace.
(d) Three basis vectors are mapped onto a line, hence range is a plane.
(e) Basis vectors are mapped onto one of two lines, hence range is a plane.
(f) Two basis vectors are mapped onto a line on the same plane as the other basis vectors, thus plane.
(g) A basis vector is mapped to zero, hence range is a 3D space.
(h) A basis vector is mapped to zero and two vectors to a line, hence range is range is a plane.

In general, the dimension of the range of the transformation is the number of vectors we're left with after removing the redundant ones, i.e. the range is the span of the column vectors.

Note that the zero vector is always redundant, as it doesn't span anything and is always a zero multiple of any other vector.

This "range" we've been talking about is called the column space of the matrix. The dimension of the column space is called the rank. This is true also for non-singular matrices, where the column space becomes the same space as the original, i.e. the space of all vectors the matrix can act on. Then the rank is at its maximum, and the transformation is called full-rank.

PART 2

We are interested in finding a quantity that, rather than taking discrete values and abruptly changing for singular matrices as the rank does, returns a continuous scalar value expressing how much a transformation smashes things to near zero.

It is reasonable to use the n-volume of the n-parallelepiped contained by the column vectors (i.e. by the images of the basis vectors) as this quantity. This volume is called the determinant of the transformation. It is clear that when the determinant is zero, the transformation is singular, and the reverse is true. Therefore the law to determine whether a matrix is singular applies to finding whether the determinant is zero.

Based on what we already know, it is possible to show, geometrically, that:

Multiplying a row or column by a scalar scales the determinant by the same scalar.
$\det(AB)=\det(A)\det(B)$ -- this essentially means that the determinant can be regarded as a "ratio" of volumes, i.e. it's a scaling factor, and measures the "amount" of the transformation, like an absolute value or magnitude.
If $A$ and $B$ have all the same elements except one row/column, then the determinant of another matrix $C$ which also has all the same elements except that row/column which is a sum of the corresponding row/column of $A$ and $B$, is equal to $\det(A)+\det(B)$.

Here's another property: you might recall that for some triangle whose sides are given by vectors $a$, $b$ and $c$, its area can be given by $\frac12 |a\times b| = \frac12 |a\times c| = \frac12 |b\times c|$. With our knowledge of determinants, we can write $|a\times b| = \left|\det \left[ {\begin{array}{*{20}{c}}
a \\
b
\end{array}} \right]\right|$, and analogous expressions for $a \times c$, etc. -- these expressions are all equal. Notice that $c=a-b$ or something to that effect -- this means the determinants $\det \left[ {\begin{array}{*{20}{c}}
a \\
b
\end{array}} \right] = \det \left[ {\begin{array}{*{20}{c}}
a \\
{b - a}
\end{array}} \right]$, etc.

This can be generalised to any number of dimensions: subtracting any scalar multiple of a row or column from another row or column (respectively) of a matrix leaves the determinant unchanged.

This is very useful in computing a determinant in practice.

We now direct our attention to actually computing an inverse of a matrix.

We know that $A^{-1}A=I$, by definition. Here's our strategy to find $A^{-1}$: we multiply $A$ by a series of matrices until we get $I$. In order to keep track of the matrices, we simultaneously multiply $I$ by the same matrices. Now since the product of these matrices with $A$ is $I$, this matrix is $A^{-1}$, and multiplying them with $I$ gives us $A^{-1}$.

What are some simple linear transformations we can apply on $A$ that we know how to manipulate to eventually yield $I$ (at least for a non-singular matrix)? Well, there is a set of such linear transformations, called row operations (or column operations, which serve the same purpose), which essentially add, subtract and scalar-multiply rows and columns as if they were simultaneous equations we are trying to solve.

Think: what would be the result of doing this on a singular matrix, besides the usual "universe blows up" stuff?

As a general rule: anything we can legitimately do to linear equations, is a legitimate row operation. This makes a lot of sense, given that the solution to $Ax=v$ is $v=A^{-1}v$. It is immediately clear that $\det(A^{-1})=\det(A)^{-1}$, which generalises our earlier identity about $\det(A^n)$ very nicely to negative integers.

More on determinants and inverses in the computational guide.

PART 3

A quick note on non-square matrices: it should be clear that non-square matrices are maps between vectors of different dimensions. Their determinant is not traditionally defined for them, but some generalisations do exist -- recommended reading: Generalisations of the determinant to interdimensional transformations: a review.

PART 4

You should make a list of things equivalent to invertibility for a square matrix. Here's a list -- some of the stuff here will make sense later:

Invertibility
Determinant non-zero
Linearly independent rows, columns
Full-rank
Zero-nullity
Injectivity
Surjectivity
Bijectivity
Image = Co-domain = Domain
RREF is the identity matrix
No zero eigenvalues
No zero singular values

Composition of linear transformations: matrix multiplication

Matrix multiplication is funny, because it's a more universally important and perhaps more fundamental operation than matrix addition is. This is often not emphasised in linear algebra textbooks, which treat matrices primarily as arrays of numbers with interesting properties, as opposed to something motivated directly out of the study of linear transformations, but should become pretty clear once we understand where matrix multiplication actually comes from.

The first, most important property of a linear transformation one needs to recognise in order to understand the significance of matrix multiplication is that the composition of two linear transformations is also a linear transformation. This will turn out to be quite significant when discussing group theory, and will actually be one of the fundamental requirements for identifying something as a group, but let's try to make this clearer right now.

The statement can be proven pretty easily as follows: suppose the transformation in question is $L = M \circ N$. Then:

$$\begin{array}{c}L(ax + by) = M(N(ax + by))\\ = M(aN(x) + bN(y))\\ = aM(N(x)) + bM(N(y))\\ = aL(x) + bL(y)\end{array}$$
It is trivial from this point (by induction) that the composition of any number of linear transformations is also a linear transformation.

Why this is so significant is that it means we can talk about compositions of transformations without talking about them in reference to a specific vector. Much like how the fact that linear transformations are linear allowed us to write them in matrix form (remember, the matrix form is just the image of the basis), the fact that compositions of linear transformations are linear allows us to do the same thing with them. In other words, the fact that compositions of linear transformations are linear means that a composition acts on every vector in exactly the same way, since the vector can be written as a linear combination of other vectors.

When thinking about a matrix as a set of vectors, it's important to realise that we're talking about an ordered set of vectors, i.e. it's possible to distinguish between the n basis vectors even after the transformation. Transforming (1, 0) to (2, 0) and (0, 1) to (3, 1) is not the same as transforming (1, 0) to (3, 1) and (0, 1) to (2, 0), i.e. $\left[ {\begin{array}{*{20}{c}}2&3\\0&1\end{array}} \right] \ne \left[ {\begin{array}{*{20}{c}}3&2\\1&0\end{array}} \right]$.

You might be thinking: aren't the axes symmetric? Isn't there no way to distinguish between the unit vector in the x-direction and the unit vector in the y-direction? This is true -- however, once you label them, they can be distinguished. I.e. you can pick out the image of the x-unit vector under the transformation and the x-component of a vector being transformed and point out that they are related. Another way of putting this is that the symmetry between axes means switching between x and y wherever they appear leaves a correct statement correct.

For example, we know that $\left[ {\begin{array}{*{20}{c}}3&2\\8&4\end{array}} \right]\left[ {\begin{array}{*{20}{c}}6\\5\end{array}} \right] = \left[ {\begin{array}{*{20}{c}}{28}\\{68}\end{array}} \right]$. Now suppose we switch the x- and y- axes.

Then first, we would have to switch the image of the x-basis vector and the y-basis vector: $\left[ {\begin{array}{*{20}{c}}2&3\\4&8\end{array}} \right]$. That's not enough, however, because we also need to switch the x- and y- components of these images, so our matrix looks like this: $\left[ {\begin{array}{*{20}{c}}4&8\\2&3\end{array}} \right]$. Similarly, we must switch the x- and y-components of the input vector to make it look like this: $\left[ {\begin{array}{*{20}{c}}5\\6\end{array}} \right]$.

Now, if we compute the matrix vector product, we get $\left[ {\begin{array}{*{20}{c}}{68}\\{28}\end{array}} \right]$, which is precisely the same result with the x- and y- components inverted, telling us that the output transforms in the same way as the input when x and y are switched, or that matrix multiplication is invariant under such a transformation.

More on this at Introduction to symmetry (1201-001), under "Symmetry on the complex plane".

The point of this detour was to introduce an example of an idea involving composition of transformations and how they affect the output.

Now, let's try to find a general form for the matrix representing a composition of transformations. Once again, we start with two dimensions, and once again, we note that it is sufficient to calculate the image of the basis vectors after two transformations and put them together as a matrix.

The first transformation can be represented by the matrix $\left[ {\begin{array}{*{20}{c}}a&c\\b&d\end{array}} \right]$, which means that after the first transformation, this is what the basis vectors have been transformed to.

To see what the second transformation (i.e. the transformation on the left), say $\left[ {\begin{array}{*{20}{c}}p&r\\q&s\end{array}} \right]$ does to this transformed basis, just multiply this matrix with each column of $\left[ {\begin{array}{*{20}{c}}a&c\\b&d\end{array}} \right]$. In other words, to multiply two matrices, just multiply the matrix on the left (the transformation applied later) with each column of the matrix on the right (the transformation applied first).

The result should look like this:

$$\left[ {\begin{array}{*{20}{c}}{ap + br}&{cp + dr}\\{aq + bs}&{cq + ds}\end{array}} \right]$$
The same, of course, can be extended to more than two dimensions.

You should be referring to the computational reference guide as you read this -- often, linear algebra is learnt as if it were all about computation. While this is not true, it is useful to have a good command of the computation once you understand the insights.

However, I would advise you to stay away from using computational techniques for proving things in linear algebra -- it becomes tedious pretty quick -- for example, prove that $(AB)C=A(BC)$, i.e. that matrix multiplication is associative, in $\mathbb{R}^n$. It's possible to prove this pretty easily and intuitively from just the idea of a linear transformation -- it's unnecessarily tedious via computation.

Linear algebra is not limited to $\mathbb{R}^n$, which is why this "trivial" proof is really more rigorous than the computational one. Mathematics is really all about such heuristics, because it's ultimately these heuristics that are entailed by formal proofs, not unnecessarily complicated computation. The heuristic proofs deliver understanding much better than computational ones, and you'll find plenty of examples of this within linear algebra.

Why matrices?

This post is a bit more tedious than necessary, but I suggest you stick with it to the end and not skip to the simpler derivations at the end so you can appreciate the elegance of linear algebra. We will start with picking a few standard linear transformations in $\mathbb{R}^2$ and writing the components of the transformed vector in terms of the components of the initial vector.

Reflection
The most elementary reflection in $\mathbb{R}^2$ is a reflection across the origin. This is, in fact, the only point across which a reflection is a linear transformation, as reflecting about any other point violates the condition that the origin must remain fixed. The reflection of some vector $\left[ {\begin{array}{*{20}{c}}x\\y\end{array}} \right]$ across the origin is simply $\left[ {\begin{array}{*{20}{c}}-x\\-y\end{array}} \right]$

Next, we consider the reflection of a vector across a specific axis. It is clear that a reflection across the x-axis maps the given vector to $\left[ {\begin{array}{*{20}{c}}x\\-y\end{array}} \right]$, while a reflection across the y-axis maps it to $\left[ {\begin{array}{*{20}{c}}-x\\y\end{array}} \right]$. In addition, we observe that the reflection of the vector about the line $y=x$ is $\left[ {\begin{array}{*{20}{c}}y\\x\end{array}} \right]$, and that reflection about the line $y=-x$ maps it to $\left[ {\begin{array}{*{20}{c}}-y\\-x\end{array}} \right]$.

Now that we're done with the trivial ones, let's consider a reflection in the generic line $y=mx$ (having a non-zero y-intercept would mean the transformation isn't linear, although it would still be affine).

Visualisation of reflection transformation

We consider the problem in polar co-ordinates -- the original co-ordinates $(x,y)$ in $(r,\theta)$ form are $r=\sqrt{x^2+y^2}$, $\theta=\arctan(y/x)$. $(x',y')$ has the same value of $r'=r$ (prove it -- hint and [SPOILER ALERT]: congruent triangles), while $\theta'=2\arctan(m)-\arctan(y/x)$. Based on the identity for the sum of three inverse tangents, $\arctan a +\arctan b +\arctan c=\arctan\frac{a+b+c-abc}{1-ab-ac-bc}$ this turns into $\theta'=\arctan\frac{2m-(y/x)+m^2y/x}{1-m^2+2my/x}=\arctan\frac{2mx-y+m^2y}{2my+x-m^2x}$.

Converting back into Cartesian co-ordinates, $x'=r\cos\theta=r\cos\arctan\frac{2mx-y+m^2y}{2my+x-m^2x}$ and $y'=r\sin\theta=r\sin\arctan\frac{2mx-y+m^2y}{2my+x-m^2x}$. Recall that $\cos\arctan Y/X=X/R$ and $\sin\arctan Y/X=Y/R$ where $R=\sqrt{X^2+Y^2}$. In this case, it can easily be shown that $R=(m^2+1)\sqrt{x^2+y^2}=(m^2+1)r$, thus $x' = r\frac{2my+x-m^2x}{(m^2+1)r}$ and $y'=r\frac{2mx-y+m^2y}{(m^2+1)r}$. Hence

$$\left[ {\begin{array}{*{20}{c}}{x'}\\{y'}\end{array}} \right] = \left[ {\begin{array}{*{20}{c}}{ - \frac{{{m^2} - 1}}{{{m^2} + 1}}x + \frac{{2m}}{{{m^2} + 1}}y}\\{\frac{{2m}}{{{m^2} + 1}}x + \frac{{{m^2} - 1}}{{{m^2} + 1}}y}\end{array}} \right]
$$
(You might notice some vague similarity to the half-angle tangent formula and complex numbers -- consider the corresponding complex number $x'+y'i$ and prove that it can be reduced to $x' + y'i = \frac{{1 + im}}{{1 - im}}x - \frac{{1 + im}}{{1-im}}yi$, i.e. $z'=\frac{1+im}{1-im}\bar{z}$)

You might be starting to notice a similarity among these transformations -- the resulting $x'$ and $y'$ are each pure linear forms in $x$ and $y$, i.e. of the form $ax+by$. How might this make our job easier?

Rotation
We know that a counter-clockwise rotation by an angle of $\pi/2$ maps the vector $\left[ {\begin{array}{*{20}{c}}x\\y\end{array}} \right]$ to $\left[ {\begin{array}{*{20}{c}}-y\\x\end{array}} \right]$. Similarly, a rotation by $\pi$ is $\left[ {\begin{array}{*{20}{c}}-x\\-y\end{array}} \right]$ (which can also be obtained by applying the $\pi/2$ transformation twice) and a rotation by $3\pi/2$ yields a transformed vector of $\left[ {\begin{array}{*{20}{c}}y\\-x\end{array}} \right]$.

We now consider a general rotation by $\phi$. Again, the problem becomes much simpler in polar co-ordinates, where it is clear that the transformation transforms $(r,\theta)$ to $(r',\theta')=(r,\theta+\phi)$. Then

$$\begin{array}{c}\left[ {\begin{array}{*{20}{c}}{x'}\\{y'}\end{array}} \right] = \left[ {\begin{array}{*{20}{c}}{r\cos (\theta + \phi )}\\{r\sin (\theta + \phi )}\end{array}} \right]\\ = \left[ {\begin{array}{*{20}{c}}{r(\cos \theta \cos \phi - \sin \theta \sin \phi )}\\{r(\cos \theta \sin \phi + \sin \theta \cos \phi )}\end{array}} \right]\\ = \left[ {\begin{array}{*{20}{c}}{\frac{x}{{\cos \theta }}\cos \theta \cos \phi - \frac{y}{{\sin \theta }}\sin \theta \sin \phi }\\{\frac{x}{{\cos \theta }}\cos \theta \sin \phi + \frac{y}{{\sin \theta }}\sin \theta \cos \phi }\end{array}} \right]\\ = \left[ {\begin{array}{*{20}{c}}{x\cos \phi - y\sin \phi }\\{x\sin \phi + y\cos \phi }\end{array}} \right]\end{array}
$$
Matrices
Similarly, one could write dilation/scaling as $(x',y') = (ax,by)$, a shear parallel to the x-axis could be written as $(x',y')=(x+\lambda y,y)$ and a shear parallel to the y-axis could be written as $(x',y')=(x,y+\lambda x)$.

It seems that one needs four numbers to define any linear transformation from $\mathbb{R}^2\rightarrow\mathbb{R}^2$: the coefficient on $x$ in $x'$, the coefficient of $y$ in $x'$, the coefficient on $x$ in $y'$ and the coefficient of $y$ in $y'$. In general, a linear transformation $\mathbb{R}^n\rightarrow\mathbb{R}^n$ could be represented with $n^2$ real numbers.

$n^2$ real numbers can, of course, be arranged into an $n\times n$ array of real numbers, which is precisely what we mean by a "matrix".

The following array shows what each component of a matrix represents:

$$\left[ {\begin{array}{*{20}{c}}{x\backslash x}&{x\backslash y}&{x\backslash z}\\{y\backslash x}&{y\backslash y}&{y\backslash z}\\{z\backslash x}&{z\backslash y}&{z\backslash z}\end{array}} \right]$$
Where the notation $x \backslash y$ refers to the contribution of $y$ to $x'$, i.e. the coefficient on $y$ when $x'$ is written in terms of $x,y,z$. This can, of course, be generalised to any number of dimensions.

With matrices, we now have a way of writing down linear transformations in a clear and explicit way (a co-ordinate dependent one -- an idea which we will eventually come to). Instead of writing the components of the transformed entity (called the image) in terms of the components of the input, we can represent transformations on their own as matrices, which helps us greatly in algebraic manipulation.

Restricting ourselves to two dimensions for simplicity, let's compute the effect of a linear transformation on (1,0) and (0,1). We see that

$$\begin{array}{l}\left[ {\begin{array}{*{20}{c}}a&b\\c&d\end{array}} \right]\left[ {\begin{array}{*{20}{c}}1\\0\end{array}} \right] = \left[ {\begin{array}{*{20}{c}}a\\c\end{array}} \right]\\\left[ {\begin{array}{*{20}{c}}a&b\\c&d\end{array}} \right]\left[ {\begin{array}{*{20}{c}}0\\1\end{array}} \right] = \left[ {\begin{array}{*{20}{c}}b\\d\end{array}} \right]\end{array}$$
This is very important -- you can confirm this for more than two dimensions -- it tells you that the matrix is just made up of columns, each being the image of the basis vectors (arranged in the same order as the components of a vector) under the matrix. In other words, the matrix is the image of the basis under the transformation.

One may notice that a linear transformation on a general vector takes the following form:
$$\left[ {\begin{array}{*{20}{c}}a&b\\c&d\end{array}} \right]\left[ {\begin{array}{*{20}{c}}x\\y\end{array}} \right] = x\left[ {\begin{array}{*{20}{c}}a\\c\end{array}} \right] + y\left[ {\begin{array}{*{20}{c}}b\\d\end{array}} \right]$$
But since (a,c) and (b,d) are simply the images of the basis vectors under the transformation, the result is just a linear combination of the transformed basis vectors -- in fact, the same linear combination as the original vector was of the un-transformed basis vectors. In other words, the transformation $\left[ {\begin{array}{*{20}{c}}a&b\\c&d\end{array}} \right]$ does the following:

$$x\left[ {\begin{array}{*{20}{c}}1\\0\end{array}} \right] + y\left[ {\begin{array}{*{20}{c}}0\\1\end{array}} \right] \to x\left[ {\begin{array}{*{20}{c}}a\\c\end{array}} \right] + y\left[ {\begin{array}{*{20}{c}}b\\d\end{array}} \right]$$
This is, if you haven't recognised it yet, exactly why matrices work. It is exactly why the observation we made earlier -- that all linear transformations result in components that are some linear combination of the original components -- is true.

This means that to study any linear transformation -- write down its matrix form, etc. -- we only need to look at how it transforms the $n$ basis vectors. Since any vector can be written as a linear combination of these basis vectors (that's why they form a basis), this allows us to encode the entire transformation and decide what impact it has on any other vector. This follows directly from the definition of a linear transformation -- the image of a linear combination of (basis) vectors (i.e. the original vector itself) is equal to the same linear combination of the images of each (basis) vector.

Why can't you divide vectors?

The question "what does $\vec{a}/\vec{b}$ equal?" is equivalent to asking "What do you multiply with $\vec b$ to get $\vec a$?" -- the answer is a matrix, assuming the multiplication involves some contraction afterwards. Equivalently, you multiply and contract a (1, 0) tensor $b^\mu$ by a (1,1) tensor $A_\mu^\nu$ to get a (0,1) tensor $b^\nu$.

But there are multiple matrices you can multiply $\vec b$ by to get $\vec a$. In two dimensions, you need *two* sets of "this maps to this" (and the knowledge that the mapping is linear) to pin down what the linear mapping is. In general in $n$ dimensions, you need $n$ such vectors -- so instead of dividing vectors, you divide *sets of vectors* -- these are called matrices.

(Taken from my answer to "Why is division not defined for vectors?")

Linear sums and independence, span, and linear bases

The idea of linear dependence is an important one in linear algebra, because it's an important idea in defining things like the span of vectors (and by extension the dimensionality of a vector space), the basis of a vector space, etc. It is also the first idea we'll be discussing that gives us an appreciation of the importance the underlying scalar field.

A linear combination of some n vectors $v_1,...v_n$ is some combination of the vectors $a_1v_1+...+a_nv_n$. In other words, it's a vector that can be obtained by adding up any of these vectors scaled to arbitrary extents. We immediately observe a few properties of this idea:

If $x_1,...x_n$ are all linear combinations of a set of vectors $v_1...v_n$, then so is any linear combination of the x-vectors.
Every one vector $v_i$ of $v_1...v_n$ is a linear combination of the vectors, namely with the coefficients $(a_1,...a_i,...a_n)=(0,...1,...0)$.
The zero vector is always a linear combination of any set of vectors

We can state the idea of a linear combination more formally in terms of sets -- given some vector space $V$ and an underlying field of scalars $F$: for every finite set of vectors $X\in V$ with $n$ elements $v_1,...v_n$, we can construct a set $Y\in V$ such that (a) $\forall y\in Y$, there exists a finite sequence of $a_1,...a_n$ such that every $a_i\in F$ and $\sum a_iv_i=y$ (b) the converse: given any finite sequence $(a_1,...a_n)\in F^n$, the sum $\sum a_iv_i\in Y$.

The set Y in the above definition is known as the span of the set of vectors X. The span is, quite literally, the entire subspace of V that can be spanned by scaling and adding the vectors in X.

A linear subspace, or when the context is clear, "subspace", is a subset of a linear vector space (or simply a "linear space") that also carries with it the operations of addition and scalar multiplication and satisfies closure, i.e. it is also a linear space itself. The only subspaces of $\mathbb{R}^n$ are flat hypersurfaces of dimension n or lower passing through the origin (Can you explain why this is so? Prove it formally.). For instance in the case of $\mathbb{R}^3$, all planes and lines passing through the origin, as well as the set {0}, are linear subspaces.

The set $v_1,...v_n$ is said to be linearly independent if none of the vectors can be written as a linear combination of another. Some simple algebraic manipulation will show that as long as even one of the vectors can be written as a linear combination of the others, the set is not linearly independent (the converse does not hold true -- why?). This idea will be very useful to us later on when we talk about defining a basis for a vector space, etc. An alternative formulation of linear independence can be obtained through some more simpler algebraic manipulation:

If the only solution to the equation $a_1v_1+...+a_nv_n=0$ is $(a_1,...a_n)=(0,...0)$, then the vectors are linearly independent.

Explain why this works geometrically in $\mathbb R^n$. Keep in mind that the geometric explanation is usually closely related or equivalent to the algebraic one, much like formal proofs usually just formalise our intuition about something.

This idea of linear dependence and independence might remind you of systems of linear equations. Indeed, linear algebra provides us with several tools for solving such equations, which we will study in future.

If you return to this article once you have a good understanding of matrix multiplication, you'll notice that the condition $a_1v_1+...+a_nv_n=0$ for linear dependence is equivalent to the statement that the matrix-vector product $\left[\begin{array}{}
{v_1}&{\cdots}&{v_n}
\end{array}\right]\left[\begin{array}{}
{a_1}\\ {\cdots}\\{a_n}
\end{array}\right]=0$ (which we will write as $Bw=0$) has a solution other than the zero vector. A matrix that maps a non-zero vector to the zero vector is called a singular matrix, and has a determinant of zero. In other words, having linearly dependent rows is equivalent to being a singular matrix.

An important theorem with regard to span and linear independence is that the maximum cardinality of a linearly independent set of vectors in $\mathbb{R}^n$ is $n$. In fact, this definition is generally used as the very definition of the dimension of a linear space -- the maximum cardinality of a linearly independent set of vectors.

Another way of saying this is that the span of $n$ linearly independent vectors in $\mathbb{R}^n$ is always $\mathbb{R}^n$.

Try to examine why this property is true by graphing the span of two vectors in $\mathbb{R}^2$. The following diagram should be of help.

Span of two vectors – draw parallel versions of the span of $v_2$ at each point
on the span of $v_1$

Do the same for $\mathbb{R}^3$. Notice that if two vectors in $\mathbb{R}^2$ lie on the same line, they do not span the entire space. Similarly, if the three vectors in $\mathbb{R}^3$ lie on a plane, they do not span the entire space.

Compare this theorem with the idea of a system of $n$ linear equations only having a unique solution if there are $n$ variables. Indeed, this idea -- of switching between the "vector" interpretation and the "linear equation" interpretation -- is one that we will revisit later on with what is called the "row picture" and the "column picture".

We conclude with perhaps the most essential part of this article: if a vector space or subspace is the span of a linearly independent set of vectors, the set is called a basis of the vector space. For example, any two vectors in $\mathbb{R}^2$ that do not lie on a line can be used as a basis for $\mathbb{R}^2$. We will often use, by default, a basis of orthogonal unit vectors in $\mathbb{R}^n$, called the standard basis, for reasons that will eventually become apparent (although most of the linear algebra we'll study is true independent of the basis used).

However, it is important to understand the existence of multiple possible bases and therefore representations of vectors, as this idea will eventually be studied in detail and has important applications in physics.

Introduction to linear transformations

Linear algebra, and algebra in general (algebra is one of the trinity of pure mathematics -- algebra, analysis and geometry) concerns itself with mathematical objects and their transformations.

Linear algebra actually deals with a specific kind of active/passive transformations, called linear transformations, but let's remain general for now, as our intention is to introduce algebra in entirety and try to understand the motivation to study linear algebra in particular.

There are two broad categories of mathematical objects that we call transformations: active and passive transformations. An active transformation is basically a function that maps an element (called a "vector") to another -- it involves actually changing the mathematical object in concern. On the other hand, a passive transformation is a transformation of the co-ordinate basis so that the representation in the basis transforms in precisely the opposite way as the basis so that the object itself remains the same.

A couple of analogies help reinforce this classification:

Numerical bases − Consider the number "101" in base 10. If we wanted to convert this into base 2, we would write 1100101. This is a passive transformation -- the actual mathematical object (101) remains the same, while the representation changes. On the other hand, suppose we kept the representation "101" but instead talked about the number that was 101 in base 2 (i.e. 5). This is an active transformation, because the actual number has stayed the same -- it's just that our "basis vectors" (1, 10, 100...) have been transformed to a new basis (1, 2, 4...) and our mathematical object itself, which is always the linear sum of 1 times the third basis "vector" (100 or 4) plus 0 times the second basis vector (10 or 2) plus 1 times the first basis vector (1 or 1) has transformed under this transformation.
Definitions − Suppose I made the statement "Bananas are yellow". This statement is true. But now if I were to redefine the word "banana" to mean what we usually call (or to be more precise, called in our previous system of definitions) "apple", then the statement would be false (the property of truth is not invariant under the transformation). This was an active transformation. On the other hand, if we redefined "apples" to mean "bananas" and changed the form of the sentence to read "Apples are yellow", the statement remains the same, and it remains true. This is a passive transformation.

An example of a co-ordinate (passive) transformation would be $(x,y)\rightarrow(\sqrt{x^2+y^2},\arctan y/x)$, which converts from Cartesian to polar co-ordinates. Alternatively, one could create an active transformation where $r$ takes on the value of the former x-coordinate and $\theta$ takes on the value of the former y-coordinate. It would be clumsy and pointless in this case, but it could be done.

Both passive and active transformations find a variety of uses in physics -- passive transformations are of central importance in relativity, while active transformations are an important tool in quantum mechanics.

Linear algebra deals with a specific kind of transformations, called linear transformations. A linear transformation is a transformation that satisfies the property that $L(ax+by)=aL(x)+bL(y)$ where L is the linear transformation, a and b are objects called "scalars", and x and y are objects called "vectors". Another way of putting this is that a linear transformation commutes with every linear sum operator.

A linear sum operator, of course, is an operator that takes in some number of vectors, scales each by some scalar that depends on the linear sum operator and adds the scaled results. You will learn that this is equivalent to multiplying the matrix with these vectors as columns with a vector that represents the operator itself.

Then this linear transformation acts on elements from a set of vectors (called a "vector space"), over a field of scalars. Common choices for the vector space are $\mathbb R^n$, $\mathbb C^n$, etc. and common choices for the scalar field are $\mathbb R$, $\mathbb C$, etc. In these cases, the vectors can be represented as tuples of real or complex numbers, and (especially in the case of the reals) are often used to represent various physical quantities with a direction in Newtonian physics.

The formal definition of a vector space is given by the following axioms:

There exists an operation called "addition" on the vector space that satisfies the following properties:

associativity -- $(u+v)+w=u+(v+w)$
commutativity -- $u+v=v+u$
identity -- there exists a vector $0$ such that $v+0=v$
inverse -- there exists a vector $-v$ for every v such that $v+(-v)=0$

There exists an operation called scalar multiplication on the vector space that satisfies the following properties:

compatibility with field multiplication -- $a(bv)=(ab)v$
shared identity element with field multiplication -- $1v=v$ where 1 is also the identity of field multiplication
distributivity over vector addition -- $a(u+v)=au+av$
distributivity over scalar addition -- $(a+b)v=av+bv$

Therefore our vectors need not be anything of the sort you'd encounter in a high-school physics or math class -- they can be functions, SVG files, arrays in computer programming, whatever, and a lot of other mathematical objects (one such object you'll eventually encounter is the tensor).

One of the reasons linear algebra is taught early is to introduce how mathematics works, and axioms are at the very foundation of mathematics. To a pure mathematician, axiomatic systems represent entire universes -- a pure mathematician unknowingly adopts a Platonist metaphysics, where our universe, the domain of physics, is just one among all the other possible axiomatic systems, all within the domain of mathematics.

On a fundamental level, this is not a testable or even meaningful claim -- however, it's a useful picture to keep in mind while thinking about mathematics and about being precise when you hear people talking about "existence", etc.

An equally interesting picture, though, is that of the applied mathematician. The applied mathematician is concerned with the physical world as it is. He observes the universe and finds that certain rules of logic based on some starting points turn up all the time in the real world -- in physics, in computer science, in finance, and practically everything else. For example, kinematics and relativity, quantum mechanics, and computer programming all seem to make use of certain logical ideas -- certain mathematical structures -- in very different ways. This idea I'm talking about is a vector and its linear transformations. So the applied mathematician probes for some set of logical assumptions from which these conclusions can be logically derived, called axioms. In the case of linear algebra, the axioms are the eight listed above.

So to an applied mathematician, axioms are more like an "interface" between the mathematical theory and its various applications. From the axioms, one can derive all the "theorems" of the mathematical theory, all its features. Rather than re-discovering all these facts about the various physical phenomenon, if the physicist or engineer or computer scientist or whoever can show that one can make a precise correspondence from the things in his field to the objects in the mathematical theory (such as mapping the quantities of "velocity", "momentum", "position", etc. to "vectors", producing a sensible idea of addition and dot products, etc.), then all the results (theorems) involving these mathematical objects will also apply to the things in his field. This is why mathematics is often called the art of identifying different things with analogous logical structures.

For the record, this is also why mathematics -- or to be more accurate, applied mathematics, or the mathematics we study is so useful in describing things -- it's designed that way. "Things" produce the motivation for us to do math, and specifically the kind of math that is useful to describe these things. Think of some other mathematical theories you know of -- calculus, the elementary algebra of the real numbers, of the complex numbers, of the integers, of the natural numbers, of the rational numbers, etc. What physical phenomena do they describe?

The idea of an axiomatic basis -- whether interpreted in the pure-mathematician sense or in the applied-mathematician sense, for there is no real difference except in how humans think of it -- is central to modern mathematics. Richard Feynman called it the "Greek school" of mathematics and science, as opposed to the "Babylonian school", which does not isolate a set of axioms from all the statements of the theory. The Babylonian school tends to be how physics and the other sciences operate, but only under the assumption that the mathematicians will eventually bring rigor to their field (e.g. with mathematical physics).

A set of axioms is seldom unique. There are always multiple different possible sets of statements that one can choose from which all the other statements of the theory can be derived (e.g. in Euclidean geometry, the Pythagorean theorem can be an axiom, replacing the parallel postulate). It's much like the concept of a defining property -- e.g. the function "$\sin(\theta)" can be defined in terms of the unit circle, where its Taylor expansion is a theorem, or the latter can be considered a definition, with the unit circle property being derived from it as a theorem.

Another example would be the number e. One definition would be the real number approached by the limit of $(1+\frac1n)^n$ as $n\to\infty$, while the other would be as the value of $\exp(1)$ where $\exp(x)$ is the function such that $\frac{d}{dx}\exp(x)=\exp(x)$ and $\exp(0)=1$. Both are defining properties, and can be derived from one another.

Coming back to the topic of linear algebra, the axioms can also be thought of as saying: a linear transformation is a transformation such that if the transformation is applied to all points on the plane, all lines (e.g. gridlines) remain lines (they don't curve) and the origin does not move (think about why these two are equivalent). The former is equivalent to stating that the gridlines (any set of evenly-spaced parallel lines) not only remain lines, but also remain evenly spaced and parallel, otherwise some other line would have to curve (try it out).

In the case that the origin does move, the transformation is called an affine transformation, which is the combination of a linear transformation and a translation, and is the generalisation of $y=mx+c$ to vectors and their transformations.

Something of interest to note here is the different ways to generalise the same thing to a more general domain -- if one wants to generalise a function or some other mathematical object $f(x)$ from a domain $x\in X$ to some $F(y)$ for $y\in Y$ where $X\subset Y$, then we do it based on some property satisfied by $f(x)$, where we decide on $F(y)$ so that it also satisfies this property.

However, we can also make the generalisation on basis of some other property -- this is exactly what's happening here, we can say that an affine transformation is a generalisation of a linear relationship among the real numbers to vectors, or that a scaling plus translation is (the cases where the first has an effect similar to the latter will be studied later when we look at eignvalues and eigenvectors). In the former case, the generalisation is based on the standard properties of a linear transformation also satisfied by $y=mx+c$, whereas in the latter case, the generalisation is based on simply scaling by a real number.

This allows one to easily determine visually if a transformation is linear. For example, it becomes clear that the Cartesian-to-Polar thing above was not a linear transformation, because the axis that is mapped to the theta-axis becomes curved.

Here's a random thought about linearity being nice. Suppose you have a relationship between $X$ and $Y$. Now if you jiggle $X$ around a bit, $Y$ jiggles a bit too. The average value of $Y$ during this jiggling period corresponds exactly on the line to the average value of $X$ during the jiggling.

On the other hand, if you have a non-linear correlation -- say with a peak in the middle -- this is no longer true. The average x-co-ordinate might give you the peak, but the average value of the y-co-ordinate is certainly not the maximum value of it. You might wonder if we can approximate non-linear things with linear things, such as by zooming close into a curve -- indeed, this is the point of calculus.

Prove the relation between the two different axiomatic descriptions of a linear transformation described above.
Hence or otherwise, prove that the following are all linear transformations:

Rotation around the origin
Scaling (along any line passing through the origin, the x- or y- axis, the line y = x, whatever)
Shearing
A combination of two (and by induction, $n$) linear transformations

Prove that for all linear transformations $A,B$ there exists a unique linear transformation $C$ such that $Cv=ABv$ for all $v\in V$. We then say that $C=AB$, and can define transformation composition in this way for an arbitrarily number of compositions by induction.
Is $(AB)C=A(BC)$ for all linear transformations $A,B,C$? Prove your answer.
Is $AB=BA$ for all linear transformations $A,B$? Prove your answer.

Notes about associativity and matrix multiplication (added on 2018.10.31, related to 1103-004)

If you didn't try question 3 above -- or if you didn't try it before you tried question 4 -- go try it, go think about it. It's important.

You'll often see "proofs" of the associativity of matrix multiplication, where they write out the matrix in its full, unglorifiying form -- perhaps as a row of column vectors or a column of row vectors -- and they work out both products $(AB)C$ and $A(BC)$, then say "hey, look, they're the same!"

That's a nonsensical proof. It escapes every single insight behind matrix multiplication and why matrix multiplication is important, and pretends that the rules for matrix multiplication were simply "given to us" by some god-emperor who wrote some crappy textbook. That's not useful, and that gives you absolutely no insight as to why matrix products are defined the way they are.

First of all, it's important to recognise that there really are two separate questions: Does $A(Bv)=(AB)v$, and does $A(BC)=(AB)C$. From these two you can really figure out (by repetition/induction) all other bracket combinations for whatever number of matrices.

Let's consider the matrix-vector product first. And let's say we haven't yet defined matrix multiplication.

So how on earth could we talk about multiplying $A$ and $B$ first, if we haven't yet defined multiplication on matrices? We can talk about $A(Bv)$ (because nowhere here are you multiplying two matrices), but not $(AB)v$. The idea is that we make this the definition of matrix multiplication -- we say that matrix multiplication is the composition of the transformations associated with the matrices. But for this to still be a matrix, this requires that the composition of the two matrices always be the same matrix.

I.e. if we know that there exists some $C$, such that $A(Bv)=Cv$ for all $v$.

How do we go about proving this? Well, you might have a picture in your head of linear transformations/matrices transforming the entire space/entire vector field, rather than an individual vector, i.e. linear transformations transform all vectors in the "same way" in some sense (more formally, we're re-stating the definition of a linear transformation). And the statement above just means that the composition of two linear transformations is a linear transformation.

We can prove this easily (do it for yourself), and this implies our statement earlier, because we know all linear transformations are represented by a unique matrix.

In other words -- the images of the basis vectors under any transformation are independent (i.e. we need all of them to define the transformation) and also uniquely define the transformation (i.e. there is only one transformation that transforms all the basis vectors in this way), by definition of linearity, dimensionality and basis vectors. So we can determine $C$ from this independence, i.e. ensuring that $A(Bu)=Cu$ for all basis vectors $u$, and since any vector can be written as a linear combination of these basis vectors, i.e. the basis vector images determine the transformation, we can show $A(Bv)=Cv$ for all vectors $v$.

We move on to the second kind of associativity, $(AB)C=A(BC)$.

We've really just defined everything about matrices, including matrix multiplication, in terms of vectors so far, and all the intuition in our head is in terms of vectors and vector spaces (not the matrices themselves), so the best way to understand this identity, which contains only matrices, is to recognise both sides as linear transformations, i.e. explicitly write out the vector being operated on:

$$(AB)Cv=A((BC)v)$$
Recall our definition of matrix multiplication, $(AB)v:=A(Bv)$. We can apply this on both sides, rewriting the above as:

$$A(B(Cv))=A(B(Cv))$$
Which is obviously true.