The perspective transformation and collinearity

When working with the graphics pipeline, the perspective projection is commonly used to give a realistic depiction of a scene, where objects close to the camera appear larger than objects far away. The perspective transformation has the important feature of mapping 3D lines to 3D lines. In this post we will go into why this property is important and give a proof of it.

The reader should have some familiarity with the graphics pipeline and the role the projection transformation plays in it. The Sources section provides good resources on the subject.

Coordinates setup

In the following we will assume that our view space is a right-hand coordinate system, has X pointing right from the camera, Y pointing up, and Z pointing back from the camera. This is the traditional setup of the view coordinate system in OpenGL applications.

Within the view space we define our viewing frustum by a near plane orthogonal to the Z axis placed at $z = -n$ , a far plane orthogonal to Z placed at $z = -f$ (so $n$ and $f$ are positive numbers). The near face of the frustum is delimited by X ranging from $l$ to $r$ , and Y ranging from $b$ to $t$ . Note this frustum is not necessarily symmetrical. In practice it is usual to use a symmetrical frustum, but this can be easily inferred as a special case. The view frustum is illustrated in Figure 1.

Our NDC (Normalized Device Coordinates) system will be right-handed with X pointing right, Y pointing down, and Z pointing into the screen. The canonical view volume is defined by X and Y ranging from -1 to 1, and Z ranging from 0 to 1. The canonical view volume is illustrated in Figure 2. This NDC setup is how Vulkan defines it in its spec.

Other rendering APIs use different NDC setups. OpenGL defines its NDC as a left-hand coordinate system, with X pointing right, Y pointing up, Z pointing into the screen, and all three coordinates ranging from -1 and 1. DirectX defines its NDC as left-handed, with X pointing right, Y pointing up, Z pointing into the screen, X and Y ranging from -1 to 1 and z ranging from 0 to 1.

We have defined our view coordinate system with the approach usually employed in OpenGL applications. This should be natural for those accustomed to working with this API, and also has the advantage that the Y axis points up allowing us to think of Y as “height”. Note however, that since the Vulkan NDC has Y pointing down and Z pointing in the same direction as the camera, we will need to define our projection transformation with negative coefficients in Y and Z, in order to flip the direction of these axes.

Figure 2. The canonical view volume in Vulkan NDC.

The perspective transformation

With the above coordinates setup (mapping z to the interval [0, 1]) the perspective transformation can be described by the following matrix:

$P = \begin{pmatrix} \frac{2n}{r-l} & 0 & \frac{r+l}{r-l} & 0 \\ 0 & \frac{2n}{b-t} & \frac{b+t}{b-t} & 0 \\ 0 & 0 & \frac{f}{n - f} & \frac{fn}{n - f} \\ 0 & 0 & -1 & 0 \end{pmatrix}$

Remember there is an implicit division by w after the matrix is applied. Note also that we have defined the fourth row of the matrix as $(0, 0, -1, 0)$ , so $w' = -z$ . That is to say, when we divide by w in clip space we are actually dividing by the view space $-z$ .

For a detailed derivation of this matrix, see Sources. The explanation in that page is oriented towards the OpenGL NDC system so the matrix they build is different from our Vulkan-oriented matrix, but they show a generic method that can be used to derive the perspective matrix with any NDC setup.

Projection and the graphics pipeline

As a reminder, the following is a simplified summary of the steps of the graphics pipeline in a typical application and the role the projection matrix plays in it. Some details like the orientation of the framebuffer coordinate system vary depending on the graphics API. This summary assumes we are working with Vulkan. For a more detailed description see Sources.

The model space coordinates of each vertex are input into your vertex shader.
Your vertex shader takes the model space coordinates of the vertex, converts them to homogenous coordinates by adding a w component with value 1, and multiplies the homogeneous vector by the 4×4 model matrix to obtain the world space coordinates.
The vertex shader takes the world space coordinates and multiplies them by the view matrix to obtain the view space coordinates.
The vertex shader takes the view space coordinates and multiplies them by the projection matrix P above to obtain the clip space coordinates. This is what the vertex shader returns. The vertex shader may combine the model, view an projection matrices into a single multiplication and return $P * V * M * v_{model}$ .
The clip space coordinates returned from the vertex shader are divided by their w component to obtain Normalized Device Coordinates (NDC). This step is sometimes called perspective division. This is a fixed function step.
The NDC are converted to framebuffer space (called window space in OpenGL) with an affine transform. The window space coordinates describe the vertex position in a coordinate system that has its origin in the top left corner of the window, and is scaled so that pixels have width and height 1 . This is a fixed function step.
Primitives are rasterized (converted to a set of samples called fragments). For each fragment, its barycentric coordinates are computed in framebuffer space, and then its framebuffer space coordinates $x', y', z'$ are computed by linearly interpolating the framebuffer space coordinates of the primitive vertices, using the barycentric coordinates of the fragment as weights. The framebuffer space z coordinate in particular is relevant because it will be used in the z test. This is a fixed function step.
The fragment’s vertex attribute values (e.g. texture coordinates) are computed via perspective-correct interpolation of the vertex attributes of the vertices. This involves using the view space (pre-projection) z values at the vertices. This is a fixed function step.
Early z test: if the driver and GPU support it (most do) and our pipeline doesn’t write to the depth buffer or use discard in the fragment shader, the pipeline will probably do the early z test at this point. The fragment’s z coordinate in framebuffer space is compared against the value stored in the depth buffer for the current pixel. If the fragment’s z is greater or equal than the value stored in the depth buffer, it means the fragment is occluded so it is discarded. Otherwise, the depth buffer is updated with the framebuffer z value of the fragment and processing of the fragment continues. This is a fixed function step.
Your fragment shader is invoked with the framebuffer space coordinates of the fragment as input.
If we have a stencil buffer attached, the stencil test is performed.
If the pipeline wasn’t able to do the early z test, z testing is performed at this point (late z test). Otherwise the late z test is skipped. The late z test involves the same read and possible write of the depth buffer described above in the early z test step.
The color returned from the fragment shader is written to the framebuffer.

The use of a depth buffer and a stencil buffer is optional in the graphics pipeline, but its use is so common that we have included it in the summary.

The collinearity property

The perspective projection doesn’t preserve parallelism. Parallel lines in view space get mapped to lines that meet in a point, the projection of which over the near plane is called their vanishing point.

Parallel lines are projected to lines that intersect at a vanishing point. Photo by Joshua Burdick on Unsplash

Lengths are not preserved either. Points get shifted inwards and backwards towards the far plane, and the farther from the near plane we go the more shifted they are. This generates the effect of far away objects looking smaller than objects of the same size that are closer to the camera.

However, the perspective projection as defined above has the important property that 3D lines in view space (after applying perspective division) get mapped to 3D lines in framebuffer space. Note we are not only talking about lines being mapped to lines in the projection plane: the 3D coordinates post projection also remain aligned, and this is important. For this reason we say that perspective is a projective transformation (it maps lines to lines, without preserving parallelism nor distances).

The key feature of the projection that is the cause of this property is the way the z coordinate is remapped in the third row of the projection matrix. When we look at the perspective matrix, we can see that after w division, the view space coordinates $(x, y, z)$ will be mapped to framebuffer space coordinates $(x', y', z')$ according to the following equations:

$x' = (\frac{2n}{r-l})\frac{x}{-z} - \frac{r+l}{r-l}$

$y' = (\frac{2n}{b-t})\frac{y}{-z} - \frac{b+t}{b-t}$

$z' = (\frac{fn}{n - f})\frac{1}{-z} - \frac{f}{n - f}$

Notice how $x'$ and $y'$ are hyperbolic functions of $z$ . If we had left $z'$ as a linear function of $z$ , our straight lines would turn into hyperbolas after projection. For a good graphical depiction of how this would look like, see Sources.

The mapping of $z'$ as another hyperbolic function of $z$ has the effect of compensating the curve of $x'$ and $y'$ and producing a straight line. Note this preserves collinearity, but not lengths: points that are equally spaced across the line in view space will get shifted along the line towards the far plane in a non linear fashion.

Remember that the graphics pipeline needs to determine the value of the framebuffer space z for each fragment, based only on the framebuffer space z values at the vertices of the primitive. This is done by doing a linear interpolation of the framebuffer space z values at the vertices (i.e. a linear combination of the vertices’ z values using the sample’s barycentric coordinates as weights). This is the reason the collinearity property is important: if straight lines in view space were turned into curves after projection, this linear interpolation would be impossible.

Note that here we are talking about the linear interpolation of the framebuffer space z values, which is used in depth testing to discard occluded fragments. This should not be confused with the perspective-correct interpolation of vertex attributes, which is done using the clip space w (which takes the value of the pre-projection view space -z) and is therefore independent of how the projection remaps z.

Formal statement of the collinearity property

For the purposes of stating the collinearity property in formal terms, there is an important observation that will allow us to present the theorem in a more simple and elegant way. Note that the way we have defined the perspective projection can be divided in two steps:

Dividing input coordinates by $z$ to make far away objects look smaller (projective component).
Multiplying by scalars and adding scalars in order to adjust the ranges and orientations of $x'$ , $y'$ and $z'$ to those expected by NDC (affine component).

The affine component doesn’t affect collinearity (we already know it will map lines to lines due to it being an affine transformation ). In order to prove that perspective maps lines to lines, it’s enough to show that the projective component does. Note that the projective component doesn’t need to flip Y or Z, this is part of the job of the affine component.

Theorem. Let $A = (x_A, y_A, z_A)$ and $B = (x_B, y_B, z_B)$ be two points in $\mathbb{R}^3$ , and $S = \frac{A + B}{2}$ their mid point. We define a perspective transformation as the function $f : \mathbb{R}^3 \rightarrow \mathbb{R}^3$ defined by the formula $f(x, y, z) = (\frac{x}{z}, \frac{y}{z}, \frac{1}{z})$ . Let $A'$ , $B'$ and $S'$ be the images of $A$ , $B$ and $S$ through $f$ respectively. Then the point $S'$ lies on the line segment $\overline{A'B'}$ . This is to say, there exists a $\lambda \in \mathbb{R}$ such that $S' = \lambda A' + (1 - \lambda) B'$ .

Proof

From the definition of $f$ we have:

$A' = (\frac{x_A}{z_A}, \frac{y_A}{z_A}, \frac{1}{z_A})$

$B' = (\frac{x_B}{z_B}, \frac{y_B}{z_B}, \frac{1}{z_B})$

$S' = (\frac{x_A + x_B}{z_A + z_B}, \frac{y_A + y_B}{z_A + z_B}, \frac{2}{z_A + z_B})$

We can separate $S'$ as:

$S' = (\frac{1}{z_A + z_B})(x_A, y_A, 1) + (\frac{1}{z_A + z_B})(x_B, y_B, 1)$

$S' = (\frac{1}{z_A + z_B}) z_A (\frac{x_A}{z_A}, \frac{y_A}{z_A}, \frac{1}{z_A}) + (\frac{1}{z_A + z_B}) z_B (\frac{x_B}{z_B}, \frac{y_B}{z_B}, \frac{1}{z_B})$

$S' = (\frac{z_A}{z_A + z_B}) (\frac{x_A}{z_A}, \frac{y_A}{z_A}, \frac{1}{z_A}) + (\frac{z_B}{z_A + z_B}) (\frac{x_B}{z_B}, \frac{y_B}{z_B}, \frac{1}{z_B})$ (1)

If we denote $\lambda = \frac{z_A}{z_A + z_B}$ then it’s easy to see $\frac{z_B}{z_A + z_B} = 1 - \lambda$ . So we can rewrite equation 1 as:

$S' = \lambda A' + (1 - \lambda) B'$

$\square$

Alternative proof

The book “3-D Computer Graphics A Mathematical Introduction with OpenGL” by Samuel Buss provides an alternative proof, with a more generic statement. The proof can be found in chapter II “Transformations and Viewing”. The argument in broad strokes is the following.

Consider any function $f : \mathbb{R}^3 \rightarrow \mathbb{R}^3$ that can be represented by a 4×4 matrix. Given an input point $P \in \mathbb{R}^3$ , its representation in the 3-dimensional projective space is a line through the origin in $\mathbb{R}^4$ . If $P$ varies on a line in $\mathbb{R}^3$ then its projective representation spans a plane through the origin in $\mathbb{R}^4$ (i.e. a 2-dimensional subspace). The image of this subspace by the matrix is another subspace whose dimension cannot be greater than 2. There are three possible cases then:

The image subspace is of dimension 0. Then the function is not defined for the point $P$ .
The image subspace is of dimension 1. Then its points represent a point in $\mathbb{R}^3$ (this is what happens for example when we try to project a point that lies on the plane through the eye parallel to the near plane, the result is a point at infinity).
The image subspace is of dimension 2. Then its points represent points in $\mathbb{R}^3$ that vary over a line.

Sources

About the derivation of the perspective matrix in OpenGL: https://www.scratchapixel.com/lessons/3d-basic-rendering/perspective-and-orthographic-projection-matrix/projection-matrix-introduction.html
Stackoverflow answer showing a graphical comparison between linear and hyperbolic mapping of z: https://stackoverflow.com/questions/47801957/linear-depth-buffer/47802596#47802596
Buss, Samuel R. “3-D Computer Graphics A Mathematical Introduction with OpenGL”. Provides an alternative and more generic proof of the property that perspective maps 3D lines to 3D lines.
About the rasterization algorithm and linear interpolation of z values: https://www.scratchapixel.com/lessons/3d-basic-rendering/rasterization-practical-implementation/visibility-problem-depth-buffer-depth-interpolation.html
About perspective-correct interpolation of vertex attributes: https://www.scratchapixel.com/lessons/3d-basic-rendering/rasterization-practical-implementation/perspective-correct-interpolation-vertex-attributes.html
Low, Kok-Lim. Perspective-Correct Interpolation. This is another in-depth article about the subject: https://www.comp.nus.edu.sg/~lowkl/publications/lowk_persp_interp_techrep.pdf
Another explanation about perspective-correct interpolation in Stackoverflow: https://stackoverflow.com/questions/24441631/how-exactly-does-opengl-do-perspectively-correct-linear-interpolation
About the stages of the graphics pipeline: https://www.khronos.org/opengl/wiki/Rendering_Pipeline_Overview
Wikipedia on projective transformations: https://en.wikipedia.org/wiki/Homography

The math behind the lookAt() transform

In computer graphics, one of the key elements of the graphics pipeline is the View transformation, which is used in the vertex shading stage to convert coordinates from World space to View space. The View transform is usually constructed using a utility function like glm::lookAt() from the GLM library, or D3DXMatrixLookAtLH() in DirectX. But what are these functions actually doing? In this post we will do a deep dive into the math behind the glm::lookAt() function. This will also serve as a way to understand and put into practice some important concepts of linear algebra and geometry.

Before we go into the actual explanation, we need to lay some mathematical groundwork first.

The change of basis matrix

Theorem: consider the $\mathbb{R}^n$ vector space and two bases $SRC$ and $DST$ . The function that takes the coordinates of a vector in $SRC$ and converts them into coordinates of the same vector in $DST$ is a linear transformation and its associated matrix $_{DST}M_{SRC}$ is composed of the coordinates of the vectors of $SRC$ expressed in $DST$ , placed as columns.

For the proof of this theorem see Sources. It’s the best explanation I’ve seen of the subject, rigorous and also elegant.

Key takeaways:

In the above theorem it’s irrelevant which of the two bases represents the source and which represents the destination. We can swap them and the statement still holds. This means that if we want to convert coordinates in $DST$ to coordinates in $SRC$ , the basis change matrix is built by taking the coordinates of the $DST$ vectors expressed in $SRC$ and placing them as columns.
The basis change matrix from $DST$ to $SRC$ $_{SRC}M_{DST}$ can also be obtained as the inverse of the basis change from $SRC$ to $DST$ $_{DST}M_{SRC}$ . That is, $_{SRC}M_{DST} = {_{DST}M_{SRC}}^-1$
An interesting special case is when $SRC$ is the canonical basis and $DST$ is orthonormal. In this case the basis change matrix $_{DST}M_{SRC}$ can be built without doing any calculation at all. In effect, let’s start the other way around and build $_{SRC}M_{DST}$ . In order to do this, we need to obtain the coordinates of the $DST$ vectors expressed in $SRC$ and put them as columns. But since $SRC$ is the canonical base, these coordinates are just the tuples of $DST$ as column vectors. The matrix $_{DST}M_{SRC}$ is the inverse of $_{SRC}M_{DST}$ , but since $DST$ is an orthonormal base, it means the matrix $_{SRC}M_{DST}$ is an orthogonal matrix, and so its inverse is its transpose. In other words, in the special case when $SRC$ is the canonical basis and $DST$ is orthonormal, the basis change matrix $_{DST}M_{SRC}$ can be built by taking the vectors if $DST$ and placing them as rows. Keep this in mind for later.

Geometric interpretation of the change of basis matrix

Although not directly related to the lookAt function, there is an interesting geometric observation that we can make from the above theorem.

Consider the change of basis matrix from $DST$ to $SRC$ , $_{SRC}M_{DST}$ . Take the first vector of $SRC$ , and take its coordinates in $SRC$ . This is the vector $(1, 0, ..., 0)$ . If we multiply this vector by $_{SRC}M_{DST}$ , what we get is a linear combination of the columns of the matrix using the elements of the vector as coefficients. But since the vector is all zeroes except for the first element, this matrix-vector multiplication will yield the first column of the matrix, which is made of the coordinates of the first vector of $DST$ expressed in $SRC$ . Following this argument for the rest of the vectors of $SRC$ , we can see that $_{SRC}M_{DST}$ is also the associated matrix of the transform T that takes $SRC$ into $DST$ . Symbolically:

$_{SRC}M_{DST} = _{SRC}((T))_{SRC}$

Taking inverse on both sides of the equation we get:

${_{SRC}M_{DST}}^{-1} = {_{SRC}((T))_{SRC}}^{-1}$

$_{DST}M_{SRC} = _{SRC}((T^{-1}))_{SRC}$

It is worth noting that the two matrices in the equation are expressed in different bases: the left side is a matrix that takes coordinates in $SRC$ and returns coordinates in $DST$ , whereas the right hand side is a matrix that takes coordinates in $SRC$ and returns coordinates in $SRC$ .

This is the geometric interpretation of the change of basis transform: converting coordinates in $SRC$ to coordinates in $DST$ is equivalent to taking the vector represented by the coordinates in $SRC$ , transforming it with the inverse of the transform that turns $SRC$ into $DST$ , and interpreting the resulting tuple as if they were coordinates in $DST$ .

To get an intuitive understanding of this observation, let’s use an example. Consider $\mathbb{R}^3$ , let $SRC$ be the canonical basis and $T$ be a rotation of 30 degrees counterclockwise around the $-Y$ axis. The $DST$ basis is the result of applying $T$ to the vectors of $SRC$ .

Say we want to compute the coordinates of $i$ in $DST$ . One way of doing it is to leave the vector $i$ fixed, rotate the basis $SRC$ so that it becomes $DST$ and then project $i$ onto the direction of the rotated basis vectors. The coordinates of $i$ in $DST$ , which we will denote as $[i]_{DST}$ , are $(\sqrt{3}/2, 0, -1/2)$ . Figure 1 illustrates this approach. We are using a right-handed coordinate system (Y points into the screen). The vectors of the canonical basis $SRC$ are represented as $i$ , $j$ , and $k$ , show in black. The rotated vectors of $DST$ are $i'$ , $j'$ and $k'$ , shown in blue (note the $j$ vector remains fixed due to the axis of the rotation being parallel to it).

Figure 1. Visualizing the coordinates of i in DST. The vector i is fixed and the basis is rotated counterclockwise

Alternatively, we can leave the basis fixed and rotate the vector $i$ 30 degrees clockwise (i.e. applying the inverse of $T$ ). The coordinates we obtain are the same. Figure 2 illustrates this process.

Figure 2. The SRC basis is fixed and we rotate i clockwise.

Projection of a vector onto another vector

Given a vector space $V$ and a vector $a$ in $V$ , the projection of $a$ onto a nonzero vector $b$ is the vector colinear with $b$ that minimizes the length of $a - b$ .

The projection of $a$ onto $b$ can be computed as $(\frac{a . b}{|| b ||^2}) b$ .

This is related to the concept of coordinates. Given a base $B = \{v_1, v_2, ..., v_n\}$ of $V$ , the coordinates of $a$ with respect to $B$ are the $\lambda_i = \frac{a . v_i}{|| v_i ||^2}$ .

An interesting special case is when all the $v_i$ are of length 1. In that case the coefficients becomes simply $\lambda_i = a . v_i$ .

Affine spaces and affine frames

The affine space is an algebraic structure that provides a natural abstraction to represent physical space. Informally, an affine space is an extension of a vector space where we add a set of points to distinguish them from vectors. The set of points doesn’t have an origin, and points cannot be added together, but a point can be added to a vector (sometimes called displacement vector) to yield another point which represents the translation of the point by the vector. Similarly, two points can be subtracted to give a displacement vector.

Given two affine spaces $X$ and $Z$ , a function $f: X \rightarrow Z$ is an affine map if there exists a linear map $m_f$ such that $f(x) - f(y) = m_f(x - y)$ for all $x, y$ in $X$ . Affine maps are functions that preserve lines and parallelism, while not necessarily preserving lengths and angles. All linear maps can be seen as special cases of affine maps, but not all affine maps are linear maps, because affine maps are not constrained to map the origin to the origin.

The set of points of an affine space doesn’t have an origin. In order to describe the coordinates of a point, we must arbitrarily define a point as origin and describe the coordinates of the displacement vectors relative to that origin. An affine frame is composed of a point $o$ that we call origin and a basis $B = \{v_1, v_2, ..., v_n\}$ of the vector space.

Given a frame $(o, v_1, v_2, ..., v_n)$ for each point $p$ there is a unique set of coefficients $\lambda_i$ such that:

$p - o = \lambda_1v_1 + \lambda_2v_2 + ... + \lambda_nv_n$

The $\lambda_i$ are called the affine coordinates of $p$ in the frame $(o, v_1, v_2, ..., v_n)$ .

Note that although the point set of an affine space doesn’t have the concept of an origin point, we can always arbitrarily choose an affine frame, and this allows us to represent any point by its coordinates in that frame.

For a more formal and detailed description of the concepts of affine space and affine frame, see Sources.

Homogeneous coordinates

For an in-depth description of homogeneous coordinates and how they work, see see Sources.

What follows are the key takeaways of the subject, without going into much formality.

Every affine map can be represented as the composition of a linear map and a translation.

An affine map from $\mathbb{R}^3 \rightarrow \mathbb{R}^3$ cannot be described by a 3×3 matrix, because it’s not a linear map (it has a translation component).

Homogeneous coordinates allow us to represent an affine map in $\mathbb{R}^3 \rightarrow \mathbb{R}^3$ as a 4×4 matrix.

Given an affine map $f: \mathbb{R}^3 \rightarrow \mathbb{R}^3$ which is a composition of linear map $r$ and a translation $t$ by a vector $(t_x, t_y, t_z)$ (that is $f = t ^ {\circ} r$ ), $f$ can be represented by a 4×4 matrix $M$ . To compute the matrix $M$ , let

$R = \begin{pmatrix} R_{11} & R_{12} & R_{13} & 0\\ R_{21} & R_{22} & R_{23} & 0\\ R_{31} & R_{32} & R_{33} & 0\\ 0 & 0 & 0 & 1 \end{pmatrix}$

(The upper-left components of $R$ are taken from the associated matrix of $r$ ).

Let

$T = \begin{pmatrix} 1 & 0 & 0 & t_x \\ 0 & 1 & 0 & t_y\\ 0 & 0 & 1 & t_z\\ 0 & 0 & 0 & 1 \end{pmatrix}$

Then

$M = T * R = \begin{pmatrix} R_{11} & R_{12} & R_{13} & t_x\\ R_{21} & R_{22} & R_{23} & t_y\\ R_{31} & R_{32} & R_{33} & t_z\\ 0 & 0 & 0 & 1 \end{pmatrix}$

(Note $R$ is applied first, then $T$ ).

In order to transform a point $p = (x, y, z)$ by an affine map $f$ , first we convert $p$ to an $\mathbb{R}^4$ column vector in homogeneous coordinates $p_h = (x, y, z, 1)$ (setting its $w$ component to 1), compute the matrix-vector multiplication $p_h' = (x', y', z', w) = M * p_h$ , then we convert $p_h'$ back to $\mathbb{R}^3$ by dividing its first three components by the fourth: $f(p) = (x'/w, y'/w, z'/w)$ .

Define and be aware of your coordinate systems orientation

In order to build the view matrix, you will first need to choose the handedness of the world and view coordinate systems and the orientation of the camera in the view system. In theory, this choice is up to the programmer and doesn’t depend on the graphics API you are using. Graphics APIs don’t care about what coordinate systems you use for the world and view spaces. What they specify is the handedness and range of the Normalized Device Coordinate (NDC) system and the orientation of the camera in it. You are free to choose the model, world and view coordinate systems however you want as long as you build the model, view and projection matrices in such a way that the $P * V * M$ matrix multiplication maps the object onto its desired position and the frustum into the NDC frustum of the API you are using.

In practice, there may be external factors that constrain this choice though. For example, if you are working with OpenGL you will probably use the GLM library and build the view matrix using the glm::lookAt() function. If you use GLM, the library already has the decision of world and view coordinate systems made for you: both systems are right-handed, with Y pointing up and the camera looking down the negative Z axis. Historically, this has been the standard in OpenGL, from the times of the deprecated GLU library and the gluLookAt() function that glm::lookAt() is based on. The OpenGL NDC system is left-handed, with X pointing right, Y pointing up and the camera looking up the positive Z axis. All three cordinates X, Y and X vary between -1 and 1. You may have noticed that this change from right-handed to left-handed when going from view space to NDC is inconsistent. This quirk is a historical holdover in OpenGL, and is usually worked around by building the projection matrix so that it flips the Z axis (glm::perspective() does this internally).

In DirectX the NDC system is left-handed, with X pointing right, Y pointing up, and the camera pointing up the positive Z axis. X and Y range from -1 to 1, while z ranges from 0 to 1. Regarding the world and view coordinate systems, the API provides utility functions for building the view matrix for either a left-handed or right-handed view system (these are the D3DXMatrixLookAtLH() and D3DXMatrixLookAtRH()). However, given that the NDC system is left-handed, it’s natural to make your world and view systems that way as well.

In the Vulkan API, the NDC system is defined differently than OpenGL in order to avoid the inconsistency of going from right-handed to left-handed: the NDC is right-handed, with X pointing right, Y pointing down and the camera looking up the positive Z axis. X and Y vary between -1 and 1, but Z varies between 0 and 1. Note how Y points down and not up as in OpenGL. You’re on your own regarding how to establish the other coordinate systems.

Without loss of generality and for convention only, for the rest of this post we will work with world and view coordinate systems which are both right-handed. Our view system will follow the historical OpenGL convention: X points right, Y points up and the camera looks down the negative Z axis.

The problem

Now we are ready to formally state the problem of building the view matrix. What we usually call world space is an affine space. We have a point in space which we establish as the origin for the vector space. We have a right-hand coordinate system where the $X$ axis points east, the $Y$ axis points up, and the $Z$ axis points south. We have an affine frame fixed in this origin with the canonical basis as its basis (the world frame), and we describe points by their coordinates in this frame. These are the world coordinates of a point.

The camera’s position is defined by a point $eye$ in world space. Its orientation can be described by an orthonormal basis $\{right, up, back\}$ whose vectors are defined in world space and point right, up and back from the camera as their names imply. This defines another affine frame with its origin at $eye$ and $\{right, up, back\}$ as its basis. Let’s call this frame the view frame.

The job of the view matrix is to convert coordinates from the world frame to the view frame. This is an affine map, so we represent it with a 4×4 matrix. Our problem is: given the $eye$ point, the point we want the camera to look at $center$ , and an $updir$ vector indicating which direction is up from the camera, find the matrix $M$ that converts coordinates in the world frame to coordinates in the view frame. Note the vector $updir$ is not necessarily a unit vector, and it is not necessarily normal to the vector $center - eye$ which defines the camera’s direction. The purpose of $updir$ is to define a plane together with the direction vector $center - eye$ , which indirectly defines $right$ as the normal to this plane.

Solution

The first thing we will need to do is compute the vectors $\{right, up, back\}$ that make up the basis of the view frame. The $back$ is easy to obtain by subtracing $center$ from $eye$ :

$back = normalize(eye - center)$

The $back$ and $updir$ vectors define a plane, and the $right$ vector needs to be normal to that plane. So we can obtain $right$ as the cross product of $updir$ and $back$ . Note $updir$ is not necessarily a unit vector, so we need to normalize the result of the dot product.

$right = normalize(updir \times back)$

Having $back$ and $right$ , we compute $up$ as the cross product of them:

$up = normalize(back \times right)$

It’s worth noting that the construction of the three vectors of the view base is the only step in this process that depends on the handedness of the world and view coordinate systems and the orientation of the camera in the view frame. If you are using OpenGL with glm::lookAt(), your view base will be $\{right, up, back\}$ , because the Z axis of the view frame points back from the camera. If using DirectX with D3DXMatrixLookAtLH(), your view base will be $\{right, up, forward\}$ , because the view frame is left-handed. Regardless of how you have chosen the coordinate systems, the procedure from this step onwards is the same regardless.

Now that we have the vectors of view frame, let’s go back to the definition of affine coordinates. Say that we have a point $p$ expressed in the world frame and we want to compute its coordinates in the view frame. By the definition, we are looking for the coefficients $\lambda_1, \lambda_2, \lambda_3$ such that the displacement vector $p - eye$ can be obtained as the linear combination $\lambda_1 right +\lambda_2 up + \lambda_3 back$ . So the first thing we need to do is compute the displacement vector $p - eye$ . This is a translation by $-eye$ , and its matrix is:

$T = \begin{pmatrix} 1 & 0 & 0 & -eye_x \\ 0 & 1 & 0 & -eye_y\\ 0 & 0 & 1 & -eye_z\\ 0 & 0 & 0 & 1 \end{pmatrix}$

Once we have that, we have the displacement vector expressed by its coordinates in the basis of the world frame. We need to convert it to coordinates in the basis of the view frame. This is a basis change, where our $SRC$ basis is the canonical base, and our $DST$ basis is $\{right, up, back\}$ . Since $SRC$ is the canonical base and $DST$ is orthonormal, the change of basis matrix can be obtained simply by taking the vectors $\{right, up, back\}$ and placing them as rows (see previous section on change of basis). So the change of basis matrix is:

$R = \begin{pmatrix} right_x & right_y & right_z & 0 \\ up_x & up_y & up_z & 0 \\ back_x & back_y & back_z & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix}$

Now all we need is to combine the two steps:

$M = R * T = \begin{pmatrix} right_x & right_y & right_z & -right . eye \\ up_x & up_y & up_z & -up . eye \\ back_x & back_y & back_z & -back . eye \\ 0 & 0 & 0 & 1 \end{pmatrix}$

That’s it! It’s a translation to compute the displacement vector with respect to the camera $eye$ point, followed by a change of basis from the canonical base to the $\{right, up, back\}$ base. This is what the glm::lookAt() and D3DXMatrixLookAtLH() functions are doing internally.

Geometric interpretation of the `lookAt()` matrix

The following figure illustrates the coordinate conversion from the world frame to view frame. Following the previous argument, the geometric interpretation is that we are computing the vector $p - eye$ expressed by its coordinates in the canonical base $SRC$ , and then converting its coordinates to the base $DST$ .

Figure 3. The p – eye vector is computed in SRC and then converted to DST

However, looking at the way the lookAt matrix is built, there’s an alternative geometric interpretation that we could make. Note that if we define the alternative translation matrix

$T' = \begin{pmatrix} 1 & 0 & 0 & -right . eye \\ 0 & 1 & 0 & -up . eye \\ 0 & 0 & 1 & -back . eye \\ 0 & 0 & 0 & 1 \end{pmatrix}$

Then the following matrix multiplication yields the same matrix $M$ defined previously:

$M = T' * R = \begin{pmatrix} right_x & right_y & right_z & -right . eye \\ up_x & up_y & up_z & -up . eye \\ back_x & back_y & back_z & -back . eye \\ 0 & 0 & 0 & 1 \end{pmatrix}$

You can do the multiplication in a sheet of paper to confirm this. Note this is similar to the multiplication from before except that now we apply the linear component first, and translate afterwards using a different translation vector. What does this mean? Remember the geometric meaning of the dot product. When we compute the dot product of a vector $a$ and a unit vector $b$ , this gives us the coordinate of $a$ over $b$ . The translation $T'$ is a translation by the vector $(-right.eye, -up.eye, -back.eye)$ , but this vector is nothing but the coordinates of $-eye$ in the orthonormal base $\{right, up, back\}$ . In other words, the expression $T' * R$ is equivalent to converting the coordinates of the point $p$ from $SRC$ to $DST$ , and then adding $-eye$ expressed in $DST$ . The result is the same displacement vector $p - eye$ that we computed earlier, expressed in $DST$ . The following figure illustrates this. It’s similar to the previous figure, but no we show $-eye$ instead of $eye$ .

Figure 4. The -eye vector is converted to DST and then added to the DST-converted p to obtain p – eye in DST.

Sources

About the change of basis theorem and its proof: https://www.statlect.com/matrix-algebra/change-of-basis#:~:text=The%20change%20of%20basis%20is,originally%20employed%20to%20compute%20coordinates
Wikipedia on Affine spaces and Affine frames: https://en.m.wikipedia.org/wiki/Affine_space
Wikipedia on homogenous coordinates: https://en.m.wikipedia.org/wiki/Homogeneous_coordinates
Wikipedia on projection of a vector onto another vector: https://en.wikipedia.org/wiki/Vector_projection

The PCI DSS Standard

On this post we will present an introduction to the PCI DSS Standard, what it is, what constraints it imposes for organizations, and why it matters.

What is the PCI DSS?

The acronym stands for Payment Card Industry Data Security Standard. It is a set of rules for organizations involved in processing of online payments cards which constraints the way payment card data is stored and transferred, with the goal of protecting customer data.

All merchants looking to accept card payments must comply with the standard in order to operate in the industry.

You may have noticed that the expression PCI DSS Standard is redundant, since the last S of the acronym already stands for the word Standard. From now on we will refer to it simply as PCI DSS.

The PCI DSS is written and maintained by the PCI Security Standards Council (PCI SSC). According to their website, the council was was founded in 2006 by American Express, Discover, JCB International, Mastercard and Visa Inc. They share equally in ownership, governance, and execution of the Council’s work.

Who does the PCI DSS apply to?

The PCI DSS applies to all entities that process, store and / or transmit cardholder data.

Note that this means that even if an organization doesn’t store the data and only passes it along to an external provider, the organization must still comply with PCI DSS (although the compliance process would be greatly simplified in this case). This is still valid in particular if the data that the organization processes, stores and / or transmits is a token and not a clear credit card number (see below for the concept of token).

What does PCI DSS mean for merchants?

There are several reasons why complying with PCI DSS is beneficial for businesses:

To protect the data of your customers. Implementing good security practices reduces the likelihood and impact of a potential breach.
Many acquiring banks demand the merchant be PCI DSS compliant in order to process credit card payments for them.
Many acquiring banks apply a fine to non-compliant merchants.
If a breach happens, the business may receive fines and lawsuits from customers and other organizations. Fines may come for example from credit card networks or the government. Complying with PCI DSS helps reduce these fines.

Terminology

Before we can get into the standard requirements themselves, we need to lay down some terms:

Primary Account Number (PAN)

The credit card number visible on the front of the card.

According to the rules of the PCI DSS, the PAN can be stored but in that case it must be rendered unreadable via some mechanism like for example encryption or tokenization.

Cardholder data (CHD)

This category contains the following items:

Primary Account Number.
Cardholder Name.
Expiration Date.
Service Code.

Sensitive Authentication Data (SAD)

This term includes:

Card verification code, also known as Card Verification Value (CVV), Card Security Code (CSC).
Full track data (magnetic-stripe data or equivalent on a chip).
PINs/PIN blocks.

SAD like the CVV cannot be stored after the authorization completes, even if encrypted.

Cardholder Data Environment (CDE)

The people, processes and technology that store, process, or transmit cardholder data or sensitive authentication data, and any other systems that don’t but are on the same network as or have unrestricted connectivity to them.

Note the detail about unrestricted connectivity. A component that can connect to another component that stores, processes, or transmits CHD or SAD because there is a firewall rule that enables it is not part of the CDE. Such a component is part of what is called the connected-to systems (more on this below in the section about Scope).

It should be noted that systems outside of the CDE may still be relevant to a PCI DSS assessment, if they can connect to systems within the CDE (see concept of scope below).

Index Token

A non-sensitive replacement for the PAN, stored in a secure index that allows recovering the PAN (a sensitive value) from the token.

Scope

In the context of PCI DSS, the scope is the set of system components, people and processes that need to be included in the PCI DSS assessment. The first step of an assessment is to properly identify the scope of the review.

It’s important to understand that this doesn’t only involve the CDE. The PCI DSS scope is composed of:

The CDE.
Any system components with connectivity to or from the CDE. This category is sometimes referred to as “Connected-to” components.
Any system that can impact the security or configuration of the CDE (e.g. a host that cannot connect directly to the CDE but can access the CDE via a jump host). This category is sometimes referred to as “Security impacting” components.

Note that a workstation that cannot connect to the CDE but can log into it via a jump host (also known as bastion host) is not in the “Connected-to” category, but is still in the “Security impacting” components and therefore in scope.

Systems that while outside of the CDE have connectivity to or from the CDE or can impact the security of the CDE are impacted by PCI DSS and this need to be secured. It is common for attackers to target systems outside of the CDE which have been considered of low importance and use them to gain access to systems inside the CDE.

A note about alternative terminology. The components in the CDE are sometimes referred to as “Category 1”. The components in the “Connected-to” and the “Security impacting” groups are sometimes referred to as “Category 2” components, and components that are not in scope are also referred to as “Category 3”. The terms “Category 1”, “Category 2” and “Category 3” are not part of the standard but they are used in part of the literature.

The following figure illustrates the elements that compose the scope in PCI DSS:

PCI Compliance process

Given an organization that processes, stores or transmits cardholder data, the process for certifying as PCI compliant involves three main elements:

Handling the ingress and transmission of cardholder data securely.
Storing cardholder data securely, which involves complying with the 12 requirements of PCI DSS about aspects like encryption and security testing.
Doing annual validations that the required security controls are in place. This can include forms, questionnaires and / or external audits.

In the following sections we will present an overview of the steps involved in PCI compliance.

Step 1: Determine your requirements

The requirements of the PCI DSS vary depending on the scale of the organization. There are four different Compliance Levels:

Level 1: Merchants that annually process over 6 million transactions of Visa or Mastercard, or more than 2.5 million of American Express, or have experienced a data breach, or are considered Level 1 by any card network (e.g. Visa, Mastercard).
Level 2: Merchants that process 1 to 6 million transactions annually.
Level 3: Merchants that process 20,000 to 1 million online transactions annually, or that process less than 1 million total transactions annually.
Level 4: Merchants that process fewer than 20,000 online transactions annually, or that process up to 1 million total transactions annually.

Level 1 merchants require:

Annual Report on Compliance (ROC) by a Qualified Security Assessor (QSA) – also commonly known as a Level 1 onsite assessment – or internal auditor if signed by an officer of the company. The assessor works on site reviewing documentation artifacts, evaluating the scope of the assessment and providing support along the compliance process. The assessor submits the ROC to the organization’s acquiring banks indicating its compliance.
Quarterly network scan by Approved Scan Vendor (ASV).
Attestation of Compliance (AOC) for Onsite Assessments.

For organizations in levels 2 to 4, compliance requires:

Annual PCI DSS Self-Assessment Questionnaire (SAQ) – There are nine different forms of SAQs, so the organization needs to figure out which one to use.
- For more details about which SAQ applies to each scenario, see Stripe’s guide on PCI compliance, SAQ Instructions and Guidelines and the FAQ about using the SAQ eligibility criteria to determine onsite assessment requirements..
Quarterly network scan by Approved Scan Vendor (ASV).
Attestation of Compliance (AOC) – each of the 9 SAQs has a respective AOC form.

In addition to the above, the PCI SSC updates the standard every three years and releases incremental updates throughout the year, which also contributes to the complexity of the process.

Step 2: Map your data flows

This step involves identifying every application or system component where CHD is processed, transmitted or stored. This may require creating new diagrams or design artifacts, showing details like which network connections carry clear credit card numbers, which carry only tokens and which carry neither of those things.

In other words, you delineate the PCI DSS scope.

This is team effort that requires collaboration across the organization.

Step 3: Check security controls and protocols

Once you have defined the scope, you need to check every system component in it to ensure the right security configurations and protocols are in effect according to the 12 requirements of the PCI DSS.

Step 4: Monitor and maintain compliance

Once you have achieved compliance with the standard, you will need to set up a regular process to monitor and ensure that you stay compliant across changes in the organization and the standard requirements.

Depending on the scale of the organization, this may involve submitting quarterly or annual reports, and may go as far as performing annual on-site assessments.

For more information about PCI compliance process, see Prioritized Approach for PCI DSS and FAQ about requirements for merchants that develop applications for consumer devices that accept payment card data..

Segmentation

Segmentation is the act of isolating the CDE from the rest of the organization’s network, for example via a firewall. Segmentation is not a requirement of PCI DSS, but it is strongly recommended to reduce the:

Scope of the PCI DSS assessment.
Cost of the PCI DSS assessment.
Complexity of implementing PCI DSS controls.
Risk of payment data compromise.

It’ s important to understand that if segmentation is not in place, the entire network is in scope for PCI DSS.

The separation of components into different networks is not enough to qualify as segmentation. Segmentation is achieved by having controls in place to enforce separation and preventing the out of scope network from being able to access CHD.

Segmentation example

As an example, consider the following problem, from Information Supplement: Guidance for PCI DSS Scoping and Network Segmentation:

Design a segmented network architecture that provides an administration workstation in the corporate LAN, that enables administrative access to the CDE while also keeping the rest of the corporate LAN out of the scope of PCI DSS.

A possible solution to this problem is illustrated by the following figure:

The solution consists of the following elements:

The system is segmented into three networks:
- One network for the CDE (protected by a firewall).
- One network for components that are unrelated to CHD processing (the corporate LAN).
- One network for services which are used by both the corporate LAN and the CDE (the “shared services” network).
A “jumpbox” (also commonly referred to as bastion host) is installed in the shared services network.
Connection to the CDE from the corporate LAN is denied. Only the jump host can connect to the CDE.
Connections from the admin workstation to the jump host are only allowed for designated users.
(Other security controls required in order to comply with PCI DSS requirements, not shown here).

With the above setup, the PCI DSS scope consists of:

The CDE.
The shared services network.
The jump box.
The administration workstation.

All the other components within the corporate LAN are out of scope.

Tokenization and scope

It is clear that system components that handle PAN are part of the CDE. But what about components that only handle tokens? It depends: components that only handle tokens are considered outside the CDE as long as they are properly isolated from the CDE.

Note that even if a component that only handles tokens is outside of the CDE, that doesn’t mean that it is not in scope: it may still be in scope if it connects to a component in the CDE.

In general the use of tokenization is recommended as a way to reduce the scope and simplify the compliance process, but it does not eliminate the need to comply with PCI DSS. From, Information Supplement: PCI DSS Tokenization Guidelines:

“Tokenization solutions do not eliminate the need to maintain and validate PCI DSS compliance, but they may simplify a merchant’s validation efforts by reducing the number of system components for which PCI DSS requirements apply”.

PCI DSS and cache engines

If the set of components in scope include a cache service, we need to ensure that the cache is PCI DSS compliant. This makes the choice of cache engine relevant. This should be taken into account when designing the architecture of the system.

As an example, if we use one of the hosted cache solutions provided by the AWS Elasticache service, we have two choices of engine: Redis and Memcached. The Elasticache Redis service is PCI DSS compliant, whereas Elasticache Memached is not.

PCI DSS requirements

The following is a summary if the requirements that PCI DSS establishes. While we have included the 12 requirements, the details we mention for each are only an overview and should not be taken as an exhaustive description. For the detailed description see the actual PCI DSS test in the sources.