Matrix Calculus Notation
This article discusses notational practices in matrix calculus and recommends a mnemonic inspired by the outer product.
Background and Motivation
In the context of differentiation, Leibniz’s notation ${\big(\frac{\mathrm{d}f}{\mathrm{d}x}\big)}$ is typically used for functions ${f: \mathbb{R} \to \mathbb{R}}$, i.e., scalar-valued function of a real variable, and is a shorthand representing ${\lim_{\Delta x \to 0} \frac{\Delta f}{\Delta x}}$. In higher-dimensional settings, a function can be multivariate (multidimensional input), vector-valued (multidimensional output), or both. In general, an input ${\mathbf{x} \in \mathbb{R}^n}$ is mapped by the function ${\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m}$ where ${n, m \in \mathbb{N}^*}$. We recall common examples of derivatives in higher dimensions:
- The gradient ${\nabla f(\mathbf{x}) = \left[\dfrac{\partial f}{\partial x_i} \right]_i}$ of a scalar-valued multivariate function ${f: \mathbb{R}^n \to \mathbb{R}}$ is the column vector whose components are the partial derivatives of $f$.
- The Jacobian $\mathrm{J}_\mathbf{f}(\mathbf{x}) = \left[\dfrac{\partial f_i}{\partial x_j} \right]_{ij} = [\nabla f_i(\mathbf{x})^\mathsf{T}]_i$ of a vector-valued multivariate function ${\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m}$ is the matrix whose rows are the gradient transposes of the components of $\mathbf{f}$.
Additional examples include the Hessian matrix $\mathrm{H}_f(\mathbf{x})$, among others.
At this stage, what we seek is a notational system for conveniently aggregating derivatives in higher dimensions in the style of Leibniz’s notation. This is the purpose of matrix calculus.
Notational Objective
The main question is then how to interpret the operator $\frac{\partial \Box_1}{\partial \Box_2}$ when at least one of its arguments $\Box_i$ is not a scalar. We are especially interested in the cases: scalar–vector ${(\Box_1, \Box_2) \in \mathbb{R} × \mathbb{R}^n}$, vector–scalar ${\mathbb{R}^m × \mathbb{R}}$, vector–vector ${\mathbb{R}^m × \mathbb{R}^n}$, scalar–matrix ${\mathbb{R} × \mathbb{R}^{m×n}}$, and matrix–scalar ${\mathbb{R}^{m×n} × \mathbb{R}}$.
Furthermore, this operator is non-commutative; meaning that ${\frac{\partial \Box_1}{\partial \Box_2} \neq \frac{\partial \Box_2}{\partial \Box_1}}$ in non-special cases. This implies that there are two distinct interpretations of $\frac{\partial \Box_1}{\partial \Box_2}$ when reading it.
For example, the expression ${\frac{\partial \mathbf{f}}{\partial \mathbf{x}}}$ is a matrix containing partial derivatives. However, we can index these as $[\frac{\partial f_i}{\partial x_j}]_{ij}$ or as $[\frac{\partial f_j}{\partial x_i}]_{ij}$. That is, either the numerator or the denominator can use the row index $i$. Herein lies the two layout conventions:
- Numerator layout: The numerator uses the row index $i$, and so ${\frac{\partial \mathbf{f}}{\partial \mathbf{x}} = [\frac{\partial f_i}{\partial x_j}]_{ij} = \mathrm{J}_\mathbf{f}(\mathbf{x})}$ is the Jacobian whose rows are gradient transposes.
- Denominator layout: The denominator uses the row index $i$, and so ${\frac{\partial \mathbf{f}}{\partial \mathbf{x}} = [\frac{\partial f_j}{\partial x_i}]_{ij} = \mathrm{J}_\mathbf{f}(\mathbf{x})}^\mathsf{T}$ is the Jacobian transpose whose columns are gradients.
Our goal in the remainder of this article is to examine all sources of nuance in interpreting and writing matrix calculus notation. Furthermore, we will design a mnemonic and an explicit notation to disambiguate these nuances directly for the purposes of this discussion.
Formal Discussion
We will formally analyze the operator $\frac{\partial \Box_1}{\partial \Box_2}$, which we will call the ‘matrix derivative’. The types of input objects $\Box_i$ we are interested in are scalars $\mathbb{R}$ (order $0$, no indices), vectors $\mathbb{R}^n$ (order $1$, one index), and matrices $\mathbb{R}^{m×n}$ (order $2$, two indices). Intuitively, the ‘matrix derivative’ handles at most two indices coming from its arguments. Therefore, the ‘matrix derivative’ is defined when the orders of its arguments $\Box _{1}$ and $\Box _{2}$ are at most $2$; that is: $$ \operatorname{order} \left[\frac{\partial \Box_1}{\partial \Box_2}\right] = \operatorname{order} [\Box_1] + \operatorname{order} [\Box_2] \leq 2 $$
The output $\frac{\partial \Box_1}{\partial \Box_2}$ would be a scalar, vector, or matrix. Before proceeding with the definition of $\frac{\partial \Box_1}{\partial \Box_2}$, we define its underlying operations. For a matrix $\mathrm{X} \in \mathbb{R}^{m×n}$:
- We define the element-wise differential to be ${\partial \mathrm{X} = [\partial x_{ij}]_{ij}}$. This also covers the special cases of vectors ${\partial \mathbf{x} = [\partial x_{i}]_{i}}$ and scalars $\partial x$.
- We use the Hadamard inverse (element-wise reciprocal) ${\mathrm{X}^{\circ - 1} = [\frac{1}{x_{ij}}]_{ij}}$, which also covers vectors ${[\frac{1}{x_{i}}]_{i}}$ and scalars $\frac{1}{x}$.
- We will use the shorthand ${\widetilde{\partial \mathrm{X}} = (\partial \mathrm{X})^{\circ - 1} = [\frac{1}{\partial x_{ij}}]_{ij}}$ to succinctly compose these operations.
With these, we can define $\frac{\partial \Box_1}{\partial \Box_2}$. However, as mentioned earlier, there are two layout conventions. We denote numerator layout as $\overline{[\frac{\partial \Box_1}{\partial \Box_2}]}$ and denominator layout as $\underline{[\frac{\partial \Box_1}{\partial \Box_2}]}$, for which we define the ‘matrix derivative’ as:
$$ \overline{\left[\frac{\partial \Box _{1}}{\partial \Box _{2}}\right]} = \partial \Box _{1} \; \widetilde{\partial {\Box} _{2}}^{\mathsf{T}}, \quad \underline{\left[\frac{\partial \Box _{1}}{\partial \Box _{2}}\right]} = \widetilde{\partial {\Box} _{2}} \; \partial \Box _{1}^{\mathsf{T}} $$
Observe that $\overline{[\frac{\partial \Box_1}{\partial \Box_2}]} = \underline{[\frac{\partial \Box_1}{\partial \Box_2}]}^\mathsf{T}$. It is also important to be aware that the product used in the above equation is a placeholder product, which can either be a scalar–matrix product (left or right, since it is commutative) or a matrix product (requiring compatible dimensions).
Finally, as a (notational) mnemonic we take inspiration from the pattern ${\mathbf{u} \otimes \mathbf{v} = \mathbf{u}\mathbf{v}^\mathsf{T}}$ in the outer product ($\otimes$) of vectors. We use the $\boxtimes$ accordingly to write the shorthands: $$\tag{3} \boxed{ \begin{aligned} \overline{\left[\frac{\partial f}{\partial x}\right]} &= \partial f \boxtimes \widetilde{\partial x} = \partial f \; \widetilde{\partial x}^{\mathsf{T}}\\ \underline{\left[\frac{\partial f}{\partial x}\right]} &= \partial x \boxtimes \widetilde{\partial f} = \widetilde{\partial x} \;\partial f^{\mathsf{T}} \end{aligned} } $$
In accordance with the earlier warning, these are merely shorthands for notational convenience and do not represent rich formal meanings.
Shape determination
We can infer that in numerator layout the shape is $\dim fx^\mathsf{T}$:
- $f$ determines the shape if $x$ is scalar.
- $x^\mathsf{T}$ determines the shape if $f$ is scalar.
- $f$ determines rows and $x^\mathsf{T}$ the columns if both are vectors
By contrast, in denominator layout the shape is $\dim xf^\mathsf{T}$:
- $x$ determines the shape if $f$ is scalar.
- $f^\mathsf{T}$ determines the shape if $x$ is scalar.
- $x$ determines rows and $f^\mathsf{T}$ the columns if both are vectors
Concrete Cases
We evaluate below all cases covered by this notation.
Scalar–scalar: For $f, x \in \mathbb{R}$, $$ \begin{aligned} \overline{\left[\frac{\partial f}{\partial x}\right]} &= \partial f \; \widetilde{\partial x}^{\mathsf{T}} = \partial f \frac{1}{\partial x} = \frac{\partial f}{\partial x} \\ \underline{\left[\frac{\partial f}{\partial x}\right]} &= \widetilde{\partial x} \; \partial f^{\mathsf{T}} = \frac{1}{\partial x} \partial f = \frac{\partial f}{\partial x} \end{aligned} $$ Unsurprisingly, the notations yield the same result since $[\frac{\partial f}{\partial x}]^\mathsf{T} = \frac{\partial f}{\partial x}$.
Scalar–vector: For $f \in \mathbb{R}$ and $\mathbf{x} \in \mathbb{R}^n$ (column), $$ \begin{aligned} \overline{\left[\frac{\partial f}{\partial \mathbf{x}}\right]} &= \partial f \; \widetilde{\partial \mathbf{x}}^{\mathsf{T}} = \left[\frac{\partial f}{\partial x_i}\right]_i^\mathsf{T} = \begin{bmatrix}\frac{\partial f}{\partial x_1} & \cdots & \frac{\partial f}{\partial x_n}\end{bmatrix} \\ \underline{\left[\frac{\partial f}{\partial \mathbf{x}}\right]} &= \widetilde{\partial \mathbf{x}} \; \partial f^{\mathsf{T}} = \left[\frac{\partial f}{\partial x_i}\right]_i = \begin{bmatrix}\frac{\partial f}{\partial x_1} & \cdots & \frac{\partial f}{\partial x_n}\end{bmatrix}^{\mathsf{T}} \end{aligned} $$ As expected, the notations yield results that are transposes of one another. The above results also reveal why numerator layout is often called ‘Jacobian convention’ whereas denominator layout is often called ‘gradient convention’.
Scalar–matrix: For $f \in \mathbb{R}$ and $\mathrm{X} \in \mathbb{R}^{m×n}$, $$ \begin{aligned} \overline{\left[\frac{\partial f}{\partial \mathrm{X}}\right]} &= \partial f \; \widetilde{\partial \mathrm{X}}^{\mathsf{T}} = \left[\frac{\partial f}{\partial x_{ji}}\right]_{ij} = \begin{bmatrix} \frac{\partial f}{\partial x_{11}} & \cdots & \frac{\partial f}{\partial x_{m1}}\\ \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial x_{1n}} & \cdots & \frac{\partial f}{\partial x_{mn}} \end{bmatrix} \\ \underline{\left[\frac{\partial f}{\partial \mathrm{X}}\right]} &= \widetilde{\partial \mathrm{X}} \; \partial f^{\mathsf{T}} = \left[\frac{\partial f}{\partial x_{ij}}\right]_{ij} = \begin{bmatrix} \frac{\partial f}{\partial x_{11}} & \cdots & \frac{\partial f}{\partial x_{1n}}\\ \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial x_{m1}} & \cdots & \frac{\partial f}{\partial x_{mn}} \end{bmatrix} \end{aligned} $$ The previous two cases (scalar–scalar, scalar–vector) are special cases of this one. Results can be obtained for row vectors by setting $m=1$.
Vector–scalar: For $\mathbf{f} \in \mathbb{R}^m$ (column) and $x \in \mathbb{R}$, $$ \begin{aligned} \overline{\left[\frac{\partial \mathbf{f}}{\partial x}\right]} &= \partial \mathbf{f} \; \widetilde{\partial x}^{\mathsf{T}} = \left[\frac{\partial f_i}{\partial x}\right]_i = \begin{bmatrix}\frac{\partial f_1}{\partial x} & \cdots & \frac{\partial f_n}{\partial x}\end{bmatrix}^{\mathsf{T}} \\ \underline{\left[\frac{\partial \mathbf{f}}{\partial x}\right]} &= \widetilde{\partial x} \; \partial \mathbf{f}^{\mathsf{T}} = \left[\frac{\partial f_i}{\partial x}\right]_i^\mathsf{T} = \begin{bmatrix}\frac{\partial f_1}{\partial x} & \cdots & \frac{\partial f_n}{\partial x}\end{bmatrix} \end{aligned} $$ These are the derivatives of a vector-valued function w.r.t. to a scalar parameter. Practically, this is component-wise differentiation w.r.t. a real variable.
Matrix–scalar: For $\mathrm{F} \in \mathbb{R}^{m×n}$ and $x \in \mathbb{R}$, $$ \begin{aligned} \overline{\left[\frac{\partial \mathrm{F}}{\partial x}\right]} &= \partial \mathrm{F} \; \widetilde{\partial x}^{\mathsf{T}} = \left[\frac{\partial f_{ij}}{\partial x}\right]_{ij} = \begin{bmatrix} \frac{\partial f_{11}}{\partial x} & \cdots & \frac{\partial f_{m1}}{\partial x}\\ \vdots & \ddots & \vdots \\ \frac{\partial f_{1n}}{\partial x} & \cdots & \frac{\partial f_{mn}}{\partial x} \end{bmatrix} \\ \underline{\left[\frac{\partial \mathrm{F}}{\partial x}\right]} &= \widetilde{\partial x} \; \partial \mathrm{F}^{\mathsf{T}} = \left[\frac{\partial f_{ji}}{\partial x}\right]_{ij} = \begin{bmatrix} \frac{\partial f_{11}}{\partial x} & \cdots & \frac{\partial f_{1n}}{\partial x}\\ \vdots & \ddots & \vdots \\ \frac{\partial f_{m1}}{\partial x} & \cdots & \frac{\partial f_{mn}}{\partial x} \end{bmatrix} \end{aligned} $$ Similar to the preceding case, practically this is element-wise differentiation w.r.t. a real variable.
Vector–vector: For $\mathbf{f} \in \mathbb{R}^m$ (column) and $\mathbf{x} \in \mathbb{R}^n$ (column), $$ \begin{aligned} \overline{\left[\frac{\partial \mathbf{f}}{\partial \mathbf{x}}\right]} &= \partial \mathbf{f} \; \widetilde{\partial \mathbf{x}}^{\mathsf{T}} = \left[\frac{\partial f_i}{\partial x_j}\right]_{ij} = \begin{bmatrix} \frac{\partial f_{1}}{\partial x_{1}} & \cdots & \frac{\partial f_{1}}{\partial x_{n}}\\ \vdots & \ddots & \vdots \\ \frac{\partial f_{m}}{\partial x_{1}} & \cdots & \frac{\partial f_{m}}{\partial x_{n}} \end{bmatrix} \\ \underline{\left[\frac{\partial \mathbf{f}}{\partial \mathbf{x}}\right]} &= \widetilde{\partial \mathbf{x}} \; \partial \mathbf{f}^{\mathsf{T}} = \left[\frac{\partial f_j}{\partial x_i}\right]_{ij} = \begin{bmatrix} \frac{\partial f_{1}}{\partial x_{1}} & \cdots & \frac{\partial f_{m}}{\partial x_{1}}\\ \vdots & \ddots & \vdots \\ \frac{\partial f_{1}}{\partial x_{n}} & \cdots & \frac{\partial f_{m}}{\partial x_{n}} \end{bmatrix} \end{aligned} $$
When both vectors are column vectors, the ‘matrix derivative’ is their outer product yielding a matrix.
When both vectors are row vectors, the ‘matrix derivative’ is only defined for $m = n$, becoming an inner product and yielding the scalar expression $\sum_i^n \frac{\partial f_i}{\partial x_i} = \frac{\partial f_1}{\partial x_1} + \cdots + \frac{\partial f_n}{\partial x_n}$, which is not particularly useful.
When the vectors differ in type (row, column), the ‘matrix derivative’ is undefined due to incompatible dimensions.
Sources of Ambiguity
When reading a multivariate mathematical treatment that uses matrix calculus notation, ambiguity can arise due to:
- Unclear layout convention: Is the notation $\frac{\partial f}{\partial x}$ implicitly assuming numerator layout $\overline{\left[\frac{\partial f}{\partial x}\right]}$ or denominator layout $\underline{\left[\frac{\partial f}{\partial x}\right]}$?
- Inconsistent layout convention: Is the notation $\frac{\partial f}{\partial x}$ throughout the treatment always either numerator layout or denominator layout? In some texts, there is an implicit mixed convention where some expressions use one layout convention and other expressions use another.
- Unclear operand representation: Are the operands $f$ and $x$ implicitly transposed? For a vector, is it row or column? Some notations apply a transposition $\frac{\partial f}{\partial x'}$ (where $x'$ means $x^\mathsf{T}$); our rules in Eq.$(3)$ still apply after all operand representation and shape ambiguities are clarified.
- Unclear derivative notation: As explained in the first warning in this article, do the symbols such as $\nabla$ retain their conventional meaning? Is the Jacobian definition transposed w.r.t. gradients? Similar verifications may be needed.
Effective disambiguation
By acknowledging the above sources of notational ambiguity, it is important to synchronize our understanding of the notation with that intended by the authors of a text. To this end, one should:
- Identify whether numerator or denominator layout is used in a particular expression.
- Identify whether usage is consistent or mixed and verify individual expressions accordingly.
- Determine the intended representation of the operands (transpose, row, column, …).
- Verify the intended meaning or definition of differentiation-related symbols and notation.
Example Usage
We demonstrate below practical usage of such matrix calculus notation.
Example 1: Differentiate the affine function ${\mathbf{f}(\mathbf{x}) = \mathrm{A} \mathbf{x} + \mathbf{b}}$ w.r.t. ${\mathbf{x} \in \mathbb{R}^{n×1}}$ using numerator layout. Here ${\mathrm{A} \in \mathbb{R}^{m×n}}$ and ${\mathbf{b} \in \mathbb{R}^{m×1}}$ do not depend on ${\mathbf{x}}$. $$ \begin{aligned} \frac{\partial (\mathrm{A}\mathbf{x} + \mathbf{b})}{\partial \mathbf{x}} &= \left[\frac{\partial (\sum_{k=1}^{n} \mathrm{A}_{ik} x_k + b_i)}{\partial x_j}\right]_{ij} \\ &= \left[\frac{\mathrm{A}_{ij} \partial x_j}{\partial x_j}\right]_{ij} \\ &= \mathrm{A} \end{aligned} $$ In numerator layout, this is equivalent to the Jacobian ${\mathrm{J}_\mathbf{f}(\mathbf{x}) = \mathrm{A}}$ (assuming the definition of the Jacobian as used in this article). If we switch to denominator layout, the result would be $\frac{\partial}{\partial \mathbf{x}}(\mathrm{A}\mathbf{x} + \mathbf{b}) = \mathrm{A}^\mathsf{T} = \mathrm{J}_\mathbf{f}(\mathbf{x})^\mathsf{T}$.
Example 2: Differentiate the quadratic form ${f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^\mathsf{T}\mathrm{A}\mathbf{x}}$ using denominator layout. Here ${\mathrm{A} \in \mathbb{R}^{n×n}}$ does not depend on ${\mathbf{x} \in \mathbb{R}^{n×1}}$. $$ \begin{aligned} \frac{\partial (\tfrac{1}{2} \mathbf{x}^\mathsf{T}\mathrm{A}\mathbf{x})}{\partial \mathbf{x}} &= \frac{1}{2} \left[\frac{\partial (\sum_{kl} x_k \mathrm{A}_{kl} x_l)}{\partial x_i}\right]_{i} \\ &= \frac{1}{2} \left[\sum_{k} x_k \mathrm{A}_{ki} + \sum_{l} \mathrm{A}_{il} x_l\right]_{i} \\ &= \tfrac{1}{2} (\mathrm{A}^\mathsf{T} \mathbf{x} + \mathrm{A}\mathbf{x}) \\ &= \tfrac{1}{2} (\mathrm{A}^\mathsf{T} + \mathrm{A})\mathbf{x} \end{aligned} $$ In denominator layout, this is equivalent to the gradient $\nabla f(\mathbf{x}) = \tfrac{1}{2} (\mathrm{A}^\mathsf{T} + \mathrm{A})\mathbf{x}$. When $\mathrm{A}$ is also symmetric, then $\frac{\partial}{\partial \mathbf{x}}(\tfrac{1}{2} \mathbf{x}^\mathsf{T}\mathrm{A}\mathbf{x}) = \mathrm{A} \mathbf{x}$. If we switch to numerator layout, the result would be ${\frac{\partial f}{\partial \mathbf{x}} = \tfrac{1}{2} \mathbf{x}^\mathsf{T} (\mathrm{A} + \mathrm{A}^\mathsf{T})= \nabla f(\mathbf{x})^\mathsf{T}}$.