Minimum Norm Solutions for Sparse Problems

Alec Jacobson

October 8, 2024

weblog/

I reimplemented the paper "Sag-Free Deformable Simulations" by Hsu et al. There's a part in the paper where you end up with a optimization problem of the form: $$ \min_{x \in \mathbb{R}^n} \frac{1}{2} \|Ax - b\|_2^2 \\ \text{subject to} \quad C x = 0, $$ where $A \in \mathbb{R}^{m \times n}$ with $\text{rank}(A) = m \lt n$, meaning that $A x = b$ is underconstrained and the normal equations system matrix $A^T A$ would be singular. Similar, $C \in \mathbb{R}^{p \times n}$ with $\text{rank}(C) = p \lt n$. Indeed, there are many equivalently optimal solutions $x$ to this problem.

There are a variety of ways to adjust this problem so that we can robustly find a specific solution. The commonality to the following approaches is that they choose a solution with minimum norm (e.g., $\|x\|_2$ or something similar).

Null-space method + Moore-Penrose Pseudoinverse

The Moore-Penrose pseudoinverse of a matrix $A$ is the classic way to solve under-determined problems of the form: $$ \min_{x \in \mathbb{R}^n} \|x\|_2^2 \quad \text{subject to} \quad Ax = b. $$ The pseudoinverse in this case is defined as $$ A^+ = A^T (A A^T)^{-1}. $$ And the solution is given by $$ x = A^+ b. $$

Adding the constraints $Cx = 0$ doesn't change things too much if we can identify a matrix $N \in \mathbb{R}^{(n-p) \times n}$ such that $$ C N y = 0 \quad \text{for all} \quad y \in \mathbb{R}^{n-p}. $$

We substitute $x = N y$ into the original problem and get a minimum-norm problem over $y$: $$ \min_{y \in \mathbb{R}^{n-p}} \frac{1}{2} \|A N y - b\|_2^2. $$ With a solution: $$ x = N ((A N)^+ b). $$

If $N$ is orthonormal, then $\|x\|_2 = \|N y\|_2 = \|y\|_2$.

If $A$ and $C$ are large and sparse, then there a couple aspects of this method that require attention.

Sparse Null-space Bases are probably not Orthonormal

There are a number of ways to construct a sparse null-space basis $N$ for a (sparse) matrix $C$. Most of these will not result in an orthonormal basis.

Intuitively, $N$ defines a linear subspace (think high dimensional analog of an arbitrarily rotated plane in 3D). Each column of $N$ defines a basis vector lying on that hyperspace. A sparse vector means that the basis vector lies on the intersection of the span of a small subset of the coordinate axes and the hyperspace. We would like all columns of $N$ to be linearly independent. If we think of picking the columns one-by-one, then this means that the $i$-th column of $N$ should also not lie in the span of the first $i-1$ columns. Fortunately, this will generally be possible. There's always some intesection of the canonical coordinates axes and the remaining space to span left to choose from.

However, if we further add that the columns of $N$ should be orthogonal, then with each new column we're severly limiting the remaining space to choose from. For a random linear subspace, it's quite unlikely that we can avoid picking some (or all) dense directions.

This has potentially ignorable effects on our problem above. Using a non-orthogonal $N$ could result in a solution $x$ with a larger norm than the minimum-norm solution. If the goal is to just get some solution, then this is probably fine.

Sparse Pseudo-Inverse

The Moore-Penrose pseudoinverse $A^+$ is defined as $A^T (A A^T)^{-1}$, but numerically we don't usually realize this matrix. Instead, we usually compute a full singular value decomposition of $A = U \Sigma V^T$ and replace only (nearly-)non-zero singular values with their reciprocals to pseudo-invert $A^+ = V \Sigma^+ U^T$ or we construct a Cholesky factorization of $A A^T$. Replacing $A$ in our constrained problem with $AN$.

If $A$ is large and sparse then a full SVD is off the table. We could use sparse Cholesky directly on $A A^T$ (or $A N N^T A^T$ for the constrained problem). If $A$ doesn't have full rank or is poorly conditioned,we can also use sparse QR to build a sparse pseudoinverse factorization. For example, following the method implemented by Bruno Luong, first compute a column pivoted QR decomposition of $A$ such that: $$ A P = Q R $$ where $P \in \mathbb{n \times n}$ is a permutation matrix, $Q \in \mathbb{R}^{m \times m}$ is a unitary matrix and $R \in \mathbb{R}^{m \times n}$ is upper triangular (since $R$ is in general rectangular, "upper triangular" here means that all entries $R_{ij} = 0$ for $i \gt j$).

Now, further conduct another QR decomposition of $R^T$: $$ R^T = S T $$ where $S \in \mathbb{R}^{n \times n}$ is unitary and $T \in \mathbb{R}^{n \times m}$ is upper triangular.

Substituting this above for $R$ we have: $$ A P = Q T^T S^T. $$

So, then the right inverse of $A$ (defined so that $A A^+ = I$) is: $$ A^+ = P^T S T^{-T} Q^T. $$ All of these matrices are sparse. However, the two calls to sparse-qr are fairly expensive (10-100×) compared to similarly sparse calls to sparse Cholesky or LLT factorizations.

Ridge regression / Tikhonov Regularization / L₂ Regularization

The Sag-Free recommends weakly enforcing the minimum-norm solution by adding a small amount of L₂ regularization to the problem scaled by a factor $\rho \gt 0$: $$ \min_{x \in \mathbb{R}^n} \frac{1}{2} \|Ax - b\|_2^2 + \frac{\rho}{2} \|x\|_2^2, $$ where we ignore the $C x = 0$ constraint in this section because it can be effortlessly tacked on to the following.

In theory, any finite $\rho$ will result in a unique solution, because the system matrix of the Euler-Langrage equations is invertible. The solution is given by: $$ x_\rho = (A^T A + \rho I)^{-1} A^T b. $$

In practice, choosing $\rho$ can be annoying. If $\rho$ is too small to overcome the poor conditioning of $A^T A$, then the numerical solver (e.g., sparse Cholesky factorization) could fail. If $\rho$ is too large, then we're effectively pulling $x_\rho \rightarrow 0$ and $x_\rho$ may not be a good approximation to a true solution of $A x = b$ (even though we know many exist).

Another way to look at it (see¹), is that the parameter $\rho$ can be understood as implicitly constraining the (generally non-zero and non-minimum) norm of the solution $\|x\|^2 = c$. Or written as an inequality constrained optimization problem: $$ x_\rho = \mathop{\text{argmin}}_{\|x\|^2 \le c} \frac{1}{2} \|Ax - b\|_2^2, $$ for an appropriate choice of $c$. We can identify the relationship with $\rho$ by looking at the optimality conditions of the inequality constrained problem. The first condition is that the gradient of the the Lagrangian: $$ \frac{1}{2} \|Ax - b\|_2^2 + \nu (\|x\|^2 - c). $$ should be zero, where we introduce a Lagrange multiplier $\nu \gt 0$. The second is that $\nu (\|x\|^2 - c) = 0$. Both of these are solved by choosing $\nu = \rho$ and $c = \|x_\rho\|^2$. Unless, $x = 0$ is a solution to $A x = b$, then $c$ will be smaller than the minimum-norm solution for finite $\rho$.

If we just want to get some solution, then we could try to pick $\rho$ to be as small as our solver will tolerate. We might be able to guess a good $\rho$ value in advance. A more robust solution would be to do something like a binary search on $\rho$ values (cf. ²), but this potentially requires many expensive factorization calls.

Iterative Quasi-Newton

Instead of relying on expensive sparse QR or multiple factorization calls, we can leverage a single successful factorization (with a perhaps overly generous $\rho$) to refine an initial guess iteratively. This works similarly to a quasi-Newton solver or as a form of preconditioned descent.

Again, let's ignore the $C x = 0$ constraints for now. At each iteration $i$, we have our current solution $x_i$ and we define the update as looking for an improvment update $\Delta x$ to minimize $$ \frac{1}{2} \|A(x_i + \Delta x) - b\|_2^2 = \frac{1}{2} \|A \Delta x - r_i\|_2^2, $$ where $r_i = b - A x_0$ is the current residual. Since we haven't really changed anything we still have an underdetermined problem. However, now we can add the L₂ regularization term over $\Delta x$, effectively limiting its norm (see above). Each iteration amounts to solving: $$ \Delta x = (A^T A + \rho I)^{-1} A^T r_i. $$

Even for reasonably large $\rho$ (e.g., 1e-6) where we are highly confident that the factorization will succeed, this method can converge quickly. Since each iteration reuses the same factorization, we spend our computation on cheap back-/forward-subsitutions rather than hunting over factorizations for the optimal $\rho$.

Nevertheless, we can repeatedly increase $\rho$ if that initial factorization fails.

A complete algorithm has the form:

ρ = 1e-6
repeat:
  try:
    factorize(A^T A + ρ I)
    break
  catch:
    ρ *= 10
x = 0
r = b
repeat:
  ∆x = solve(A^T A + ρ I, A^T r)
  x += ∆x
  r = b - A x
  if ‖r‖ < 1e-16:
    break

There are couple ways to handle the $C x = 0$ constraints. The original optimization problem has optimal conditions of the form: $$ \begin{bmatrix} A^T A & C^T \\ C & 0 \end{bmatrix} \begin{bmatrix} x \\ \lambda \end{bmatrix} = \begin{bmatrix} A^T b \\ 0 \end{bmatrix}, $$ which we could capture with variables: $$ K z = d, $$ with $K \in \mathbb{R}^{(n+p) \times (n+p)}$ and $z,d \in \mathbb{R}^{n+p}$. Given the definitions of $A$ and $C$ above, this equation over $z$ is still underdetermined. We can treat $A,b,x \leftarrow K,d,z$ in the algorithm above directly. The downsides are: we end up squaring a $(n+p) \times (n+p)$ matrix (increasing the number of non-zeros) and we're effectively raising $A$ to the power of 4 (and its condition number). The upside is that $K^TK + \rho I$ is positive definite and we can use Cholesky factorization.

Alternatively, we can add the L² regulization directly to the upper left block of the system matrix during the direct application of the algorithm to $A,b,x$: $$ \begin{bmatrix} A^T A + \rho I & C^T \\ C & 0 \end{bmatrix}. $$ This is a sparser matrix with a better condition number and also non-singular. However, it is indefinite, so requires LLT or LDLT (or LU) factorization, which are 2-4× more expensive sparse than Cholesky factorization.

All of the alternatives above worked really well for the Sag-Free problem. The fastest was doing sparse Cholesky on $A N N^T A^T$. The most accurate was doing the iterative solver (still pretty fast). The slowest were the QR-based methods. Other options found online like preconditioned lsqr or pcg were abyssmally slow. I'm quite satisified that such a minimal loop around the L₂ regularization can be so effective and general purpose.