On recovering matrices

Photo by brandy turner on Unsplash.

Most of my research is focused on predicting matrix-valued data ('pairwise learning'). This works by either having features of the rows and columns or by exploiting some kind of structure in the matrix. One particularly appealing type of structure is that a matrix XX is low rank, i.e.

XUV , X \approx UV^\intercal\,,

where both UU and VV are 'long': they have many rows but few columns. Many large matrices can be modelled as such, since it is assumed that there are only a couple of hidden variables that explain the data. This principle is widely used in unsupervised learning (factorization methods such as PCA and the like) and can be exploited to, for example, find the topmost relevant items efficiently.

One way to find low-rank matrices is by adding the nuclear norm as a matrix regularizer. This could recover a low-rank matrix. Only very recently it finally clicked in my mind how this works. This post is a small technical note on how to use empirical risk minimization to recover matrices, with an emphasis low-rank and positive semi-definite ones. As an illustration, I will show that with some straightforward code, we can obtain a better estimate of a correlation matrix. The Jupyter notebook of this post can be found on my Github.

Recovering matrices

Suppose we have matrix YY of observations, for example user-item ratings, interaction values between drugs and protein targets, or the adjacency matrix of a food web. If we expect that the values of YY are corrupted or if YY is incomplete, we need to re-estimate YY by a hopefully better matrix XX. In machine learning, such problems are approached by solving an empirical risk minimization problem (ERMP):

minXl(Y,X)+λh(X) . \min_X l(Y, X) +\lambda \cdot h(X)\,.

This optimization problem has typically three parts:

  • a loss function l(,)l(\cdot, \cdot) which quantifies how well the matrix XX approximates our observed matrix YY;

  • a regularization function h()h(\cdot), which we can use to impose some structure on the matrix XX;

  • a regularization parameter λ\lambda, a dial to adjust the relative trade-off between the two terms above.

For the loss function, there are plenty of options, depending on the nature of the data (regression, classification, many outliers...). We will just focus on the vanilla squared loss:

l(Y,X)=i,j(YijXij)2 , l(Y, X) = \sum_{i,j}(Y_{ij}-X_{ij})^2\,,

which is most appropriate if the elements of YY are real-valued and meausurement errors are close to normally distributed. Note that if not all the values of YY are observed, we can trivially deal with this by just taking the sum only over the values that are sampled.

If we only minimize a loss, we will just obtain a matrix XX with exactly the same content as YY, not particularly exciting! The magic of the regularizer is that it will impose some structure on the learnt matrix XX. The most basic assumption we can make is that the values of XX should not be 'too large', which can be enforced by L2L_2: regularization:

hL2(X)=i,jXij2 . h_{L_2}(X) = \sum_{i,j} X_{ij}^2\,.

The L2L_2 norm ensures that no value in XX can grow too large, as the penalty is proportional to the squared value. The nice thing about this, if that in conjunction with the squared loss, the empirical risk minimization problem yields a closed-form solution.

A popular alternative to L2L_2 regularization is by using L1L_1-norm instead:

hL1(X)=i,jXij . h_{L_1}(X) = \sum_{i,j} |X_{ij}|\,.

The L1L_1 norm enforces sparsity: many of the values in XX will be driven to zero. Additionally, the L1L_1, regulalizer, in contrast to L2L_2, is also more tolerant to having some larger values in XX (linear vs. quadratic penalty).

What if, rather than having some demands on the individual values of XX, we want this matrix to be 'simple' in a more global way? One way how XX can be simple, is by searching for an XX with a low rank, i.e. one that can be represented by a few dimensions. Is there a simple way to guide XX towards having a low rank? Yes! This can be done by considering the nuclear norm:

X=tr(XX) , ||X||_\star = \sqrt{\text{tr}(X^\intercal X)}\,,

which leads to a regularizer:

hnucl(X)=tr(XX) . h_\text{nucl}(X) = \text{tr}(X^\intercal X)\,.

So the trace (taking the sum of the diagonal values of a matrix) apparently holds the key to finding low-rank matrices. The reason is quite an interesting exercise in linear algebra. Note that the singular value decompostion of XX is written as:

X=kσkukvk , X = \sum_k\sigma_k\mathbf{u}_k\mathbf{v}^\intercal_k\,,

with u1,u2,\mathbf{u}_1, \mathbf{u}_2,\ldots and v1,v2,\mathbf{v}_1, \mathbf{v}_2,\ldots two sets of orthonormal eigenvectors and σ1,σ2,\sigma_1, \sigma_2,\ldots the eigenvalues of XXX^\intercal X. If XX is low rank, this means that many of these eigenvalues are zero. Note that these eigenvalues are all non-negative reals if XX is a real-valued matrix. The trace of a matrix is equal to the sum of its eigenvalues, hence we can write

tr(XX)=kσk=kσk . \text{tr}(X^\intercal X) = \sum_k \sigma_k = \sum_k |\sigma_k|\,.

This means that the trace will place an L2L_2 regularization on the eigenvalues. Large values of the regularization parameter λ\lambda will drive many of the eigenvalues to zero, ensuring that XX will be low rank ([1]).

Illustration: approximating a covariance matrix

Let us illustrate this on a related problem: re-estimating a covariance matrix ([2]). Similar to the ERMP of above, we have a covariance matrix Σ\Sigma which we want to approximate by a matrix SS (for example because Σ\Sigma is subjected to high variance during estimation). The problem becomes:

minS0l(Σ,S)+λh(S) . \min_{S\succeq 0} l(\Sigma, S) + \lambda \cdot h(S)\,.

We restrict ourselves to searching in the positive semi-definite (PSD) cone (S0S\succeq 0). Since covariance matrices are PSD, all eigenvalues are non-zero.

We will try this out on the wines dataset, a small dataset containing 13 chemical and physical characteristics of several wines. We will first compute the covariance matrix and subsequently re-estimate this matrix, penalizing the L2L_2-, L1L_1 and nuclear norm.

wines = CSV.read("wines.csv")
variables = String.(names(wines))
describe(wines)
variablemeanminmedianmaxnuniquenmissingeltype
SymbolFloat64RealFloat64RealNothingInt64DataType

13 rows × 8 columns

1Alcohol1.938212.030Int64
2Malic acid13.000611.0313.0514.830Float64
3Ash2.336350.741.8655.80Float64
4Alcalinity of ash2.366521.362.363.230Float64
5Magnesium19.494910.619.530.00Float64
6Total phenols99.74167098.01620Int64
7Flavanoids2.295110.982.3553.880Float64
8Nonflavanoid phenols2.029270.342.1355.080Float64
9Proanthocyanins0.3618540.130.340.660Float64
10Color intensity1.59090.411.5553.580Float64
11Hue5.058091.284.6913.00Float64
12OD280/OD315 of diluted wines0.9574490.480.9651.710Float64
13Proline2.611691.272.784.00Float64

We use Σ\Sigma to denote the unbiased estimator of the covariance.

# convert DataFrame to Matrix
X = convert(Matrix, wines)
Σ = cov(X)
13×13 Array{Float64,2}:
      0.600679   -0.206515    0.379039   …  -0.109368    -0.433737
     -0.206515    0.659062    0.0856113     -0.0133134    0.0416978
      0.379039    0.0856113   1.24802       -0.143326    -0.292447
     -0.0105554   0.0471152   0.050277      -0.00468215   0.000761836
      1.34036    -0.841093    1.07633       -0.209118    -0.656234
     -2.3155      3.13988    -0.870780.180851     0.669308
     -0.348835    0.146887   -0.234338       0.0620389    0.311021
     -0.656091    0.192033   -0.45863        0.124082     0.558262
      0.0471774  -0.0157543   0.0407334     -0.00747118  -0.0444692
     -0.221413    0.0635175  -0.141147       0.0386646    0.210933
      0.477339    1.02828     0.644838   …  -0.276506    -0.705813
     -0.109368   -0.0133134  -0.143326       0.052245     0.0917662
     -0.433737    0.0416978  -0.292447       0.0917662    0.504086

We will use a simple gradient descent to solve the ERMP. During the gradient updates, it is possible that SS is no longer positive semi-definite. For this reason, we will have to project back in the positive semi-definite cone in every step. This can be done by setting negative eigenvalues to zero.

function projectSPDcone!(S)
    vals, vectors = eigen(S)
    vals[vals .< 0.0] .= 0.0
    S .= vectors * Diagonal(vals) * vectors'
    return S
end

Similarly, we can use soft-thresholding to set very small values to zero. This is an easy way to make sparse solutions possible.

function softthreshold!(X, tol::Real=1e-4)
    X[abs.(X) .< tol] .= 0.0
end

We also need the gradient of loss and regularization functions. These can be computed using automatic differentiation methods, but since the functions are so simple, let's provide them explicitly:

∇ls(Y, X) = 2(X .- Y)
∇L₂(X) = 2X
∇L₁(X) = sign.(X)
∇nucl(S) = sign.(Diagonal(S))

Note that here the nuclear regularization is directly on the matrix, since it is already symmetric and PSD.

Putting this all together, we have obtain a general gradient descent function to approximate PSD matrices.

function PSDapprox(Σ::AbstractArray, ∇norm::Function, ∇loss=∇ls;
                λ=1.0, stepsize=1e-3, maxitter=20_000, tol=1e-4)
    S = copy(Σ)
    ΔS = similar(S)
    for i in 1:maxitter
        # compute gradient of loss + norm
        ΔS .= ∇ls(Σ, S) + λ * ∇norm(S)
        # stop if the gradient is small
        if norm(ΔS) < tol
            break
        end
        S -= stepsize * ΔS
        # project back into the SPD cone
        S .= projectSPDcone!(S)
        # soft thresholding
        softthreshold!(S, tol)
    end
    S
end
PSDapprox (generic function with 2 methods)

Let us try for the different regularizers:

SL₂ = PSDapprox(Σ, ∇L₂, λ=2e-1)
13×13 Array{Float64,2}:
      0.500566    -0.172096    0.315866   …  -0.0911396   -0.361448
     -0.172096     0.549219    0.0713428     -0.0110945    0.0347482
      0.315866     0.0713428   1.04001       -0.119438    -0.243706
     -0.00879621   0.0392626   0.0418975     -0.0039018    0.000634863
      1.11697     -0.700911    0.896943      -0.174265    -0.546862
     -1.92958      2.61657    -0.725650.150709     0.557757
     -0.290696     0.122406   -0.195281       0.0516991    0.259184
     -0.546742     0.160028   -0.382192       0.103402     0.465219
      0.0393145   -0.0131286   0.0339445     -0.00622598  -0.0370577
     -0.184511     0.0529313  -0.117623       0.0322205    0.175777
      0.397783     0.856902    0.537365   …  -0.230422    -0.588177
     -0.0911396   -0.0110945  -0.119438       0.0435375    0.0764719
     -0.361448     0.0347482  -0.243706       0.0764719    0.420072
SL₁ = PSDapprox(Σ, ∇L₁, λ=2e-1)
13×13 Array{Float64,2}:
      0.502745   -0.105621      0.279187     …  -0.0208615    -0.335675
     -0.105621    0.55945       0.000142351      0.0           0.0
      0.279187    0.000142351   1.14803         -0.0441496    -0.192586
      0.0         0.0           0.0             -0.000107562   0.0
      1.23978    -0.741346      0.97629         -0.10587      -0.555687
     -2.2155      3.03988      -0.770780.0808711     0.569311
     -0.249459    0.0466174    -0.134382         0.0           0.211606
     -0.554226    0.0928404    -0.358497         0.0137052     0.456512
      0.0         0.0           0.0             -0.000149592   0.0
     -0.121912    0.0          -0.0411828        0.0           0.111401
      0.376425    0.927887      0.544773     …  -0.171415     -0.604954
     -0.0208615   0.0          -0.0441496        0.0162355     0.00255751
     -0.335675    0.0          -0.192586         0.00255751    0.405906
Snucl = PSDapprox(Σ, ∇nucl, λ=2e-1)
13×13 Array{Float64,2}:
      0.523474   -0.2018      0.378216   …  -0.102672    -0.425452
     -0.2018      0.560289    0.0853282     -0.0146486    0.0434516
      0.378216    0.0853282   1.14846       -0.138243    -0.292814
      0.0         0.0474372   0.0479655     -0.00677665   0.00514546
      1.33877    -0.841296    1.07649       -0.209159    -0.656809
     -2.31547     3.1399     -0.8707770.180832     0.669327
     -0.360782    0.143824   -0.234054       0.0621105    0.307935
     -0.642777    0.195738   -0.459109       0.121481     0.562478
      0.0496407  -0.0149893   0.0401001     -0.00982623  -0.0395298
     -0.223001    0.062717   -0.141305       0.0383765    0.21024
      0.476298    1.02801     0.645162   …  -0.273265    -0.706216
     -0.102672   -0.0146486  -0.138243       0.0302173    0.0936902
     -0.425452    0.0434516  -0.292814       0.0936902    0.407482

If these matrices are compared with the original matrix Σ\Sigma, you notice that the values have been shrunken by the regularization: we we sacrifice unbiasedness for a large drop in variance! As promised, the coviance matrix obtained by using L1L_1 regularization is quite sparse.

sparsity SL₂: 0.0
    sparsity SL₁: 0.2603550295857988
    sparsity Snucl: 0.011834319526627219

A look at the spectrum of all matrices shows that the nuclear norm results in low-rank matrices.

Comparison of the spectra of the different covariance matrices.

Checking the rank explicitly:

rank Σ: 13
    rank SL₂: 13
    rank SL₁: 12
    rank Snucl: 8

As claimed.

Conclusion

Many machine learning problems can be posed as trying to recover a matrix (e.g. edge prediction in a network, collaborative filtering, image inpainting...). Emperical risk minimization is a general framework to think about such problems, where the regualarization penalty can be used to impose some structure on the learned matrix.

Footnotes

[1] You might wonder why not simply reformulate the problem by substituting XX by UVUV^\intercal and minimize the objective w.r.t. UU and VV. In this case, the problem is no longer convex and has no longer an unique solution (swapping any corresponding of UU and the corresponding ones in VV yields the same objective value). The optimization problem will no longer have a single minimum is is thus harder to solve.
[2] A related problem would be kernel learning, where the goal is to find a good kernel matrix describing the data.