Leverage Calculation And Approximation for Large Scale ℓ₁-Norm Data Analysis

*S. Flores is with the Mathematics Engineering Department and the Center for Mathematical Modelling, Universidad de Chile, Chile. e-mail: sflores@dim.uchile.cl

Abstract Identifying influential observations is a fundamental task in data analysis. It has served longtime to detect outliers and other deviating behaviors in data. Presently, it is having an essential role in the development of techniques for obtaining coresets to reduce huge amounts of data to smaller, manageable ones with provable warranties. For least squares regression, the subject is well understood both in the theoretical and computational setting. Other estimation techniques, such as ℓ₁-norm minimization, have several advantages over least squares in terms of robustness and identification of influential observations, but diagnostics are not developed enough, which hinders extensions of these methods to large scale data. In this paper we study the proper definition and numerical estimation of leverage scores for ℓ₁-norm minimization. We show that the calculation of ℓ₁-norm leverage scores amounts to solving a linear programming problem. In view of applications to big data analysis, this raises the issue of the trade-off between computing exact or very good quality leverage scores, at a high cost, versus using fast approximations to obtain raw approximations. Based on this linear programming formulation, we compare the quality of approximated leverage scores proposed in the literature with respect to the exact ones. We also propose a randomized approximation which is included in this evaluation. Additionally, we use linear programming duality to derive several of approximations to the exact leverage scores, with different levels of quality and complexity.

Consider the classical linear regression model, where each observation y_i follows

for some parameter β ∈ℝ^p to be estimated, where the x_i ∈ℝ^p are the corresponding vector of independent variables or carriers and δ_i are random deviations from the model. It is customary to assume i.i.d errors, and in many cases it is implicitly assumed that errors behave like Gaussian random variables.

Regression diagnostics are essential in helping data analysts to identify failures of the assumptions that customarily accompany regression models. Most of them look for evaluating the impact of each datum on the estimated coefficients, the predicted (fitted) values or the residuals. The concept of leverage holds a prominent place in regression diagnostics. For instance, in Least Squares (LS) regression, they correspond to the diagonal element h_ii of the ”hat” matrix H = X(X^⊤X)^-1X^⊤. These quantities can be efficiently computed and enjoy a bundle of properties, most of them stemming from the linearity of the estimation of fitted values.

The ℓ₁ estimator, also called Least Absolute Deviations (LAD), is defined as a solution to

This estimator is distinguished for its robustness to outliers. Nonetheless, for many years the non-differentiability of the objective function complicated the use of LAD regression, delaying the development of regression diagnostics. The emergence of mature implementations of linear programming algorithms made the calculation of LAD estimates possible for data in the order of hundreds of thousands [1]. Also, specific algorithms for computing the effect of adding or removing observations from the LAD fit have appeared since then [2, 3, 4] which have contributed to close the gap between LAD and LS concerning row deletion methods [5]. Surprisingly, even if leverage has been defined for LAD regression by [?] in 1992, its computation has not been treated yet, and contrarily to LS leverages, is not included in any statistical package or library.

Recently, the increasing need for big data regression has renewed the interest in leverage calculations [6, 7] for LAD regression. As opposed to what is done in robust regression, these methods favour high-leverage observations, since few of them mostly determine the LS fit. In this way, performing regression in an properly sampled sub population (a coreset or sketch) provides an accurate approximation to the “true” estimator, reducing considerably the size of the numerical problem when n is huge [8]. [9] devised an algorithm to compute the leverage scores without computing pseudo-inverse matrices. They also point out interesting applications of maximum leverage values to problems such as identifying important nodes in a network or very small clusters in noisy data, among others. Concerning robustness, there has been some attemps to extend these sampling techniques to LAD regression [10] and more generally to quantile regression [11].

Our goal is to compute leverage scores as a data exploration tool in a context broader than just approximating the LAD regression estimator. Obviously, the task being harder, at first implies restricting to smaller problems and/or been ready to increase the devoted computational resources. However, we are deeply convinced that the rapid development both in High Performance Computing (HPC) and in theory and algorithms will shortly make the use of regression diagnostics routine even for very large scale data. We aim at contributing in that direction.

In Section II we discuss the best notion of leverage to adopt for this study. Then, in Section III we derive the linear programming formulation of LAD leverages, and derive some bounds using duality theory. The randomized approximation is desbribed in Section IV. Section V is devoted to a preliminary numerical evaluation of the accuracy of some approximations to LAD leverage scores. Finally, we close the paper with a few conclusions and a lot of avenues of continuation of this work, presented in Section VI.

The concept of leverage is central in regression diagnostics. In LS regression, it corresponds to the diagonal element h_i ≡ h_ii of the projector onto the range of X, called hat matrix:

where u_i is the i-th row of U. Equation (4) has been used [10, 8] to extend the notions of leverage to ℓ₁ norm by using the leverage proxies

as a by-step to devise sampling algorithms. Despite the numerical ease in computing such quantities, there are no associated theoretical properties providing sustenance to such a definition. Nonetheless, this definition worked well for their objective, approximating the LAD estimator by sampling, which is different from ours.

derived in [12] and used as a departing point to extend the notion of leverage in at least two directions, by making possible to define leverage for groups of observations and for more general estimation techniques than LS. For i = 1,...,n, we shall denote by γⁱ the individual leverage of observation i, given by

There are results in two key application fields where this definition stands out as the most adequate one:

One of the main reasons for using ℓ₁-norm in regression is its resistance to outliers. Both [13] and [12] have stablished the links between the leverage indicator γⁱ and the robutness properties of LAD regression to outliers. Later, [14] proved, for data perturbed by noise and outliers and as long as γⁱ < 1∕2, that the regression fit differs from the fit without outliers proportionally to the noise amplified by a factor (1∕2 - γⁱ)^-1, independently of the outliers magnitude. This result extends to group of observations as well.

An strategy for processing large amounts of data consists in creating a tractable summary of the data, called coreset or sketch, which, for an specific analysis, gives results that are provably close to those obtained on the whole dataset. This data reduction technique is known as Sketching. In practice, sketching is performed by sampling data according to some particular distribution that favors points of high relevance for the analysis in mind. For LS regression, the sampling distribution is proportional to h_i∕p. In the broader context of constructing coresets for Empirical Risk Minimization (ERM) problems,

by random sampling, [15] shows that while bias can be corrected by a proper re-weighting, in order to minimize the variance of the sampled cost function each point has to be drawn with probability proportional to:

Note that the expression in (9) needs, in many cases, a normalization in order to be well-defined. When the function f is homogeneous in g for instance, the solution is unique up to scale. This framework encompasses linear, quantile and logistic regression, support vector machine, clustering and many other machine learning problems. In particular, for a slightly modified version of LAD regression we recover the expression for leverage (7).

Despite its appealing theoretical properties, and even if we are going to show next that they can be formulated as efficiently solvable linear programs, this leverage scores are not amenable for computation at scale. For this reason, in what follows we aim at obtaining ways to approximate them and evaluate the quality of different approximations in tractable instances of the problem.

The leverage definition (7) has good theoretical properties, but in general is a difficult optimization problem involving a non-convex objective function. As a first step towards deriving good approximations, we show that those optimization problems can be cast as linear programs. The advantages are twofold. First, it allows to compute leverage scores for medium-sized problems, which can be used to understand and assess the quality of any proposed approximation in small problems. Secondly, we have at our disposal the battery of results on linear programming to exploit.

proof
The key observation to cast (7) as a LP is that |x_i^⊤g|∕∥Xg∥₁ is invariant with respect to scale changes of g, therefore the normalization ∥g∥₂ = 1 can be replaced by any other, such as ∥Xg∥₁ = 1, as long as it is not degenerate, which is the case here since X has full rank. Therefore the constants γⁱ can be evaluated by solving the convex optimization problem

Note that the constraint ∥Xg∥₁ ≤ 1 will always hold as equality at the optimum, since ∥Xg∥₁ contains the term |x_i^⊤g| which is being maximized. Problem (10) is equivalent to (P_i). To see this, rewrite the constraints -t_j ≤ x_j^⊤g ≤ t_j equivalently as |x_j^⊤g|≤ t_j. Then, x_i^⊤g + ∑ _j≠it_j ≤ 1 amounts to ∥Xg∥₁ ≤ 1, given that the term x_i^⊤g being maximized must be positive at the optimum, and that |x_j^⊤g| = t_j for j≠i, at the optimum too. We have established that Value(P_i) = γⁱ.

A strong consequence of the linear programming formulation is duality theory. For any linear program in primal form minc^⊤x : Ax = b,x ≥ 0 we can associate a dual linear problem of the form maxb^⊤y : A^⊤y ≤ c. The so-called weak duality theorem ensures that for any primal feasible point x and dual feasible y, it holds

proof
We show that the optimization problem (7) can be cast as a linear program, and the claimed relations will follow as consequence of LP duality.

Let us show that (D_i) is the dual problem to (P_i). By definition, the linear dual to (P_i) is

Notice that μ_j¹ and μ_j² are Lagrange multipliers of x_j^⊤g ≤ t_j and -t_j ≤ x_j^⊤g ≤ t_j, respectively, therefore they cannot take a non-zero value simultaneously. For this reason we can simplify problem (12) by defining μ := μ¹ -μ². In this way, μ_j¹ -μ_j² = μ_j and μ_j¹ + μ_j² = |μ_j|, yielding

In order to proceed to further simplications of (13) it will be convenient to introduce the notation X_-i for the (n - 1) × p matrix obtained by supressing the ith row of X. Consider the orthogonal decomposition μ = X_-iθ + β, with β ∈ kerX_-i^⊤. From the first equation in (13),

As a consequence, those μ ∈ℝ^n-1 satisfying (α- 1)x_i + ∑ _j≠iμ_jx_j = 0 can be parameterized as μ = {(1 -α)νⁱ + β ∣β ∈ kerX_-i^⊤}, which replaced in (13) yields to (D_i). The equality γⁱ = Value(D_i) is therefore a direct application of the strong duality theorem for linear programming, using the fact Value(P_i) = γⁱ proved before. Note that lⁱ is the value of (P_i) for the feasible point g = x_i∕∥Xx_i∥₁, therefore lⁱ ≤ max(P_i). Similarly, considering only feasible points with β = 0 in (D_i),

and since max_j≠i|ν_jⁱ|∕(1 + |ν_jⁱ|) = uⁱ, the result follows from the fact that restricting the feasible domain of a minimization problem can only increase its optimal value.

Note that in the previous result we make only an elementary use of LP duality, considering primal points with a large portion set to zero. Further analysis may lead to tighter upper and lower bounds. The challenge in doing so is to find a good balance between the improvement of the approximation versus the increase in the time to compute it. Therefore, this Theorem should be considered only as a starting point rather than as a definitive result.

A drawback of the approximations presented so far is that in general they need all the data to make calculations, such as solving a LP problem or compute the SVD decomposition. Though, very large volumes of data are customarily stored in distributed systems, which makes difficult to operate over the whole data at once. Another point to stress is that the definition of leverage changes very mildly from one observation to another. All of them carry the term ∥Xg∥₁ on the denominator, for instance. If instead of optimizing over the whole sphere we computed the maximum over a finite set of directions g₁,...,g_k,...,g_m, we could compute only once, locally at the node I where the datum x_i is, the terms |x_i^⊤g_k|, saving communication burden among nodes by broadcasting only the real numbers ∥X_Ig_k∥₁ instead of the whole data X_I. With this information, each node forms the common term ∥Xg_k∥₁ = ∑ ∥X_Ig_k∥₁ and then computes max_k|x_i^⊤g_k|∕∥Xg_k∥₁ for the points stored at that node.

Two crucial points in the previous Algorithm are fixing the number K of approximation samples, and we way of drawing them. These questions are interrelated. A deterministic alternative is to fix and error goal ϵ and then construct an ϵ-net covering the unit sphere.

The communication costs of the Algorithm are very low, for the only vectors that are communicated are the g’s at the beginning. The rest of the time only real number are transmitted.

For all the above reasons, we use the Bingham distribution to draw the vectors x_i. This is an antipodally symmetric anisotropic probability on the sphere. In this way we can control the effect of anisotropy on the leverage distribution though the covariance structure characterizing the distribution. In order to simulate vectors of different magnitude, such as outliers, we multiply the vectors drawn from the Bingham distribution by a Cauchy random variable of location one and two scales, 0.1 and 0.15. For each combination we draw 1000 matrices and compute their exact leverage scores by solving the linear programs (P_i). Our intuition on the resulting leverage pattern is confirmed by the plots over the 100 repetitions, shown in Figure 2.

[1] S. Portnoy and R. Koenker, “The gaussian hare and the laplacian tortoise: Computability of squared-error versus absolute-error estimators,” Statistical Science, vol. 12, no. 4, pp. 279–300, 1997.

[2] E. Castillo, A. S. Hadi, and B. Lacruz, “Regression diagnostics for the least absolute deviations and the minimax methods,” Comm. Statist. Theory Methods, vol. 30, no. 6, pp. 1197–1225, 2001.

[3] S. C. Narula and J. F. Wellington, “Interior analysis for the minimum sum of absolute errors regression,” Technometrics, vol. 27, no. 2, pp. 181–188, 1985.

[4] ——, “Sensitivity analysis for predictor variables in the MSAE regression,” Computational Statistics & Data Analysis, vol. 40, no. 2, pp. 355 – 373, 2002.

[5] R.-B. Sun and B.-C. Wei, “On influence assessment for {LAD} regression,” Statistics & Probability Letters, vol. 67, no. 2, pp. 97 – 110, 2004.

[6] P. Drineas, M. W. Mahoney, and S. Muthukrishnan, “Sampling algorithms for l2 regression and applications,” in Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, ser. SODA ’06. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2006, pp. 1127–1136.

[7] P. Ma and X. Sun, “Leveraging for big data regression,” Wiley Interdisciplinary Reviews: Computational Statistics, vol. 7, no. 1, pp. 70–76, 2015.

[8] A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. W. Mahoney, “Sampling algorithms and coresets for \ell_p regression,” SIAM Journal on Computing, vol. 38, no. 5, pp. 2060–2078, 2009.

[9] P. Drineas, M. W. Mahoney, and D. P. Woodruff, “Fast approximation of matrix coherence and statistical leverage,” Journal of Machine Learning Research, vol. 13, pp. 3475–3506, 2012.

[10] K. L. Clarkson, P. Drineas, M. Magdon-Ismail, M. W. Mahoney, X. Meng, and D. P. Woodruff, “The fast cauchy transform and faster robust linear regression,” in Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2013, pp. 466–477.

[11] J. Yang, X. Meng, and M. W. Mahoney, “Quantile regression for large-scale applications,” SIAM Journal on Scientific Computing, vol. 36, no. 5, pp. 78–110, 2014.

[12] S. P. Ellis and S. Morgenthaler, “Leverage and breakdown in L₁ regression,” J. Amer. Statist. Assoc., vol. 87, no. 417, pp. 143–148, 1992.

[13] X. He, J. Jurečkov�, R. Koenker, and S. Portnoy, “Tail behavior of regression estimators and their breakdown points,” Econometrica, vol. 58, no. 5, pp. 1195–1214, 1990.

[14] S. Flores, “Sharp non-asymptotic performance bounds for ℓ₁ and huber robust regression estimators,” TEST, 2015.

[15] O. Bachem, M. Lucic, and A. Krause, “Practical coreset constructions for machine learning,” arXiv preprint arXiv:1703.06476, 2017.

[16] D. A. Belsley, E. Kuh, and R. E. Welsch, Regression diagnostics: identifying influential data and sources of collinearity, ser. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, New York-Chichester-Brisbane, 1980.

ν_jⁱ	= x_j^⊤(X_-i^⊤X_-i)^-1x_i	(14)
	= x_j^⊤(X^⊤X)^-1x_i +	(15)
	= h_ji +	(16)
	=	(17)