Distributedmemory hierarchical interpolative factorization
 Yingzhou Li^{1}View ORCID ID profile and
 Lexing Ying^{1, 2}Email authorView ORCID ID profile
https://doi.org/10.1186/s4068701701006
© The Author(s) 2017
Received: 1 July 2016
Accepted: 27 February 2017
Published: 5 June 2017
Abstract
The hierarchical interpolative factorization (HIF) offers an efficient way for solving or preconditioning elliptic partial differential equations. By exploiting locality and lowrank properties of the operators, the HIF achieves quasilinear complexity for factorizing the discrete positive definite elliptic operator and linear complexity for solving the associated linear system. In this paper, the distributedmemory HIF (DHIF) is introduced as a parallel and distributedmemory implementation of the HIF. The DHIF organizes the processes in a hierarchical structure and keeps the communication as local as possible. The computation complexity is \(O(\frac{N\log N}{P})\) and \(O(\frac{N}{P})\) for constructing and applying the DHIF, respectively, where N is the size of the problem and P is the number of processes. The communication complexity is \(O(\sqrt{P}\log ^3 P)\alpha + O(\frac{N^{2/3}}{\sqrt{P}})\beta \) where \(\alpha \) is the latency and \(\beta \) is the inverse bandwidth. Extensive numerical examples are performed on the NERSC Edison system with up to 8192 processes. The numerical results agree with the complexity analysis and demonstrate the efficiency and scalability of the DHIF.
Keywords
Sparse matrix Multifrontal Elliptic problem Matrix factorization Structured matrixMathematics Subject Classification
44A55 65R10 65T501 Background
1.1 Previous work
A great deal of effort in the field of scientific computing has been devoted to the efficient solution of (2). Beyond the \(O(N^3)\) complexity naïve matrix inversion approach, one can classify the existing fast algorithms into the following groups.
The first one consists of the sparse direct algorithms, which take advantage of the sparsity of the discrete problem. The most noticeable example in this group is the nested dissection multifrontal method (MF) method [14, 16, 26]. By carefully exploring the sparsity and the locality of the problem, the multifrontal method factorizes the matrix A (and thus \(A^{1}\)) as a product of sparse lower and upper triangular matrices. For 3D problems, the factorization step costs \(O(N^2)\) operations, while the application step takes \(O(N^{4/3})\) operations. Many parallel implementations [3, 4, 30] of the multifrontal method were proposed and they typically work quite well for problem of moderate size. However, as the problem size goes beyond a couple of millions, most implementations, including the distributedmemory ones, hit severe bottlenecks in memory consumption.
The second group consists of iterative solvers [9, 15, 33, 34], including famous algorithms such as the conjugate gradient (CG) method and the multigrid method. Each iteration of these algorithms typically takes O(N) steps and hence the overall cost for solving (2) is proportional to the number of iterations required for convergence. For problems with smooth coefficient functions a(x) and b(x), the number of iterations typically remains rather small and the optimal linear complexity is achieved. However, if the coefficient functions lack regularity or have high contrast, the iteration number typically grows quite rapidly as the problem size increases.
The third group contains the methods based on structured matrices [6–8, 11]. These methods, for example, the \({\mathcal {H}}\)matrix [18, 20], the \({\mathcal {H}}^2\)matrix [19], and the hierarchically semiseparable matrix (HSS) [10, 42], are shown to have efficient approximations of linear or quasilinear complexity for the matrices A and \(A^{1}\). As a result, the algebraic operations of these matrices are of linear or quasilinear complexities as well. More specifically, the recursive inversion and the rocketstyle inversion [1] are two popular methods for the inverse operation. For distributedmemory implementations, however, the former lacks parallel scalability [24, 25], while the latter demonstrates scalability only for 1D and 2D problems [1]. For 3D problems, these methods typically suffer from large prefactors that make them less efficient for practical largescale problems.
A recent group of methods explores the idea of integrating the MF method with the hierarchical matrix [17, 21, 28, 32, 38–41] or block lowrank matrix [2, 35, 36] approach in order to leverage the efficiency of both methods. Instead of directly applying the hierarchical matrix structure to the 3D problems, these methods apply it to the representation of the frontal matrices (i.e., the interactions between the lowerdimensional fronts). These methods are of linear or quasilinear complexities in theory with much small prefactors. However, due to the combined complexity, the implementation is highly nontrivial and quite difficult for parallelization [27, 43].
More recently, the hierarchical interpolative factorization (HIF) [22, 23] is proposed as a new way for solving elliptic PDEs and integral equations. As compared to the multifrontal method, the HIF includes an extra step of skeletonizing the fronts in order to reduce the size of the dense frontal matrices. Based on the key observation that the number of skeleton points on each front scales linearly as the onedimensional fronts, the HIF factorizes the matrix A (and thus \(A^{1}\)) as a product of sparse matrices that contains only \(O(N)\) nonzero entries in total. In addition, the factorization and application of the HIF are of complexities \(O(N\log N)\) and \(O(N)\), respectively, for N being the total number of degrees of freedom (DOFs) in (2). In practice, the HIF shows significant saving in terms of computational resources required for 3D problems.
1.2 Contribution
This paper proposes the first distributedmemory hierarchical interpolative factorization (DHIF) for solving very largescale problems. In a nutshell, the DHIF organizes the processes in an octree structure in the same way that the HIF partitions the computation domain. In the simplest setting, each leaf node of the computation domain is assigned a single process. Thanks to the locality of the operator in (1), this process only communicates with its neighbors and all algebraic computations are local within the leaf node. At higher levels, each node of the computation domain is associated with a process group that contains all processes in the subtree starting from this node. The computations are all local within this process group via parallel dense linear algebra, and the communications are carried out between neighboring process groups. By following this octree structure, we make sure that both the communication and computations in the distributedmemory HIF are evenly distributed. As a result, the distributedmemory HIF implementation achieves \(O(\frac{N\log N}{P})\) and \(O(\frac{N}{P})\) parallel complexity for constructing and applying the factorization, respectively, where N is the number of DOFs and P is the number of processes.
We have performed extensive numerical tests. The numerical results support the complexity analysis of the distributedmemory HIF and suggest that the DHIF is a scalable method up to thousands of processes and can be applied to solve largescale elliptic PDEs.
1.3 Organization
The rest of this paper is organized as follows. In Sect. 2, we review the basic tools needed for both HIF and DHIF, and review the sequential HIF. Section 3 presents the DHIF as a parallel extension of the sequential HIF for 3D problems. Complexity analyses for memory usage, computation time, and communication volume are given at the end of this section. The numerical results detailed in Sect. 4 show that the DHIF is applicable to largescale problems and achieves parallel scalability up to thousands of processes. Finally, Sect. 5 concludes with some extra discussions on future work.
2 Preliminaries
This section reviews the basic tools and the sequential HIF. First, we start by listing the notations that are widely used throughout this paper.
2.1 Notations
In this paper, we adopt MATLAB notations for simple representation of submatrices. For example, given a matrix A and two index sets, \(s_1\) and \(s_2\), \(A(s_1,s_2)\) represents the submatrix of A with the row indices in \(s_1\) and column indices in \(s_2\). The next two examples explore the usage of MATLAB notation “ : .” With the same settings, \(A(s_1,:)\) represents the submatrix of A with row indices in \(s_1\) and all columns. Another usage of notation “ : ” is to create regularly spaced vectors for integer values i and j, for instance, i : j is the same as \([i,i+1,i+2,\dots ,j]\) for \(i\le j\).
In order to fully explore the hierarchical structure of the problem, we recursively bipartite each dimension of the grid into \(L+1\) levels. Let the leaf level be level 0 and the root level be level L. At level \(\ell \), a cell indexed with \({\mathbf j}\) is of size \(m2^\ell \times m2^\ell \times m2^\ell \) and each point in the cell is in the range, \(\left[ m2^\ell j_1+(0:m2^\ell 1) \right] \times \left[ m2^\ell j_2+(0:m2^\ell 1) \right] \times \left[ m2^\ell j_3+(0:m2^\ell 1) \right] ,\) for \({\mathbf j}= (j_1,j_2,j_3)\) and \(0\le j_1,j_2,j_3<2^{L\ell }\). \(C^\ell _{\mathbf j}\) denotes the grid point set of the cell at level \(\ell \) indexed with \({\mathbf j}\).
Commonly used notations
Notation  Description 

n  Number of points in each dimension of the grid 
N  Number of points in the grid 
h  Grid gap size 
\(\ell \)  Level number in the hierarchical structure 
L  Level number of the root level in the hierarchical structure 
\({\mathbf e}_1\), \({\mathbf e}_2\), \({\mathbf e}_3\)  Unit vector along each dimension 
\({\mathbf 0}\)  Zero vector 
\({\mathbf j}\)  Triplet index \({\mathbf j}=(j_1,j_2,j_3)\) 
\({\mathbf x}_{\mathbf j}\)  Point on the grid indexed with \({\mathbf j}\) 
\(\varOmega \)  The set of all points on the grid 
\(C^\ell _{\mathbf j}\)  Cell at level \(\ell \) with index \({\mathbf j}\) 
\({{\mathcal {C}}}^\ell \)  \({{\mathcal {C}}}^\ell = \{C^\ell _{\mathbf j}\}_{\mathbf j}\) is the set of all cells at level \(\ell \) 
\({{\mathcal {F}}}^\ell _{\mathbf j}\)  Set of all surrounding faces of cell \(C^\ell _{\mathbf j}\) 
\({{\mathcal {F}}}^\ell \)  Set of all faces at level \(\ell \) 
\(I^\ell _{\mathbf j}\)  Interior of \(C^\ell _{\mathbf j}\) 
\({{\mathcal {I}}}^\ell \)  \({{\mathcal {I}}}^\ell = \{I^\ell _{\mathbf j}\}_{\mathbf j}\) is the set of all interiors at level \(\ell \) 
\(\Sigma ^\ell \)  The set of active DOFs at level \(\ell \) 
\(\Sigma ^\ell _{\mathbf j}\)  The set of active DOFs at level \(\ell \) with process group index \({\mathbf j}\) 
\(p^\ell _{\mathbf j}\), \(p^\ell \)  The process group at level \(\ell \) with/without index \({\mathbf j}\) 
2.2 Sparse elimination
2.3 Skeletonization
Skeletonization is a tool for eliminating redundant point set from a symmetric matrix that has lowrank offdiagonal blocks. The key step in skeletonization uses the interpolative decomposition [12, 29] of lowrank matrices.
2.4 Sequential HIF

A set \(\Sigma \) of DOFs of A are called active if \(A_{\Sigma \Sigma }\) is not a diagonal matrix or \(A_{{\bar{\Sigma }}\Sigma }\) is a nonzero matrix;

A set \(\Sigma \) of DOFs of A are called inactive if \(A_{\Sigma \Sigma }\) is a diagonal matrix and \(A_{{\bar{\Sigma }}\Sigma }\) is a zero matrix.

Preliminary Let \(A^0=A\) be the sparse symmetric matrix in (17), \(\Sigma ^0\) be the initial active DOFs of A, which includes all indices.

Level \(\ell \) for \(\ell =0,\ldots ,L1\).

– Preliminary Let \(A^\ell \) denote the matrix before any elimination at level \(\ell \). \(\Sigma ^\ell \) is the corresponding active DOFs. Let us recall the notations in Sect. 2.1. \(C^\ell _{\mathbf j}\) denotes the active DOFs in the cell at level \(\ell \) indexed with \({\mathbf j}\). \({{\mathcal {F}}}^\ell _{\mathbf j}\) and \(I^\ell _{\mathbf j}\) denote the surrounding faces and interior active DOFs in the corresponding cell, respectively.

– Sparse elimination We first focus on a single cell at level \(\ell \) indexed with \({\mathbf j}\), i.e., \(C^\ell _{\mathbf j}\). To simplify the notation, we drop the superscript and subscript for now and introduce \(C=C^\ell _{\mathbf j}\), \(I=I^\ell _{\mathbf j}\), \(F={{\mathcal {F}}}^\ell _{\mathbf j}\), and \(R=R^\ell _{\mathbf j}\). Based on the discretization and previous level eliminations, the interior active DOFs interact only with itself and its surrounding faces. The interactions of the interior active DOFs and the rest DOFs are empty and the corresponding matrix is zero, \(A^\ell (R,I) = 0\). Hence, by applying sparse elimination, we have,where the explicit definitions of \(B^\ell _{F F}\) and \(S_{I}\) are given in the discussion of sparse elimination. This factorization eliminates I from the active DOFs of \(A^\ell \). Looping over all cells \(C^\ell _{\mathbf j}\) at level \(\ell \), we obtain$$\begin{aligned} S_I^T A^\ell S_I = \begin{bmatrix} D_I&\quad&\quad \\&\quad B^\ell _{F F}&\quad {\left( A^\ell _{R F} \right) ^T} \\&\quad A^\ell _{R F}&\quad A^\ell _{R R}\\ \end{bmatrix}, \end{aligned}$$(18)$$\begin{aligned} {\widetilde{A}}^\ell= & {} \left( \prod _{I\in {{\mathcal {I}}}^\ell }S_{I}\right) ^T A^\ell \left( \prod _{I\in {{\mathcal {I}}}^\ell }S_{I}\right) , \end{aligned}$$(19)Now all the active interior DOFs at level \(\ell \) are eliminated from \(\Sigma ^\ell \).$$\begin{aligned} {\widetilde{\Sigma }}^\ell= & {} \Sigma ^\ell \setminus \bigcup _{I\in {{\mathcal {I}}}^\ell } I. \end{aligned}$$(20)

– Skeletonization Each face at level \(\ell \) not only interacts within its own cell but also interacts with faces of neighbor cells. Since the interaction between any two different faces is low rank, this leads us to apply skeletonization. The skeletonization for any face \(F \in {{\mathcal {F}}}^\ell \) gives, where \({\widehat{F}}\) is the skeleton DOFs of F, \(\bar{\bar{F}}\) is the redundant DOFs of F, and R refers to the rest DOFs. Due to the elimination from previous levels, F scales as \(O(m2^\ell )\) and \({\widetilde{A}}^\ell _{R F}\) contains a nonzero submatrix of size \(O(m2^\ell )\times O(m2^\ell )\). Therefore, the interpolative decomposition can be formed efficiently. Readers are referred to Sect. 2.3 for the explicit forms of each matrix in (21). Looping over all faces at level \(\ell \), we obtainwhere \(W^\ell = \left( \prod _{I\in {{\mathcal {I}}}^\ell }S_{I}\right) \left( \prod _{F\in {{\mathcal {F}}}^\ell }S_{\bar{\bar{F}}} Q_{F}\right) \). The active DOFs for the next level are now defined as,$$\begin{aligned} A^{\ell +1}\approx & {} \left( \prod _{F\in {{\mathcal {F}}}^\ell }S_{\bar{\bar{F}}} Q_{F}\right) ^T {\widetilde{A}}^\ell \left( \prod _{F\in {{\mathcal {F}}}^\ell }S_{\bar{\bar{F}}} Q_{F}\right) \nonumber \\= & {} \left( \prod _{F\in {{\mathcal {F}}}^\ell }S_{\bar{\bar{F}}} Q_{F}\right) ^T \left( \prod _{I\in {{\mathcal {I}}}^\ell }S_{I}\right) ^T A^\ell \left( \prod _{I\in {{\mathcal {I}}}^\ell }S_{I}\right) \left( \prod _{F\in {{\mathcal {F}}}^\ell }S_{\bar{\bar{F}}} Q_{F}\right) \nonumber \\= & {} {\left( W^\ell \right) ^T} A^\ell W^\ell , \end{aligned}$$(22)$$\begin{aligned} \Sigma ^{\ell +1} = {\widetilde{\Sigma }}^\ell \setminus \bigcup _{F\in {{\mathcal {F}}}^\ell } \bar{\bar{F}}= \Sigma ^{\ell } \setminus \left( \left( \bigcup _{I\in {{\mathcal {I}}}^\ell } I\right) \bigcup \left( \bigcup _{F\in {{\mathcal {F}}}^\ell } \bar{\bar{F}}\right) \right) . \end{aligned}$$(23)


Level L Finally, \(A^L\) and \(\Sigma ^L\) are the matrix and active DOFs at level L. Up to a permutation, \(A^L\) can be factorized asCombining all these factorization results$$\begin{aligned} A^L = \begin{bmatrix} A^L_{\Sigma ^L\Sigma ^L}&\\&D_R \end{bmatrix} = \begin{bmatrix} L_{\Sigma ^L}&\\&I \end{bmatrix} \begin{bmatrix} D_{\Sigma ^L}&\\&D_R \end{bmatrix} \begin{bmatrix} L_{\Sigma ^L}^T&\\&I \end{bmatrix} := {\left( W^L \right) ^{T}}D{\left( W^L \right) ^{1}}.\nonumber \\ \end{aligned}$$(24)and$$\begin{aligned} A \approx {\left( W^0 \right) ^{T}}\cdots {\left( W^{L1} \right) ^{T}}{\left( W^L \right) ^{T}} D{\left( W^L \right) ^{1}}{\left( W^{L1} \right) ^{1}}\cdots {\left( W^0 \right) ^{1}} \equiv F\nonumber \\ \end{aligned}$$(25)\(A^{1}\) is factorized into a multiplicative sequence of matrices \(W^\ell \) and each \(W^\ell \) corresponding to level \(\ell \) is again a multiplicative sequence of sparse matrices, \(S_I\), \(S_{\bar{\bar{F}}}\) and \(Q_F\). Due to the fact that any \(S_I\), \(S_{\bar{\bar{F}}}\) or \(Q_F\) contains a small nontrivial (i.e., neither identity nor empty) matrix of size \(O(\frac{N^{1/3}}{2^{L\ell }})\times O(\frac{N^{1/3}}{2^{L\ell }})\), the overall complexity for strong and applying \(W^\ell \) is \(O(N/2^\ell )\). Hence, the application of the inverse of A is of \(O(N)\) computation and memory complexity.$$\begin{aligned} A^{1} \approx W^0\cdots W^{L1} W^L D^{1} {\left( W^L \right) ^T}{\left( W^{L1} \right) ^T}\cdots {\left( W^0 \right) ^T} = F^{1}. \end{aligned}$$(26)
3 Distributedmemory HIF
This section describes the algorithm for the distributedmemory HIF.
3.1 Process tree
3.2 Distributedmemory method
Same as in Sect. 2.4, we define the \(n \times n\times n\) grid on \(\varOmega =[0,1)^3\) for \(n=m2^L\), where \(m=O(1)\) is a small positive integer and \(L=O(\log N)\) is the level number of the root level. Discretizing (1) with sevenpoint stencil on the grid provides the linear system \(Au=f\), where A is a sparse \(N\times N\) SPD matrix, \(u\in {\mathbb {R}}^N\) is the unknown function at grid points, and \(f\in {\mathbb {R}}^N\) is the given function at grid points.

Preliminary Construct the process tree with \(8^L\) processes. Each process group \(p^0_{\mathbf j}\) owns the data corresponding to cell \(C^0_{\mathbf j}\) and the set of active DOFs in \(C^0_{\mathbf j}\) is denoted as \(\Sigma ^0_{\mathbf j}\), where \({\mathbf j}=(j_1,j_2,j_3)\) and \(0\le j_1,j_2,j_3 < 2^L\). Set \(A^0=A\) and let the process group \(p^0_{\mathbf j}\) own \(A^0(:,\Sigma ^0_{\mathbf j})\), which is a sparse tallskinny matrix with \(O(N/P)\) nonzero entries.

Level \(\ell \) for \(\ell =0,\ldots ,L1\).

– Preliminary Let \(A^\ell \) denote the matrix before any elimination at level \(\ell \). \(\Sigma ^\ell _{\mathbf j}\) denotes the active DOFs owned by the process group \(p^\ell _{\mathbf j}\) for \({\mathbf j}=(j_1,j_2,j_3)\), \(0\le j_1,j_2,j_3 < 2^{L\ell }\), and the nonzero submatrix of \(A^\ell (:,\Sigma ^\ell _{\mathbf j})\) is distributed among the process group \(p^\ell _{\mathbf j}\) using the twodimensional blockcyclic distribution.

– Sparse elimination The process group \(p^\ell _{\mathbf j}\) owns \(A^\ell (:,\Sigma ^\ell _{\mathbf j})\), which is sufficient for performing sparse elimination for \(I^\ell _{\mathbf j}\). To simplify the notation, we define \(I=I^\ell _{\mathbf j}\) as the active interior DOFs of cell \(C^\ell _{\mathbf j}\), \(F={{\mathcal {F}}}^\ell _{\mathbf j}\) as the surrounding faces, and \(R=R^\ell _{\mathbf j}\) as the rest active DOFs. Sparse elimination at level \(\ell \) within the process group \(p^\ell _{\mathbf j}\) performs essentiallywhere \(B^\ell _{FF} = A^\ell _{FF}A^\ell _{FI}{\left( A^\ell _{II} \right) ^{1}} {\left( A^\ell _{FI} \right) ^T}\),$$\begin{aligned} S_I^{T} A^\ell S_I = \begin{bmatrix} D_I&\quad&\quad \\&\quad B^\ell _{FF}&\quad {\left( A^\ell _{RF} \right) ^T}\\&\quad A^\ell _{RF}&\quad A^\ell _{RR} \end{bmatrix}, \end{aligned}$$(27)with \(L^\ell _I D_I {\left( L^\ell _I \right) ^T} = A^\ell _{II}\). Since \(A^\ell (:,\Sigma ^\ell _{\mathbf j})\) is owned locally by \(p^\ell _{\mathbf j}\), both \(A^\ell _{FI}\) and \(A^\ell _{II}\) are local matrices. All nontrivial (i.e., neither identity nor empty) submatrices in \(S_I\) are formed locally and stored locally for application. On the other hand, updating on \(A^\ell _{FF}\) requires some communication in the next step.$$\begin{aligned} S_I = \begin{bmatrix} {\left( L^\ell _I \right) ^{T}}&\quad {\left( A^\ell _{II} \right) ^{1}}{\left( A^\ell _{FI} \right) ^T}&\quad \\&\quad I&\quad \\&\quad&\quad I \end{bmatrix} \end{aligned}$$(28)

– Communication after sparse elimination After all sparse eliminations are performed, some communication is required to update \(A^\ell _{FF}\) for each cell \(C^\ell _{\mathbf j}\). For the problem (1) with the periodic boundary conditions, each face at level \(\ell \) is the surrounding face of exactly two cells. The owning process groups of these two cells need to communicate with each other to apply the additive updates, a submatrix of \(A^\ell _{FI}{\left( A^\ell _{II} \right) ^{1}}{\left( A^\ell _{FI} \right) ^T}\). Once all communications are finished, the parallel sparse elimination does the rest of the computation, which can be conceptually denoted asfor \({\mathbf j}=(j_1,j_2,j_3), 0\le j_1,j_2,j_3<2^{L\ell }\).$$\begin{aligned} {\begin{matrix} &{} {\widetilde{A}}^\ell = {\left( \prod _{I\in {{\mathcal {I}}}^\ell }S_{I} \right) ^T} A^\ell \left( \prod _{I\in {{\mathcal {I}}}^\ell }S_{I}\right) ,\\ &{} {\widetilde{\Sigma }}^\ell _{\mathbf j}= \Sigma ^\ell _{\mathbf j}\setminus \bigcup _{I\in {{\mathcal {I}}}^\ell }I, \end{matrix}} \end{aligned}$$(29)

– Skeletonization For each face F owned by \(p^\ell _{\mathbf j}\), the corresponding matrices \({\widetilde{A}}^\ell (:,F)\) are stored locally. Similar to the parallel sparse elimination part, most operations are local at the process group \(p^\ell _{\mathbf j}\) and can be carried out using the dense parallel linear algebra efficiently. By forming a parallel interpolative decomposition (ID) for \({\widetilde{A}}^\ell _{RF} = \begin{bmatrix} {\widetilde{A}}^\ell _{R{\widehat{F}}} T^\ell _F&{\widetilde{A}}^\ell _{R{\widehat{F}}} \end{bmatrix}\), the parallel skeletonization can be, conceptually, written as where the definitions of \(Q_F\) and \(S_{\bar{\bar{F}}}\) are given in the discussion of skeletonization. Since \({\widetilde{A}}^\ell _{\bar{\bar{F}}\,\bar{\bar{F}}}\), \({\widetilde{A}}^\ell _{{\widehat{F}}\bar{\bar{F}}}\), \({\widetilde{A}}^\ell _{{\widehat{F}}{\widehat{F}}}\) and \(T^\ell _{F}\) are all owned by \(p^\ell _{\mathbf j}\), it requires only local operations to formSimilarly, \(L_{\bar{\bar{F}}}\), which is the \(LDL^T\) factor of \({\widetilde{B}}_{\bar{\bar{F}}\,\bar{\bar{F}}}\), is also formed within the process group \(p^\ell _{\mathbf j}\). Moreover, since nontrivial blocks in \(Q_F\) and \(S_{\bar{\bar{F}}}\) are both local, this implies that the applications of \(Q_F\) and \(S_{\bar{\bar{F}}}\) are local operations. As a result, the parallel skeletonization factorizes \(A^\ell \) conceptually as$$\begin{aligned} \begin{aligned} {\widetilde{B}}^\ell _{\bar{\bar{F}}\,\bar{\bar{F}}}&= {\widetilde{A}}^\ell _{\bar{\bar{F}}\,\bar{\bar{F}}}  {\left( T^\ell _F \right) ^T}{\widetilde{A}}^\ell _{{\widehat{F}}\bar{\bar{F}}}  {\left( {\widetilde{A}}^\ell _{{\widehat{F}}\bar{\bar{F}}} \right) ^T}T^\ell _F + {\left( T^\ell _F \right) ^T}{\widetilde{A}}^\ell _{{\widehat{F}}{\widehat{F}}}T^\ell _F,\\ {\widetilde{B}}^\ell _{{\widehat{F}}\bar{\bar{F}}}&= {\widetilde{A}}^\ell _{{\widehat{F}}\bar{\bar{F}}}  {\widetilde{A}}^\ell _{{\widehat{F}}{\widehat{F}}}T^\ell _F,\\ {\widetilde{B}}^\ell _{{\widehat{F}}{\widehat{F}}}&= {\widetilde{A}}^\ell _{{\widehat{F}}{\widehat{F}}}  {\widetilde{B}}^\ell _{{\widehat{F}}\bar{\bar{F}}}{\left( {\widetilde{B}}^\ell _{\bar{\bar{F}}\,\bar{\bar{F}}} \right) ^{1}} {\left( {\widetilde{B}}^\ell _{{\widehat{F}}\bar{\bar{F}}} \right) ^T}.\\ \end{aligned} \end{aligned}$$(31)and we can define$$\begin{aligned} A^{\ell +1}\approx & {} {\left( \prod _{F\in {{\mathcal {F}}}^\ell }S_{\bar{\bar{F}}}Q_{F} \right) ^T} {\widetilde{A}}^\ell \left( \prod _{F\in {{\mathcal {F}}}^\ell }S_{\bar{\bar{F}}}Q_{F}\right) \nonumber \\= & {} {\left( \prod _{F\in {{\mathcal {F}}}^\ell }S_{\bar{\bar{F}}}Q_{F} \right) ^T} {\left( \prod _{I\in {{\mathcal {I}}}^\ell }S_{I} \right) ^T} A^\ell \left( \prod _{I\in {{\mathcal {I}}}^\ell }S_{I}\right) \left( \prod _{F\in {{\mathcal {F}}}^\ell }S_{\bar{\bar{F}}}Q_{F}\right) \end{aligned}$$(32)We would like to emphasize that the factors \(W^{\ell }\) are evenly distributed among the process groups at level \(\ell \) and that all nontrivial blocks are stored locally.$$\begin{aligned} {\begin{matrix} W^{\ell } = &{}\,\left( \prod _{I\in {{\mathcal {I}}}^\ell }S_{I}\right) \left( \prod _{F\in {{\mathcal {F}}}^\ell }S_{\bar{\bar{F}}}Q_{F}\right) ,\\ \Sigma ^{\ell +1/2}_{\mathbf j}= &{}\, {\widetilde{\Sigma }}^\ell _{\mathbf j}\setminus \bigcup _{F\in {{\mathcal {F}}}^\ell }\bar{\bar{F}}\\ = &{}\, \Sigma ^\ell _{\mathbf j}\setminus \left( \left( \bigcup _{F\in {{\mathcal {F}}}^\ell }\bar{\bar{F}}\right) \bigcup \left( \bigcup _{I\in {{\mathcal {I}}}^\ell }I\right) \right) . \end{matrix}} \end{aligned}$$(33)

– Merging and redistribution Toward the end of the factorization at level \(\ell \), we need to merge the process groups and redistribute the data associated with the active DOFs in order to prepare for the work at level \(\ell +1\). For each process group at level \(\ell +1\), \(p^{\ell +1}_{\mathbf j}\), for \({\mathbf j}=(j_1,j_2,j_3)\), \(0\le j_1,j_2,j_3 < 2^{L\ell 1}\), we first form its active DOF set \(\Sigma ^{\ell +1}_{\mathbf j}\) by merging \(\Sigma ^{\ell +1/2}_{{\mathbf j}_c}\) from all its children \(p^\ell _{{\mathbf j}_c}\), where \({\lfloor {\mathbf j}_c/2 \rfloor } = {\mathbf j}\). In addition, \(A^{\ell +1}(:,s^{\ell +1}_{\mathbf j})\) is separately owned by \(\{p^\ell _{{\mathbf j}_c}\}_{{\lfloor {\mathbf j}_c/2 \rfloor }={\mathbf j}}\). A redistribution among \(p^{\ell +1}_{\mathbf j}\) is needed in order to reduce the communication cost for future parallel dense linear algebra. Although this redistribution requires a global communication among \(p^{\ell +1}_{\mathbf j}\), the complexities for message and bandwidth are bounded by the cost for parallel dense linear algebra. Actually, as we shall see in the numerical results, its cost is far lower than that of the parallel dense linear algebra.


Level L factorization The parallel factorization at level L is quite similar to the sequential one. After factorizations from all previous levels, \(A^L(\Sigma ^L_{\mathbf 0},\Sigma ^L_{\mathbf 0})\) is distributed among \(p^L_{\mathbf 0}\). A parallel \(LDL^T\) factorization of \(A^L_{\Sigma ^L_{\mathbf 0}\Sigma ^L_{\mathbf 0}} = A^L(\Sigma ^L_{\mathbf 0},\Sigma ^L_{\mathbf 0})\) among the processes in \(p^L_{\mathbf 0}\) resultsConsequently, we form the DHIF for A and \(A^{1}\) as$$\begin{aligned} A^L = \begin{bmatrix} A^L_{\Sigma ^L_{\mathbf 0}\Sigma ^L_{\mathbf 0}}&\\&D_R \end{bmatrix} {=} \begin{bmatrix} L^L_{\Sigma ^L_{\mathbf 0}}&\\&I \end{bmatrix} \begin{bmatrix} D^L_{\Sigma ^L_{\mathbf 0}}&\\&D_R \end{bmatrix} \begin{bmatrix} {\left( L^L_{\Sigma ^L_{\mathbf 0}} \right) ^T}&\\&I \end{bmatrix} = {\left( W^L \right) ^{T}}D{\left( W^L \right) ^{1}}.\nonumber \\ \end{aligned}$$(34)and$$\begin{aligned} A \approx {\left( W^0 \right) ^{T}} \cdots {\left( W^{L1} \right) ^{T}} {\left( W^L \right) ^{T}} D {\left( W^L \right) ^{1}} {\left( W^{L1} \right) ^{1}} \cdots {\left( W^0 \right) ^{1}} \equiv F \end{aligned}$$(35)The factors, \(W^\ell \) are evenly distributed among all processes and the application of \(F^{1}\) is basically a sequence of parallel dense matrix–vector multiplications.$$\begin{aligned} A^{1} \approx W^0 \cdots W^{L1} W^L D^{1} {\left( W^L \right) ^T} {\left( W^{L1} \right) ^T} \cdots {\left( W^0 \right) ^T} = F^{1}. \end{aligned}$$(36)
3.3 Complexity analysis
3.3.1 Memory complexity
There are two places in the distributed algorithm that require heavy memory usage. The first one is to store the original matrix A and its updated version \(A^\ell \) for each level \(\ell \). As we mentioned above in the parallel algorithm, \(A^\ell \) contains at most \(O(N)\) nonzeros and they are evenly distributed on P processes as follows. At level \(\ell \), there are \(8^{L\ell }\) cells, empirically each of which contains \(O(\frac{N^{1/3}}{2^{L\ell }})\) active DOFs. Meanwhile, each cell is evenly owned by a process group with \(8^\ell \) processes. Hence, \(O((\frac{N^{1/3}}{2^{L\ell }})^2)\) nonzero entries of \(A^\ell (:,s^\ell _{\mathbf j})\) are evenly distributed on process group \(p^\ell _{\mathbf j}\) with \(8^\ell \) processes. Overall, there are \(O(8^{L\ell }\cdot \frac{N^{2/3}}{4^{L\ell }}) =O(N\cdot 2^{\ell })\) nonzero entries in \(A^\ell \) evenly distributed on \(8^{L\ell }\cdot 8^\ell = P\) processes, and each process owns \(O(\frac{N}{P}\cdot 2^{\ell })\) data for \(A^\ell \). Moreover, the factorization at level \(\ell \) does not rely on the matrix \(A^{\ell '}\) for \(\ell '<\ell 1\). Therefore, the memory cost for storing \(A^\ell \)s is \(O(\frac{N}{P})\) for each process.
The second place is to store the factors \(W^\ell \). It is not difficult to see that the memory cost for each \(W^\ell \) is the same as \(A^{\ell }\). Only nontrivial blocks in \(S_I\), \(Q_F\), and \(S_{\bar{\bar{F}}}\) require storage. Since each of these nontrivial blocks is of size \(O(\frac{N^{1/3}}{2^{L\ell }})\times O(\frac{N^{1/3}}{2^{L\ell }})\) and evenly distributed on \(8^\ell \) processes, the overall memory requirement for each \(W^\ell \) on a process is \(O(\frac{N}{P}\cdot 2^{\ell })\). Therefore, \(O(\frac{N}{P})\) memory is required on each process to store all \(W^\ell \)s.
3.3.2 Computation complexity
The majority of the computation work goes to the construction of \(S_I\), \(Q_{F}\), and \(S_{\bar{\bar{F}}}\). As stated in the previous section, at level \(\ell \), each nontrivial dense matrix in these factors is of size \(O(\frac{N^{1/3}}{2^{L\ell }})\times O(\frac{N^{1/3}}{2^{L\ell }})\). The construction adopts the matrix–matrix multiplication, the interpolative decomposition (pivoting QR), the \(LDL^T\) factorization, and the triangular matrix inversion. Each of these operation is of cubic computation complexities and the corresponding parallel computation cost over \(8^\ell \) processes is \(O(\frac{N}{P})\). Since there are only a constant number of these operations per process at a single level, the total computational complexity across all \(O(\log N)\) levels is \(O(\frac{N\log N}{P})\).
The application computational complexity is simply the complexity of applying each nonzero entries in \(W^\ell \)s once, hence, the overall computational complexity is the same as the memory complexity \(O(\frac{N}{P})\).
3.3.3 Communication complexity
The communication complexity is composed of three parts: the communication in the parallel dense linear algebra, the communication after sparse elimination, and the merging and redistribution step within DHIF. It is clear to see that the communication cost for the second part is bounded by either of the rest. Hence, we will simply derive the communication cost for the first and third parts. Here, we adopt the simplified communication model, \(T_\mathrm{comm} = \alpha + \beta \), where \(\alpha \) is the latency and \(\beta \) is the inverse bandwidth.
4 Numerical results
Here we present a couple of numerical examples to demonstrate the parallel efficiency of the distributedmemory HIF. The algorithm is implemented in C++11 and all interprocess communication is expressed via the message passing interface (MPI). The distributedmemory dense linear algebra computation is done through the Elemental library [31]. All numerical examples are performed on Edison at the National Energy Research Scientific Computing center (NERSC). The numbers of processes used are always powers of two, ranging from 1 to 8192. The memory allocated for each process is limited to 2 GB.
Notations for the numerical results
Notation  Explanation 

\(\epsilon \)  Relative precision of the ID 
N  Total number of DOFs in the problem 
\(e_s\)  Relative error for solving, \(\left(IF^{1}A)x\right/\leftx\right\), where x is a Gaussian random vector 
\(\Sigma _L\)  Number of remaining active DOFs at the root level 
\(m_f\)  Maximum memory required to perform the factorization in GB across all processes 
\(t_f\)  Time for constructing the factorization in seconds 
\(E^S_f\)  Strong scaling efficiency for factorization time 
\(t_s\)  Time for applying \(F^{1}\) to a vector in seconds 
\(E^S_s\)  Strong scaling efficiency for application time 
\(n_\mathrm{iter}\)  Number of iterations to solve \(Au=f\) with GMRES with \(F^{1}\) being a preconditioner to a tolerance of \(10^{12}\) 
The notations used in the following tables and figures are listed in Table 2. For simplicity, all examples are defined over \(\varOmega = [0,1)^3\) with periodic boundary condition, discretized on a uniform grid, \(n\times n\times n\), with n being the number of points in each dimension and \(N=n^3\). The PDEs defined in (1) are discretized using the secondorder central difference method with sevenpoint stencil, which is the same as (16). Octrees are adopted to partition the computation domain with the block size at leaf level bounded by 64.
Example 1
We first consider the problem in (1) with \(a(x) \equiv 1\) and \(b(x) \equiv 0.1\). The relative precision of the ID is set to be \(\epsilon = 10^{3}\).
Example 1: numerical results
\(\varvec{N}\)  \(\varvec{P}\)  \(\varvec{e_s}\)  \(\varvec{s_L}\)  \(\varvec{m_f}\)  \(\varvec{t_f}\)  \(\varvec{E^S_f} \varvec{(\%)}\)  \(\varvec{t_s}\)  \(\varvec{E^S_s} \varvec{(\%)}\)  \(\varvec{n_\mathrm{iter}}\) 

\(32^3\)  1  4.84e−04  3440  1.92e−01  4.85e+00  100  1.36e−01  100  6 
2  5.26e−04  3440  9.60e−02  2.60e+00  93  6.65e−02  103  6  
4  3.78e−04  3440  4.80e−02  1.45e+00  84  3.47e−02  98  6  
8  4.93e−04  3440  2.40e−02  8.38e−01  72  1.99e−02  85  6  
16  3.97e−04  3440  1.20e−02  5.83e−01  52  1.31e−02  65  6  
32  7.33e−04  3440  6.03e−03  4.35e−01  35  1.47e−02  29  6  
\(64^3\)  2  5.92e−04  7760  9.07e−01  3.87e+01  100  6.08e−01  100  6 
4  5.98e−04  7760  4.54e−01  2.36e+01  82  2.99e−01  102  6  
8  5.59e−04  7760  2.27e−01  1.48e+01  65  1.61e−01  94  6  
16  6.30e−04  7760  1.13e−01  1.03e+01  47  9.52e−02  80  6  
32  5.89e−04  7760  5.68e−02  5.34e+00  45  6.88e−02  55  6  
64  5.45e−04  7760  2.84e−02  2.67e+00  45  4.10e−02  46  6  
128  5.43e−04  7760  1.42e−02  1.52e+00  40  3.43e−02  28  6  
256  6.29e−04  7760  7.14e−03  1.27e+00  24  2.69e−02  18  6  
\(128^3\)  16  6.19e−04  16,208  9.77e−01  1.43e+02  100  8.24e−01  100  6 
32  5.98e−04  16,208  4.89e−01  7.40e+01  97  4.37e−01  94  6  
64  5.85e−04  16,208  2.44e−01  3.87e+01  92  2.26e−01  91  6  
128  6.23e−04  16,208  1.22e−01  2.11e+01  85  1.40e−01  74  6  
256  6.14e−04  16,208  6.12e−02  1.00e+01  89  9.76e−02  53  6  
512  5.96e−04  16,208  3.06e−02  5.80e+00  77  1.98e−01  13  6  
1024  5.86e−04  16,208  1.54e−02  3.46e+00  65  6.13e−02  21  6  
\(256^3\)  128  6.18e−04  33,104  1.01e+00  2.24e+02  100  9.18e−01  100  6 
256  6.11e−04  33,104  5.07e−01  1.19e+02  94  4.88e−01  94  6  
512  6.06e−04  33,104  2.53e−01  6.33e+01  88  2.85e−01  81  6  
1024  6.25e−04  33,104  1.27e−01  3.19e+01  88  1.86e−01  62  6  
2048  6.18e−04  33,104  6.35e−02  2.44e+01  57  1.58e−01  36  6  
4096  6.16e−04  33,104  3.18e−02  1.27e+01  55  1.73e−01  17  6  
8192  6.14e−04  33,104  1.60e−02  1.16e+01  30  4.14e−01  3  6  
\(512^3\)  1024  6.16e−04  66,896  1.03e+00  3.32e+02  100  1.08e+00  100  6 
2048  6.15e−04  66,896  5.16e−01  1.84e+02  90  6.53e−01  82  6  
4096  6.14e−04  66,896  2.58e−01  9.55e+01  87  4.90e−01  55  6  
8192  6.13e−04  66,896  1.29e−01  5.58e+01  74  4.58e−01  29  6  
\(1024^3\)  8192  6.15e−04  134,480  1.04e+00  4.67e+02  100  1.48e+00  100  6 
Example 2
 1.
Generate uniform random value \(a_{\mathbf j}\) between 0 and 1 for each discretization point;
 2.
Convolve the random value \(a_{\mathbf j}\) with an isotropic threedimensional Gaussian with standard deviation 1;
 3.Quantize the field via$$\begin{aligned} a_{\mathbf j}= \left\{ \begin{array}{ll} 0.1, &{} a_{\mathbf j}\le 0.5\\ 1000, &{} a_{\mathbf j}> 0.5\\ \end{array} \right. . \end{aligned}$$(42)
Example 2: numerical results
\(\varvec{N}\)  \(\varvec{P}\)  \(\varvec{e_s}\)  \(\varvec{s_L}\)  \(\varvec{m_f}\)  \(\varvec{t_f}\)  \(\varvec{E^S_f} \varvec{(\%)}\)  \(\varvec{t_s}\)  \(\varvec{E^S_s} \varvec{(\%)}\)  \(\varvec{n_\mathrm{iter}}\) 

\(32^3\)  1  3.02e−03  3865  2.00e−01  5.80e+00  100  1.34e−01  100  7 
2  3.39e−03  3632  9.31e−02  2.48e+00  117  6.69e−02  100  7  
4  2.69e−03  3934  5.13e−02  1.72e+00  84  3.75e−02  89  7  
8  3.18e−03  3660  2.37e−02  9.50e−01  76  2.20e−02  76  7  
16  3.13e−03  3693  1.24e−02  6.22e−01  58  1.32e−02  63  7  
32  3.00e−03  3744  6.42e−03  4.83e−01  38  1.49e−02  28  7  
\(64^3\)  2  3.29e−03  8580  9.45e−01  4.33e+01  100  6.15e−01  100  7 
4  3.13e−03  8938  4.94e−01  2.91e+01  74  3.10e−01  99  7  
8  3.09e−03  9600  2.51e−01  1.98e+01  55  1.68e−01  91  7  
16  3.07e−03  8919  1.19e−01  1.27e+01  43  9.86e−02  78  7  
32  3.09e−03  9478  6.59e−02  6.99e+00  39  7.89e−02  49  7  
64  3.18e−03  9111  3.03e−02  3.17e+00  43  4.90e−02  39  7  
128  3.02e−03  9419  1.58e−02  2.15e+00  31  3.31e−02  29  7  
256  3.03e−03  9349  7.97e−03  1.60e+00  21  3.66e−02  13  7  
\(128^3\)  16  3.16e−03  19,855  1.07e+00  2.11e+02  100  8.89e−01  100  7 
32  3.09e−03  20,487  5.58e−01  1.18e+02  90  4.86e−01  91  7  
64  3.06e−03  21,345  2.78e−01  6.43e+01  82  2.45e−01  91  7  
128  3.10e−03  20,344  1.37e−01  3.39e+01  78  1.34e−01  83  7  
256  3.07e−03  21,152  7.43e−02  1.76e+01  75  1.10e−01  51  7  
512  3.07e−03  20,779  3.51e−02  8.46e+00  78  8.80e−02  32  7  
1024  3.04e−03  21,361  1.76e−02  5.38e+00  61  6.31e−02  22  7  
\(256^3\)  128  3.11e−03  42,420  1.14e+00  4.15e+02  100  1.04e+00  100  7 
256  3.12e−03  43,828  5.91e−01  2.12e+02  98  5.77e−01  90  8  
512  3.11e−03  44,126  2.90e−01  1.25e+02  83  3.86e−01  67  7  
1024  3.08e−03  43,302  1.46e−01  6.31e+01  82  2.12e−01  61  7  
2048  3.09e−03  44,131  7.78e−02  3.43e+01  76  1.86e−01  35  7  
4096  3.10e−03  43,691  3.71e−02  1.96e+01  66  2.28e−01  14  7  
8192  3.10e−03  43,952  1.85e−02  2.05e+01  32  4.03e−01  4  7  
\(512^3\)  1024  3.11e−03  88,070  1.16e+00  6.37e+02  100  1.22e+00  100  7 
2048  3.11e−03  88,577  6.11e−01  3.47e+02  92  6.84e−01  89  8  
4096  3.11e−03  88,757  3.03e−01  1.89e+02  84  5.31e−01  58  7  
8192  3.11e−03  85,877  1.50e−01  1.02e+02  78  6.20e−01  25  7  
\(1024^3\)  8192  3.11e−03  177,323  1.18e+00  9.35e+02  100  1.95e+00  100  8 
Example 3
We adopt GMRES iterative method in both DHIF and hypre to solve the elliptic problem to a relative error \(10^{12}\). The given tolerance in DHIF is set to be \(10^{4}\). And SMG interface in hypre is used as preconditioner for the problem on regular grids. The numerical results for DHIF and hypre are given in Table 5.
Numerical results for DHIF and hypre
\(\varvec{N}\)  \(\varvec{P}\)  DHIF  hypre  

\(\varvec{t_\mathrm{setup}} \varvec{(s)}\)  \(\varvec{t_\mathrm{solve}} \varvec{(s)}\)  \(\varvec{n_\mathrm{iter}}\)  \(\varvec{t_\mathrm{setup}} \varvec{(s)}\)  \(\varvec{t_\mathrm{solve}} {\varvec{(s)}}\)  \(\varvec{n_\mathrm{iter}}\)  
\(64^3\)  8  15.27  18.10  21  0.29  9.67  67 
64  2.46  3.45  21  1.47  17.37  60  
\(128^3\)  64  29.20  24.53  22  1.78  140.90  394 
512  3.93  4.41  22  2.11  113.57  455  
\(256^3\)  512  59.66  26.33  21  4.11  258.22  492 
4096  11.58  6.78  21  8.97  191.15  375 
As we can read from Table 5, there are a few advantages of DHIF over hypre in the given settings. First, DHIF is faster than hypre’s SMG except for small problems with small numbers of processes. And the number of iterations grows as the problem size grows in hypre, while it remains almost the same in DHIF. In truly large problems, the advantages of DHIF are more pronounced. Second, the scalability of DHIF appears to be better than that of hypre’s SMG. Finally, DHIF only requires powers of two numbers of processes, whereas hypre’s SMG requires powers of eight for 3D problems.
5 Conclusion
We have described the algorithm using the periodic boundary condition in order to simplify the presentation. However, the implementation can be extended in a straightforward way to problems with other type of boundary conditions. The discretization adopted here is the standard Cartesian grid. For more general discretizations such as finite element methods on unstructured meshes, one can generalize the current implementation by combining with the idea proposed in [36].
Here we have only considered the parallelization of the HIF for differential equations. As shown in [22], the HIF is also applicable to solving integral equations with nonoscillatory kernels. Parallelization of this algorithm is also of practical importance.
Acknowledgements
Y. Li and L. Ying are partially supported by the National Science Foundation under award DMS1521830 and the U.S. Department of Energy’s Advanced Scientific Computing Research program under award DEFC0213ER26134/DESC0009409. The authors would like to thank K. Ho, V. Minden, A. Benson, and A. Damle for helpful discussions. We especially thank J. Poulson for the parallel dense linear algebra package Elemental. We acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin (URL: http://www.tacc.utexas.edu) for providing HPC resources that have contributed to the research results reported in the early versions of this paper. This research, in the current version, used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC0205CH11231.
Declarations
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Ambikasaran, S., Darve, E.: An \(\cal{O}(N \log N)\) fast direct solver for partial hierarchically semiseparable matrices: with application to radial basis function interpolation. SIAM J. Sci. Comput. 57(3), 477–501 (2013)MATHGoogle Scholar
 Amestoy, P., Ashcraft, C., Boiteau, O., Buttari, A., L’Excellent, J.Y., Weisbecker, C.: Improving multifrontal methods by means of block lowrank representations. SIAM J. Sci. Comput. 37(3), A1451–A1474 (2015)MathSciNetView ArticleMATHGoogle Scholar
 Amestoy, P.R., Duff, I.S., L’Excellent, J.Y.: Multifrontal parallel distributed symmetric and unsymmetric solvers. Comput. Methods Appl. Mech. Eng. 184(24), 501–520 (2000)View ArticleMATHGoogle Scholar
 Amestoy, P.R., Duff, I.S., L’Excellent, J.Y., Koster, J.: A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM J. Matrix Anal. Appl. 23(1), 15–41 (2001)MathSciNetView ArticleMATHGoogle Scholar
 Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Minimizing communication in numerical linear algebra. SIAM J. Matrix Anal. Appl. 32(3), 866–901 (2011)MathSciNetView ArticleMATHGoogle Scholar
 Bebendorf, M.: Efficient inversion of the Galerkin matrix of general secondorder elliptic operators with nonsmooth coefficients. Math. Comput. 74(251), 1179–1199 (2005)MathSciNetView ArticleMATHGoogle Scholar
 Bebendorf, M., Hackbusch, W.: Existence of \(\cal{H}\)matrix approximants to the inverse FEmatrix of elliptic operators with \(L^\infty \)coefficients. Numer. Math. 95(1), 1–28 (2003)MathSciNetView ArticleMATHGoogle Scholar
 Börm, S.: Approximation of solution operators of elliptic partial differential equations by \(\cal{H}\) and \(\cal{H}^2\)matrices. Numer. Math. 115(2), 165–193 (2010)MathSciNetView ArticleMATHGoogle Scholar
 Briggs, W.L., Henson, V.E., McCormick, S.F.: A Multigrid Tutorial, 2nd edn. Society for Industrial and Applied Mathematics (2000). doi:https://doi.org/10.1137/1.9780898719505
 Chandrasekaran, S., Dewilde, P., Gu, M., Pals, T., Sun, X., van der Veen, A.J., White, D.: Some fast algorithms for sequentially semiseparable representations. SIAM J. Matrix Anal. Appl. 27(2), 341–364 (2005)MathSciNetView ArticleMATHGoogle Scholar
 Chandrasekaran, S., Dewilde, P., Gu, M., Somasunderam, N.: On the numerical rank of the offdiagonal blocks of Schur complements of discretized elliptic PDEs. SIAM J. Matrix Anal. Appl. 31(5), 2261–2290 (2010)MathSciNetView ArticleMATHGoogle Scholar
 Cheng, H., Gimbutas, Z., Martinsson, P.G., Rokhlin, V.: On the compression of low rank matrices. SIAM J. Sci. Comput. 26(4), 1389–1404 (2005)MathSciNetView ArticleMATHGoogle Scholar
 Chow, E., Falgout, R.D., Hu, J.J., Tuminaro, R.S., and Yang, U.M.: A survey of parallelization techniques for multigrid solvers. Parallel Process. Sci. Comput., chapter 10, pp. 179–201. Society for Industrial and Applied Mathematics (2006)Google Scholar
 Duff, I.S., Reid, J.K.: The multifrontal solution of indefinite sparse symmetric linear equations. ACM Trans. Math. Softw. 9(3), 302–325 (1983)MathSciNetView ArticleMATHGoogle Scholar
 Falgout, R.D., Jones, J.E.: Multigrid on massively parallel architectures. In: Dick, E., Riemslagh, K., Vierendeels, J. (eds.) Multigrid Methods VI. Lecture Notes in Computational Science and Engineering, vol. 14, pp. 101–107. Springer, Berlin (2000). doi:https://doi.org/10.1007/9783642583124_13
 George, A.: Nested dissection of a regular finite element mesh. SIAM J. Numer. Anal. 10(2), 345–363 (1973)MathSciNetView ArticleMATHGoogle Scholar
 Gillman, A., Martinsson, P.G.: A direct solver with \(O(N)\) complexity for variable coefficient elliptic PDEs discretized via a highorder composite spectral collocation method. SIAM J. Sci. Comput. 36(4), A2023–A2046 (2014)MathSciNetView ArticleMATHGoogle Scholar
 Hackbusch, W.: A sparse matrix arithmetic based on \(\cal{H}\)matrices. I. Introduction to \(\cal{H}\)matrices. Computing 62(2), 89–108 (1999)MathSciNetView ArticleMATHGoogle Scholar
 Hackbusch, W., Börm, S.: Datasparse approximation by adaptive \(\cal{H}^2\)matrices. Computing 69(1), 1–35 (2002)MathSciNetView ArticleMATHGoogle Scholar
 Hackbusch, W., Khoromskij, B.N.: A sparse \(\cal{H}\)matrix arithmetic. II. Application to multidimensional problems. Computing 64(1), 21–47 (2000)MathSciNetMATHGoogle Scholar
 Hao, S., Martinsson, P.G.: A direct solver for elliptic PDEs in three dimensions based on hierarchical merging of PoincaréSteklov operators. J. Comput. Appl. Math. 308, 419–434 (2016). doi:https://doi.org/10.1016/j.cam.2016.05.013 MathSciNetView ArticleMATHGoogle Scholar
 Ho, K.L., Ying, L.: Hierarchical interpolative factorization for elliptic operators: differential equations. Commun. Pure Appl. Math. 69(8), 1415–1451 (2016). doi:https://doi.org/10.1002/cpa.21582 MathSciNetView ArticleMATHGoogle Scholar
 Ho, K.L., Ying, L.: Hierarchical interpolative factorization for elliptic operators: integral equations. Commun. Pure Appl. Math. 69(7), 1314–1353 (2016)MathSciNetView ArticleMATHGoogle Scholar
 Izadi, M.: Parallel \(\cal{H}\)matrix arithmetic on distributedmemory systems. Comput. Vis. Sci. 15(2), 87–97 (2012)MathSciNetView ArticleGoogle Scholar
 Kriemann, R.: \(\cal{H}\)LU factorization on manycore systems. Comput. Vis. Sci. 16(3), 105 (2013)MathSciNetView ArticleGoogle Scholar
 Liu, J.W.H.: The multifrontal method for sparse matrix solution: theory and practice. SIAM Rev. 34(1), 82–109 (1992)MathSciNetView ArticleMATHGoogle Scholar
 Liu, X., Xia, J., Hoop, M.V.D.E.: Parallel randomized and matrixfree direct solvers for large structured dense linear systems. SIAM J. Sci. Comput. 38(5), 1–32 (2016)MathSciNetView ArticleMATHGoogle Scholar
 Martinsson, P.G.: A fast direct solver for a class of elliptic partial differential equations. SIAM J. Sci. Comput. 38(3), 316–330 (2009)MathSciNetMATHGoogle Scholar
 Martinsson, P.G.: Blocked rankrevealing QR factorizations: how randomized sampling can be used to avoid singlevector pivoting. arXiv:1505.08115 (2015)
 Poulson, J., Engquist, B., Li, S., Ying, L.: A parallel sweeping preconditioner for heterogeneous 3D Helmholtz equations. SIAM J. Sci. Comput. 35(3), C194–C212 (2013)MathSciNetView ArticleMATHGoogle Scholar
 Poulson, J., Marker, B., van de Geijn, R.A., Hammond, J.R., Romero, N.A.: Elemental: a new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw. 39(2), 13:1–13:24 (2013)MathSciNetView ArticleMATHGoogle Scholar
 Pouransari, H., Coulier, P., Darve, E.: Fast hierarchical solvers for sparse matrices using lowrank approximation. arXiv:1510.07363 (2016)
 Saad, Y.: Parallel iterative methods for sparse linear systems. Stud. Comput. Math. 8, 423–440 (2001)MathSciNetView ArticleMATHGoogle Scholar
 Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics (2003). doi:https://doi.org/10.1137/1.9780898718003
 Schmitz, P.G., Ying, L.: A fast direct solver for elliptic problems on general meshes in 2D. J. Comput. Phys. 231(4), 1314–1338 (2012)MathSciNetView ArticleMATHGoogle Scholar
 Schmitz, P.G., Ying, L.: A fast nested dissection solver for Cartesian 3D elliptic problems using hierarchical matrices. J. Comput. Phys. 258, 227–245 (2014)MathSciNetView ArticleMATHGoogle Scholar
 Scott, D.S.: Efficient alltoall communication patterns in hypercube and mesh topologies. In: Distributed Memory Computing Conference, pp. 398–403. IEEE (1991)Google Scholar
 Wang, S., Li, X.S., Rouet, F.H., Xia, J., De Hoop, M.V.: A parallel geometric multifrontal solver using hierarchically semiseparable structure. ACM Trans. Math. Softw. 42(3), 21:1–21:21 (2016)MathSciNetView ArticleGoogle Scholar
 Xia, J.: Efficient structured multifrontal factorization for general large sparse matrices. SIAM J. Sci. Comput. 35(2), A832–A860 (2013)MathSciNetView ArticleMATHGoogle Scholar
 Xia, J.: Randomized sparse direct solvers. SIAM J. Matrix Anal. Appl. 34(1), 197–227 (2013)MathSciNetView ArticleMATHGoogle Scholar
 Xia, J., Chandrasekaran, S., Gu, M., Li, X.S.: Superfast multifrontal method for large structured linear systems of equations. SIAM J. Matrix Anal. Appl. 31(3), 1382–1411 (2009)MathSciNetView ArticleMATHGoogle Scholar
 Xia, J., Chandrasekaran, S., Gu, M., Li, X.S.: Fast algorithms for hierarchically semiseparable matrices. Numer. Linear Algebr. Appl. 17(6), 953–976 (2010)MathSciNetView ArticleMATHGoogle Scholar
 Xin, Z., Xia, J., De Hoop, M.V., Cauley, S., Balakrishnan, V.: A distributedmemory randomized structured multifrontal method for sparse direct solutions. Purdue GMIG Rep. 14(17), 1–25 (2014)Google Scholar