Communication-Avoiding Parallel Algorithms for Solving Triangular Systems of Linear Equations

Tobias Wicky  
Department of Computer Science  
ETH Zurich  
Zurich, Switzerland  
twicki@ethz.ch

Edgar Solomonik  
Department of Computer Science  
University of Illinois at Urbana-Champaign  
Urbana, IL, USA  
solomon2@illinois.edu

Torsten Hoefler  
Department of Computer Science  
ETH Zurich  
Zurich, Switzerland  
htor@inf.ethz.ch

Abstract—We present a new parallel algorithm for solving triangular systems with multiple right hand sides (TRSM). TRSM is used extensively in numerical linear algebra computations, both to solve triangular linear systems of equations as well as to compute factorizations with triangular matrices, such as Cholesky, LU, and QR. Our algorithm achieves better theoretical scalability than known alternatives, while maintaining numerical stability, via selective use of triangular matrix inversion. We leverage the fact that triangular inversion and matrix multiplication are more parallelizable than the standard TRSM algorithm. By only inverting triangular blocks along the diagonal of the initial matrix, we generalize the usual way of TRSM computation and the full matrix inversion approach. This flexibility leads to an efficient algorithm for any ratio of the number of right hand sides to the triangular matrix dimension. We provide a detailed communication cost analysis for our algorithm as well as for the recursive triangular matrix inversion. This cost analysis makes it possible to determine optimal block sizes and processor grids a priori. Relative to the best known algorithms for TRSM, our approach can require asymptotically fewer messages, while performing optimal amounts of computation and communication in terms of words sent.

Keywords—TRSM, communication cost, 3D algorithms

I. INTRODUCTION

Triangular solve for multiple right hand sides (TRSM) is a crucial subroutine in many numerical linear algebra algorithms, such as LU and Cholesky factorizations [1], [2]. Moreover, it is used to solve linear systems of equations once the equation matrix is decomposed using any factorization involving a triangular matrix factor. We consider TRSM for dense linear equations in the matrix form,

\[ L \cdot X = B, \]

where \( L \in \mathbb{R}^{n \times n} \) is a lower-triangular matrix while \( B \in \mathbb{R}^{n \times k} \), \( X \in \mathbb{R}^{n \times k} \) are dense matrices. We study the communication cost complexity of two variants of the TRSM algorithm for parallel execution on \( p \) processors with unbounded memory. First, we present an adaptation of a known recursive scheme for TRSM [3] along with a complete communication cost analysis. Then, we demonstrate a new algorithm that uses selective triangular inversion to reduce the synchronization cost over known schemes, while preserving optimal communication and computation costs. Careful choice of algorithmic parameters, allows us to achieve better asymptotic complexity for a large (and most important) range of input configurations.

Our TRSM algorithm leverages matrix multiplication and triangular matrix inversion as primitives. We provide a communication-efficient parallel matrix multiplication algorithm that starts from a 2D cyclic distribution, a modest enhancement to existing approaches. Triangular matrix inversion provides the key ingredient to the lower synchronization cost in our TRSM algorithm. Unlike general matrix inversion, triangular inversion is numerically stable [4] and can be done with relatively few synchronizations. We present a known parallel approach for triangular inversion and provide the first communication cost analysis thereof.

We invert triangular diagonal blocks of the \( L \) matrix at the start of our TRSM algorithm, increasing the computational granularity of the main part of the solver. Inverting blocks, rather than the whole matrix, also allows our algorithm to be work-efficient in cases when the number of right-hand sides is smaller than the matrix dimension. We formulate the algorithm in an iterative, rather than a recursive manner, avoiding overheads incurred by the known parallel recursive TRSM approach. This innovation reduces communication cost by a factor of \( \Theta(\log(p)) \) relative to the recursive algorithm when the number of right-hand sides is relatively small. At the same time, across a large range of input parameters, we achieve a synchronization cost improvement over the recursive approach by a factor of \( \Theta\left(\left(\frac{n}{k}\right)^{1/6} p^{2/3}\right) \).

II. PRELIMINARIES

A. Execution Time Model

The model we use to calculate the parallel execution time of an algorithm along its critical path is the \( \alpha - \beta - \gamma \) model. It describes the total execution time of the algorithm \( T \) in terms of the floating point operations (flop) count \( F \), the bandwidth \( W \) (number of words of data sent and received)

\[ T = \alpha F + \beta W + \gamma. \]
and the latency $S$ (number of messages sent and received) along the critical path [5] in the following fashion:

$$T = \alpha \cdot S + \beta \cdot W + \gamma \cdot F.$$  

For all the terms we only show the leading order cost in terms of $n, k$ and $p$. We assume that every processor can send and receive one message at a time in point to point communication. We do not place constraints on the local memory size.

### B. Notation

For brevity, we will sometimes omit specification of the logarithm base, using log to denote $\log_2$. We will make frequent use of the unit step function

$$\mathbb{I}_x = \begin{cases} 1 : x > 1 \\ 0 : x \leq 1. \end{cases}$$

We use $\Pi(x_1, \ldots, x_n)$ to denote single processors in an $n$-dimensional processor grids.

To refer to successive elements or blocks, we will use the colon notation where $i : j = [i, i + 1, \ldots, j - 1]$ $i < j$. To refer to strided sets of elements or blocks, we will write

$$i : k : j = [i, i + k, i + 2k, \ldots, i + ak] \max a \text{ s.t. } ak < j.$$  

The colon notation can be applied to a list and should there be considered element-wise. To use subsets of processors, we use the $\circ$ notation in the way that $\Pi(x, \circ, z) = \Pi(x, 1 : p_y, z)$ denotes a (1-dimensional) grid of processors in the $y$-dimension.

Global matrices are denoted by capital letters whereas locally owned parts have square brackets: If $L$ is distributed on $\Pi(o, o, 1)$, every processor $\Pi(x, y, 1)$ owns $L[x, y]$. Matrix elements are accessed with brackets.

### C. Previous Work

We now cover necessary existing building blocks (e.g. collective communication routines) and previous work. In particular, we overview related results on communication cost of matrix multiplication and triangular solves.

1) **Collective Communication**: In [6], Chan et al. present a way to perform reduction, allreduction and broadcast via allgather, scatter, gather and reduce-scatter. The latter set of collectives can be done using recursive doubling (butterfly algorithms) [6], [7] for a power of two number of processors. If we have a non-power of two number of processors, the algorithm described in [8] can be used. For simplicity, we do not consider reduction and broadcast algorithms that can achieve a factor of two less in cost in specific regimes of $\alpha$ and $\beta$ [9]. If we use butterfly methods, the cost of an allgather of $n$ words among $p$ processors is

$$T_{\text{allgather}}(n, p) = \alpha \cdot \log p + \beta \cdot n \mathbb{I}_p.$$  

Scatter and gather also have the same cost [7],

$$T_{\text{scatter}}(n, p) = \alpha \cdot \log p + \beta \cdot n \mathbb{I}_p,$$

$$T_{\text{gather}}(n, p) = \alpha \cdot \log p + \beta \cdot n \mathbb{I}_p.$$  

Reduce-scatter uses the same communication-path and has same communication cost but we have to add the additional overhead of local computation,

$$T_{\text{reduce-scatter}}(n, p) = \alpha \cdot \log p + \beta \cdot n \mathbb{I}_p + \gamma \cdot n \mathbb{I}_p.$$  

The cost of an all-to-all among $p$ processors is

$$T_{\text{alltoall}}(n, p) = \alpha \cdot \log(p) + \beta \cdot \frac{n \log p}{2}.$$  

The combination of the algorithms leads to the following costs for reduction, allreduction, and broadcast:

$$T_{\text{reduction}}(n, p) = \alpha \cdot 2 \log p + \beta \cdot 2n \mathbb{I}_p + \gamma \cdot n \mathbb{I}_p,$$

$$T_{\text{allreduction}}(n, p) = \alpha \cdot 2 \log p + \beta \cdot 2n \mathbb{I}_p + \gamma \cdot n \mathbb{I}_p,$$

$$T_{\text{broadcast}}(n, p) = \alpha \cdot 2 \log p + \beta \cdot 2n \mathbb{I}_p.$$  

2) **Matrix Multiplication**: Communication-efficient parallel algorithms for matrix multiplication have been analyzed extensively [10]–[15]. In [16], Demmel et al. present algorithms that are asymptotically optimal for matrix multiplication of arbitrary (potentially non square) matrices. If we neglect the memory terms, their work shows that matrix multiplication can be done with the following costs.

**Bandwidth**: When multiplying a matrix $A$ that is of dimension $n \times n$ with a matrix $B$ of dimensions $n \times k$ with $p$ processors, we obtain an asymptotic bandwidth cost of

$$W_{\text{MM}}(n, k, p) = \begin{cases} \mathcal{O}\left(\frac{nk}{\sqrt{p}}\right) & n > k \cdot \sqrt{p} \\ \mathcal{O}\left(\frac{n^2k}{p}\right)^{2/3} & k/p \leq n \leq k \cdot \sqrt{p} \\ \mathcal{O}\left(n^2\right) & n < k/p. \end{cases}$$

We refer to the first of the cases of $W_{\text{MM}}$, as the case of two large dimensions, here the matrix $A$ is much larger than the right hand side $B$, when the best way of performing a matrix multiplication is to use a two dimensional layout for the processor grid. The second case, three large dimensions, has matrices $A$ and $B$ of approximately the same size. A three dimensional grid layout is optimal here. And the third case, one large dimension, is the case where the right hand side $B$ is larger than the triangular matrix $A$, the best way to do a matrix multiplication is to use a one dimensional layout for the processor grid.

**Latency**: Assuming unlimited memory, matrix multiplication as presented in [16] can be done with a latency of

$$S_{\text{MM}}(p) = \mathcal{O}\left(\log(p)\right).$$

**Flop Cost**: Matrix multiplication takes $\mathcal{O}\left(n^2k\right)$ flops, which can be divided on $p$ processors and therefore we have

$$F_{\text{MM}}(n, k, p) = \mathcal{O}\left(\frac{n^2k}{p}\right).$$
Previous Analysis: For the case where $k = n$ the bandwidth analysis of a general matrix multiplication goes back to what is presented in [12]. Aggarwal et al. present a cost analysis in the LPRAM model. In that work, the authors show that the same cost can also be achieved for the transitive closure problem that can be extended to the problem of doing an LU decomposition. The fact that these bandwidth costs can be obtained for the LU decomposition was later demonstrated by Tiskin [17]. He used the bulk synchronous parallel (BSP) execution time model. Since the dependencies in LU are more complicated than they are for TRSM, we also expect TRSM to be able to have the same asymptotic bandwidth and flop costs as a general matrix multiplication.

3) Triangular Matrix Solve for Single Right Hand Sides: Algorithms for the problem of triangular solve for a single right hand side (when $X$ and $B$ are vectors) have been well-studied. A communication-efficient parallel algorithm was given by Heath and Romine [18]. This parallel algorithm was later shown to be an optimal schedule in latency and bandwidth costs via lower bounds [5]. However, when $X$ and $B$ are matrices ($k > 1$), it is possible to achieve significantly lower communication costs relative to the amount of computation required. The application of selective inversion has been used to accelerate repeated triangular solves that arise in preconditioned sparse iterative methods [19].

4) Recursive Triangular Matrix Solve for Multiple Right Hand Sides: A recursive approach of solving the TRSM-problem was presented in the work of Elmroth et al. [3]. The initial problem, $L \cdot X = B$ can be split into $L \cdot [X_1 \ X_2] = [B_1 \ B_2]$, which yields two independent subproblems:

$$L \cdot X_1 = B_1, \quad L \cdot X_2 = B_2,$$

hence the subproblems are independent and can be solved in parallel.

The other, dependent splitting proposed divides the triangular matrix, yielding two subtasks that have to be solved one at a time

$$\begin{bmatrix} L_{11} & L_{12} \\ L_{21} & L_{22} \end{bmatrix} \cdot \begin{bmatrix} X_1 \\ X_2 \end{bmatrix} = \begin{bmatrix} B_1 \\ B_2 \end{bmatrix},$$

where we obtain the dependent subproblems:

$$L_{11} \cdot X_1 = B_1 \quad \text{and} \quad L_{22} \cdot X_2 = B_2 - L_{21} \cdot X_1.$$

These problems are dependent as we need the solution $X_1$ to solve the second problem.

Parallel TRSM algorithms with 3D processor grids can reduce the communication cost in an analogous fashion to matrix multiplication. Irony and Toledo [20] presented the first parallelization of the recursive TRSM algorithm with a 3D processor grid. They demonstrated that the communication volume of their parallelization is $O(nkp^{1/3} + n^2p^{2/3})$. Thus each processor communicates $O((nk + n^2)p^{2/3})$ elements, which is asymptotically equal to $W_{MM}(n,k,p)$ when $k = \Theta(n)$. However, they did not provide a bound on the latency cost nor on the communication bandwidth cost along the critical path, so it is unclear to what extent the communication volume is load-balanced. Lipshitz [21] provides an analysis of the recursive TRSM algorithm in the same communication cost model as used in this paper. For the case of $k = n$, his analysis demonstrates

$$T_{TRSM-L}(n,n,p) = O(p^{2/3} \cdot \alpha + n^2/p^{2/3} \cdot \beta + n^3/p \cdot \gamma).$$

For some choices of algorithmic parameters, the analysis in [21] should lead to the bandwidth cost,

$$W_{TRSM-L}(n,k,p) = O\left(\left(\frac{n^2k}{p}\right)^{2/3}\right),$$

which is as good as matrix multiplication, $W_{MM}(n,k,p)$. However, it is unclear how to choose the parameters of the formulation in [21] to minimize latency cost for general $n,k.$ Prior to presenting our main contribution (an inversion-based TRSM algorithm with a lower latency cost), we provide a simpler form of the recursive TRSM algorithm and its asymptotic cost complexity. Inversion has previously been used in TRSM implementations [22], but our study is the first to consider communication-optimality. We start by presenting a subroutine for 3D matrix multiplication which operates from a starting 2D distribution, simplifying the subsequent presentation of TRSM algorithms.

## III. MATRIX MULTIPLICATION

\[ B = MM(L, X, \pi_{2D}, n, k, p, p_1, p_2) \]

**Require:**

The processor grid $\pi_{2D}$ has dimensions $\sqrt{p} \times \sqrt{p}$

$L$ is an $n \times n$ matrix, distributed on $\pi_{2D}$ in a cyclic layout, so $\pi_{2D}(x,y)$ owns $L[x,y]$ of size $\frac{\sqrt{p}}{\sqrt{p}} \times \frac{\sqrt{p}}{\sqrt{p}}$ such that $L[x,y] = L((i\sqrt{p} + j)/\sqrt{p} + y)$.

$X$ is a dense $n \times k$ matrix distributed cyclically so that $\pi_{2D}(x,y)$ owns $X[x,y]$ of size $\frac{\sqrt{p}}{\sqrt{p}} \times \frac{\sqrt{p}}{\sqrt{p}}$.

1. Define a $p_1 \times \sqrt{p_2} \times \sqrt{p_1} \times \sqrt{p_2}$ processor grid $\pi_{4D}$, such that $\pi_{4D}(x_1,x_2,y_1,y_2) = \pi_{2D}(x_1 + p_1 x_2, y_1 + p_2 y_2)$ owns blocks $L[x_1, x_2, y_1, y_2]$ and $X[x_1, x_2, y_1, y_2]$.
2. $L'[x_1, y_1] = \text{Allgather}(L[x_1, 0, y_1, 0], \pi_{4D}(x_1, 0, y_1, 0))$
3. $X'[x_1, y_1, x_2, y_2] = \text{Transpose}(X[x_1, x_2, y_1, y_2], \pi_{4D}(x_1, x_2, y_1, y_2), x_2, y_1)$
4. $X''[y_1, x_1, x_2, y_2] = \text{Transpose}(X'[x_1, y_1, x_2, y_2], \pi_{4D}(x_1, x_2, y_1, y_2), x_1, y_1)$
5. $X'''[y_1, x_1, x_2, y_2] = \text{Allgather}(X''[y_1, 0, x_2, y_2], \pi_{4D}(0, x_2, y_1, y_2))$
6. $\pi_{4D}(x_1, x_2, y_1, y_2)$:
7. $B'[x_1, y_1, x_2, y_2] = L'[x_1, y_1] \cdot X'''[y_1, x_2, y_2]$
8. $B'[x_1, x_2, y_1] = \text{Scatter-reduce}(B'[x_1, 0, x_2, y_1], \pi_{4D}(x_1, 0, x_2, y_1))$

Ensure:

$B = LX$ is distributed the same way as $X$

We present an algorithm for 3D matrix multiplication [10]–[15] that works efficiently given input matrices distributed
cyclically on a 2D processor grid. The algorithm is well-suited for the purposes of analyzing TRSM algorithms. We define the algorithm using a $p_1 \times \sqrt{p_2^3 \times p_1} \times \sqrt{p_2^3}$ processor grid, where $\sqrt{p} = p_1 \sqrt{p_2^3}$ in order to provide well-defined transitions from a distribution on a $\sqrt{p} \times \sqrt{p}$ processor grid to faces of a 3D $p_1 \times p_1 \times p_2$ processor grid. The latter 3D processor grid is being used implicitly in our construction. The algorithm assumes divisibility among $p, p_1, p_2$ and $\sqrt{p_2}$.

A. Cost Analysis of the 3D Matrix Multiplication Algorithm

We analyze the algorithm with account for constant factors in the key leading order costs. The communication costs incurred at line 2, 5, and 7 correspond to the cost of respective collectives, given in Section II-C1.

The transpose on line 4 always occurs on a square processor grid, and so involves only one send and receive of a block. The transposes on line 3 and line 8 are transposes on 2D grids of $p_1 \times \sqrt{p_2^3}$ processors, with each processor owning $nk/p$ elements. The cost of these transposes is no greater than an all-to-all among $\sqrt{p}$ processors, which can be done with cost $O(\alpha \cdot \log(p) + \beta \cdot nk \log(p)/p)$. We consider only the asymptotic cost of this transpose, since it will be low order so long as $p_1 \gg 1$. Based on the above arguments, the cost for MM is given line-by-line in the following table:

<table>
<thead>
<tr>
<th>Line</th>
<th>Cost</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>$\alpha \cdot \log(p_2) + \beta \cdot n^2/p_2$</td>
</tr>
<tr>
<td>3</td>
<td>$O(\alpha \cdot \log(p) + \beta \cdot nk \log(p)/p)$</td>
</tr>
<tr>
<td>4</td>
<td>$\gamma \cdot nk/p$</td>
</tr>
<tr>
<td>5</td>
<td>$\alpha \cdot \log(p_1) + (\beta + \gamma) \cdot nk/p_{1/2}$</td>
</tr>
<tr>
<td>6</td>
<td>$\alpha \cdot \log(p) + \beta \cdot nk \log(p)/p$</td>
</tr>
</tbody>
</table>

To leading order, the cost of MM is given by

$$T_{MM}(n, k, p, p_1, p_2) = \beta \cdot \frac{n^2}{p_2} \log(p_2) + \frac{2nk}{p_{1/2}} + \gamma \cdot \frac{n^2k}{p} + O\left(\alpha \cdot \log(p) + \beta \cdot nk \log(p)/p\right).$$

The last communication cost term (due to the rectangular grid transpose) is only of leading order when $p_1 \approx \log(p)$. A square processor grid is not a good initial/finial layout for $X$ and $B$ in this case, and the problem would be addressed by choosing an alternative one. We disregard this issue, because we will use the algorithm only for $n \geq k$.

IV. RECURSIVE TRSM

We provide a recursive TRSM algorithm for solving $LX = B$ using the techniques covered in Section II-C4. Our algorithm works recursively on a $p_r \times p_r$ processor grid. We will define the processor grid to be square ($p_r = p_c$) when $n \geq k$, but rectangular ($p_r < p_c$) when $n < k$. So long as $p < k/n$, we will choose $p_c = (k/n)p_r$. This strategy implies the largest of the matrices $L$ and $B$ will be partitioned initially so each cyclically-selected block is close to square. The algorithm starts by partitioning the processor grid into $p_c/p_r$ square grids, if $p_c > p_r$, replicating the matrix $L$ and computing a subset of $k/p_r/p_c$ columns of $X$ on each. Then the algorithm partitions $L$ into $n/2 \times n/2$ blocks recursively, executing subproblems with all processors. At a given threshold, $n_0$, the algorithm stops recursing, gathers $L$ onto all processors, and computes a subset of columns of $X$ with each processor.

$$X = \text{Rec-TRSM}(L, B, \Pi_{2D}, n, k, p_r, p_r, p_n, n_0).$$

**Require:**

$L$ is a lower triangular $n \times n$ matrix, distributed on $p_r \times p_r$ in a cyclic layout, so $\Pi_{2D}(x, y)$ of size $\frac{n}{p_r} \times \frac{n}{p_r}$ such that $L[x, y](i, j) = L(i/p_r + x, j/p_r + y)$. $B$ is a dense $n \times k$ matrix is distributed cyclically so that $\Pi_{2D}(x, y)$ owns $X[x, y]$ of size $\frac{n}{p_r} \times \frac{n}{p_r}$.

1: if $p_r = q_{p_r}$ and $q > 1$ then
2: define $p_r \times p_r \times q$ processor grid $\Pi_{3D}$, such that $\Pi_{3D}(x, y, z) = \Pi_{2D}(x + y/p_r, z)$ owns blocks $L[x, y, z]$ and $B[x, y, z]$.
3: $L[x, y, z] = \text{Allgather}(L[x, y, z], \Pi_{3D}(x, y, z))$.
4: $X[x, y, z] = \text{Rec-TRSM}(L[x, y, z], \Pi_{3D}(x, y, z))$.
5: else if $n \leq n_0$ or $p_r = p_c = 1$ then
6: $L = \text{Allgather}(L, \Pi_{3D}(x, y, z), \Pi_{2D}(x, y, z))$.
7: $B[x + yp_r, y] = \text{AllToAll}(B[x, y, z], \Pi_{2D}(x, y, z))$.
8: $X[x + yp_r, y] = \text{Rec-TRSM}(L[x + yp_r, y], \Pi_{2D}(x, y, z))$.
9: $X[x + yp_r, y] = \text{AllToAll}(B[x + yp_r, y], \Pi_{2D}(x, y, z))$.
10: else
11: Partition $L = \begin{bmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{bmatrix}$, so $L_{ij} \in \mathbb{R}^{2 \times 2}$.
12: Partition $B = \begin{bmatrix} B_1 & B_2 \\ B_3 & B_4 \end{bmatrix}$, so $B_i, x_i \in \mathbb{R}^{2 \times k}$.
13: $X_1 = \text{Rec-TRSM}(L_{11}, B_1, \Pi_{2D}, n/2, k, p_r, p_r, p_n)$.
14: $B_2 = B_2 - \text{MM}(L_{21}, X_1, \Pi_{2D}, n/2, k, p_r, \beta^{1/3}(n/k)p_r^{1/3}/p_r^{1/3}(n/k)p_r^{1/3})$.
15: $X_2 = \text{Rec-TRSM}(L_{22}, B_2, \Pi_{2D}, n/2, k, p_r, p_r, p_n)$.
16: end if

Ensure:

$X = L^{-1}B$ is distributed on $\Pi_{2D}$ in the same way as $B$.

A. Cost Analysis of the Recursive Algorithm

We select $p_r = \max(\sqrt{p}, \min(p, \sqrt{p}/k))$ and $p_r = \min(\sqrt{p}, \max(1, \sqrt{p}/k))$. The cost of the allgather on line 3 is

$$T_{\text{part-colr}}(n, p_r) = O\left(\beta \cdot \frac{n^2}{p_r^2} + \alpha \cdot \log(p)\right).$$

since each $L'[x, y] \in \mathbb{R}^{p_r \times k}$ is lower triangular. Once we have a square processor grid, we partition $L$, yielding the recurrence,

$$T_{\text{rec}}(n, k, p, n_0) = T_{\text{MM}}(n/2, k, p, p_r^{1/3}(n/k)^{1/3} \cdot p_r^{1/3}(n/k)^{2/3}) + 2T_{\text{rec}}(n/2, k, p, n_0).$$
We now derive the cost of the algorithm for different relations between \( n, k, \) and \( p, \) as in the expression for \( T_{MM}. \)

**One large dimension:** When \( n < k/p, \) we have \( p_r = 1 \) and \( p_c = p, \) and the first allgather will be the only communication, therefore,

\[
T_{RT1D}(n, k, p) = O\left( (\alpha \cdot \log(p) + \beta \cdot n^2 + \gamma \cdot n^2 k/p) \right).
\]

**Two large dimensions:** When \( n > k \sqrt{p}, \) we will select \( p_r = p_c = \sqrt{p} \) and the column partitioning of \( B \) is not performed. In this case, the MM algorithm will always have \( p_2 = 1 \) (it will be 2D). For sufficiently large \( k, \) this leads us to the recurrence,

\[
T_{RT2D}(n, k, p, n_0) = T_{MM2D}(n/2, k, p) + 2T_{RT2D}(n/2, k, p, n_0),
\]

where \( T_{MM2D}(n, k, p) = O\left( (\alpha \cdot \log(p) + \beta \cdot n^2 + \gamma \cdot n^2 k/p) \right). \)

The bandwidth cost stays the same at every recursive level, while the computation cost decreases by a factor of 2. At the base case, we incur the cost \( T_{RTBC}(n_0, k, p) = O\left( (\alpha \cdot \log(p) + \beta \cdot n^2 + n^2 k/p) \right). \)

We select \( n_0 = \max(\sqrt{p}, n \log(p)/\sqrt{p}) \), so \( n/n_0 \leq \sqrt{p}/\log(p) \), which results in the overall cost,

\[
T_{RT2D}(n, k, p) = O\left( (\alpha \cdot \log(p) + \beta \cdot \frac{n k \log(p)}{\sqrt{p}} + \gamma \cdot \frac{n^2 k}{p}) \right).
\]

The bandwidth cost above is suboptimal by a factor of \( O(\log(p)). \) The overhead is due to the recursive algorithm re-broadcasting some of the same elements of \( B \) at every recursive level. We use an iterative approach for our subsequent TRSM algorithm to avoid this redundant communication.

**Three large dimensions:**

When \( k/p < n < k/\sqrt{p}, \) the algorithm partitions the columns of \( B \) initially then recursively partitions \( L. \) In particular, we select \( p_c = \max(\sqrt{p}, \sqrt{nk/n}) \) and \( p_r = p_c = \min(\sqrt{p}, \sqrt{kn/k}). \) so the first step partitions a rectangular processor grid into \( \max(1, k/n) \) fewer processor grids. After the first step, which partitions \( B, \) we have independent subproblems with \( p_r \) processors. We now start recursively partitioning \( L, \) yielding the cost recurrence,

\[
T_{RT3D}(n, k, p_r^2, n_0) = T_{MM3D}(n/2, k, p_r^2) + \frac{T_{part-cols}(n, p) 1_{k/n}}{2} + 2T_{RT3D}(n/2, k, p_r^2, n_0),
\]

Above, we always employ the MM algorithm in the 3D regime by selecting \( p_1 = \frac{n^2 k}{p} \) and \( p_2 = \frac{n k \log(p)}{p}. \) This gives to the cost recurrence, \( T_{RT3D}(n, k, p_r^2, n_0) = O\left( (\alpha \cdot \log(p) + \beta \cdot \frac{n k \log(p)}{p} + \gamma \cdot \frac{n^2 k}{p}) \right). \)

where we can see that \( \frac{n k \log(p)}{p} = O((n^2 k/p^2)^{2/3}) \), since the initial partitioning will give \( n > k. \) It is also easy to see that \( \frac{n^2 k}{p^2} = O((n^2 k/p^2)^{2/3}) \). With these simplifications,

\[
T_{RT3D}(n, k, p_r^2, n_0) = O\left( (\alpha \cdot \log(p) + \beta \cdot \frac{n^2 k}{p^2} + \gamma \cdot \frac{n^2 k}{p^2}) + 2T_{RT3D}(n/2, k, p_r^2, n_0) \right).
\]

We observe that the bandwidth cost \( O((n^2 k/p^2)^{2/3}) \) decreases by a factor of \( 2^{1/3} \) at every recursive level, and the computation cost by a factor of 2. The base-case cost will be \( T_{RTBC}(n_0, k, p_r^2). \) We select \( n_0 = n^{1/3} \left( \frac{k}{p^2} \right)^{1/2} \), giving a total cost over all base cases of \( T_{RT3D}(n, k, p_r^2) = O\left( (\alpha \cdot \frac{n \log(p)}{k} + \beta \cdot (n^2 k/p^2) + \gamma \cdot n^2 k/p^2) \right) \)

\[
= O\left( (\alpha \cdot \frac{n p^2}{k})^{2/3} \log(p) + \beta \cdot (n^2 k/p^2)^{2/3} + \gamma \cdot n^2 k/p^2 \right).
\]

Therefore, the overall cost incurred on each square processor grid is \( T_{RT3D}(n, k, p_r^2) = \)

\[
O\left( (\alpha \cdot \frac{n p^2}{k})^{2/3} \log(p) + \beta \cdot (n^2 k/p^2)^{2/3} + \gamma \cdot n^2 k/p^2 \right),
\]

which is the same as for the case \( k \leq n. \) For \( n = k, \) the 3D costs obtained above are the same as the most efficient algorithms for \( n \times n \times n \) LU factorization. In subsequent sections, we show that a lower synchronization cost is achievable via selective use of triangular matrix inversion.

**V. TRIANGULAR INVERSION**

In this section, we derive the cost of inverting a lower triangular matrix \( L \) of size \( n \times n \times n \) with \( p \) processors. Since the input matrix is square, the dimensions of the processor grid \( II \) should be identical in two dimensions leaving us with \( \dim(II) = p_1 \times p_1 \times p_2, \) where \( p = p_1 p_2. \) We assume the initial matrix to be cyclically distributed on the subgrid \( II(0, 0, 1). \)

**A. Algorithmic Approach**

In [23], a recursive method for inverting triangular matrices was presented. A similar method for full inversion was presented in [24]. When applied to a triangular matrix, those methods coincide. The method uses the triangular structure of the initial matrix to calculate the inverse by subdividing the problem into two recursive matrix inversion calls, which can be executed concurrently and then uses two matrix multiplications to complete the inversion.
Since the subproblems are independent, we want to split the processor grid such that two distinct sets of processors work on either subproblem. We chose the base case condition to be that the grid is one-dimensional in the dimension of \( p_1 \) and we do redundant base case calculations in this subgrid. For this section, we consider \( p_2 \geq p_1 \), a constraint that we will fulfill anytime the method is called.

\[
L^{-1} = \text{RecTriInv}(L, \Pi, p, p_1, p_2)
\]

**Require:**

The processor grid \( \Pi \) has dimensions \( \sqrt{p} \times \sqrt{p} \). Let \( L \) be a lower triangular \( n \times n \) matrix, distributed on \( \Pi \) in a cyclic layout, so \( \Pi(x, y) \) owns \( L[x, y] \) of size \( \frac{n}{\sqrt{p}} \times \frac{n}{\sqrt{p}} \) such that

\[
L[x, y](i, j) = L(i\sqrt{p} + x, j\sqrt{p} + y).
\]

1. if \( p_1 = 1 \) then
2. AllToAll \((L[x, \Pi], \Pi(x, \circ))\)
3. \( L^{-1} = \text{sequential inversion}(L) \)
4. else
5. Subdivide \( L \) into \( n/2 \times n/2 \) blocks,
6. \( L = \begin{bmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{bmatrix} \)
7. Subdivide the processor grid \( \Pi = [\Pi_1, \Pi_2] \) such that \( \dim(\Pi_1) = \dim(\Pi_2) = \left(\frac{\sqrt{p}}{2} \times \frac{\sqrt{p}}{2}\right) \)
8. Redistribute \((L_{11}, \Pi \rightarrow \Pi_1)\)
9. Redistribute \((L_{22}, \Pi \rightarrow \Pi_2)\)
10. \( L_{11}^{-1} = \text{Rec-Tri-Inv}(L_{11}, \Pi_1, p, p_1/2^{3/3}, p_2/2^{2/3}) \)
11. \( L_{22}^{-1} = \text{Rec-Tri-Inv}(L_{22}, \Pi_2, p, p_1/2^{3/3}, p_2/2^{2/3}) \)
12. \( L_{21}^{-1} = -\text{MM}(L_{22}^{-1}, L_{21}, \Pi, n, n, p, p_1, p_2) \)
13. \( L_{21}^{-1} = \text{MM}(L_{21}^{-1}, L_{11}^{-1}, \Pi, n, n, p, p_1, p_2) \)
14. Assemble \( L^{-1} \) from the \( n/2 \times n/2 \) blocks,
15. \( L^{-1} = \begin{bmatrix} L_{11}^{-1} & 0 \\ L_{21}^{-1} & L_{22}^{-1} \end{bmatrix} \)

16. end if

**Ensure:**

\( LL^{-1} = I \) where \( L^{-1} \) is distributed the same way as \( L \)

**B. Total Cost of Triangular Inversion**

This recursive approach of inverting a matrix has total cost,

\[
T_{\text{RecTriInv}}(n, p_1, p_2) = 2T_{\text{MM}}(n/2, n/2, p_1, p_2) + T_{\text{RecTriInv}}(n/2, p_1/2^{1/3}, p_2/2^{2/3}) + T_{\text{redistr}}(n/2, p_1, p_2),
\]

with a base case cost of

\[
T_{\text{RecTriInv}}(n_0, 1, p_2) = \alpha \cdot 2 \log \left( \frac{p_2}{p_1} \right) + \beta \cdot 2n_0^2 + \gamma \cdot n_0^3.
\]

The base case size will be \( n_0 = \frac{n}{p_1^{2/3}} \) and therefore neither of the terms is of leading order. We observe that the bandwidth cost of the matrix multiplication \( O\left((n^3/p_1^2)^{2/3}\right) \) decreases by a factor of \( 2^{1/3} \) at every recursive level, and the computation cost by a factor of 2. The redistribution process requires moving the matrices from a cyclic processor grid to a smaller cyclic processor grid, with the block each processor owns having a factor of \( 2^{1/3} \) more rows and columns. This redistribution is effectively an all-to-all between a larger and a smaller set of processors. We can get a concrete bound on the cost, by first performing an all-to-all to transition to a blocked layout (each processor owns contiguous blocks of the matrix). Then we can transition to a blocked layout on the smaller processor grid by scattering each block to at most 4 processors. Finally, we can perform an all-to-all on the smaller processor grid to transition from the blocked layout back to a cyclic one. The overall cost of these steps is \( O(\alpha \cdot \log(p) + \beta \cdot n_0^2 \log(p)/p) \) This redistribution bandwidth cost is dominated by the cost of the matrix multiplication.

The total cost for the recursive inversion is

\[
T_{\text{RecTriInv}}(n, p_1, p_2) = \beta \cdot \frac{n^2}{2^{2/3} - 1} \left( \frac{n^2}{8p_1^3} + \frac{n^2}{2^2p_1^2} \right)
\]

In contrast to LU factorization and our recursive TRSM algorithm, the synchronization cost is logarithmic rather than polynomial in \( p \).

**VI. ITERATIVE TRIANGULAR SOLVER**

In this section, we present our main contribution, a 3D TRSM algorithm that uses inversion of diagonal blocks to achieve a lower synchronization cost. By precomputing the inversions, we replace the latency-dominated small TRSMs with more parallel matrix multiplications.

**A. Block-Diagonal Triangular Inversion**

In order to lower the synchronization cost of TRSM, first we invert a set of triangular blocks along the diagonal of the matrix, each with a distinct subset of processors. We split \( \Pi \) into \( \frac{n}{n_0} \) subgrids of dimensions \( r_1 \times r_1 \times r_2 \), where \( r_1^2 r_2 = p \frac{\sqrt{n}}{n} \). To have the proper layout for the inversion, a transition from the original, cyclic layout on a subgrid to the grid as described in Section III has to happen. Afterwards, all the inversions can be done in parallel. To support our inversion, we must have \( r_2 > r_1 \) and \( n_0 \geq \sqrt{r_1 r_2} \). The precise choices of \( r_1 \) and \( r_2 \) are given in the algorithm and will be discussed in Section VII.

**B. Triangular Solve using Partial Inversion**

Initially, we want \( L \) to be distributed on the top level of the three dimensional grid \( \Pi \) in a cyclic layout such that each processor \( \Pi(x, y, 1) \) owns \( L(y : p_1, n, x : p_1 : n) \). Also, we set the right hand side to be distributed on one level of the grid with a blocked layout with a physical block size of \( b \times \frac{p_2}{p_1} \) such that each processor \( \Pi(x, 1, z) \) owns \( B(x : p_1 : \frac{n}{b}, zk/p_2 : (z + 1)k/p_2) \).
Additionally, each processor has memory of the same size $\frac{n}{p_1} \times \frac{n}{p_1}$ matrix such that $L[x, y](i, j) = L(ip_1 + x, jp_1 + y)$.

Define $\bar{L} = \text{Diagonal-Inverter}(L, \Pi, n, p_1, p_2, n_0)$

Require:
The processor grid $\Pi$ has dimensions $p_1 \times p_1 \times p_2$
$L$ is a lower triangular $n \times n$ matrix distributed cyclically on $\Pi$ such that processor $\Pi(x, y, 1)$ owns $L[x, y]$ a lower triangular $\frac{n}{p_1} \times \frac{n}{p_1}$ matrix such that $L[x, y](i, j) = L(ip_1 + x, jp_1 + y)$.

1: Define $q = \frac{p_0}{n_0}$ $r = \frac{n}{n_0}$
2: Define $r_1 = \left(\frac{p_0}{k_0}\right)^{1/3}$
3: Define $r_2 = \left(\frac{16p_0}{n_0}\right)^{1/3}$
4: Define a $p_1 \times p_1 \times \sqrt{p_2} \times \sqrt{p_2}$ processor grid $\Pi_{4D}$, such that $\Xi_{4D}(x_1, x_2, y_1, y_2) = \Xi(x_1, x_2, y_1 + \sqrt{p_2}y_2)$, and $\Pi_{4D}(x_1, x_2, y_1, y_2)$ owns blocks $L[x_1, x_2, y_1, y_2]$
5: Define a block diagonal matrix $L_D[x_1, x_2, y_1, y_2]$ such that
   - $L_D[x_1, x_2, y_1, y_2][b]$ denotes
     - the block $L[x_1, x_2, y_1, y_2](bn_0 : (b + 1)n_0, bn_0 : (b + 1)n_0)$
6: Scatter $L_D[0, o, y_1, y_2][o], \Pi_{4D}(x, y_1, y_2)$
7: Define a $\sqrt{p_2} \times \sqrt{p_2}$ processor grid $\Pi_{4D}$, such that $\Xi_{4D}(x_1 + 1, x_2 + 1, y_1, y_2) = \Xi(x_1, x_2, y_1, y_2)$.
8: Define a $\sqrt{q} \times \sqrt{q} \times \sqrt{p_2} \times \sqrt{p_2}$ processor grid $\Pi_{4D}$, such that $\Xi_{4D}(u_1, u_2, v_1, v_2) = \Xi_{4D}(u_1 + \sqrt{q}u_2, u_2 + \sqrt{q}v_2)$ owns blocks $L_D[u_1, u_2, v_1, v_2]$.
9: AllToAll $L_D[u_1, u_2, v_1, v_2][o], \Pi_{4D}(u_1, u_2, o, o)$
10: For $i = 0 : \sqrt{n_0} - 1$ do in parallel
11: For $j = 0 : \sqrt{n_0} - 1$ do in parallel
12: Define $b = (i + \sqrt{n_0}j)$
13: $\bar{L}_D[o, o, i, j][b] = \text{RecTrInv}(L_D[o, o, i, j][b], \Pi_{4D}(o, o, i, j), q, r_1, r_2)$
14: End for
15: End for
16: AllToAll $\bar{L}_D[u_1, u_2, v_1, v_2][o], \Pi_{4D}(u_1, u_2, o, o)$
17: Gather $\bar{L}_D[o, o, y_1, y_2][o], \Pi_{4D}(x_1, x_2, o, o)$

Ensure: $L_D L_D = 1$ \hspace{1cm} $\forall i$ where $L$ and $\bar{L}$ are partitioned the same way

Additionally, each processor has memory of the same size as its part of $B$ allocated for an update-matrix denoted as $B_j, j \in [1, p_1]$, where also each processor $\Pi(x, y, z)$ owns $\Xi_{4D}(x : p_1, \frac{2}{p_2}) \times \Xi_{4D}(z : (p_1 + 1)k/p_2)$. The algorithm itself consists of two parts: first 'inversion', we invert all the base-cases on the diagonal in parallel as described in the algorithm above and, second 'solve', we do the updates and calculate the solution to TRSM.

**VII. Cost Analysis of the Iterative TRSM**
In this section we will derive a performance model for the algorithm presented in Section VI. The total cost of the algorithm is put together from the cost of its three subroutines:

$T_{\text{Inv}}(n, k, n_0, p_1, p_2) = T_{\text{inv}}(n, p_1, p_2) + T_{\text{Upd}}(n, k, n_0, p_1, p_2) + T_{\text{Solve}}(n, k, n_0, p_1, p_2).

Above the cost denoted by inversion is the part of the algorithm that inverts the blocks (Algorithm Diagonal-Inverter). The solve part is in lines 4-5, and the update in lines 7-8.

$X = \text{II-Inv-TRSM}(L, B, \Pi, n, k, p_1, p_2, r_1, r_2)$

Require:
The processor grid $\Pi$ has dimensions $p_1 \times p_1 \times p_2$
$L$ is a lower triangular $n \times n$ matrix distributed on $\Pi$ such that $\Pi(x, y, 1)$ owns $L[x, y]$ of size $n/p_1 \times n/p_1$ such that $L[x, y](i, j) = L(ip_1 + x, jp_1 + y)$
$B$ is a dense $n \times k$ matrix is distributed such that $\Pi(x, 1, z)$ owns $B[x, z]$ of size $\frac{p_1}{k} \times \frac{p_2}{k}$. Such that $B[x, z](i, j) = L(ip_1 + x, \frac{k}{p_2}j + p_2)$

Define blocks $S_i = n_0 : (i + 1)n_0$ and $T_i = n_0 : n$

Line 6 \hspace{1cm} $\alpha \cdot \log(p_2) + \beta \cdot \frac{n_0}{2p_1^2}$
Line 9 \hspace{1cm} $O(\alpha \cdot \log(p) + \beta \cdot \frac{n_0 \log p}{2p})$
Line 16 \hspace{1cm} $O(\alpha \cdot \log(p) + \beta \cdot \frac{n_0 \log p}{2p})$
Line 17 \hspace{1cm} $O(\alpha \cdot \log(p) + \beta \cdot \frac{n_0}{2p_1^2})$

These cost are never of leading order compared to the costs that arise form the triangular inversion. With the derivations done in Section V, we get the following costs for inversion:

**Latency Cost:** The total latency cost of inversion is $S_{\text{inv}}(p) = O(\alpha \log^2 p)$.
Bandwidth Cost: In order to minimize the bandwidth cost of triangular inversion, we choose a grid splitting to achieve closest to ideal ratios for the subgrids processor layout \( r_1 \) and \( r_2 \). This ratio is achieved when \( r_2 = 4r_1 \). The choices for \( r_1 \) and \( r_2 \) are therefore,
\[
r_1 = \left( \frac{m n_0}{4n} \right)^{1/3} \quad \text{and} \quad r_2 = \left( \frac{16m n_0}{n} \right)^{1/3}.
\]
With this grid slicing, we get \( \frac{n}{n_0} \) different subgrids of dimensions \( r_1 \times r_1 \times r_2 \). This setup leads to a cost for inverting \( \frac{n}{n_0} \) submatrices of
\[
W_{\text{inv}}(n_0, r_1, r_2) = \frac{2^{1/3}}{2^{1/3} - 1} \left( \frac{n_0^2}{8r_1^3} + \frac{n_0^2}{2r_1r_2} \right).
\]

Flop Cost: The flop cost of the inversion part is
\[
F_{\text{inv}}(n_0, p_1, p_2) = \frac{1}{8} \frac{n n_0^2}{p_1^2 p_2^2}.
\]

B. Solve Cost

The complete solve cost can be derived by
\[
T_{\text{solve}}(n, n_0, k, p, p_1, p_2) = \frac{n}{n_0} T_{\text{MM}}(n_0, k, p, p_1, p_2)
\]

Latency Cost: The latency cost of the solve part is
\[
S_{\text{solve}}(n, n_0, p) = O \left( \frac{n}{n_0} \log p \right).
\]

Bandwidth Cost: The cost of the solve is one call to triangular matrix multiplication for each base case. The synchronization cost is again dominated by the \( \frac{n}{n_0} \) cases. The cost for these sums has been presented in Section III. Including these, we obtain a total cost of
\[
W_{\text{solve}}(n, k, n_0, p, p_1, p_2) = \frac{n}{n_0} W_{\text{MM}}(n_0, k, p, p_1, p_2)
\]
\[
= \frac{n}{n_0} \left[ \left( \frac{n_0^2}{p_1^2} \right) I_{p_2} + \left( \frac{n_0 k}{p_1 p_2} \right) I_{p_1} \right].
\]

Flop Cost: The flop cost of the solve part is
\[
F_{\text{solve}}(n, k, n_0, p_1, p_2) = \frac{n}{n_0} \left( \frac{n_0^2 k}{p_1^2 p_2^2} \right).
\]

C. Update Cost

The complete solve cost can be derived by
\[
T_{\text{Upd}}(n, k, n_0, p, p_1, p_2) = \sum_{i=1}^{n/n_0 - 1} T_{\text{MM}}(n - in_0, n_0, k, p, p_1, p_2).
\]

Latency Cost: The update latency cost is
\[
S_{\text{Upd}}(n, n_0, p) = O \left( \frac{n - n_0}{n_0} \log p \right).
\]

Bandwidth Cost: The cost of doing all the updates as described in the algorithm in Section VI (Lines 5-9) is the cost of both the allreductions and the broadcast,
\[
W_{\text{Upd}}(n, k, n_0, p_1, p_2) = \sum_{i=1}^{n/n_0 - 1} \left[ W_{\text{broadcast}} \left( \frac{n_0 i - in_0}{p_1}, p_2 \right) + W_{\text{allreduction}} \left( \frac{n_0 k}{p_1 p_2}, p_1 \right) + W_{\text{allreduction}} \left( \frac{n_0 k}{p_1 p_2}, p_1 \right) \right].
\]

This yields to a total cost of \( W_{\text{Upd}}(n, k, n_0, p_1, p_2) = \frac{n - n_0}{n_0} \left( \frac{kn n_0}{p_1^2 p_2^2} \right) \).

Flop Cost: The update flop cost is
\[
F_{\text{Upd}}(n, k, n_0, p_1, p_2) = \frac{n - n_0}{n_0} \left( \frac{k n n_0}{p_1^2 p_2^2} \right).
\]

D. Total Cost

The total cost of the algorithm is the sum of its three parts and leaves a lot of tuning room as with a choice of \( p_1 = 1 \), \( p_2 = 1 \) or \( n_0 = n \) one is able to eliminate certain terms.

Latency Cost: The total latency cost of the algorithm is a sum of the previous parts,
\[
S_{\text{It-Inv-TRSM}}(p_1, p_2, r_1, b) = S_{\text{Upd}}(n, n_0, p_1, p_2) + S_{\text{solve}}(n, n_0, p_1, p_2) + S_{\text{inv}}(p_1, p_2, r_1, b)
\]
\[
= O \left( \alpha \left( \frac{n}{n_0} \log p + \log^2 p \right) \right).
\]

Bandwidth Cost: The total bandwidth cost for the TRSM algorithm is, by abbreviating \( \nu = \frac{2^{1/3}}{2^{1/3} - 1} \),
\[
W_{\text{It-Inv-TRSM}}(n, k, n_0, p_1, p_2, u, v, b) = W_{\text{Upd}}(n, k, n_0, p_1, p_2) + W_{\text{solve}}(n, k, n_0, p_1, p_2)
\]
\[
+ W_{\text{inv}}(n, b, p_1, p_2, u, v)
\]
\[
= \frac{n}{n_0} \left[ \left( \frac{n_0^2}{p_1^2} \right) I_{p_2} + \frac{n_0 k}{p_1 p_2} I_{p_1} \right]
\]
\[
+ \frac{n - n_0}{n_0} \left[ \frac{4}{p_1^2 p_2^2} I_{p_2} + \frac{4 n_0 k}{p_1 p_2} I_{p_1} \right] + \nu \left( \frac{n_0^2}{2r_1 r_2} + \frac{n_0^2}{2p_1^2 r_2} \right).
\]

Flop Cost: Lastly, the combined total flop cost is
\[
F_{\text{It-Inv-TRSM}}(n, k, n_0, p_1, p_2) = F_{\text{Upd}}(n, n_0, p_1, p_2)
\]
\[
+ F_{\text{solve}}(n, n_0, p_1, p_2) + F_{\text{inv}}(n_0, u, v, p_1, p_2) = \frac{n k}{p_1^2 p_2^2} + \frac{n_0^2 n}{p_1^2 p_2^2}.
\]

VIII. PARAMETER TUNING

In this section, we give asymptotically optimal tuning parameters for different relative matrix sizes to optimize performance. We only focus on asymptotic parameters as there is a trade off between the constant factors on the bandwidth and latency costs. The exact choice is therefore machine
dependent and should be determined experimentally. The initial grid layout is dependent on the relative matrix sizes of $L$ and $B$ since the update part of the algorithm is one of the dominating terms in any case where there is an update to be made and determines the case where is is infeasible. The different layouts are shown in Figure 1.

In the case where $n < \frac{4k}{p}$, the processor grid layout is one-dimensional. The optimal parameters are given in the following table.

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>$p_1$</td>
<td>1</td>
</tr>
<tr>
<td>$p_2$</td>
<td>$\sqrt{p}$</td>
</tr>
<tr>
<td>$r_1$</td>
<td>$O\left(\left(\frac{nk}{p}\right)^{3/4}\right)$</td>
</tr>
<tr>
<td>$r_2$</td>
<td>$O\left(\left(\frac{k}{n}\right)^{1/4}\right)$</td>
</tr>
</tbody>
</table>

Using this set of parameters will yield to a total cost of

$$T_{RT1D}(n,k,p) = O\left(\alpha \left(\log^2 p + \log p\right) + \beta \cdot n^2 + \gamma \cdot \frac{n^2 k}{p}\right).$$

Comparing these costs to the costs of $T_{RT1D}$ obtained in Section IV-A, we can see that we are asymptotically more efficient in terms of latency by a factor of at least $p^{3/4}$ as well as in bandwidth by a factor of $\log p$ while having the same flop cost asymptotically. This is a significant gain and especially important as the occurrence of fewer right hand sides $k < n$ is high.

In the case where $n > \frac{4k}{p}$, the processor grid layout is two-dimensional. The optimal parameters are given in the following table.

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>$p_1$</td>
<td>$\sqrt{p}$</td>
</tr>
<tr>
<td>$p_2$</td>
<td>$\sqrt{p}$</td>
</tr>
<tr>
<td>$r_1$</td>
<td>$O\left(\left(\frac{nk}{p}\right)^{3/4}\right)$</td>
</tr>
<tr>
<td>$r_2$</td>
<td>$O\left(\left(\frac{k}{n}\right)^{1/4}\right)$</td>
</tr>
</tbody>
</table>

Using this set of parameters will yield to a total cost of

$$T_{RT2D}(n,k,p) = O\left(\alpha \left(\log^2 p + \log p\right) + \beta \left(\frac{n^2 k}{p}\right)^{2/3}\right) + \gamma \left(\frac{n^2 k}{\sqrt{p}}\right).$$

Comparing these costs to the cost of $T_{RT2D}$ obtained in Section IV-A, we can see that we are asymptotically more efficient in terms of latency by a factor of $(\sqrt{p})^{1/6}$ and $p^{2/3}$ while being able to keep bandwidth and flop costs asymptotically constant.

**IX. CONCLUSION**

<table>
<thead>
<tr>
<th>Approach</th>
<th>S</th>
<th>W</th>
<th>F</th>
</tr>
</thead>
<tbody>
<tr>
<td>Standard</td>
<td>$\log p$</td>
<td>$n^2$</td>
<td>$\frac{n^2 k}{p}$</td>
</tr>
<tr>
<td>New method</td>
<td>$\log^2 p$</td>
<td>$n^2$</td>
<td>$\frac{n^2 k}{p}$</td>
</tr>
<tr>
<td>Standard</td>
<td>$\sqrt{p}$</td>
<td>$\log p \cdot \frac{k}{\sqrt{p}}$</td>
<td>$\frac{n^2 k}{p}$</td>
</tr>
<tr>
<td>New method</td>
<td>$\log^2 p + \left(\frac{n}{k}\right)^{3/4} \frac{1}{p^{3/4}} \log p$</td>
<td>$\frac{n^2 k}{p}$</td>
<td></td>
</tr>
<tr>
<td>Standard</td>
<td>$\left(\frac{n}{k}\right)^{2/3} \log p$</td>
<td>$\left(\frac{n^2 k}{p}\right)^{2/3}$</td>
<td>$\frac{n^2 k}{p}$</td>
</tr>
<tr>
<td>New method</td>
<td>$\log p + \sqrt{\frac{n}{k}} \log p$</td>
<td>$\left(\frac{n^2 k}{p}\right)^{2/3}$</td>
<td>$\frac{n^2 k}{p}$</td>
</tr>
</tbody>
</table>

We present a new method for solving triangular systems for multiple right hand sides. In the above table, we compare to a baseline algorithm adapted from [3] that achieves costs that are as good or better than the state of the art [18], [21], [25]. Our algorithm achieves better theoretical scalability than these alternatives by up to a factor of $(\frac{n}{k})^{1/6} p^{2/3}$. For certain matrix dimensions, a decrease of bandwidth cost by a factor of $\log_2 p$ is obtained by use of selective triangular matrix inversion. By only inverting triangular blocks along the diagonal of the initial matrix, we generalize the usual way of TRSM computation and the full matrix inversion approach. Fine-tuning the algorithm based on the relative input sizes as well as the number of processors available leads to a significantly more robust algorithm. The cost
analysis of this new method allows us to give recommendations for asymptotically optimal tuning parameters for a wide variety of possible inputs. The detailed pseudo-code provides a direct path toward a more efficient parallel TRSM implementation.

REFERENCES


