Abstract—Vectorization and GPUs will profoundly change graph processing. Traditional graph algorithms tuned for 32- or 64-bit based memory accesses will be inefficient on architectures with 512-bit wide (or larger) instruction units that are already present in the Intel Knights Landing (KNL) manycore CPU. Anticipating this shift, we propose SlimSell: a vectorizable graph representation to accelerate Breadth-First Search (BFS) based on sparse-matrix dense-vector (SpMV) products. SlimSell extends and combines the state-of-the-art SIMD-friendly Sell-C-σ matrix storage format with tropical, real, boolean, and sel-max semiring operations. The resulting design reduces the necessary storage (by up to 50%) and thus pressure on the memory subsystem. We augment SlimSell with the SlimWork and SlimChunk schemes that reduce the amount of work and improve load balance, further accelerating BFS. We evaluate all the schemes on Intel Haswell multicore CPUs, the state-of-the-art Intel Xeon Phi KNL manycore CPUs, and NVIDIA Tesla GPUs. Our experiments indicate which semiring offers highest speedups for BFS and illustrate that SlimSell accelerates a tuned Graph500 BFS code by up to 33%. This work shows that vectorization can secure high-performance in real, boolean, and sel-max semiring operations. The resulting semiring operations by the whole SIMD execution unit. We conduct a work/storage complexity and performance analysis and discover which semiring is the most advantageous for BFS-SpMV based on Sell-C-σ for a large selection of parameters. To the best of our knowledge, this is the first theoretical work complexity analysis as well as performance study on the benefits and downsides of each considered semiring.

I. INTRODUCTION

Upcoming CPUs armed with wide vector instruction units as well as GPUs with even wider warps of cores will transform the way graph algorithms are reasoned about, developed, and executed. Graph schemes are traditionally optimized for architectures based on 32- or 64-bit instructions. This is not only due to the prevalence of such systems but also because graph processing is most often based on irregular fine-grained memory accesses [19]. With the trend to develop machines deploying large vector instruction or SIMD units, we must rethink the design of graph algorithms to use the available hardware effectively.

One such important algorithm is Breadth-First Search (BFS) [7] that is used in many scientific domains such as machine learning, data mining, and computational sciences. Moreover, it has applications in various graph-related problems, including bipartiteness testing and the Ford-Fulkerson method. Thus, despite the wealth of existing optimizations for BFS, the growing amounts and sizes of graph datasets require even faster BFS schemes.

It is well-known that BFS can be viewed in either the traditional combinatorial or the algebraic abstraction. Traditional BFS is based on primitives such as queues. Algebraic BFS is a series of matrix-vector (MV) products over various semirings [13]; the products consist of a sparse matrix and a dense vector (SpMV) or a sparse matrix and a sparse vector (SpMSpV). BFS based on SpMV (BFS-SpMV) uses no explicit locking or atoms and has a succinct description as well as good locality [13]. Yet, it needs more work than traditional BFS and BFS based on SpMSpV [29].

There exist some BFS schemes for GPUs [25, 22, 6, 29, 9, 12]. However, they usually underutilize the available SIMD and vectorization parallelism as they focus on work-optimal traditional BFS or BFS based on SpMSpV that use fine-grained irregular memory accesses. Moreover, they are often tuned to a given architecture and thus not portable. In this work, we illustrate that even if BFS-SpMV is not work-optimal, one can use vectorization to outweigh potential performance penalties and secure high-performance. For this, we first show how to combine the Sell-C-σ [14] sparse matrix storage format with the algebraic BFS based on tropical, real, boolean, and sel-max semirings. Intuitively, Sell-C-σ decomposes a matrix into chunks stored in such a way that each chunk can be processed without additional overheads by the whole SIMD execution unit. We conduct a work/storage complexity and performance analysis and discover which semiring is the most advantageous for BFS-SpMV based on Sell-C-σ for a large selection of parameters. To the best of our knowledge, this is the first theoretical work complexity analysis as well as performance study on the benefits and downsides of each considered semiring.

The advantage of Sell-C-σ is that it enables portable performance for SpMV products on CPUs and GPUs. Still, Sell-C-σ was not originally built for graph processing where it does not perform well. To alleviate this, we propose SlimSell: a vectorizable graph representation that needs up to 50% less storage than Sell-C-σ, lowering the pressure on the memory subsystem when traversing undirected graphs. Intuitively, we do not explicitly store non-zero entries of the adjacency matrix but instead derive them implicitly from column indices. We then augment SlimSell with the SlimWork and the SlimChunk schemes to reduce the amount of work and load imbalance, and thus accelerate BFS even further.
Here, two key ideas are to: skip chunks of computations if they correspond to final output values (e.g., vertex distances) that would not change, and divide large chunks among multiple compute units; an example illustration of SlimSell advantages is in Figure 1. We then analyze the work and storage complexity of BFS based on SlimSell and derive bounds for general graphs as well as for graph models that follow uniform (Erdős-Rényi) and power-law vertex degree distributions. Our evaluation on multi- and manycore CPUs and GPUs shows that the resulting BFS-SpMV outperforms traditional BFS by up to 33%. Finally, this is one of the first works to report BFS results on the state-of-the-art Intel Knights Landing (KNL) Xeon Phi manycore.

We provide the following contributions:

- We combine the state-of-the-art Sell-C-σ matrix storage format with algebraic BFS based on tropical, boolean, real, and set-mem samplings. We conduct work/storage/complexity and performance analysis and show which semiring secures the highest performance.
- We introduce SlimSell, a vectorizable graph representation that improves the storage complexity of Sell-C-σ and accelerates BFS based on SpMV products.
- We propose SlimWork and SlimChunk, two schemes that reduce the amount of work and improve load balancing in BFS based on SlimSell.
- We evaluate SlimSell and show performance improvements of 33% over the optimized Graph500 BFS code and storage reductions of 50% over Sell-C-σ. Our work covers multi- and manycore CPUs as well as GPUs, with the total of seven different systems; we provide one of the first BFS results on the Intel KNL manycore.

II. BACKGROUND

A. Notation

For convenience, Table I lists the most important symbols used in the paper.

1) Graph Structure: We describe a graph $G$ as a tuple $(V, E)$; $V$ is a set of vertices and $E \subseteq V \times V$ is a set of edges between vertices; $|V| = n$ and $|E| = m$. $A$ is the adjacency matrix of $G$. $G$'s diameter is denoted as $D$.

2) Linear Algebra: Bold symbols refer to vectors and matrices while standard font indicates a vector/matrix element. The matrix-vector (MV) multiplications are denoted with $\otimes$. The element-wise vector-vector (Hadamard) product is referred to with $\odot$. The lower index is used to indicate the iteration number (e.g., $x_k$ is $x$ in iteration $k$) while the upper index refers to a vertex element (e.g., $x_k^i$ is the $i$th element of $x_k$); we also use square brackets when referring to vector elements in the source code when they are implemented with arrays ($x_k \equiv x_k[i]$). The logical negation of a vector $x$ is an element-wise operation $\bar{x}$ where $\bar{x}^i = 0$ if $x^i \in \mathbb{R} \setminus \{0\}$, and $\bar{x}^i = 1$ if $x^i = 0$ (e.g., $(0, 7, 0)^T = (1, 0, 1)^T$).

B. Vectorization: Concepts and Considered Architectures

Modern x86 CPUs feature single instruction, multiple data (SIMD) execution units (also called vector instruction units) [18]. One unit has several lanes that proceed synchronously in parallel and execute a specified scalar instruction on a data vector; each lane processes a vector element of a specified size. In modern CPUs with AVX (Advanced Vector Extensions), registers that hold the vectors are 256-bit wide, enabling 8 single-precision (or 4 double-precision) floating point operations per instruction. Consequently, the SIMD width is respectively 8 and 4. Next, manycores such as Xeon Phi KNL use wider SIMD units; KNL comes with 512-bit wide units. Finally, GPUs employ SIMT where CUDA cores roughly correspond to SIMD lanes. They are scheduled in warps; one warp usually counts 32 cores, which constitutes the GPU “SIMD width”. We present the used vectorization functions in Listing I. For a widespread analysis of SlimSell, we evaluate all three types of architectures: Intel CPUs (represent classic latency-oriented x86 multicores), Intel Xeon Phi KNL (stands for state-of-the-art manycores), and NVIDIA GPUs (represent modern throughput-oriented SIMT GPUs).

C. Algorithms: BFS in Various Abstractions

The goal of BFS is to visit each vertex in $G$. First, it visits the neighbors of a specified root vertex $r$, then all the unvisited neighbors of $r$'s neighbors, and so on. Vertices that are processed in a current step constitute the BFS frontier. BFS outputs either the distance (in hops) from each vertex to $r$, or the parent of each vertex in the BFS traversal tree. Distances are stored in a vector $d \in \mathbb{R}^n$ ($d^r$ is the distance from vertex $v$ to $r$) and parents in $p \in \mathbb{R}^n$ ($p^r$ is the parent of vertex $v$). $d$ can be transformed into $p$ using $O(m + n)$ work and $O(1)$ depth. To achieve this, for each vertex $v$, the neighbor $w$ of $v$ with the distance $d^w = d^v - 1$ must be

Table I: The most important symbols used in the paper.

Listing 1: Used vectorization functions: syntax and semantics.

1 // Notes. V is a vector of length $C$ of int. $V = \text{int} \{ C \}$. 2 // The names of the used Intel/NVIDIA functions can be found in 3 // the technical report$^1$ due to space constraints. 4 5 V LOAD(V addr); // Load and return C elements from addr. 6 void STORE(V addr, V data); // Store data at addr address. 7 V \{ t[1], t[2],..., t[q] \}; // Create a vector of C elements. 8 9 // Compare a and b elementwise using operand op; return a vector 10 // with binary outcome of each comparison (0/1 if 11 // different/same); op can be EQ (equality), NEQ (non-equality). 12 V CMP(V a, V b, b op); 13 14 // Interleave elements of a and b based on mask elements; 15 // Bits in [0, C] | out[i] = mask[i] ? b[i] : a[i] 16 V BLEND(V a, V b, V mask); 17 18 // Return t elementwise result of the function FUN applied to 19 // a and b. FUN can be: MIN (minimum), MAX (maximum), ADD (sum), 20 // MUL (product), AND (logical and), OR (logical or). 21 V FUN(V a, V b);
found; we refer to it as the DP transformation \( p = DP(d) \).

1) Traditional BFS: Here, a frontier is implemented with a queue \( Q \). At every iteration, vertices are removed in parallel from \( Q \) and all their unvisited neighbours are appended to \( Q \); this process is repeated until \( Q \) is empty.

2) Algebraic BFS: BFS can also be implemented with the MV product over a selected semiring. For example, for the real semiring, \( x_0 \in \mathbb{R}^n \) is a starting vector with \( x_0 = 1 \) and then, for iteration \( k \geq 1 \), \( x_k = A \otimes f_{k-1} \) (we assume that \( A \) is pre-transposed in accordance with common practice [5]); \( f_{k-1} \) is the frontier in iteration \( k - 1 \); note that \( f_0 = x_0 \). As we show in § III-A, for some semirings \( f_k = x_k \), while for others one must derive \( f_k \) from \( x_k \). x and f can be represented either as sparse or dense vectors that respectively result in sparse-sparse and sparse-dense \( A \otimes f \) products; the latter entail more work but offer more potential for vectorization that we use in SlimSell.

D. Data Structures: CSR, Sell-\( C-\sigma \), and Adjacency List

Finally, we discuss data structures relevant for our work.

1) Compressed Sparse Row (CSR): In the well-known CSR format (see Figure 2), a matrix is represented with three arrays: \( \text{val}, \text{col}, \) and \( \text{row} \). \( \text{val} \) contains all \( A \)'s non-zeros (that correspond to \( G \)'s edges) in the row major order. \( \text{col} \) contains the column index for each corresponding value in \( \text{val} \) and thus has the same size \( O(n) \). Finally, \( \text{row} \) contains starting indices in \( \text{val} \) and \( \text{col} \) of the beginning of each row in \( A \) (size \( O(n) \)). CSR is widely adopted for its simplicity and low memory footprint for sparse matrices. Yet, it hinders high-performance vectorization. Consider the CSR-based textbook MV product in Listing 2 where we unroll the inner loop. Such unrolling ensures consecutive data accesses necessary to effectively use vectorization (consecutive \( \text{val} \) entries are exposed). Yet, due to the reduction (line 8) and the cleanup of the loop remainder (line 10), the unrolling becomes inefficient for sparse matrices [14].

```c
1 /* The code computes \( x_{k+1} = A \otimes x_k \) using CSR. */
2 for(int i = 0; i < n; i++) {
3     \text{int} t_0 = t_1 = t_2 = t_3 = 0; 
4     \text{for(int j = row[i]; j < row[i+1]; j += 4) }{
5         t_0 += \text{val}[j]*x_k[\text{col}[j]] + \text{val}[j+1]*x_k[\text{col}[j+1]];
6         t_1 += \text{val}[j+2]*x_k[\text{col}[j+2]]; \text{t_3 += val[j+3]*x_k[\text{col}[j+3]]};
7     } 
8     x_k[i+1] = t_0 + t_1 + t_2 + t_3; // Reduction.
9     // Clean up the loop remainder:
10    for(j=3; j < row[i+1]; j++) \{x_k[i+1] += \text{val}[j]*x_k[\text{col}[j]];\}
```

Listing 2: (§ II-D2) The MV product kernel with Sell-\( C-\sigma \) \((C \text{ is } 4)\).
A. Combining Sell-C-σ with Semirings

We now provide the first systematic comparison of semirings for BFS-SpMV, and combine them with Sell-C-σ, see Listing 4 for a BFS iteration (frontier expansion) with the considered semirings. A reader who is not interested in the following theoretical details can proceed directly to § III-B

A semiring is defined as a tuple $S = (X, op_1, op_2, el_1, el_2)$ where $X$ is a set equipped with two binary operations $op_1, op_2$ such that $(X, op_1)$ is a commutative monoid with identity element $el_1$ and $(X, op_2)$ is a monoid with identity element $el_2$. We denote matrix-vector multiplication and Hadamard products on semiring $S$ as $\otimes_S$ and $\oplus_S$.

1) The Tropical Semiring $T = \langle \mathbb{R} \cup \{\infty, \min, +, \infty, 0\} \rangle$

To use BFS with this semiring, one must first transform A to $A'$ where each off-diagonal zero entry is $\infty$. Next, one sets $x_0' = 0$ and $x_0'' = \infty$, for $s \neq r$ (the initial distance to r and any other vertex is respectively 0 and $\infty$). Then, $x_k = f_k = A' \otimes_T f_{k-1}$. This BFS computes distances after the final iteration: $d = XD$ (D is the diameter) with predecessors $p = DP(d)$ (see § II-C for a description of $DP$).

Finally, $d = \sum_{i=1}^{D} t_i$ and $p = DP(d)$. This semiring results in code snippets similar to the following boolean semiring and we omit it due to space constraints.

3) The Boolean Semiring $B = \langle \{0,1\}, |, &, 0, 1 \rangle$

Similarly to the real semiring, $x_0'' = f_0'' = 1$ and $x_0' = f_0' = 0, s \neq r$. Next, $x_k = A \otimes_B f_{k-1}$, and $f_k = x_k \& g_k$; $g_k = \bigcup_{i=0}^{k-1} f_i, k \geq 2$. Then, $d = \bigcup_{i=1}^{D} t_i$ and $p = DP(d)$.

4) The Sel-Max ($S$) Semiring $(\mathbb{R}, \max, \cdot, \infty, -1)$: All the above BFS variants require the $DP$ transformation to derive $p$. We now present a semiring that facilitates computing parents. In this semiring $x_0'' = r$ and $x_0' = 0, s \neq r$. The frontier is given by $f_0' = 1$ and $f_0'' = 0, s \neq r$. Here, we also add an additional vector $p_0 = x_0''$. Then, in each iteration, we first calculate $x_k = A \otimes x_{k-1}$ and derive the current frontier $f_k = \max x_k - g_k$ where $g_k = \bigcup_{i=0}^{k-1} f_i$ (if $g$ is a double logical negation to transform each non-zero in $x$ into 1). We next obtain $p_k = p_{k-1} + \bigcup_{i=1}^{k-1} f_i$. Moreover, $x_k$ must be updated in such a way that each non-zero element becomes equal to its index: $x_k = \max (1, \ldots, n)^T.$ After the last iteration, we have $p = p_D$. Finally, $d = \bigcup_{i=1}^{D} t_i$.

Work Complexity We compare work complexity $W$ of BFS-SpMV based on Sell-C-σ with other schemes in Table II. Full sorting that takes $O(n \log n)$ is assumed ($\sigma = n$). Still, we exclude this term as it is a one-time investment and it does not change the asymptotic complexity. Given vertices sorted in increasing order by degree, we let $\rho_i$ denote the degree of the $i$th vertex and $\rho$ denote the maximum vertex degree. Our results focus on the tropical semiring, results for other semirings follow similarly. We derive a general bound for $W$ based only on $C$ and the maximum degree $\rho$ for any graph. We then apply this bound to obtain expected storage bounds for Erdős-Rényi and power-law graph models.

<table>
<thead>
<tr>
<th>BFS algorithm</th>
<th>$W$</th>
</tr>
</thead>
<tbody>
<tr>
<td>Traditional BFS (textbook) [7]</td>
<td>$O(n \log n)$</td>
</tr>
<tr>
<td>Traditional BFS (bag-based) [15]</td>
<td>$O(n)$</td>
</tr>
<tr>
<td>Traditional BFS (direction-inversion) [12]</td>
<td>$O(Dn + Dm)$</td>
</tr>
<tr>
<td>BFS-SpMV (textbook) [7]</td>
<td>$O(Dn^2)$</td>
</tr>
<tr>
<td>BFS-SpMV [9]</td>
<td>$O(n + Dn \log m)$</td>
</tr>
<tr>
<td>BFS-SpMV with merge sort [29]</td>
<td>$O(n + \log m)$</td>
</tr>
<tr>
<td>BFS-SpMV with radix sort [29]</td>
<td>$O(n)$</td>
</tr>
<tr>
<td>BFS-SpMV with no sort [29]</td>
<td>$O(n \log m)$</td>
</tr>
</tbody>
</table>

This work (graphs with max degree $\rho$) | $(O(Dn + Dm))$ |
This work (Erdős-Rényi graphs) | See Equation (1) |
This work (power-law graphs) | See Equation (2) |

Table II: Comparison of $W$; $x$ is the length of the largest key in binary [29].

The size of all the blocks (except the largest) is in total

$$\sum_{i=2}^{n_c} C_{\rho C-1} \leq \sum_{i=1}^{n-1} C \rho_{(i-1)(C+j)} \leq m,$$

as the size of each block is smaller than the number of vertices in the previous (larger) block. Next, the size of the largest block is $\rho C$, so the total storage is bounded by

$$\sum_{i=1}^{n_c} C_{\rho C-1} \leq m + \rho C.$$

This bound is illustrated in Figure 3. Asymptotically, this bound is always tight for Sell-C-σ, since the minimum

Listing 4: One iteration (frontier expansion) of BFS-SpMV; $C$ is 4. We omit the real semiring as it is very similar to the boolean one.

2) The Real Semiring $R = \langle \mathbb{R}, +, \cdot, 0, 1 \rangle$ Here, $x_0'' = f_0'' = 1$ and any other element of $x_0'$ and $f_0'$ equals 0 (i.e., only $r$ is in the initial frontier). Then, $x_k = A \otimes_R f_{k-1}$ and $f_k = x_k \& g_k$, where $g_k$ is the filtering term: $g_k = \bigcup_{i=0}^{k-1} f_i$. 

$$d = \sum_{i=1}^{D} t_i$$ and $$p = DP(d).$$ This semiring results in code snippets similar to the following boolean semiring and we omit it due to space constraints.
B. Reducing Storage Complexity with SlimSell

We now build on Sell-C-σ and introduce SlimSell: a representation for unweighted graphs that reduces the original Sell-C-σ size by a factor of up to \(\frac{m+n}{2m+n}\), putting less pressure on the memory subsystem. Our key notion is that, for undirected graphs, entries in \(A\) only indicate presence or absence of edges; this information can be contained in \(col\) without any need for \(val\). For this, each entry in \(col\) in SlimSell contains either a usual column index of the corresponding non-zero entry in \(A\) (as in Sell-C-σ) or a special marker (e.g., \(-1\)) if the entry is not an edge but is only present due to padding. Thus, an entry in \(col\) different from the marker indicates an edge and implies a 1 in \(val\). We illustrate SlimSell in Figure 4 and in Listing 5. Instead of directly loading \(val\) (like in Listing 4, line 7), we first load the corresponding column indices from \(col\) (Listing 5, line 8). Then, a vectorized compare instruction provides us with a mask that determines which column indices are \(-1\) and which are non-negative entries corresponding to edges. Finally, we derive the contents of \(val\) (lines 10-12).

```
1 // Vectors used by SlimSell:
2 v m_ones = [-1,...,-1], oneS = [1,...,1], infs = [oo,oo,oo,oo];
3
4 for(int i = 0; i < n/C; i++) { // Iterate over each chunk.
5    int index = ca[i]; // get offset to chunk i.
6    V tmp = LOAD(fk-1[i] + i*4); // Load chunk i from frontier fk-1.
7    for(int j = 0; j < C; j++) { // Iterate over each column.
8        V cols = LOAD(kcol[index]); // Load entries from col.
9        // Compute a mask that indicates which entry in cols is -1.
10       v vals = CMP(cols, m_ones, EQ);
11        // Derive required entries from vals.
12        v vals = BLEND(ones, infs, vals);
13        // Create the rhs vector consisting of four fk-1 entries.
14        V rhs = [fk-1[col[index+1]], fk-1[col[index+2]],
15                   fk-1[col[index+3]]]; // Create a new frontier.
16        tmp = MIN(ADD(rhs, vals), tmp); // Compute a new frontier.
17        index++; // Set a new frontier.
18 }
19 STORE(&f[k][i*C], tmp); // Set new frontier.
```

Listing 5: (§ III-B) An example of using SlimSell; \(C = 4\).

SlimSell reduces data transfer by removing loads of \(val\). Yet, more computation is required (lines 10-12). We show in § IV-A3 that this does not entail significant overheads and SlimSell offers more performance than Sell-C-σ.

Storage Complexity SlimSell reduces the size of not only Sell-C-σ but also CSR and AL; see Table III for a summary. Sell-C-σ and SlimSell require respectively \(4n + \frac{2n}{m} + P\) and \(2n + \frac{2n}{m} + P\) words; \(P\) is the number of cells from padding and it depends on a specific graph and its degree distribution. SlimSell uses less storage than the corresponding AL if:
2m + \frac{2n}{C} + P < n + 2m \iff P < n \left(1 - \frac{2}{C}\right) \tag{3}

This translates to $P < \frac{3n}{C}$ for $C = 8$ (e.g., in Xeon CPUs), $P < \frac{7n}{8}$ for $C = 16$ (e.g., in KNs), and $P < \frac{7n}{4}$ for $C = 32$ (e.g., in Tesla GPUs). We show in an empirical analysis that covers various families of graphs (§ IV-E) that the impact of $P$ is negligible for larger sorting scopes ($\sigma > \sqrt{n}$) and SlimSell always uses less space than CSR and Sell-$C$-$\sigma$ and it is comparable or more space-efficient than AL. Finally, note that the size of val in SlimSell and Sell-$C$-$\sigma$ ($= 2m + P$) is equal to the amount of work $W$ of a single $SpMV$ product. Thus, bounds (1) and (2) equal storage complexities for Erdős-Rényi and power-law graph models.

<table>
<thead>
<tr>
<th>Representation</th>
<th>Sell-$C$-$\sigma$</th>
<th>CSR</th>
<th>AL</th>
<th>SlimSell</th>
</tr>
</thead>
<tbody>
<tr>
<td>Size [cells]</td>
<td>$4m + \frac{2n}{C} + P$, $4m + 2m + \frac{2n}{C} + P$</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Table III: (§ III-B) Storage complexity of considered graph representations; $P$ is the number of padded memory cells.

Listing 6: (§ III-C1) SlimWork for various semirings; $C$ is 4.

C. Reducing Work Amount with SlimWork

SlimSell improves upon Sell-$C$-$\sigma$ by reducing its size. We now introduce SlimWork (see Listing 6), a scheme that reduces the work amount in $BFS-\text{SpMV}$. Intuitively, we skip chunks of computation if they are associated with the final BFS values (distances or parent IDs) that do not change in future iterations. As our performance study shows (§ IV), this scheme significantly accelerates $BFS-\text{SpMV}$. We now present how to use SlimWork.

1) Tropical Semiring: SlimWork for the tropical semiring is illustrated in Listing 6 (lines 5-7). Consider $f_k = A^* \otimes f_{k-1}$. In the label-setting BFS, $f_k$ can only differ from $f_{k-1}$ if $f_{k-1}$ is not the final distance. By verifying whether $f_{k-1}$ is the number of padded memory cells.

D. Improving Load Balance with SlimChunk

The final SlimChunk scheme aims at accelerating $BFS-\text{SpMV}$ when executing on units with large SIMD widths. Specifically, consider a case with a large sorting scope (e.g., $\sigma = \sqrt{n}$). The rows are sorted by the degree of the corresponding vertices. Thus, some chunks (related to high-degree vertices) may entail much more computation than the chunks related to low-degree vertices, resulting in load imbalance. To alleviate this, we split $A$ not only horizontally into chunks, but we also split the chunks vertically, in two dimensions. This still leads to the contiguous layout in memory but also accelerates $BFS-\text{SpMV}$ by dividing work more equally between threads, as we show in § IV.

To conclude, SlimSell improves the design of BFS-$\text{SpMV}$ along three dimensions: it reduces the size, lowers the amount of work, and improves the load balancing.

IV. PERFORMANCE ANALYSIS AND EVALUATION

We now illustrate that SlimSell secures high performance for BFS-$\text{SpMV}$. We also analyze the differences between BFS-$\text{SpMV}$ based on various semirings and representations.

Selection of Benchmarks and Parameters We consider all four BFS-$\text{SpMV}$ semiring variants. Next, performance and storage improvements from SlimSell, SlimWork, and SlimChunk are evaluated. Both strong- and weak-scaling is considered. We also analyze the $DP$ transformation overheads (OP, No-OP). To evaluate the impact from sorting, we vary $\sigma \in [1,n]$ (for simplicity $\sigma$ is a multiple of $C$). We verify the BFS performance in each iteration for a fine-grained analysis. Next, static and dynamic OpenMP scheduling is incorporated (omp-s, omp-d). Three classes of graphs are considered: synthetic power-law Kronecker [16] and Erdős-Rényi (ER) [10] graphs such that $n \in \{2^{20}, \ldots, 2^{28}\}$ and $\tau \in \{2^{1}, \ldots, 2^{10}\}$; we also use real-world graphs (RW); see Table IV. Due to a very large amount of data we present and discuss in detail a small subset; the remainder can be found in the technical report (see the link on page 2).

Table IV: The analyzed graphs with skewed degree distributions.

<table>
<thead>
<tr>
<th>Type</th>
<th>ID</th>
<th>$n$</th>
<th>$m$</th>
<th>$\bar{D}$</th>
</tr>
</thead>
<tbody>
<tr>
<td>Kronecker graphs</td>
<td>K</td>
<td>1M-268M</td>
<td>2M-536M</td>
<td>2-1014</td>
</tr>
<tr>
<td>Social networks</td>
<td>orc</td>
<td>3.07M</td>
<td>117M</td>
<td>39</td>
</tr>
<tr>
<td>pok</td>
<td>1.63M</td>
<td>30.6M</td>
<td>18.75</td>
<td>11</td>
</tr>
<tr>
<td>epi</td>
<td>75k</td>
<td>508k</td>
<td>6.7</td>
<td>15</td>
</tr>
<tr>
<td>Ground-truth [30] community</td>
<td>lln</td>
<td>3.99M</td>
<td>34.6M</td>
<td>8.67</td>
</tr>
<tr>
<td>Web graphs</td>
<td>brk</td>
<td>685k</td>
<td>7.60M</td>
<td>11.09</td>
</tr>
<tr>
<td>gog</td>
<td>875k</td>
<td>5.1M</td>
<td>5.52</td>
<td>21</td>
</tr>
<tr>
<td>sta</td>
<td>281k</td>
<td>2.31M</td>
<td>8.2</td>
<td>46</td>
</tr>
<tr>
<td>ndm</td>
<td>325k</td>
<td>1.49M</td>
<td>4.59</td>
<td>674</td>
</tr>
<tr>
<td>Purchase network</td>
<td>amz</td>
<td>262k</td>
<td>1.23M</td>
<td>4.71</td>
</tr>
<tr>
<td>Road network</td>
<td>rca</td>
<td>1.96M</td>
<td>2.76M</td>
<td>1.4</td>
</tr>
</tbody>
</table>
Comparison Targets (Performance) We compare BFS-SpMV equipped with SlimSell to the work-efficient highly-optimized OpenMP BFS Graph500 code [24] ($\text{tradMbfs}$). This baseline applies several optimizations; among others it reduces the amount of fine-grained synchronization by checking if the vertex was visited before executing an atomic. As $\text{tradMbfs}$ is work-optimal, it also represents the work-optimal BFS based on SpMSpV. We execute $\text{tradMbfs}$ on both KNLs and CPUs to secure fair comparison and to illustrate one of the points of our work: vectorization- and SIMD-friendly SlimSell enables BFS-SpMV to be comparable to work-efficient BFS optimized for CPUs.

Comparison Targets (Storage) In the storage analysis, we compare SlimSell to CSR, AL, and $\text{Sell-C-\sigma}$.

Experimental Setup and Architectures We conduct the analysis on several CPUs, GPUs, and manycores, with the total of seven evaluated systems:

Dora (CPU) One node of the CSCS Piz Dora supercomputer contains two Intel Xeons E5-2695 v4 @ 2.10 GHz. Each CPU has 18 cores, 18x 256 KB L2 and 45 MB L3 cache. Each nodes has 64 GB of RAM. We use gcc-5.3.

Tesla (GPU) We use NVIDIA Tesla K80 GPUs from the CSCS Greina HPC cluster (covering high-performance GPUs). Any CPU code is compiled using gcc-5.3, the CUDA code uses cudatoolkit-6.5.

KNL (manycore) We use Intel Xeon Phi KNLs 7210 (also from Greina). Each KNL has 64 4-way 1.3GHz multi-threaded cores with the total of 32 MB L2 and 66 GB of RAM. We use icc 16.0.1.150.

Others Other systems are described in detail in the technical report. They include a Tesla K20X from the CSCS Piz Daint supercomputer, a Trivium server with an Intel Haswell CPU and an NVIDIA GTX 670 that represent commodity machines, and a Xeon E5-1620 @3.50GHz from CSCS Greina to cover very low-latency systems.

A. CPU Analysis (Xeon E5-2695 v4)

We start with the CPU analysis; see Figure 5. We fix $C$ to match architectural constraints. For example, choosing 32-bit integers to represent vertex identifiers on a CPU yields a SIMD width of 8 due to 256-bit wide AVX registers. We use OpenMP for intra-node parallelization.

1) Impact from Sorting ($\sigma$): We first investigate the impact of $\sigma$ for various BFS-SpMV scenarios (Figures 5a-5c). Performance strongly depends on $\sigma$ as significantly different vertex degrees in one $C$ entail wasted work due to padding. Thus, large $\sigma$ is preferable. Next, $\sigma < C$ does not reduce padding and work but only reorder rows within a chunk. Thus, the performance does not increase while $\log \sigma \leq 3$. Moreover, in Figure 5a, an overhead due to the $D$P transformation is visible as $\sigma$ goes towards $n$, except for the sel-max semiring. This is because in static OpenMP scheduling all threads process similar counts of chunks. Then, for a very large $\sigma$, the first chunk contains all of the longest rows and consequently the corresponding thread performs the majority of work, causing imbalance. Dynamic OpenMP scheduling adds $\approx 1$-2% of relative overhead but alleviates the problem; see Figure 5b.

2) Differences between Semirings: The inner-most loop that processes chunks is the same for each semiring except for the used vector instructions (cf. Listing 4, lines 6-21). As Figure 5 shows, performance differences are negligible ($\approx$1-2%), confirming the intuition that BFS is memory-bound. However, semirings also differ in the post-processing that derives $f_k$ from $x_k$ (lines 22-45). The tropical semiring contains no additional work. Contrarily, the real and boolean semirings update the filter $g_k$ (six instructions and two stores), while the sel-max semiring processes the $p$ vector (four instructions and two stores). Still, the only major performance difference comes with the DP in sel-max semiring.

3) Advantages of SlimSell: We next illustrate advantages of SlimSell over $\text{Sell-C-\sigma}$ (Table V). Here, $\sigma$ determines how much bandwidth is wasted on unused loads. Larger $\sigma$ results in lower impact from SlimSell as the bandwidth is used more efficiently due to sorting.

<table>
<thead>
<tr>
<th>$\sigma$</th>
<th>Boolean</th>
<th>Real</th>
<th>Tropical</th>
<th>Sel-max</th>
</tr>
</thead>
<tbody>
<tr>
<td>$2^4$</td>
<td>1.17</td>
<td>1.17</td>
<td>1.21</td>
<td>1.18</td>
</tr>
<tr>
<td>$2^{18}$</td>
<td>1.00</td>
<td>1.04</td>
<td>1.04</td>
<td>1.03</td>
</tr>
</tbody>
</table>

Table V: Speedup of SlimSell over $\text{Sell-C-\sigma}$ in Kronecker graphs ($n = 2^{24}, \bar{\rho} = 16$).

4) Advantages of SlimWork: We present SlimWork advantages in Figure 5d. We compare BFS-SpMV variants with and without SlimWork for several $\sigma$s (for clarity, we plot only the case with $\sigma = n$). The larger $\sigma$, the faster work amount decreases with iterations. This is because larger $\sigma$s make it more probable that the majority of the largest chunks (associated with high-degree vertices) are processed in early iterations. Next, SlimWork skips chunks as soon as all the associated vertices have been visited. Thus, as more vertices are visited, the number of chunks left to compute
decreases and the last few iterations entail only little work. Contrarily, early iterations gain no speedup; they instead pay a small overhead for checking the skipping criteria. Still, the overhead is rapidly outweighed by saving expensive chunk computations. Overall, SlimWork accelerates the baseline SlimSell by a large factor, for example \( \approx129\% \) for sel-max for a Kronecker graph with \( n = 2^{24}, \rho = 16 \). Finally, in “No SlimWork” (corresponding to Sell-C-\( \sigma \) augmented only with the SlimSell storage reductions), there is no performance improvement after the first iteration.

5) Various Graph Families: We now discuss Erdős-Rényi and real-world inputs. The degrees of the former follow the uniform instead of the power law distribution. This is therefore less evident; see Figure 5c. The fine-grained analysis of each BFS step illustrates that the decrease in time only starts during the final iterations. Next, we consider real-world graphs. Some, like amz or rca, have high \( D \) and low \( \bar{\rho} \) (3.4 for amz and 1.4 for rca). This leads to small or no improvement from SlimWork, regardless of \( \sigma \). Others behave similarly to Kronecker graphs as their \( \bar{\rho} \) is large enough for SlimWork to ensure speedups.

B. GPU Analysis (Tesla K80)

We present a representative subset of results in Figure 6. Our insights are similar to those of the CPU analysis. The main difference is that wider SIMD units in GPUs secure higher speedups. First, Figures 6a and 6b depict the total execution time for varying \( \sigma \) for Kronecker and ER graphs. Similarly to CPUs, the performance does not improve up to a certain \( \sigma \) equal to \( C \). Second, sel-max performs better than others as it does not need \( DP \). Moreover, the performance for higher \( \sigma \) is because the sorting of the rows in \( A \) leads to large load imbalance (e.g., see Figure 6b). Then, the per-iteration performance follows patterns similar to CPUs as illustrated in Figure 6c. Finally, we show that SlimChunk brings expected significant advantages, for example \( \approx 50\% \) in the first two iterations in Figure 6e).

C. Manycore Analysis (KNL)

Here, we evaluateSlimSell on Intel KNLs; see Figure 8. The latency of each iteration grows as expected with the increasing \( \bar{\rho} \) or \( n \). KNL secures a drop in compute time after the first iteration. In GPUs, this effect is less visible due to larger SIMD widths that hinder reducing computation. On CPUs, it is only negligibly visible in Figure 4d as Erdős-Rényi graphs are used there. Later (§ IV-F) we show that our variant of BFS-SpMV based onSlimSell outperforms the traditional work-efficient BFS on KNLs.

D. Preprocessing Analysis

We illustrate that the full sorting time required for good storage improvements as well as the actual build time can be amortized away. For a Kronecker graph with \( n = 2^{24} \), sorting takes \( \approx 0.95s \), which constitutes \( \approx 21\% \) of a single BFS run. Thus, 10 BFS runs are enough to reduce the sorting time to \( < 2\% \) of the total runtime. The build time is analogous to that of Sell-C-\( \sigma \) [14]; it lasts longer but can also be amortized with more BFS runs. For example, on a Kronecker graph with \( n = 2^{18}, 20 \) BFS runs are enough to reduce the full preprocessing time to \( < 5\% \) of the total runtime. The results for other graphs are similar. In our analyses, we amortize the preprocessing time and report averaged iteration or total execution times.

E. Storage Analysis

Figure 7 illustrates storage improvements of SlimSell for various \( \sigma, n, \bar{\rho}, \) and graph families. SlimSell reduces the size of Sell-C-\( \sigma \) by \( \approx 50\% \), a consistent advantage for various \( n, \bar{\rho} \), and \( \sigma \). Notably, for full sorting (\( \sigma = n \)) and for Kronecker graphs, SlimSell is also more compact than AL (by \( \approx 5-10\% \)) which is effectively the smallest graph representation if no compression is used. The same effect sets in for \( \sigma \geq \sqrt{n} \) for...
RW graphs. Note that $C = 8$ in the above analysis. Larger $C$ brings even more storage reductions.

$F$. Comparison to the State-of-the-Art

We compare BFS-SpMV equipped with SlimSell to Trad-BFS in a fine-grained analysis. Figure 10 shows the time to finish each BFS iteration for various $\rho \in \{1,\ldots,512\}$; we compare the optimized traditional BFS running on a system where it achieves highest performance (Xeon CPU) with BFS-SpMV based on SlimSell that executes on Tesla K80 GPU. The higher $\rho$ (denser $G$), the faster BFS-SpMV is. This is supported with the intuition: more edges in $G$ entail a larger SIMD potential. Finally, Figures 1 and 9 show that BFS-SpMV based on SlimSell outperforms the traditional BFS by up to 53%; denser graphs entail larger speedups. As expected, power-law graphs are more SIMD-friendly during the expansion of the large frontier part. Contrarily, Erdős-Rényi graphs do not have this property. We conclude that SlimSell enables BFS-SpMV (running on throughput-oriented architectures such as GPUs) to match the performance of highly-optimized work-efficient codes (executing on optimized and latency-oriented CPUs).

Figure 10: (§ IV-F) Fine-grained comparison between BFS-Trad (optimized for CPUs) and BFS-SpMV executing on GPUs based on the tropical semiring; $C$ is Kronecker, $n = 2^{\rho/2}$.

V. RELATED WORK

Extensive research has been done to address the challenges in parallel graph processing [19]. One category of works includes parallel graph processing engines (e.g., Pregel [20] or PBGL [11]) and various improvements for algorithms and implementations such as direction-inversion [12, 27]). The well-known direction-optimization [3] and other work-avoidance schemes are orthogonal to our work and can be implemented on top of SlimSell; see Figure 1. In general, we do not focus on improving the well-studied queue-based BFS algorithm, but instead apply changes to the graph representation to improve performance.

There are several works on vectorization and GPUs for BFS [25, 22, 6, 29, 9, 12]. Yet, they use queue-based approaches or sparse matrix-sparse vector products and thus suffer from the common downsides such as bad data locality, irregular data accesses, and SIMD or vectorization underutilisation [6]. Contrarily, we start from sparse-matrix dense-vector products as a basis for SlimSell that come with regular memory accesses and better SIMD utilization. The SlimSell optimization for unweighted graphs is applicable not only to Sell-$C$-$\sigma$ but also other sparse matrix formats such as ELLPACK/ELL, sliced ELLPACK [23], ESB [17], ACSR [2], and CSR. They all address general matrices and thus use the val array with matrix values that can be omitted to reduce data transfer. Next, SlimSell removes padding and thus storage overheads inherent in ELLPACK, differs from ELLPACK/ELL and ESB as it turns the graph storage layout by 90 degrees in memory along with Sell-$C$-$\sigma$ to utilize more of SIMD compute power, and targets not only GPUs (like sliced ELLPACK or ACSR) or only KNL (like ESB), but can be used for both and for CPUs. Finally, our work is the first to conduct a systematic analysis on using different semirings for BFS and on work complexity of BFS for miscellaneous graph classes and schemes.

SlimChunk is similar to ACSR [2] and to CSR-VectorL in CSR-Adaptive [8]. Unlike them, we do not adaptively split chunks and instead let the dynamic nature of the partial chunk allocation deal with work imbalance. The advantage of our approach lies in its simplicity: no complex adaptive schemes are required. In addition, SlimChunk is by default enabled only for GPUs as this is the only architecture that entailed load imbalance. Thus, SlimSell offers an intuitive
and simple design for CPUs and Xeon Phi. Another similar scheme is SELL-P [1]: it pads the rows with zeros to make the row length of each block a multiple of the number of threads assigned to a row. Yet, this is discussed only for GPU-based Krylov solvers. Furthermore, Yang et al. [31] facilitate matrix tiling to enhance temporal locality, a scheme different from SlimSell that uses Sell-$C$-$\sigma$ to turn the matrix layout by 90 degrees for better SIMD utilization, reduces work and storage overheads to limit the pressure on the memory subsystem. Next, Gunrock [28] is a CUDA library for graph-processing on GPUs. It does not prescribe a high-performance graph representation, but instead provides a high-level, bulk-synchronous, data-centric abstraction that allows programmers to quickly develop new graph primitives. SlimSell could substitute graph representations used by default in Gunrock for higher speedups; an idea that we leave for future work. Another thread of future work is to port SlimSell to massively threaded systems such as SPARC64 or Adapteva Parallel.

Some work has been done into the formulation and implementation of BFS based on the algebraic formulation. Examples include GraphBLAS [21], Combinatorial BLAS (CombBLAS) [4], and BFS with 2D adjacency data distribution [5]. SlimSell, SlimWork, and SlimChunk can be used to accelerate Combinatorial BLAS. Finally, our first systematic discussion and theoretical analysis of BFS based on various semirings may be incorporated into the theoretical GraphBLAS framework.

VI. CONCLUSION

Vectorization and throughput-oriented GPUs are reshaping graph processing: traditional BFS codes optimized for latency-oriented CPUs will be inefficient on systems with wide SIMD units. We address this trend and propose SlimSell: a vectorizable graph representation for BFS based on sparse matrix-dense vector products. SlimSell augments the state-of-the-art Sell-$C$-$\sigma$ matrix storage format along three dimensions: size, work, and load balance. It effectively applies the regular SIMD parallelism to irregular graph traversals to match or outperform in various scenarios optimized Graph500 codes tuned for high-performance CPUs.

Furthermore, our work provides the first systematic analysis of the formulation, performance, and work/storage complexity of algebraic BFS variants based on different semirings. We derive bounds for uniform and power-law graphs; our theoretical insights combined with performance studies for state-of-the-art architectures such as Intel Knights Landing can be used by architects and developers to improve heuristics and graph representations implemented in processing engines such as CombBLAS [4].

Finally, we strongly believe that SlimSell can be used to accelerate other graph algorithms. Potential speedups can be much higher because BFS is a data-driven scheme with different memory access patterns across iterations while many algorithms (e.g., Pagerank) have identical communication patterns in each superstep.

Acknowledgements: We thank Hussein Harake, Colin McMurratie, and the whole CSCS team granting access to the Greina, Piz Dora, and Daint machines, and for their excellent technical support. We thank Moritz Kreuzer and the anonymous reviewers for their insightful comments. Maciej Besta is supported by Google European Doctoral Fellowship. Edgar Solomonik conducted part of this work while supported by an ETH Postdoctoral Fellowship.

REFERENCES