## A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver

Nan Ding, Samuel Williams, Yang Liu, Xiaoye S. Li nanding@lbl.gov

May 03, 2021











### **Sparse Direct Solvers**

- Sparse direct solvers are used to solve challenging systems:
  - LU factorization (SpLU)
  - L- and U-solve (SpTRSV) •
- SpLU factorization can be extremely expensive:
  - Memory hungry
  - Computationally expensive  $\bullet$
- Solution:
  - Factor once and use as a preconditioner across multiple solves • *i.e., multiple GMRES solves call SpTRSV on every iteration all using the same factors*
  - Shifts the focus to SpTRSV performance ٠





### multi-GPU SpTRSV is imperative but challenging

- SpTRSV is an important kernel for a variety numerical simulations
- GPUs have become a first-class compute citizen
  - 110/147 system use NVIDIA Volta chips in 2020, Top500 list<sup>[1]</sup>
- Demand for ever finer-resolution problems calls for SpTRSV to exploit larger scales of parallelism
  - the slow pace in HBM memory capacity scaling
  - Can not always fit into a single GPU's memory





### **Sparse Triangular Solver (SpTRSV)**

Compute solution vector x from a sparse linear system, Lx=b 



### $b_7$ b (8x1)dense known

## Naïve BSP SpTRSV

- Naive approach:
  - solve the system one equation (row) at a time
  - can be optimized to (selectively) parallelize over column updates or row reductions



### **Recast SpTRSV as a DAG**

(0)

(13)

5

15

- Computation = Directed Acyclic Graph
- Each node in the DAG is a small dense matrix-vector
- Use level sets
- Blocks in the same level can be solved concurrently





### SpTRSV in SuperLU: Message Driven



- SuperLU imposes a 2D block cyclic process layout
- Two types of computation:
  - Solves (on-diagonal blocks)
  - MatVec (off-diagonal blocks)
- Two types of communication:
  - Block column broadcast
  - Block row reduction







### **NVSHMEM** has potential (but bad implementations can destroy it)

- NVSHMEM is a parallel programming interface based on OPENSHMEM
  - one-sided communication for NVIDIA GPU clusters
  - ✓ uses GPU-initiated data transfers
  - $\checkmark$  provides signaling operations and point-to-point synchronization operations to notify receivers

X point-to-point synchronization operations limits the number of thread blocks that can be concurrently scheduled on one V100 GPU to 80 (the number of SMs) to avoid potential deadlocks



7

### Multi-GPU SpTRSV vs. cusparse\_csrsv2()

### Experimented on Summit:

- Cuda 10
- Nvshmem 1.1.3
- Grdcopy 2.0
- bind one process to one GPU
- Px1 process layout (column broadcast)

nnz in L

Nsupers

use nvshmem\_double\_put\_nbi\_block() 







### Multi-GPU SpTRSV using two CUDA streams

### WAIT (for notifications), stream[0]

thread block 0 (column broadcast) thread block 1 (row reduction) for each thread t. additional work except waiting: While expecting more tasks: (recorded in the pre-processing pharse) local GEMV *idx*=nvshmem\_int\_wait\_until\_any(flag\_x, mask of the buffer) fmod(I)--M\_c[*idx*]=1 if **fmod(I)**==0 NVSHMEM send (by thread) *Isum* 0: active in wait, 1: masked off 1 1 0 0 1 1 0 0 0 0 0 t2 t3  $t\overline{1}$ t0 for each thread block K: (SOLVE) if I am the diagonal process in charge of K: spin wait until the diagonal block is unlocked **fmod(l)**==0 **NVSHMEM** send: local TRSV nvshmemx double put nbi block() else: (I am off-diagonal process) nvshmem\_fence() While flag\_x[idx] !=1; nvshmemx\_int\_signal() NVHSMEM send x local GEMV fmod(I)--. . . . . . . . . . . .

wait for WAIT successfully launched SOLVE (GEMV/TRSV, send data and notification), stream[1]





### Multi-GPU SpTRSV performance on Summit GPUs using different process layouts



### It's important to understand what constraint the performance

- Some numerical methods lend themselves to simple performance analysis
- DAG-based SpTRSV demands more sophistication
- Solution:
  - construct a critical path performance model
  - assess our observed performance relative to machine capabilities.



11

### **Critical path Performance model**

- **Architecture Characterization** 
  - Memory bandwidth •

 $\frac{828 \ GB/s}{80 \ SM \ * 64 \ warps} = \frac{1.3GB/s}{8 \ warps} \le bw = \frac{828 \ GB/s}{10 \ cal \ supernodes} \le 5.2 \ GB/s = \frac{828 \ GB/s}{80 \ SM \ * 2 \ warps}$ 

- NVSHMEM bandwidth (optimized) •
  - **NVSHMEM** Performance differs with GPU affinities
  - Intra-socket is 4x faster than inter-node •





### **Critical path Performance model**

**SpTRSV** Characterization

- Initial Critical path: based on level-set using BFS
- Refined Critical path: process decomposition
  - Memory bandwidth scales with the number of blocks • (GEMV/TRSV) in the same level until the aggregate bandwidth reach the peak:

 $T_{mat-vec\ per\ gpu} = \frac{accumulated\ Bytes}{aggrated\ bw}$ 

Communication: •







13

### SpTRSV performance differs with critical paths





| PU solution |                |
|-------------|----------------|
| es          | 3 Summit nodes |
|             | 18 GPU         |
|             | 0.8x           |
|             | 2.7x           |

### Summary

- Multi-GPU SpTRSV using CUDA streams
  - Bring experience for DAG-based computations on emerging accelerated architectures, e.g., GPU-GPU communication using nvshmem
  - Core specialization on GPUs for DAG-based computations: more complex producerconsumer relationship than stencils ---- the producer (sender) and the consumer (receiver) can swap roles in turn to dispatch new work (message).
- DAG performance analysis can be extremely challenging
  - Extend our SpTRSV performance model for GPUs
  - Accounts for matrix sparsity, process layout, network/memory bandwidth/latency
  - Enables performance analysis for DAG-based computations
  - Helps understand performance bottlenecks

# Acknowledgements

This research is supported in part by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through Advanced Computing (SciDAC) programs under Contract No. DE-AC02-05CH11231 at Lawrence Berkeley National Laboratory.











# Questions











