YuenyeungSpTRSV

Citation Author(s):
Feng
Zhang
Renmin University of China
Submitted by:
Feng Zhang
Last updated:
Tue, 05/17/2022 - 22:18
DOI:
10.21227/w61h-h482
Data Format:
Research Article Link:
Links:
License:
78 Views
Categories:
Keywords:
0
0 ratings - Please login to submit your rating.

Abstract 

Sparse triangular solves (SpTRSVs) are widely used in linear algebra domains, and several GPU-based SpTRSV algorithms have been developed. Synchronization-free SpTRSVs, due to their short preprocessing time and high performance, are currently the most popular SpTRSV algorithms. However, we observe that the performance of those SpTRSV algorithms on different matrices can vary greatly by 845 times. Our further studies show that when the average number of components per level is high and the average number of nonzero elements per row is low, those SpTRSVs exhibit extremely low performance. The reason is that, they use a warp on the GPU to process a row in sparse matrices, and such warp-level designs have severe underutilization of the GPU. To solve this problem, we propose YuenyeungSpTRSV, a thread-level and wrap-level fusion synchronization-free SpTRSV algorithm, which handles the rows with a large number of nonzero elements at warp-level while the rows with a low number of nonzero elements at thread-level. Particularly, YuenyeungSpTRSV has three novel features. First, unlike the previous studies, YuenyeungSpTRSV does not need long preprocessing time to calculate levels. Second, YuenyeungSpTRSV exhibits high performance on matrices that previous SpTRSVs cannot handle efficiently. Third, YuenyeungSpTRSV’s optimization does not rely on the specific sparse matrix storage format. Instead, it can achieve very good performance on the most popular sparse matrix storage, compressed sparse row (CSR) format, and thus users do not need to conduct format conversion. We evaluate YuenyeungSpTRSV with 245 matrices from the Florida Sparse Matrix Collection on four GPU platforms, and experiments show that our YuenyeungSpTRSV exhibits 7.14 GFLOPS/s, which is 5.98x speedup over the state-of-the-art synchronization-free SpTRSV algorithm, and 4.83x speedup over the SpTRSV in cuSPARSE.

Instructions: 

https://github.com/JiyaSu/YuenyeungSpTRSV

  1. Choose the language you want to run, and enter the folder.
  2. Adjust the common.h file according to the hardware, the repeated times, and the accuracy of the calculation (single or double precision).
  3. Set library path in the Makefile.
  4. Run make.
  5. Run ./YYSpTRSV example.mtx. (kernel is in the YYSpTRSV.h or YYSpTRSV_kernel.cl)
  6. The result will be printed out.