

### ScaleHLS: A New Scalable HLS Framework on Multi-Level Intermediate Representation

Anthony Cabrera CSE565M Paper Presentation October 3, 2023

ORNL is managed by UT-Battelle LLC for the US Department of Energy































## Main Takeaway

- Current HLS tooling is built on a single, low level of abstraction, e.g., LLVM IR.
- However, the semantic gap between C/C++ and LLVM IR is very large, which makes it challenging to perform higher abstraction level optimizations.
- The goal of <u>ScaleHLS</u> is to fill this semantic gap with hierarchical intermediate representations that make it easier to optimize the HLS design, subsequently explore the design space.



## Contributions of ScaleHLS

- Hierarchical representations of HLS designs to make reasoning about optimizations easier
- Provide optimization passes and infrastructure to operate at the level of graph, loop, and directive levels
- An automated DSE engine to find the Pareto curve between latency and space
- Provides and HLS C front-end to MLIR and an HLS C/C++ emitter





## Background



### MLIR in a Nutshell

- Close the semantic gap between [input language] and LLVM IR through hierarchical intermediate representations, i.e., progressive lowering
- Take advantage of pre-existing SW, passes, dialects, and optimizations and plug them in to your own compiler flows





### MLIR in a Nutshell

- Close the semantic gap between [input language] and LLVM IR through hierarchical intermediate representations, i.e., progressive lowering
- Take advantage of pre-existing SW, passes, dialects, and optimizations and plug them in to your own compiler flows







## ScaleHLS Design



### ScaleHLS Representation: Graph Level

- Leverage pre-existing onnx dialect\* for Graph-level IR and representing computation graphs
  - E.g., %output = "onnx.Conv"(%input, %weight) {...} :
     (tensor<1x3x34x34xf32>, tensor<64x3x3x3xf32>) ->
     tensor<1x64x32x32xf32> ... HLS C/C++ PyTorch



National Laboratory<sub>\*</sub> The figure has been changed since its original publication date

**CAK RIDGE** 

### ScaleHLS Representation: Loop Level

• Leverage pre-existing affine and scf dialect for Loop-level IR for loop level transformations and analysis



National Laboratory\* The figure has been changed since its original publication date

### ScaleHLS Representation: HLS Level

 Develop HLSCpp to represent HLS-specific structures and program directives, which provides the capability of conducting directive optimizations and supports the emission of synthesizable C/C++ code



#### CAK RIDGE

National Laboratory, The figure has been changed since its original publication date

### Array Partitioning MLIR Representation

(d0<sub>orig</sub>, d1<sub>orig</sub>) -> (partition idx x, partition idx y, physical idx x, physical idx y)





### Array Partitioning MLIR Representation

(d0<sub>orig</sub>, d1<sub>orig</sub>) -> (partition idx x, partition idx y, physical idx x, physical idx y)





### Array Partitioning MLIR Representation

(d0<sub>orig</sub>, d1<sub>orig</sub>) -> (partition idx x, partition idx y, physical idx x, physical idx y)











(c) affine\_map<(d0, d1) ->
(d0 mod 2, d1 floordiv 2, d0 floordiv 2, d1 mod 2)>















### ScaleHLS Optimization

- Apply the appropriate flavor of pass at the appropriate representation level
  - e.g., apply graph passes on graph representation
- Note the passes not in bold in the Loop row; those are passes that already exist and are compatible with the dialects in that level of representation

#### Table II SCALEHLS PASSES.

|         | Passes                      | Target    | Parameters    |
|---------|-----------------------------|-----------|---------------|
| Graph   | -legalize-dataflow          | function  | insert-copy   |
|         | -split-function             | function  | min-gran      |
| Loop    | -affine-loop-perfectization | loop band | -             |
|         | -affine-loop-order-opt      | loop band | perm-map      |
|         | -remove-variable-bound      | loop band | -             |
|         | -affine-loop-tile           | loop band | tile-sizes    |
|         | -affine-loop-unroll         | loop      | unroll-factor |
| Direct. | -loop-pipelining            | loop      | target-ii     |
|         | -func-pipelining            | function  | target-ii     |
|         | -array-partition            | function  | part-factors  |
| Misc.   | -simplify-affine-if         | function  | -             |
|         | -affine-store-forward       | function  | -             |
|         | -simplify-memref-access     | function  | -             |
|         | -canonicalize -cse          | function  | -             |

Boldface ones are new passes provided by ScaleHLS, while others are MLIR built-in passes.



### Graph Transform Example

Explores tradeoff between latency and space

- Assume each Proc takes 1t time.
- The goal is to apply the HLS dataflow pragma to this design
- 4(a) violates the bypass path constraint of Vitis HLS
- 4(b) legalizes this dataflow by combining the offending sub-graph into it's own stage. Results in 3-stage pipeline with latency = 3t
- 4(c) aggressively legalizes by adding enough copy nodes to remove bypass path. Results in 5-stage pipeline with latency = 1t, but more HW needed to achieve this
- Latency/space tradeoff in 4(d) by introducing split-function pass, which groups every two stages. Latency = 2t with only one additional copy node required



Figure 4. Graph-level dataflow optimization. (a) original dataflow; (b) legalized dataflow without copy nodes; (c) legalized dataflow with inserting copy nodes; (d) dataflow with a minimum granularity of 2.



### Full Example





27

### Full Example

void syrk(float alpha, float beta, float C[16][8], float A[16][16]) { for (int i = 0; i < 16; i++) {</pre> for (int j = 0; j <= i; j++) {</pre> C[i][j] \*= beta; for (int k = 0; k < 8; k++) {</pre> C[i][j] += alpha \* A[i][k] \* A[j][k]; }}} (i) input C Pi→ii: parse C into MLIR func @syrk(%alpha, %beta, %A, %C) { affine.for %i = 0 to 16 { d affine.for % = 0 to (% + 1) { c simplifications %0 = affine.load %C[%i, %j] %1 = mulf %beta, %0 affine.store %1, %C[%i, %j] affine.for %k = 0 to 8 { b %2 = affine.load %A[%i, %k] %3 = affine.load %A[%i, %k] and IR ( %4 = affine.load %C[%i, %j] %5 = mulf %alpha, %2%6 = mulf %5, %3transforms %7 = addf %6, %4 affine.store %7, %C[%i, %j] }}} (ii) baseline MLIR tive P<sub>ii→iii</sub>: loop transfroms direct func @syrk(%alpha, %beta, <u>%</u>A, %C) { Pili→iv: affine.for %k = 0 to 8 { B affine.for %i = 0 to 16 step 2 { D affine.for %j = 0 to 16 { affine.for %ii = (%i) to (%i + 2) { [] affine.if (%ii - %j >= 0) { C %0 = affine.load %C[%ii, %j] %1 = mulf %beta, %0 affine.if (%k == 0) { affine.store %1, %C[%ii, %j] h %2 = affine.load %A[%ii, %k] %3 = affine.load %A[%j, %k] %4 = affine.load %C[%ii, %j] h %5 = mulf %alpha, %2 %6 = mulf %5, %3%7 = addf %6, %4affine.store %7, %C[%ii, %j] }}}}} (iii) loop-opted MLIR



Pi→jj: scalehls-clang | scalehls-opt -raise-scf-to-affine

- Pii→iii: scalehls-opt -affine-loop-perfection -remove-variable-bound -affine-loop-order-opt -partial-affine-loop-tile
- Piii→iv: scalehls-opt -legalize-to-hlscpp -loop-pipelining -canonicalize -simplify-affine-if -affine-store-forward -simplify-memref-access -array-partition -cse

Piv→v: scalehls-translate -emit-hlscpp



28

#### Full Example void syrk(float alpha, float beta, #map = affine\_map<(d0, d1)</pre> void syrk(float v0, float v1, float C[16][8], float A[16][16]) { -> (d0 mod 2, 0, d0 floordiv 2, d1)> float v2[16][8], float v3[16][16]) { for (int i = 0; i < 16; i++) {</pre> func @syrk(%alpha, %beta, #pragma HLS resource variable=v2 \ for (int j = 0; j <= i; j++) {</pre> %A: memref<16x8xf32, #map, 1>, core=ram\_s2p\_bram C[i][j] \*= beta; %C: memref<16x16xf32, #map, 1>) []m #pragma HLS array\_partition variable=v2 \ for (int k = 0; k < 8; k++) { cyclic factor=2 dim=1 attributes {top\_function = true} { C[i][j] += alpha \* A[i][k] \* A[j][k]; affine.for %k = 0 to 8 { }}} (i) input C affine.for %i = 0 to 16 step 2 { #pragma HLS resource variable=v3 \ affine.for %j = 0 to 16 { core=ram s2p bram Pi→ii: parse C into MLIR #pragma HLS array\_partition variable=v3 \ affine.if (%i - %j >= 0) { %0 = affine.load %C[%i, %j] G cvclic factor=2 dim=1 func @syrk(%alpha, %beta, %A, %C) { %1 = mulf %beta, %0 affine.for %i = 0 to 16 { d %2 = affine.load %A[%i, %k] for (int v4 = 0; v4 < 8; v4 += 1) { affine.for % = 0 to (% + 1) { c %3 = affine.load %A[%j, %k] for (int v5 = 0; v5 < 16; v5 += 2) { simplifications %0 = affine.load %C[%i, %j] %4 = affine.if (%k == 0) { for (int v6 = 0; v6 < 16; v6 += 1) { %1 = mulf %beta, %0 affine.yield %1 #pragma HLS pipeline N affine.store %1, %C[%i, %j] } else { affine.for %k = 0 to 8 { b affine.yield %0 if $((v5 - v6) \ge 0)$ { %2 = affine.load %A[%i, %k] float v7 = v3[v5][v6]; %3 = affine.load %A[%i, %k] and IR ( emission %5 = mulf %alpha, %2 float v8 = v1 \* v7; %4 = affine.load %C[%i, %j] %6 = mulf %5, %3float v9 = v2[v5][v4]; %5 = mulf %alpha, %2%7 = addf %6, %4 float v10 = v2[v6][v4]; %6 = mulf %5, %3transforms affine.store %7, %C[%i, %j] float v11 = (v4 == 0) ? v8 : v7; ÷ %7 = addf %6, %4 float v12 = v0 \* v9;affine.store %7, %C[%i, %j] affine.if (%i - %j + 1 >= 0) { float v13 = v12 \* v10; synthesizable }}} (ii) baseline MLIR %0 = affine.load %C[%i + 1, %j] G float v14 = v13 + v11: %1 = mulf %beta, %0 v3[v5][v6] = v14;tive P<sub>ii→iii</sub>: loop transfroms %2 = affine.load %A[%i + 1, %k] direct %3 = affine.load %A[%j, %k] if $((v5 - v6 + 1) \ge 0)$ { func @syrk(%alpha, %beta, <u>%</u>A, %C) { float v15 = v3[(v5 + 1)][v6]; Pili→iv: affine.for %k = 0 to 8 {B float v16 = v1 \* v15; ... ... affine.for %i = 0 to 16 step 2 { D float v17 = v2[(v5 + 1)][v4];affine.for %j = 0 to 16 { affine.store %7, %C[%i + 1, %j] float v18 = v2[v6][v4];affine.for %ii = (%i) to (%i + 2) { [] affine.if (%ii - %j >= 0) { ( ) } {flatten = false, pipeline = true} n ... ... %0 = affine.load %C[%ii, %j] } {flatten = true, pipeline = false} %1 = mulf %beta, %0 } {flatten = true, pipeline = false} v3[(v5 + 1)][v6] = v22;A affine.if (%k == 0) { }}} affine.store %1, %C[%ii, %j] h (iv) directive-opted MLIR (v) synthesizable C++ %2 = affine.load %A[%ii, %k] P<sub>i→ii</sub>: scalehls-clang | scalehls-opt -raise-scf-to-affine %3 = affine.load %A[%j, %k] %4 = affine.load %C[%ii, %j] h **P**<sub>ii</sub>→iii: scalehls-opt -affine-loop-perfection -remove-variable-bound -affine-loop-order-opt %5 = mulf %alpha, %2 -partial-affine-loop-tile %6 = mulf %5, %3**P**iii→iv: scalehls-opt -legalize-to-hlscpp -loop-pipelining -canonicalize -simplify-affine-if %7 = addf %6, %4affine.store %7, %C[%ii, %j] -affine-store-forward -simplify-memref-access -array-partition -cse }}}}} Piv→v: scalehls-translate -emit-hlscpp (iii) loop-opted MLIR



29

#### Full Example void syrk(float alpha, float beta, #map = affine\_map<(d0, d1)</pre> void syrk(float v0, float v1, float C[16][8], float A[16][16]) { -> (d0 mod 2, 0, d0 floordiv 2, d1)> float v2[16][8], float v3[16][16]) { for (int i = 0; i < 16; i++) {</pre> func @syrk(%alpha, %beta, #pragma HLS resource variable=v2 \ for (int j = 0; j <= i; j++) {</pre> %A: memref<16x8xf32, #map, 1>, core=ram\_s2p\_bram C[i][j] \*= beta; %C: memref<16x16xf32, #map, 1>) []m #pragma HLS array\_partition variable=v2 \ for (int k = 0; k < 8; k++) { attributes {top\_function = true} { cyclic factor=2 dim=1 C[i][j] += alpha \* A[i][k] \* A[j][k]; affine.for %k = 0 to 8 { }}} (i) input C affine.for %i = 0 to 16 step 2 { #pragma HLS resource variable=v3 \ affine.for %j = 0 to 16 { core=ram s2p bram Pi→ii: parse C into MLIR affine.if (%i - %j >= 0) { #pragma HLS array\_partition variable=v3 \ %0 = affine.load %C[%i, %j] G cyclic factor=2 dim=1 func @syrk(%alpha, %beta, %A, %C) { %1 = mulf %beta, %0 affine.for %i = 0 to 16 { d %2 = affine.load %A[%i, %k] for (int v4 = 0; v4 < 8; v4 += 1) { affine.for % = 0 to (% + 1) { c %3 = affine.load %A[%j, %k] for (int v5 = 0; v5 < 16; v5 += 2) { simplifications %0 = affine.load %C[%i, %j] %4 = affine.if (%k == 0) { for (int v6 = 0; v6 < 16; v6 += 1) { %1 = mulf %beta, %0 affine.yield %1 #pragma HLS pipeline N affine.store %1, %C[%i, %j] } else { affine.for %k = 0 to 8 { b affine.yield %0 if $((v5 - v6) \ge 0)$ { %2 = affine.load %A[%i, %k] float v7 = v3[v5][v6]; %3 = affine.load %A[%i, %k] and IR ( emission %5 = mulf %alpha, %2 float v8 = v1 \* v7; %4 = affine.load %C[%i, %j] %6 = mulf %5, %3float v9 = v2[v5][v4]; %5 = mulf %alpha, %2%7 = addf %6, %4 float v10 = v2[v6][v4]; %6 = mulf %5, %3transforms affine.store %7, %C[%i, %j] float v11 = (v4 == 0) ? v8 : v7; ÷ %7 = addf %6, %4 float v12 = v0 \* v9;affine.store %7, %C[%i, %j] affine.if (%i - %j + 1 >= 0) { float v13 = v12 \* v10; synthesizable }}} (ii) baseline MLIR %0 = affine.load %C[%i + 1, %j] G float v14 = v13 + v11: %1 = mulf %beta, %0 v3[v5][v6] = v14;tive Pii→iii: loop transfroms %2 = affine.load %A[%i + 1, %k] direct %3 = affine.load %A[%j, %k] if $((v5 - v6 + 1) \ge 0)$ { func @syrk(%alpha, %beta, <u>%</u>A, %C) { float v15 = v3[(v5 + 1)][v6];Pili→iv: affine.for %k = 0 to 8 {B float v16 = v1 \* v15; ... ... affine.for %i = 0 to 16 step 2 { D float v17 = v2[(v5 + 1)][v4];affine.for %j = 0 to 16 { affine.store %7, %C[%i + 1, %j] float v18 = v2[v6][v4];affine.for %ii = (%i) to (%i + 2) { [] affine.if (%ii - %j >= 0) { ( ) } {flatten = false, pipeline = true} n ... ... %0 = affine.load %C[%ii, %j] } {flatten = true, pipeline = false} %1 = mulf %beta, %0 } {flatten = true, pipeline = false} v3[(v5 + 1)][v6] = v22;A affine.if (%k == 0) { affine.store %1, %C[%ii, %j] h (v) synthesizable C++ (iv) directive-opted MLIR %2 = affine.load %A[%ii, %k] Pi→jj: scalehls-clang | scalehls-opt -raise-scf-to-affine %3 = affine.load %A[%j, %k] %4 = affine.load %C[%ii, %j] **P**<sub>ii</sub>→iii: scalehls-opt -affine-loop-perfection -remove-variable-bound -affine-loop-order-opt %5 = mulf %alpha, %2 -partial-affine-loop-tile %6 = mulf %5, %3**Piii→iv:** scalehls-opt -legalize-to-hlscpp -loop-pipelining -canonicalize -simplify-affine-if %7 = addf %6, %4-affine-store-forward -simplify-memref-access -array-partition -cse affine.store %7, %C[%ii, %j] }}}}} Pivey: scalehls-translate -emit-hlscpp (iii) loop-opted MLIR



## Automatic Design Space Exploration

- Before emitting final design, perform automated DSE to find the Pareto frontier for the latency/area tradeoff
  - DSE, in this case, is just experimenting with different combinations of the available passes for ScaleHLS
- Performing PCA reveals that Pareto optimal points cluster and informs their DSE algorithm



Figure 6. Design space profiling of a GEMM kernel. (a) the latency-area space; (b) PCA of the multi-dimensional design space.





## Results



# Best Configuration for Benchmarks & Scalability Study

| Kernel  | Prob. Size | Speedup | LP  | RVB | Perm. Map | Tiling Sizes | Pipeline II | <b>Array Partition Factors</b>                |
|---------|------------|---------|-----|-----|-----------|--------------|-------------|-----------------------------------------------|
| BICG    | 4096       | 41.7×   | No  | No  | [1, 0]    | [16, 8]      | 43          | A:[8, 16], s:[16], q:[8], p:[16], r:[8]       |
| GEMM    | 4096       | 768.1×  | Yes | No  | [1, 2, 0] | [8, 1, 16]   | 3           | C:[1, 16], A:[1, 8], B:[8, 16]                |
| GESUMMV | 4096       | 199.1×  | Yes | No  | [1, 0]    | [8, 16]      | 9           | A:[16, 8], B:[16, 8], tmp:[16], x:[8], y:[16] |
| SYR2K   | 4096       | 384.0×  | Yes | Yes | [1, 2, 0] | [8, 4, 4]    | 8           | C:[4, 4], A:[4, 8], B:[4, 8]                  |
| SYRK    | 4096       | 384.1×  | Yes | Yes | [1, 2, 0] | [64, 1, 1]   | 3           | C:[1, 1], A:[1, 64]                           |
| TRMM    | 4096       | 590.9×  | Yes | Yes | [1, 2, 0] | [4, 4, 32]   | 13          | A:[4, 4], B:[4, 32]                           |

Table III DSE RESULTS OF LARGE-SCALE COMPUTATION KERNELS.

The data types of all kernels are 32-bits floating-points. Speedup is with respect to the baseline designs from PolyBench-C without the optimization of DSE. LP and RVB denote Loop Perfectization and Remove Variable Bound, respectively. In the Loop Order Optimization, the *i*-th loop in the loop nest is permuted to location PermMap[i], where locations are from the outermost loop to inner.



Figure 7. Scalability study of computation kernels. The problem sizes of computation kernels are scaled from 32 to 4096 and the DSE engine is launched to search for the optimized solutions under each problem size.

### Result for DNN Models

Our DSP Effi. DSP Effi. of DSP LUT Runtime Memory Model Speedup (SLR Util. %) **TVM-VTA** [49] (seconds) (SLR Util. %) (SLR Util. %) (OP/Cycle/DSP) ResNet-18 3825.0× 60.8 91.7Mb (79.5%) 1326 (58.2%) 157902 (40.1%) 1.343 0.344 VGG-16  $1505.3 \times$ 37.3 46.7Mb (40.5%) 878 (38.5%) 88108 (22.4%) 0.744 0.296 MobileNet  $1509.0 \times$ 38.1 79.4Mb (68.9%) 1774 (77.8%) 138060 (35.0%) 0.791 0.468

Table V Optimization results of representative DNN models.

Speedup is with respect to the baseline designs compiled from PyTorch by ScaleHLS but without the multi-level optimization.



"Baseline" <u>s</u>. the baseline HLS design



Figure 8. Ablation study of DNN models. D,  $L\{n\}$ , and  $G\{n\}$  denote directive, loop, and graph optimizations, respectively. Larger n indicates larger loop unrolling factor and finer dataflow granularity for loop and graph optimizations, respectively.

## Conclusion

- Contribution
  - An end-to-end framework that closes the semantic gap between HLS and Verilog
- Strengths
  - Novel approach to closing the semantic gap between HLS and RTL
  - Code is open-source
- Weaknesses
  - Number of benchmarked applications is small and of a similar flavor (GEMM). More applications from different domains would be beneficial
  - Approach requires using AMD software ecosystem and hardware backends

