



# Domain Specific Computing in Tightly-Coupled Heterogeneous Systems

Anthony Cabrera
PhD Dissertation Defense
Department of Computer Science and Engineering
Washington University in St. Louis
July 22, 2020

#### It's the end of **Moore's Law** as we know it





#### Queue Heterogeneity!



Summit Supercomputer @ Oak Ridge National Lab



6X NVidia V100 GPUs



#### Hardware Specialization Spectrum





#### The Path Forward

Don't try to build a general purpose processor that does everything well; build a processor that builds a few tasks incredibly well, and figure out how to build a heterogeneous architecture using those techniques.

David Patterson and John Hennessy



#### Our Vision

Enable the paradigm shift towards domain specific computing by identifying application domains and architecting performant domain-specific hardware.

#### Outline

## Domain Specific Computing













#### Outline

## Domain Specific





Domain Identification









### How do you identify a domain?

#### How do you identify a domain?



Qualitatively

Create the Data Integration Benchmark Suite

Quantitatively

Characterize workload

#### Data Integration



#### Parsing/Cleansing

Transformation

>Some Sequence of Interest

. . .

agcaagacttcatctcaaaaaaaaaaaaaaaGCTGCANATTTattattat tattattattatttatttatttttttgagacagagtctcgttctgtcg cccaggctggagtgcggtgatcttggctcattgcaacctccacct cccgggttcaagtgattctcctgcctcagcctcccgagtagctgggacta caggcgtatgccaccatgcctggctaattttttgtacttttagtagagac Agagtttcacggtgttagccaggctggtcttgatctcctgacctcgtgat

Aggregation

Account for unknown bases

Convert to 2bit format

Pack into bytes

Count total number of bases

Ready for downstream processing

#### The Preprocessing Pain Point

ON 1853 SILVERSITY IN ST. LO

BFS on Twitter Data



#### Characterization Conclusions



Quantitative Characterization

Consistency in Locality

Prevalence of Data Movement

Use this to inform domain specific hardware

#### Outline

## Domain Specific





Domain Identification







#### Outline

# Domain Specific Computing







Domain Identification







#### How about FPGAs as our platform?



Bloomberg

Deals

## Intel's \$16.7 Billion Altera Deal Is Fueled by Data Centers

Project Catapult

Project Catapult

#### Wait, what's an FPGA, anyway?





#### FPGA attached via PCIe card





### Intel HARPv2 (left) vs. PCle card (right)







#### OpenCL to the Rescue!







# How portable are OpenCL FPGA kernels to Intel HARPv2?

Version FPGA Speedup



SVP = Stratix V, PCIe HARP= Arria 10, HARP

Zohouri et al., SC`18

SVP = Stratix V, PCIe HARP= Arria 10, HARP

Zohouri et al., SC`18

| Version | FPGA | Speedup |
|---------|------|---------|
| 1       | SVP  | 1.00    |
|         | HARP | 0.74    |
| 2       | SVP  | 0.05    |
|         | HARP | 0.01    |
| 3       | SVP  | 2.48    |
|         | HARP | 3.90    |
| 4       | SVP  | 3.55    |
|         | HARP | 3.24    |
| 5       | SVP  | 38.22   |
|         | HARP | 34.27   |



SVP = Stratix V, PCIe HARP= Arria 10, HARP

Zohouri et al., SC`18

| Version | FPGA | Speedup |
|---------|------|---------|
| 1       | SVP  | 1.00    |
|         | HARP | 0.74    |
| 2       | SVP  | 0.05    |
|         | HARP | 0.01    |
| 3       | SVP  | 2.48    |
|         | HARP | 3.90    |
| 4       | SVP  | 3.55    |
|         | HARP | 3.24    |
| 5       | SVP  | 38.22   |
|         | HARP | 34.27   |



#### Result of Exploiting Shared Virtual Memory







SVP = Stratix V, PCIe HARP= Arria 10, HARP

Zohouri et al., SC`18

| Version | FPGA | Speedup |
|---------|------|---------|
| 5       | SVP  | 38.22   |
|         | HARP | 34.27   |



# How do you find the most performant knob configuration?

### Hardware Design Parameter Sweep





#### Outline

## Domain Specific





Domain Identification







#### Outline

# Domain Specific Computing







Domain Identification









# How do you design performant hardware for a specific domain?

# Domain Specific Computing



Domain Identification







## Domain Specific

Computing



Use insights from data integration domain characterization

Domain Identification









# Domain Specific Computing

Use insights from data integration domain characterization

Domain Identification

Leverage results from HARPv2 portability and performance evaluation

Hardware Platform Evaluation



#### Multi-spectral Reuse Distance



Develop a tool to inform the relationship between spatial and temporal locality

### Method Overview



Code Regions of Interest







Instruction Trace

Instruction Trace



Reuse Distance @ 64B, 4KiB, 2MiB Granularities



Earth Mover's Distance

## Quantifying Spatial Locality with EMD



# EMD for SPEC2006 Applications





### Sub-Domain Creation



Multi-spectral Reuse Distance



DIBS Applications



#### DIBS Multi-spectral Reuse Distance Results





### Sub-Domain Creation



Multi-spectral Reuse Distance



DIBS Applications



*k*-means clustering

## Clustering Applications with k-means



## Width vs. Depth

The two OpenCL FPGA Design Paradigms







## k-means Clustering

The two OpenCL FPGA Design Paradigms







# DIBS Subset



| Application  | Clustering Prediction |  |
|--------------|-----------------------|--|
| ebcdic_txt   | Wide                  |  |
| idx_tiff     | Deep                  |  |
| fix_float    | Wide                  |  |
| edgelist_csr | Deep                  |  |
| 2bit_fa      | Wide                  |  |
| fa_2bit      | Deep                  |  |

## Let's Talk OpenCL Hardware Design



Wide Kernel Design

Deep Kernel Design

## Case Study: ebcdic txt Wide Kernel



## 3) Width Knobs

```
kernel void e2a(
                  2) Kernel Arguments
        1) Kernel Body
```

#### What About the "Loose Ends"?



```
kernel void e2a( global const uchar* restrict src,
                          global uchar* restrict dst,
global const uchar* restrict src,
ulong total work items)
global uchar* restrict dst)
kernel void e2a(
 unsigned char e2a lut[256] = { ... };
unsigned char e2a lut[256] = { ... };
 winsignedtataliworgettemespal_id(0);
 ucham sirjechant = srget ;global id(0);
 ucharchaformd_charair = src[i];
 xformd charformer a chart; [orig_char];
 dst [iformax formad_chern; lut[orig char];
      dst[i] = xformd char;
```

# Unbounded (left) vs. Bounded (right)



4 replicates



Choose **unbounded** implementation and make the problem fit the hardware!

## Case Study: ebcdic txt Wide Kernel



### 3) Width Knobs

```
kernel void e2a( global const uchar* restrict src,
                    global uchar* restrict dst)
 unsigned char e2a lut[256] = { ... };
 unsigned int i = get global id(0);
 uchar orig char = src[i];
 uchar xformd char;
 xformd char = e2a lut[orig char];
 dst[i] = xformd char;
```

### ebcdic txt Coarse-Grain Width Knobs

```
__attribute__((num_compute_units(NUMCOMPUNITS)))
__attribute__((reqd_work_group_size(WGSIZE,1,1)))
__attribute__((num_simd_work_items(NUMSIMD)))
```

**NUMCOMPUNITS** = # of replicated compute units

**WGSIZE** = work-group size of compute unit

**NUMSIMD** = # of times data path is replicated within a compute unit

### ebcdic txt Width Design Space



NUMCOMPUTEUNITS =  $\{1, 2, 4, 8\}$ 

 $NUMSIMD = \{1, 2, 4, 8, 16\}$ 

## Case Study: ebcdic\_txt Deep Kernel



```
kernel void e2a( global const uchar* restrict src,
                     global uchar* restrict dst,
                   ulong num elts) Loop termination
                                           condition
 unsigned char e2a lut[256] = \{ ... \};
 unsigned int i;
                                          UNROLL = # of times to
 #pragma unroll UNROLL
                                          unroll the loop
 for (i = 0; i < num elts; ++i) {</pre>
     uchar xformd char;
     xformd char = e2a lut[orig char];
     dst[i] = xformd char;
```

### ebcdic txt Deep Design Space



UNROLL = {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024}

# ebcdic\_txt Width vs. Depth Results







Wide Result

Deep Result

### **Prediction Results:**



| Application  | Clustering Prediction | Correct? |
|--------------|-----------------------|----------|
| ebcdic_txt   | Wide                  | Yes!     |
| idx_tiff     | Deep                  | Yes!     |
| fix_float    | Wide                  | Yes!     |
| edgelist_csr | Deep                  | Yes!     |
| 2bit_fa      | Wide                  | Yes!     |
| fa_2bit      | Deep                  | Yes!     |

# CPU vs Width (MWI) vs SWI (Depth) Results





#### Conclusion

We have presented our work towards designing domain specific hardware for a Post-Moore world. We do that through qualitative and quantitative domain identification, evaluating future compute technologies for domain specific computing, and architecting hardware that exploits the target domain and hardware platform.

#### Future Work

LOUIN 1853 . SRIP

Intelligent Design Space Search

HLS Hardware Compiler Development

What/Where to Accelerate

A New Domain

We have presented our work towards designing domain specific hardware for a Post-Moore world. We do that through qualitative and quantitative domain identification, evaluating future compute technologies for domain specific computing, and architecting hardware that exploits the target domain and hardware platform.

#### Thanks!

