Domain-specific and Resource-aware Computing

Domänenspezifisches und ressourcengewahres Rechnen

Der Technischen Fakultät der
Friedrich-Alexander-Universität Erlangen-Nürnberg

als

Habilitationsschrift

vorgelegt von

Dr.-Ing. Frank Hannig

Erlangen – 2017
Tag der Einreichung: 15. Dezember 2017

Fachmentorat: Professor Dr.-Ing. Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg
Professor Dr.-Ing. habil. Wolfgang Schröder-Preikschat, Friedrich-Alexander-Universität Erlangen-Nürnberg
Professor Dr.-Ing. Dr.-Ing. habil. Robert Weigel, Friedrich-Alexander-Universität Erlangen-Nürnberg

Gutachter: Professor Paul H. J. Kelly, Imperial College London
Professor Dr. rer. nat. Theo Ungerer, Universität Augsburg
Abstract

This cumulative habilitation treatise summarizes the research I have conducted with my group Architecture and Compiler Design (ACD) at the Chair of Hardware/Software Co-Design, focusing on selected results published within the last four years. My research can be divided mainly into two categories: Resource-aware computing and domain-specific computing. Both computing paradigms try to tackle the very complex programming and design challenge of parallel heterogeneous computer architectures, having different—to some extent common—goals in mind, e.g., performance, resource utilization, energy efficiency, predictability of even multiple execution qualities, or programming effort.

While resource-aware computing provides a full control loop from hardware status information to the program level and back, domain-specific computing drastically separates the concerns of algorithm development and target architecture implementation (including parallelization and low-level implementation details).

In the context of resource-aware computing, my research can be further subdivided into (1) modeling and system simulation and (2) architecture/compiler co-design of invasive tightly coupled processor arrays (TCPAs). In the area of domain-specific computing, three approaches are presented: (3) domain-specific high-level synthesis (HLS), (4) the heterogeneous image processing acceleration framework HIPAcc, and (5) the ExaStencils: Advanced stencil-code engineering approach.
# Contents

1 **Introduction**  
1.1 Contributions ....................................................... 3  
1.2 Papers of this Habilitation Treatise .............................. 5  
1.3 Structure of this Habilitation Treatise ........................... 8  

2 **Resource-aware Computing**  
2.1 Invasive Computing .................................................. 10  
2.2 Modeling and System Simulation ................................. 11  
2.2.1 Goals ............................................................. 11  
2.2.2 Approach ......................................................... 11  
2.2.3 Results ............................................................ 13  
2.2.4 Key Papers ....................................................... 13  
2.3 Architecture/Compiler Co-Design of Invasive TCPAs .......... 15  
2.3.1 Challenges ....................................................... 15  
2.3.2 Approach ......................................................... 15  
2.3.3 Results ............................................................ 17  
2.3.4 Key Papers ....................................................... 18  

3 **Domain-specific Computing**  
3.1 Domain-specific Languages ........................................ 24  
3.1.1 Definition ......................................................... 26  
3.1.2 Classification of DSLs ........................................... 27  
3.2 Domain-specific High-level Synthesis ............................ 29  
3.2.1 Goals for Domain-specific HLS ............................... 30  
3.2.2 Approach ......................................................... 31  
3.2.3 Results ............................................................ 32  
3.2.4 Domain-specific HLS Key Papers ............................. 33  
3.3 HIPAcc: The Heterogeneous Image Processing Acceleration Framework ....................................................... 34  
3.3.1 HIPAcc Goals .................................................... 35  
3.3.2 HIPAcc Approach ................................................ 35  
3.3.3 HIPAcc Results .................................................. 37  
3.3.4 HIPAcc Key Papers .............................................. 38  
3.4 ExaStencils: Advanced Stencil-Code Engineering ............. 41  
3.4.1 ExaStencils Goals ................................................ 42  
3.4.2 ExaStencils Approach .......................................... 42
3.4.3 ExaStencils Results .............................................. 44
3.4.4 ExaStencils Key Papers ......................................... 44

4 Conclusions .......................................................... 47

A Bibliography .......................................................... 49
A.1 General Bibliography ................................................. 49
A.2 Personal Bibliography ................................................. 60

B Image Credits .......................................................... 81

C Paper Reprints .......................................................... 83
C.1 Resource-aware Computing ........................................... 87
C.1.1 Modeling and System Simulation Papers ......................... 87
DAC ’15: Execution-driven Parallel Simulation of PGAS Applications on Heterogeneous Tiled Architectures ........ 87
X10 ’16: ActorX10: An Actor Library for X10 ......................... 93
ESTIMedia ’17: High Performance Network-on-Chip Simulation by Interval-based Timing Predictions .............. 99
C.1.2 Papers on Architecture/Compiler Co-Design of Invasive TCPAs 109
ACM TECS ’14: Invasive Tightly-Coupled Processor Arrays: A Domain-Specific Architecture/Compiler Co-Design Approach .............................................. 109
RSP ’17: Constructing Fast and Cycle-Accurate Simulators for Configurable Accelerators Using C++ Templates .... 139
Springer JSPS ’14: Symbolic Mapping of Loop Programs onto Processor Arrays .............................................. 147
MEMOCODE ’14: Symbolic Inner Loop Parallelisation for Massively Parallel Processor Arrays ......................... 177
ACM TECS ’17: Symbolic Multi-Level Loop Mapping of Loop Programs for Massively Parallel Processor Arrays .... 187
ASAP ’16: Modulo Scheduling of Symbolically Tiled Loops for Tightly Coupled Processor Arrays ......................... 215
Springer JSPS ’14: Compact Code Generation for Tightly-Coupled Processor Arrays .............................................. 225
C.2 Domain-specific Computing .......................................... 251
C.2.1 Domain-specific HLS Papers ...................................... 251
ASAP ’14: Domain-Specific Augmentations for High-Level Synthesis .............................................. 251
FPL ’14: An Image Processing Library for C-based High-Level Synthesis .............................................. 257
Springer JPS '17: A Novel Image Impulse Noise Removal Algorithm Optimized for Hardware Accelerators 261
ASAP '17: Hardware Design and Analysis of Efficient Loop Coarsening and Border Handling for Image Processing 279

C.2.2 HIPAcc Papers 289
IEEE TPDS '16: HIPAcc: A Domain-Specific Language and Compiler for Image Processing 289
DATE '14: Code Generation for Embedded Heterogeneous Architectures on Android 305
CODES+ISSS '14: Code Generation from a Domain-specific Language for C-based HLS of Hardware Accelerators 311
Elsevier JPDC '14: Towards a Performance-portable Description of Geometric Multigrid Algorithms using a Domain-specific Language 321
FPL '16: FPGA-based Accelerator Design from a Domain-Specific Language 333
Springer JPS '17: Loop Parallelization Techniques for FPGA Accelerator Synthesis 343

C.2.3 ExaStencils Papers 379
ICCSA '14: An Evaluation of Domain-Specific Language Technologies for Code Generation 379
Euro-Par '14: ExaStencils: Advanced Stencil-Code Engineering 389
WOLFHPC '14: ExaSlang: A Domain-Specific Language for Highly Scalable Multigrid Solvers 401
Springer LNCSE '16: Systems of Partial Differential Equations in ExaSlang 411
### List of Abbreviations

<table>
<thead>
<tr>
<th>Abbreviation</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>ACD</td>
<td>Architecture and Compiler Design</td>
</tr>
<tr>
<td>ADAS</td>
<td>Advanced Driver Assistance System</td>
</tr>
<tr>
<td>ALU</td>
<td>Arithmetic Logic Unit</td>
</tr>
<tr>
<td>APGAS</td>
<td>Asynchronously Partitioned Global Address Space</td>
</tr>
<tr>
<td>ASIC</td>
<td>Application-Specific Integrated Circuit</td>
</tr>
<tr>
<td>ASIP</td>
<td>Application-Specific Instruction set Processor</td>
</tr>
<tr>
<td>AST</td>
<td>Abstract Syntax Tree</td>
</tr>
<tr>
<td>AVX</td>
<td>Advanced Vector Extensions</td>
</tr>
<tr>
<td>BRAM</td>
<td>Block Random Access Memory</td>
</tr>
<tr>
<td>CGRA</td>
<td>Coarse-Grained Reconfigurable Architecture</td>
</tr>
<tr>
<td>CMOS</td>
<td>Complementary Metal Oxide Semiconductor</td>
</tr>
<tr>
<td>CNC</td>
<td>Computer Numerical Control</td>
</tr>
<tr>
<td>CPU</td>
<td>Central Processing Unit</td>
</tr>
<tr>
<td>DAG</td>
<td>Directed Acyclic Graph</td>
</tr>
<tr>
<td>DLP</td>
<td>Data-Level Parallelism</td>
</tr>
<tr>
<td>DoP</td>
<td>Degree of Parallelism</td>
</tr>
<tr>
<td>DSL</td>
<td>Domain-Specific programming Language</td>
</tr>
<tr>
<td>DSP</td>
<td>Digital Signal Processor</td>
</tr>
<tr>
<td>FLOPS</td>
<td>Floating-point Operations Per Second</td>
</tr>
<tr>
<td>FPGA</td>
<td>Field Programmable Gate Array</td>
</tr>
<tr>
<td>FU</td>
<td>Functional Unit</td>
</tr>
<tr>
<td>GPU</td>
<td>Graphics Processing Unit</td>
</tr>
</tbody>
</table>
## List of Abbreviations

<table>
<thead>
<tr>
<th>Abbreviation</th>
<th>Full Form</th>
</tr>
</thead>
<tbody>
<tr>
<td>HDL</td>
<td>Hardware Description Language</td>
</tr>
<tr>
<td>HLS</td>
<td>High-Level Synthesis</td>
</tr>
<tr>
<td>HPC</td>
<td>High-Performance Computing</td>
</tr>
<tr>
<td>HSA</td>
<td>Heterogeneous System Architecture</td>
</tr>
<tr>
<td>IC</td>
<td>Integrated Circuit</td>
</tr>
<tr>
<td>ILP</td>
<td>Instruction-Level Parallelism</td>
</tr>
<tr>
<td>IR</td>
<td>Intermediate Representation</td>
</tr>
<tr>
<td>LoC</td>
<td>Lines of Code</td>
</tr>
<tr>
<td>LPGS</td>
<td>Locally Parallel, Globally Sequential</td>
</tr>
<tr>
<td>LSGP</td>
<td>Locally Sequential, Globally Parallel</td>
</tr>
<tr>
<td>LUT</td>
<td>Look-Up Table</td>
</tr>
<tr>
<td>MIPS</td>
<td>Million Instructions Per Second</td>
</tr>
<tr>
<td>MPI</td>
<td>Message Passing Interface</td>
</tr>
<tr>
<td>MPSoC</td>
<td>Multi-Processor System-on-Chip</td>
</tr>
<tr>
<td>NoC</td>
<td>Network-on-Chip</td>
</tr>
<tr>
<td>NPP</td>
<td>Nvidia Performance Primitives</td>
</tr>
<tr>
<td>OpenCL</td>
<td>Open Computing Language</td>
</tr>
<tr>
<td>OpenCV</td>
<td>Open Source Computer Vision</td>
</tr>
<tr>
<td>PC</td>
<td>Personal Computer</td>
</tr>
<tr>
<td>PDE</td>
<td>Partial Differential Equation</td>
</tr>
<tr>
<td>PE</td>
<td>Processing Element</td>
</tr>
<tr>
<td>QoR</td>
<td>Quality of Results</td>
</tr>
<tr>
<td>RISC</td>
<td>Reduced Instruction Set Computer</td>
</tr>
<tr>
<td>SDK</td>
<td>Software Development Kit</td>
</tr>
<tr>
<td>SIMD</td>
<td>Single Instruction, Multiple Data</td>
</tr>
<tr>
<td>Abbreviation</td>
<td>Description</td>
</tr>
<tr>
<td>--------------</td>
<td>-------------</td>
</tr>
<tr>
<td>SNR</td>
<td>Signal-to-Noise Ratio</td>
</tr>
<tr>
<td>SoC</td>
<td>System-on-Chip</td>
</tr>
<tr>
<td>SQL</td>
<td>Structured Query Language</td>
</tr>
<tr>
<td>SSE</td>
<td>Streaming SIMD Extensions</td>
</tr>
<tr>
<td>TCPA</td>
<td>Tightly Coupled Processor Array</td>
</tr>
<tr>
<td>TI</td>
<td>Texas Instruments</td>
</tr>
<tr>
<td>TPDL</td>
<td>Target Platform Description Language</td>
</tr>
<tr>
<td>UML</td>
<td>Unified Modeling Language</td>
</tr>
<tr>
<td>VHDL</td>
<td>VHSIC Hardware Description Language</td>
</tr>
<tr>
<td>VHLL</td>
<td>Very High-Level programming Language</td>
</tr>
<tr>
<td>VHSIC</td>
<td>Very High Speed IC</td>
</tr>
<tr>
<td>VLIW</td>
<td>Very Long Instruction Word</td>
</tr>
</tbody>
</table>
1 Introduction

Diversity accompanies electronics ab initio. First electronic devices were assembled from discrete components to fulfill a particular purpose, i.e., they were designed and implemented in an application-specific way. Quickly, more and more transistors, the basis for all logic gates and registers, could be crammed onto integrated circuits and offered an almost unlimited plethora of possibilities for digital circuits. Thanks to the steady advances in Complementary Metal Oxide Semiconductor (CMOS) technology, the number of transistors on the same chip footprint has grown exponentially within the last half-century. This observation/projection is widely known as Moore’s law [107] and goes along with Dennard scaling [32], which roughly says that the power density remains constant as transistors are getting smaller, i.e., performance per watt is also growing exponentially. In the context of microprocessors, Central Processing Unit (CPU) designers harnessed this growth 30 years long by turning it into performance gains. First, because not only more transistors but also faster transistors have been built, and thus even CPUs could run increasingly at higher clock rates. Second, ever more complex CPUs could be designed that can do more work per clock cycle. That is, more powerful instructions and larger data types, deeper processor pipelines, sophisticated branch prediction, multiple parallel instructions, or instruction reordering in the case of out-of-order execution. A third performance boost has been accomplished thanks to larger and hierarchically organized caches. However, the seemingly never-ending performance boost of single-core processors was disrupted around 2005. Herb Sutter described this turning point in microprocessor history in his highly cited article “A Fundamental Turn Toward Concurrency in Software” [134] and the 2009 updated version “The Free Lunch is Over” [135], respectively. Herein, Sutter states that there is almost no progress in achieving higher clock frequencies. Although single transistors and small Integrated Circuits (ICs) have been experimentally proven to run at frequencies in the two-digit to three-digit gigahertz (GHz) range, processor clock rates have nearly saturated due to several physical effects. (a) Shrinking features sizes came along with severe current leakage issues, (b) the power consumption is limited, and the two items as mentioned above lead to (c) profound heat problems, in other words, heat is hard to dissipate. Collectively, these issues are referred to as the power wall. In addition, support for instruction level parallelism cannot be arbitrarily scaled within a processor core because increasing the number of functional units gets quickly exceedingly complex for both superscalar and Very Long Instruction Word (VLIW) processors [30]. All these threats

1 Leakage in semiconductor devices denotes the effect of capacity-connected components, such as transistors or diodes, to draw a small amount of current even though they are switched off.
Figure 1.1: Evolution of microprocessors with respect to (i) number of transistors per chip, (ii) benchmarked performance, (iii) clock rate, (iv) typical power consumption, and (v) number of processor cores. Figure reprinted from [23, p. 46]. © 2015 by IEEE.

induced the end of Dennard scaling in 2005 and turned processor architectures into multicore designs henceforward (see Figure 1.1).

While the speed of Moore’s projection has slightly slowed down, the steady miniaturization of feature sizes and the associated exponential growth of processor designs is still ongoing. However, as mentioned above, technology shrinking also continuously leads to higher energy densities, and thus the situation has become even more severe concerning power consumption because chips can handle only a limited power budget. As a consequence, the potentially available chip area might not be entirely utilized or at least not simultaneously. This phenomenon is also known as utilization wall [45] and accordingly as dark silicon [35], which denotes chip areas that must remain inactive most of the time. As a conclusion: Future systems will only scale if their energy efficiency will considerably improve—this reasoning holds for both embedded and portable devices, such as smartphones and tablets as well as large-scale systems used for High-Performance Computing (HPC). Customization and heterogeneity, e.g., in the form of custom-tailored memory hierarchies, sophisticated interconnection networks, and application-specific compute components, such as accelerators, are the key to success for future performance gains [127, 91].
Additionally, each CMOS process shrink is progressively ridden with imperfections (i.e., variability [14, 6]) and unreliability of the devices. These technological hurdles combined with the sheer complexity of heterogeneous Multi-Processor System-on-Chip (MPSoC) architectures raise numerous questions on how to design, test, and program such systems while having multiple—possibly contrary—goals in mind. It is very challenging to find solutions satisfying different objectives, such as performance, resource utilization, energy efficiency, resiliency [69], predictability of even multiple execution qualities [140], effective exploitation of concurrency, or programming effort. Thus, novel design methodologies and computing paradigms are required to fuel research and scientific discoveries with undiminished pace.

1.1 Contributions

In recent years, my research mission has been to master the design and programming complexity of parallel systems as well as their rising heterogeneity. The considered systems span a wide range, from architectures targeted for embedded applications to clusters for HPC. A particular focus is on accelerators (e.g., processor arrays, Graphics Processing Units (GPUs), or dedicated hardware implemented in Field Programmable Gate Arrays (FPGAs)) or their combination as part of a heterogeneous system (e.g., compute nodes equipped with accelerator technology, MPSoCs, or NoC-based tiled heterogeneous architectures). Investigated are programming languages, methods and techniques for compilation and application mapping, as well as the synthesis and simulation of parallel processor architectures. Here, two strategic approaches are researched—which may appear utterly different at first. The one is resource-aware computing, and the other is domain-specific computing. While the former makes numerous static architecture properties as well as runtime status information explicitly visible (e.g., at the program level as in the case of resource-aware programming [139, 142, P69]), the latter tries to wholly separate application development from an underlying architecture (i.e., a programmer is shielded from low-level implementation, mapping and parallelization details). Concerning these two computing paradigms, my principal research contributions can be very briefly summarized as follows:

Resource-aware Computing

(1) Modeling and System Simulation. Invasive computing [139, 142] is resource-aware computing at its best. To study resource-aware programming [139, P69] and invasive resource management strategies [68], means for application modeling and to mimic the execution behavior on not yet existing manycore architectures are required. The modeling of invasive applications includes programming extensions for resource-aware programming [P69] as well as concurrent models of computation, such as actor models [P42, P16]. Timed functional simulation
techniques [P65, P24] are used to assess execution qualities and to explore variants [P62] of tiled heterogeneous manycore architectures. Measures for fast full system simulation (parallel simulation [P24] and Network-on-Chip (NoC) simulation [P47, P4]) have been investigated and evaluated for streaming-based multimedia applications [P16] as well as X10 benchmarks [P24, P4].

(2) Architecture/Compiler Co-Design of Invasive TCPAs. Invasive Tightly Coupled Processor Arrays (TCPAs) [J18] combine architectural research in the field of parallel on-chip accelerators with the paradigm of invasive computing. The concepts to dynamically invade processors and to retreat from them are directly integrated into an invasive TCPA at the hardware level [P71, P67, J18]. This opens opportunities for ultrafast resource reservation and adaptivity (e.g., power management [J19, J18] or fault tolerance [P23, P21, 84]). These techniques have been mainly evaluated using cycle-accurate simulation [P101, J18, P5].

Regarding compilation for TCPAs, (a) efficient, yet compact code generation [P46, J17] as well as (b) symbolic tiling and symbolic scheduling [P48, J15, P32, P19, P15, J1] as program transformations for the symbolic parallelization of nested loop programs with uniform data dependences have been investigated.

Domain-specific Computing

(3) Domain-specific High-level Synthesis. Domain-specific High-Level Synthesis (HLS) provides programming abstractions to ease the problem specification and thus productivity. In the case of this habilitation treatise, the domain is image processing. The considered techniques are based on template metaprogramming and generative programming. The proposed domain-specific concepts have been investigated employing external Domain-Specific programming Language (DSL) constructs followed by code transformations in the case of the PARO HLS research tool [P37, J2] and the form of a template library in the case of C-based commercial tools, such as Xilinx Vivado HLS [P33, P9].

(4) The Heterogeneous Image Processing Acceleration Framework. HIPAcc is a DSL embedded in C++ and a compiler framework for the domain of image processing. It captures domain knowledge in a compact and intuitive language and employs source-to-source translation combined with various optimizations to achieve an excellent productivity paired with performance portability. The HIPAcc approach has been applied and evaluated for a broad variety of parallel accelerator architectures, including manycore processors [J9, J12], such as Nvidia and AMD GPUs and Intel Xeon Phi, embedded GPUs [P41], Xilinx and Intel/Altera FPGAs [P31, P13, J5], and vector units [P11].
1.2. Papers of this Habilitation Treatise

(5) **The ExaStencils Approach.** To reduce the foreseen performance/productivity gap of upcoming exascale platforms, a unique, tool-assisted co-design approach specific for the domain of multigrid methods based on stencil computations is developed within ExaStencils [P35, P38]. The fundamental concept of ExaStencils is a multi-layered external DSL in combination with a modular transformation and optimization framework [P29]. The approach has been evaluated concerning productivity and scalability, among other things, for a generated geometric multigrid solver on the JUQUEEN supercomputer [C1].

I have conducted the in this habilitation treatise cumulated research jointly together with nine doctoral researchers as well as bachelor and master students from my research group *Architecture and Compiler Design (ACD).*\(^2\) I have been the principal investigator for (1), (4), and (5) within the DFG Transregional Collaborative Research Centre 89 “Invasive Computing,” the DFG Research Training Group 1773 “Heterogeneous Image Systems,” and the DFG Priority Programme 1648 “Software for Exascale Computing,” respectively. In the case of (2), I have been a conceptual contributor, and in the case of (3), I have been the scientific leader and conceptual contributor.

### 1.2 Papers of this Habilitation Treatise

This document is a cumulative habilitation treatise. From my 165 peer-reviewed publications listed in Appendix A.2 (page 60ff.), I have opted for the following 25 key contributions of my research. These form the chief part of my cumulative habilitation treatise. The full texts (i.e., reprints) of these publications are provided on page 83ff. in Appendix C.

#### Resource-aware Computing

**Modeling and System Simulation Papers**

<table>
<thead>
<tr>
<th>Conference</th>
<th>Year</th>
<th>Authors</th>
<th>Title</th>
<th>Pages</th>
</tr>
</thead>
<tbody>
<tr>
<td>DAC ’15</td>
<td>2015</td>
<td>Roloff, Schafhauser, Hannig, and Teich.</td>
<td>“Execution-driven parallel simulation of PGAS applications on heterogeneous tiled architectures”</td>
<td>P24</td>
</tr>
<tr>
<td>X10 ’16</td>
<td>2016</td>
<td>Roloff, Pöppl, Schwarzer, Wildermann, Bader, Glaß, Hannig, and Teich.</td>
<td>ActorX10: An actor library for X10”</td>
<td>P16</td>
</tr>
<tr>
<td>ESTIMedia ’17</td>
<td>2017</td>
<td>Roloff, Hannig, and Teich.</td>
<td>“High performance network-on-chip simulation by interval-based timing predictions”</td>
<td>P4</td>
</tr>
</tbody>
</table>

---

\(^2\) In November 2017, my team, the ACD group, consists of eleven doctoral researchers and one post-doctoral researcher.
1. Introduction

Papers on Architecture/Compiler Co-Design of Invasive TCPAs

**ACM TECS ’14**
- Page 109ff.
- Hannig, Lari, Boppu, Tanase, and Reiche. "Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach"

**RSP ’17**
- Page 139ff.
- Witterauf, Hannig, and Teich. "Constructing fast and cycle-accurate simulators for configurable accelerators using C++ templates"

**Springer JSPS ’14**
- Page 147ff.
- Teich, Tanase, and Hannig. "Symbolic mapping of loop programs onto processor arrays"

**MEMOCODE ’14**
- Page 177ff.
- Tanase, Witterauf, Teich, and Hannig. "Symbolic inner loop parallelisation for massively parallel processor arrays"

**ACM TECS ’17**
- Page 187ff.
- Tanase, Witterauf, Teich, and Hannig. "Symbolic multi-level loop mapping of loop programs for massively parallel processor arrays"

**ASAP ’16**
- Page 215ff.
- Witterauf, Tanase, Hannig, and Teich. "Modulo scheduling of symbolically tiled loops for tightly coupled processor arrays"

**Springer JSPS ’14**
- Page 225ff.
- Boppu, Hannig, and Teich. "Compact code generation for tightly-coupled processor arrays"

**Domain-specific Computing**

Domain-specific HLS Papers

**ASAP ’14**
- Page 251ff.
- Schmid, Tanase, Hannig, Teich, Bhadouria, and Ghoshal. "Domain-specific augmentations for high-level synthesis"

**FPL ’14**
- Page 257ff.
- Schmid, Apelt, Hannig, and Teich. "An image processing library for C-based high-level synthesis"

**Springer JSPS ’17**
- Page 261ff.

**ASAP ’17**
- Page 279ff.
- Özkan, Reiche, Hannig, and Teich. "Hardware design and analysis of efficient loop coarsening and border handling for image processing"
### 1.2. Papers of this Habilitation Treatise

**HIPAcc Papers**

<table>
<thead>
<tr>
<th>Conference/Year</th>
<th>Authors</th>
<th>Title</th>
<th>Page Range</th>
<th>Journal/Conference ID</th>
</tr>
</thead>
<tbody>
<tr>
<td>IEEE TPDS '16</td>
<td>Membarth, Reiche, Hannig, Teich, Körner, and Eckert</td>
<td>“HIPAcc: A domain-specific language and compiler for image processing”</td>
<td>page 289ff.</td>
<td>[J9]</td>
</tr>
<tr>
<td>DATE '14</td>
<td>Membarth, Reiche, Hannig, and Teich</td>
<td>“Code generation for embedded heterogeneous architectures on Android”</td>
<td>page 305ff.</td>
<td>[P41]</td>
</tr>
<tr>
<td>CODES+ISSS '14</td>
<td>Reiche, Schmid, Hannig, Membarth, and Teich</td>
<td>“Code generation from a domain-specific language for C-based HLS of hardware accelerators”</td>
<td>page 311ff.</td>
<td>[P31]</td>
</tr>
<tr>
<td>FPL '16</td>
<td>Özkan, Reiche, Hannig, and Teich</td>
<td>“FPGA-based accelerator design from a domain-specific language”</td>
<td>page 333ff.</td>
<td>[P13]</td>
</tr>
<tr>
<td>Springer JSPS '17</td>
<td>Reiche, Özkan, Hannig, Teich, and Schmid</td>
<td>“Loop parallelization techniques for FPGA accelerator synthesis”</td>
<td>page 343ff.</td>
<td>[J5]</td>
</tr>
<tr>
<td>LCTES '17</td>
<td>Reiche, Kobylko, Hannig, and Teich</td>
<td>“Auto-vectorization for image processing DSLs”</td>
<td>page 369ff.</td>
<td>[P11]</td>
</tr>
</tbody>
</table>

**ExaStencils Papers**

<table>
<thead>
<tr>
<th>Conference/Year</th>
<th>Authors</th>
<th>Title</th>
<th>Page Range</th>
<th>Journal/Conference ID</th>
</tr>
</thead>
<tbody>
<tr>
<td>Springer LNCSE '16</td>
<td>Schmitt, Kuckuk, Hannig, Teich, Köstler, Rüde, and Lengauer</td>
<td>“Systems of partial differential equations in ExaSlang”</td>
<td>page 411ff.</td>
<td>[C1]</td>
</tr>
</tbody>
</table>
1. Introduction

1.3 Structure of this Habilitation Treatise

The rest of my habilitation treatise is organized as follows: In the next

Chapter 2 “Resource-aware Computing,” (pages 9 to 21)
I briefly introduce resource-aware computing and provide a short overview of the Transregional Collaborative Research Centre 89 “Invasive Computing”. Subsequently, I summarize my research results that emerged from the Collaborative Research Centre in the areas of modeling and system simulation (Section 2.2) and architecture/compiler co-design of invasive tightly coupled processor arrays (Section 2.3) as well as the corresponding publications; their reprints can be found in Appendix C in order not to impair reading fluency and to ease partial printing.

Chapter 3 “Domain-specific Computing” (pages 23 to 46)
introduces the key concepts of domain-specific computing and fundamentals of domain-specific programming languages briefly. Afterward, I provide an overview of the respective projects (domain-specific HLS in Section 3.2, HIPAcc in Section 3.3, and ExaStencils in Section 3.4) as well as their basis forming design and implementation techniques, followed by the corresponding research results. Reprints of the related publications can also be found in Appendix C in order not to impair reading fluency and to ease partial printing.

Chapter 4 “Conclusions” (pages 47 to 48)
summarizes and concludes my cumulative habilitation treatise. Herein, I identify commonalities of my two main research branches and provide ideas for future research directions.

Appendix A “Bibliography,” (pages 49 to 79)
Appendix B “Image Credits,” and (page 81)
Appendix C “Paper Reprints” (page 83ff.)
provide both general and personal bibliographies (including a complete list of my own publications), image credits, and reprints of the papers I opted for my cumulative habilitation treatise, respectively.
2 Resource-aware Computing

Since the term resource-aware computing as a whole is not very widespread, let us decompose it into its parts according to the Oxford English Dictionary:

“resource” noun; “A stock or supply of money, materials, staff, and other assets that can be drawn on by a person or organization in order to function effectively.”

“aware” adjective; “[with adverb or in combination] Concerned and well informed about a particular situation or development.”

“computing” noun; “The use or operation of computers.”

Resources in the sense of a computing system can be physical components, such as processing units (e.g., arithmetic units or processor cores), memory and input/output devices, or virtual components, e.g., files, network connections, and memory areas. Obviously, the resources of a computer are limited, thus, some means to manage them are required (allocation and releasing of resources, dealing with contention, etc.). Resource management necessitates some awareness about the static architecture properties (amount and type of resources) as well as about the status of resources at runtime (i.e., dynamically changing qualities, such as availability, utilization, reliability, or temperature). Static properties are sometimes considered during the design of a program or at compile time, e.g., that an application is parallelized for a certain number of compute resources. But often, an underlying architecture is completely abstracted away by chopping the workload into smaller portions that are handled by a runtime system, which takes care of assignment and scheduling. Similarly, if the workload is not known at compile time (e.g., unknown problem size or the computational effort depends on the composition of input data) or barely predictable combinations of (parallel) applications have to be executed, resource management, workload distribution, and scheduling typically happen in a runtime system. However, such an implicit resource management by the runtime system—in combination with programs written in high-level languages—thieves the control of basic resources from the programmer and thus might only lead to a suboptimal program execution.

The challenge of resource-aware program execution becomes even more intricate when having the computer architecture trends in mind (see Chapter 1). Consequently, many research questions arise: How to manage and program heterogeneous architectures with thousands of cores? How to deal with heat dissipation and power consumption? How

3https://en.oxforddictionaries.com
to optimize yield by coping with the increasing variability in semiconductor fabrication? How to treat faults and aging processes over the lifetime of a chip? How to provide predictability—not only regarding timing but also other non-functional qualities of parallel program execution, such as power and reliability?

One fundamental principle of parallel computing that examines above raised questions is invasive computing.

2.1 Invasive Computing

In [139], Jürgen Teich coined the terms invasive algorithms and invasive architectures. The associated novel concepts have been generalized and subsumed shortly afterward under invasive computing and by the DFG-funded Transregional Collaborative Research Centre 89 of the same name\(^4\) [142].

Resource awareness has top priority in invasive computing by introducing it at the level of application programming. Here, resource-aware programming facilitates a given program to explore and dynamically distribute its computations to (neighbor) processors. Then, to execute portions of code that have a high Degree of Parallelism (DoP) in parallel based on the available region on a given multi-processor architecture. Afterward, once the program terminates or if the DoP should be lower again, the program should deallocate resources and resume execution again, e.g., sequentially on a single processor. In the nomenclature [139, 142, P69] of invasive computing these different phases of operation are called invade, infect, and retreat, denoting (i) resource exploration and claiming, (ii) code and data distribution as well as parallel program execution, and (iii) release of resources, respectively. By means of invasion, an application will thus be able to spread its computations for parallel execution based on the availability and the actual state of the underlying resources, such as utilization, load, or temperature. Transitions from one phase (i.e., invade, infect, and retreat) to another one not necessarily have to follow always a straight order. As illustrated by the back edges of the state chart depicted in Figure 2.1, there can be also situations where a claimed set of resources should be extended (reinvasion) or partially freed (partial retreat), or used for execution again (reinfect). The introduced programming constructs for resource-aware programming are embedded into the parallel programming language X10 [28] as developed by IBM using a library-based approach [P69].

While the Transregional Collaborative Research Centre 89 looks at invasive computing in its entirety (four project areas covering (A) fundamentals, language, and algorithm research, (B) architectural research, (C) compiler, simulation, and run-time support, (D) applications), my research contributions are in the areas of modeling and simulation of

\(^4\)The CRC/Transregio 89 "Invasive Computing" was established by the German Research Foundation (DFG) in July 2010 and is currently in its second funding phase until end of June 2018.
2.2 Modeling and System Simulation

2.2.1 Goals

Invasive computing is a hotbed for a novel parallel computing paradigm with multifarious research directions, including theory, programming languages, and applications as well as architectural design. In order to explore and optimize different invasive architectures, invasion strategies, and invasive programming approaches, simulation techniques that cover all these aspects are indispensable. Without the need to have full hardware or software implementations available, the goals are primarily (a) modeling and behavioral simulation of invasive applications and (b) simulation of invasive architectures.

2.2.2 Approach

In order to fulfill the goals mentioned above, the simulation platform InvadeSIM [P65, P62, P24] has been developed that provides a fast simulation approach for hundreds of competing applications on large heterogeneous architectures, allows the modeling of customized heterogeneous invasive multi-tile architectures, and supports the simulation of the complete set of X10 programming language constructs as well as all novel invasive programming constructs, such as `invade`, `infect`, and `retreat`. An overview of the developed timed functional simulation platform InvadeSIM is shown in Figure 2.2. The simulator allows to quickly customize an invasive multi-tile architecture to be evaluated by changing a number of parameters, including topology network parameters, or number of tiles and processor types in each tile—denoted as architecture model in the figure. Application modeling in InvadeSIM is carried out in X10 [28] and may use the InvadeX10 language extensions [P69, P55] for exploiting the invasive command set [142, P69], an actor-based programming model [P42, P16], or combinations of both. An actor model [1, 58] exposes asynchronous buffered communication paths and decouples these from the control flow of the concurrently executed application parts. Our actor model is called ActorX10 [P16] and is implemented on top of the Asynchronously Partitioned Global Address Space (APGAS) principles of X10.
The core components to simulate parallel applications on a modeled heterogeneous multi-tile MPSoC architecture are sketched in the middle of Figure 2.2. These components contain a novel concept of approximately timed simulation and a discrete event synchronization mechanism for the simulation of multiple processors. Many existing simulation frameworks for heterogeneous MPSoCs are based either on cycle-accurate or trace-based approaches and are typically much too slow. Instead, our processor simulation approach is a hybrid method based on performance counters and analytic
models, which we call time warping [P65]. A synchronization mechanism preserves the causality of simulation events in case of multiple processor simulations. The discrete event simulation approach combines the functional execution of an application (consisting of so-called i-lets in the notion of invasive computing [P69, 142]) with the timing simulation according to the computational properties of the target processor architecture on which it was mapped at runtime. Eventually, the simulation results may be analyzed or visualized (e.g., by a graphical user interface or trace viewer), or the timing values serve as quality number for the evaluation of application mappings [P18, 140, 130] or architecture exploration [P62]. Since the latter two tasks are very complex regarding computing time, it is of utmost importance to have an ultrafast yet highly timing-accurate system simulator. Here, novel parallel simulation [P24] and hybrid network-on-chip simulation [P4] techniques have been developed, implemented, and evaluated.

2.2.3 Results

To evaluate the scalability and performance of our parallelized version of the InvadeSIM system simulator, we have modeled multi-tile processor architectures up to 64 cores (Reduced Instruction Set Computer (RISC) or Application-Specific Instruction set Processor (ASIP)) and run the simulator on a host machine with twelve cores [P24]. We could demonstrate a linear speedup in simulation time of the parallel simulation (running on the host with up to twelve cores) with respect to a sequential simulation—while the absolute simulation performance is in a five-figure Million Instructions Per Second (MIPS) range.

The developed hybrid NoC simulation technique [P4] achieves speedups of one to three orders of magnitude compared with cycle-accurate NoC simulation, while depending inversely related on the NoC size ranging from $4 \times 4$ to $16 \times 16$ tiles, and preserving a timing accuracy of above 95%. The NoC simulation has been integrated into InvadeSIM, to evaluate how much the full system simulation can benefit. Where we could showcase decent speedups for complex streaming-based multimedia applications [P16, P4] written in ActorX10 [P16] and X10 applications from the IMSuite [48] benchmark.

2.2.4 Key Papers

Roloff, Schafhauser, Hannig, and Teich. “Execution-driven parallel simulation of PGAS applications on heterogeneous tiled architectures” [P24]

The DAC ’15 paper presents the parallel execution-driven simulator for the efficient simulation of heterogeneous tile-based multicore architectures. Four novel parallel discrete-event simulation techniques were proposed, which map thread-level parallelism within the appli-
2. Resource-aware Computing

cations to core-level parallelism on the simulated target architecture and back to thread-level parallelism on the simulation host machine. In all these approaches, the correct synchronization and activation of the host threads are essential. Experiments with parallel real-world applications are used to compare the different techniques against each other and demonstrate that, on average, 8.2 times faster simulations than a sequential simulation can be achieved on a 12-core Intel Xeon processor while reaching a simulation speed of up to 86,000 MIPS.

Based on the foundation “Towards Actor-oriented Programming on PGAS-based Multicore Architectures” by Sascha Roloff et al. in [P42], the X10 ’16 paper consequently formalizes the actor model on the basis of the APGAS programming model as used in X10. The implementation, named ActorX10, seamlessly integrates into X10 in the form of a class library. It eases parallel program development by making the communication between different parts of the program explicit and by separating the control flow aspects from the computational aspects. This kind of abstraction is very versatile as we demonstrate for applications from the embedded system (object detection streaming pipeline) and HPC domain (proxy application for tsunami simulation).

The ESTIMedia ’17 paper focuses on NoC simulation and presents a hybrid approach, where the timing of packet delays in a NoC is modeled accurately, yet, for scalability reasons, predictable and contiguous communications are detected. These are exploited to forward the simulation time appropriately instead of simulating flit-by-flit. We evaluated our hybrid simulation technique for pure NoC simulation as well as full system simulation. In both cases decent speedups were achieved.

The work above was conducted with doctoral researcher Sascha Roloff. He significantly contributed to the research on the simulation of invasive computing systems, assisted by David Schafhauser with his bachelor’s thesis on parallel simulation techniques. The conceptual design of ActorX10 was carried out jointly with partners from the CRC/Transregio 89, while Sascha Roloff additionally performed the X10 library implementation.
2.3 Architecture/Compiler Co-Design of Invasive Tightly Coupled Processor Arrays

Looking at programmable accelerators in its entirety has a long tradition in my research group Architecture and Compiler Design (ACD). The co-design aspect of processor architecture design (incl. means for simulation and prototyping) and compiler research (e.g., mapping, scheduling, and code generation) has always been an overarching objective—ACD’s motto is to build both compiler-friendly architectures as well as architecture-friendly / retargetable design tools and compilers. In the course of research, we developed a highly parametrizable class of parallel on-chip processor arrays, which consists of lightweight and weakly programmable VLIW Processing Elements (PEs) that are tightly interconnected over a reconfigurable network [P108, P104, P103], as well as the corresponding tool flow [P99, C9, J24]. These Tightly Coupled Processor Arrays (TCPAs) are particularly suitable for the acceleration of nested affine loop programs [33], and have been proven for high-throughput processing [T1, C8] while being low power [P90, J23, J21] at the same time, i.e., such architectures are highly energy-efficient [J22, J18, J4].

2.3.1 Challenges

Many research questions emerged from invasive computing in the context of processor arrays: How can the principles of resource-aware programming be supported in hardware, i.e., ultrafast within a few clock cycles? Can invasive computing be an enabler for optimizing execution qualities, such as energy efficiency, and non-functional requirements, such as fault tolerance? Previous research on mapping loop programs onto processor arrays or Coarse-Grained Reconfigurable Architectures (CGRAs) considered both fixed problem sizes and a static number of PEs preferentially—but, how to proceed if the number of available resources (PEs) is only known at runtime? How can costly compilation at runtime or storing of all possible mappings be circumvented?

2.3.2 Approach

As mentioned before, TCPAs are primarily used as accelerators, reflected by dedicated TCPA tiles as part of an invasive heterogeneous tiled architecture (see Figure 2.3). To adopt the principles of invasive computing at hardware level, each PE of a TCPA is equipped with an invasion controller (or short iCtrl in Figure 2.3), giving it the capability to acquire, reserve, and then release the PEs in its neighborhood in a fast cycle-based manner [P71, P67, J18]. This decentralized approach ensures scalable and reliable resource management in future many-core architectures with hundreds to thousands of PEs, where centralized approaches do not scale anymore because of latency and fault-tolerance reasons.
The invasion controllers enable also hierarchical power management in TCPAs by dynamically turning regions of PEs on and off, depending on invade and retreat phases [P57, J19]. Most of the investigations on hardware-based management of TCPA resources [P71, P67, J18], power management [P57, J19, J18], and fault tolerance [P23, P21, 84] were assessed using cycle-accurate simulation [P101, J18, P5]. Since simulation is essential, our most recent approach [P5] profoundly exploits C++ templates, which allow for highly configurable architectures using parameters at synthesis time.

Regarding compilation techniques, the main idea is to provide (a) symbolic transformations for loop tiling, i.e., to use parametrized tile sizes to represent a (statically) unknown number of PEs, as well as (b) the determination of corresponding symbolic latency-optimal schedules. The approach can be summarized as follows [J15]: First, the iteration space of a nested loop program is tiled symbolically into parametric orthotopes. Next, the resulting tiled program is also scheduled symbolically, resulting in a set of latency-optimal parametrized schedule candidates. Finally, at runtime, once the allocated PE region (size) becomes known, simple comparisons of latency-determining expressions finally steer which of these schedules will be selected and the corresponding program configuration executed on the corresponding processor array.

---

5 An orthotope is a parallelotope whose edges are all mutually orthogonal to each other. The orthotope is a generalization of the rectangle and rectangular parallelepiped.
2.3. Architecture/Compiler Co-Design of Invasive TCPAs

After partitioning and modulo scheduling [T1], instruction and register binding are performed before code is generated. Our code generation approach [P46, J17] starts with subdividing the set of PEs of a TCPA into so-called processor classes, such that all processors that belong to the same class obtain the same program but may differ only by a delayed start of execution. For each processor class a corresponding program block control flow graph is traversed to emit the final assembly code. Unique to our approach is that local control is represented by the code generated for each processor class, whereas global control is implemented by a global controller [J18, J17], which produces global control signals that trigger the program executions and branching. As a result of this, the entire static control flow (i.e., zero-overhead loops and static branching) can be hidden.

2.3.3 Results

In [J18], we could demonstrate a greatly superior energy efficiency of TCPAs in comparison with an embedded GPU (ARM Mali-T604), mainly due to a much better resource utilization, data locality, and less expensive functional units (integer arithmetic). However, we envision TCPAs complementary to GPUs in an MPSoC since TCPAs are limited to integer arithmetic so far. Whereas, GPUs are high-performance devices, which are well suited for 3D imaging when floating-point calculations are needed.

Our highly configurable and flexible simulation approach based on C++ templates is about five times faster when simulating a VLIW PE of a TCPA compared to implementations not using templates.

We could push the research on loop parallelization by providing symbolic parallelization techniques according to the above outlined multi-step approach. Previous works [121, 54, 55, 9, 122] partition only the iteration space symbolically but not the data dependences of loop-body statements and also only consider either merely sequential or atomic execution of the tiles. In contrast, our approach allows for communicating results as soon as they are computed, which is especially beneficial for streaming architectures, such as TCPAs. We applied our novel formalism to schedule and assign atomic iterations to processors symbolically for a given loop nest with uniform data dependences to the most common loop partitioning techniques: (1) outer loop parallelization [P48, J15] (a.k.a. clustering, blocking [156], or Locally Sequential, Globally Parallel (LSGP) [60]), (2) inner loop parallelization [P32] (a.k.a. tiling or Locally Parallel, Globally Sequential (LPGS) [60, 104]), and (3) hierarchical loop parallelization [P19, 6].

---

6 Atomic execution of a tile denotes that all iterations belonging to that tile are computed first before other tiles depending on these computations might be executed. This may lead to cyclic dependences between adjacent tiles, and thus unschedulability. Whereas, a finer-grained execution and communication of results per iteration might be schedulable.

7 Atomic iteration denotes that a single iteration of a given loop nest is executed in one unit of time (e.g., clock cycle).
Moreover, we generalized our symbolic approach to modulo scheduling [P15], including multi-cycle instructions under resource constraints (i.e., a limited number of Functional Units (FUs)). Here, we formally and experimentally show that if the number of processor elements to map onto is known at compile time, the resulting schedules are latency optimal. Otherwise, they are negligibly suboptimal. Summing-up, symbolic loop parallelization techniques complement resource-ware computing on TCPAs perfectly.

Our TCPA code generator produces highly competitive code [J17]. Compared with the Trimaran compiler infrastructure [27, 144], the generated codes of our approach achieve 56–88% higher throughputs. The code generator even more drastically outperforms the VEX compiler framework [38, 59]. Here, the generated code runs 4–12 times faster, although the code size is 2–4 times smaller.

### 2.3.4 Key Papers

In the following, I briefly classify the role of the key papers related to resource-aware computing, which are part of this cumulative habilitation treatise. Reprints of these papers are available in Appendix C.

**ACM TECS ’14 page 109ff.** Hannig, Lari, Boppu, Tanase, and Reiche. "Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach" [J18]

The ACM TECS article summarizes our research on invasive TCPAs, a class of programmable accelerators with built-in hardware for very fast resource management. The article covers the entire technology stack, i.e., architectural design, programming, compilation, and simulation. By considering the design as a whole—that is, both architecture and compiler design—we could demonstrate a substantial performance and gains in energy efficiency in comparison with a state-of-the-art embedded GPU. Preserving application knowledge and combining it with architectural support, such as zero-overhead looping enabled us to generate code that is as fast as fully unrolled (across all loop levels) code, while the code size is kept minimal.


The RSP ’17 paper deals with constructing fast and cycle-accurate simulators for processors and processor arrays, such as TCPAs. The entire architecture model is specified using C++ templates. All parameters of a hardware component are considered as C++ type and construct a class templated on this type. Eventually, the simulator is built as a hierarchical composition of the component classes. The proposed ap-
2.3. Architecture/Compiler Co-Design of Invasive TCPAs

proach is evaluated by modeling and simulating a VLIW PE of a TCPA, and comparing it against a non-templated simulator.

Springer JSPS ’14 page 147ff.
Teich, Tanase, and Hannig. "Symbolic mapping of loop programs onto processor arrays" [J15]

This JSPS ’14 article is a rigorous extension of our award-winning paper [P48] "Symbolic Parallelization of Loop Programs for Massively Parallel Processor Arrays" presented at ASAP ’13. It presents a method for the symbolic parallelization of nested loop programs with uniform data dependences using LSGP partitioning. Parametrized linear functions are used for assigning and scheduling iterations of such loop programs on processor arrays of unknown size. Latency-optimal schedules may be derived statically using two program transformations: First, the iteration space is partitioned symbolically using parametrized tile sizes. From this symbolically tiled code, latency-optimal symbolic schedules are determined. Further, an upper bound for the number of different optimal schedules as well as a pruning algorithm to reduce the search space efficiently is presented.

MEMOCODE ’14 page 177ff.
Tanase, Witterauf, Teich, and Hannig. "Symbolic inner loop parallelisation for massively parallel processor arrays" [P32]

The MEMOCODE ’14 paper presents a first solution to the unsolved problem of symbolically scheduling a given symbolically tiled loop nest according to the LPGS partitioning method. A mixed compile/runtime approach is proposed. First, a unique intra-tile schedule is determined, which is responsible for scheduling the iteration points within one tile. Subsequently, all feasible inter-tile schedule candidates are determined based on sequential scanning orders given by stride matrices. The result is a parametrized latency formula that is used in a compiler-generated prolog to select the latency-minimal schedule candidate at runtime, once the size of the available processor array becomes known. The prolog also ensures feasibility of the schedules by evaluating simple guard expressions and repairing the schedule vector candidates accordingly.

ACM TECS ’17 page 187ff.
Tanase, Witterauf, Teich, and Hannig. "Symbolic multi-level loop mapping of loop programs for massively parallel processor arrays" [J1]

LSGP partitioning is well-suited for tuning the I/O demand of a tile to the given bandwidth. However, this concept cannot be used to obtain a mapping independent of the iteration space size of the algorithm.
2. Resource-aware Computing

Whereas, LPGS partitioning may handle constant as well as minimal local memory requirements on an unknown number of processors at compile time. But, the required I/O bandwidth is more significant than in the case of LSGP and might exceed the existent I/O capacities. To remedy this dichotomy, the ACM TECS article combines the best of both worlds by proposing a symbolic multi-level tiling/scheduling approach that balances the I/O bandwidth and local memory requirements.

**ASAP ’16**

Witterauf, Tanase, Hannig, and Teich. “Modulo scheduling of symbolically tiled loops for tightly coupled processor arrays” [P15]

The ASAP ’16 paper combines symbolic tiling techniques with resource-constrained modulo scheduling [119, 118, 146, T1] to parallelize nested loop programs not only at loop level but also at instruction level (i.e., Instruction-Level Parallelism (ILP)). For this, the dependence constraints are partitioned into a parametric and non-parametric subset. A solution to the modulo scheduling problem can be found using only the non-parametric constraints. To still satisfy the parametric dependence constraints, a minimum tile size is determined from the found solution. If the minimum tile size is not satisfied at runtime, a fallback schedule is selected alternately.

**Springer JSPS ’14**

Boppu, Hannig, and Teich. "Compact code generation for tightly-coupled processor arrays" [J17]

This JSPS ’14 article is a thoroughgoing extended version of the ASAP ’13 paper [P46]. It presents a design methodology for the mapping of nested loop programs onto TCPAs with an emphasis on code compaction and code generation. TCPAs and the corresponding code generator support zero-overhead looping not only for innermost loops but also for arbitrarily nested loops, i.e., the generated code is as fast as the equivalent entirely unrolled (across all loop levels) code while having a minimal code size. For selected benchmarks, our code generator is assessed and compared to the well-known Trimaran [27, 144] and VEX [38, 59] compiler frameworks.

The research on invasive TCPA architectures was mainly conducted with Vahid Lari during his Ph.D. [82, 83], helped by Jürgen Teich. Already, in my Ph.D. [T1], I investigated the problem of resource-constrained loop program scheduling for TCPAs. Consequently, this formed the basis for the research and development on TCPA code generation that was conducted with Srinivas Boppu during his Ph.D. [19]. Similarly, fundamentals such as the theoretic number of possible schedule candidates as well as
the concept of *path strides* arose out of my Ph.D. [T1] and were used as fundamentals in the research on symbolic parallelization of nested loop programs that was carried out together with Alexandru Tanase, Jürgen Teich, and Michael Witterauf. During his Ph.D. [137], Alexandru Tanase contributed principally to the symbolic LSGP, LPGS, and hierarchical partitioning methods, as well as symbolic scheduling techniques. The work on symbolic modulo scheduling and TCPA simulation was undertaken with doctoral researcher Michael Witterauf.
3 Domain-specific Computing

As introduced and motivated in Chapter 1, processor systems not only contain more and more cores but also specialized cores of different types due to performance and energy efficiency reasons. However, programming such parallel, heterogeneous systems is not an easy task since there exist too many different programming models and accordingly languages [95, 96]. Most of them require in-depth knowledge of both the application and target architecture, which is commonly referred to as the programmability gap. Naturally, the question arises how this gap can be closed and whether there is a one-fits-all solution, sort of a Jack or Jill of all trades of parallel programming languages?

On closer examination, programming languages might be rated with respect to three properties: (1) performance, (2) generality, and (3) productivity. At this point, we do not quantify these properties but rather rate them according to their general perception as follows:

**Performance** denotes the run time of a program executed on a certain architecture. That is, which performance can be achieved using a certain programming language and how it is processed, i.e., whether it is compiled or interpreted? A compiler analyses an entire program and typically generates machine code that is directly executable by a processor, whereas an interpreter takes only a single instruction as input at a time and execute that instruction. An approach based on compilation is thus typically faster than one based on interpretation.

**Generality**, sometimes also denoted as expressiveness, denotes how rich the vocabulary and how powerful the syntax of a language is with respect to its versatility. Put differently, is it a general-purpose programming language applicable for a broad range of applications or is the language specialized for a certain application area? Even very specialized and restricted programming languages might be Turing complete—however, it is questionable if it makes sense to spend great programming effort in order to describe an application of a different domain, and it goes along with the next property (productivity).

**Productivity** denotes the required programming effort to obtain a functionally correct, executable implementation. It corresponds to the variety of available high-level language constructs and programming abstractions. That is, productivity does not only include the development time or number of Lines of Code (LoC) but also the

---

8 Turing completeness or computational universality denotes in the context of a programming language that the language can be used to simulate any single-taped Turing machine.
time spent for finding errors in a program. Obviously, the probability for errors, and thus the debugging effort, might increase with the length of a program.

The three aforementioned properties span a triangle of the programming language landscape. It is visualized in Figure 3.1 and a few well-known programming languages are classified.

Next, I give a brief introduction to domain-specific languages, followed by three domain-specific approaches that I have investigated in the context of this habilitation treatise.

### 3.1 Domain-specific Languages

Already in the 1960s, researchers, such as Mark I. Halpern [50], James Martin [93], and Jean E. Sammet [126], discussed the desire for *machine independence*—in the context of writing computer programs in machine or assembly languages accompanied by portability issues (e.g., number formats and precision, integration of storage and I/O devices)—as well as *problem orientation* of programming languages.
In 1967, James Martin wrote in his textbook *Design of Real-Time Computer Systems* in the context of man-machine interfaces about the future of programming languages:

“We must develop languages that the scientist, the architect, the teacher, and the layman can use without being computer experts. The language for each user must be as *natural* as possible to him. The statistician must talk to his terminal in the language of statistics. The civil engineer must use the language of civil engineering. When a man learns his profession he must learn the *problem-oriented languages* to go with that profession.

If we give teachers throughout the world the right computer language for them, they will build *libraries* of teaching programs. If we give circuit designers the right language for them, they will make computers design circuits. If we give the medical profession the right language, they will give a doctor far more information at his fingertips than one individual could ever have today.” ([93, pp. 89f.])

Shortly afterward, Jean E. Sammet wrote in her textbook *Programming Languages: History and Fundamentals*:

“The term *problem-oriented* ... any language which is easier for writing solutions to a particular problem ...” ([126, p. 21])

From these postulations, the idea follows to take advantage of the *knowledge* being inherent in a particular problem area or field of application, i.e., a certain *domain*, in a well-directed manner and thus, to master the complexity of the before-mentioned heterogeneous systems. Such *domain knowledge* can be captured by reasonable abstractions, augmentations, and notations, e.g., libraries, *Domain-Specific programming Languages (DSLs)*, or combinations of both (e.g., embedded DSLs implemented via template metaprogramming [131, 148]). On this basis, patterns can be utilized to transform and optimize the input description in a goal-oriented way during compilation, and, finally, to generate code for a specific target architecture. Thus, DSLs have a high productivity plus typically also a high performance (see Figure 3.1).

While more useful than ever before, DSLs are not new in the programming language landscape—and probably already came across any of us. Widely-used examples include query languages for accessing and manipulating relational databases (e.g., Structured Query Language (SQL) [31]), application programming interfaces for drawing 2D and 3D graphics like OpenGL [132], languages such as MATLAB [105, 53] for doing math,\(^9\) hardware description languages like VHDL [7], or languages for the preparation of text documents, such as *LaTeX* [81] with which this treatise was typeset. But also away

---

\(^9\)The association of MATLAB with a *math* language seems naïve. More precisely, it started as an *array programming language* in the late 1970s and evolved since then to a multi-paradigm language (incorporating concepts of array, functional, imperative, procedural, and object-oriented programming).
3. Domain-specific Computing

from engineering and computer sciences, there have been applications in other areas, e.g., pig farming [99], finances [114, 40], or landscape ecology [42]. In addition to the mentioned increased program development productivity, one can infer another valuable DSLs property from the above languages, namely to ease the communication with domain experts because a well-designed DSL offers the context and thus serves as communication platform between the domain expert (i.e., user) and the programmer (i.e., technology expert) [94, 39].

3.1.1 Definition

In literature, several definitions have been proposed for the term DSL. For example, Arie van Deursen et al. propose the following definition:

“A domain-specific language (DSL) is a programming language or executable specification language that offers, through appropriate notations and abstractions, expressive power focused on, and usually restricted to, a particular problem domain.” ([149, p. 1])

Marjan Mernik et al. write in their survey:

“Domain-specific languages (DSLs) are languages tailored to a specific application domain. They offer substantial gains in expressiveness and ease of use compared with general-purpose programming languages in their domain of application.” ([100, p. 1])

Martin Fowler proposes in his textbook Domain-Specific Languages the following definition:

“Domain-specific language (noun): a computer programming language of limited expressiveness focused on particular domain.” ([39, p. 27])

All three definitions above have in common that they refer to programming languages tailored to a certain problem domain, and are of limited/restricted expressiveness. Typically, DSLs adhere to the following properties.

• A programming language for a particular field of application or domain,

• Offers appropriate abstractions and notations,

• Typically, small (i.e., restricted in number of notations and abstractions), does not necessarily have to be Turing complete,

• Limited expressiveness, i.e., less expressive than a general-purpose programming language,
3.1. Domain-specific Languages

- Rather *declarative* than imperative,

- *Nature of a programming language*, i.e., expressiveness should not be limited by the number of individual expressions but rather by how these expressions can be combined logically.

There are many synonyms for the term DSLs in the literature, they have also been called: *Specification languages* [29] in the context of application generators, *little languages* [12], *micro-languages, minilanguages* in the Unix world [120, pp. 183ff.], *task-specific programming languages* [109, p. 27, 75, 72], *Very High-Level programming Languages (VHLLs)* that were used often for rapid software prototyping and scripting\(^{10}\) [24, 154], *special purpose languages* [153, p. xix], or *languages for specialized application areas* [153, p. xix, 13, p. 17], for instance in the context of the APT programming language [123] for programming CNC (Computer Numerical Control) machines.

3.1.2 Classification of DSLs

One distinctive feature of a DSL is whether it is *textual* or *graphical*. A textual DSL contains typical constructs of textual programming languages (expressions, operators, etc.), while geometric forms (such as lines, arrows, and shapes) are used in a graphical DSL in order to express intent. Examples of graphical DSLs [47, C2] include the Unified Modeling Language (UML) [97, 18], LabVIEW [3], MATLAB Simulink [16], as well as visual programming languages, such as Scratch [90] or Google’s Blockly [41] for educational purposes of pupils.

In the following, I focus on textual DSLs even though the next attribute is also applicable to graphical DSLs. According to this property, DSLs can be mainly grouped into two categories, namely *internal* and *external* ones.

An *external DSL* denotes a programming language defined entirely new. That is, its syntax\(^{11}\) and semantics\(^{12}\) can be freely defined—however, often it is inspired by existing programming languages. The effort for designing an external DSL is relatively high but there exist tools that aid the development process (e.g., parser generators [86, 61, 101, 2, 112], metaprogramming [65, 67, 57, 117] and transformation frameworks [64, 36]). Well-known examples of external DSLs are SQL [31], awk, and regular expressions.

An *internal DSL* uses in some manner a general-purpose programming language, referred to as *host language*. The DSL utilizes the same syntax as its host language, i.e., it inherits the generic language elements of the host language (conditionals, functions, loops, etc.). Thus, the existing compiler and interpreter of the host language can be

\(^{10}\)Nowadays, scripting languages, such as Perl, Python, and Ruby, are simply denoted as *high-level programming languages*.

\(^{11}\)Syntax denotes the structure/grammar of a programming language.

\(^{12}\)The semantics of a programming language is about the *meaning*, i.e., there could be syntactically well formed programs but that are not semantically defined (e.g., using uninitialized variables).
employed for transforming and executing a DSL program, respectively. That is, the DSL is *embedded* into another language, therefore, the terms internal DSL and *embedded DSL* are interchangeable. An internal DSL may solely provide *extensions* to the employed host language, i.e., the entire host language with all its features is made available to the user. While such an extension is compelling (see Figure 3.2 left)—an acquired general-purpose language can be entirely used—it carries the inherent danger of expressing too many program parts by constructs that might be difficult to analyze by a compiler, or the generated target code might be inefficient. Internal DSLs, therefore, often hide constructs of the host language that are not relevant to the domain or difficult to analyze by a compiler. For instance, if C is considered as host language, often an embedded DSL forbids constructs such as pointers or recursions. This technique is called *reduction* and is illustrated in Figure 3.2 on the right. Finally, the DSL ends up being a restricted host language blended with new domain-specific augmentations. Common extension techniques to express domain knowledge in an internal DSL are libraries (data types, methods), annotations, or macros.

In comparison, external DSLs are typically more flexible and expressive than internal DSLs, however, at higher implementation cost. The richer expressiveness of external DSLs is not only reflected at the language level but also offers opportunities for domain-specific Intermediate Representations (IRs)—far beyond Abstract Syntax Trees (ASTs) as commonly used in compilers for general-purpose languages. Thus, powerful semantic models can be designed, e.g., a hierarchy of objects in the case of a principally declarative DSL, or finite state machines in the case of a DSL for automation technology.
3.2 Domain-specific High-level Synthesis

Research in the areas of loop parallelization (see also Section 2.3) and synthesis of nested loop programs to dedicated hardware accelerators is deeply rooted at the Chair of Hardware/Software Co-Design and also in my scientific career. In this course of research, we have developed PARO [P96], which is a High-Level Synthesis (HLS) tool for a particular class of loop algorithms, based on the mathematical foundation of the polyhedron model [37]. The development of PARO started during my Ph.D. thesis [T1]. The considered loop programs belong basically to the class of affine loop nests [33], i.e., (1) all loop bounds and iteration-dependent control conditions must be expressible as an affine expression in the containing loop iteration variables, constants, or parameters (that is, fixed latest at loop entry), and (2) all array references (respective memory accesses) can be represented as affine functions in the loop iteration variables and fixed parameters. Furthermore, our considered algorithm class, the class of so-called dynamic piecewise linear/regular algorithms (DPLA/DPRA) [P118, T1], can also account for a specific type of dynamic data dependences. DPLAs can be seen as a descendant of the notion of recurrence equations introduced by Richard M. Karp et al. in the form of a system of uniform recurrence equations (SURE) [62] and a multitude of extensions thereof [116, 145, 158, 147, 124, 34, 133], respectively. The class of DPLAs is implemented by the PAULA language [P97, T1] that serves as input to the PARO HLS tool [P96]. An abstract view of PARO’s design flow is shown in Figure 3.3. Based on a given algorithm in PAULA notation, various source-to-source compiler transformations [108, 2] and optimizations can be applied using the design tool. Amongst others, these include constant and variable propagation, common subexpression elimination, loop perfectization [157], dead code elimination, strength reduction of operators, (partial) unrolling of loops and reductions, affine transformations [156] and loop partitioning based on the polyhedron model [37]. The heart of PARO is built on allocation and scheduling methods using mixed integer linear programming, as described in [T1]. Here, latency optimal schedules under resource constraints are derived before ultimately a dedicated hardware accelerator is synthesized.

PAULA [P97, T1] is not a DSL but a functional programming language, which is well suited for modeling iterative, multi-dimensional, data-flow dominated algorithms. In this sense, it is suitable for many application areas that can be expressed by systems of recurrence equations (linear algebra, digital signal processing, image processing, ...)

---

13It should be mentioned that the terms PARO as a design system and PARO methodology are used considerably longer. A distinction from this older works follows: Marcus Bednara and Jürgen Teich [11, 10] present a design system, which is based on the CASPAR design system [138]. Around this system, a number of loosely-coupled tools for the parallelization of C code [15], scheduling and exploration [T2, P127], and hardware synthesis [10] were build. Scheduling is restricted to projection as global allocation. Also, the hardware generation is restricted to projection along a vector in the first direction (that means, the first iteration variable denotes the time axis).
3. Domain-specific Computing

Figure 3.3: PARO design flow. Figure adapted from [T1, p. 151]. For more details on PARO, see [P96, T1]. Noteworthy, PARO’s front end also forms the basis for TCPA code generation [J17, J18] (back end on the lower left in the figure) as described in Section 2.3.

Combinatorial problems, neural networks, etc.). Image processing was always a strong use case for PARO [T1, P74, P43]. Multi-dimensional reductions, such as summations, which are also handy when modeling two-dimensional convolutions in image processing, are one rationale for this. However, there were neither specific transformations in PARO nor language constructs in PAULA available for the domain of image processing. Similarly, albeit commercial HLS tools [C2] became recently very popular and versatile for FPGAs, they still require in-depth hardware design knowledge to obtain efficient implementations, and there exist hardly any domain-specific extensions. Exceptions are Xilinx Vivado HLS and the Intel/Altera FPGA SDK for Open Computing Language (OpenCL). The first offers a partial implementation of the Open Source Computer Vision (OpenCV) library for image processing and computer vision. The latter is perse—thanks to its OpenCL nature—suited for modeling image processing applications.

3.2.1 Goals for Domain-specific HLS

In the context of domain-specific HLS, the primary goal is to analyze and provide programming abstractions to ease the specification of image processing applications.
Consequently, readability, error-proneness, and thus productivity can be improved significantly. The concepts should apply to both commercial tools, such as Xilinx Vivado HLS [C5], and academic frameworks, such as PARO [P96].

3.2.2 Approach

Other than image processing implementations in software following the control-driven von Neumann execution model [43], hardware designs have to be developed very differently to be efficient. Hardware designs should follow the dataflow-driven computing paradigm, where the entire ILP is explicitly represented, e.g., as a dataflow graph14 [92, 102, 141]. Custom tailored hardware accelerators implemented as dataflow architecture [5, 113] can maximally benefit (within a space budget) from the width and height of a layered15 dataflow graph. Then, the width of a layer denotes the maximum number of operations that can be executed in parallel, i.e., the ILP. The height of the graph corresponds to the depth of a streaming pipeline of operations. In addition, the concept of streaming can be considered at the higher abstraction level of communicating tasks, represented again by a DAG, where nodes correspond to sub-algorithms (a.k.a. tasks, compute kernels), and edges denote data dependences (i.e., communication of data typically—but not necessarily—at higher granularity, e.g., entire frames in the case of image processing). GPU computing follows a buffer-wise execution concept [P31], where one compute kernel after the other is executed, and buffers serve as synchronization points (so-called host barriers) by reading from and writing to them sequentially. To achieve highly efficient FPGA implementations, the art here is to transform the buffer-wise execution model into a structural description suitable for streamed pipelining and to exploit the distributed on-chip storage capabilities (e.g., Block Random Access Memories (BRAMs), flip-flops) for data reuse in later iterations. In FPGAs, a widely used approach is line buffering [88, 87, P107], which is particularly beneficial for local operators in image processing, as each pixel is accessed more than once.

In PARO, thanks to the available loop transformations and parallelization techniques (e.g., partitioning) based on the polyhedron model as well as data flow analysis and modulo instruction scheduling [T1], the lifetime of input variables and corresponding line buffers are determined and generated automatically, respectively. In our previous work when using PARO in the image processing domain, a notable fraction of PAULA code was dedicated to image border treatment. For instance, in [P74], a 5-layered multiresolution filter for denoising X-ray images was designed. Here, in each filter

14 A dataflow graph is a Directed Acyclic Graph (DAG). That is, in the context of data flow analysis, such as used in compiler design [108], only flow dependences exist. The nodes and edges of the DAG represent transformational actors (e.g., operations) and data dependences (i.e., the flow of data from one node to another one), respectively.

15 A DAG is layered if its vertices can be arranged in horizontal rows, so-called layers, with the edges generally directed downwards [56].
3. Domain-specific Computing

kernel, about 300 lines of PAULA code, which correspond to half of the entire kernel code, were required to specify conditions at the image borders to realize mirroring [P74, P37] of pixels in the case of out-of-bound accesses.

As one contribution of [P37], we have introduced a new high-level transformation to facilitate border handling in high-level synthesis for FPGAs easily. Instead of writing dozens of code lines manually—which is potentially error-prone—just a single line specifying the input variable and type of border treatment can be used to produce the corresponding PAULA code automatically. Multi-dimensional reductions for median computations and sorting in general are another contribution. Both types of reductions are realized by generating throughput-optimized and highly parallel systolic arrays [79, 136].

In the case of Vivado HLS, a template-based approach is employed to specify memory components, such as line buffers and memory windows in dependence on the image width and size of local image operators [P33]. Moreover, architecture templates for loop coarsening and border handling are provided [P9].

3.2.3 Results

In the case of PARO, we compared the required lines of PAULA code with the LoC of a functional equivalent implementation for C-based HLS. Here, we could demonstrate that the proposed domain-specific augmentations can increase productivity by one to two orders of magnitude [P37]. Furthermore, the generated implementations are in an equal range for Quality of Results (QoR) (i.e., FPGA resource utilization and clock frequency). The high productivity of our proposed domain-specific HLS approach was proven by utilizing it for the design of a novel algorithm for image impulse noise removal, which has a superior image quality compared to other state-of-the-art denoising methods [J2].

In the context of commercial C-based HLS, we designed a library to support the productive development of hardware accelerators for stream-based image processing applications and have shown that these accelerators can achieve a higher QoR at an equal coding effort than Xilinx OpenCV implementations [P33]. For the architecture templates (for loop coarsening in combination with border handling), which we also designed for Xilinx Vivado HLS, we could demonstrate an excellent scalability while I/O bandwidth and FPGA resources are available [P9]. Also, the synthesis results show that the proposed coarsening architecture uses 32% fewer registers for a 5-by-5 convolution with a coarsening factor of 64 compared to previous works, whereas the proposed border handling architectures facilitate a decrease in the Look-Up Table (LUT) usage by 36%.
3.2. Domain-specific High-level Synthesis

3.2.4 Domain-specific HLS Key Papers

In the following, I briefly classify the role of the key papers related to domain-specific HLS, which are part of this cumulative habilitation treatise. Reprints of these papers are available in Appendix C.

**ASAP ’14**  
page 251ff.  
Schmid, Tanase, Hannig, Teich, Bhadouria, and Ghoshal. “Domain-specific augmentations for high-level synthesis”  
The paper introduces domain-specific augmentations to our PARO HLS tool [P96]. For the image processing domain, both a new high-level transformation to perform several types of border treatment (e.g., constant, clamp, mirror, mirror 101) is provided and domain-specific language extensions for sorting and median computations over polyhedral iteration domains are introduced to the PAULA language [P97]. The proposed approach is evaluated for several image processing algorithms and compared against C-based commercial HLS tool.

**FPL ’14**  
page 257ff.  
Schmid, Apelt, Hannig, and Teich. “An image processing library for C-based high-level synthesis”  
This paper presents a systematic approach for designing image processing accelerators in FPGA technology. For the domain of image processing filters, we developed concepts for (a) memory hierarchy, (b) causality and border handling, as well as (c) filter assembly (e.g., image pyramids). These concepts were implemented as a lightweight library to support productive design of FPGA accelerators for stream-based image processing using C-based HLS (e.g., Vivado HLS framework [C5] from Xilinx) and compared against state-of-the-art implementations in OpenCV [111] with respect to area cost, performance, and development effort (productivity).

**Springer JSPS ’17**  
page 261ff.  
The article in Springer’s Journal of Signal Processing Systems underlines the powerfulness and productivity of the augmentations specific for image processing as introduced in our ASAP ’14 paper [P37]. Here, we apply our approach to design a novel image impulse noise removal algorithm and synthesizing different FPGA implementations. The proposed noise removal algorithm relies on iteratively reducing...
3. Domain-specific Computing

the detection criteria, which refers to classifying a pixel as noisy pixel or noise-free pixel.

**Özkan, Reiche, Hannig, and Teich.** "Hardware design and analysis of efficient loop coarsening and border handling for image processing" [ASAP '17 page 279ff.]

The ASAP '17 paper presents two novel architectures for loop coarsening that use significantly less registers than previous work. Also, for the first time, we extended the problem of image border handling to loop coarsening. Depending on the input parameters (e.g., size of local operator, coarsening factor, border handling mode), we conducted a systematic analysis of latency and hardware costs. Based on these models, we provided an algorithm that selects the best coarsening and border handling architecture for the given parameters of a local operator. Finally, the paper is rounded off with implementation results obtained by Vivado HLS.

The work on domain-specific HLS was conducted with Moritz Schmid during his Ph.D. [128], helped by doctoral researcher Alexandru Tanase in the case of PARO and PAULA. This research was additionally spurred and complemented by Vivek Singh Bhadouria who contributed image processing knowledge during his four-month stay as visiting researcher in my group. Nicolas Apelt developed the library for stream-based image processing on top of Xilinx Vivado HLS during his master’s thesis. The analysis and design of efficient loop coarsening and border handling architectures for the domain of image processing was mainly carried out together with doctoral researcher Akif Özkan, assisted by doctoral researcher Oliver Reiche.

### 3.3 HIPAcc: The Heterogeneous Image Processing Acceleration Framework

Modern medical image processing has come a long way since the first hazy "Röntgen-ray" radiograms as produced in 1895, yet X-rays are still one of the most important imaging techniques for a broad variety of applications. One exacting scenario is interventional angiography, which visualizes the inside of blood vessels (arteries, veins, etc.) and heart chambers by injecting a contrast agent and taking X-ray images. During an intervention, motion images in real time are required to guide catheters and minimal-invasive vessel treatments (stenosis, embolization, aneurysm, etc.) interactively. The major problem still facing medical images is the poor Signal-to-Noise Ratio (SNR) due to limited dosage

---

16 **Stenosis** denotes an abnormal narrowing of a blood vessel. An **aneurysm** is a balloon-like swelling of a vessel wall filled with blood. **Embolization** is a passage where a free mass travels through the bloodstream and may block a vessel ultimately.
3.3. HIPAcc: The Heterogeneous Image Processing Acceleration Framework

and exposure for health reasons [80]. Therefore, the use of digital image processing algorithms to reduce the present noise along with preservation of visual structures is an essential field of research and just about to inhale the performance of available high-end systems. To meet the high arithmetic effort in combination with strict real-time requirements of such applications, sophisticated parallel implementations on multi-Digital Signal Processor (DSP) or FPGA systems have been used. The inhibition threshold to port an application to an entirely new class of target architectures, such as GPUs, is correspondingly high—and when is the appropriate time to make such a fundamental change of course? This tricky situation creates the desire to design algorithms only once in a domain-specific abstract way and to have compilers and generators that produce efficiently executable code for a broad variety of parallel architectures. And, in the case of new processors, only another compiler back end has to be added while the entire code base can remain untouched.

3.3.1 HIPAcc Goals

The major goals of HIPAcc are to provide a DSL for image processing and a corresponding compiler framework for parallel processor architectures, including accelerators, such as GPUs and FPGAs. The DSL should offer means to specify image processing applications in an intuitive, expressive, and very compact manner to increase productivity (i.e., to reduce both development and debugging time). The DSL should not only be for the medical domain but also suitable for image processing and computer vision tasks, such as used in robotics or Advanced Driver Assistance Systems (ADASs). Further, the language should abstract completely from parallelization, low-level, and target-specific implementation details, hereby, algorithm development and implementation can be completely decoupled and existing DSL programs can be easily re-targeted to new processor architectures by just developing an appropriate compiler back end. Domain knowledge captured in the DSL as well as hardware knowledge of a target processor architecture should be utilized in the best possible way to generate highly efficient implementations.

3.3.2 HIPAcc Approach

The design of HIPAcc’s DSL started with a domain analysis, which includes a careful examination of the required algorithmic type of operations and operators as well as building blocks characteristic to the domain of digital image processing. A vast body of work for image processing algorithms and classifications exists in literature, e.g., [66, 20, 8, 125] to name only a few. John C. Russ et al. [125] classify image processing algorithms based on their different purposes, such as image acquisition, image correction, image enhancement, or measuring and filtering images in the frequency domain. Other authors [66, 20, 8] group image processing methods based on which input data (one
3. Domain-specific Computing

or multiple pixels\textsuperscript{17} the entire image or even multiple images) are used to produce an output (image). HIPAcc adopts the notion of Reinhard Klette et al. \textsuperscript{66} to classify image processing in the spatial domain into point operators that compute one output pixel based on one input pixel (e.g., color-to-grayscale conversion, contrast enhancement), local operators that process also neighboring pixels to compute one output pixel (e.g., Gaussian blur for smoothing, Sobel filter for edge detection, median filter for noise removal, etc.), and global operators that consider the entire input (e.g., global reductions such determination of the minimum or maximum image intensity, image histogram calculation as representation of the tonal distribution). HIPAcc’s languages components were built upon this foundation and include objects for storing digital images and accessing them (by so-called accessors) or to operate on image pyramids (i.e., different resolutions of an image), declarative language constructs for boundary treatment (i.e., different modes for handling out-of-bounds accesses in the case of local operators), as well as interpolation modes in the case the computed and contributing images are of different size. Further, language elements, such as masks and convolutions, are offered to define point and local operators independent of the iteration space (i.e., size of an output image) as well as global reduction operators.

HIPAcc’s DSL is embedded into C++ and its components are implemented in form of C++ classes. Computations on images are as well encapsulated in C++ classes, which inherit from base classes provided by the framework. The decision for a C++-embedded DSL is twofold: (a) programs can be compiled with any C++ compiler, this allows to mix DSL programs with arbitrary C/C++ code as well as to port applications incrementally, and (b) no new compiler front end (parser, AST, etc.) has to be developed—HIPAcc utilizes Clang,\textsuperscript{18} a C-based language family front end for the LLVM compiler \textsuperscript{85}.

Instead of directly generating machine code for each different target processor architecture, HIPAcc translates the DSL source program into parallel and optimized source code written again in a C-like (e.g., CUDA,\textsuperscript{19} OpenCL, or OpenMP parallel C++, vector intrinsics) or other high-level programming language. This source-to-source translation is implemented within Clang, an overview of the HIPAcc framework is depicted in Figure 3.4. All analyses, optimizations, and target code generation are based on Clang’s IR \textsuperscript{2, 108}, the AST. The domain knowledge is captured in the nodes of the AST and further operations are triggered depending on their type, which correspond to (a) declarations and definitions of HIPAcc DSL classes, (b) statements that define objects in the DSL, and (c) expressions involving DSL objects \textsuperscript{J9}. In addition to the domain knowledge, hardware knowledge of the target architectures is available, and thus, platform-specific optimization strategies can be applied. For instance, in the case of a certain GPU, a tailor-made adaptation of the generated code to the memory hierarchy.

---

\textsuperscript{17} Pixel, a neologism for picture element.

\textsuperscript{18} https://clang.llvm.org

\textsuperscript{19} Initially, CUDA was an acronym for Compute Unified Device Architecture, however, later Nvidia dropped its usage.
3.3. HIPAcc: The Heterogeneous Image Processing Acceleration Framework

Or, in the case of FPGAs as target architecture, the IR is transformed into a streaming pipeline (see Section 3.2.2) before HLS code (C++ tailored for Xilinx Vivado HLS or OpenCL for Intel/Altera FPGAs) is emitted.

3.3.3 HIPAcc Results

We have applied domain-specific computing techniques to the area of image processing. The proposed techniques lead to significant advantages with respect to productivity, portability, and performance.

Productivity: HIPAcc’s DSL and source-to-source compilation framework can significantly increase productivity by one to two orders of magnitude [P61, P41, J12, J9] (e.g., quantified by LoC or other well-known software metrics [44, 51] when comparing DSL code with generated high-level code for a target architecture).

Portability: HIPAcc offers a wide range of code generators for different target architectures (see Figure 3.4), including manycore architectures, such as GPUs or Intel’s Xeon Phi [P61, P60, P56, J12, J9], Intel vector/SIMD instruction sets (up to SSE4.2 and AVX2) [P11], embedded GPUs, such as used in smartphones and tablet computers [P41, J9], and FPGAs [P31, P34, P26, P13, J5, P3]. Generating code for such a broad spectrum of architectures, starting from one and the same DSL program, is unrivaled. In addition, both the DSL and the source-to-source compiler are open-source and thus can be adapted by other researchers.

---

20HIPAcc’s OpenCL back end for DSPs from Texas Instruments (TI) was developed for an industrial cooperation partner to evaluate TI’s multicore DSP+ARM KeyStone II System-on-Chip (SoC) and the corresponding OpenCL compiler. This back end is not publicly available.

21http://hipacc-lang.org
Performance: HIPAcc’s source-to-source compiler generates highly efficient parallel implementations. In the case of manycore architectures (i.e., high-end GPUs), we could demonstrate to beat hand-written low-level implementations (e.g., CUDA codes in the OpenCV library and in the Nvidia Performance Primitives (NPP) library\footnote{https://developer.nvidia.com/npp}) \cite{J12, J9}. In the case of embedded GPUs, we demonstrated also that HIPAcc performs better than corresponding pre-implemented and hand-tuned versions \cite{P41}. In the case of HIPAcc’s vectorization back end, we are on a par with other state-of-the-art compilers or even better for many benchmark algorithms \cite{P11}. Regarding HIPAcc’s FPGA back ends, we could outperform Xilinx Vivado HLS OpenCV implementations \cite{P31}, and are close to those of hand-optimized Altera OpenCL examples \cite{P13}. Beside pure performance numbers, we could also ascertain an excellent energy efficiency when comparing FPGA implementations against other parallel target architectures \cite{J5}.

3.3.4 HIPAcc Key Papers

In the following, I briefly classify the role of the key papers related to HIPAcc, which are part of this habilitation treatise. Reprints of these papers are available in Appendix C.

IEEE TPDS ’16 page 289ff.

Membarth, Reiche, Hannig, Teich, Körner, and Eckert. “HIPAcc: A domain-specific language and compiler for image processing” \cite{J9}

This article can be considered as the HIPAcc reference paper. It summarizes and unifies the core research results by providing a thorough description of HIPAcc’s language design and components, as well as of the compiler framework with an emphasis on kernel code generation and optimization for GPUs. Productivity gains are quantified using Halstead’s complexity measures \cite{44, 51}. A variety of image processing applications is implemented in HIPAcc and for several manycore architectures, ranging from embedded to high-end GPUs from different vendors as well as Intel’s Xeon Phi, highly efficient code is generated. Finally, HIPAcc’s excellent performance results are underscored by comparing it against other state-of-the-art approaches (e.g., Halide \cite{115}, OpenCV \cite{111}).

DATE ’14 page 305ff.

Membarth, Reiche, Hannig, and Teich. "Code generation for embedded heterogeneous architectures on Android" \cite{P41}

The DATE ’14 paper deals with code generation for embedded Heterogeneous System Architecture (HSA) platforms, such as MPSoCs as used in smartphones and tablet Personal Computers (PCs). Starting from an abstract high-level representation in HIPAcc, a code generator for
3.3. HIPAcc: The Heterogeneous Image Processing Acceleration Framework

both Renderscript and Filterscript on Android platforms is presented for the first time. With HSA, CPU and GPU share the same physical memory. This allows to avoid extensive memory transfers and enables the employment of heterogeneous resources where the same data has to be accessed frequently from different compute resources.


In the CODES+ISSS ’14 paper, FPGAs are introduced as target to HIPAcc. To achieve this objective, we proposed a model transformation from the buffer-wise execution model as typically used in GPUs into a structural description in form of a streaming pipeline tailored for FPGAs. In addition, several FPGA-specific optimizations (e.g., conversion of floating to fixed point for mask coefficients, optimization of loop counter variables, mapping of vector data types) are proposed. Eventually, source code suited for C-based HLS is emitted instead of directly generating a structural description in a Hardware Description Language (HDL). The proposed approach is evaluated by assessing performance and power requirements of the generated FPGA designs in contrast to (embedded) GPUs.


This article extends HIPAcc’s language range by facilitating multiscale modeling, i.e., image pyramids [25] and multiresolution approaches [80, P74] as used in image processing, but also known as two-dimensional geometric multigrid methods based on stencil computations [22, 49] in the domain of numerical analysis. Only image data for the finest pyramid level has to be provided, all other levels are managed automatically by the HIPAcc framework while the programmer can still specify how the pyramid is traversed and which operations are performed at different levels. That means, rather than just offering template objects, the nature of a programming language is preserved, and typical traversals for construction of pyramids in image processing as well as the V-cycle and W-cycle for multigrid stencil computations can be specified. Our proposed approach provides portability across different architectures and allows to achieve competitive performance compared to Halide [115] as well as hand-tuned implementations.
3. Domain-specific Computing

FPL '16
page 333ff.
Özkan, Reiche, Hannig, and Teich. “FPGA-based accelerator design from a domain-specific language”

Focus of the FPL '16 paper is Altera's Software Development Kit (SDK) for OpenCL in the context of image processing. First, it is shown that best programming practices and the data-parallel programming paradigm as used for GPUs are not well suited, and thus immense code modifications have to be applied in order to obtain efficient FPGA implementations. This findings are employed in the second major contribution of the paper, namely a HIPAcc back end for generating optimized OpenCL code specific to the peculiarities of Altera FPGAs. Implementation results for several image filters as well as a comparison of Altera’s hand-optimized example designs with those generated by HIPAcc DSL programs of the same algorithms are provided and demonstrate that HIPAcc’s generic approach can lead to results close to those of Altera. Furthermore, it is shown that server-grade GPUs can be outperformed in terms of throughput for a wide variety of image filter algorithms.

Springer JSPS '17
page 343ff.
Reiche, Özkan, Hannig, Teich, and Schmid. “Loop parallelization techniques for FPGA accelerator synthesis”

This article summarizes, unifies, and extends FPGA-specific parallelization and optimization techniques as well as code generation for both Xilinx and Altera FPGAs. A generic method for loop tiling and loop coarsening [P20] as well as concrete back ends for generating C++ and OpenCL code are presented, which can be further synthesized with Xilinx Vivado HLS and Altera’s SDK for OpenCL, respectively. A comparison of loop tiling and coarsening, in terms of hardware utilization and achieved throughput, for both HLS tools with varying parallelization factors is presented. Further, an evaluation of the throughput of HIPAcc-generated accelerators for Vivado HLS and Altera’s SDK for OpenCL over an extensive application set is presented and compared against implementation results of embedded and discrete high-end GPUs.

LCTES '17
page 369ff.
Reiche, Kobylko, Hannig, and Teich. “Auto-vectorization for image processing DSLs”

Based on the concept of whole-function vectorization [63], the LCTES '17 paper proposes auto-vectorization techniques for image processing DSLs, for the first time, in the context of source-to-source compilation. Thanks to the regular memory access patterns (i.e., relative indexing) of
3.4 ExaStencils: Advanced Stencil-Code Engineering

the considered domain, the vectorization analysis can be significantly simplified. Another contribution is the handling of mixed bit-width data types. Here, an analysis that automatically selects the optimal Single Instruction, Multiple Data (SIMD) width for the specified target instruction set in order to pack native vectors into virtual vectors and to apply on-demand type promotion is proposed. Finally, the proposed techniques are implemented and integrated into HIPAcc, evaluated for several pre- and post-processing image filters, and compared against other state-of-the-art (semi-)automatically vectorizing compilers.

The work on the HIPAcc core framework, DSL definition, and code generation for (embedded) GPUs was carried out with Richard Membarth during his Ph.D. [98]. Doctoral researcher Oliver Reiche developed the Android integration, the model transformation and optimizations for efficient FPGA designs, while the work on HIPAcc’s OpenCL back end for Intel/Altera FPGAs was conducted with doctoral researcher Akif Özkan. The research on domain-specific auto-vectorization was performed as well with Oliver Reiche, complemented by contributions from Christof Kobylko during his master’s thesis. The research on image pyramids was a joined work that synergistically combines our findings on the automated solving of Partial Differential Equations (PDEs) as considered in project ExaStencils (together with doctoral researcher Christian Schmitt and project partner Harald Köstler) with HIPAcc’s development efforts.

3.4 ExaStencils: Advanced Stencil-Code Engineering

Coming exascale computing\textsuperscript{23} will require a close co-design of application, algorithm, and target-architecture-specific program development to unleash the tremendous performance of such supercomputers. However, before-mentioned HPC systems will only scale if the energy efficiency will considerably improve [69, 127]. One remedy for keeping up scalability are hardware components such as accelerators (e.g., GPUs, Intel Xeon Phi). Almost every fifth supercomputer in the TOP500\textsuperscript{24} is already (as at June 2017) equipped with accelerator/co-possessor technology. This trend will go on [69, 70, J3], and consequently, the node structure inside an exascale computer will become more heterogeneous, and thus challenging concerning programmability. Therefore, new software techniques and tools supporting both algorithm and architecture-aware program development will become vital, not only to ease application and program development, but also for performance analysis and tuning, to ensure short turn-around times, and for reasons of portability.

\begin{footnotesize}
\footnote{\textsuperscript{23}Exascale computing refers to supercomputers that can carry out at least one exaFLOPS, i.e., \(10^{18}\) FLoating-point Operations Per Second (FLOPS).}

\footnote{\textsuperscript{24}https://www.top500.org/}
\end{footnotesize}
3. Domain-specific Computing

Project “ExaStencils: Advanced Stencil-Code Engineering” [R2, P35] tackles the challenges mentioned above for the important class of stencil computations, which are at the heart of many high-performance simulation applications in scientific computing. Examples are simulations in astrophysics [21], geophysics [4], oceanography [71], quantum chemistry [74], or the simulation of viscoplastic materials [78], i.e., suspensions of particles or macromolecules (e.g., foams, gels, pastes, bio-organic fluids such as blood, and food products such as fruit juices), as well as particle simulation in general [17]. Stencils are used to solve large systems of linear equations that may stem from the discretization of PDEs. They are regular computations on—usually multidimensional—structured or block-structured data grids, also known as multigrid methods [49], which involve a hierarchy of very fine to successively coarser grids.

3.4.1 ExaStencils Goals

ExaStencils’ mission is to research and provide a unique, tool-assisted, co-design approach specific for the domain of multigrid methods based on stencil computations to reduce the foreseen performance/productivity gap of future exascale platforms. The project has three central objectives: (1) a substantial gain in productivity, (2) a high flexibility in the choices of the stencil algorithm and target platform, and (3) the provision and proof of scalability to reach exascale performance.

3.4.2 ExaStencils Approach

Solving above-mentioned scientific problems requires the expertise of different specialists, including domain experts (i.e., natural scientists), mathematicians, software as well as hardware engineers. To best match the distinct interests and technical terminologies of these different groups, ExaStencils’ approach [P35] captures domain knowledge for each of these groups individually in the form of a hierarchy of tailor-made DSLs and corresponding compiler technology following the concept of stepwise refinement [155]. The multi-layered DSL is depicted in Figure 3.5, it is named ExaSlang [P29] and consists of four layers. Layer 1 is the most abstract layer, targeting natural scientists and engineers that have little or no experience in programming but want to define continuous mathematical problems in the form of an energy functional to be minimized or a partial differential equation to be solved, with a corresponding computational domain and boundary definitions. The second layer is slightly less abstract and allows to specify the problem in a discretized form. It is suitable for natural scientists and engineers with the appropriate knowledge as well as mathematicians. Based on the discretized

\[\text{Project ExaStencils is funded by the German Research Foundation (DFG) as part of the Priority Programme 1648 “Software for Exascale Computing”. The first funding period from January 2013 to December 2015 has been completed. The project is in its second funding period from January 2016 to December 2018.}\]
3.4. ExaStencils: Advanced Stencil-Code Engineering

Figure 3.5: Multi-layered DSL approach of ExaSlang. Figure reprinted from [P29, p. 3].

problem of Layer 2, the third layer adds modeling of algorithmic components, settings, and parameter values. At this layer, the multigrid method becomes visible for the first time. Here, it is possible to define smoothers and to modify the multigrid cycle (e.g., different types of V-cycles or W-cycles [C1]). Computations are specified for the entire computational domain. Since this is already a very advanced layer regarding algorithm and discretization details, it targets mainly mathematicians and computer scientists. The fourth layer (ExaSlang 4) is most concrete, where user-relevant parts of the parallelization become accessible, data structures can be adapted for data exchange, and communication patterns can be specified. This layer can be considered as semi-explicitly parallel and is mainly intended for computer scientists.

While ExaSlang 4’s syntax is partly inspired by the programming language Scala [110, P29], it and all other layers are external DSLs to offer highest expressiveness. Scala in turn is used as framework for DSL entry, transformation, optimization, and code generation. As a modern object-functional programming language [110], it offers powerful concepts, such as parser combinators and pattern matching. While the former allows the use of context-sensitive grammars, and thus, ease the specification of external DSLs, the latter offers possibilities to specify transformations shortly and expressively.

The Target Platform Description Language (TPDL) is orthogonally available across all layers of the functional program description (i.e., ExaSlang 1 to ExaSlang 4). It specifies both hardware components of the target platform (e.g., CPUs, memory hierarchies, accelerators, cluster topology) and available software (e.g., compilers or Message Passing Interface (MPI) implementations).

On ExaSlang 4’s IR, most of the compiler transformations are performed, i.e., parallelization efforts, such as domain partitioning, as well as high-level and low-level optimizations [J14], such as polyhedral optimizations [37, 76] and vectorization [77].
3. Domain-specific Computing

Since the order and parametrization of optimizations and compiler transformations are not always straightforward, modularity is paramount in the compiler framework to enable traceability and variant generation. Modularity and flexibility are realized by organizing transformations in strategies and transactions, and keeping track of the IR’s state in a StateManger [P29], which allows to partly revert transformations in order to avoid re-generation and thus to speed up variant generation and exploration [J14, 46]. Afterward, the optimized IR is transformed to target source code, e.g., in C++. Finally, this generated code is transferred to the designated hardware platform to be compiled and executed.

3.4.3 ExaStencils Results

As an essential result of having studied different modern frameworks and DSL technologies [P38], we decided to bank ExaStencils on the programming language Scala [110]. Right after, we quickly implemented a very limited prototype to turn ExaStencils’ vision as a first proof-of-concept into reality [J16, P35, J6]. Another milestone of ExaStencils is the DSL design, especially the details on ExaSlang 4, presented in [P29]. Moreover, also in [P29], we presented our DSL-based optimization and transformation framework for generating code variants. Finally, we could prove a high productivity and scalability of our approach in [P29, C1]. Regarding productivity, we compared the LoC of ExaSlang 4 DSL code with the generated C++ code and obtained gains between a factor of 30 to 165. Concerning scalability, we could showcase in [P29] that our generated code scales up to the complete JUQUEEN cluster, consisting of 28,672 nodes that correspond to a total of 458,752 cores.

3.4.4 ExaStencils Key Papers

In the following, I briefly classify the role of the key papers related to ExaStencils, which are part of this cumulative habilitation treatise. Reprints of these papers are available in Appendix C.


The ICCSA ’14 paper conducts a thorough evaluation of DSL technologies for code generation. It proposes several criteria for the assessment of frameworks for the design and compilation of textual DSLs. Four technologies are considered in the evaluation, namely the language workbench Spoofax/IMP [64], the metaprogramming language Rascal

http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JUQUEEN/Configuration/Configuration_node.html
MPL [67], as well as two custom design approaches, one using C++ and the other Scala [110] as host programming language. This paper provides the rationale for selecting Scala as host language and technology for designing ExaStencils’ multi-layered DSL and transformation framework, respectively.

The Euro-Par ’14 presents thorough overview of project ExaStencils and can be considered as general reference. It introduces the considered application domain of stencil codes along with having a multi-layered domain-specific programming language and the corresponding concept of stepwise refinement [155] and model-driven software engineering [129]. In the ExaStencils workflow, at every refinement step, dedicated and highly automated optimizations are applied, which exploit the domain-specific knowledge available at this step. Since this paper summarizes the ideas at an early project stage, code generation is considered only as proof of concept with a preliminary prototypical implementation [J16, J6] in Scala [110].

The WOLFHPC ’14 contribution is the key paper for ExaStencils multi-layered DSL technology. It summarizes the purpose of the four sorts of language layers (called ExaSlang 1 to ExaSlang 4), which are tailored for different target audiences, i.e., engineers and natural scientists, mathematicians, and computer scientists. The most concrete layer ExaSlang 4 follows the paradigm of procedural programming and is described in detail by presenting its language elements, such as simple (e.g., real, integer), aggregate (e.g., complex numbers, vectors) and algorithmic (e.g., fields, stencils) data types, level specifications to ease multigrid programming, control flow (e.g., if-statements and loops), functions, as well as communication. Furthermore, the structure and mechanisms of the transformation framework (compiler) that refine user input from one DSL layer to another, and finally emits C++ code, is presented. To enable both traceability and variant generation, modular concepts, such as transformations, strategies, and transactions, are introduced.
The LNCSE ’16 book chapter introduces additional data types such as vectors and matrices to ExaSlang 4, which further ease the description of solvers for systems of PDEs. Beside these new extensions at language level, the corresponding required modifications during the code generation are described. Finally, the paper is rounded off by an optical flow detection case study based on the multigrid approach [73]. Here, it is shown that the novel data types can reduce again ExaSlang 4 program sizes up to 28%. The productivity gains are up to a factor of 30 when the LoC of the generated C++ programs are set in proportion to LoC of ExaSlang 4 programs.

The work on ExaStencils was conducted with doctoral researcher Christian Schmitt. He significantly contributed to the evaluation of DSL technologies, the definition of ExaSlang, and to the transformation framework, helped by Sebastian Kuckuk, Harald Köstler, and Christian Lengauer.
4 Conclusions

This cumulative habilitation treatise compiles my research activities on how to tackle the design and programming complexity challenge of heterogeneous parallel systems. The presented design, simulation, parallelization, and compilation techniques pursue two strategies, namely resource-aware computing and domain-specific computing, which seem to be fundamentally different at first sight. Resource-aware computing provides a full control loop from hardware status information to the program level and back, whereas domain-specific computing drastically separates the concerns of algorithm development from parallelization and low-level implementation details. However, the two approaches have also commonalities: Both approaches scale very well for heterogeneous manycore systems and achieve high performance; in the case of invasive computing, through the implementation of program variants and symbolic mappings that can adapt to a certain number of resources at runtime. This adaptability gives programs the possibility to react flexibly to changes in the execution environment, such as the number of available resources, failures, power, or temperature. In the case of domain-specific computing, both domain knowledge and hardware knowledge are exploited to generate highly optimized implementations. In addition, an excellent productivity can be achieved thanks to highly abstract and predominantly declarative program specifications in the form of domain-specific augmentations (e.g., libraries or DSLs). But also in invasive computing, there has always been the motto: “As concrete as necessary, as abstract as possible.” That is, on the one hand, invasive computing provides instruments, such as resource-aware programming, that allow controlling a system in a very fine manner. On the other hand, once resource-aware programming and resource management strategies are adequately researched (e.g., by dint of system simulation), they can be moved progressively into compilation and runtime management, and that is exactly what happened in the second funding phase of invasive computing. Now, instead of specifying how many resources of which type should be claimed, the user can specify what should be achieved. For this purpose, she or he can specify requirements, e.g., in the case of image/video processing that the throughput should be at least 25 frames per second. But how to achieve this throughput requirement is completely up to the compiler and runtime system. For this purpose, design-time analysis/exploration of application mappings is combined with run-time management [151, 152]. By this transition from an imperative to a declarative programming concept, resource-aware computing approaches domain-specific computing. A second point of contact is the increasing investigation and determined exploitation of parallel patterns within invasive computing.
Highly parallel accelerators in combination with domain-specific languages are more relevant than ever. For instance, Google’s Pixel 2 smartphone has besides its high-end Qualcomm Snapdragon 835 MPSoC, a domain-specific co-processor, the Pixel Visual Core that consists of 4096 tiny ALUs and will be programmable using the domain-specific library TensorFlow Lite for machine intelligence (i.e., machine learning and deep neural networks) and the DSL Halide for image processing applications. On the other extreme of computing, in data centers, accelerators are also increasingly investigated. Examples include interconnected FPGAs as computational accelerators in Microsoft’s Project Catapult [26], ASIC clouds [89], or even offered as commercial cloud services, e.g., Amazon EC2 Elastic GPUs or Amazon EC2 F1 (FPGA) Instances. Here, DSLs are appealing, especially, for non-hardware specialists to unleash the full processing power of such accelerator-based large-scale systems.

What’s next after Moore’s law ends? Research in computing will not stop. Quite the opposite, new technologies and computational approaches, such as approximate computing [52, 103], neuromorphic computing [106], reversible and quantum computing will drive exciting scientific adventures and will foster research of corresponding computing paradigms, programming languages, and compilation techniques.
A Bibliography

A.1 General Bibliography


A. Bibliography


A.1. General Bibliography


A.1. General Bibliography


[70] Peter Kogge and John Shalf. “Exascale computing trends: Adjusting to the "new normal" for computer architecture”. In: Computing in Science Engineering 15.6 (Nov. 2013), pp. 16–26. ISSN: 1521-9615. DOI: 10.1109/MCSE.2013.95.


A.1. General Bibliography


[84] Vahid Lari, Andreas Weichslgartner, Alex Tanase, Michael Witterauf, Faramarz Khosravi, Jürgen Teich, Jürgen Becker, Jan Heißwolf, and Stephanie Friederich. "Providing fault tolerance through invasive computing". In: *it – Information Technology* 58.6 (Oct. 19, 2016), pp. 309–328. ISSN: 1611-2776. DOI: 10.1515/itit-2016-0022.


A. Bibliography


[121] Lakshminarayanan Renganarayanan, Dae-Gon Kim, Sanjay Rajopadhye, and Michelle Mills Stout. "Parameterized tiled loops for free". In: *ACM SIGPLAN Notices* 42.6 (June 2007), pp. 405–414. ISSN: 0362-1340. DOI: 10.1145/1273442.1250780.

A. Bibliography


A. Bibliography


A.2 Personal Bibliography (186)

All journal, conference, and workshop papers listed in the following were selected for publication by peers or international program committees in a formal review process. They have been published in printed journals, proceedings, or widely recognized online archives (ACM Digital Library, IEEE Xplore, SpringerLink, etc.). Different publication categories are denoted by capital prefixes, e.g., "B" for books, "C" for chapters in books, "J" for journal articles, "P" for papers in proceedings, and so on. References to key papers of this habilitation treatise are additionally marked by a book symbol, such as [J9].

Books and Proceedings (10)


A.2. Personal Bibliography


Book Chapters (11)


A. Bibliography


Editorials (2)


Journal Articles (27)


A. Bibliography


A.2. Personal Bibliography


Papers in Conference, Symposia, and Workshop Proceedings (127)


A. Bibliography


A.2. Personal Bibliography


A. Bibliography


A.2. Personal Bibliography


A. Bibliography


A.2. Personal Bibliography


A. Bibliography


A.2. Personal Bibliography


A. Bibliography


A.2. Personal Bibliography


A. Bibliography


Theses (3)


Technical Reports and White Papers (7)

A.2. Personal Bibliography


B Image Credits

Advanced Micro Devices Inc.,

AMI GmbH,
http://www.ami-gmbh.com/ami/img/Xilinx_Kintex_7_32ST.jpg

Afrank99 (Wikipedia),

Digi-Key Electronics,
https://media.digikey.com/Photos/Texas%20Instr%20Photos/MFG_EVMK2H.JPG

Dave Gandy, \faBook from \LaTeX fontawesome package, SIL Open Font License,
https://www.ctan.org/tex-archive/fonts/fontawesome

Intel Corporation,

Intel Corporation,

Jarekt (Wikipedia), Public Domain,
https://commons.wikimedia.org/w/index.php?curid=3769111

Tam (Wikipedia),

Reproduced from work created and shared by the Android Open Source Project (Wikipedia), CC BY-SA 2.5,
https://upload.wikimedia.org/wikipedia/commons/f/fe/Nexus_10.png

NVIDIA Corporation,
http://www.nvidia.de/docs/IO/143902/tesla-k40.png

www.llvm.org, ©Apple Inc.,
http://llvm.org/img/DragonFull.png

www.python.org (https://www.python.org/community/logos/), GPL,
https://commons.wikimedia.org/w/index.php?curid=34991637

Yukihiro Matsumoto, Ruby Visual Identity Team (Wikipedia), CC BY-SA 2.5,
https://commons.wikimedia.org/w/index.php?curid=3239992
C  Paper Reprints

From my 165 peer-reviewed publications listed in Appendix A.2 (page 60ff.), I have opted for the following 25 key contributions of my research. These form the chief part of my cumulative habilitation treatise.

Resource-aware Computing

Modeling and System Simulation Papers

DAC ’15  
page 87ff.  
Roloff, Schafhauser, Hannig, and Teich. “Execution-driven parallel simulation of PGAS applications on heterogeneous tiled architectures” [P24]

X10 ’16  
page 93ff.  

ESTIMedia ’17  
page 99ff.  

Papers on Architecture/Compiler Co-Design of Invasive TCPAs

ACM TECS ’14  
page 109ff.  
Hannig, Lari, Boppu, Tanase, and Reiche. “Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach” [J18]

RSP ’17  
page 139ff.  

Springer JSPS ’14  
page 147ff.  
Teich, Tanase, and Hannig. “Symbolic mapping of loop programs onto processor arrays” [J15]

MEMOCODE ’14  
page 177ff.  
Tanase, Witterauf, Teich, and Hannig. “Symbolic inner loop parallelisation for massively parallel processor arrays” [P32]

ACM TECS ’17  
page 187ff.  
Tanase, Witterauf, Teich, and Hannig. “Symbolic multi-level loop mapping of loop programs for massively parallel processor arrays” [J1]

ASAP ’16  
page 215ff.  
Witterauf, Tanase, Hannig, and Teich. “Modulo scheduling of symbolically tiled loops for tightly coupled processor arrays” [P15]

Springer JSPS ’14  
page 225ff.  
Boppu, Hannig, and Teich. “Compact code generation for tightly-coupled processor arrays” [J17]
Domain-specific Computing

Domain-specific HLS Papers


ASAP ’17 page 279ff. Özkan, Reiche, Hannig, and Teich. “Hardware design and analysis of efficient loop coarsening and border handling for image processing” [P9]

HIPAcc Papers


ExaStencils Papers

ICCSA ‘14
page 379ff.

Euro-Par ‘14
page 389ff.

WOLFHPC ‘14
page 401ff.

Springer LNCSE ’16
page 411ff.
Schmitt, Kuckuk, Hannig, Teich, Köstler, Rüde, and Lengauer. “Systems of partial differential equations in ExaSlang” [C1]