🏠

Hardware Acceleration Design Tutorials

The Big Picture: Teaching Hardware Acceleration by Example

The Hardware_Acceleration_Design_Tutorials module is a curated collection of reference designs that bridge the gap between algorithmic intent and high-performance FPGA implementation. Think of it as a "cookbook" for hardware acceleration — each tutorial demonstrates how to transform a compute-intensive workload from a CPU-bound reference implementation into an efficient FPGA-accelerated solution using Vitis HLS (High-Level Synthesis) and the Vitis unified software platform.

Why This Module Exists

Modern data-center and edge workloads — from real-time video processing to combinatorial optimization — increasingly bump against the "CPU memory wall." While CPUs excel at sequential, branch-heavy code, they struggle with data-parallel "throughput kernels" where the same operation applies to massive data streams.

The traditional path to FPGA acceleration was daunting: write RTL (Verilog/VHDL), verify in simulation, handle timing closure manually — a months-long effort requiring specialized hardware design expertise. This module demonstrates the Vitis HLS approach: express algorithms in C/C++, add pragmas to guide hardware generation, and let the compiler build the pipeline. The goal is to enable software engineers to achieve RTL-level performance without writing RTL.

The Mental Model: A Pipeline Factory

To understand how these tutorials work, imagine a modern automated factory floor:

  1. The Raw Materials (Global Memory) — Large batches of unprocessed data (images, city coordinates, network packets) sit in off-chip DDR memory, waiting for processing.

  2. The Loading Dock (Memory-Mapped AXI4) — A specialized loader (ReadFromMem) pulls raw materials from the warehouse into the factory in optimal-sized batches (burst transfers), converting wide memory bus transactions into a steady stream of individual items.

  3. The Assembly Line (Dataflow Pipeline) — Inside the factory, materials move through specialized workstations connected by conveyor belts (hls::stream FIFOs):

    • Window Formation Station (Window2D): Assembles individual pixels into 2D neighborhoods (convolution windows) using line buffers — like organizing parts into assemblies.
    • Processing Station (Filter2D): Applies the actual computation (multiply-accumulate for convolution, distance calculation for TSP) to each assembled unit.
  4. The Shipping Dock (AXI4 Write) — Finished products flow to the unloading station (WriteToMem), which repackages the stream into burst writes back to global memory.

  5. The Dispatcher (Host Orchestration) — A central coordinator (the Filter2DDispatcher class) manages multiple "delivery requests" to the factory. It implements software pipelining: while one batch is being processed on the FPGA, the next batch's data is being transferred from host memory to FPGA memory, overlapping communication with computation.

The key insight: Hardware acceleration isn't just about making one operation faster; it's about keeping the pipeline full. The tutorials demonstrate how to structure both the hardware kernel (the factory floor) and the host software (the logistics coordinator) to maintain maximum throughput.

Architecture Overview

The module is organized into three distinct tutorial tracks, each targeting different problem domains and optimization techniques:

graph TB subgraph "Hardware Acceleration Design Tutorials" A[Module Entry Point] --> B[Convolution Tutorial] A --> C[Traveling Salesperson] A --> D[Alveo Aurora Kernel] subgraph "Convolution: Image Processing Pipeline" B --> B1[HLS Kernel: Filter2DKernel] B --> B2[Host Dispatcher] B1 --> B3[Window2D Line Buffers] B1 --> B4[Filter2D Convolution] end subgraph "TSP: Combinatorial Optimization" C --> C1[CPU Reference Gold] C --> C2[HLS tsp Kernel] C --> C3[HLS Optimized tsp] C2 --> C4[Permutation Engine] end subgraph "Aurora: High-Speed Networking" D --> D1[Stream Configuration] D --> D2[GT Transceiver Integration] end end style B fill:#e1f5fe style C fill:#fff3e0 style D fill:#f3e5f5

1. Convolution Tutorial: The Stream Processing Pattern

The Convolution Tutorial is the "hello world" of image processing acceleration. It implements a 2D convolution filter (blur, sharpen, edge detection) using the classic stream processing pattern.

Key architectural insights:

  • Line Buffer Pattern: The Window2D function demonstrates the canonical FPGA image processing architecture — using on-chip BRAM to buffer image lines, enabling efficient 2D window extraction with minimal external memory bandwidth.
  • DATAFLOW Architecture: The kernel uses #pragma HLS DATAFLOW to create a 4-stage pipeline (Read → Window → Filter → Write) where all stages execute concurrently, connected by hls::stream FIFOs.
  • Software Pipelining on Host: The Filter2DDispatcher class implements triple-buffering at the system level — while the FPGA processes batch N, the CPU prepares data for batch N+1 and reads results from batch N-1.

2. Traveling Salesperson: Algorithmic Optimization

The Traveling Salesperson Tutorial tackles combinatorial optimization — a domain where the algorithmic approach matters more than raw data bandwidth. It implements a brute-force TSP solver that evaluates all city permutations to find the shortest route.

Key architectural insights:

  • Reference vs. Optimized Flow: The tutorial provides both a baseline HLS implementation (build/hls.tcl) and an optimized version (build/hls_opt.tcl), demonstrating the progression from "functional but slow" to "pipeline-optimized."
  • Memory-Bound vs. Compute-Bound: Unlike the convolution tutorial (memory bandwidth limited), TSP is compute-bound — the challenge is efficiently generating permutations and accumulating distances without pipeline stalls.
  • Fixed-Point Arithmetic: The kernel uses uint16_t distances (scaled integers) rather than floating-point, dramatically reducing DSP48 usage and enabling higher clock frequencies.

3. Alveo Aurora: High-Speed Serial Communication

The Alveo Aurora Tutorial demonstrates high-speed serial communication using the Aurora protocol over QSFP interfaces on Alveo cards. Unlike the previous tutorials (which focus on computation), this focuses on data movement at the edge of the FPGA — connecting the device to external networks or sensors.

Key architectural insights:

  • GT Transceiver Integration: The configuration file shows how to connect HLS kernels to hardened GT (Gigabit Transceiver) blocks — the physical layer for high-speed serial.
  • Stream-Based Datapaths: The strm_issue and strm_dump kernels generate and consume streaming data, demonstrating how to test high-bandwidth links without external equipment.
  • Clock Domain Crossing: The configuration shows connections between the Aurora core's clock domain and the user logic clock domain — a common source of subtle bugs in high-speed designs.

Cross-Module Dependencies

This module sits at the intersection of several larger ecosystems:

  • Vitis_HLS_Tutorials: Shares the HLS toolchain but focuses on language features and pragmas rather than system integration. The convolution tutorial here uses techniques demonstrated there, but adds the host-kernel integration layer.

  • Hardware_Acceleration_Feature_Tutorials: Explores specific Vitis features (debugging, RTL kernel integration, multi-CU dispatch). The convolution tutorial's Filter2DDispatcher demonstrates a production-ready version of the multi-CU dispatch pattern described there.

  • AI_Engine_Development/AIE: For workloads that don't fit the HLS model (especially DSP-heavy signal processing with complex dataflow), the AIE (AI Engine) offers a different paradigm. The convolution tutorial here is a "pure HLS" approach; for very large filters or multi-channel video streams, an AIE implementation might be more efficient.

Design Tradeoffs and Philosophy

Throughout these tutorials, several recurring design tensions appear:

  1. Abstraction vs. Control: HLS provides high-level C++ abstraction, but achieving optimal performance requires understanding the underlying hardware (pipeline stages, memory ports, DSP48s). The tutorials show the "pragma-augmented" middle path — C++ with hardware hints rather than raw RTL.

  2. Host-Visible vs. Kernel-Autonomous: The convolution tutorial keeps the host deeply involved (dispatching individual frames), while the Aurora tutorial is more autonomous (streams flow without per-packet host intervention). This reflects the fundamental difference between "accelerator" (host-driven) and "smart NIC" (autonomous) architectures.

  3. Throughput vs. Latency: The TSP tutorial sacrifices latency (it takes time to evaluate all permutations) for throughput (evaluating many permutations in parallel via pipelining). The convolution tutorial optimizes for sustained throughput of video frames. Understanding which metric matters for your use case is critical — these tutorials demonstrate both strategies.

  4. Portability vs. Optimization: The host_randomized.cpp variant of the convolution tutorial exists precisely for portability — it removes the OpenCV dependency at the cost of less realistic input data. This is a common pattern: provide a "full featured" path and a "minimal dependency" path.

What New Contributors Should Watch Out For

If you're joining the team to work on these tutorials or extend them, here are the non-obvious gotchas:

HLS Kernel Development:

  • DATAFLOW vs. PIPELINE: DATAFLOW enables task-level parallelism (concurrent functions), while PIPELINE enables loop-level parallelism (overlapping loop iterations). Mixing them incorrectly causes "stalled pipeline" warnings in the HLS console. The convolution tutorial uses DATAFLOW at the top level and PIPELINE inside the window processing loops.

  • Stream Depth Matters: The hls::stream template has a default depth that may not be sufficient for your data rate mismatch. If a producer writes faster than a consumer reads, and the stream depth is too shallow, the producer stalls, killing throughput. The convolution tutorial sets explicit depths (hls::stream<char,2>, hls::stream<U8,64>) based on the producer-consumer rate ratios.

  • Alignment Assertions: Notice the assert(stride%64 == 0) in ReadFromMem. This isn't just defensive programming — it ensures the AXI4 interface can use 512-bit (64-byte) bursts, maximizing memory bandwidth. Violating this silently degrades performance by 8x or more.

Host Application Development:

  • Buffer Pinning vs. Migration: The Filter2DRequest constructor uses enqueueMigrateMemObjects with CL_MIGRATE_MEM_OBJECT_CONTENT_UNDEFINED after setArg calls. This is a subtle optimization: setArg binds buffers to specific memory banks (pinning), then migration makes them resident in those banks without a copy (since content is undefined/irrelevant). Getting this order wrong causes extra data copies.

  • Out-of-Order Queues: The host code creates the command queue with cl::QueueProperties::OutOfOrder. This allows the runtime to overlap data transfers and kernel execution — essential for the software pipelining pattern. Using an in-order queue would serialize these operations, destroying throughput.

  • Event Chaining: Notice how events vectors are passed to enqueueWriteBuffer, enqueueTask, and enqueueReadBuffer. This creates explicit dependencies: the kernel can't start until writes complete, and the read can't start until the kernel completes. The Filter2DDispatcher relies on this for correctness when issuing overlapping requests.

System Integration:

  • XCLBIN Compatibility: The host code checks getenv("XCL_EMULATION_MODE") to conditionally print timing info. This matters because emulation flows don't report accurate hardware timing. Missing this check causes confusing "0 MB/s" throughput reports in emulation.

  • OpenCV Dependency Management: There are two host variants — one with OpenCV (image I/O) and one randomized (no dependencies). When building on headless servers without OpenCV, you must use the randomized version or the build fails with missing headers.

With this context established, let's dive into the detailed sub-module documentation to understand the specific implementation patterns in each tutorial track.

Sub-Module Documentation

For detailed implementation specifics of each tutorial track, refer to the sub-module documentation:


Data Flow Architecture

The following diagram illustrates the end-to-end data flow for the convolution tutorial (the most complex of the three), showing how data moves from host memory through the FPGA and back:

graph LR subgraph "Host Memory" A[Input Image
Y/U/V Planes] B[Filter Coefficients] C[Output Image] end subgraph "PCIe/XRT" D[OpenCL Buffer
cl::Buffer] E[Command Queue
Out-of-Order] F[Event Dependencies] end subgraph "FPGA Global Memory" G[DDR Bank 0
src_buffer] H[DDR Bank 1
coef_buffer] I[DDR Bank 2
dst_buffer] end subgraph "Filter2DKernel
HLS DATAFLOW" J[ReadFromMem
AXI4-Full] K[Window2D
Line Buffers] L[Filter2D
Convolution MAC] M[WriteToMem
AXI4-Full] end A -->|enqueueWriteBuffer| D B -->|enqueueWriteBuffer| D D -->|Migrate| G D -->|Migrate| H G -->|AXI4-Full
burst=512b| J H -->|coeff_stream| J J -->|pixel_stream| K K -->|window_stream| L L -->|pixel_stream| M M -->|AXI4-Full
burst=512b| I I -->|Migrate| D D -->|enqueueReadBuffer| C E -->|Event Chaining| F F -.->|Wait for| J F -.->|Wait for| K F -.->|Wait for| L F -.->|Wait for| M

This architecture embodies several key design principles demonstrated across all tutorials:

  1. Decoupled Producer-Consumer Stages: Each stage in the DATAFLOW region operates independently, pulling data from input streams and pushing to output streams. This decouples timing between stages — a slow memory read doesn't stall the compute unit if the stream FIFO has depth.

  2. Burst-Based Memory Access: The ReadFromMem and WriteToMem functions are designed to issue wide, aligned burst transfers (512-bit / 64-byte). This amortizes the ~100ns latency of DDR access over hundreds of bytes, achieving near-peak bandwidth. The assert(stride%64 == 0) enforces alignment requirements.

  3. Software Pipelining at System Level: The host Filter2DDispatcher maintains multiple Filter2DRequest objects (typically 3), each representing an in-flight transaction. While Request 0's kernel executes, Request 1's input data is being transferred to the FPGA, and Request 2's output data is being transferred back to the host — the classic "double buffering" or "triple buffering" pattern.

  4. Event-Driven Synchronization: OpenCL events (cl::Event) explicitly encode dependencies between operations. The kernel execution event depends on the write completion events; the read event depends on the kernel event. This allows the runtime to optimize scheduling without sequential host-side waiting.

Key Design Decisions

1. HLS DATAFLOW vs. Sequential Execution

Decision: Use #pragma HLS DATAFLOW to enable task-level parallelism across the four sub-functions (ReadFromMem, Window2D, Filter2D, WriteToMem).

Rationale:

  • In a sequential implementation, ReadFromMem would read the entire image before Window2D starts. With large images (1080p), this requires buffering the entire frame on-chip (impossible) or in external memory (bandwidth waste).
  • With DATAFLOW, Window2D starts processing as soon as the first few lines are read. The four functions execute concurrently in a pipeline, with hls::stream FIFOs decoupling their execution rates.

Tradeoff: DATAFLOW requires careful management of stream depths. If the producer writes faster than the consumer reads, and the stream depth is insufficient, the producer stalls. The tutorial sets explicit depths (hls::stream<char,2>, hls::stream<U8,64>) based on the rate mismatch between memory access (burst) and compute (sample-by-sample).

2. Line Buffer Architecture for 2D Convolution

Decision: Implement Window2D using line buffers (BRAM arrays storing FILTER_V_SIZE-1 complete lines) to assemble 2D convolution windows from a 1D input stream.

Rationale:

  • 2D convolution requires accessing pixels from a neighborhood (e.g., 3x3 or 5x5) around each pixel. In a row-major stream, these pixels arrive at different times (the row above arrived width cycles ago).
  • The line buffer acts as a delay line: as pixels stream through, the buffer holds the previous FILTER_V_SIZE-1 rows. When a new pixel arrives, it can be combined with buffered pixels from the lines above to form the complete window.

Tradeoff: Line buffers consume significant BRAM. For a 1080p image with 3 line buffers, each holding 1920 bytes, that's ~6KB per color plane — acceptable for small filters but scaling to larger kernels or 4K video requires careful BRAM budgeting. The tutorial uses #pragma HLS ARRAY_PARTITION on the line buffer dimension to ensure parallel access to all lines for window formation.

3. Software Pipelining via Filter2DDispatcher

Decision: Implement the host-side Filter2DDispatcher class to maintain multiple in-flight requests (Filter2DRequest objects), enabling software pipelining where data transfers overlap with kernel execution.

Rationale:

  • Without pipelining, the sequence is: Write input → Run kernel → Read output → (repeat). The FPGA sits idle during host→FPGA transfers and FPGA→host transfers.
  • With maxReqs=3, the dispatcher maintains three request slots. While Request 0 runs on the FPGA, Request 1's input data is being written, and Request 2's previous output is being read. This triple-buffering pattern keeps the FPGA continuously busy.

Tradeoff: Increased host memory usage (3× the buffering) and code complexity. The dispatcher must track which request slot is available using round-robin allocation (cnt%max). Additionally, the out-of-order OpenCL queue is required for this to work — an in-order queue would serialize the operations regardless of the dispatcher's logic.

4. HLS Stream Depth Sizing

Decision: Size hls::stream depths based on the producer-consumer rate mismatch: small depths (2-3) for balanced rates, large depths (64) for bursty producers.

Rationale:

  • Between Filter2D (the compute stage) and WriteToMem (the memory write stage), the rates differ. Filter2D produces one pixel per cycle (II=1), but WriteToMem issues 64-byte bursts to memory, consuming 64 pixels every N cycles (where N depends on memory latency).
  • A deep FIFO (64 entries) absorbs this burstiness, allowing Filter2D to run continuously even when WriteToMem is occasionally stalled waiting for DRAM arbitration.

Tradeoff: Each hls::stream maps to a FIFO implemented in BRAM or LUTRAM. Deep FIFOs consume significant resources. The tutorial carefully sizes streams: the coefficient stream (read once, used many times) is shallow (2), while the output stream (bursty) is deep (64).

5. Build System: TCL-Based HLS Workflows

Decision: Use TCL scripting (build.tcl, hls.tcl, hls_opt.tcl) to orchestrate the HLS synthesis flow, enabling reproducible builds and easy exploration of design space parameters (clock period, target part, optimization directives).

Rationale:

  • HLS design space exploration requires iterating on pragmas, clock constraints, and data types. Manual GUI-based iteration is error-prone and non-reproducible.
  • TCL scripts encode the "golden" build procedure: open_project, set_top, add_files, csynth_design, cosim_design. They can be version-controlled and run in CI/CD pipelines.
  • The tutorial provides both baseline (hls.tcl) and optimized (hls_opt.tcl) versions, showing the progression from functional prototype to production-optimized kernel.

Tradeoff: TCL is less readable than Python-based build systems (like Vitis's newer Python APIs). Error messages from TCL scripts can be cryptic, and debugging HLS synthesis failures often requires reading detailed logs. The tutorials include extensive comments in the TCL files to mitigate this.

Sub-Module Documentation

1. Convolution Tutorial: 2D Filter Pipeline

The convolution_tutorial_filter2d_pipeline sub-module is the flagship tutorial, demonstrating a complete image processing pipeline from host code to HLS kernel. It covers:

  • Hardware Architecture: The line-buffer-based Window2D function and the MAC-based Filter2D function, orchestrated by Filter2DKernel using DATAFLOW.
  • Host Software Architecture: The Filter2DRequest class (single transaction management) and Filter2DDispatcher class (multi-transaction pipelining), showing how to overlap data transfers with computation.
  • Build System: The TCL-based HLS build flow (build.tcl) and the Vitis system integration flow.

2. Traveling Salesperson: Algorithmic Optimization

The traveling_salesperson_hls_and_reference_flow sub-module focuses on combinatorial optimization and demonstrates the progression from reference C++ to optimized HLS:

  • CPU Reference: The main_gold.cpp provides a functional, unoptimized reference for correctness checking.
  • Baseline HLS: The hls.tcl flow synthesizes a naive implementation, establishing a performance baseline.
  • Optimized HLS: The hls_opt.tcl flow demonstrates optimization pragmas (PIPELINE, UNROLL, ARRAY_PARTITION) to achieve target throughput.

This sub-module is essential for understanding how to optimize control-heavy, irregular algorithms (unlike the regular dataflow of image processing).

3. Alveo Aurora: High-Speed Serial Communication

The alveo_aurora_kernel_stream_config sub-module demonstrates network-attached acceleration using the Aurora protocol:

  • GT Transceiver Integration: Shows how to connect HLS kernels to hardened serial transceivers (GTs) for 10/25/100G networking.
  • Stream Connectivity: The krnl_aurora_test.cfg file defines stream connections between strm_issue (traffic generator), krnl_aurora (the network core), and strm_dump (traffic checker).
  • Clock Domain Management: Demonstrates proper handling of Aurora reference clocks (gt_refclk) and init clocks for transceiver stability.

This sub-module is crucial for developers building network-attached accelerators or chip-to-chip communication systems.

Conclusion

The Hardware_Acceleration_Design_Tutorials module is more than a collection of example programs — it's a structured curriculum for learning FPGA acceleration. By progressing through the convolution (stream processing), TSP (algorithmic optimization), and Aurora (network integration) tutorials, developers gain a holistic understanding of the hardware acceleration design space.

The key takeaway is that efficient hardware acceleration requires co-design: the HLS kernel architecture (line buffers, DATAFLOW), the host dispatch strategy (pipelining, event management), and the system connectivity (memory banks, stream depths) must be designed together. These tutorials provide the template for that co-design process.

On this page