DMA Stream Endpoints HLS Kernels

概述

DMA Stream Endpoints 是多相信道化器系统的数据出入口，负责在 LPDDR 内存和 AIE 处理流水线之间搬运数据。这对内核就像机场的登机口和到达口——Source 核将乘客（样本）从候机大厅（DDR）分批送上飞机（AXI Stream），Sink 核则在目的地将乘客接回候机大厅。

由于信道化器采用 SSR=8 的超采样架构，数据在 DDR 中以线性顺序存储，但在传输过程中需要被重新组织为并行的多路流格式。这种"格式转换"正是这两个核的核心职责。

dma_stream_src：数据源端点

功能定位

dma_stream_src 是一个只读型数据搬运器，它从 LPDDR 读取输入样本，通过内部 BRAM 缓冲，最终以 7 路并行 AXI4-Stream 输出。

为什么只有 7 路而不是 8 路？这是由 P/Q = 8/7 的过采样比决定的——每 8 个输出周期只需要 7 个输入样本（因为系统在"扩展"时间）。

核心实现分析

// ------------------------------------------------------------
// Load Buffer
// ------------------------------------------------------------

void load_buffer( TT_DATA mem[NSTREAM*DEPTH], TT_DATA (&buff)[NSTREAM][DEPTH] )
{
// Load samples in linear order from DDR4, store into separate stream buffers:
  ap_uint<3> ss = 0;
  ap_uint<9> dd = 0;
 LOAD_BUFF: for (int mm=0; mm < DEPTH*NSTREAM; mm++) {
#pragma HLS PIPELINE II=1
    buff[ss][dd] = mem[mm];
    if ( ss == ap_uint<3>(NSTREAM-1) ) {
      ss = 0;
      dd = ( dd == ap_uint<9>(DEPTH-1) ) ? ap_uint<9>(0) : ap_uint<9>(dd + 1);
    }
    else {
      ss = ss + 1;
    }
  }
}

关键设计点：

矩阵转置操作：load_buffer 实际上执行了一个在线性内存 (mem) 和分块缓冲区 (buff[NSTREAM][DEPTH]) 之间的隐式转置
- DDR 中的数据布局：[sample0, sample1, sample2, ...]（通道交织）
- BRAM 中的数据布局：buff[stream][depth]（按流分离）
ap_uint 的位宽选择：
- ss 使用 3-bit (ap_uint<3>)：支持 0-7，正好覆盖 NSTREAM=7
- dd 使用 9-bit (ap_uint<9>)：支持 0-511，覆盖 DEPTH=512
- 这种显式的位宽声明让 HLS 工具生成最紧凑的状态机逻辑
循环终止条件的三元运算符：dd = ( dd == ap_uint<9>(DEPTH-1) ) ? ap_uint<9>(0) : ap_uint<9>(dd + 1)
- 显式构造 ap_uint<9>(0) 确保类型匹配，避免隐式转换带来的额外逻辑

Wrapper 函数与接口定义

void dma_stream_src_wrapper( TT_DATA mem[NSTREAM*DEPTH], int loop_cnt, TT_STREAM sig_o[NSTREAM] )
{
#pragma HLS interface m_axi      port=mem         bundle=gmem    offset=slave   depth=DEPTH*NSTREAM
#pragma HLS interface axis       port=sig_o
#pragma HLS interface s_axilite  port=loop_cnt    bundle=control
#pragma HLS interface s_axilite  port=mem         bundle=control
#pragma HLS interface s_axilite  port=return      bundle=control
#pragma HLS DATAFLOW

  // Internal buffer:
  TT_DATA buff[NSTREAM][DEPTH];
#pragma HLS array_partition variable=buff dim=1

  // Front end load from DDR4 to PL BRAM:
  load_buffer( mem, buff );

  // Back-end transmit from PL BRAM contents:
  transmit( buff, sig_o, loop_cnt );
}

接口配置解析：

Pragma	含义	硬件映射
`m_axi port=mem bundle=gmem`	AXI4-Full 主接口，连接 DDR	高性能内存访问端口
`axis port=sig_o`	AXI4-Stream 从接口	流式数据输出到下游
`s_axilite bundle=control`	AXI4-Lite 寄存器接口	主机控制参数（loop_cnt, mem 基地址）
`DATAFLOW`	启用任务级并行	load_buffer 和 transmit 重叠执行

数组分区的重要性：#pragma HLS array_partition variable=buff dim=1 将 buff[7][512] 的 7 行分割为独立的 BRAM 块，使得 transmit 函数可以同时读取所有 7 个流的样本而不会发生端口争用。

数据传输时序

Cycle:  0   1   2   3   4   5   6   7   8   9   ...
sig_o[0]: S0  S7  S14 ...
sig_o[1]: S1  S8  S15 ...
...      ...
sig_o[6]: S6  S13 S20 ...

每个周期输出 7 个样本（每个流一个），对应 DDR 中连续 7 个样本的"轮询分发"。

dma_stream_snk：数据汇聚端点

功能定位

dma_stream_snk 是 Source 的镜像 counterpart，但它增加了两个重要功能：

循环选择：可以从多次迭代中选择特定的一次保存（用于调试/验证）
DFT 置换模式：支持两种读回顺序以匹配 DFT 的输出特性

核心实现分析

流捕获阶段

void capture_streams( TT_DATA (&buff)[NSTREAM][DEPTH], TT_STREAM sig_i[NSTREAM],
                      const int& loop_sel, const int& loop_cnt )
{
 CAPTURE: for (int ll=0; ll < loop_cnt; ll++) {
#pragma HLS LOOP_TRIPCOUNT min=1 max=8
  SAMPLE_IN: for (int dd=0; dd < DEPTH; dd++) {
#pragma HLS pipeline II=1
    STREAM_IN: for (int ss=0; ss < NSTREAM; ss++) {
        TT_DATA val = sig_i[ss].read();
        if ( ll == loop_sel ) {
          buff[ss][dd] = val;
        }
      } // ss
    }  //dd
  } // ll
}

LOOP_TRIPCOUNT 的作用：向 HLS 工具提供循环次数的估计范围（min=1, max=8），帮助其进行资源规划和性能分析。这不会限制实际运行时的循环次数——实际的 loop_cnt 由运行时通过 AXI-Lite 接口传入。

条件写入的硬件代价：if ( ll == loop_sel ) 在每个周期都进行比较，但只在匹配时写入 BRAM。这引入了一个 2:1 MUX，但由于 BRAM 的写使能信号天然支持这种条件写入，实际开销很小。

缓冲读回与置换

void read_buffer( TT_DATA mem[NSTREAM*DEPTH], TT_DATA (&buff)[NSTREAM][DEPTH], int dft_perm )
{
  ap_uint<3> ss = 0;
  ap_uint<9> dd = 0;
 READ_BUFF: for (int mm=0; mm < DEPTH*NSTREAM; mm++) {
#pragma HLS PIPELINE II=1
    mem[mm] = buff[ss][dd];
    if ( dft_perm == 0 ) {
      // Normal linear order
      if ( ss == ap_uint<3>(NSTREAM-1) ) {
        ss = 0;
        dd = ( dd == ap_uint<9>(DEPTH-1) ) ? ap_uint<9>(0) : ap_uint<9>(dd + 1);
      }
      else {
        ss = ss + 1;
      }
    }
    else  {
      // Permutation model (for DFT output):
      // Data from DFT comes out 4 samples at a time on even streams first, followed by odd streams second
      if ( ss == ap_uint<3>(NSTREAM-2) ) {    // Last even stream
        ss = 1;
      }
      else if ( ss == ap_uint<3>(NSTREAM-1) ) { // Last odd stream
        ss = 0;
        dd = ( dd == ap_uint<9>(DEPTH-1) ) ? ap_uint<9>(0) : ap_uint<9>(dd + 1);
      }
      else {
        ss = ss + 2;
      }
    }
  }
}

两种读回模式的对比：

模式	读回顺序	应用场景
`dft_perm = 0`	0→1→2→3→4→5→6→7→0→1...	标准滤波器组输出
`dft_perm = 1`	0→2→4→6→1→3→5→7→0→2...	DFT 输出（偶数流优先，然后奇数流）

DFT 的这种特殊输出模式源于其内部的 4×4 Tile 阵列结构——底部两行 Tile 产生偶数索引输出，顶部两行产生奇数索引输出。

内存模型与资源占用

BRAM 需求计算

每个 DMA 核内部维护一个 buff[NSTREAM][DEPTH] 缓冲区：

static constexpr int NSTREAM = 7;  // Source
static constexpr int NSTREAM = 8;  // Sink
static constexpr int DEPTH   = 512;
static constexpr int NBITS   = 128;  // ap_uint<128>

// Source BRAM bits: 7 × 512 × 128 = 458,752 bits ≈ 56.3 KB
// Sink BRAM bits:   8 × 512 × 128 = 524,288 bits ≈ 64 KB

在 Xilinx Versal 器件中，BRAM 通常以 36Kb 或 18Kb 块组织。这些缓冲区可以完全放入几个 BRAM36 块中。

带宽分析

Source 输出带宽：

7 streams × 128 bits/stream × 312.5 MHz = 280 Gbps = 35 GB/s
等效样本率：35 GB/s ÷ 4 bytes/sample (cint16) = 8.75 Gsamples/s

这与理论需求 10.5 GSPS × 7/8 = 9.1875 Gsamples/s 基本匹配（考虑一些协议开销）。

Sink 输入带宽：

8 streams × 128 bits/stream × 312.5 MHz = 320 Gbps = 40 GB/s
等效样本率：40 GB/s ÷ 4 bytes/sample = 10 Gsamples/s

略高于 Source，因为 IDFT 输出包含过采样因子 P/Q = 8/7 的扩展。

使用场景与配置

典型控制流程

// Host code example (conceptual)
int main() {
    auto src_kernel = xrt::kernel(device, uuid, "dma_stream_src_wrapper");
    auto snk_kernel = xrt::kernel(device, uuid, "dma_stream_snk_wrapper");
    
    // Configure source: repeat input buffer 4 times
    src_kernel.set_arg(1, 4);  // loop_cnt = 4
    
    // Configure sink: capture the 2nd iteration, use DFT permutation
    snk_kernel.set_arg(1, 1);  // loop_sel = 1 (capture 2nd iteration)
    snk_kernel.set_arg(2, 4);  // loop_cnt = 4
    snk_kernel.set_arg(3, 1);  // dft_perm = 1 (DFT mode)
    
    // Launch kernels
    auto src_run = src_kernel(bo_in, 4, ...);
    auto snk_run = snk_kernel(bo_out, 1, 4, 1, ...);
    
    snk_run.wait();
}

调试技巧

验证数据完整性：设置 loop_cnt=1，比较输入和输出文件
检查多帧处理：设置 loop_cnt>1，验证每帧的一致性
隔离 AIE 问题：使用 External Traffic Generator 绕过 AIE，直接连接 Source → Sink

潜在陷阱

1. Loop Tripcount 与实际不匹配

LOOP_TRIPCOUNT 只是对 HLS 工具的提示，不影响实际硬件行为。但如果设置的 range 与实际运行差异过大，可能导致：

综合报告的延迟/吞吐量估计不准确
流水线深度优化不当

2. AXI4 Burst 长度限制

m_axi 接口的突发传输受 AXI4 协议限制（最大 256 beats）。对于大缓冲区（7×512=3584 个 128-bit 字），Vitis HLS 会自动将传输拆分为多个突发，但这会增加命令开销。

3. Stream 死锁风险

如果下游消费者（permute_fb_i）停止读取，sig_o[ss].write() 会阻塞。确保整个数据通路没有反压瓶颈。