'OpenCL channels dynamic indexing

I want to implement a systolic structure for matrix multiplication. My objective is to use a single kernel for every Processing Element so I will execute the same kernel from the host part multiple times.

To communicate between kernels I would like to use channels or pipes. The problem is that "channels extension does not support dynamic indexing into arrays of channel IDs". The number of kernels will depend on the size of the matrix so I will need some method to connect the channels to the corresponding kernels automatically.

Summarizing, I am looking for a method to create this functionality:

channel float c0[32];

__kernel void producer (__global float * data_in){
    for(int i=0; i<32; i++){
        write_channel_altera(c0[i],data_in[i]);
    }
}

__kernel void consumer (__global float * ret_buf){
    for(int i=0; i<32; i++){
        ret_buf[i]=read_channel_altera(c0[i]);
    }
}

Thanks in advance!



Solution 1:[1]

OpenCL channels (Intel FPGA extension) do not support "true" dynamic indexing, but you can work around this limitation in most cases by using switch or #pragma unroll approach:

switch approach is described in Intel FPGA SDK for OpenCL Programming Guide:

channel int ch[WORKGROUP_SIZE];

__kernel void consumer() {
    int gid = get_global_id(0);
    int value;

    switch(gid)
    {
        case 0: value = read_channel_intel(ch[0]); break;
        case 1: value = read_channel_intel(ch[1]); break;
        case 2: value = read_channel_intel(ch[2]); break;
        case 3: value = read_channel_intel(ch[3]); break;
        //statements

        case WORKGROUP_SIZE-1:read_channel_intel(ch[WORKGROUP_SIZE-1]); break;
    }
}

You can also use #pragma unroll if you have a loop over channels:

__kernel void consumer() {
    int values[WORKGROUP_SIZE]
    #pragma unroll
    for (int i = 0; i < WORKGROUP_SIZE; ++i) {
        values[i] = read_channel_intel(ch[i]);
    }
}

Solution 2:[2]

As far as I know, we need to know how many channels we would require at the maximum much before compiling the program for the board, as we cannot program the FPGA like the way we do for other computing system and allocate resources on the go. Once we know the maximum number (atleast) we can use

#pragma unroll

before we start the loop for reading/writing the channels

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Andrew Savonichev
Solution 2 Raghuttam Hombal