'Optimizing gpu allocation/transfer of matrix tiles

I am working with very large matrices (>1GB) but imagine that I have the following matrix:

A = [1 1 2 2;
     1 1 2 2;
     3 3 4 4;
     3 3 4 4]

I need to pin each tile of the previous matrix to transfer them to the GPU in an async way (using the CUDA.jl package).

The following code allocates the space of each tile in the GPU and it is working:

function allocGPU!(gpu_buf, m,n)
    dev_buf = CUDA.Mem.alloc(CUDA.Mem.DeviceBuffer, m*n*8)
    dev_ptr = convert(CuPtr{Float64}, dev_buf);
    push!(gpu_buf, dev_buf)

    tile_gpu = unsafe_wrap(CuArray{Float64}, dev_ptr, (m,n)); 
    gpu_buf

    return tile_gpu
end

A_coor = [(1:2,1:2) (1:2, 3:4);
          (3:4,1:2) (3:4,3:4)]

A_tiles = [A[A_coor[i][1], A_coor[i,j][2]] for i=1:size(A_coor)[1], j=1:size(A_coor)[2]]
gpu_buf = []
A_tiles_gpu = [allocGPU!(gpu_buf, m,n) for i=1:size(A_tiles)[1], j=1:size(A_tiles)[2]]

But it's copying each tile into a new object, taking more time than I would like. Is there any way to wrap a 2x2 Array to each tile in order to reduce the number of allocations?

I also tried with this line:

A_tiles = [unsafe_wrap(Array{Float64}, pointer(A[A_coor[i][1], A_coor[i,j][2]]), (m,n)) for i=1:size(A_coor)[1], j=1:size(A_coor)[2]]

I also though of pinning matrix A and then transfer to the GPU as:

copyto!(tile_gpu, A[1:2,1:2])

but I'm guessing julia will copy the A[1:2,1:2] into a new object and then transfer the tile, yielding the same results as 1st method.

Edit:

As I suspected the:

copyto!(tile_gpu, A[1:2,1:2])

Creates a new object, in a different memory location, I also tried to use the @view macro, although it works for the CPU it doesn't seem to work with copyto! to the GPU memory.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Optimizing gpu allocation/transfer of matrix tiles

Edit:

Sources

Related Questions