I have a buffer (array) on the host that should be resided in the constant memory region of the device (in this case, an NVIDIA GPU). So, I have two questions:
I am relatively new to CUDA programming. In this blog (How to Access Global Memory Efficiently in CUDA C/C++ Kernels), we have the following: "The device can a
cudatoolkit has metadata and a description, and seems to be widely referenced in various installation guides for different libraries, e.g. pytorch.cuda-toolkit
I want to install pytorch3d from source with the following command as recommended at Link: git clone https://github.com/facebookresearch/pytorch3d.git cd pytorc
Problem statement: There are three variables a and b and nk. interval is another variable which store the value of (b-a)/nk. There is an array named, interval_l
i've been working in a program that requires to use array of structs inside another array of structs or structure of arrays, i decided to use this approach give
I am accelerating a big application, part of which relies on basic indexing as shown below: #include <iostream> void kernel_cpu() { for (size_t i=0;
I am accelerating a big application, part of which relies on basic indexing as shown below: #include <iostream> void kernel_cpu() { for (size_t i=0;
I am trying to run the example samples/sgemm.cpp from the CLBlast repo on Windows 10 with a Nvidia graphics card. I have obtained the cl.hpp from the link. The
I have two arrays A and B which have m and n int variables respectively. I have a function f, and I want to generate all f(a, b) which a is in A and b is in B:
I have the following kernel: __global__ void kernel() { float loc{269.0f}; float s{(356.0f - loc) / 13.05f}; float a{pow(1.0f - 0.15f * s, 1.0f)}; floa
I am trying to install the CUDA toolkit in order to be able to use Thundersvm in my personal computer. However I keep getting the following message in the GUI i
I've seen a similar post on stackoverflow which tackles the problem in C++: Parallel implementation for multiple SVDs using CUDA I want to do exactly the same i
I'm somewhat lost on the point of converting, for instance, cgbn_mem_t<256> into cgbn_mem_t<1024> in device code. Say, the kernel receives two point
I am trying to set up a cross-compile environment on an AWS EC2 Ubuntu box targeting Nvida Xavier devices on Cuda 10.2. I tried following the "instructions" at
I am having trouble using cuda-gdb. My program starts from python and it loads a shared library containing tensorflow and cuda code. The command I used to start
I want to run CUDA code on google colab. For that I am following the below steps but I am not able to install CUDA packages. Step 1: Removing previous CUDA vers
I am trying to implement some jpeg encoding cuda code based one a sample code below: https://docs.nvidia.com/cuda/nvjpeg/index.html#nvjpeg-encode-examples I pos
During training this code with ray tune(1 gpu for 1 trial), after few hours of training (about 20 trials) CUDA out of memory error occurred from GPU:0,1. And ev
I tried vs2015 2017 2019 2022 without success, cmake also tried 3.14.1 and the latest version, cuda is available, and vs2019 seems to have also compiled test.cu