Category "cuda"

NVIDIA __constant memory: how to populate constant memory from host in both OpenCL and CUDA?

I have a buffer (array) on the host that should be resided in the constant memory region of the device (in this case, an NVIDIA GPU). So, I have two questions:

What are CUDA Global Memory 32-, 64- and 128-byte transactions?

I am relatively new to CUDA programming. In this blog (How to Access Global Memory Efficiently in CUDA C/C++ Kernels), we have the following: "The device can a

What is the difference between the Conda packages nvidia::cudatoolkit and nvidia::cuda-toolkit?

cudatoolkit has metadata and a description, and seems to be widely referenced in various installation guides for different libraries, e.g. pytorch.cuda-toolkit

nvcc not found when source installation

I want to install pytorch3d from source with the following command as recommended at Link: git clone https://github.com/facebookresearch/pytorch3d.git cd pytorc

CUDA Unified memory compilation error type mismatched

Problem statement: There are three variables a and b and nk. interval is another variable which store the value of (b-a)/nk. There is an array named, interval_l

CUDA Memory Allocation for AoS inside a SoA

i've been working in a program that requires to use array of structs inside another array of structs or structure of arrays, i decided to use this approach give

incorrect cuda kernel output

I am accelerating a big application, part of which relies on basic indexing as shown below: #include <iostream> void kernel_cpu() { for (size_t i=0;

incorrect cuda kernel output

I am accelerating a big application, part of which relies on basic indexing as shown below: #include <iostream> void kernel_cpu() { for (size_t i=0;

CLBlast library not working on Mingw-w64 with Nvidia GPUs

I am trying to run the example samples/sgemm.cpp from the CLBlast repo on Windows 10 with a Nvidia graphics card. I have obtained the cl.hpp from the link. The

How can I implement parallel Cartesian Product efficiently using C++ Cuda

I have two arrays A and B which have m and n int variables respectively. I have a function f, and I want to generate all f(a, b) which a is in A and b is in B:

Changing one part of code affects the other part when compiling with use_fast_math flag

I have the following kernel: __global__ void kernel() { float loc{269.0f}; float s{(356.0f - loc) / 13.05f}; float a{pow(1.0f - 0.15f * s, 1.0f)}; floa

Installing CUDA Windows 10

I am trying to install the CUDA toolkit in order to be able to use Thundersvm in my personal computer. However I keep getting the following message in the GUI i

Python CUDA parallize multiple SVD's of small matrices

I've seen a similar post on stackoverflow which tackles the problem in C++: Parallel implementation for multiple SVDs using CUDA I want to do exactly the same i

CGBN: How to send to device memory short cgbn_mem_t, but make calculations over long cgbn_mem_t?

I'm somewhat lost on the point of converting, for instance, cgbn_mem_t<256> into cgbn_mem_t<1024> in device code. Say, the kernel receives two point

Where does cuda-repo-cross-<identifier>-all.deb come from?

I am trying to set up a cross-compile environment on an AWS EC2 Ubuntu box targeting Nvida Xavier devices on Cuda 10.2. I tried following the "instructions" at

Why is cuda-gdb much slower than gdb in executing the same program without breakpoints in CUDA kernels?

I am having trouble using cuda-gdb. My program starts from python and it loads a shared library containing tensorflow and cuda code. The command I used to start

How can I fix this "dpkg" error while installing CUDA on google colab

I want to run CUDA code on google colab. For that I am following the below steps but I am not able to install CUDA packages. Step 1: Removing previous CUDA vers

Want some guide about how to use nvjpegEncodeYUV()

I am trying to implement some jpeg encoding cuda code based one a sample code below: https://docs.nvidia.com/cuda/nvjpeg/index.html#nvjpeg-encode-examples I pos

GPU memory is empty, but CUDA out of memory error occurs

During training this code with ray tune(1 gpu for 1 trial), after few hours of training (about 20 trials) CUDA out of memory error occurred from GPU:0,1. And ev

how to solve ' CUDA was found but your compiler failed to compile a simple CUDA program'

I tried vs2015 2017 2019 2022 without success, cmake also tried 3.14.1 and the latest version, cuda is available, and vs2019 seems to have also compiled test.cu