Category "cuda"

CUDA Memory Allocation for AoS inside a SoA

i've been working in a program that requires to use array of structs inside another array of structs or structure of arrays, i decided to use this approach give

incorrect cuda kernel output

I am accelerating a big application, part of which relies on basic indexing as shown below: #include <iostream> void kernel_cpu() { for (size_t i=0;

incorrect cuda kernel output

I am accelerating a big application, part of which relies on basic indexing as shown below: #include <iostream> void kernel_cpu() { for (size_t i=0;

CLBlast library not working on Mingw-w64 with Nvidia GPUs

I am trying to run the example samples/sgemm.cpp from the CLBlast repo on Windows 10 with a Nvidia graphics card. I have obtained the cl.hpp from the link. The

How can I implement parallel Cartesian Product efficiently using C++ Cuda

I have two arrays A and B which have m and n int variables respectively. I have a function f, and I want to generate all f(a, b) which a is in A and b is in B:

Changing one part of code affects the other part when compiling with use_fast_math flag

I have the following kernel: __global__ void kernel() { float loc{269.0f}; float s{(356.0f - loc) / 13.05f}; float a{pow(1.0f - 0.15f * s, 1.0f)}; floa

Installing CUDA Windows 10

I am trying to install the CUDA toolkit in order to be able to use Thundersvm in my personal computer. However I keep getting the following message in the GUI i

Python CUDA parallize multiple SVD's of small matrices

I've seen a similar post on stackoverflow which tackles the problem in C++: Parallel implementation for multiple SVDs using CUDA I want to do exactly the same i

CGBN: How to send to device memory short cgbn_mem_t, but make calculations over long cgbn_mem_t?

I'm somewhat lost on the point of converting, for instance, cgbn_mem_t<256> into cgbn_mem_t<1024> in device code. Say, the kernel receives two point

Where does cuda-repo-cross-<identifier>-all.deb come from?

I am trying to set up a cross-compile environment on an AWS EC2 Ubuntu box targeting Nvida Xavier devices on Cuda 10.2. I tried following the "instructions" at

Why is cuda-gdb much slower than gdb in executing the same program without breakpoints in CUDA kernels?

I am having trouble using cuda-gdb. My program starts from python and it loads a shared library containing tensorflow and cuda code. The command I used to start

How can I fix this "dpkg" error while installing CUDA on google colab

I want to run CUDA code on google colab. For that I am following the below steps but I am not able to install CUDA packages. Step 1: Removing previous CUDA vers

Want some guide about how to use nvjpegEncodeYUV()

I am trying to implement some jpeg encoding cuda code based one a sample code below: https://docs.nvidia.com/cuda/nvjpeg/index.html#nvjpeg-encode-examples I pos

GPU memory is empty, but CUDA out of memory error occurs

During training this code with ray tune(1 gpu for 1 trial), after few hours of training (about 20 trials) CUDA out of memory error occurred from GPU:0,1. And ev

how to solve ' CUDA was found but your compiler failed to compile a simple CUDA program'

I tried vs2015 2017 2019 2022 without success, cmake also tried 3.14.1 and the latest version, cuda is available, and vs2019 seems to have also compiled test.cu

pytorch CUDA version vs. Nvidia CUDA version

Till Apr26th, 2022, CUDA has updated to version 11.6, which can be installed by Nvidia Instruction: wget https://developer.download.nvidia.com/compute/cuda/11.6

numba cuda does not produce correct result with += (gpu reduction needed?)

I am using numba cuda to calculate a function. The code is simply to add up all the values into one result, but numba cuda gives me a different result from nu

CUDA_ARCHITECTURES is empty for target "cmTC_28d80"

I made a new CUDA executable project in CLion and when it opened I got CMake error: CUDA_ARCHITECTURES is empty for target "cmTC_908f4". CMakeLists.txt: cmake_

Do __shfl_xx_sync() intrinsics with mask need an additional __syncwarp()?

Do __shfl_xx_sync() instructions, where only some lanes participate, need an additional __syncwarp() instruction, or is setting a mask enough? I cannot provide

Finding a prime factor using Cuda

I was not able to find other topics about finding the largest prime factor of a number using Cuda and I am having some issues. #include <cuda.h> #include

Category "cuda"

Other Categories