Category "cuda"

pytorch CUDA version vs. Nvidia CUDA version

Till Apr26th, 2022, CUDA has updated to version 11.6, which can be installed by Nvidia Instruction: wget https://developer.download.nvidia.com/compute/cuda/11.6

numba cuda does not produce correct result with += (gpu reduction needed?)

I am using numba cuda to calculate a function. The code is simply to add up all the values into one result, but numba cuda gives me a different result from nu

CUDA_ARCHITECTURES is empty for target "cmTC_28d80"

I made a new CUDA executable project in CLion and when it opened I got CMake error: CUDA_ARCHITECTURES is empty for target "cmTC_908f4". CMakeLists.txt: cmake_

Do __shfl_xx_sync() intrinsics with mask need an additional __syncwarp()?

Do __shfl_xx_sync() instructions, where only some lanes participate, need an additional __syncwarp() instruction, or is setting a mask enough? I cannot provide

Finding a prime factor using Cuda

I was not able to find other topics about finding the largest prime factor of a number using Cuda and I am having some issues. #include <cuda.h> #include

Are tensor cores / WMMA useful for matrix-vector multiplication?

Suppose that, in my CUDA grid block, I have a Matrix, which I want to multiply by a vector. And that my data type is either half, single, or double precision (i

Warp Matrix-Multiply functions - are single-precision multiplicands supported?

In the CUDA Programming guide, v11.7, section B.24.6. Element Types & Matrix Sizes, there's a table of supported type combinations, in which the multiplicat

Can two processes running simultaneously share a variable?

Newbie here, I recon this may be a very foolish question. I am simultaneously running on cuda, in two distinct processes, a simple 3-layer MLP neural network ov

Can I launch a cooperative kernel without passing an array of pointers?

The CUDA runtime API allows us to launch kernels using the variable-number-of-arguments triple-chevron syntax: my_kernel<<<grid_dims, block_dims, shar

Pytorch with CUDA local installation fails

I am trying to install PyTorch with CUDA. I followed the instructions (installation using conda) mentioned in https://pytorch.org/get-started/locally/ conda in

Nvidia NVML Driver/library version mismatch [closed]

When I run nvidia-smi, I get the following message: Failed to initialize NVML: Driver/library version mismatch An hour ago I received the sa

Numba support for cuda cooperative block synchronization?? Python numba cuda grid sync

Numba Cuda has syncthreads() to sync all thread within a block. How can I sync all blocks in a grid without exiting the current kernel? In C-Cuda there's a coo

ncu-ui won't run: Could not load the Qt platform plugin "xcb" in "" even though it was found

I'm trying to run the ncu-ui profiler GUI on a CentOS 7 Linux system (using ncu-ui 2022.1), both as root and as a regular user. I'm getting the error: qt.qpa.pl

sprintf-like function for CUDA device-side code?

I could not find anything in internet. Due to the fact that it is possible to use printf in a __device__ function I am wondering if there is a sprintf like func

A cuda wrapper to execute openCL

I'm involved in a project where I have to do gpu programming, one of my constraint is to do it on a nvidia device (thus in CUDA). But I haven't access to a dev

What is the canonical way to check for errors using the CUDA runtime API?

Looking through the answers and comments on CUDA questions, and in the CUDA tag wiki, I see it is often suggested that the return status of every API call shoul

cuda 10.2 in Qt 5.14 ubuntu 18.04

I am planning to start cuda programming in Qt framework. I would like to start with a simple example. system information : OS : ubuintu 18.04 LTS Qt version : 5

In a CUDA kernel, how do I store an array in "local thread memory"?

I'm trying to develop a small program with CUDA, but since it was SLOW I made some tests and googled a bit. I found out that while single variables are by defau

A top-like utility for monitoring CUDA activity on a GPU

I'm trying to monitor a process that uses CUDA and MPI, is there any way I could do this, something like the command "top" but that monitors the GPU too?

CUDA - Implementing Device Hash Map?

Does anyone have any experience implementing a hash map on a CUDA Device? Specifically, I'm wondering how one might go about allocating memory on the Device an

Category "cuda"

Other Categories