Category "cuda"

Want some guide about how to use nvjpegEncodeYUV()

I am trying to implement some jpeg encoding cuda code based one a sample code below: https://docs.nvidia.com/cuda/nvjpeg/index.html#nvjpeg-encode-examples I pos

GPU memory is empty, but CUDA out of memory error occurs

During training this code with ray tune(1 gpu for 1 trial), after few hours of training (about 20 trials) CUDA out of memory error occurred from GPU:0,1. And ev

how to solve ' CUDA was found but your compiler failed to compile a simple CUDA program'

I tried vs2015 2017 2019 2022 without success, cmake also tried 3.14.1 and the latest version, cuda is available, and vs2019 seems to have also compiled test.cu

pytorch CUDA version vs. Nvidia CUDA version

Till Apr26th, 2022, CUDA has updated to version 11.6, which can be installed by Nvidia Instruction: wget https://developer.download.nvidia.com/compute/cuda/11.6

numba cuda does not produce correct result with += (gpu reduction needed?)

I am using numba cuda to calculate a function. The code is simply to add up all the values into one result, but numba cuda gives me a different result from nu

CUDA_ARCHITECTURES is empty for target "cmTC_28d80"

I made a new CUDA executable project in CLion and when it opened I got CMake error: CUDA_ARCHITECTURES is empty for target "cmTC_908f4". CMakeLists.txt: cmake_

Do __shfl_xx_sync() intrinsics with mask need an additional __syncwarp()?

Do __shfl_xx_sync() instructions, where only some lanes participate, need an additional __syncwarp() instruction, or is setting a mask enough? I cannot provide

Finding a prime factor using Cuda

I was not able to find other topics about finding the largest prime factor of a number using Cuda and I am having some issues. #include <cuda.h> #include

Are tensor cores / WMMA useful for matrix-vector multiplication?

Suppose that, in my CUDA grid block, I have a Matrix, which I want to multiply by a vector. And that my data type is either half, single, or double precision (i

Warp Matrix-Multiply functions - are single-precision multiplicands supported?

In the CUDA Programming guide, v11.7, section B.24.6. Element Types & Matrix Sizes, there's a table of supported type combinations, in which the multiplicat

Can two processes running simultaneously share a variable?

Newbie here, I recon this may be a very foolish question. I am simultaneously running on cuda, in two distinct processes, a simple 3-layer MLP neural network ov

Can I launch a cooperative kernel without passing an array of pointers?

The CUDA runtime API allows us to launch kernels using the variable-number-of-arguments triple-chevron syntax: my_kernel<<<grid_dims, block_dims, shar

Pytorch with CUDA local installation fails

I am trying to install PyTorch with CUDA. I followed the instructions (installation using conda) mentioned in https://pytorch.org/get-started/locally/ conda in

Nvidia NVML Driver/library version mismatch [closed]

When I run nvidia-smi, I get the following message: Failed to initialize NVML: Driver/library version mismatch An hour ago I received the sa

Numba support for cuda cooperative block synchronization?? Python numba cuda grid sync

Numba Cuda has syncthreads() to sync all thread within a block. How can I sync all blocks in a grid without exiting the current kernel? In C-Cuda there's a coo

ncu-ui won't run: Could not load the Qt platform plugin "xcb" in "" even though it was found

I'm trying to run the ncu-ui profiler GUI on a CentOS 7 Linux system (using ncu-ui 2022.1), both as root and as a regular user. I'm getting the error: qt.qpa.pl

sprintf-like function for CUDA device-side code?

I could not find anything in internet. Due to the fact that it is possible to use printf in a __device__ function I am wondering if there is a sprintf like func

A cuda wrapper to execute openCL

I'm involved in a project where I have to do gpu programming, one of my constraint is to do it on a nvidia device (thus in CUDA). But I haven't access to a dev

What is the canonical way to check for errors using the CUDA runtime API?

Looking through the answers and comments on CUDA questions, and in the CUDA tag wiki, I see it is often suggested that the return status of every API call shoul

cuda 10.2 in Qt 5.14 ubuntu 18.04

I am planning to start cuda programming in Qt framework. I would like to start with a simple example. system information : OS : ubuintu 18.04 LTS Qt version : 5