Thursday 15 September 2011

c++ - Matrix the Rectangle Part transpose Cuda -


I am writing a quadrogram program to change square matrix, based on the size of the idea matrix in two parts is; The size of the matrix is ​​also cut in size with tile, and except for the rectangular part it is transferred separately Ex .: 67x 67 Matrix with tile: 32, the first part is 64x64 transfixed, then the second part is 3x67.

My problem is in the rectangular section, the first code shows the main code with values ​​defined:

  const int TILE_DIM = 32; Const int BLOCK_ROWS = 8; Integer NUM_REPS = 100; Const int Nx = 2024; // matrix content int ne = 2024 size; Int main (int argc, char ** argv) {const nx = nx; Const int ny = Ny; // arrays are known int int mem_size = nx * ny * sizeof (int); // shape size Saw int * h_idata = (int *) malloc (mem_size); // Basic Host RR Int * d_idata; // device rr checkquuda (cudaMalloc (& amp; d_idata, mem_size)); Dim3 dimGridX (nx / TILE_DIM, 1, 1); // Grid Dimension dim3 dimBlockX (TILE_DIM, 1, 1) used; // Number of threads used / kernel function for rectangular EdgeTransposeX & lt; & Lt; & Lt; Dimgid, dimblock & gt; & Gt; & Gt; (D_idata); CudaEventRecord (startEvent, 0); CudaEventRecord (stopEvent, 0); CudaEventSynchronize (stopEvent); CudaEventElapsedTime (& amp; MS, startEvent, stopEvent); CudaMemcpy (h_idata, d_idata, mem_size, cudaMemcpyDeviceToHost);  

The kernel code I was advised to use to share, is therefore given below:

  __ global__ zero EdgeTransposeX (int * idata) {Int tile_C [Edge] [nx]; Int tile_v [nx] [edge]; Int x = blockIdx.x * TILE_DIM + threadIdx.x; For (======================= I Tile_V [J] [I-1] = Idata [j * nx + (x + i)]; Tilek [i - 1] [j] = idata [(x + i) * nx + j];} __cintacreds (); (Ent.J = 0; j & lt; nx; j ++) (idat [j * nx + (x + i)] = tilak (for int i = 1; i & lt; = edge + i ++) [I - 1] [j]; adetata [(x + i) * nx + ja = tile_ v [j] [i - 1];}}}  

the size of the matrix Works fine until reaching 1025, after which the work stops, any thoughts? Why am I missing something here?

"post-text" itemprop = "text">

Your two dimensional arrays tile_C and tile_V be stored in the GPU's local memory illegally The amount of the local storage per thread is 512KB Verify that you are not using more than 512KB of local memory per thread.

Devices , shared The stable qualifier described in this section usually stays in a register, however in some cases the compiler can choose to keep it in local memory, this piece "CUDA C Programming Guide 2015" page 89.

Mer Suggest that you occupy, use register and visual profiler to see local memory usage.

This link can be helpful to you:

I have applied a square matrix transpose using the Kuda Surfaces in 2D, to size it from 2 to 16384 Works fine, increase the power of two If you do not apply a tiled version, then I recommend this approach.


No comments:

Post a Comment