As I wrote in my last post, we started fiddling around with CUDA today. First thing to notice: the demos are just totally awesome (like this particle demo here). We’re talking teraflops of computing power here. After doing the basic CUDA excercises (which were really helpful) I started to implement a CA again. And guess which one: A2 of course.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | // includes, system #include <stdio.h> #include <assert.h> // Simple utility function to check for CUDA runtime errors void checkCUDAError(const char *msg); __device__ void get_field(int* d_res, dim3 d_pos, int *d_p0, int d_size) { (*d_res) = 0; if(d_pos.x < d_size && d_pos.y < d_size) (*d_res) = d_p0[(d_pos.y * d_size) + d_pos.x]; } __global__ void caKernel( int *d_p0, int size, int *d_p1 ) { int x = (blockIdx.x * gridDim.x) + threadIdx.x; int y = (blockIdx.y * gridDim.y) + threadIdx.y; int sum = 0; int v = 0; get_field(&v, dim3(x - 1, y - 1), d_p0, size); sum += v; get_field(&v, dim3(x - 0, y - 1), d_p0, size); sum += v; get_field(&v, dim3(x + 1, y - 1), d_p0, size); sum += v; get_field(&v, dim3(x - 1, y - 0), d_p0, size); sum += v; get_field(&v, dim3(x + 1, y - 0), d_p0, size); sum += v; get_field(&v, dim3(x - 1, y + 1), d_p0, size); sum += v; get_field(&v, dim3(x - 0, y + 1), d_p0, size); sum += v; get_field(&v, dim3(x + 1, y + 1), d_p0, size); sum += v; d_p1[y * size + x] = sum == 2 ? 1 : 0; } //////////////////////////////////////////////////////////////////////////////// // Program main //////////////////////////////////////////////////////////////////////////////// int main( int argc, char** argv) { if(argc < 3) { fprintf(stderr, "usage: %s <size> <gens>\n", argv[0]); return -1; } int size = atoi(argv[1]); int gens = atoi(argv[2]); printf("Calculating %i generations for a %ix%i (=%i cells) poulation\n", gens, size*size, size*size, size*size*size*size); // pointer for host memory int *h_a; // pointer for device memory int *d_a; int *d_b; size_t memSize = size * size * size * size * sizeof(int); h_a = (int *) malloc(memSize); cudaMalloc( (void**)&d_a, memSize ); cudaMalloc( (void**)&d_b, memSize ); memset(h_a, 0, memSize); int i; for(i = 0; i < size*size; i++) h_a[i] = i%2; cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice ); dim3 dimGrid( size, size ); dim3 dimBlock( size, size ); for(i = 0; i < gens; i++) { caKernel<<< dimGrid , dimBlock >>>( i % 2== 0 ? d_a : d_b, size * size, i % 2==0 ? d_b : d_a ); } // block until the device has completed cudaThreadSynchronize(); // check if kernel execution generated an error checkCUDAError("kernel execution"); cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost ); // Check for any CUDA errors checkCUDAError("cudaMemcpy"); // free device memory cudaFree(d_a); cudaFree(d_b); // free host memory free(h_a); return 0; } void checkCUDAError(const char *msg) { cudaError_t err = cudaGetLastError(); if( cudaSuccess != err) { fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) ); exit(-1); } } |
The implementation is roughly based on those execercises. What makes CUDA that powerful is that it employs massive parallel computing. The time complexity of the algorithm itself is
(where
,
being the population size – the X-axis in the graph below) for computing one generation. But as we can see the factor by which the time grows is pretty low.
Note: This is all unproven. A formal prove may follow some time (especially for that algorithm being
)
It’s also pretty interesting to compare those results with the time required to do all this computation solely on the CPU. If one used the implementation of the A2 I presented two posts ago, he would notice that computing 4096 generations of a
population takes roughly 0.154s. The CUDA implementation requires roughly 0.044s for the same work. I used a
nVidia Corporation GeForce 8600 GT (rev a1)
for the CUDA stuff and the
2Ghz Intel Core 2 Duo
for the CPU version. This stuff really is fun, but there really is one question left: what to do with all this power?


