10004794 Honours Project: Streams vs Device Functions 2

In the previous post we looked at using device functions and evaluating each of the neighbours sequentially compared to using streams and evaluating each of the neighbours concurrently.

Interestingly we found that performing the check sequentially was actually faster than using streams, this post attempts to determine why this is.

Map	Sequential Time	Device functions	Streams
32x32	1	8.341714	21.75491
64x64	3.8	16.37582	47.9231
128x128	16.2	32.52248	133.1749
256x256	68.01	65.31222	409.7781
512x512	277.07	140.5209	19658.81
1024x1024	1146.38	399.7442	144157.2
2048x2048	4954.72	1745.794	462554.9

Device Functions vs Streams

Evaluate Neighbours Device Function

Evaluate Neighbours Kernel

Looking at the code snippets above they are both almost identical, the only real difference is that one is tagged with the __device__ keyword while the other is tagged with __global__, however this seems to make a big difference.

Map	Device Function	Create streams	Kernel with Streams	Kernel without streams
32x32	0.014075	0.017313	0.284529	0.262266
64x64	0.014052	0.017732	0.315197	0.297716
128x128	0.014728	0.018648	0.308442	0.400251

Time for different methods of evaluating neighbours

When we call a device function from within a thread we are not actually creating a new parallel block and thread to handle the process we are simply just calling a function in the same manner we could from the host. However when we use a kernel to handle the evaluation we need to launch a block and thread to handle the process, when using streams to launch the kernels concurrently we also have an extra overhead of having to create the streams.

Looking at the graph above we can see that we can call the device function to evaluate all 8 of our nodes neighbours quicker than we can create 8 streams to handle the neighbours. On top of the stream creating we also have to launch 8 kernels creating 8 blocks and threads to process the neighbors, this process in itself is extremely slow taking almost 18X longer than it does to call our device function.

Interestingly however it appears that using streams is potentially faster than not using streams, however this would really need to be tested for more graph sizes to get a fair results, this was not possible at the time due to errors in the timing code for large graph sizes.

The next change that was attempted involved each node storing a list of its neighbours in global memory on the device. In doing this it meant that it would be possible to call a single kernel which could be used to evaluate all of the 8 neighbours at the same time rather than calling an individual kernel for each neighbour as is needed when storing the nodes neighbours in local memory.

Evaluate all neighbours kernel

This resulted in a large speed up when compared to calling an individual kernel for each neighbour when using streams, however it is still slower than calling device functions. There was very little difference between calling the device functions when storing the neighbours in local memory compared to global memory.

Map	Device Local Mem	Device Global Mem	Kernel Local Mem	Kernel Global Mem
32x32	8.341714	8.41365	21.75491	14.15247
64x64	16.37582	16.51353	47.9231	29.25797
128x128	32.52248	32.9504	133.1749	61.81864
256x256	65.31222	65.97365	409.7781	138.1582
512x512	140.5209	142.72	19658.81	345.5626
1024x1024	399.7442	410.5048	144157.2	1094.155
2048x2048	1745.794	1777.783	462554.9	4441.894

Map	Call device function	Call kernel
32x32	0.007322	0.034438
64x64	0.007232	0.035898
128x128	0.007338	0.037483
256x256	0.007511	0.040405
512x512	0.007879	0.047797
1024x1024	0.008057	0.04972
2048x2048	0.008066	0.051512

10004794 Honours Project

Labels

Wednesday, 5 March 2014

Streams vs Device Functions 2

No comments:

Post a Comment