Optimising a raytracer

Today and yesterday I have been working on an optimisation project given by the teacher. The optimisation is of a raytracer that is producing an image from point cloud data generated by the Kinect 2. This raytracer has been left deliberately slow for an easier optimisation attempt. In class we learned about optimisation techniques and why it is needed for today’s games, the reason being that CPUs cannot become faster without an insane amount of cooling, but we can still fit more cores into CPUs than the last time we tried. This project was given to everyone and each needs to implement optimisations that will allow the raytracer to render faster.

The first thing we all attempted to do (I assume, since it was the biggest thing spoken about in class) was multithreading. Multithreading is an optimisation where you give every core in the CPU a portion of the work, essentially reducing the amount of time to complete the overall task by a factor of however many cores you have. This is easy to do in code as it is only 2 lines of code, these lines are:

#include <omp.h>

#pragma omp parallel for

The second line goes just above any for loop that needs to be multithreaded. This uses the Open MP library that comes with Visual Studio, though it is required to allow Open MP in the project settings under(in Visual Studio) Project > Properties > Configuration Properties > C/C++ > Language > Open MP support.

The second optimisation I attempted was using the structure of arrays, which caused problems. Structure of Arrays (SOA) is a memory management optimisation, it allows for the CPU to grab out a larger amount of necessary data from RAM to be used. The way the memory would be constructed with an array of classes would be contiguous. All of the position x’s in memory would be next to the position y’s , and those would be next to the position z’s, the CPU would have to spend a few hundred cycles getting this data from RAM and into the cache, which would be slow.

The way the CPU gets memory from RAM is by lines, the CPU doesn’t just grab the 1 piece of data needed at that moment, it instead grabs a whole line of data (64 bytes) and puts it into the cache (cache being much faster but much smaller than RAM), and when this data is filled with maybe 3 instances of the class, it needs to go grab another line due to asking for one that isn’t in cache. An optimisation is to have individual arrays for the class’s variables.

SOA (structure of arrays) comes into play to help make things faster by having the cache grab out 16 variables at once (a float is 32 bits, which is 4 bytes [32 / 8], 64 bytes [size of the cache line] / 4 = 16) which the CPU will need to do for each variable that is stored for each sphere that is raytraced against. This should be faster as an entire class grabbed out for calculation would take up cycles.

I was able to implement SOA, but I had a problem. The image generated by the raytracer had some weird yellow tint on all of the spheres generated:

output.png

After a while of fiddling around with the code, looking a what I changed through implementing SOA, I found out that when I was initialising the values before they were read from data, I didn’t inisialise them properly. I found this out by looking at the constructor of the sphere class and saw that he was setting the specular power to 10 as a default. This was annoying as it took up a good several hours of staring at code and contemplating reverting all of my work of the past day and a bit. After fixing the specular power, I got:

output2.png

While I was getting the SOA specular problem done, my friend Liam who was siting next to me doing his optimisation, came across 2 things which the teacher hinted at. He found a better multi-threading command in OMP which gives each thread a certain amount of iterations in the for loop to do and once the thread has done its job, it grabs another set of iterations that haven’t been done or are being done. This is:

#pragma omp parallel schedule(dynamic, #)

The # is the number of iterations each thread will handle before grabbing another # iterations.

This boosted the speed of the rendering by not having the threads that are done, do nothing afterwards, it allows every thread to be virtually be doing something until everything is done.

Once I had this done, I went into all of the methods that are called frequently and made them inline, I also added a new, more efficient though less precise square root function which is an inverse square root multiplied by the original value:

       inline float InvSqrt(float x){
             float xhalf = 0.5f * x;
             int i = *(int*)&x;            // store floating-point bits in integer
             i = 0x5f3759df - (i >> 1);    // initial guess for Newton's method
             x = *(float*)&i;              // convert new bits into float
             x = x*(1.5f - xhalf*x*x);     // One round of Newton's method
             return x;
       }

This was also found by Liam, it is the function that was used in Quake 3.

Next week I intend to implement an octree for faster hit detection, and a better implementation of SOA, I might be able to get some performance gain from it (I didn’t get any benefit from implementing it).

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s