This page summarizes the key messages that I wanted to covey through the lab. It is designed for someone who were unable to complete it during EJCP, or those who found this page later on and are working on it on your own.
This step is to make sure you can apply tiling to a simple example. There are some subtle points where you can make mistakes, and it is best to start with a simple case with constant tile size. I selected 8x8x8 tile size because it will probably give significant performance improvement, but will not be anywhere near optimal in most processors these days.
Now that you are familiar with tiling this program, making the tile size parametric is an important precursor to the next step. Imagine performing all the explorations in the subsequent steps without parameterized code!
In this step, you should find how sensitive the performance of tiled code is to tile sizes. The optimal tile size is different for each of your machine and for different problem sizes. Although the performance with respect to the tile size usually are continuous, there are often some spikes that are difficult to explain. In addition, many people in the optimization world find this part very "fun", and I would be very glad if I was able to share some of it.
As for the plotting, what I expect for many architecture is as follows:Correctly predicting the right tile size is very difficult as it depends a lot on both the program and the architecture. See Wikipedia entry on CPU Cache, especially about cache misses, or search for "cache conflict miss", "cache interference miss" to find materials about them.
This step is to go back to the permutation example that I showed in the beginning of the lecture. For most modern processors, it should give significant performance improvement. Because the HW prefetcher is able to hide almost all cache miss penalties for this code, the performance you get is very close to what you get by tiling and finding the optimal tile size.
The expectation was that you are unable to beat the performance of the permuted code, since finding the optimal tile size is really difficult, and also because tiling adds some overhead.
If I stopped the lab at the previous step, some of you may get the impression that tiling is useless. This misconception can easily be avoided by throwing in another complication: parallelism.
Adding parallelism into the mix further complicates the process of tile size selection. There are certain levels of cache that are shared by the cores, adding more sources of interference. Now the tile sizes directly corresponds to the granularity of parallelization, influencing load balancing. Finding the good set of tile sizes for varying number of threads is really challenging.
The code with loop permutation should not get much better (or even get worse) after parallelization. This is because the permutation transformation is not really making the data locality better, and is only hiding the penalty of cache misses through prefetching. By adding more threads to performce the computation, the prefetcher won't be able to keep up with the amount of memory requests, and becomes incapable of hiding the latencies.
Tiling is actually improving the temporal locality, and reducing bandwidth requirement. Therefore, it scales very well (with large enough problem size), and will surpass the performance of permuted code in the parallel setting.
Although I had recommended to set OMP_NUM_THREADS to 2 during EJCP to keep things from becoming too complicated, you will find the difference between permutation and tiling much more clearly when using more threads and larger problem sizes.