Saturday, 21 November 2009

Accelerating numerical calculations with different compilers

Up until a few weeks ago I believed that it does not matter whether code is compiled in Managed or unmanaged code. I have a friend who is doing a lot of iterative maths via a library written in MS C++. I was quite sure that this was sub optimal because pinvoke and the marshalling of data is expensive. So He went away and made a comparison with a reference problem on one of our modelling servers that involved 2 huge for loops. The outcome was that for this problem the in line VB.NET code was around 1.5 time faster that the VB.NET code that called the MS C++ library. So I was thinking great, next step lets use F-Sharp, after all it’s designed for scientists and parallelization is simple. Since my friend comes from the world of science he did one more test with Fortran 90 running on a Linux operating system and it was 100 times faster on a low spec machine he had at home. So this blew the idea of making this algorithm in managed code be it F-Sharp or C-Sharp completely out of the water. In fact with an MPICH Message Passing Interface with 4 nodes corresponding to the 4 cores of his CPU we got a 400 times increase.

We investigated further and it turns out the Fortran compiler had some HPO (High Performance Operations) that included the intelligent analysis of the code to do vectorisation and parallelization. Then we downloaded the Intel C++ and the Intel Fortran compiler because these had similar compiler optimizations. It turned out that we got a similar performance as the Linux version, which was a great relief because it illuminated the Operating System. The Intel Fortran90 compiler was just slightly faster than the Intel C++ Compiler. Apparently this is because it is difficult for the compiler to recognize patterns that can be safely optimized because C++ is based on pointers. Where as good old fortran is designed for simple array maths. In fact the name Fortran comes from FORMular TRANslator.

I visited the Intel stand at the PDC and they have a very good product that integrtates seamlessly into Visual Studio and they have a lot of tools designed to debug concurrent programs. For example there is a tool that highlights data races and deadlocks, also there is a tool in a similar style to Visual Studio 2010 that enables the stepping through of code on multiple tasks. Parallel Composer, Inspector and Parrallel Amplifier are very good tools for writing and debugging parallel C++ code. Just recently they launched a web site on a large spec machine that can test you application on varying numbers of cores.

Another nice thing about Intel C++ is that it is task based and just like in .Net 4.0 implements a task stealing architecture that lessons the need for equally sized amounts of work. Also it has a parrell.for that helps the compiler identify loops that can be parrelized. The MPI is used for algorithm’s that have boundary conditions, such as finite element analysis. This is required to pass the boundary conditions between the nodes via Messages. This is also needed to go across machine boundaries.

At the Key note of the PDC a comparison was made between CPUs and GPUs. The GPU is a lot bigger than the CPU and contains a lot more cores 32 or 64. With hyper threading you have around 1000 threads to play with. This can give 1 Tflop performance which is similar to a cray 5 years ago. It’s funny to think all this power was developed for gaming. With DirectX it is possible to use this functionality. There is a compiler and a special language that creates an Intermediate language in a .o file. This is compiled into hardware specific machine code with a Just in time compiler. The coding is somewhat cryptic but could be worth it. With such super high performance it is possible to do other types of analysis such as Monte Carlo Simulations. Wall street has taken a big interest in this. Therefore it will not be long until there are multiple GPUs on one server. Intel is creating a new cpu that has a GPU built in. This would be ideal because the Intel compilers will probably take advantage of this without the programmer having to think too much about it. The clock speed of a GPU is much slower than a CPU and also generates less heat. One down side is that you must be very careful when designing the data centric algorithms because for every gpu cycle that is missed more than a 1000 operations are lost in other words a massive slow down. One more thing to be careful about is that not all gpus support double precision. The reason is that games produces don’t need this level of precision and that to save costs they settle for single precision.

I also looked into the Microsoft HPC Server. This is a grid computing infrastrure that comprises of a Scheduler and Nodes. A broker can be used with WCF to give things a SOA flavour. This is a low cost grid that seems to be very flexible. The example given was an excel spread sheet that calculated derivatives. With HPC the expensive calculations can be spread across machines and cores.

Cross computer parrallization has different considerations than just multi core. For example in multi core programming you need to pay attention to the cache invalidation problems as you iterate through integer arrays. Cross computer computing has latency, therefore you need to be careful not to chop up the work in the tasks to finely. On the other hand there is isolation. There is a performance concideration concerning passing data greater than 64k on the HPC Server. The problem is that serializing this data is slow which means you need to send a reference where the data is being stored.