Thursday 9 June 2011

Parallel.For seems to have a large overhead compared to a simpler implementation

I implemented the following alternative to the parallel for using threads:

        static void MyParallelFor(int fromInclusive, int toExclusive, Action<int> body)
        {
            int numProcs = Environment.ProcessorCount;
            using (CountdownEvent ce = new CountdownEvent(numProcs))
            {
              int rangeSize = (toExclusive - fromInclusive) / numProcs;
              for (int p = 0; p< numProcs; p++)
              {
                  int start = rangeSize * p;
                 //int end = start + rangeSize;
                 int end = p == numProcs-1 ? toExclusive : start + rangeSize;
                 ThreadPool.QueueUserWorkItem(delegate
                   {
                      for (int i = start; i < end; i++) body(i);
                      ce.Signal();
                    });
              }
              ce.Wait();
            }
        }

I then tested this on the benchmark earthquake model as described in my last blog entry. Since the signature of the parallel for and myParallelFor are identical the comparison can be made by commenting out the implementation that is not under test.

            for (i = 1; i <= ns; i++) // Outer loop going through lots of lines of exposure
            {
                loss[i] = 0;

                    //Parallel.For(1, nr,
                    MyParallelFor(1, nr, // Inner loop going through many events
                         (jj) =>
                         {
                            // Some maths to determine the mean damage ratio (see previous blog entry for details)
                            localsum[jj] = mdr * value[jj];
                         }

                         });
                 loss[i] = localsum.Sum();
            }

I then ran the above code on a physical machine with ProLiant DL580 G5 12 Cores 2.66 GHz Intel Xeon(R) X7460 with 8188 MB RAM

Parallel.For took about 114 seconds with all 12 CPUs running evenly at about 70%.
MyParallelFor took about 71 seconds with 11 CPUs at 30% and 1 CPU at 80%,

I found this difference quite surprising as it seems that the parallel.for implementation in System.Threading.Tasks seems to have a very large overhead. Not only does it take longer but the cpu’s are working harder. Using reflector I could see that the implementation of private static ParallelLoopResult ForWorker was quite extensive. When I have some time I will add some buffering in the localsum to check whether the problem is caused by cache invalidation.