MPICH2 is a de facto industry standard for parallel numeric algorithms. Although it has been around for about 20 years and has a very strong following in the scientific community the average Software Engineer has not heard of this. This is a shame because although the programming model is a little low level it is very powerful and provides a great way to make use of multi core or even distributed grid computing. There are implementations of MPI for a wide variety of languages ranging from R to Fortran.
I have looked into an implementation of MPI that works with .NET from Open System Labs, Pervasive Technology Labs at Indiana University http://www.osl.iu.edu/research/mpi.net/
Software Development Kit for MPI.NET needs Windows Compute Cluster Server and can run on Windows HPC Server 2008, Windows XP, or Windows Vista. Then there is a command line that invokes an MPI program within the framework of the Compute Cluster Server. Here is an example from the sdk:
C:\Program Files\MPI.NET>"C:\Program Files\Microsoft Compute Cluster Pack\Bin\mp   
iexec.exe" -n 8 PingPong.exe    
Rank 0 is alive and running on chzuprel227.global.partnerre.net    
Pinging process with rank 1... Pong!    
  Rank 1 is alive and running on chzuprel227.global.partnerre.net    
Pinging process with rank 2... Pong!    
  Rank 2 is alive and running on chzuprel227.global.partnerre.net    
Pinging process with rank 3... Pong!    
  Rank 3 is alive and running on chzuprel227.global.partnerre.net    
Pinging process with rank 4... Pong!    
  Rank 4 is alive and running on chzuprel227.global.partnerre.net    
Pinging process with rank 5... Pong!    
  Rank 5 is alive and running on chzuprel227.global.partnerre.net    
Pinging process with rank 6... Pong!    
  Rank 6 is alive and running on chzuprel227.global.partnerre.net    
Pinging process with rank 7... Pong!    
  Rank 7 is alive and running on chzuprel227.global.partnerre.net 
I carried out a test that proved that all cores of my laptop where being used. There may be a way to call this programatically via C:\Program Files\Microsoft Compute Cluster Pack\Bin\ccpapi.dll. In have seen a .NET based and a SOA based API for this within the HPC Server
There are one or two things you must not do. For example
//Console.SetWindowSize(60, 30);produced this error message:
Unhandled Exception: System.ArgumentOutOfRangeException: The console buffer size must not be less than the current size and position of the console window, norgreater than or equal to Int16.MaxValue.   
Parameter name: height    
Actual value was 30.    
   at System.Console.SetBufferSize(Int32 width, Int32 height)    
   at EQ.Japan.Program.Main(String[] args) in C:\Data\Work\EXposure+IT\RefactoringAndOprimization\CodePerformanceAnalysis\BenchmarkCode\VB_MPI\ConsoleApplication1\Program.cs:line 26 
            //Console.SetWindowSize(60, 30);   
I refactored some code to include
private static void MyCalculation(int fromInclusive, int toExclusive, ref double[] loss)
This is called in the following way:
                int fromInclusive=1;   
                int toExclusive = 194510;    
                int MaxIntervals = 194510;    
                int StepSize = MaxIntervals / Communicator.world.Size;    
                int StepCount  = MaxIntervals / StepSize;    
                double[] loss = new double[0x2f7cf];    
                Console.WriteLine("StepCount = {0} " , StepCount); 
                for (int z = 0; z < StepCount -1 ; z++)   
                {    
                    if (Communicator.world.Rank.Equals(z))    
                    {    
                        fromInclusive = 1 + z * StepSize;    
                        toExclusive = 1 + (z+1) * StepSize;    
                        MyCalculation(fromInclusive, toExclusive,ref loss);    
                    }    
                }    
                if (Communicator.world.Rank.Equals(StepCount-1))    
                {    
                    fromInclusive = 1 + (StepCount-1) * StepSize;    
                    toExclusive = MaxIntervals;    
                    MyCalculation(fromInclusive, toExclusive, ref loss);    
                } 
                if (Communicator.world.Rank.Equals(0))   
                {    
                } 
This worked well but I am having difficulties in gathering the loss[].
On the MPI.NET web site there are a number of examples how such aggregations can be made. Here is one of the examples that demonstrates the power MPI has for tightly coupled calculations
using System;   
using MPI; 
class Pi   
{    
    static void Main(string[] args)    
    {    
        int dartsPerProcessor = 10000;    
        using (new MPI.Environment(ref args))    
        {    
            if (args.Length > 0)    
                dartsPerProcessor = Convert.ToInt32(args[0]);    
            Intracommunicator world = Communicator.world;                                        // <<<<<<<    
            Random random = new Random(5 * world.Rank);    
            int dartsInCircle = 0;    
            for (int i = 0; i < dartsPerProcessor; ++i)    
            {    
                double x = (random.NextDouble() - 0.5) * 2;    
                double y = (random.NextDouble() - 0.5) * 2;    
                if (x * x + y * y <= 1.0)    
                    ++dartsInCircle;    
            } 
            if (world.Rank == 0)   
            {    
                int totalDartsInCircle = world.Reduce<int>(dartsInCircle, Operation<int>.Add, 0);  // <<<<<<<<    
                System.Console.WriteLine("Pi is approximately {0:F15}.",     
                    4*(double)totalDartsInCircle/(world.Size*(double)dartsPerProcessor));    
            }    
            else    
            {    
                world.Reduce<int>(dartsInCircle, Operation<int>.Add, 0);                           // <<<<<<<< 
            }   
        }    
    }    
} 
My next Attempt looked like
                for (int z = 0; z < StepCount -1 ; z++)   
                {    
                    if (Communicator.world.Rank.Equals(z))    
                    {    
                        fromInclusive = 1 + z * StepSize;    
                        toExclusive = 1 + (z+1) * StepSize;    
                        MyCalculation(fromInclusive, toExclusive,ref loss);    
                    }    
                }    
                if (Communicator.world.Rank.Equals(StepCount-1))    
                {    
                    fromInclusive = 1 + (StepCount-1) * StepSize;    
                    toExclusive = MaxIntervals;    
                    MyCalculation(fromInclusive, toExclusive, ref loss); 
}
                Intracommunicator world = Communicator.world;   
                world.Send<Double[]>(loss, 0,0); 
                if (world.Rank == 0)   
                {    
                    System.Diagnostics.Debugger.Launch(); 
                    for (int z = 0; z < StepCount - 1; z++)   
                    {    
                        Double[] test = world.Receive<Double[]>(1, 0);    
                    }    
                } 
But this produced the following error
System.Exception was unhandled   
  Message="Other MPI error, error stack:\nMPI_Send(172): MPI_Send(buf=0x0442EB1C, count=1, dtype=USER<struct>, dest=0, tag=0, MPI_COMM_WORLD) failed\nMPID_Send(51): DEADLOCK: attempting to send a message to the local process without a prior matching receive"    
  Source="MPI"    
  StackTrace:    
       at MPI.Communicator.Send[T](T value, Int32 dest, Int32 tag)    
       at MPI.Communicator.Send[T](T value, Int32 dest, Int32 tag)    
       at EQ.Japan.Program.Main(String[] args) in C:\Data\Work\EXposure+IT\RefactoringAndOprimization\CodePerformanceAnalysis\BenchmarkCode\VB_MPI\ConsoleApplication1\Program.cs:line 120    
       at System.AppDomain._nExecuteAssembly(Assembly assembly, String[] args)    
       at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)    
       at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()    
       at System.Threading.ThreadHelper.ThreadStart_Context(Object state)    
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)    
       at System.Threading.ThreadHelper.ThreadStart()    
  InnerException: 
The mistake seems to be that a process must be allocated to gathering data. If this is not ready to recieve the data then this error happens
So my next Attempt Testing just how to send and receive data without the check about the number of processes. The example below works. We need at least 2 processes. 2 Processes means that we are running in serial, only after 3 processes we start going parallel...
                int fromInclusive = 1;   
                int toExclusive = 194510;    
                int MaxIntervals = 194510;    
                int StepSize = MaxIntervals / Communicator.world.Size;    
                int StepCount = MaxIntervals / StepSize;    
                double[] loss = new double[0x2f7cf];    
                    if (Communicator.world.Rank.Equals(1))    
                    {    
                        fromInclusive = 1;    
                        toExclusive = 194510;    
                        MyCalculation(fromInclusive, toExclusive, ref loss);    
                    }    
                Intracommunicator world = Communicator.world; 
                if (world.Rank == 0)   
                { 
                    Double[] test = world.Receive<Double[]>(1, 0);   
                    System.Diagnostics.Debugger.Launch(); 
                }   
                else    
                {    
                    world.Send<Double[]>(loss, 0, 0);    
                }
These are my first experiments with MPI. As I mentioned earlier MPI has been around a long time and there are communities out there that have solved these kinds of problems a long time ago. So my next steps will be to look at the tutorials on the MPI.NET web site and on http://math.acadiau.ca/ACMMaC/Rmpi/.
In addition to this I will be looking into the new parallel programming models that will be available with .Net 4.0. These are focused around raising the level of abstraction such that the programmer does not need to worry about threads and locks. Instead there is a programming model where these details have been abstracted away.
