Go

Center for High Performance Computing (CHPC)

 

A (very) Brief Introduction to MPI: C example

We'll start with the single processor code:

#include <stdio.h>
#include <stdlib.h>

int main() {

  long int npts = 1e10;

  long int i;

  double f,sum;
  double xmin,xmax,x;

  xmin = 0.0;
  xmax = 1.0;

  for (i=0; i<npts; i++) {
    x = (double) rand()/RAND_MAX*(xmax-xmin) + xmin;
    sum += 4.0/(1.0 + x*x);
  }
  f = sum/npts;

  printf("PI calculated with %ld points = %f \n",npts,f);

}

We can compile this code with the command:
[mtobias@login001 C]$ icc monte_carlo.c
Note: We need to compile on the login nodes (as opposed to the compute nodes). To get consistent timings with the MPI runs, I'm going to run everything on a compute node by requesting an interactive job:
[mtobias@login001 C]$ qsub -I -lnodes=1:ppn=8,walltime=8:00:00
[mtobias@node121 C]$ time ./a.out
PI calculated with 10000000000 points = 3.141594

real 2m57.426s
user 2m57.421s
sys 0m0.008s

The simplest way to parallelize this program is to divide up the sampled points among all the processors. After each processor has computed its partial sum, we'll add the results together. It is important to remember that each processor will run the same piece of code.

I'll now present a MPI version of the code in its entirety, and then go through some of the differences between it and the single processor code:

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

int main(int argc, char *argv[]) {

  int myid,nprocs;

  long int npts = 1e10;

  long int i,mynpts;

  double f,sum,mysum;
  double xmin,xmax,x;

  MPI_Init(&argc,&argv);
  MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
  MPI_Comm_rank(MPI_COMM_WORLD,&myid);

  if (myid == 0) {
    mynpts = npts - (nprocs-1)*(npts/nprocs);
  } else {
    mynpts = npts/nprocs;
  }

  mysum = 0.0;
  xmin = 0.0;
  xmax = 1.0;

  srand(myid);

  for (i=0; i<mynpts; i++) {
    x = (double) rand()/RAND_MAX*(xmax-xmin) + xmin;
    mysum += 4.0/(1.0 + x*x);
  }

  MPI_Reduce(&mysum,&sum,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
  
  if (myid == 0) {
    f = sum/npts;
    printf("PI calculated with %ld points = %f \n",npts,f);
  }

  MPI_Finalize();


}

The first thing we need to do is include the MPI C header file:

#include "mpi.h"

We define some integers:

 int myid,nprocs;

myid is a unique identifier denoting which process we are in the set and nprocs is the total number of processes.

 
long int i,mynpts;

We'll use mynpts to keep track of the number of points to be sampled by a particular process.

 
double f,sum,mysum;

mysum is each process's contribution to the integral.

  MPI_Init(&argc,&argv);
  MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
  MPI_Comm_rank(MPI_COMM_WORLD,&myid);

These three MPI calls are found in almost all MPI programs. The first routine initializes MPI, the second allows each process to know the total number of processes being used in the run (determined at run time) and the third allows each process to determine its unique id.

  if (myid == 0) {
    mynpts = npts - (nprocs-1)*(npts/nprocs);
  } else {
    mynpts = npts/nprocs;
  }

Here is where we divvy up the work (also called data decomposition). Since the total number of points may not be evenly divisible by the number of processors, we'll give the extra work to processor 0.

srand(myid);

This is a crucial step; We need to seed the random number generator for each process with a unique seed.

  for (i=0; i<mynpts; i++) {
    x = (double) rand()/RAND_MAX*(xmax-xmin) + xmin;
    mysum += 4.0/(1.0 + x*x);
  }

Whereas before we did the sum over all points, now each process does a partial sum over its own points.

  MPI_Reduce(&mysum,&sum,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);

This MPI routine takes all the process's values of mysum and adds them up into sum which it sends to processor 0.

  if (myid == 0) {
    f = sum/npts;
    printf("PI calculated with %ld points = %f \n",npts,f);
  }

The final calculation is done by process 0 since it is the only process that knows the value of sum and we only want one copy of the result printed out.

  MPI_Finalize();

Lastly, we allow MPI to clean up after itself.

 

That's it! We're now ready to compile the code:
[mtobias@login001 C]$ icc monte_carlo_mpi.c -I/export/openmpi-1.4.3/include/ -L/export/openmpi-1.4.3/lib -lmpi
/export/intel/Compiler/12.0/lib/intel64//libimf.so: warning: warning: feupdateenv is not implemented and will always fail

Notice that we need to link against the MPI C library (mpi) as well as point the compiler where to find the libraries and include files. Feel free to ignore the warning.

At this time I should point out we have many different implementations of MPI installed on our systems. I'm using the default implementation OpenMPI.

All MPI executables must be run with the command mpirun where we specify the number of processors via the np flag:
[mtobias@node121 C]$ time mpirun -np 1 ./a.out
PI calculated with 10000000000 points = 3.141594

real 3m8.316s
user 3m7.796s
sys 0m0.135s

Not surprisingly, the single-process MPI run takes about as long as the serial code.
[mtobias@node121 C]$ time mpirun -np 2 ./a.out
PI calculated with 10000000000 points = 3.141585

real 1m34.976s
user 3m7.784s
sys 0m0.049s

[mtobias@node121 C]$ time mpirun -np 4 ./a.out
PI calculated with 10000000000 points = 3.141588

real 0m48.033s
user 3m7.772s
sys 0m0.082s
[mtobias@node121 C]$ time mpirun -np 8 ./a.out
PI calculated with 10000000000 points = 3.141590

real 0m24.877s
user 3m7.976s
sys 0m0.485s

Every time we double the number of processes, we see the speed of the code almost doubles leading to near-perfect scaling. The reason for this is that the Monte-Carlo algorithm is ideally suited for parallelization of this type, or to put it another way it is "trivially parallelizable".