Go

Center for High Performance Computing (CHPC)

 

A (very) Brief Introduction to OpenMP: C example

OpenMP compiler directives in C take the form:

#pragma omp

It is important to note that the directive applies to the statement that follows it which must be a structured block.

Let's just jump right in and consider a simple "hello world" program with a simple OpenMP directive.

 int main() {
#pragma omp parallel
  {
    printf("hello world\n");
  }
 exit(0);
}

 In this case, the directive

#pragma omp parallel

indicates the following block is to be executed in parallel. At the beginning of the block the code will fork off into multiple execution threads. At the end of the block the multiple execution threads will join back up with the master thread.

When we compile this code without any special compiler options, the OpenMP compiler directives will be ignored as comments:

[mtobias@login001 C]$ icc -w hello_omp.c
[mtobias@login001 C]$ ./a.out
hello world

For the compiler to recognize the OpenMP compiler directives, we must use the "-openmp" compiler flag:

[mtobias@login001 C]$ icc -w -openmp hello_omp.c

Note that we still haven't specified how many threads should be created.   One way to do this is with the OMP_NUM_THREADS environmental variable.

[mtobias@login001 C]$ export OMP_NUM_THREADS=1
[mtobias@login001 C]$ ./a.out
hello world
[mtobias@login001 C]$ export OMP_NUM_THREADS=2
[mtobias@login001 C]$ ./a.out
hello world
hello world
[mtobias@login001 C]$ export OMP_NUM_THREADS=4
[mtobias@login001 C]$ ./a.out
hello world
hello world
hello world
hello world


Note that one thread may print its "hello world" message before another thread has finished printing its carriage return so the output may be unpredictable.  Also, there is no need to recompile the code to change the number of threads.

When debugging threads, it's often useful to identify what each thread is doing. The OpenMP library provides the functions OMP_GET_NUM_THREADS and OMP_GET_THREAD_NUM to identify the total number of threads and the individual thread number respectively. Consider the code:

#include <omp.h>
            
int main() {
  int my_id,nthreads;
#pragma omp parallel private(my_id)
  {
    my_id = omp_get_thread_num();
    printf("my_id = %d\n",my_id);

    nthreads = omp_get_num_threads();

    if (my_id ==0) 
      printf("nthreads = %d\n",nthreads);
  }
 exit(0);
}

Notice that when the PARALLEL region is declared, I list the variable my_id as private.  That is because it will be unique to each thread. Variables can either be private, or shared. When printing out the total number of threads, we'll just have thread 0 print the output. Thread ids begin with 0 and run to N-1 for N threads.

[mtobias@login001 C]$ export OMP_NUM_THREADS=1
[mtobias@login001 C]$ ./a.out
my_id = 0
nthreads = 1
[mtobias@login001 C]$ export OMP_NUM_THREADS=2
[mtobias@login001 C]$ ./a.out
my_id = 0
my_id = 1
nthreads = 2

There is no guarantee as to the order in which threads will execute

[mtobias@login001 C]$ ./a.out
my_id = 1
my_id = 0
nthreads = 2

 Now let's consider the 3D scalar wave example. As the code is fairly long, I'll just link to it: 3Dwave.c. Without going into too much detail, the code evolves the wave equation using a 'finite-difference' method.  Each function is sampled by a finite number of points in space, and the derivatives are approximated by taking differences between neighboring values.   The important parameters are:
 nx,ny,nz - the number of 'gridpoints' in the x,y, and z directions
 nsteps - the number of 'timesteps' to take forward in evolving the fields
 printevery - how often to output data
 dt - the difference in time between one 'timestep' and the next
 We can compile the code and run it on one of the login nodes:

[mtobias@login001 C]$ icc 3Dwave.c

Before we run it, we need to request more memory for the stack:

[mtobias@login001 C]$ ulimit -s unlimited

[mtobias@login001 C]$ time ./a.out
time = 0.000
time = 0.100
time = 0.200
time = 0.300
time = 0.400
time = 0.500
time = 0.600
time = 0.700
time = 0.800
time = 0.900
time = 1.000

real 1m10.121s
user 1m9.906s
sys 0m0.204s

Before you start modifying the program too much, let's consider how we could go about parallelizing the code. As the bulk of the time is spent inside the triply-nested do loops, we could just pick the main do loop and add the appropriate Open MP directives.

#pragma omp parallel private(i,j,k)
{               
    /* use the predictor to update gp */
#pragma omp for schedule(static) 
    for (i=1; i
                (fp[i+1][j][k]-2.0*fp[i][j][k]+fp[i-1][j][k])/dx/dx  +
                (fp[i][j+1][k]-2.0*fp[i][j][k]+fp[i][j-1][k])/dy/dy +
                (fp[i][j][k+1]-2.0*fp[i][j][k]+fp[i][j][k-1])/dz/dz );
        }
      }
    } 

}

 We declare the integer k to be private, as we want each thread to update it's local set of gridpoints.
 We can now compile the code and see how the timing compares.  For the remainder of the examples I'll use an "interactive job" to run on one of the compute nodes instead of competing for resources on the login nodes:

[mtobias@login001 C]$ qsub -I -l nodes=1:ppn=8,walltime=8:00:00
[mtobias@node106 ~]$

When running multi-threaded code, one often finds the code segfaults because it needs more stack.  To get around this, I'll just ask for as much stack as possible:

[mtobias@node106 ~]$ ulimit -s unlimited

[mtobias@login001 C]$ icc -openmp 3Dwave_omp_step1.c

NOTE: I needed to compile the program on the login node.

 

Run with 1 thread:

[mtobias@node106 C]$ export OMP_NUM_THREADS=1
[mtobias@node106 C]$ time ./a.out
...
real 0m56.035s
user 0m55.782s
sys 0m0.147s

[mtobias@node106 C]$ export OMP_NUM_THREADS=2
[mtobias@node106 C]$ time ./a.out
...
real 0m40.271s
user 1m8.620s
sys 0m0.818s

[mtobias@node106 C]$ export OMP_NUM_THREADS=4
[mtobias@node106 C]$ time ./a.out
...
real 0m31.996s
user 1m42.676s
sys 0m6.095s

[mtobias@node106 C]$ export OMP_NUM_THREADS=8
[mtobias@node106 C]$ time ./a.out
...
real 0m27.752s
user 2m38.517s
sys 0m27.221s

With 4 threads we've cut the execution time almost in half.  From 4 to 8 threads, there's very little speed-up.

In this example I just picked on one do loop, but what if we parallelized all the do loops: 3Dwave_omp.c?

[mtobias@login001 C]$ icc -openmp 3Dwave_omp.c
[mtobias@node106 C]$ export OMP_NUM_THREADS=8
[mtobias@node106 C]$ time ./a.out
...
real 0m12.969s
user 1m39.265s
sys 0m2.062s

While this is considerably better than the one-loop example, we're still only getting a speed-up of 56/13 =~ 4.3 out of 8.  This is indicative of OpenMP, it's easy to get some scaling but often difficult to get perfect scaling.

I should also point out that there are many ways one could have gone about parallelizing this code (for instance parallelizing over inner loops rather than outer loops, etc.). Please view this as a starting point and try out any ideas you have.
 

Hopefully these brief examples will give you a few ideas on how you would go about using OpenMP to parallelize your own code. For more detailed information and examples, see:
http://www.openmp.org or https://computing.llnl.gov/tutorials/openMP/