Go

Center for High Performance Computing (CHPC)

 

A (very) Brief Introduction to OpenMP: Fortran example

OpenMP compiler directives in Fortran support multiple formats, but in this tutorial I'll use:

 !$OMP

In fixed form Fortran the compiler directive must appear in the 1st column.  In free form Fortran the compiler directive can appear in any column, but  must be preceded only by white space. 

Let's just jump right in and consider a simple "hello world" program with a simple OpenMP directive.

 

       program hello
       implicit none
 !$OMP PARALLEL
       write(*,*)'hello world'
 !$OMP END PARALLEL
       stop
       end


 

 In this case, the directive

 !$OMP PARALLEL


 indicates the beginning of a parallel section and at this point the master thread will fork off into multiple execution threads. The directive

 !$OMP END PARALLEL


marks the end of the parallel section and multiple execution threads will join back up with the master thread.

When we compile this code without any special compiler options, the OpenMP compiler directives will be ignored as comments:

[mtobias@login001 Fortran]$ ifort hello_world_omp.f

 

[mtobias@login001 Fortran]$ ./a.out
 hello world

For the compiler to recognize the OpenMP compiler directives,
 we must use the "-openmp" compiler flag:

[mtobias@login001 Fortran]$ ifort -openmp hello_world_omp.f

Note that we still haven't specified how many threads should be created.   One way to do this is with the OMP_NUM_THREADS environmental variable.

[mtobias@login001 Fortran]$ export OMP_NUM_THREADS=1
[mtobias@login001 Fortran]$ ./a.out
 hello world
[mtobias@login001 Fortran]$ export OMP_NUM_THREADS=2
[mtobias@login001 Fortran]$ ./a.out
 hello world
 hello world
[mtobias@login001 Fortran]$ export OMP_NUM_THREADS=4
[mtobias@login001 Fortran]$ ./a.out
 hello world
 hello world
 hello world
 hello world

Note that one thread may print its "hello world" message before another thread has finished printing its carriage return so the output may be unpredictable.  Also, there is no need to recompile the code to change the number of threads.

When debugging threads, it's often useful to identify what each thread is doing. The OpenMP library provides the functions OMP_GET_NUM_THREADS and OMP_GET_THREAD_NUM to identify the total number of threads and the individual thread number respectively. Consider the code:

 

       program hello
 
       implicit none
       integer nthreads,my_id
       integer OMP_GET_NUM_THREADS,OMP_GET_THREAD_NUM
 
 !$OMP PARALLEL PRIVATE(my_id)
       my_id = OMP_GET_THREAD_NUM()
       write(*,*)'my_id = ',my_id
 
       nthreads = OMP_GET_NUM_THREADS()
 
       if (my_id.eq.0) then
          write(*,*)'nthreads = ',nthreads
       endif
 !$OMP END PARALLEL
 
       stop
       end



Notice that when the PARALLEL region is declared, I list the variable my_id as private.  That is because it will be unique to each thread. Variables can either be private, or shared. When printing out the total number of threads, we'll just have thread 0 print the output. Thread ids begin with 0 and run to N-1 for N threads.

[mtobias@login001 Fortran]$ export OMP_NUM_THREADS=1
[mtobias@login001 Fortran]$ ./a.out
 my_id =            0
 nthreads =            1
[mtobias@login001 Fortran]$ export OMP_NUM_THREADS=2
[mtobias@login001 Fortran]$ ./a.out
 my_id =            0
 nthreads =            2
 my_id =            1

There is no guarantee as to the order in which threads will execute

[mtobias@login001 Fortran]$ ./a.out
 my_id =            1
 my_id =            0
 nthreads =            2

 Now let's consider the 3D scalar wave example. As the code is fairly long, I'll just link to it: 3Dwave.f. Without going into too much detail, the code evolves the wave equation using a 'finite-difference' method.  Each function is sampled by a finite number of points in space, and the derivatives are approximated by taking differences between neighboring values.   The important parameters are:
 nx,ny,nz - the number of 'gridpoints' in the x,y, and z directions
 nsteps - the number of 'timesteps' to take forward in evolving the fields
 printevery - how often to output data
 dt - the difference in time between one 'timestep' and the next
 We can compile the code and run it on one of the login nodes:

[mtobias@login001 Fortran]$ ifort -w 3Dwave.f

[mtobias@login001 Fortran]$ time ./a.out
 time = 0.000
 time = 0.100
 time = 0.200
 time = 0.300
 time = 0.400
 time = 0.500
 time = 0.600
 time = 0.700
 time = 0.800
 time = 0.900
 time = 1.000

real    1m12.977s
user    1m12.685s
sys     0m0.224s

Before you start modifying the program too much, let's consider how we could go about parallelizing the code. As the bulk of the time is spent inside the triply-nested do loops, we could just pick the main do loop and add the appropriate Open MP directives.

 

 !$OMP PARALLEL PRIVATE(k)
 
 !$OMP DO SCHEDULE(STATIC)
 
          do k=2,nz-1
             do j=2,ny-1
                do i=2,nx-1
                   gp(i,j,k) = g(i,j,k) + dt*(
      &                 (fp(i+1,j,k)-2.0d0*fp(i,j,k)+fp(i-1,j,k))
      &                 /dx/dx +
      &                 (fp(i,j+1,k)-2.0d0*fp(i,j,k)+fp(i,j-1,k))
      &                 /dy/dy +
      &                 (fp(i,j,k+1)-2.0d0*fp(i,j,k)+fp(i,j,k-1))
      &                 /dz/dz )
                enddo
             enddo
          enddo
 
 !$OMP END DO
 
 !$OMP END PARALLEL 

 We declare the integer k to be private, as we want each thread to update it's local set of gridpoints.
 We can now compile the code and see how the timing compares.  For the remainder of the examples I'll use an "interactive job" to run on one of the compute nodes instead of competing for resources on the login nodes:

[mtobias@login001 Fortran]$ qsub -I -l nodes=1:ppn=8,walltime=1:00:00
[mtobias@node167 ~]$

When running multi-threaded code, one often finds the code segfaults because it needs more stack.  To get around this, I'll just ask for as much stack as possible:

[mtobias@node167 Fortran]$ ulimit -s unlimited

[mtobias@node167 Fortran]$ ifort -openmp -w 3Dwave_omp_step1.f

Run with 1 thread:

[mtobias@node167 Fortran]$ export OMP_NUM_THREADS=1
[mtobias@node167 Fortran]$ time ./a.out
...

real    0m54.997s
user    0m54.645s
sys     0m0.147s

[mtobias@node167 Fortran]$ export OMP_NUM_THREADS=2
[mtobias@node167 Fortran]$ time ./a.out
...

real    0m40.858s
user    1m13.026s
sys     0m1.723s

 

[mtobias@node167 Fortran]$ export OMP_NUM_THREADS=4
[mtobias@node167 Fortran]$ time ./a.out
...

real    0m31.378s
user    1m42.688s
sys     0m5.705s

[mtobias@node167 Fortran]$ export OMP_NUM_THREADS=8
[mtobias@node167 Fortran]$ time ./a.out
...

real    0m27.324s
user    2m34.070s
sys     0m22.770s

With 4 threads we've cut the execution time almost in half.  From 4 to 8 threads, there's very little speed-up.

In this example I just picked on one do loop, but what if we parallelized all the do loops: 3Dwave_omp.f?

[mtobias@node167 Fortran]$ ifort -openmp -w 3Dwave_omp.f
[mtobias@node167 Fortran]$ export OMP_NUM_THREADS=8
[mtobias@node167 Fortran]$ time ./a.out
...

real    0m12.353s
user    1m34.537s
sys     0m1.091s

 

While this is considerably better than the one-loop example, we're still only getting a speed-up of 55/12 =~ 4.6 out of 8.  This is indicative of OpenMP, it's easy to get some scaling but often difficult to get perfect scaling.

I should also point out that there are many ways one could have gone about parallelizing this code (for instance parallelizing over inner loops rather than outer loops, etc.). Please view this as a starting point and try out any ideas you have.
 

Hopefully these brief examples will give you a few ideas on how you would go about using OpenMP to parallelize your own code. For more detailed information and examples, see:
http://www.openmp.org or https://computing.llnl.gov/tutorials/openMP/