Task solution - matrix multiplication in fortran

This repository contains the realization of the first task of the fortran programming language course at AGH in Cracov.

Directory structure

CMakeLists.txt
README.md
make_results.sh
template.gnuplot
src/
    blockmath.F90
    testing.F90
    naivemath.F90
    bettermath.F90
    main.F90
    dotmath.F90
    bettermath2.F90
res/
    dot_16
    dot_8
    mat_8
    bett2_4
    bett_4
    bett2_8
    block_4
    naiv_8
    mat_16
    bett_8
    naiv_4
    block_8
    naiv_16
    mat_4
    dot_4
    bett_16
    block_16
    bett2_16
    wykres4.pdf
    wykres8.pdf
    wykres16.pdf
    wykres4.svg
    wykres8.svg
    wykres16.svg

CmakeLists.txt

This file contains CMake definitions for building fortran sources. It does not, however, define the actions of running the program, generating results, nor plotting.

make_results.sh

This script runs built program under different options and writes measured multiplication times in res/{naiv,bett,dot,mat,bett2,block}_{4,8,16} while also running gnuplot to generate their corresponding plots into res/wykres{4,8,16}.pdf.

template.gnuplot

It's the gnuplot script for generating plots from text files in res. In this script [kind] has to be replaced with actual number (for example with the help of sed, see make_results.sh for how it's done).

src/

This directory contains the fortran sources of the project, namely modules sources *math.F90 and main program source main.F90.

res/

This directory contains text results of matrix multiplication time measuring (files without extension) and their corresponding plots (pdf files as required for the task and respective svg versions for embedding here).

Building

As a prerequisite, CMake, make and a fortran compiler, either gfortran or ifort, are required (didn't want to support proprietary compiler :c but had to).

make_results.sh assumes an out-of-source build in build/ has been performed.

$ mkdir build
$ cd build
$ cmake ../
$ make mull

The CMake configuration provided allows one to specify one of twosets of compiler flags. The default is RELEASE and it can be changed to DEBUG using

$ cmake .. -DCMAKE_BUILD_TYPE=DEBUG

Running

The executable is mull.

 $ ./mull
 Usage: mull KIND IMPLEMENTATION
 where KIND is one of: 4, 8, 16
  IMPLEMENTATION is one of: naiv, bett, dot, mat, bett2, block

When ran with proper arguments, it prints matrix sizes and multiplication times in columns.

Text files and plots in res/ can be regenerated by issuing

$ ./make_results.sh

from the main project directory (may take up to several tens of minutes).

Implementation

5 versions of matrix multiplication are implemented in separate files under src/. They're all called and their running times measured in src/main.F90 together with fortran's built-in matmul() intrinsic function.

naive

Implemented in src/naivemath.F90 the order of loops is i, j, k.

better

Implemented in src/bettermath.F90 the order of loops is j, k, i. It's expected to be faster than naive, because the innermost loop accesses array elements that are closed to each other in memory.

dot

Implemented in src/dotmath.F90 the order of loops is i, j, k. The innermost loop is replaced with dot_product() instrinsic function call, that is supposed to aid it's optimization.

better2

Implemented in src/bettermath2.F90 the order of loops is j, k, i. The innermost loop is replaced with an operation on an entire column.

block

Implemented in src/blockmath.F90. The multiplication of matrices is achieved by many multiplications of submatrices (blocks). The order of loops for one block is j, k, i with the innermost loop replaced with an operation on entire blocks' columns.

Results

KIND=4

plot kind=4

KIND=8

plot kind=8

KIND=16

plot kind=16

Conclusions

As expected, modification of algorithm to reference memory more locally improves the execution speed (difference between naive and better). This is achieved through better processor cache exploitation.

Usage of intrinsic dot_product() as well as fortran's array operations didn't really help the compiler further optimize code (almost no diference between better and better2, between naive and dot). However, the same change in a bigger fashion, mainly - application of matmul(), gave significant performance gains.

Block array multiplication is supposed to increase temporal locality of operations, thus allowing efficient use use L1 cache. Whether this method is really beneficial depends on factors like processor model. The performance gain, if it there is any, is greater for bigger matices. In this experiment this algorithm was almost aqually effective to better and better2 implementations.

The differences between algorithms' speeds are biggest for single precision numbers and almost unnoticable in case of kind 16 precision. One possible explaination of this is that operations on high precision numbers consume more processor cycles, hence 1. memory accesses have smaller impart on the overall running time of the code 2. they can be performed by processor in parallel with arithmetic operations and have a chance of completing before the FPU can start another operation