Task solution - matrix multiplication in fortran
This repository contains the realization of the first task of the fortran programming language course at AGH in Cracov.
Directory structure
CMakeLists.txt
README.md
make_results.sh
template.gnuplot
src/
blockmath.F90
testing.F90
naivemath.F90
bettermath.F90
main.F90
dotmath.F90
bettermath2.F90
res/
dot_16
dot_8
mat_8
bett2_4
bett_4
bett2_8
block_4
naiv_8
mat_16
bett_8
naiv_4
block_8
naiv_16
mat_4
dot_4
bett_16
block_16
bett2_16
wykres4.pdf
wykres8.pdf
wykres16.pdf
wykres4.svg
wykres8.svg
wykres16.svg
CmakeLists.txt
This file contains CMake definitions for building fortran sources. It does not, however, define the actions of running the program, generating results, nor plotting.
make_results.sh
This script runs built program under different options and writes measured multiplication times in res/{naiv,bett,dot,mat,bett2,block}_{4,8,16}
while also running gnuplot to generate their corresponding plots into res/wykres{4,8,16}.pdf
.
template.gnuplot
It's the gnuplot script for generating plots from text files in res
. In this script [kind]
has to be replaced with actual number (for example with the help of sed
, see make_results.sh
for how it's done).
src/
This directory contains the fortran sources of the project, namely modules sources *math.F90
and main program source main.F90
.
res/
This directory contains text results of matrix multiplication time measuring (files without extension) and their corresponding plots (pdf files as required for the task and respective svg versions for embedding here).
Building
As a prerequisite, CMake, make and a fortran compiler, either gfortran or ifort, are required (didn't want to support proprietary compiler :c but had to).
make_results.sh
assumes an out-of-source build in build/
has been performed.
$ mkdir build
$ cd build
$ cmake ../
$ make mull
The CMake configuration provided allows one to specify one of twosets of compiler flags. The default is RELEASE and it can be changed to DEBUG using
$ cmake .. -DCMAKE_BUILD_TYPE=DEBUG
Running
The executable is mull
.
$ ./mull
Usage: mull KIND IMPLEMENTATION
where KIND is one of: 4, 8, 16
IMPLEMENTATION is one of: naiv, bett, dot, mat, bett2, block
When ran with proper arguments, it prints matrix sizes and multiplication times in columns.
Text files and plots in res/
can be regenerated by issuing
$ ./make_results.sh
from the main project directory (may take up to several tens of minutes).
Implementation
5 versions of matrix multiplication are implemented in separate files under src/
. They're all called and their running times measured in src/main.F90
together with fortran's built-in matmul()
intrinsic function.
naive
Implemented in src/naivemath.F90
the order of loops is i, j, k.
better
Implemented in src/bettermath.F90
the order of loops is j, k, i. It's expected to be faster than naive, because the innermost loop accesses array elements that are closed to each other in memory.
dot
Implemented in src/dotmath.F90
the order of loops is i, j, k. The innermost loop is replaced with dot_product() instrinsic function call, that is supposed to aid it's optimization.
better2
Implemented in src/bettermath2.F90
the order of loops is j, k, i. The innermost loop is replaced with an operation on an entire column.
block
Implemented in src/blockmath.F90
. The multiplication of matrices is achieved by many multiplications of submatrices (blocks). The order of loops for one block is j, k, i with the innermost loop replaced with an operation on entire blocks' columns.
Results
KIND=4
KIND=8
KIND=16
Conclusions
As expected, modification of algorithm to reference memory more locally improves the execution speed (difference between naive and better). This is achieved through better processor cache exploitation.
Usage of intrinsic dot_product()
as well as fortran's array operations didn't really help the compiler further optimize code (almost no diference between better and better2, between naive and dot). However, the same change in a bigger fashion, mainly - application of matmul()
, gave significant performance gains.
Block array multiplication is supposed to increase temporal locality of operations, thus allowing efficient use use L1 cache. Whether this method is really beneficial depends on factors like processor model. The performance gain, if it there is any, is greater for bigger matices. In this experiment this algorithm was almost aqually effective to better and better2 implementations.
The differences between algorithms' speeds are biggest for single precision numbers and almost unnoticable in case of kind 16 precision. One possible explaination of this is that operations on high precision numbers consume more processor cycles, hence 1. memory accesses have smaller impart on the overall running time of the code 2. they can be performed by processor in parallel with arithmetic operations and have a chance of completing before the FPU can start another operation