#Task solution - matrix multiplication in fortran# This repository contains the realization of the [first task](http://home.agh.edu.pl/~macwozni/fort/zadanie1.pdf) of the fortran programming language course at AGH in Cracov. ##Directory structure## CMakeLists.txt README.md make_results.sh template.gnuplot src/ blockmath.F90 testing.F90 naivemath.F90 bettermath.F90 main.F90 dotmath.F90 bettermath2.F90 res/ dot_16 dot_8 mat_8 bett2_4 bett_4 bett2_8 block_4 naiv_8 mat_16 bett_8 naiv_4 block_8 naiv_16 mat_4 dot_4 bett_16 block_16 bett2_16 wykres4.pdf wykres8.pdf wykres16.pdf wykres4.svg wykres8.svg wykres16.svg ###CmakeLists.txt### This file contains CMake definitions for building fortran sources. It does not, however, define the actions of running the program, generating results, nor plotting. ###make_results.sh### This script runs built program under different options and writes measured multiplication times in `res/{naiv,bett,dot,mat,bett2,block}_{4,8,16}` while also running gnuplot to generate their corresponding plots into `res/wykres{4,8,16}.pdf`. ###template.gnuplot### It's the gnuplot script for generating plots from text files in `res`. In this script `[kind]` has to be replaced with actual number (for example with the help of `sed`, see `make_results.sh` for how it's done). ###src/### This directory contains the fortran sources of the project, namely modules sources `*math.F90` and main program source `main.F90`. ###res/### This directory contains text results of matrix multiplication time measuring (files without extension) and their corresponding plots (pdf files as required for the task and respective svg versions for embedding here). ##Building## As a prerequisite, CMake, make and a fortran compiler, either gfortran or ifort, are required (didn't want to support proprietary compiler :c but had to). `make_results.sh` assumes an out-of-source build in `build/` has been performed. $ mkdir build $ cd build $ cmake ../ $ make mull The CMake configuration provided allows one to specify one of twosets of compiler flags. The default is RELEASE and it can be changed to DEBUG using $ cmake .. -DCMAKE_BUILD_TYPE=DEBUG ##Running## The executable is `mull`. $ ./mull Usage: mull KIND IMPLEMENTATION where KIND is one of: 4, 8, 16 IMPLEMENTATION is one of: naiv, bett, dot, mat, bett2, block When ran with proper arguments, it prints matrix sizes and multiplication times in columns. Text files and plots in `res/` can be regenerated by issuing $ ./make_results.sh from the main project directory (may take up to several tens of minutes). ##Implementation## 5 versions of matrix multiplication are implemented in separate files under `src/`. They're all called and their running times measured in `src/main.F90` together with fortran's built-in `matmul()` intrinsic function. ####naive#### Implemented in `src/naivemath.F90` the order of loops is i, j, k. ####better#### Implemented in `src/bettermath.F90` the order of loops is j, k, i. It's expected to be faster than naive, because the innermost loop accesses array elements that are closed to each other in memory. ####dot#### Implemented in `src/dotmath.F90` the order of loops is i, j, k. The innermost loop is replaced with dot_product() instrinsic function call, that is supposed to aid it's optimization. ####better2#### Implemented in `src/bettermath2.F90` the order of loops is j, k, i. The innermost loop is replaced with an operation on an entire column. ####block#### Implemented in `src/blockmath.F90`. The multiplication of matrices is achieved by many multiplications of submatrices (blocks). The order of loops for one block is j, k, i with the innermost loop replaced with an operation on entire blocks' columns. ##Results## ####KIND=4#### ![plot kind=4](res/wykres4.svg) ####KIND=8#### ![plot kind=8](res/wykres8.svg) ####KIND=16#### ![plot kind=16](res/wykres16.svg) ##Conclusions## As expected, modification of algorithm to reference memory more locally improves the execution speed (difference between naive and better). Usage of built-in intrincis functions as well as fortran's array operations may help the compiler better optimize code and further improve speed (difference between better and better2, between naive and dot, between better and matmul). It is not, however, always effective - in this experiment single precision operations were indeed optimized noticably better while there was less or none improvement for bigger precisions floating point types. Block array multiplication is supposed to increase temporal locality of operations thus allowing efficient use use L1 cache. Whether this method is really beneficial depends on factors like processor model. The performance gain, if it there is any, is greater for bigger matices. In this experiment this algorithm was most successful for kind 8 (double precision on x86_64) numbers.