#Task solution - matrix multiplication in fortran#
This repository contains the realization of the [first task](http://home.agh.edu.pl/~macwozni/fort/zadanie1.pdf) of the fortran programming language course at AGH in Cracov.  

##Directory structure##

    CMakeLists.txt
    README.md
    make_results.sh
    template.gnuplot
    src/
        blockmath.F90
        testing.F90
        naivemath.F90
        bettermath.F90
        main.F90
        dotmath.F90
        bettermath2.F90
    res/
        dot_16
        dot_8
        mat_8
        bett2_4
        bett_4
        bett2_8
        block_4
        naiv_8
        mat_16
        bett_8
        naiv_4
        block_8
        naiv_16
        mat_4
        dot_4
        bett_16
        block_16
        bett2_16
        wykres4.pdf
        wykres8.pdf
        wykres16.pdf
        wykres4.svg
        wykres8.svg
        wykres16.svg

###CmakeLists.txt###
This file contains CMake definitions for building fortran sources. It does not, however, define the actions of running the program, generating results, nor plotting.   

###make_results.sh###
This script runs built program under different options and writes measured multiplication times in `res/{naiv,bett,dot,mat,bett2,block}_{4,8,16}` while also running gnuplot to generate their corresponding plots into `res/wykres{4,8,16}.pdf`.   

###template.gnuplot###
It's the gnuplot script for generating plots from text files in `res`. In this script `[kind]` has to be replaced with actual number (for example with the help of `sed`, see `make_results.sh` for how it's done).   

###src/###
This directory contains the fortran sources of the project, namely modules sources `*math.F90` and main program source `main.F90`.   

###res/###
This directory contains text results of matrix multiplication time measuring (files without extension) and their corresponding plots (pdf files as required for the task and respective svg versions for embedding here).   

##Building##
As a prerequisite, CMake, make and a fortran compiler, either gfortran or ifort, are required (didn't want to support proprietary compiler :c but had to).   

`make_results.sh` assumes an out-of-source build in `build/` has been performed.   

    $ mkdir build
    $ cd build
    $ cmake ../
    $ make mull

The CMake configuration provided allows one to specify one of twosets of compiler flags.  The default is RELEASE and it can be changed to DEBUG using

    $ cmake .. -DCMAKE_BUILD_TYPE=DEBUG

##Running##
The executable is `mull`.   

     $ ./mull
     Usage: mull KIND IMPLEMENTATION
     where KIND is one of: 4, 8, 16
      IMPLEMENTATION is one of: naiv, bett, dot, mat, bett2, block

When ran with proper arguments, it prints matrix sizes and multiplication times in columns.   

Text files and plots in `res/` can be regenerated by issuing

    $ ./make_results.sh

from the main project directory (may take up to several tens of minutes).   

##Implementation##
5 versions of matrix multiplication are implemented in separate files under `src/`. They're all called and their running times measured in `src/main.F90` together with fortran's built-in `matmul()` intrinsic function.   

####naive####
Implemented in `src/naivemath.F90` the order of loops is i, j, k.   

####better####
Implemented in `src/bettermath.F90` the order of loops is j, k, i. It's expected to be faster than naive, because the innermost loop accesses array elements that are closed to each other in memory.   

####dot####
Implemented in `src/dotmath.F90` the order of loops is i, j, k. The innermost loop is replaced with dot_product() instrinsic function call, that is supposed to aid it's optimization.   

####better2####
Implemented in `src/bettermath2.F90` the order of loops is j, k, i. The innermost loop is replaced with an operation on an entire column.   

####block####
Implemented in `src/blockmath.F90`. The multiplication of matrices is achieved by many multiplications of submatrices (blocks).  The order of loops for one block is j, k, i with the innermost loop replaced with an operation on entire blocks' columns.   

##Results##

####KIND=4####
![plot kind=4](res/wykres4.svg)

####KIND=8####
![plot kind=8](res/wykres8.svg)

####KIND=16####
![plot kind=16](res/wykres16.svg)

##Conclusions##
As expected, modification of algorithm to reference memory more locally improves the execution speed (difference between naive and better).   
Usage of built-in intrincis functions as well as fortran's array operations may help the compiler better optimize code and further improve speed (difference between better and better2, between naive and dot, between better and matmul). It is not, however, always effective - in this experiment single precision operations were indeed optimized noticably better while there was less or none improvement for bigger precisions floating point types.
Block array multiplication is supposed to increase temporal locality of operations thus allowing efficient use use L1 cache. Whether this method is really beneficial depends on factors like processor model. The performance gain, if it there is any, is greater for bigger matices. In this experiment this algorithm was most successful for kind 8 (double precision on x86_64) numbers.