From 915803ffd60008fd5e25b7f6332bd960775f6f36 Mon Sep 17 00:00:00 2001 From: Wojtek Kosior Date: Fri, 26 Apr 2019 16:20:41 +0200 Subject: update conclusions to new results --- README.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 616c8a9..61b4166 100644 --- a/README.md +++ b/README.md @@ -116,6 +116,12 @@ Implemented in `src/blockmath.F90`. The multiplication of matrices is achieved b ![plot kind=16](res/wykres16.svg) ##Conclusions## -As expected, modification of algorithm to reference memory more locally improves the execution speed (difference between naive and better). -Usage of built-in intrincis functions as well as fortran's array operations may help the compiler better optimize code and further improve speed (difference between better and better2, between naive and dot, between better and matmul). It is not, however, always effective - in this experiment single precision operations were indeed optimized noticably better while there was less or none improvement for bigger precisions floating point types. -Block array multiplication is supposed to increase temporal locality of operations thus allowing efficient use use L1 cache. Whether this method is really beneficial depends on factors like processor model. The performance gain, if it there is any, is greater for bigger matices. In this experiment this algorithm was most successful for kind 8 (double precision on x86_64) numbers. \ No newline at end of file +As expected, modification of algorithm to reference memory more locally improves the execution speed (difference between naive and better). This is achieved through better processor cache exploitation. + +Usage of intrinsic `dot_product()` as well as fortran's array operations didn't really help the compiler further optimize code (almost no diference between better and better2, between naive and dot). However, the same change in a bigger fashion, mainly - application of `matmul()`, gave significant performance gains. + +Block array multiplication is supposed to increase temporal locality of operations, thus allowing efficient use use L1 cache. Whether this method is really beneficial depends on factors like processor model. The performance gain, if it there is any, is greater for bigger matices. In this experiment this algorithm was almost aqually effective to better and better2 implementations. + +The differences between algorithms' speeds are biggest for single precision numbers and almost unnoticable in case of kind 16 precision. One possible explaination of this is that operations on high precision numbers consume more processor cycles, hence +1. memory accesses have smaller impart on the overall running time of the code +2. they can be performed by processor in parallel with arithmetic operations and have a chance of completing before the FPU can start another operation -- cgit v1.2.3