From aa4d426b4d3527d7e166df1a05058c9a4a0f6683 Mon Sep 17 00:00:00 2001 From: Wojtek Kosior Date: Fri, 30 Apr 2021 00:33:56 +0200 Subject: initial/final commit --- openssl-1.1.0h/crypto/bn/asm/ia64.S | 1562 +++++++++++++++++++++++++++++++++++ 1 file changed, 1562 insertions(+) create mode 100644 openssl-1.1.0h/crypto/bn/asm/ia64.S (limited to 'openssl-1.1.0h/crypto/bn/asm/ia64.S') diff --git a/openssl-1.1.0h/crypto/bn/asm/ia64.S b/openssl-1.1.0h/crypto/bn/asm/ia64.S new file mode 100644 index 0000000..f2404a3 --- /dev/null +++ b/openssl-1.1.0h/crypto/bn/asm/ia64.S @@ -0,0 +1,1562 @@ +.explicit +.text +.ident "ia64.S, Version 2.1" +.ident "IA-64 ISA artwork by Andy Polyakov " + +// Copyright 2001-2016 The OpenSSL Project Authors. All Rights Reserved. +// +// Licensed under the OpenSSL license (the "License"). You may not use +// this file except in compliance with the License. You can obtain a copy +// in the file LICENSE in the source distribution or at +// https://www.openssl.org/source/license.html + +// +// ==================================================================== +// Written by Andy Polyakov for the OpenSSL +// project. +// +// Rights for redistribution and usage in source and binary forms are +// granted according to the OpenSSL license. Warranty of any kind is +// disclaimed. +// ==================================================================== +// +// Version 2.x is Itanium2 re-tune. Few words about how Itanum2 is +// different from Itanium to this module viewpoint. Most notably, is it +// "wider" than Itanium? Can you experience loop scalability as +// discussed in commentary sections? Not really:-( Itanium2 has 6 +// integer ALU ports, i.e. it's 2 ports wider, but it's not enough to +// spin twice as fast, as I need 8 IALU ports. Amount of floating point +// ports is the same, i.e. 2, while I need 4. In other words, to this +// module Itanium2 remains effectively as "wide" as Itanium. Yet it's +// essentially different in respect to this module, and a re-tune was +// required. Well, because some instruction latencies has changed. Most +// noticeably those intensively used: +// +// Itanium Itanium2 +// ldf8 9 6 L2 hit +// ld8 2 1 L1 hit +// getf 2 5 +// xma[->getf] 7[+1] 4[+0] +// add[->st8] 1[+1] 1[+0] +// +// What does it mean? You might ratiocinate that the original code +// should run just faster... Because sum of latencies is smaller... +// Wrong! Note that getf latency increased. This means that if a loop is +// scheduled for lower latency (as they were), then it will suffer from +// stall condition and the code will therefore turn anti-scalable, e.g. +// original bn_mul_words spun at 5*n or 2.5 times slower than expected +// on Itanium2! What to do? Reschedule loops for Itanium2? But then +// Itanium would exhibit anti-scalability. So I've chosen to reschedule +// for worst latency for every instruction aiming for best *all-round* +// performance. + +// Q. How much faster does it get? +// A. Here is the output from 'openssl speed rsa dsa' for vanilla +// 0.9.6a compiled with gcc version 2.96 20000731 (Red Hat +// Linux 7.1 2.96-81): +// +// sign verify sign/s verify/s +// rsa 512 bits 0.0036s 0.0003s 275.3 2999.2 +// rsa 1024 bits 0.0203s 0.0011s 49.3 894.1 +// rsa 2048 bits 0.1331s 0.0040s 7.5 250.9 +// rsa 4096 bits 0.9270s 0.0147s 1.1 68.1 +// sign verify sign/s verify/s +// dsa 512 bits 0.0035s 0.0043s 288.3 234.8 +// dsa 1024 bits 0.0111s 0.0135s 90.0 74.2 +// +// And here is similar output but for this assembler +// implementation:-) +// +// sign verify sign/s verify/s +// rsa 512 bits 0.0021s 0.0001s 549.4 9638.5 +// rsa 1024 bits 0.0055s 0.0002s 183.8 4481.1 +// rsa 2048 bits 0.0244s 0.0006s 41.4 1726.3 +// rsa 4096 bits 0.1295s 0.0018s 7.7 561.5 +// sign verify sign/s verify/s +// dsa 512 bits 0.0012s 0.0013s 891.9 756.6 +// dsa 1024 bits 0.0023s 0.0028s 440.4 376.2 +// +// Yes, you may argue that it's not fair comparison as it's +// possible to craft the C implementation with BN_UMULT_HIGH +// inline assembler macro. But of course! Here is the output +// with the macro: +// +// sign verify sign/s verify/s +// rsa 512 bits 0.0020s 0.0002s 495.0 6561.0 +// rsa 1024 bits 0.0086s 0.0004s 116.2 2235.7 +// rsa 2048 bits 0.0519s 0.0015s 19.3 667.3 +// rsa 4096 bits 0.3464s 0.0053s 2.9 187.7 +// sign verify sign/s verify/s +// dsa 512 bits 0.0016s 0.0020s 613.1 510.5 +// dsa 1024 bits 0.0045s 0.0054s 221.0 183.9 +// +// My code is still way faster, huh:-) And I believe that even +// higher performance can be achieved. Note that as keys get +// longer, performance gain is larger. Why? According to the +// profiler there is another player in the field, namely +// BN_from_montgomery consuming larger and larger portion of CPU +// time as keysize decreases. I therefore consider putting effort +// to assembler implementation of the following routine: +// +// void bn_mul_add_mont (BN_ULONG *rp,BN_ULONG *np,int nl,BN_ULONG n0) +// { +// int i,j; +// BN_ULONG v; +// +// for (i=0; i