commit ff08521ffe124aa922f46fde544396af98393fa4
parent 6d44ea017f092a4dccef483d24827c8283e5b830
Author: Jared Tobin <jared@jtobin.io>
Date:   Thu, 12 Sep 2024 09:51:53 +0400
meta: update readme w/perf notes
Diffstat:
| M | README.md | | | 146 | ++++++++++++++++++++++++++++++++++++++++++++----------------------------------- | 
1 file changed, 82 insertions(+), 64 deletions(-)
diff --git a/README.md b/README.md
@@ -8,40 +8,40 @@ lazy ByteStrings, as specified by RFC's [6234][r6234] and [2104][r2104].
 A sample GHCi session:
 
 ```
-> :set -XOverloadedStrings
->
-> -- import qualified
-> import qualified Crypto.Hash.SHA256 as SHA256
->
-> -- 'hash' and 'hmac' operate on strict bytestrings
->
-> let hash_s = SHA256.hash "strict bytestring input"
-> let hmac_s = SHA256.hmac "strict secret" "strict bytestring input"
->
-> -- 'hash_lazy' and 'hmac_lazy' operate on lazy bytestrings
-> -- but note that the key for HMAC is always strict
->
-> let hash_l = SHA256.hash_lazy "lazy bytestring input"
-> let hmac_l = SHA256.hmac_lazy "strict secret" "lazy bytestring input"
->
-> -- results are always unformatted 256-bit (32-byte) strict bytestrings
->
-> import qualified Data.ByteString as BS
->
-> BS.take 10 hash_s
-"1\223\152Ha\USB\171V\a"
-> BS.take 10 hmac_l
-"\DELSOk\180\242\182'v\187"
->
-> -- you can use third-party libraries for rendering if necessary
-> -- e.g., using base16-bytestring:
->
-> import qualified Data.ByteString.Base16 as B16
->
-> B16.encode hash_s
-"31df9848611f42ab5607ea9e6de84b05d5259085abb30a7917d85efcda42b0e3"
-> B16.encode hmac_l
-"7f534f6bb4f2b62776bba3d6466e384505f2ff89c91f39800d7a0d4623a4711e"
+  > :set -XOverloadedStrings
+  >
+  > -- import qualified
+  > import qualified Crypto.Hash.SHA256 as SHA256
+  >
+  > -- 'hash' and 'hmac' operate on strict bytestrings
+  >
+  > let hash_s = SHA256.hash "strict bytestring input"
+  > let hmac_s = SHA256.hmac "strict secret" "strict bytestring input"
+  >
+  > -- 'hash_lazy' and 'hmac_lazy' operate on lazy bytestrings
+  > -- but note that the key for HMAC is always strict
+  >
+  > let hash_l = SHA256.hash_lazy "lazy bytestring input"
+  > let hmac_l = SHA256.hmac_lazy "strict secret" "lazy bytestring input"
+  >
+  > -- results are always unformatted 256-bit (32-byte) strict bytestrings
+  >
+  > import qualified Data.ByteString as BS
+  >
+  > BS.take 10 hash_s
+  "1\223\152Ha\USB\171V\a"
+  > BS.take 10 hmac_l
+  "\DELSOk\180\242\182'v\187"
+  >
+  > -- you can use third-party libraries for rendering if necessary
+  > -- e.g., using base16-bytestring:
+  >
+  > import qualified Data.ByteString.Base16 as B16
+  >
+  > B16.encode hash_s
+  "31df9848611f42ab5607ea9e6de84b05d5259085abb30a7917d85efcda42b0e3"
+  > B16.encode hmac_l
+  "7f534f6bb4f2b62776bba3d6466e384505f2ff89c91f39800d7a0d4623a4711e"
 ```
 
 ## Documentation
@@ -52,41 +52,59 @@ Haddocks (API documentation, etc.) are hosted at
 ## Performance
 
 The eventual aim is best-in-class performance for pure, highly-auditable
-Haskell code.
+Haskell code. At present we're not quite there.
 
-Benchmark figures at present:
+Current benchmark figures look like (use `cabal bench` to run the
+benchmark suite):
 
 ```
-benchmarking ppad-sha256/SHA256 (32B input)/hash
-time                 2.684 μs   (2.658 μs .. 2.714 μs)
-                     0.999 R²   (0.999 R² .. 1.000 R²)
-mean                 2.689 μs   (2.674 μs .. 2.706 μs)
-std dev              55.18 ns   (44.66 ns .. 66.35 ns)
-variance introduced by outliers: 22% (moderately inflated)
-
-benchmarking ppad-sha256/SHA256 (32B input)/hash_lazy
-time                 2.746 μs   (2.712 μs .. 2.786 μs)
-                     0.999 R²   (0.998 R² .. 1.000 R²)
-mean                 2.747 μs   (2.720 μs .. 2.784 μs)
-std dev              101.1 ns   (73.17 ns .. 144.1 ns)
-variance introduced by outliers: 49% (moderately inflated)
-
-benchmarking ppad-sha256/HMAC-SHA256 (32B input)/hmac
-time                 10.30 μs   (10.18 μs .. 10.48 μs)
-                     0.997 R²   (0.996 R² .. 0.998 R²)
-mean                 10.68 μs   (10.48 μs .. 10.92 μs)
-std dev              720.5 ns   (603.8 ns .. 874.2 ns)
-variance introduced by outliers: 74% (severely inflated)
-
-benchmarking ppad-sha256/HMAC-SHA256 (32B input)/hmac_lazy
-time                 10.58 μs   (10.36 μs .. 10.85 μs)
-                     0.996 R²   (0.991 R² .. 0.998 R²)
-mean                 10.72 μs   (10.56 μs .. 10.93 μs)
-std dev              634.4 ns   (523.1 ns .. 868.8 ns)
-variance introduced by outliers: 68% (severely inflated)
+  benchmarking ppad-sha256/SHA256 (32B input)/hash
+  time                 2.684 μs   (2.658 μs .. 2.714 μs)
+                       0.999 R²   (0.999 R² .. 1.000 R²)
+  mean                 2.689 μs   (2.674 μs .. 2.706 μs)
+  std dev              55.18 ns   (44.66 ns .. 66.35 ns)
+  variance introduced by outliers: 22% (moderately inflated)
+
+  benchmarking ppad-sha256/SHA256 (32B input)/hash_lazy
+  time                 2.746 μs   (2.712 μs .. 2.786 μs)
+                       0.999 R²   (0.998 R² .. 1.000 R²)
+  mean                 2.747 μs   (2.720 μs .. 2.784 μs)
+  std dev              101.1 ns   (73.17 ns .. 144.1 ns)
+  variance introduced by outliers: 49% (moderately inflated)
+
+  benchmarking ppad-sha256/HMAC-SHA256 (32B input)/hmac
+  time                 10.30 μs   (10.18 μs .. 10.48 μs)
+                       0.997 R²   (0.996 R² .. 0.998 R²)
+  mean                 10.68 μs   (10.48 μs .. 10.92 μs)
+  std dev              720.5 ns   (603.8 ns .. 874.2 ns)
+  variance introduced by outliers: 74% (severely inflated)
+
+  benchmarking ppad-sha256/HMAC-SHA256 (32B input)/hmac_lazy
+  time                 10.58 μs   (10.36 μs .. 10.85 μs)
+                       0.996 R²   (0.991 R² .. 0.998 R²)
+  mean                 10.72 μs   (10.56 μs .. 10.93 μs)
+  std dev              634.4 ns   (523.1 ns .. 868.8 ns)
+  variance introduced by outliers: 68% (severely inflated)
 ```
 
-Use `cabal bench` to run the benchmark suite.
+When testing `hash_lazy` on a 1GB input, we get a profile like the
+following:
+
+```
+  COST CENTRE                         %time %alloc
+
+  Crypto.Hash.SHA256.block_hash        72.8    4.9
+  Crypto.Hash.SHA256.prepare_schedule  15.9   32.3
+  Crypto.Hash.SHA256.blocks_lazy        3.7   37.2
+  Crypto.Hash.SHA256.parse              3.6   14.7
+  Crypto.Hash.SHA256.hash_alg           2.1    2.9
+  hash                                  1.3    8.0
+```
+
+As low-hanging fruit, time and allocation can likely be reduced by
+unpacking the strict bytestrings used to represent 512-bit blocks, and
+also by replacing several internal data structures with unboxed tuples,
+extended literals, etc.
 
 ## Security