commit d27e503317a3685939a5bea559126b8c86264b7b
parent 83dfead9e9cba75d7b927a61741df8bc28e20be8
Author: Jared Tobin <jared@jtobin.io>
Date: Sun, 26 Jan 2025 20:36:48 +0400
meta: perf note
(cherry picked from commit 3627073e287e6917c9d60f96d9cb37da5bc418cf)
Diffstat:
| M | README.md | | | 31 | ++++++++++++++++++++++++++----- |
1 file changed, 26 insertions(+), 5 deletions(-)
diff --git a/README.md b/README.md
@@ -41,11 +41,6 @@ A sample GHCi session:
65535
```
-## Documentation
-
-Haddocks (API documentation, etc.) are hosted at
-[docs.ppad.tech/fixed](https://docs.ppad.tech/fixed).
-
## Performance
The aim is best-in-class performance for pure, highly-auditable Haskell
@@ -53,6 +48,32 @@ code. Most operations beat out their variable-length Integer variants,
*but*, we have yet to reach parity with the all-important division and
modulo operations.
+There are a few ways I can imagine this being improved:
+
+* Extremely careful use of Data.Primitive.PrimArray can likely beat
+ the pure code that currently implements most of the library. A
+ single allocation of a maximum 21-long mutable PrimArray is not
+ terribly expensive. In particular finding a remainder, e.g. for
+ modulo, requires comparatively few expensive writes.
+
+ We require, for cases like modular multplication, at most 8 elements
+ for the quotient, 9 elements for the dividend/normalized dividend
+ (i.e. reusing the same area for both), and 8 elements for the
+ divisor/normalized divisor (ditto), assuming we stick with Knuth's division
+ algorithm. In other situations, e.g. division or modulo, we need less.
+
+* Avoid unnecessarily calculating the quotient or remainder if only one
+ of the two is desired.
+
+* It's possible that manually writing internals using primitives (think
+ MagicHash, UnboxedTuples, GHC.Exts) could yield an improvement; I've
+ noticed that GHC doesn't always unbox everything automatically. This
+ is possibly due to register contention, though, so it could also
+ potentially degrade performance.
+
+* An alternative division algorithm might be worth looking into, if it
+ proves to be impossible to squeeze more out of Knuth's.
+
Current benchmark figures on my mid-2020 MacBook Air look like (use
`cabal bench` to run the benchmark suite):