commit c2fb09755dd7213dc2ce5532c164aacb94982ce4
parent 39477dfcbd306f38a42a1bd1981009d912ea12a9
Author: Jared Tobin <jared@jtobin.io>
Date: Sat, 16 May 2026 13:06:36 -0230
lib: dispatch mac to ARM path when available
Wire 'Crypto.MAC.Poly1305.mac' to the ARM-accelerated 'Arm.mac'
when 'poly1305_arm_available' is True; otherwise fall through to
the existing pure Haskell scalar implementation. Mirrors the
dispatch pattern in 'Crypto.Hash.SHA256.hs', 'Crypto.Cipher.
ChaCha20.hs', and 'Data.ByteString.Base16.hs':
mac key@(BI.PS _ _ kl) msg
| kl /= 32 = Nothing
| Arm.poly1305_arm_available =
pure $! MAC (Arm.mac key msg)
| otherwise = ... scalar ...
Length validation stays in the dispatcher. The Arm wrapper
assumes a 32-byte key.
Performance on the 114-byte RFC 8439 test vector (M4 MacBook Air,
GHC 9.10.3 + LLVM 19, '-fllvm'):
mac (small key): 124 ns -> 66 ns (~1.9x, stage 1 alone)
mac (mid key): 125 ns -> 66 ns
mac (big key): 124 ns -> 66 ns
Allocation per call drops as well: the scalar Haskell implementation
allocates through 'Wider' / 'Limb' wrappers and assorted
intermediate values; the C path allocates only the 16-byte MAC
output (~256 B per call including bytestring overhead, vs ~640+ B
previously).
All 12 tasty cases (including RFC 8439 A.3 vectors 1-11) pass
through the dispatched path, both under '-fllvm' and under
'-fllvm -fsanitize' (ASan + UBSan over the C kernel — no
diagnostics).
The next commit replaces the scalar inner block loop with a NEON
4-way parallel kernel.
Diffstat:
1 file changed, 3 insertions(+), 0 deletions(-)
diff --git a/lib/Crypto/MAC/Poly1305.hs b/lib/Crypto/MAC/Poly1305.hs
@@ -26,6 +26,7 @@ module Crypto.MAC.Poly1305 (
, _roll16
) where
+import qualified Crypto.MAC.Poly1305.Arm as Arm
import qualified Data.Bits as B
import qualified Data.ByteString as BS
import qualified Data.ByteString.Internal as BI
@@ -173,6 +174,8 @@ mac
-> Maybe MAC -- ^ 128-bit message authentication code
mac key@(BI.PS _ _ kl) msg
| kl /= 32 = Nothing
+ | Arm.poly1305_arm_available =
+ pure $! MAC (Arm.mac key msg)
| otherwise =
let (clamp . _roll16 -> r, _roll16 -> s) = BS.splitAt 16 key
in pure $! (MAC (_poly1305_loop r s msg))