commit f35e58b1f912f0f37fa7f2a88635a0238e80ea7d
parent f52b5ee9a8273b95c461cc71fa278495fc48d029
Author: Jared Tobin <jared@jtobin.io>
Date: Sat, 16 May 2026 12:51:31 -0230
lib: dispatch cipher and block to ARM NEON when available
Wire 'Crypto.Cipher.ChaCha20.cipher' and 'block' to the NEON path
added in the previous commit, with the existing scalar
implementations as the fallback. Mirrors the dispatch pattern in
'Crypto.Hash.SHA256.hs' and 'Data.ByteString.Base16.hs':
block key counter nonce
| kl /= 32 = Left InvalidKey
| nl /= 12 = Left InvalidNonce
| Arm.chacha20_arm_available =
Right (Arm.block key counter nonce)
| otherwise = pure $ runST $ do ... -- scalar
Same shape for 'cipher'. Length validation stays in the dispatcher
so the Arm wrappers can assume valid inputs.
Performance on the existing 114-byte RFC 8439 test vector (M4
MacBook Air, GHC 9.10.3 + LLVM 19, '-fllvm'):
cipher time: 478 ns -> 282 ns (~1.7x)
Allocation per call (via 'weigh') drops dramatically across the
size range, because the scalar path was accumulating intermediate
per-block ByteStrings through a Builder while the NEON path writes
into one 'BI.unsafeCreate plen' buffer:
block: 4,968 B -> 312 B (~16x less)
cipher 64B input: 42,584 B -> 448 B (~95x less)
cipher 256B input: 61,568 B -> 448 B (~137x less)
cipher 1024B input: 121,376 B -> 4,072 B (~30x less)
cipher 4096B input: 406,168 B -> 4,568 B (~89x less)
The 1.7x wall-time on the 114B vector is a floor figure — that
input is only ~2 blocks, so FFI overhead and per-call setup
dominate. Larger inputs amortise the FFI call across more SIMD
work and recover proportionally more.
All 8 tasty cases (including RFC 8439 A.2 vectors 1, 2, 3) pass
through the dispatched path, both under '-fllvm' and under
'-fllvm -fsanitize' (ASan + UBSan over the C kernel — no
diagnostics).
Diffstat:
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/lib/Crypto/Cipher/ChaCha20.hs b/lib/Crypto/Cipher/ChaCha20.hs
@@ -34,6 +34,7 @@ module Crypto.Cipher.ChaCha20 (
) where
import Control.Monad.ST
+import qualified Crypto.Cipher.ChaCha20.Arm as Arm
import qualified Data.Bits as B
import Data.Bits ((.|.), (.<<.), (.^.))
import qualified Data.ByteString as BS
@@ -289,6 +290,8 @@ block
block key@(BI.PS _ _ kl) counter nonce@(BI.PS _ _ nl)
| kl /= 32 = Left InvalidKey
| nl /= 12 = Left InvalidNonce
+ | Arm.chacha20_arm_available =
+ Right (Arm.block key counter nonce)
| otherwise = pure $ runST $ do
let k = _parse_key key
n = _parse_nonce nonce
@@ -341,8 +344,10 @@ cipher
-> BS.ByteString -- ^ arbitrary-length plaintext
-> Either Error BS.ByteString -- ^ ciphertext
cipher raw_key@(BI.PS _ _ kl) counter raw_nonce@(BI.PS _ _ nl) plaintext
- | kl /= 32 = Left InvalidKey
- | nl /= 12 = Left InvalidNonce
+ | kl /= 32 = Left InvalidKey
+ | nl /= 12 = Left InvalidNonce
+ | Arm.chacha20_arm_available =
+ Right (Arm.cipher raw_key counter raw_nonce plaintext)
| otherwise = pure $ runST $ do
let key = _parse_key raw_key
non = _parse_nonce raw_nonce