Merge branch 'arm-neon' - chacha - The ChaCha20 stream cipher (docs.ppad.tech/chacha).

commit cca0ddb767d963fa84289d3adbe878821fcb7e06
parent 7ec51583b56223898f63591c2dab718c3cd62b16
Author: Jared Tobin <jared@jtobin.io>
Date:   Sat, 16 May 2026 13:25:34 -0230

Merge branch 'arm-neon'

Add aarch64 NEON acceleration for the ChaCha20 stream cipher and
block function, following the integration pattern used by
ppad-sha256 / ppad-base16: a small C kernel under 'cbits/' with
NEON intrinsics, a thin Haskell FFI module, runtime detection via
a CAF, dispatch in the top-level module, and a 'sanitize' cabal
flag that turns on ASan/UBSan instrumentation for testing.  Falls
back to the existing pure Haskell scalar implementation on
non-aarch64 platforms.

Single package, single Hackage release.

Changes:

* New 'cbits/chacha20_arm.c' containing the NEON kernel and a
  'chacha20_arm_available' probe.  Intra-block parallelism: the
  16-word ChaCha20 state matrix is held in four 128-bit NEON
  registers v0..v3, one per row.  A column quarter-round on
  (s00,s04,s08,s12), (s01,s05,s09,s13), ...  becomes one set of
  element-wise vector operations on (v0,v1,v2,v3) — four
  quarter-rounds in parallel per round.  Diagonal rounds are
  reached by left-rotating v1, v2, v3 by 1, 2, 3 lanes (VEXT),
  running another column round, then rotating back.  Rotations:
  ROTL-by-16 uses REV32.u16; the others compile to shift-shift-or
  pairs.  Body gated by '#if defined(__aarch64__)' with stubs in
  the '#else' branch.

* New 'lib/Crypto/Cipher/ChaCha20/Arm.hs' wraps the C kernel via
  'foreign import ccall unsafe'.  'chacha20_arm_available :: Bool'
  is a NOINLINE CAF.  'block' wraps the 64-byte keystream
  generator; 'cipher' wraps the streaming XOR via
  'BI.unsafeCreate plen'.  Module is exposed (matching sha256's
  Arm module exposure) but hidden from haddock.

* 'lib/Crypto/Cipher/ChaCha20.hs' now dispatches 'cipher' and
  'block' to the NEON path when 'chacha20_arm_available' is True;
  the existing scalar bodies stay in place as the fallback.

* 'ppad-chacha.cabal' adds 'c-sources', 'arch(aarch64)' cc-options
  ('-march=armv8-a' — NEON is baseline of armv8 so no extension
  flag is needed), and a new 'sanitize' flag wiring '-fsanitize=
  address,undefined -fno-omit-frame-pointer' into both the C
  source and the test-suite link, mirroring 'ppad-sha256'.

* README performance section updated with the new criterion
  figures and the standard 'where we avail of hardware
  acceleration' note used in sha256.

Verification (M4 MacBook Air, GHC 9.10.3 + LLVM 19, '-fllvm'):

* 'cabal test -fllvm'           — 8/8 tests pass through the ARM
                                   path, including RFC 8439 A.2
                                   vectors 1, 2, 3.
* 'cabal test -fllvm -fsanitize' — same suite, instrumented with
                                   ASan + UBSan over the C kernel.
                                   No diagnostics.
* 'cabal bench -fllvm':
    cipher time (114B): 478 ns -> 267 ns   (~1.8x)
* 'cabal bench -fllvm chacha-weigh' — per-call allocation drops
  dramatically across the size range, because the scalar path was
  accumulating intermediate per-block ByteStrings through a Builder
  while the NEON path writes into a single 'BI.unsafeCreate plen'
  buffer:
    block:                4,968 B ->   312 B  (~16x less)
    cipher 64B input:    42,584 B ->   448 B  (~95x less)
    cipher 256B input:   61,568 B ->   448 B  (~137x less)
    cipher 1024B input: 121,376 B -> 4,072 B  (~30x less)
    cipher 4096B input: 406,168 B -> 4,568 B  (~89x less)

The 1.8x wall-time on the 114B RFC test vector is a floor figure
— that input is only ~2 blocks, so FFI overhead and per-call
setup dominate.  Longer inputs amortise the FFI call across more
SIMD work and recover proportionally more.

Non-aarch64 builds are unchanged: the '#else'-branch C stubs
return availability = 0 and the dispatcher falls through to the
scalar path.

Diffstat:
M README.md  | 20 +++++++++-----------
A cbits/chacha20_arm.c  | 199 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M flake.nix  | 2 +-
M lib/Crypto/Cipher/ChaCha20.hs  | 11 ++++++++---
A lib/Crypto/Cipher/ChaCha20/Arm.hs  | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M ppad-chacha.cabal  | 19 +++++++++++++++++--

6 files changed, 316 insertions(+), 17 deletions(-)
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 ![](https://img.shields.io/badge/license-MIT-brightgreen)
 [![](https://img.shields.io/badge/haddock-chacha-lightblue)](https://docs.ppad.tech/chacha)
 
-A pure Haskell implementation of the ChaCha20 stream cipher as specified
+A fast Haskell implementation of the ChaCha20 stream cipher as specified
 by [RFC8439][8439].
 
 ## Usage
@@ -36,19 +36,17 @@ Haddocks (API documentation, etc.) are hosted at
 
 ## Performance
 
-The aim is best-in-class performance for pure, highly-auditable Haskell
-code.
-
-Current benchmark figures on the simple "sunscreen input" from RFC8439
-on an M4 Silicon MacBook Air look like (use `cabal bench` to run the
-benchmark suite):
+The aim is best-in-class performance. Current benchmark figures on the
+simple "sunscreen input" from RFC8439 on an M4 Silicon MacBook Air,
+where we avail of hardware acceleration via ARM NEON intrinsics, look
+like (use `cabal bench` to run the benchmark suite):
 
 ```
   benchmarking ppad-chacha/cipher
-  time                 468.3 ns   (467.9 ns .. 468.8 ns)
-                       1.000 R²   (1.000 R² .. 1.000 R²)
-  mean                 468.4 ns   (468.0 ns .. 469.2 ns)
-  std dev              2.041 ns   (1.317 ns .. 3.539 ns)
+  time                 267.1 ns   (266.0 ns .. 268.2 ns)
+                       1.000 R²   (0.999 R² .. 1.000 R²)
+  mean                 267.1 ns   (264.8 ns .. 270.3 ns)
+  std dev              8.576 ns   (6.191 ns .. 11.56 ns)
 ```
 
 You should compile with the 'llvm' flag for maximum performance.
diff --git a/cbits/chacha20_arm.c b/cbits/chacha20_arm.c
@@ -0,0 +1,199 @@
+#include <stddef.h>
+#include <stdint.h>
+#include <string.h>
+
+#if defined(__aarch64__)
+
+#include <arm_neon.h>
+
+/*
+ * ChaCha20 NEON kernel using intra-block parallelism.  The 16-word
+ * state matrix
+ *
+ *     s00 s01 s02 s03
+ *     s04 s05 s06 s07
+ *     s08 s09 s10 s11
+ *     s12 s13 s14 s15
+ *
+ * is held in four 128-bit NEON registers v0..v3, one per row.  A
+ * column quarter-round on (s00, s04, s08, s12), (s01, s05, s09, s13),
+ * etc., becomes one set of element-wise vector operations on
+ * (v0, v1, v2, v3) — four quarter-rounds in parallel.  Diagonal
+ * rounds are reached by left-rotating v1, v2, v3 by 1, 2, 3 lanes
+ * respectively with VEXT before the second quarter-round, then
+ * rotating back.
+ */
+
+/* 32-bit left rotations.  Rotate-by-16 reduces to REV32.u16; the
+ * others compile to a shift-shift-or pair (the compiler folds rotate-
+ * by-8 to a TBL with a constant shuffle on some targets).         */
+#define ROTL32_16(x) \
+    vreinterpretq_u32_u16(vrev32q_u16(vreinterpretq_u16_u32(x)))
+#define ROTL32_12(x) \
+    vorrq_u32(vshlq_n_u32((x), 12), vshrq_n_u32((x), 20))
+#define ROTL32_8(x) \
+    vorrq_u32(vshlq_n_u32((x),  8), vshrq_n_u32((x), 24))
+#define ROTL32_7(x) \
+    vorrq_u32(vshlq_n_u32((x),  7), vshrq_n_u32((x), 25))
+
+#define QUARTER(v0, v1, v2, v3)                          \
+    do {                                                  \
+        v0 = vaddq_u32(v0, v1);                           \
+        v3 = veorq_u32(v3, v0); v3 = ROTL32_16(v3);       \
+        v2 = vaddq_u32(v2, v3);                           \
+        v1 = veorq_u32(v1, v2); v1 = ROTL32_12(v1);       \
+        v0 = vaddq_u32(v0, v1);                           \
+        v3 = veorq_u32(v3, v0); v3 = ROTL32_8(v3);        \
+        v2 = vaddq_u32(v2, v3);                           \
+        v1 = veorq_u32(v1, v2); v1 = ROTL32_7(v1);        \
+    } while (0)
+
+/* 20-round ChaCha20 core: 10 iterations of (column + diagonal). */
+static inline void chacha20_core(uint32x4_t *v0, uint32x4_t *v1,
+                                  uint32x4_t *v2, uint32x4_t *v3,
+                                  uint32x4_t s0, uint32x4_t s1,
+                                  uint32x4_t s2, uint32x4_t s3) {
+    uint32x4_t a = s0, b = s1, c = s2, d = s3;
+    for (int i = 0; i < 10; i++) {
+        QUARTER(a, b, c, d);
+        /* shift rows: row 1 left 1, row 2 left 2, row 3 left 3.    */
+        b = vextq_u32(b, b, 1);
+        c = vextq_u32(c, c, 2);
+        d = vextq_u32(d, d, 3);
+        QUARTER(a, b, c, d);
+        /* shift back.                                              */
+        b = vextq_u32(b, b, 3);
+        c = vextq_u32(c, c, 2);
+        d = vextq_u32(d, d, 1);
+    }
+    *v0 = vaddq_u32(a, s0);
+    *v1 = vaddq_u32(b, s1);
+    *v2 = vaddq_u32(c, s2);
+    *v3 = vaddq_u32(d, s3);
+}
+
+static const uint32_t chacha_constants[4] = {
+    0x61707865u, 0x3320646eu, 0x79622d32u, 0x6b206574u
+};
+
+/* Set up the constant rows of the state from key + nonce.  s3
+ * (counter + nonce) varies per block and is built inside the loop. */
+static inline void chacha20_setup(const uint8_t key[32],
+                                   const uint8_t nonce[12],
+                                   uint32x4_t *s0, uint32x4_t *s1,
+                                   uint32x4_t *s2,
+                                   uint32_t *n0, uint32_t *n1,
+                                   uint32_t *n2) {
+    *s0 = vld1q_u32(chacha_constants);
+    *s1 = vreinterpretq_u32_u8(vld1q_u8(key));
+    *s2 = vreinterpretq_u32_u8(vld1q_u8(key + 16));
+    memcpy(n0, nonce + 0, 4);
+    memcpy(n1, nonce + 4, 4);
+    memcpy(n2, nonce + 8, 4);
+}
+
+/*
+ * Generate one 64-byte ChaCha20 keystream block at 'out'.
+ */
+void chacha20_block_arm(const uint8_t key[32], uint32_t counter,
+                        const uint8_t nonce[12], uint8_t out[64]) {
+    uint32x4_t s0, s1, s2;
+    uint32_t n0, n1, n2;
+    chacha20_setup(key, nonce, &s0, &s1, &s2, &n0, &n1, &n2);
+
+    uint32_t s3_in[4] = { counter, n0, n1, n2 };
+    uint32x4_t s3 = vld1q_u32(s3_in);
+    uint32x4_t v0, v1, v2, v3;
+    chacha20_core(&v0, &v1, &v2, &v3, s0, s1, s2, s3);
+
+    vst1q_u8(out + 0,  vreinterpretq_u8_u32(v0));
+    vst1q_u8(out + 16, vreinterpretq_u8_u32(v1));
+    vst1q_u8(out + 32, vreinterpretq_u8_u32(v2));
+    vst1q_u8(out + 48, vreinterpretq_u8_u32(v3));
+}
+
+/*
+ * Encrypt/decrypt 'inlen' bytes at 'in' into 'out' using ChaCha20
+ * with the given key, starting counter, and nonce.  Stream cipher,
+ * so the same routine decrypts.
+ */
+void chacha20_cipher_arm(const uint8_t key[32], uint32_t counter,
+                         const uint8_t nonce[12],
+                         const uint8_t *in, uint8_t *out,
+                         size_t inlen) {
+    uint32x4_t s0, s1, s2;
+    uint32_t n0, n1, n2;
+    chacha20_setup(key, nonce, &s0, &s1, &s2, &n0, &n1, &n2);
+
+    size_t pos = 0;
+    while (pos + 64 <= inlen) {
+        uint32_t s3_in[4] = { counter, n0, n1, n2 };
+        uint32x4_t s3 = vld1q_u32(s3_in);
+        uint32x4_t v0, v1, v2, v3;
+        chacha20_core(&v0, &v1, &v2, &v3, s0, s1, s2, s3);
+
+        uint8x16_t i0 = vld1q_u8(in + pos +  0);
+        uint8x16_t i1 = vld1q_u8(in + pos + 16);
+        uint8x16_t i2 = vld1q_u8(in + pos + 32);
+        uint8x16_t i3 = vld1q_u8(in + pos + 48);
+
+        vst1q_u8(out + pos +  0,
+                 veorq_u8(i0, vreinterpretq_u8_u32(v0)));
+        vst1q_u8(out + pos + 16,
+                 veorq_u8(i1, vreinterpretq_u8_u32(v1)));
+        vst1q_u8(out + pos + 32,
+                 veorq_u8(i2, vreinterpretq_u8_u32(v2)));
+        vst1q_u8(out + pos + 48,
+                 veorq_u8(i3, vreinterpretq_u8_u32(v3)));
+
+        pos += 64;
+        counter++;
+    }
+
+    /* trailing partial block (< 64 bytes) */
+    if (pos < inlen) {
+        uint32_t s3_in[4] = { counter, n0, n1, n2 };
+        uint32x4_t s3 = vld1q_u32(s3_in);
+        uint32x4_t v0, v1, v2, v3;
+        chacha20_core(&v0, &v1, &v2, &v3, s0, s1, s2, s3);
+
+        uint8_t block[64];
+        vst1q_u8(block +  0, vreinterpretq_u8_u32(v0));
+        vst1q_u8(block + 16, vreinterpretq_u8_u32(v1));
+        vst1q_u8(block + 32, vreinterpretq_u8_u32(v2));
+        vst1q_u8(block + 48, vreinterpretq_u8_u32(v3));
+
+        size_t remaining = inlen - pos;
+        for (size_t i = 0; i < remaining; i++) {
+            out[pos + i] = in[pos + i] ^ block[i];
+        }
+    }
+}
+
+int chacha20_arm_available(void) {
+    return 1;
+}
+
+#else
+
+/* stubs for non-aarch64 builds; never reached because dispatch is
+ * gated on 'chacha20_arm_available' returning 0                  */
+
+void chacha20_block_arm(const uint8_t *key, uint32_t counter,
+                        const uint8_t *nonce, uint8_t *out) {
+    (void)key; (void)counter; (void)nonce; (void)out;
+}
+
+void chacha20_cipher_arm(const uint8_t *key, uint32_t counter,
+                         const uint8_t *nonce,
+                         const uint8_t *in, uint8_t *out,
+                         size_t inlen) {
+    (void)key; (void)counter; (void)nonce;
+    (void)in; (void)out; (void)inlen;
+}
+
+int chacha20_arm_available(void) {
+    return 0;
+}
+
+#endif
diff --git a/flake.nix b/flake.nix
@@ -1,5 +1,5 @@
 {
-  description = "A pure Haskell ChaCha stream cipher.";
+  description = "A fast Haskell ChaCha stream cipher.";
 
   inputs = {
     ppad-base16 = {
diff --git a/lib/Crypto/Cipher/ChaCha20.hs b/lib/Crypto/Cipher/ChaCha20.hs
@@ -10,7 +10,7 @@
 -- License: MIT
 -- Maintainer: Jared Tobin <jared@ppad.tech>
 --
--- A pure ChaCha20 implementation, as specified by
+-- A fast ChaCha20 implementation, as specified by
 -- [RFC 8439](https://datatracker.ietf.org/doc/html/rfc8439).
 
 module Crypto.Cipher.ChaCha20 (
@@ -34,6 +34,7 @@ module Crypto.Cipher.ChaCha20 (
   ) where
 
 import Control.Monad.ST
+import qualified Crypto.Cipher.ChaCha20.Arm as Arm
 import qualified Data.Bits as B
 import Data.Bits ((.|.), (.<<.), (.^.))
 import qualified Data.ByteString as BS
@@ -289,6 +290,8 @@ block
 block key@(BI.PS _ _ kl) counter nonce@(BI.PS _ _ nl)
   | kl /= 32 = Left InvalidKey
   | nl /= 12 = Left InvalidNonce
+  | Arm.chacha20_arm_available =
+      Right (Arm.block key counter nonce)
   | otherwise = pure $ runST $ do
       let k = _parse_key key
           n = _parse_nonce nonce
@@ -341,8 +344,10 @@ cipher
   -> BS.ByteString    -- ^ arbitrary-length plaintext
   -> Either Error BS.ByteString    -- ^ ciphertext
 cipher raw_key@(BI.PS _ _ kl) counter raw_nonce@(BI.PS _ _ nl) plaintext
-  | kl /= 32  = Left InvalidKey
-  | nl /= 12  = Left InvalidNonce
+  | kl /= 32 = Left InvalidKey
+  | nl /= 12 = Left InvalidNonce
+  | Arm.chacha20_arm_available =
+      Right (Arm.cipher raw_key counter raw_nonce plaintext)
   | otherwise = pure $ runST $ do
       let key = _parse_key raw_key
           non = _parse_nonce raw_nonce
diff --git a/lib/Crypto/Cipher/ChaCha20/Arm.hs b/lib/Crypto/Cipher/ChaCha20/Arm.hs
@@ -0,0 +1,82 @@
+{-# OPTIONS_HADDOCK hide #-}
+{-# LANGUAGE BangPatterns #-}
+
+-- |
+-- Module: Crypto.Cipher.ChaCha20.Arm
+-- Copyright: (c) 2025 Jared Tobin
+-- License: MIT
+-- Maintainer: Jared Tobin <jared@ppad.tech>
+--
+-- ARM NEON support for the ChaCha20 stream cipher.
+
+module Crypto.Cipher.ChaCha20.Arm (
+    chacha20_arm_available
+  , block
+  , cipher
+  ) where
+
+import qualified Data.ByteString as BS
+import qualified Data.ByteString.Internal as BI
+import Data.Word (Word8, Word32)
+import Foreign.C.Types (CInt(..), CSize(..))
+import Foreign.ForeignPtr (withForeignPtr)
+import Foreign.Ptr (Ptr, plusPtr)
+import System.IO.Unsafe (unsafeDupablePerformIO)
+
+-- ffi ------------------------------------------------------------------------
+
+foreign import ccall unsafe "chacha20_block_arm"
+  c_chacha20_block
+    :: Ptr Word8 -> Word32 -> Ptr Word8 -> Ptr Word8 -> IO ()
+
+foreign import ccall unsafe "chacha20_cipher_arm"
+  c_chacha20_cipher
+    :: Ptr Word8 -> Word32 -> Ptr Word8
+    -> Ptr Word8 -> Ptr Word8 -> CSize -> IO ()
+
+foreign import ccall unsafe "chacha20_arm_available"
+  c_chacha20_arm_available :: IO CInt
+
+-- utilities ------------------------------------------------------------------
+
+fi :: (Integral a, Num b) => a -> b
+fi = fromIntegral
+{-# INLINE fi #-}
+
+-- api ------------------------------------------------------------------------
+
+-- | Are ARM NEON extensions available?
+chacha20_arm_available :: Bool
+chacha20_arm_available =
+  unsafeDupablePerformIO c_chacha20_arm_available /= 0
+{-# NOINLINE chacha20_arm_available #-}
+
+-- | One 64-byte ChaCha20 keystream block for the given (already-
+--   validated) key, counter, and nonce.
+block :: BS.ByteString -> Word32 -> BS.ByteString -> BS.ByteString
+block (BI.PS kfp koff _) counter (BI.PS nfp noff _) =
+  BI.unsafeCreate 64 $ \dst ->
+    withForeignPtr kfp $ \kp0 ->
+    withForeignPtr nfp $ \np0 ->
+      c_chacha20_block (kp0 `plusPtr` koff)
+                       counter
+                       (np0 `plusPtr` noff)
+                       dst
+
+-- | XOR the plaintext with the ChaCha20 keystream derived from the
+--   given (already-validated) key, counter, and nonce.
+cipher
+  :: BS.ByteString -> Word32 -> BS.ByteString -> BS.ByteString
+  -> BS.ByteString
+cipher (BI.PS kfp koff _) counter (BI.PS nfp noff _)
+       (BI.PS pfp poff plen) =
+  BI.unsafeCreate plen $ \dst ->
+    withForeignPtr kfp $ \kp0 ->
+    withForeignPtr nfp $ \np0 ->
+    withForeignPtr pfp $ \pp0 ->
+      c_chacha20_cipher (kp0 `plusPtr` koff)
+                        counter
+                        (np0 `plusPtr` noff)
+                        (pp0 `plusPtr` poff)
+                        dst
+                        (fi plen)
diff --git a/ppad-chacha.cabal b/ppad-chacha.cabal
@@ -1,7 +1,7 @@
 cabal-version:      3.0
 name:               ppad-chacha
 version:            0.2.1
-synopsis:           A pure ChaCha20 stream cipher
+synopsis:           A fast ChaCha20 stream cipher
 license:            MIT
 license-file:       LICENSE
 author:             Jared Tobin
@@ -11,13 +11,18 @@ build-type:         Simple
 tested-with:        GHC == 9.10.3
 extra-doc-files:    CHANGELOG
 description:
-  A pure ChaCha20 stream cipher and block function.
+  A fast ChaCha20 stream cipher and block function.
 
 flag llvm
   description: Use GHC's LLVM backend.
   default:     False
   manual:      True
 
+flag sanitize
+  description: Build with AddressSanitizer and UndefinedBehaviorSanitizer.
+  default:     False
+  manual:      True
+
 source-repository head
   type:     git
   location: git.ppad.tech/chacha.git
@@ -31,10 +36,18 @@ library
     ghc-options: -fllvm -O2
   exposed-modules:
       Crypto.Cipher.ChaCha20
+      Crypto.Cipher.ChaCha20.Arm
   build-depends:
       base >= 4.9 && < 5
     , bytestring >= 0.9 && < 0.13
     , primitive >= 0.8 && < 0.10
+  c-sources:
+      cbits/chacha20_arm.c
+  if arch(aarch64)
+    cc-options: -march=armv8-a
+  if flag(sanitize)
+    cc-options: -fsanitize=address,undefined -fno-omit-frame-pointer
+    ghc-options: -optl=-fsanitize=address,undefined
 
 test-suite chacha-tests
   type:                exitcode-stdio-1.0
@@ -44,6 +57,8 @@ test-suite chacha-tests
 
   ghc-options:
     -rtsopts -Wall -O2
+  if flag(sanitize)
+    ghc-options: -optl=-fsanitize=address,undefined
 
   build-depends:
       base

	chacha The ChaCha20 stream cipher (docs.ppad.tech/chacha).
	git clone git://git.ppad.tech/chacha.git
	Log \| Files \| Refs \| README \| LICENSE

M	README.md	\|	20	+++++++++-----------
A	cbits/chacha20_arm.c	\|	199	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	flake.nix	\|	2	+-
M	lib/Crypto/Cipher/ChaCha20.hs	\|	11	++++++++---
A	lib/Crypto/Cipher/ChaCha20/Arm.hs	\|	82	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	ppad-chacha.cabal	\|	19	+++++++++++++++++--