commit cca0ddb767d963fa84289d3adbe878821fcb7e06
parent 7ec51583b56223898f63591c2dab718c3cd62b16
Author: Jared Tobin <jared@jtobin.io>
Date: Sat, 16 May 2026 13:25:34 -0230
Merge branch 'arm-neon'
Add aarch64 NEON acceleration for the ChaCha20 stream cipher and
block function, following the integration pattern used by
ppad-sha256 / ppad-base16: a small C kernel under 'cbits/' with
NEON intrinsics, a thin Haskell FFI module, runtime detection via
a CAF, dispatch in the top-level module, and a 'sanitize' cabal
flag that turns on ASan/UBSan instrumentation for testing. Falls
back to the existing pure Haskell scalar implementation on
non-aarch64 platforms.
Single package, single Hackage release.
Changes:
* New 'cbits/chacha20_arm.c' containing the NEON kernel and a
'chacha20_arm_available' probe. Intra-block parallelism: the
16-word ChaCha20 state matrix is held in four 128-bit NEON
registers v0..v3, one per row. A column quarter-round on
(s00,s04,s08,s12), (s01,s05,s09,s13), ... becomes one set of
element-wise vector operations on (v0,v1,v2,v3) — four
quarter-rounds in parallel per round. Diagonal rounds are
reached by left-rotating v1, v2, v3 by 1, 2, 3 lanes (VEXT),
running another column round, then rotating back. Rotations:
ROTL-by-16 uses REV32.u16; the others compile to shift-shift-or
pairs. Body gated by '#if defined(__aarch64__)' with stubs in
the '#else' branch.
* New 'lib/Crypto/Cipher/ChaCha20/Arm.hs' wraps the C kernel via
'foreign import ccall unsafe'. 'chacha20_arm_available :: Bool'
is a NOINLINE CAF. 'block' wraps the 64-byte keystream
generator; 'cipher' wraps the streaming XOR via
'BI.unsafeCreate plen'. Module is exposed (matching sha256's
Arm module exposure) but hidden from haddock.
* 'lib/Crypto/Cipher/ChaCha20.hs' now dispatches 'cipher' and
'block' to the NEON path when 'chacha20_arm_available' is True;
the existing scalar bodies stay in place as the fallback.
* 'ppad-chacha.cabal' adds 'c-sources', 'arch(aarch64)' cc-options
('-march=armv8-a' — NEON is baseline of armv8 so no extension
flag is needed), and a new 'sanitize' flag wiring '-fsanitize=
address,undefined -fno-omit-frame-pointer' into both the C
source and the test-suite link, mirroring 'ppad-sha256'.
* README performance section updated with the new criterion
figures and the standard 'where we avail of hardware
acceleration' note used in sha256.
Verification (M4 MacBook Air, GHC 9.10.3 + LLVM 19, '-fllvm'):
* 'cabal test -fllvm' — 8/8 tests pass through the ARM
path, including RFC 8439 A.2
vectors 1, 2, 3.
* 'cabal test -fllvm -fsanitize' — same suite, instrumented with
ASan + UBSan over the C kernel.
No diagnostics.
* 'cabal bench -fllvm':
cipher time (114B): 478 ns -> 267 ns (~1.8x)
* 'cabal bench -fllvm chacha-weigh' — per-call allocation drops
dramatically across the size range, because the scalar path was
accumulating intermediate per-block ByteStrings through a Builder
while the NEON path writes into a single 'BI.unsafeCreate plen'
buffer:
block: 4,968 B -> 312 B (~16x less)
cipher 64B input: 42,584 B -> 448 B (~95x less)
cipher 256B input: 61,568 B -> 448 B (~137x less)
cipher 1024B input: 121,376 B -> 4,072 B (~30x less)
cipher 4096B input: 406,168 B -> 4,568 B (~89x less)
The 1.8x wall-time on the 114B RFC test vector is a floor figure
— that input is only ~2 blocks, so FFI overhead and per-call
setup dominate. Longer inputs amortise the FFI call across more
SIMD work and recover proportionally more.
Non-aarch64 builds are unchanged: the '#else'-branch C stubs
return availability = 0 and the dispatcher falls through to the
scalar path.
Diffstat:
6 files changed, 316 insertions(+), 17 deletions(-)
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@

[](https://docs.ppad.tech/chacha)
-A pure Haskell implementation of the ChaCha20 stream cipher as specified
+A fast Haskell implementation of the ChaCha20 stream cipher as specified
by [RFC8439][8439].
## Usage
@@ -36,19 +36,17 @@ Haddocks (API documentation, etc.) are hosted at
## Performance
-The aim is best-in-class performance for pure, highly-auditable Haskell
-code.
-
-Current benchmark figures on the simple "sunscreen input" from RFC8439
-on an M4 Silicon MacBook Air look like (use `cabal bench` to run the
-benchmark suite):
+The aim is best-in-class performance. Current benchmark figures on the
+simple "sunscreen input" from RFC8439 on an M4 Silicon MacBook Air,
+where we avail of hardware acceleration via ARM NEON intrinsics, look
+like (use `cabal bench` to run the benchmark suite):
```
benchmarking ppad-chacha/cipher
- time 468.3 ns (467.9 ns .. 468.8 ns)
- 1.000 R² (1.000 R² .. 1.000 R²)
- mean 468.4 ns (468.0 ns .. 469.2 ns)
- std dev 2.041 ns (1.317 ns .. 3.539 ns)
+ time 267.1 ns (266.0 ns .. 268.2 ns)
+ 1.000 R² (0.999 R² .. 1.000 R²)
+ mean 267.1 ns (264.8 ns .. 270.3 ns)
+ std dev 8.576 ns (6.191 ns .. 11.56 ns)
```
You should compile with the 'llvm' flag for maximum performance.
diff --git a/cbits/chacha20_arm.c b/cbits/chacha20_arm.c
@@ -0,0 +1,199 @@
+#include <stddef.h>
+#include <stdint.h>
+#include <string.h>
+
+#if defined(__aarch64__)
+
+#include <arm_neon.h>
+
+/*
+ * ChaCha20 NEON kernel using intra-block parallelism. The 16-word
+ * state matrix
+ *
+ * s00 s01 s02 s03
+ * s04 s05 s06 s07
+ * s08 s09 s10 s11
+ * s12 s13 s14 s15
+ *
+ * is held in four 128-bit NEON registers v0..v3, one per row. A
+ * column quarter-round on (s00, s04, s08, s12), (s01, s05, s09, s13),
+ * etc., becomes one set of element-wise vector operations on
+ * (v0, v1, v2, v3) — four quarter-rounds in parallel. Diagonal
+ * rounds are reached by left-rotating v1, v2, v3 by 1, 2, 3 lanes
+ * respectively with VEXT before the second quarter-round, then
+ * rotating back.
+ */
+
+/* 32-bit left rotations. Rotate-by-16 reduces to REV32.u16; the
+ * others compile to a shift-shift-or pair (the compiler folds rotate-
+ * by-8 to a TBL with a constant shuffle on some targets). */
+#define ROTL32_16(x) \
+ vreinterpretq_u32_u16(vrev32q_u16(vreinterpretq_u16_u32(x)))
+#define ROTL32_12(x) \
+ vorrq_u32(vshlq_n_u32((x), 12), vshrq_n_u32((x), 20))
+#define ROTL32_8(x) \
+ vorrq_u32(vshlq_n_u32((x), 8), vshrq_n_u32((x), 24))
+#define ROTL32_7(x) \
+ vorrq_u32(vshlq_n_u32((x), 7), vshrq_n_u32((x), 25))
+
+#define QUARTER(v0, v1, v2, v3) \
+ do { \
+ v0 = vaddq_u32(v0, v1); \
+ v3 = veorq_u32(v3, v0); v3 = ROTL32_16(v3); \
+ v2 = vaddq_u32(v2, v3); \
+ v1 = veorq_u32(v1, v2); v1 = ROTL32_12(v1); \
+ v0 = vaddq_u32(v0, v1); \
+ v3 = veorq_u32(v3, v0); v3 = ROTL32_8(v3); \
+ v2 = vaddq_u32(v2, v3); \
+ v1 = veorq_u32(v1, v2); v1 = ROTL32_7(v1); \
+ } while (0)
+
+/* 20-round ChaCha20 core: 10 iterations of (column + diagonal). */
+static inline void chacha20_core(uint32x4_t *v0, uint32x4_t *v1,
+ uint32x4_t *v2, uint32x4_t *v3,
+ uint32x4_t s0, uint32x4_t s1,
+ uint32x4_t s2, uint32x4_t s3) {
+ uint32x4_t a = s0, b = s1, c = s2, d = s3;
+ for (int i = 0; i < 10; i++) {
+ QUARTER(a, b, c, d);
+ /* shift rows: row 1 left 1, row 2 left 2, row 3 left 3. */
+ b = vextq_u32(b, b, 1);
+ c = vextq_u32(c, c, 2);
+ d = vextq_u32(d, d, 3);
+ QUARTER(a, b, c, d);
+ /* shift back. */
+ b = vextq_u32(b, b, 3);
+ c = vextq_u32(c, c, 2);
+ d = vextq_u32(d, d, 1);
+ }
+ *v0 = vaddq_u32(a, s0);
+ *v1 = vaddq_u32(b, s1);
+ *v2 = vaddq_u32(c, s2);
+ *v3 = vaddq_u32(d, s3);
+}
+
+static const uint32_t chacha_constants[4] = {
+ 0x61707865u, 0x3320646eu, 0x79622d32u, 0x6b206574u
+};
+
+/* Set up the constant rows of the state from key + nonce. s3
+ * (counter + nonce) varies per block and is built inside the loop. */
+static inline void chacha20_setup(const uint8_t key[32],
+ const uint8_t nonce[12],
+ uint32x4_t *s0, uint32x4_t *s1,
+ uint32x4_t *s2,
+ uint32_t *n0, uint32_t *n1,
+ uint32_t *n2) {
+ *s0 = vld1q_u32(chacha_constants);
+ *s1 = vreinterpretq_u32_u8(vld1q_u8(key));
+ *s2 = vreinterpretq_u32_u8(vld1q_u8(key + 16));
+ memcpy(n0, nonce + 0, 4);
+ memcpy(n1, nonce + 4, 4);
+ memcpy(n2, nonce + 8, 4);
+}
+
+/*
+ * Generate one 64-byte ChaCha20 keystream block at 'out'.
+ */
+void chacha20_block_arm(const uint8_t key[32], uint32_t counter,
+ const uint8_t nonce[12], uint8_t out[64]) {
+ uint32x4_t s0, s1, s2;
+ uint32_t n0, n1, n2;
+ chacha20_setup(key, nonce, &s0, &s1, &s2, &n0, &n1, &n2);
+
+ uint32_t s3_in[4] = { counter, n0, n1, n2 };
+ uint32x4_t s3 = vld1q_u32(s3_in);
+ uint32x4_t v0, v1, v2, v3;
+ chacha20_core(&v0, &v1, &v2, &v3, s0, s1, s2, s3);
+
+ vst1q_u8(out + 0, vreinterpretq_u8_u32(v0));
+ vst1q_u8(out + 16, vreinterpretq_u8_u32(v1));
+ vst1q_u8(out + 32, vreinterpretq_u8_u32(v2));
+ vst1q_u8(out + 48, vreinterpretq_u8_u32(v3));
+}
+
+/*
+ * Encrypt/decrypt 'inlen' bytes at 'in' into 'out' using ChaCha20
+ * with the given key, starting counter, and nonce. Stream cipher,
+ * so the same routine decrypts.
+ */
+void chacha20_cipher_arm(const uint8_t key[32], uint32_t counter,
+ const uint8_t nonce[12],
+ const uint8_t *in, uint8_t *out,
+ size_t inlen) {
+ uint32x4_t s0, s1, s2;
+ uint32_t n0, n1, n2;
+ chacha20_setup(key, nonce, &s0, &s1, &s2, &n0, &n1, &n2);
+
+ size_t pos = 0;
+ while (pos + 64 <= inlen) {
+ uint32_t s3_in[4] = { counter, n0, n1, n2 };
+ uint32x4_t s3 = vld1q_u32(s3_in);
+ uint32x4_t v0, v1, v2, v3;
+ chacha20_core(&v0, &v1, &v2, &v3, s0, s1, s2, s3);
+
+ uint8x16_t i0 = vld1q_u8(in + pos + 0);
+ uint8x16_t i1 = vld1q_u8(in + pos + 16);
+ uint8x16_t i2 = vld1q_u8(in + pos + 32);
+ uint8x16_t i3 = vld1q_u8(in + pos + 48);
+
+ vst1q_u8(out + pos + 0,
+ veorq_u8(i0, vreinterpretq_u8_u32(v0)));
+ vst1q_u8(out + pos + 16,
+ veorq_u8(i1, vreinterpretq_u8_u32(v1)));
+ vst1q_u8(out + pos + 32,
+ veorq_u8(i2, vreinterpretq_u8_u32(v2)));
+ vst1q_u8(out + pos + 48,
+ veorq_u8(i3, vreinterpretq_u8_u32(v3)));
+
+ pos += 64;
+ counter++;
+ }
+
+ /* trailing partial block (< 64 bytes) */
+ if (pos < inlen) {
+ uint32_t s3_in[4] = { counter, n0, n1, n2 };
+ uint32x4_t s3 = vld1q_u32(s3_in);
+ uint32x4_t v0, v1, v2, v3;
+ chacha20_core(&v0, &v1, &v2, &v3, s0, s1, s2, s3);
+
+ uint8_t block[64];
+ vst1q_u8(block + 0, vreinterpretq_u8_u32(v0));
+ vst1q_u8(block + 16, vreinterpretq_u8_u32(v1));
+ vst1q_u8(block + 32, vreinterpretq_u8_u32(v2));
+ vst1q_u8(block + 48, vreinterpretq_u8_u32(v3));
+
+ size_t remaining = inlen - pos;
+ for (size_t i = 0; i < remaining; i++) {
+ out[pos + i] = in[pos + i] ^ block[i];
+ }
+ }
+}
+
+int chacha20_arm_available(void) {
+ return 1;
+}
+
+#else
+
+/* stubs for non-aarch64 builds; never reached because dispatch is
+ * gated on 'chacha20_arm_available' returning 0 */
+
+void chacha20_block_arm(const uint8_t *key, uint32_t counter,
+ const uint8_t *nonce, uint8_t *out) {
+ (void)key; (void)counter; (void)nonce; (void)out;
+}
+
+void chacha20_cipher_arm(const uint8_t *key, uint32_t counter,
+ const uint8_t *nonce,
+ const uint8_t *in, uint8_t *out,
+ size_t inlen) {
+ (void)key; (void)counter; (void)nonce;
+ (void)in; (void)out; (void)inlen;
+}
+
+int chacha20_arm_available(void) {
+ return 0;
+}
+
+#endif
diff --git a/flake.nix b/flake.nix
@@ -1,5 +1,5 @@
{
- description = "A pure Haskell ChaCha stream cipher.";
+ description = "A fast Haskell ChaCha stream cipher.";
inputs = {
ppad-base16 = {
diff --git a/lib/Crypto/Cipher/ChaCha20.hs b/lib/Crypto/Cipher/ChaCha20.hs
@@ -10,7 +10,7 @@
-- License: MIT
-- Maintainer: Jared Tobin <jared@ppad.tech>
--
--- A pure ChaCha20 implementation, as specified by
+-- A fast ChaCha20 implementation, as specified by
-- [RFC 8439](https://datatracker.ietf.org/doc/html/rfc8439).
module Crypto.Cipher.ChaCha20 (
@@ -34,6 +34,7 @@ module Crypto.Cipher.ChaCha20 (
) where
import Control.Monad.ST
+import qualified Crypto.Cipher.ChaCha20.Arm as Arm
import qualified Data.Bits as B
import Data.Bits ((.|.), (.<<.), (.^.))
import qualified Data.ByteString as BS
@@ -289,6 +290,8 @@ block
block key@(BI.PS _ _ kl) counter nonce@(BI.PS _ _ nl)
| kl /= 32 = Left InvalidKey
| nl /= 12 = Left InvalidNonce
+ | Arm.chacha20_arm_available =
+ Right (Arm.block key counter nonce)
| otherwise = pure $ runST $ do
let k = _parse_key key
n = _parse_nonce nonce
@@ -341,8 +344,10 @@ cipher
-> BS.ByteString -- ^ arbitrary-length plaintext
-> Either Error BS.ByteString -- ^ ciphertext
cipher raw_key@(BI.PS _ _ kl) counter raw_nonce@(BI.PS _ _ nl) plaintext
- | kl /= 32 = Left InvalidKey
- | nl /= 12 = Left InvalidNonce
+ | kl /= 32 = Left InvalidKey
+ | nl /= 12 = Left InvalidNonce
+ | Arm.chacha20_arm_available =
+ Right (Arm.cipher raw_key counter raw_nonce plaintext)
| otherwise = pure $ runST $ do
let key = _parse_key raw_key
non = _parse_nonce raw_nonce
diff --git a/lib/Crypto/Cipher/ChaCha20/Arm.hs b/lib/Crypto/Cipher/ChaCha20/Arm.hs
@@ -0,0 +1,82 @@
+{-# OPTIONS_HADDOCK hide #-}
+{-# LANGUAGE BangPatterns #-}
+
+-- |
+-- Module: Crypto.Cipher.ChaCha20.Arm
+-- Copyright: (c) 2025 Jared Tobin
+-- License: MIT
+-- Maintainer: Jared Tobin <jared@ppad.tech>
+--
+-- ARM NEON support for the ChaCha20 stream cipher.
+
+module Crypto.Cipher.ChaCha20.Arm (
+ chacha20_arm_available
+ , block
+ , cipher
+ ) where
+
+import qualified Data.ByteString as BS
+import qualified Data.ByteString.Internal as BI
+import Data.Word (Word8, Word32)
+import Foreign.C.Types (CInt(..), CSize(..))
+import Foreign.ForeignPtr (withForeignPtr)
+import Foreign.Ptr (Ptr, plusPtr)
+import System.IO.Unsafe (unsafeDupablePerformIO)
+
+-- ffi ------------------------------------------------------------------------
+
+foreign import ccall unsafe "chacha20_block_arm"
+ c_chacha20_block
+ :: Ptr Word8 -> Word32 -> Ptr Word8 -> Ptr Word8 -> IO ()
+
+foreign import ccall unsafe "chacha20_cipher_arm"
+ c_chacha20_cipher
+ :: Ptr Word8 -> Word32 -> Ptr Word8
+ -> Ptr Word8 -> Ptr Word8 -> CSize -> IO ()
+
+foreign import ccall unsafe "chacha20_arm_available"
+ c_chacha20_arm_available :: IO CInt
+
+-- utilities ------------------------------------------------------------------
+
+fi :: (Integral a, Num b) => a -> b
+fi = fromIntegral
+{-# INLINE fi #-}
+
+-- api ------------------------------------------------------------------------
+
+-- | Are ARM NEON extensions available?
+chacha20_arm_available :: Bool
+chacha20_arm_available =
+ unsafeDupablePerformIO c_chacha20_arm_available /= 0
+{-# NOINLINE chacha20_arm_available #-}
+
+-- | One 64-byte ChaCha20 keystream block for the given (already-
+-- validated) key, counter, and nonce.
+block :: BS.ByteString -> Word32 -> BS.ByteString -> BS.ByteString
+block (BI.PS kfp koff _) counter (BI.PS nfp noff _) =
+ BI.unsafeCreate 64 $ \dst ->
+ withForeignPtr kfp $ \kp0 ->
+ withForeignPtr nfp $ \np0 ->
+ c_chacha20_block (kp0 `plusPtr` koff)
+ counter
+ (np0 `plusPtr` noff)
+ dst
+
+-- | XOR the plaintext with the ChaCha20 keystream derived from the
+-- given (already-validated) key, counter, and nonce.
+cipher
+ :: BS.ByteString -> Word32 -> BS.ByteString -> BS.ByteString
+ -> BS.ByteString
+cipher (BI.PS kfp koff _) counter (BI.PS nfp noff _)
+ (BI.PS pfp poff plen) =
+ BI.unsafeCreate plen $ \dst ->
+ withForeignPtr kfp $ \kp0 ->
+ withForeignPtr nfp $ \np0 ->
+ withForeignPtr pfp $ \pp0 ->
+ c_chacha20_cipher (kp0 `plusPtr` koff)
+ counter
+ (np0 `plusPtr` noff)
+ (pp0 `plusPtr` poff)
+ dst
+ (fi plen)
diff --git a/ppad-chacha.cabal b/ppad-chacha.cabal
@@ -1,7 +1,7 @@
cabal-version: 3.0
name: ppad-chacha
version: 0.2.1
-synopsis: A pure ChaCha20 stream cipher
+synopsis: A fast ChaCha20 stream cipher
license: MIT
license-file: LICENSE
author: Jared Tobin
@@ -11,13 +11,18 @@ build-type: Simple
tested-with: GHC == 9.10.3
extra-doc-files: CHANGELOG
description:
- A pure ChaCha20 stream cipher and block function.
+ A fast ChaCha20 stream cipher and block function.
flag llvm
description: Use GHC's LLVM backend.
default: False
manual: True
+flag sanitize
+ description: Build with AddressSanitizer and UndefinedBehaviorSanitizer.
+ default: False
+ manual: True
+
source-repository head
type: git
location: git.ppad.tech/chacha.git
@@ -31,10 +36,18 @@ library
ghc-options: -fllvm -O2
exposed-modules:
Crypto.Cipher.ChaCha20
+ Crypto.Cipher.ChaCha20.Arm
build-depends:
base >= 4.9 && < 5
, bytestring >= 0.9 && < 0.13
, primitive >= 0.8 && < 0.10
+ c-sources:
+ cbits/chacha20_arm.c
+ if arch(aarch64)
+ cc-options: -march=armv8-a
+ if flag(sanitize)
+ cc-options: -fsanitize=address,undefined -fno-omit-frame-pointer
+ ghc-options: -optl=-fsanitize=address,undefined
test-suite chacha-tests
type: exitcode-stdio-1.0
@@ -44,6 +57,8 @@ test-suite chacha-tests
ghc-options:
-rtsopts -Wall -O2
+ if flag(sanitize)
+ ghc-options: -optl=-fsanitize=address,undefined
build-depends:
base