chacha

The ChaCha20 stream cipher (docs.ppad.tech/chacha).
git clone git://git.ppad.tech/chacha.git
Log | Files | Refs | README | LICENSE

commit cca0ddb767d963fa84289d3adbe878821fcb7e06
parent 7ec51583b56223898f63591c2dab718c3cd62b16
Author: Jared Tobin <jared@jtobin.io>
Date:   Sat, 16 May 2026 13:25:34 -0230

Merge branch 'arm-neon'

Add aarch64 NEON acceleration for the ChaCha20 stream cipher and
block function, following the integration pattern used by
ppad-sha256 / ppad-base16: a small C kernel under 'cbits/' with
NEON intrinsics, a thin Haskell FFI module, runtime detection via
a CAF, dispatch in the top-level module, and a 'sanitize' cabal
flag that turns on ASan/UBSan instrumentation for testing.  Falls
back to the existing pure Haskell scalar implementation on
non-aarch64 platforms.

Single package, single Hackage release.

Changes:

* New 'cbits/chacha20_arm.c' containing the NEON kernel and a
  'chacha20_arm_available' probe.  Intra-block parallelism: the
  16-word ChaCha20 state matrix is held in four 128-bit NEON
  registers v0..v3, one per row.  A column quarter-round on
  (s00,s04,s08,s12), (s01,s05,s09,s13), ...  becomes one set of
  element-wise vector operations on (v0,v1,v2,v3) — four
  quarter-rounds in parallel per round.  Diagonal rounds are
  reached by left-rotating v1, v2, v3 by 1, 2, 3 lanes (VEXT),
  running another column round, then rotating back.  Rotations:
  ROTL-by-16 uses REV32.u16; the others compile to shift-shift-or
  pairs.  Body gated by '#if defined(__aarch64__)' with stubs in
  the '#else' branch.

* New 'lib/Crypto/Cipher/ChaCha20/Arm.hs' wraps the C kernel via
  'foreign import ccall unsafe'.  'chacha20_arm_available :: Bool'
  is a NOINLINE CAF.  'block' wraps the 64-byte keystream
  generator; 'cipher' wraps the streaming XOR via
  'BI.unsafeCreate plen'.  Module is exposed (matching sha256's
  Arm module exposure) but hidden from haddock.

* 'lib/Crypto/Cipher/ChaCha20.hs' now dispatches 'cipher' and
  'block' to the NEON path when 'chacha20_arm_available' is True;
  the existing scalar bodies stay in place as the fallback.

* 'ppad-chacha.cabal' adds 'c-sources', 'arch(aarch64)' cc-options
  ('-march=armv8-a' — NEON is baseline of armv8 so no extension
  flag is needed), and a new 'sanitize' flag wiring '-fsanitize=
  address,undefined -fno-omit-frame-pointer' into both the C
  source and the test-suite link, mirroring 'ppad-sha256'.

* README performance section updated with the new criterion
  figures and the standard 'where we avail of hardware
  acceleration' note used in sha256.

Verification (M4 MacBook Air, GHC 9.10.3 + LLVM 19, '-fllvm'):

* 'cabal test -fllvm'           — 8/8 tests pass through the ARM
                                   path, including RFC 8439 A.2
                                   vectors 1, 2, 3.
* 'cabal test -fllvm -fsanitize' — same suite, instrumented with
                                   ASan + UBSan over the C kernel.
                                   No diagnostics.
* 'cabal bench -fllvm':
    cipher time (114B): 478 ns -> 267 ns   (~1.8x)
* 'cabal bench -fllvm chacha-weigh' — per-call allocation drops
  dramatically across the size range, because the scalar path was
  accumulating intermediate per-block ByteStrings through a Builder
  while the NEON path writes into a single 'BI.unsafeCreate plen'
  buffer:
    block:                4,968 B ->   312 B  (~16x less)
    cipher 64B input:    42,584 B ->   448 B  (~95x less)
    cipher 256B input:   61,568 B ->   448 B  (~137x less)
    cipher 1024B input: 121,376 B -> 4,072 B  (~30x less)
    cipher 4096B input: 406,168 B -> 4,568 B  (~89x less)

The 1.8x wall-time on the 114B RFC test vector is a floor figure
— that input is only ~2 blocks, so FFI overhead and per-call
setup dominate.  Longer inputs amortise the FFI call across more
SIMD work and recover proportionally more.

Non-aarch64 builds are unchanged: the '#else'-branch C stubs
return availability = 0 and the dispatcher falls through to the
scalar path.

Diffstat:
MREADME.md | 20+++++++++-----------
Acbits/chacha20_arm.c | 199+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mflake.nix | 2+-
Mlib/Crypto/Cipher/ChaCha20.hs | 11++++++++---
Alib/Crypto/Cipher/ChaCha20/Arm.hs | 82+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mppad-chacha.cabal | 19+++++++++++++++++--
6 files changed, 316 insertions(+), 17 deletions(-)

diff --git a/README.md b/README.md @@ -4,7 +4,7 @@ ![](https://img.shields.io/badge/license-MIT-brightgreen) [![](https://img.shields.io/badge/haddock-chacha-lightblue)](https://docs.ppad.tech/chacha) -A pure Haskell implementation of the ChaCha20 stream cipher as specified +A fast Haskell implementation of the ChaCha20 stream cipher as specified by [RFC8439][8439]. ## Usage @@ -36,19 +36,17 @@ Haddocks (API documentation, etc.) are hosted at ## Performance -The aim is best-in-class performance for pure, highly-auditable Haskell -code. - -Current benchmark figures on the simple "sunscreen input" from RFC8439 -on an M4 Silicon MacBook Air look like (use `cabal bench` to run the -benchmark suite): +The aim is best-in-class performance. Current benchmark figures on the +simple "sunscreen input" from RFC8439 on an M4 Silicon MacBook Air, +where we avail of hardware acceleration via ARM NEON intrinsics, look +like (use `cabal bench` to run the benchmark suite): ``` benchmarking ppad-chacha/cipher - time 468.3 ns (467.9 ns .. 468.8 ns) - 1.000 R² (1.000 R² .. 1.000 R²) - mean 468.4 ns (468.0 ns .. 469.2 ns) - std dev 2.041 ns (1.317 ns .. 3.539 ns) + time 267.1 ns (266.0 ns .. 268.2 ns) + 1.000 R² (0.999 R² .. 1.000 R²) + mean 267.1 ns (264.8 ns .. 270.3 ns) + std dev 8.576 ns (6.191 ns .. 11.56 ns) ``` You should compile with the 'llvm' flag for maximum performance. diff --git a/cbits/chacha20_arm.c b/cbits/chacha20_arm.c @@ -0,0 +1,199 @@ +#include <stddef.h> +#include <stdint.h> +#include <string.h> + +#if defined(__aarch64__) + +#include <arm_neon.h> + +/* + * ChaCha20 NEON kernel using intra-block parallelism. The 16-word + * state matrix + * + * s00 s01 s02 s03 + * s04 s05 s06 s07 + * s08 s09 s10 s11 + * s12 s13 s14 s15 + * + * is held in four 128-bit NEON registers v0..v3, one per row. A + * column quarter-round on (s00, s04, s08, s12), (s01, s05, s09, s13), + * etc., becomes one set of element-wise vector operations on + * (v0, v1, v2, v3) — four quarter-rounds in parallel. Diagonal + * rounds are reached by left-rotating v1, v2, v3 by 1, 2, 3 lanes + * respectively with VEXT before the second quarter-round, then + * rotating back. + */ + +/* 32-bit left rotations. Rotate-by-16 reduces to REV32.u16; the + * others compile to a shift-shift-or pair (the compiler folds rotate- + * by-8 to a TBL with a constant shuffle on some targets). */ +#define ROTL32_16(x) \ + vreinterpretq_u32_u16(vrev32q_u16(vreinterpretq_u16_u32(x))) +#define ROTL32_12(x) \ + vorrq_u32(vshlq_n_u32((x), 12), vshrq_n_u32((x), 20)) +#define ROTL32_8(x) \ + vorrq_u32(vshlq_n_u32((x), 8), vshrq_n_u32((x), 24)) +#define ROTL32_7(x) \ + vorrq_u32(vshlq_n_u32((x), 7), vshrq_n_u32((x), 25)) + +#define QUARTER(v0, v1, v2, v3) \ + do { \ + v0 = vaddq_u32(v0, v1); \ + v3 = veorq_u32(v3, v0); v3 = ROTL32_16(v3); \ + v2 = vaddq_u32(v2, v3); \ + v1 = veorq_u32(v1, v2); v1 = ROTL32_12(v1); \ + v0 = vaddq_u32(v0, v1); \ + v3 = veorq_u32(v3, v0); v3 = ROTL32_8(v3); \ + v2 = vaddq_u32(v2, v3); \ + v1 = veorq_u32(v1, v2); v1 = ROTL32_7(v1); \ + } while (0) + +/* 20-round ChaCha20 core: 10 iterations of (column + diagonal). */ +static inline void chacha20_core(uint32x4_t *v0, uint32x4_t *v1, + uint32x4_t *v2, uint32x4_t *v3, + uint32x4_t s0, uint32x4_t s1, + uint32x4_t s2, uint32x4_t s3) { + uint32x4_t a = s0, b = s1, c = s2, d = s3; + for (int i = 0; i < 10; i++) { + QUARTER(a, b, c, d); + /* shift rows: row 1 left 1, row 2 left 2, row 3 left 3. */ + b = vextq_u32(b, b, 1); + c = vextq_u32(c, c, 2); + d = vextq_u32(d, d, 3); + QUARTER(a, b, c, d); + /* shift back. */ + b = vextq_u32(b, b, 3); + c = vextq_u32(c, c, 2); + d = vextq_u32(d, d, 1); + } + *v0 = vaddq_u32(a, s0); + *v1 = vaddq_u32(b, s1); + *v2 = vaddq_u32(c, s2); + *v3 = vaddq_u32(d, s3); +} + +static const uint32_t chacha_constants[4] = { + 0x61707865u, 0x3320646eu, 0x79622d32u, 0x6b206574u +}; + +/* Set up the constant rows of the state from key + nonce. s3 + * (counter + nonce) varies per block and is built inside the loop. */ +static inline void chacha20_setup(const uint8_t key[32], + const uint8_t nonce[12], + uint32x4_t *s0, uint32x4_t *s1, + uint32x4_t *s2, + uint32_t *n0, uint32_t *n1, + uint32_t *n2) { + *s0 = vld1q_u32(chacha_constants); + *s1 = vreinterpretq_u32_u8(vld1q_u8(key)); + *s2 = vreinterpretq_u32_u8(vld1q_u8(key + 16)); + memcpy(n0, nonce + 0, 4); + memcpy(n1, nonce + 4, 4); + memcpy(n2, nonce + 8, 4); +} + +/* + * Generate one 64-byte ChaCha20 keystream block at 'out'. + */ +void chacha20_block_arm(const uint8_t key[32], uint32_t counter, + const uint8_t nonce[12], uint8_t out[64]) { + uint32x4_t s0, s1, s2; + uint32_t n0, n1, n2; + chacha20_setup(key, nonce, &s0, &s1, &s2, &n0, &n1, &n2); + + uint32_t s3_in[4] = { counter, n0, n1, n2 }; + uint32x4_t s3 = vld1q_u32(s3_in); + uint32x4_t v0, v1, v2, v3; + chacha20_core(&v0, &v1, &v2, &v3, s0, s1, s2, s3); + + vst1q_u8(out + 0, vreinterpretq_u8_u32(v0)); + vst1q_u8(out + 16, vreinterpretq_u8_u32(v1)); + vst1q_u8(out + 32, vreinterpretq_u8_u32(v2)); + vst1q_u8(out + 48, vreinterpretq_u8_u32(v3)); +} + +/* + * Encrypt/decrypt 'inlen' bytes at 'in' into 'out' using ChaCha20 + * with the given key, starting counter, and nonce. Stream cipher, + * so the same routine decrypts. + */ +void chacha20_cipher_arm(const uint8_t key[32], uint32_t counter, + const uint8_t nonce[12], + const uint8_t *in, uint8_t *out, + size_t inlen) { + uint32x4_t s0, s1, s2; + uint32_t n0, n1, n2; + chacha20_setup(key, nonce, &s0, &s1, &s2, &n0, &n1, &n2); + + size_t pos = 0; + while (pos + 64 <= inlen) { + uint32_t s3_in[4] = { counter, n0, n1, n2 }; + uint32x4_t s3 = vld1q_u32(s3_in); + uint32x4_t v0, v1, v2, v3; + chacha20_core(&v0, &v1, &v2, &v3, s0, s1, s2, s3); + + uint8x16_t i0 = vld1q_u8(in + pos + 0); + uint8x16_t i1 = vld1q_u8(in + pos + 16); + uint8x16_t i2 = vld1q_u8(in + pos + 32); + uint8x16_t i3 = vld1q_u8(in + pos + 48); + + vst1q_u8(out + pos + 0, + veorq_u8(i0, vreinterpretq_u8_u32(v0))); + vst1q_u8(out + pos + 16, + veorq_u8(i1, vreinterpretq_u8_u32(v1))); + vst1q_u8(out + pos + 32, + veorq_u8(i2, vreinterpretq_u8_u32(v2))); + vst1q_u8(out + pos + 48, + veorq_u8(i3, vreinterpretq_u8_u32(v3))); + + pos += 64; + counter++; + } + + /* trailing partial block (< 64 bytes) */ + if (pos < inlen) { + uint32_t s3_in[4] = { counter, n0, n1, n2 }; + uint32x4_t s3 = vld1q_u32(s3_in); + uint32x4_t v0, v1, v2, v3; + chacha20_core(&v0, &v1, &v2, &v3, s0, s1, s2, s3); + + uint8_t block[64]; + vst1q_u8(block + 0, vreinterpretq_u8_u32(v0)); + vst1q_u8(block + 16, vreinterpretq_u8_u32(v1)); + vst1q_u8(block + 32, vreinterpretq_u8_u32(v2)); + vst1q_u8(block + 48, vreinterpretq_u8_u32(v3)); + + size_t remaining = inlen - pos; + for (size_t i = 0; i < remaining; i++) { + out[pos + i] = in[pos + i] ^ block[i]; + } + } +} + +int chacha20_arm_available(void) { + return 1; +} + +#else + +/* stubs for non-aarch64 builds; never reached because dispatch is + * gated on 'chacha20_arm_available' returning 0 */ + +void chacha20_block_arm(const uint8_t *key, uint32_t counter, + const uint8_t *nonce, uint8_t *out) { + (void)key; (void)counter; (void)nonce; (void)out; +} + +void chacha20_cipher_arm(const uint8_t *key, uint32_t counter, + const uint8_t *nonce, + const uint8_t *in, uint8_t *out, + size_t inlen) { + (void)key; (void)counter; (void)nonce; + (void)in; (void)out; (void)inlen; +} + +int chacha20_arm_available(void) { + return 0; +} + +#endif diff --git a/flake.nix b/flake.nix @@ -1,5 +1,5 @@ { - description = "A pure Haskell ChaCha stream cipher."; + description = "A fast Haskell ChaCha stream cipher."; inputs = { ppad-base16 = { diff --git a/lib/Crypto/Cipher/ChaCha20.hs b/lib/Crypto/Cipher/ChaCha20.hs @@ -10,7 +10,7 @@ -- License: MIT -- Maintainer: Jared Tobin <jared@ppad.tech> -- --- A pure ChaCha20 implementation, as specified by +-- A fast ChaCha20 implementation, as specified by -- [RFC 8439](https://datatracker.ietf.org/doc/html/rfc8439). module Crypto.Cipher.ChaCha20 ( @@ -34,6 +34,7 @@ module Crypto.Cipher.ChaCha20 ( ) where import Control.Monad.ST +import qualified Crypto.Cipher.ChaCha20.Arm as Arm import qualified Data.Bits as B import Data.Bits ((.|.), (.<<.), (.^.)) import qualified Data.ByteString as BS @@ -289,6 +290,8 @@ block block key@(BI.PS _ _ kl) counter nonce@(BI.PS _ _ nl) | kl /= 32 = Left InvalidKey | nl /= 12 = Left InvalidNonce + | Arm.chacha20_arm_available = + Right (Arm.block key counter nonce) | otherwise = pure $ runST $ do let k = _parse_key key n = _parse_nonce nonce @@ -341,8 +344,10 @@ cipher -> BS.ByteString -- ^ arbitrary-length plaintext -> Either Error BS.ByteString -- ^ ciphertext cipher raw_key@(BI.PS _ _ kl) counter raw_nonce@(BI.PS _ _ nl) plaintext - | kl /= 32 = Left InvalidKey - | nl /= 12 = Left InvalidNonce + | kl /= 32 = Left InvalidKey + | nl /= 12 = Left InvalidNonce + | Arm.chacha20_arm_available = + Right (Arm.cipher raw_key counter raw_nonce plaintext) | otherwise = pure $ runST $ do let key = _parse_key raw_key non = _parse_nonce raw_nonce diff --git a/lib/Crypto/Cipher/ChaCha20/Arm.hs b/lib/Crypto/Cipher/ChaCha20/Arm.hs @@ -0,0 +1,82 @@ +{-# OPTIONS_HADDOCK hide #-} +{-# LANGUAGE BangPatterns #-} + +-- | +-- Module: Crypto.Cipher.ChaCha20.Arm +-- Copyright: (c) 2025 Jared Tobin +-- License: MIT +-- Maintainer: Jared Tobin <jared@ppad.tech> +-- +-- ARM NEON support for the ChaCha20 stream cipher. + +module Crypto.Cipher.ChaCha20.Arm ( + chacha20_arm_available + , block + , cipher + ) where + +import qualified Data.ByteString as BS +import qualified Data.ByteString.Internal as BI +import Data.Word (Word8, Word32) +import Foreign.C.Types (CInt(..), CSize(..)) +import Foreign.ForeignPtr (withForeignPtr) +import Foreign.Ptr (Ptr, plusPtr) +import System.IO.Unsafe (unsafeDupablePerformIO) + +-- ffi ------------------------------------------------------------------------ + +foreign import ccall unsafe "chacha20_block_arm" + c_chacha20_block + :: Ptr Word8 -> Word32 -> Ptr Word8 -> Ptr Word8 -> IO () + +foreign import ccall unsafe "chacha20_cipher_arm" + c_chacha20_cipher + :: Ptr Word8 -> Word32 -> Ptr Word8 + -> Ptr Word8 -> Ptr Word8 -> CSize -> IO () + +foreign import ccall unsafe "chacha20_arm_available" + c_chacha20_arm_available :: IO CInt + +-- utilities ------------------------------------------------------------------ + +fi :: (Integral a, Num b) => a -> b +fi = fromIntegral +{-# INLINE fi #-} + +-- api ------------------------------------------------------------------------ + +-- | Are ARM NEON extensions available? +chacha20_arm_available :: Bool +chacha20_arm_available = + unsafeDupablePerformIO c_chacha20_arm_available /= 0 +{-# NOINLINE chacha20_arm_available #-} + +-- | One 64-byte ChaCha20 keystream block for the given (already- +-- validated) key, counter, and nonce. +block :: BS.ByteString -> Word32 -> BS.ByteString -> BS.ByteString +block (BI.PS kfp koff _) counter (BI.PS nfp noff _) = + BI.unsafeCreate 64 $ \dst -> + withForeignPtr kfp $ \kp0 -> + withForeignPtr nfp $ \np0 -> + c_chacha20_block (kp0 `plusPtr` koff) + counter + (np0 `plusPtr` noff) + dst + +-- | XOR the plaintext with the ChaCha20 keystream derived from the +-- given (already-validated) key, counter, and nonce. +cipher + :: BS.ByteString -> Word32 -> BS.ByteString -> BS.ByteString + -> BS.ByteString +cipher (BI.PS kfp koff _) counter (BI.PS nfp noff _) + (BI.PS pfp poff plen) = + BI.unsafeCreate plen $ \dst -> + withForeignPtr kfp $ \kp0 -> + withForeignPtr nfp $ \np0 -> + withForeignPtr pfp $ \pp0 -> + c_chacha20_cipher (kp0 `plusPtr` koff) + counter + (np0 `plusPtr` noff) + (pp0 `plusPtr` poff) + dst + (fi plen) diff --git a/ppad-chacha.cabal b/ppad-chacha.cabal @@ -1,7 +1,7 @@ cabal-version: 3.0 name: ppad-chacha version: 0.2.1 -synopsis: A pure ChaCha20 stream cipher +synopsis: A fast ChaCha20 stream cipher license: MIT license-file: LICENSE author: Jared Tobin @@ -11,13 +11,18 @@ build-type: Simple tested-with: GHC == 9.10.3 extra-doc-files: CHANGELOG description: - A pure ChaCha20 stream cipher and block function. + A fast ChaCha20 stream cipher and block function. flag llvm description: Use GHC's LLVM backend. default: False manual: True +flag sanitize + description: Build with AddressSanitizer and UndefinedBehaviorSanitizer. + default: False + manual: True + source-repository head type: git location: git.ppad.tech/chacha.git @@ -31,10 +36,18 @@ library ghc-options: -fllvm -O2 exposed-modules: Crypto.Cipher.ChaCha20 + Crypto.Cipher.ChaCha20.Arm build-depends: base >= 4.9 && < 5 , bytestring >= 0.9 && < 0.13 , primitive >= 0.8 && < 0.10 + c-sources: + cbits/chacha20_arm.c + if arch(aarch64) + cc-options: -march=armv8-a + if flag(sanitize) + cc-options: -fsanitize=address,undefined -fno-omit-frame-pointer + ghc-options: -optl=-fsanitize=address,undefined test-suite chacha-tests type: exitcode-stdio-1.0 @@ -44,6 +57,8 @@ test-suite chacha-tests ghc-options: -rtsopts -Wall -O2 + if flag(sanitize) + ghc-options: -optl=-fsanitize=address,undefined build-depends: base