Merge branch 'perf-refactor' - base64 - Fast Haskell base64 encoding/decoding (docs.ppad.tech/base64).

commit 3d622446b5ee3af52511cc9770895cb1acf4d940
parent b4dd9ff6c285bfb9db834cdcca3d460688c3297d
Author: Jared Tobin <jared@jtobin.io>
Date:   Sat, 16 May 2026 13:04:11 -0230

Merge branch 'perf-refactor'

Performance refactor + ARM NEON intrinsics, mirroring the analogous
work merged to ppad-base16 master.

Five commits, organized as two logical changes:

1. Drop the bytestring 'Builder' pipeline in favour of 'BI.unsafeCreate'
   plus two static-rodata lookup tables (encode alphabet + decode
   table), with the 0x40-offset trick keeping the decode table's
   string literal NUL-free so it lives in rodata via the bytestring
   IsString rewrite. Encode falls from ~2.3 μs to ~270 ns on 1 KiB
   inputs.

2. Add an aarch64 NEON kernel in 'cbits/base64_arm.c' exposed via the
   new 'Data.ByteString.Base64.Arm' module:

   * Encode kernel processes 12 input bytes -> 16 output chars per
     iteration via a vqtbl1q_u8 shuffle, four parallel u32 shifts +
     masks, and a vqtbl4q_u8 alphabet lookup.

   * Decode kernel processes 16 input chars -> 12 output bytes per
     iteration. Range-compare validation with OR-accumulated 'bad'
     masks, per-u32-lane 24-bit pack, vqtbl1q_u8 reorder to BE
     triplets. The Haskell side hands the C kernel both inlen and
     outlen; padding detection and the padded final quartet
     (including RFC 4648 §3.5 non-data-bit validation) are handled
     in C for symmetry with encode.

   'Data.ByteString.Base64.encode' and 'decode' dispatch to the NEON
   path when 'base64_arm_available' returns true, falling back to the
   scalar path otherwise. Cabal adds the C sources, an aarch64
   '-march=armv8-a' cc-option, and a 'sanitize' flag for ASan + UBSan
   builds.

Performance on 1 KiB inputs, M4 MacBook Air, GHC 9.10.3 + LLVM 19,
'cabal bench -f+llvm':

  encode time:   2.279 μs ->   102 ns   (~22×)
  decode time:   649.2 ns ->   160 ns   (~4×)

The existing tasty suite (5000 QuickCheck cases × 3 properties + the
RFC 4648 §10 unit vectors) passes through the dispatched path under
'cabal test', 'cabal test -fllvm', and 'cabal test -fsanitize'.

Also rebrands the cabal/flake/README descriptions from "Pure" to
"Fast" to reflect that the hot path is no longer purely Haskell.

Diffstat:
M README.md  | 31 +++++++++++++++----------------
A cbits/base64_arm.c  | 307 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M flake.nix  | 2 +-
M lib/Data/ByteString/Base64.hs  | 511 +++++++++++++++++++++++++++++++++----------------------------------------------
A lib/Data/ByteString/Base64/Arm.hs  | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M ppad-base64.cabal  | 19 +++++++++++++++++--

6 files changed, 637 insertions(+), 319 deletions(-)
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 ![](https://img.shields.io/badge/license-MIT-brightgreen)
 [![](https://img.shields.io/badge/haddock-base64-lightblue)](https://docs.ppad.tech/base64)
 
-Pure base64 encoding & decoding on strict ByteStrings.
+Fast base64 encoding & decoding on strict ByteStrings.
 
 ## Usage
 
@@ -31,28 +31,27 @@ Haddocks (API documentation, etc.) are hosted at
 
 ## Performance
 
-The aim is best-in-class performance for pure, highly-auditable Haskell
-code. We could go slightly faster by using direct allocation and writes,
-but we get pretty close to the best impure versions with only builders.
-
-Current benchmark figures on a 1024-byte input on an Apple M4 MacBook Air,
-GHC 9.10.3 with the LLVM backend, look like (use `cabal bench` to run the
-benchmark suite):
+The aim is best-in-class performance. Current benchmark figures on 1kb
+inputs on an M4 Silicon MacBook Air, where we avail of hardware
+acceleration via ARM NEON intrinsics, look like (use `cabal bench` to
+run the benchmark suite):
 
 ```
   benchmarking ppad-base64/encode
-  time                 2.279 μs   (2.253 μs .. 2.316 μs)
-                       0.999 R²   (0.998 R² .. 1.000 R²)
-  mean                 2.284 μs   (2.270 μs .. 2.308 μs)
-  std dev              74.77 ns   (50.21 ns .. 124.4 ns)
+  time                 102.0 ns   (101.9 ns .. 102.2 ns)
+                       1.000 R²   (1.000 R² .. 1.000 R²)
+  mean                 102.0 ns   (101.9 ns .. 102.1 ns)
+  std dev              386.6 ps   (313.4 ps .. 521.5 ps)
 
   benchmarking ppad-base64/decode
-  time                 649.2 ns   (637.2 ns .. 659.0 ns)
-                       0.998 R²   (0.997 R² .. 0.999 R²)
-  mean                 618.5 ns   (611.8 ns .. 625.5 ns)
-  std dev              29.46 ns   (25.76 ns .. 35.06 ns)
+  time                 160.3 ns   (160.3 ns .. 160.4 ns)
+                       1.000 R²   (1.000 R² .. 1.000 R²)
+  mean                 160.3 ns   (160.2 ns .. 160.4 ns)
+  std dev              242.8 ps   (201.8 ps .. 301.2 ps)
 ```
 
+You should compile with the 'llvm' flag for maximum performance.
+
 ## Security
 
 This library aims at the maximum security achievable in a
diff --git a/cbits/base64_arm.c b/cbits/base64_arm.c
@@ -0,0 +1,307 @@
+#include <stddef.h>
+#include <stdint.h>
+
+#if defined(__aarch64__)
+
+#include <arm_neon.h>
+
+static const uint8_t b64_alphabet[64] =
+    "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
+
+/*
+ * Encode 'l' input bytes at 'src' into ((l+2)/3)*4 ASCII chars at 'dst'.
+ *
+ * NEON kernel processes 12 input bytes per iteration:
+ *   - vld1q_u8 loads 16 bytes (we use the first 12; reading 4 ahead is
+ *     safe as long as l - i >= 16)
+ *   - vqtbl1q_u8 with a shuffle mask gathers each 4-byte output lane as
+ *     [b1, b0, b2, b1], the order that lets a single shift+mask extract
+ *     each 6-bit index
+ *   - 4 vshrq_n_u32 + vandq_u32 pull out indices i0..i3 (one per lane
+ *     byte); see comments below for the bit math
+ *   - vqtbl4q_u8 looks up each index in the 64-byte alphabet
+ *   - vst1q_u8 stores 16 output chars
+ *
+ * A scalar loop finishes any full triplet that didn't make the NEON
+ * cut-off, and a final branch emits the 0/1/2-byte padded tail.
+ */
+void base64_encode_arm(const uint8_t *src, uint8_t *dst, size_t l) {
+    uint8x16x4_t lut;
+    lut.val[0] = vld1q_u8(b64_alphabet);
+    lut.val[1] = vld1q_u8(b64_alphabet + 16);
+    lut.val[2] = vld1q_u8(b64_alphabet + 32);
+    lut.val[3] = vld1q_u8(b64_alphabet + 48);
+
+    /* For each 4-byte lane of output of vqtbl1q_u8, we want
+     * [b1, b0, b2, b1] in memory order — viewed as a little-endian u32
+     * lane that is (b1) | (b0 << 8) | (b2 << 16) | (b1 << 24).         */
+    static const uint8_t shuf_enc[16] = {
+        1, 0, 2, 1,
+        4, 3, 5, 4,
+        7, 6, 8, 7,
+       10, 9,11,10,
+    };
+    uint8x16_t shuf = vld1q_u8(shuf_enc);
+
+    size_t i = 0, o = 0;
+    while (i + 16 <= l) {
+        uint8x16_t in       = vld1q_u8(src + i);
+        uint8x16_t shuffled = vqtbl1q_u8(in, shuf);
+        uint32x4_t lane     = vreinterpretq_u32_u8(shuffled);
+        uint32x4_t mask6    = vdupq_n_u32(0x3F);
+
+        /* lane (LE) = b1 | (b0 << 8) | (b2 << 16) | (b1 << 24)
+         *  i0 (top 6 of b0)          = (lane >> 10) & 0x3F
+         *  i1 (lo 2 of b0|hi 4 of b1)= (lane >>  4) & 0x3F
+         *  i2 (lo 4 of b1|hi 2 of b2)= (lane >> 22) & 0x3F  [uses b1 copy at byte 3]
+         *  i3 (lo 6 of b2)           = (lane >> 16) & 0x3F          */
+        uint32x4_t i0 = vandq_u32(vshrq_n_u32(lane, 10), mask6);
+        uint32x4_t i1 = vandq_u32(vshrq_n_u32(lane,  4), mask6);
+        uint32x4_t i2 = vandq_u32(vshrq_n_u32(lane, 22), mask6);
+        uint32x4_t i3 = vandq_u32(vshrq_n_u32(lane, 16), mask6);
+
+        /* assemble per-lane u32 = i0 | (i1 << 8) | (i2 << 16) | (i3 << 24) */
+        uint32x4_t idx_u32 = vorrq_u32(
+            vorrq_u32(i0, vshlq_n_u32(i1, 8)),
+            vorrq_u32(vshlq_n_u32(i2, 16), vshlq_n_u32(i3, 24)));
+
+        uint8x16_t indices = vreinterpretq_u8_u32(idx_u32);
+        uint8x16_t chars   = vqtbl4q_u8(lut, indices);
+        vst1q_u8(dst + o, chars);
+
+        i += 12;
+        o += 16;
+    }
+
+    /* scalar tail: full triplets */
+    for (; i + 3 <= l; i += 3, o += 4) {
+        uint32_t v = ((uint32_t)src[i] << 16)
+                   | ((uint32_t)src[i + 1] << 8)
+                   |  (uint32_t)src[i + 2];
+        dst[o]     = b64_alphabet[(v >> 18) & 0x3F];
+        dst[o + 1] = b64_alphabet[(v >> 12) & 0x3F];
+        dst[o + 2] = b64_alphabet[(v >>  6) & 0x3F];
+        dst[o + 3] = b64_alphabet[ v        & 0x3F];
+    }
+
+    /* 1- or 2-byte padded tail */
+    if (i + 1 == l) {
+        uint8_t b = src[i];
+        dst[o]     = b64_alphabet[(b >> 2) & 0x3F];
+        dst[o + 1] = b64_alphabet[(b & 0x03) << 4];
+        dst[o + 2] = '=';
+        dst[o + 3] = '=';
+    } else if (i + 2 == l) {
+        uint8_t b0 = src[i];
+        uint8_t b1 = src[i + 1];
+        dst[o]     = b64_alphabet[(b0 >> 2) & 0x3F];
+        dst[o + 1] = b64_alphabet[((b0 & 0x03) << 4) | (b1 >> 4)];
+        dst[o + 2] = b64_alphabet[(b1 & 0x0F) << 2];
+        dst[o + 3] = '=';
+    }
+}
+
+/*
+ * Convert 16 ASCII base64 chars to 6-bit values in 'val'.
+ * Each lane of 'bad' is 0xff if the corresponding input is not a
+ * valid base64 char ('A'..'Z', 'a'..'z', '0'..'9', '+', '/'), else 0.
+ * '=' is treated as invalid here; the caller handles padding.
+ */
+static inline void ascii_to_b64(uint8x16_t c,
+                                uint8x16_t *val,
+                                uint8x16_t *bad) {
+    uint8x16_t is_upper = vandq_u8(vcgeq_u8(c, vdupq_n_u8('A')),
+                                    vcleq_u8(c, vdupq_n_u8('Z')));
+    uint8x16_t is_lower = vandq_u8(vcgeq_u8(c, vdupq_n_u8('a')),
+                                    vcleq_u8(c, vdupq_n_u8('z')));
+    uint8x16_t is_digit = vandq_u8(vcgeq_u8(c, vdupq_n_u8('0')),
+                                    vcleq_u8(c, vdupq_n_u8('9')));
+    uint8x16_t is_plus  = vceqq_u8(c, vdupq_n_u8('+'));
+    uint8x16_t is_slash = vceqq_u8(c, vdupq_n_u8('/'));
+
+    /* Per-lane additive offset that takes c to its 6-bit value:
+     *   'A'..'Z':  +(-65) = 0xBF mod 256   ('A' + 0xBF = 0)
+     *   'a'..'z':  +(-71) = 0xB9
+     *   '0'..'9':  +4
+     *   '+':       +19
+     *   '/':       +16
+     * Invalid lanes get +0; 'bad' flags them.                          */
+    uint8x16_t add = vorrq_u8(
+        vandq_u8(is_upper, vdupq_n_u8((uint8_t)(0u - 65))),
+        vorrq_u8(
+            vandq_u8(is_lower, vdupq_n_u8((uint8_t)(0u - 71))),
+            vorrq_u8(
+                vandq_u8(is_digit, vdupq_n_u8(4)),
+                vorrq_u8(
+                    vandq_u8(is_plus,  vdupq_n_u8(19)),
+                    vandq_u8(is_slash, vdupq_n_u8(16))))));
+
+    *val = vaddq_u8(c, add);
+
+    uint8x16_t any_valid = vorrq_u8(is_upper,
+                            vorrq_u8(is_lower,
+                              vorrq_u8(is_digit,
+                                vorrq_u8(is_plus, is_slash))));
+    *bad = vmvnq_u8(any_valid);
+}
+
+static inline uint8_t scalar_b64(uint8_t c) {
+    if (c >= 'A' && c <= 'Z') return (uint8_t)(c - 'A');
+    if (c >= 'a' && c <= 'z') return (uint8_t)(c - 'a' + 26);
+    if (c >= '0' && c <= '9') return (uint8_t)(c - '0' + 52);
+    if (c == '+') return 62;
+    if (c == '/') return 63;
+    return 0x80; /* invalid sentinel */
+}
+
+/*
+ * Decode 'inlen' ASCII base64 chars at 'src' into 'outlen' bytes at
+ * 'dst'.  Returns 1 on success, 0 on any decoding error: malformed
+ * length, malformed padding, invalid char in body, or invalid char /
+ * non-zero non-data bits in the padded final quartet (RFC 4648 §3.5).
+ *
+ * Caller must allocate 'outlen' bytes at 'dst' and pass the correct
+ * outlen for the given inlen and padding; mismatch returns 0 with
+ * 'dst' unspecified.
+ *
+ * Body NEON kernel processes 16 input chars (= 4 quartets) per
+ * iteration:
+ *   - vld1q_u8 loads 16 chars
+ *   - ascii_to_b64 validates each lane and yields 6-bit values
+ *   - per u32x4 lane: build the 24-bit packed value V = (v0 << 18) |
+ *     (v1 << 12) | (v2 << 6) | v3, whose bytes in LE are [V_low,
+ *     V_mid, V_high, 0]
+ *   - vqtbl1q_u8 reshuffles those bytes into [V_high, V_mid, V_low]
+ *     per triplet, yielding 12 output bytes at the bottom of the
+ *     output vector
+ *   - vst1q_u8 stores 16 bytes (writing 12 valid + 4 spurious; the
+ *     loop bound 'o + 16 <= body_outlen' keeps the overrun within
+ *     the allocated buffer, and the spurious bytes get clobbered by
+ *     the next iteration or by the scalar tail / final quartet)
+ *
+ * A scalar tail finishes any body quartets that didn't make the
+ * NEON cut-off, then the padded final quartet is decoded explicitly.
+ */
+int base64_decode_arm(const uint8_t *src, uint8_t *dst,
+                      size_t inlen, size_t outlen) {
+    if (inlen == 0) return outlen == 0;
+    if (inlen & 0x3) return 0;
+
+    uint8_t c_pre = src[inlen - 2];
+    uint8_t c_end = src[inlen - 1];
+    size_t pad = 0;
+    if (c_end == '=') {
+        if (c_pre == '=') pad = 2;
+        else              pad = 1;
+    } else if (c_pre == '=') {
+        return 0; /* '=' at offset -2 only is malformed */
+    }
+
+    size_t nfull = inlen >> 2;
+    if (outlen != nfull * 3 - pad) return 0;
+
+    size_t body_chars  = (pad > 0) ? (inlen - 4) : inlen;
+    size_t body_outlen = (body_chars >> 2) * 3;
+
+    uint8x16_t bad_acc = vdupq_n_u8(0);
+
+    static const uint8_t pack_shuf[16] = {
+         2, 1, 0,
+         6, 5, 4,
+        10, 9, 8,
+        14,13,12,
+         0xFF, 0xFF, 0xFF, 0xFF
+    };
+    uint8x16_t pshuf = vld1q_u8(pack_shuf);
+
+    size_t i = 0, o = 0;
+    while (o + 16 <= body_outlen) {
+        uint8x16_t c = vld1q_u8(src + i);
+        uint8x16_t val, this_bad;
+        ascii_to_b64(c, &val, &this_bad);
+        bad_acc = vorrq_u8(bad_acc, this_bad);
+
+        uint32x4_t v32   = vreinterpretq_u32_u8(val);
+        uint32x4_t mask8 = vdupq_n_u32(0xFF);
+
+        uint32x4_t p0 = vshlq_n_u32(vandq_u32(v32, mask8), 18);
+        uint32x4_t p1 = vshlq_n_u32(
+            vandq_u32(vshrq_n_u32(v32,  8), mask8), 12);
+        uint32x4_t p2 = vshlq_n_u32(
+            vandq_u32(vshrq_n_u32(v32, 16), mask8),  6);
+        uint32x4_t p3 = vshrq_n_u32(v32, 24);
+
+        uint32x4_t V       = vorrq_u32(vorrq_u32(p0, p1),
+                                       vorrq_u32(p2, p3));
+        uint8x16_t V_bytes = vreinterpretq_u8_u32(V);
+        uint8x16_t packed  = vqtbl1q_u8(V_bytes, pshuf);
+
+        vst1q_u8(dst + o, packed); /* 12 valid bytes + 4 spurious */
+
+        i += 16;
+        o += 12;
+    }
+
+    uint8_t tail_bad = 0;
+
+    /* scalar body tail (full quartets, no '=') */
+    while (o + 3 <= body_outlen) {
+        uint8_t v0 = scalar_b64(src[i]);
+        uint8_t v1 = scalar_b64(src[i + 1]);
+        uint8_t v2 = scalar_b64(src[i + 2]);
+        uint8_t v3 = scalar_b64(src[i + 3]);
+        tail_bad |= (v0 | v1 | v2 | v3) & 0x80;
+        dst[o]     = (uint8_t)((v0 << 2)         | (v1 >> 4));
+        dst[o + 1] = (uint8_t)(((v1 & 0x0F) << 4) | (v2 >> 2));
+        dst[o + 2] = (uint8_t)(((v2 & 0x03) << 6) | (v3 & 0x3F));
+        i += 4;
+        o += 3;
+    }
+
+    /* padded final quartet */
+    if (pad > 0) {
+        uint8_t v0 = scalar_b64(src[i]);
+        uint8_t v1 = scalar_b64(src[i + 1]);
+        if ((v0 | v1) & 0x80) return 0;
+
+        if (pad == 2) {
+            /* "XX==" -> 1 output byte; bottom 4 bits of v1 must be 0 */
+            if (v1 & 0x0F) return 0;
+            dst[o] = (uint8_t)((v0 << 2) | (v1 >> 4));
+        } else {
+            /* "XXX=" -> 2 output bytes; bottom 2 bits of v2 must be 0 */
+            uint8_t v2 = scalar_b64(src[i + 2]);
+            if (v2 & 0x80)  return 0;
+            if (v2 & 0x03) return 0;
+            dst[o]     = (uint8_t)((v0 << 2)        | (v1 >> 4));
+            dst[o + 1] = (uint8_t)(((v1 & 0x0F) << 4) | (v2 >> 2));
+        }
+    }
+
+    return (vmaxvq_u8(bad_acc) == 0) && (tail_bad == 0);
+}
+
+int base64_arm_available(void) {
+    return 1;
+}
+
+#else
+
+/* stubs for non-aarch64 builds; never reached because dispatch is
+ * gated on 'base64_arm_available' returning 0                     */
+
+void base64_encode_arm(const uint8_t *src, uint8_t *dst, size_t l) {
+    (void)src; (void)dst; (void)l;
+}
+
+int base64_decode_arm(const uint8_t *src, uint8_t *dst,
+                      size_t inlen, size_t outlen) {
+    (void)src; (void)dst; (void)inlen; (void)outlen;
+    return 0;
+}
+
+int base64_arm_available(void) {
+    return 0;
+}
+
+#endif
diff --git a/flake.nix b/flake.nix
@@ -1,5 +1,5 @@
 {
-  description = "Pure Haskell base64 encoding and decoding on bytestrings.";
+  description = "Fast Haskell base64 encoding and decoding on bytestrings.";
 
   inputs = {
     ppad-nixpkgs = {
diff --git a/lib/Data/ByteString/Base64.hs b/lib/Data/ByteString/Base64.hs
@@ -1,7 +1,5 @@
 {-# OPTIONS_HADDOCK prune #-}
-{-# LANGUAGE ApplicativeDo #-}
 {-# LANGUAGE BangPatterns #-}
-{-# LANGUAGE BinaryLiterals #-}
 {-# LANGUAGE OverloadedStrings #-}
 
 -- |
@@ -20,324 +18,237 @@ module Data.ByteString.Base64 (
 import qualified Data.Bits as B
 import Data.Bits ((.&.), (.|.))
 import qualified Data.ByteString as BS
-import qualified Data.ByteString.Builder as BSB
-import qualified Data.ByteString.Builder.Extra as BE
+import qualified Data.ByteString.Base64.Arm as Arm
 import qualified Data.ByteString.Internal as BI
-import qualified Data.ByteString.Unsafe as BU
-import Data.Word (Word8, Word16, Word32, Word64)
-
-to_strict :: BSB.Builder -> BS.ByteString
-to_strict = BS.toStrict . BSB.toLazyByteString
-{-# INLINE to_strict #-}
-
-to_strict_small :: BSB.Builder -> BS.ByteString
-to_strict_small = BS.toStrict
-  . BE.toLazyByteStringWith (BE.safeStrategy 128 BE.smallChunkSize) mempty
-{-# INLINE to_strict_small #-}
+import Data.Word (Word8)
+import Foreign.ForeignPtr (withForeignPtr)
+import Foreign.Ptr (Ptr, plusPtr)
+import Foreign.Storable (peekElemOff, pokeElemOff)
+import System.IO.Unsafe (unsafeDupablePerformIO)
 
 fi :: (Num a, Integral b) => b -> a
 fi = fromIntegral
 {-# INLINE fi #-}
 
-b64_charset :: BS.ByteString
-b64_charset =
+-- 64-byte table.  Indexed by 6-bit value (0..63), yields the
+-- corresponding base64 alphabet character.  All-ASCII content means
+-- the bytestring 'IsString' rule rewrites this to 'unsafePackAddress'
+-- and the bytes live in static rodata.
+enc_tab :: BS.ByteString
+enc_tab =
   "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
+{-# NOINLINE enc_tab #-}
 
--- 3 input bytes -> 4 output chars packed in a Word32 (big-endian)
-expand_w24 :: Word8 -> Word8 -> Word8 -> Word32
-expand_w24 a b c =
-  let !v = (fi a `B.shiftL` 16 :: Word32)
-       .|. (fi b `B.shiftL`  8)
-       .|.  fi c
-      !c0 = BU.unsafeIndex b64_charset (fi ((v `B.shiftR` 18) .&. 0x3F))
-      !c1 = BU.unsafeIndex b64_charset (fi ((v `B.shiftR` 12) .&. 0x3F))
-      !c2 = BU.unsafeIndex b64_charset (fi ((v `B.shiftR`  6) .&. 0x3F))
-      !c3 = BU.unsafeIndex b64_charset (fi  (v               .&. 0x3F))
-  in      fi c0 `B.shiftL` 24
-      .|. fi c1 `B.shiftL` 16
-      .|. fi c2 `B.shiftL`  8
-      .|. fi c3
-{-# INLINE expand_w24 #-}
-
--- 6 input bytes -> 8 output chars packed in a Word64 (big-endian)
-expand_w48 :: Word8 -> Word8 -> Word8 -> Word8 -> Word8 -> Word8 -> Word64
-expand_w48 a b c d e f =
-  let !hi = expand_w24 a b c
-      !lo = expand_w24 d e f
-  in  (fi hi `B.shiftL` 32) .|. fi lo
-{-# INLINE expand_w48 #-}
+-- 256-byte table.  Index by an ASCII byte to obtain its 6-bit value;
+-- valid base64 chars ('A'..'Z', 'a'..'z', '0'..'9', '+', '/') map to
+-- 0x40..0x7f, every other byte (including '=') maps to 0x80.
+--
+-- The encoding is chosen so the literal is strictly ASCII and contains
+-- no embedded NUL, which is what the bytestring 'IsString' rule needs
+-- to rewrite it into 'unsafePackAddress' (cf. 'enc_tab') — the bytes
+-- end up in static rodata, with no CAF allocation.
+--
+-- The 0x80 sentinel is distinguished by bit 7; no value 0x40..0x7f
+-- carries that bit, so 'decode' OR-folds every lookup into an
+-- accumulator and tests 'acc .&. 0x80 == 0' once at the end.  The
+-- low 6 bits of each entry are the 6-bit value, possibly contaminated
+-- by the 0x40 flag bit; the b0/b1/b2 formulas mask each subexpression
+-- before combining so the flag never bleeds into the output bytes.
+dec_tab :: BS.ByteString
+dec_tab =
+  "\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\
+  \\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\
+  \\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x7E\x80\x80\x80\x7F\
+  \\x74\x75\x76\x77\x78\x79\x7A\x7B\x7C\x7D\x80\x80\x80\x80\x80\x80\
+  \\x80\x40\x41\x42\x43\x44\x45\x46\x47\x48\x49\x4A\x4B\x4C\x4D\x4E\
+  \\x4F\x50\x51\x52\x53\x54\x55\x56\x57\x58\x59\x80\x80\x80\x80\x80\
+  \\x80\x5A\x5B\x5C\x5D\x5E\x5F\x60\x61\x62\x63\x64\x65\x66\x67\x68\
+  \\x69\x6A\x6B\x6C\x6D\x6E\x6F\x70\x71\x72\x73\x80\x80\x80\x80\x80\
+  \\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\
+  \\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\
+  \\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\
+  \\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\
+  \\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\
+  \\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\
+  \\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\
+  \\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80"
+{-# NOINLINE dec_tab #-}
 
 -- | Encode a base256 'ByteString' as base64.
 --
+--   Uses ARM NEON extensions when available, otherwise a pure
+--   Haskell scalar loop.
+--
 --   >>> encode "hello world"
 --   "aGVsbG8gd29ybGQ="
 encode :: BS.ByteString -> BS.ByteString
-encode bs@(BI.PS _ _ l)
-    | l < 64    = to_strict_small loop
-    | otherwise = to_strict loop
-  where
-    loop
-      | l `rem` 6 == 0 =
-               go64 bs
-      | (l - 3) `rem` 6 == 0 = case BS.splitAt (l - 3) bs of
-          (chunk, etc) ->
-               go64 chunk
-            <> go32 etc
-      | (l - 1) `rem` 6 == 0 = case BS.splitAt (l - 1) bs of
-          (chunk, etc) ->
-               go64 chunk
-            <> tail1 etc
-      | (l - 2) `rem` 6 == 0 = case BS.splitAt (l - 2) bs of
-          (chunk, etc) ->
-               go64 chunk
-            <> tail2 etc
-      | (l - 4) `rem` 6 == 0 = case BS.splitAt (l - 4) bs of
-          (chunk, etc) ->
-               go64 chunk
-            <> go32 (BU.unsafeTake 3 etc)
-            <> tail1 (BU.unsafeDrop 3 etc)
-      | (l - 5) `rem` 6 == 0 = case BS.splitAt (l - 5) bs of
-          (chunk, etc) ->
-               go64 chunk
-            <> go32 (BU.unsafeTake 3 etc)
-            <> tail2 (BU.unsafeDrop 3 etc)
-      | otherwise =
-               mempty -- unreachable: l `rem` 6 in [0..5]
-
-    go64 b = case BS.splitAt 6 b of
-      (chunk, etc)
-        | BS.null chunk -> mempty
-        | otherwise ->
-            let !w64 = expand_w48
-                  (BU.unsafeIndex chunk 0)
-                  (BU.unsafeIndex chunk 1)
-                  (BU.unsafeIndex chunk 2)
-                  (BU.unsafeIndex chunk 3)
-                  (BU.unsafeIndex chunk 4)
-                  (BU.unsafeIndex chunk 5)
-            in  BSB.word64BE w64 <> go64 etc
-
-    go32 b = case BS.splitAt 3 b of
-      (chunk, etc)
-        | BS.null chunk -> mempty
-        | otherwise ->
-            let !w32 = expand_w24
-                  (BU.unsafeIndex chunk 0)
-                  (BU.unsafeIndex chunk 1)
-                  (BU.unsafeIndex chunk 2)
-            in  BSB.word32BE w32 <> go32 etc
-
-    -- final 1 byte -> "XX==" (one Word32 BE)
-    tail1 b =
-      let !a  = BU.unsafeIndex b 0
-          !c0 = BU.unsafeIndex b64_charset (fi (a `B.shiftR` 2))
-          !c1 = BU.unsafeIndex b64_charset (fi ((a .&. 0x03) `B.shiftL` 4))
-          !w32 = (fi c0 `B.shiftL` 24 :: Word32)
-             .|. (fi c1 `B.shiftL` 16)
-             .|.  0x00003D3D
-      in  BSB.word32BE w32
-
-    -- final 2 bytes -> "XXX=" (one Word32 BE)
-    tail2 b =
-      let !a  = BU.unsafeIndex b 0
-          !c  = BU.unsafeIndex b 1
-          !c0 = BU.unsafeIndex b64_charset (fi (a `B.shiftR` 2))
-          !c1 = BU.unsafeIndex b64_charset
-                  (fi (((a .&. 0x03) `B.shiftL` 4) .|. (c `B.shiftR` 4)))
-          !c2 = BU.unsafeIndex b64_charset (fi ((c .&. 0x0F) `B.shiftL` 2))
-          !w32 = (fi c0 `B.shiftL` 24 :: Word32)
-             .|. (fi c1 `B.shiftL` 16)
-             .|. (fi c2 `B.shiftL`  8)
-             .|.  0x0000003D
-      in  BSB.word32BE w32
-
--- word8 base64 character -> 6-bit value
-word6 :: Word8 -> Maybe Word8
-word6 c
-  | c >= 65 && c <=  90 = pure $! c - 65 -- A-Z
-  | c >= 97 && c <= 122 = pure $! c - 71 -- a-z
-  | c >= 48 && c <=  57 = pure $! c + 4  -- 0-9
-  | c == 43             = pure 62        -- '+'
-  | c == 47             = pure 63        -- '/'
-  | otherwise           = Nothing
-{-# INLINE word6 #-}
-
--- decode 4 chars at offset i to a 24-bit value (in low bits of Word32)
-dec_quartet :: BS.ByteString -> Int -> Maybe Word32
-dec_quartet b i = do
-  !v0 <- word6 (BU.unsafeIndex b i)
-  !v1 <- word6 (BU.unsafeIndex b (i + 1))
-  !v2 <- word6 (BU.unsafeIndex b (i + 2))
-  !v3 <- word6 (BU.unsafeIndex b (i + 3))
-  pure $! (fi v0 `B.shiftL` 18 :: Word32)
-      .|. (fi v1 `B.shiftL` 12)
-      .|. (fi v2 `B.shiftL`  6)
-      .|.  fi v3
-{-# INLINE dec_quartet #-}
+encode bs
+  | Arm.base64_arm_available = Arm.encode bs
+  | otherwise                = encode_scalar bs
+{-# INLINABLE encode #-}
 
 -- | Decode a base64 'ByteString' to base256.
 --
---   Invalid inputs (including incorrectly-padded or non-canonical
---   inputs) will produce 'Nothing'.
+--   Uses ARM NEON extensions when available, otherwise a pure
+--   Haskell scalar loop.  Invalid inputs (including incorrectly-
+--   padded or non-canonical inputs) will produce 'Nothing'.
 --
 --   >>> decode "aGVsbG8gd29ybGQ="
 --   Just "hello world"
 --   >>> decode "aGVsbG8gd29ybGQ" -- missing padding
 --   Nothing
 decode :: BS.ByteString -> Maybe BS.ByteString
-decode bs@(BI.PS _ _ l)
-    | l == 0                 = pure BS.empty
-    | l `rem` 4 /= 0         = Nothing
-    | (l `quot` 4) * 3 < 128 = fmap to_strict_small loop
-    | otherwise              = fmap to_strict loop
-  where
-    !bl    = l - 4
-    !body  = BU.unsafeTake bl bs
-    !final = BU.unsafeDrop bl bs
-
-    loop = do
-      !b0 <- decode_body body
-      !b1 <- decode_final final
-      pure (b0 <> b1)
-
-    decode_body b
-      | bl `rem` 32 == 0 =
-            go64 mempty b
-      | (bl - 4) `rem` 32 == 0 = case BS.splitAt (bl - 4) b of
-          (chunk, etc) -> do
-            !acc <- go64 mempty chunk
-            go16 acc etc
-      | (bl - 8) `rem` 32 == 0 = case BS.splitAt (bl - 8) b of
-          (chunk, etc) -> do
-            !acc <- go64 mempty chunk
-            go32 acc etc
-      | (bl - 12) `rem` 32 == 0 = case BS.splitAt (bl - 12) b of
-          (chunk, etc) -> do
-            !acc0 <- go64 mempty chunk
-            !acc1 <- go32 acc0 (BU.unsafeTake 8 etc)
-            go16 acc1 (BU.unsafeDrop 8 etc)
-      | (bl - 16) `rem` 32 == 0 = case BS.splitAt (bl - 16) b of
-          (chunk, etc) -> do
-            !acc <- go64 mempty chunk
-            go48 acc etc
-      | (bl - 20) `rem` 32 == 0 = case BS.splitAt (bl - 20) b of
-          (chunk, etc) -> do
-            !acc0 <- go64 mempty chunk
-            !acc1 <- go48 acc0 (BU.unsafeTake 16 etc)
-            go16 acc1 (BU.unsafeDrop 16 etc)
-      | (bl - 24) `rem` 32 == 0 = case BS.splitAt (bl - 24) b of
-          (chunk, etc) -> do
-            !acc0 <- go64 mempty chunk
-            !acc1 <- go48 acc0 (BU.unsafeTake 16 etc)
-            go32 acc1 (BU.unsafeDrop 16 etc)
-      | (bl - 28) `rem` 32 == 0 = case BS.splitAt (bl - 28) b of
-          (chunk, etc) -> do
-            !acc0 <- go64 mempty chunk
-            !acc1 <- go48 acc0 (BU.unsafeTake 16 etc)
-            !acc2 <- go32 acc1 (BU.unsafeTake 8 (BU.unsafeDrop 16 etc))
-            go16 acc2 (BU.unsafeDrop 24 etc)
-      | otherwise = Nothing -- unreachable
-
-    decode_final b =
-      let !c0 = BU.unsafeIndex b 0
-          !c1 = BU.unsafeIndex b 1
-          !c2 = BU.unsafeIndex b 2
-          !c3 = BU.unsafeIndex b 3
-      in  case (c2 == 0x3D, c3 == 0x3D) of
-            (True, True) -> do
-              !v0 <- word6 c0
-              !v1 <- word6 c1
-              if v1 .&. 0x0F /= 0
-                then Nothing
-                else
-                  let !w8 = (v0 `B.shiftL` 2) .|. (v1 `B.shiftR` 4)
-                  in  pure $! BSB.word8 w8
-            (False, True) -> do
-              !v0 <- word6 c0
-              !v1 <- word6 c1
-              !v2 <- word6 c2
-              if v2 .&. 0x03 /= 0
-                then Nothing
-                else
-                  let !w16 = (fi v0 `B.shiftL` 10 :: Word16)
-                         .|. (fi v1 `B.shiftL`  4)
-                         .|. (fi v2 `B.shiftR`  2)
-                  in  pure $! BSB.word16BE w16
-            (True, False) -> Nothing
-            (False, False) -> do
-              !v0 <- word6 c0
-              !v1 <- word6 c1
-              !v2 <- word6 c2
-              !v3 <- word6 c3
-              let !w24 = (fi v0 `B.shiftL` 18 :: Word32)
-                     .|. (fi v1 `B.shiftL` 12)
-                     .|. (fi v2 `B.shiftL`  6)
-                     .|.  fi v3
-                  !w16 = fi (w24 `B.shiftR` 8) :: Word16
-                  !w8  = fi w24 :: Word8
-              pure $! BSB.word16BE w16 <> BSB.word8 w8
-
-    -- 4 chars -> 3 bytes (1 word16BE + 1 word8)
-    go16 acc b = case BS.splitAt 4 b of
-      (chunk, etc)
-        | BS.null chunk -> pure acc
-        | otherwise -> do
-            !q <- dec_quartet chunk 0
-            let !w16 = fi (q `B.shiftR` 8) :: Word16
-                !w8  = fi q :: Word8
-            go16 (acc <> BSB.word16BE w16 <> BSB.word8 w8) etc
-
-    -- 8 chars -> 6 bytes (1 word32BE + 1 word16BE)
-    go32 acc b = case BS.splitAt 8 b of
-      (chunk, etc)
-        | BS.null chunk -> pure acc
-        | otherwise -> do
-            !q0 <- dec_quartet chunk 0
-            !q1 <- dec_quartet chunk 4
-            let !w48 = (fi q0 `B.shiftL` 24 :: Word64)
-                   .|.  fi q1
-                !w32 = fi (w48 `B.shiftR` 16) :: Word32
-                !w16 = fi w48 :: Word16
-            go32 (acc <> BSB.word32BE w32 <> BSB.word16BE w16) etc
-
-    -- 16 chars -> 12 bytes (1 word64BE + 1 word32BE)
-    go48 acc b = case BS.splitAt 16 b of
-      (chunk, etc)
-        | BS.null chunk -> pure acc
-        | otherwise -> do
-            !q0 <- dec_quartet chunk 0
-            !q1 <- dec_quartet chunk 4
-            !q2 <- dec_quartet chunk 8
-            !q3 <- dec_quartet chunk 12
-            let !w64 = (fi q0 `B.shiftL` 40 :: Word64)
-                   .|. (fi q1 `B.shiftL` 16)
-                   .|.  fi (q2 `B.shiftR` 8)
-                !w32 = ((q2 .&. 0xFF) `B.shiftL` 24) .|. q3
-            go48 (acc <> BSB.word64BE w64 <> BSB.word32BE w32) etc
-
-    -- 32 chars -> 24 bytes (3 × word64BE)
-    go64 acc b = case BS.splitAt 32 b of
-      (chunk, etc)
-        | BS.null chunk -> pure acc
-        | otherwise -> do
-            !q0 <- dec_quartet chunk 0
-            !q1 <- dec_quartet chunk 4
-            !q2 <- dec_quartet chunk 8
-            !q3 <- dec_quartet chunk 12
-            !q4 <- dec_quartet chunk 16
-            !q5 <- dec_quartet chunk 20
-            !q6 <- dec_quartet chunk 24
-            !q7 <- dec_quartet chunk 28
-            let !w64a = (fi q0 `B.shiftL` 40 :: Word64)
-                    .|. (fi q1 `B.shiftL` 16)
-                    .|.  fi (q2 `B.shiftR` 8)
-                !w64b = (fi (q2 .&. 0xFF) `B.shiftL` 56 :: Word64)
-                    .|. (fi q3 `B.shiftL` 32)
-                    .|. (fi q4 `B.shiftL`  8)
-                    .|.  fi (q5 `B.shiftR` 16)
-                !w64c = (fi (q5 .&. 0xFFFF) `B.shiftL` 48 :: Word64)
-                    .|. (fi q6 `B.shiftL` 24)
-                    .|.  fi q7
-            go64 (acc <> BSB.word64BE w64a
-                      <> BSB.word64BE w64b
-                      <> BSB.word64BE w64c) etc
+decode bs
+  | Arm.base64_arm_available = Arm.decode bs
+  | otherwise                = decode_scalar bs
+{-# INLINABLE decode #-}
+
+encode_scalar :: BS.ByteString -> BS.ByteString
+encode_scalar (BI.PS sfp soff l) =
+  case enc_tab of
+    BI.PS tfp toff _ ->
+      BI.unsafeCreate ((l + 2) `quot` 3 * 4) $ \dst ->
+        withForeignPtr sfp $ \sp0 ->
+        withForeignPtr tfp $ \tp0 -> do
+          let !sp = sp0 `plusPtr` soff :: Ptr Word8
+              !tp = tp0 `plusPtr` toff :: Ptr Word8
+              !nfull = l `quot` 3
+              !rmn   = l - nfull * 3
+              loop !i
+                | i == nfull = pure ()
+                | otherwise = do
+                    let !ii = i * 3
+                        !oo = i * 4
+                    b0 <- peekElemOff sp ii
+                    b1 <- peekElemOff sp (ii + 1)
+                    b2 <- peekElemOff sp (ii + 2)
+                    c0 <- peekElemOff tp (fi (b0 `B.shiftR` 2))
+                    c1 <- peekElemOff tp (fi
+                            (((b0 .&. 0x03) `B.shiftL` 4)
+                         .|.  (b1 `B.shiftR` 4)))
+                    c2 <- peekElemOff tp (fi
+                            (((b1 .&. 0x0F) `B.shiftL` 2)
+                         .|.  (b2 `B.shiftR` 6)))
+                    c3 <- peekElemOff tp (fi (b2 .&. 0x3F))
+                    pokeElemOff dst  oo      (c0 :: Word8)
+                    pokeElemOff dst (oo + 1) c1
+                    pokeElemOff dst (oo + 2) c2
+                    pokeElemOff dst (oo + 3) c3
+                    loop (i + 1)
+          loop 0
+          case rmn of
+            0 -> pure ()
+            1 -> do
+              let !ii = nfull * 3
+                  !oo = nfull * 4
+              b0 <- peekElemOff sp ii
+              c0 <- peekElemOff tp (fi (b0 `B.shiftR` 2))
+              c1 <- peekElemOff tp (fi ((b0 .&. 0x03) `B.shiftL` 4))
+              pokeElemOff dst  oo      (c0 :: Word8)
+              pokeElemOff dst (oo + 1) c1
+              pokeElemOff dst (oo + 2) 0x3D
+              pokeElemOff dst (oo + 3) 0x3D
+            _ -> do
+              let !ii = nfull * 3
+                  !oo = nfull * 4
+              b0 <- peekElemOff sp ii
+              b1 <- peekElemOff sp (ii + 1)
+              c0 <- peekElemOff tp (fi (b0 `B.shiftR` 2))
+              c1 <- peekElemOff tp (fi
+                      (((b0 .&. 0x03) `B.shiftL` 4)
+                   .|.  (b1 `B.shiftR` 4)))
+              c2 <- peekElemOff tp (fi ((b1 .&. 0x0F) `B.shiftL` 2))
+              pokeElemOff dst  oo      (c0 :: Word8)
+              pokeElemOff dst (oo + 1) c1
+              pokeElemOff dst (oo + 2) c2
+              pokeElemOff dst (oo + 3) 0x3D
+
+decode_scalar :: BS.ByteString -> Maybe BS.ByteString
+decode_scalar (BI.PS sfp soff l)
+  | l == 0          = Just BS.empty
+  | l .&. 0x03 /= 0 = Nothing
+  | otherwise = case dec_tab of
+      BI.PS tfp toff _ -> unsafeDupablePerformIO $
+        withForeignPtr sfp $ \sp0 ->
+        withForeignPtr tfp $ \tp0 -> do
+          let !sp = sp0 `plusPtr` soff :: Ptr Word8
+              !tp = tp0 `plusPtr` toff :: Ptr Word8
+          c_pre <- peekElemOff sp (l - 2)
+          c_end <- peekElemOff sp (l - 1)
+          let !pad_pre = c_pre == 0x3D
+              !pad_end = c_end == 0x3D
+          if pad_pre && not pad_end
+            then pure Nothing
+            else do
+              let !pad = (if pad_pre then 2 else if pad_end then 1 else 0)
+                       :: Int
+                  !nfull  = l `B.shiftR` 2
+                  !nbody  = if pad > 0 then nfull - 1 else nfull
+                  !outlen = nfull * 3 - pad
+              fp <- BI.mallocByteString outlen
+              ok <- withForeignPtr fp $ \dst -> do
+                let body_loop !acc !i
+                      | i == nbody = pure acc
+                      | otherwise = do
+                          let !ii = i `B.shiftL` 2
+                              !oo = i * 3
+                          c0 <- peekElemOff sp  ii
+                          c1 <- peekElemOff sp (ii + 1)
+                          c2 <- peekElemOff sp (ii + 2)
+                          c3 <- peekElemOff sp (ii + 3)
+                          v0 <- peekElemOff tp (fi c0)
+                          v1 <- peekElemOff tp (fi c1)
+                          v2 <- peekElemOff tp (fi c2)
+                          v3 <- peekElemOff tp (fi c3)
+                          let !b0 = (v0 `B.shiftL` 2)
+                                .|. ((v1 `B.shiftR` 4) .&. 0x03)
+                              !b1 = ((v1 .&. 0x0F) `B.shiftL` 4)
+                                .|. ((v2 `B.shiftR` 2) .&. 0x0F)
+                              !b2 = ((v2 .&. 0x03) `B.shiftL` 6)
+                                .|. (v3 .&. 0x3F)
+                          pokeElemOff dst  oo      b0
+                          pokeElemOff dst (oo + 1) b1
+                          pokeElemOff dst (oo + 2) b2
+                          body_loop
+                            (acc .|. v0 .|. v1 .|. v2 .|. v3) (i + 1)
+                acc <- body_loop 0 0
+                if acc .&. 0x80 /= 0
+                  then pure False
+                  else case pad of
+                    0 -> pure True
+                    1 -> do
+                      let !ii = nbody `B.shiftL` 2
+                          !oo = nbody * 3
+                      c0 <- peekElemOff sp  ii
+                      c1 <- peekElemOff sp (ii + 1)
+                      c2 <- peekElemOff sp (ii + 2)
+                      v0 <- peekElemOff tp (fi c0)
+                      v1 <- peekElemOff tp (fi c1)
+                      v2 <- peekElemOff tp (fi c2)
+                      let !tail_acc = v0 .|. v1 .|. v2
+                      if tail_acc .&. 0x80 /= 0 || v2 .&. 0x03 /= 0
+                        then pure False
+                        else do
+                          let !b0 = (v0 `B.shiftL` 2)
+                                .|. ((v1 `B.shiftR` 4) .&. 0x03)
+                              !b1 = ((v1 .&. 0x0F) `B.shiftL` 4)
+                                .|. ((v2 `B.shiftR` 2) .&. 0x0F)
+                          pokeElemOff dst  oo      b0
+                          pokeElemOff dst (oo + 1) b1
+                          pure True
+                    _ -> do
+                      let !ii = nbody `B.shiftL` 2
+                          !oo = nbody * 3
+                      c0 <- peekElemOff sp  ii
+                      c1 <- peekElemOff sp (ii + 1)
+                      v0 <- peekElemOff tp (fi c0)
+                      v1 <- peekElemOff tp (fi c1)
+                      let !tail_acc = v0 .|. v1
+                      if tail_acc .&. 0x80 /= 0 || v1 .&. 0x0F /= 0
+                        then pure False
+                        else do
+                          let !b0 = (v0 `B.shiftL` 2)
+                                .|. ((v1 `B.shiftR` 4) .&. 0x03)
+                          pokeElemOff dst oo b0
+                          pure True
+              pure $! if ok then Just (BI.PS fp 0 outlen) else Nothing
diff --git a/lib/Data/ByteString/Base64/Arm.hs b/lib/Data/ByteString/Base64/Arm.hs
@@ -0,0 +1,86 @@
+{-# OPTIONS_HADDOCK hide #-}
+{-# LANGUAGE BangPatterns #-}
+
+-- |
+-- Module: Data.ByteString.Base64.Arm
+-- Copyright: (c) 2026 Jared Tobin
+-- License: MIT
+-- Maintainer: Jared Tobin <jared@ppad.tech>
+--
+-- ARM NEON support for base64 encoding and decoding.
+
+module Data.ByteString.Base64.Arm (
+    base64_arm_available
+  , encode
+  , decode
+  ) where
+
+import qualified Data.Bits as B
+import Data.Bits ((.&.))
+import qualified Data.ByteString as BS
+import qualified Data.ByteString.Internal as BI
+import Data.Word (Word8)
+import Foreign.C.Types (CInt(..), CSize(..))
+import Foreign.ForeignPtr (withForeignPtr)
+import Foreign.Ptr (Ptr, plusPtr)
+import Foreign.Storable (peekElemOff)
+import System.IO.Unsafe (unsafeDupablePerformIO)
+
+-- ffi ------------------------------------------------------------------------
+
+foreign import ccall unsafe "base64_encode_arm"
+  c_base64_encode :: Ptr Word8 -> Ptr Word8 -> CSize -> IO ()
+
+foreign import ccall unsafe "base64_decode_arm"
+  c_base64_decode :: Ptr Word8 -> Ptr Word8 -> CSize -> CSize -> IO CInt
+
+foreign import ccall unsafe "base64_arm_available"
+  c_base64_arm_available :: IO CInt
+
+-- utilities ------------------------------------------------------------------
+
+fi :: (Integral a, Num b) => a -> b
+fi = fromIntegral
+{-# INLINE fi #-}
+
+-- api ------------------------------------------------------------------------
+
+-- | Are ARM NEON extensions available?
+base64_arm_available :: Bool
+base64_arm_available =
+  unsafeDupablePerformIO c_base64_arm_available /= 0
+{-# NOINLINE base64_arm_available #-}
+
+-- | Encode a base256 'ByteString' as base64 using NEON.
+encode :: BS.ByteString -> BS.ByteString
+encode (BI.PS sfp soff l) =
+  BI.unsafeCreate ((l + 2) `quot` 3 * 4) $ \dst ->
+    withForeignPtr sfp $ \sp0 ->
+      c_base64_encode (sp0 `plusPtr` soff) dst (fi l)
+
+-- | Decode a base64 'ByteString' to base256 using NEON.  Returns
+--   'Nothing' on malformed input.
+decode :: BS.ByteString -> Maybe BS.ByteString
+decode (BI.PS sfp soff l)
+  | l == 0          = Just BS.empty
+  | l .&. 0x03 /= 0 = Nothing
+  | otherwise = unsafeDupablePerformIO $
+      withForeignPtr sfp $ \sp0 -> do
+        let !sp = sp0 `plusPtr` soff :: Ptr Word8
+        c_pre <- peekElemOff sp (l - 2)
+        c_end <- peekElemOff sp (l - 1)
+        let !pad_pre = c_pre == 0x3D
+            !pad_end = c_end == 0x3D
+        if pad_pre && not pad_end
+          then pure Nothing
+          else do
+            let !pad = (if pad_pre then 2 else if pad_end then 1 else 0)
+                     :: Int
+                !nfull  = l `B.shiftR` 2
+                !outlen = nfull * 3 - pad
+            fp <- BI.mallocByteString outlen
+            ok <- withForeignPtr fp $ \dst ->
+              c_base64_decode sp dst (fi l) (fi outlen)
+            pure $! if ok /= 0
+              then Just (BI.PS fp 0 outlen)
+              else Nothing
diff --git a/ppad-base64.cabal b/ppad-base64.cabal
@@ -1,7 +1,7 @@
 cabal-version:      3.0
 name:               ppad-base64
 version:            0.1.0
-synopsis:           Pure base64 encoding and decoding on bytestrings.
+synopsis:           Fast base64 encoding and decoding on bytestrings.
 license:            MIT
 license-file:       LICENSE
 author:             Jared Tobin
@@ -11,13 +11,18 @@ build-type:         Simple
 tested-with:        GHC == { 9.10.3 }
 extra-doc-files:    CHANGELOG
 description:
-  Pure base64 encoding and decoding on bytestrings.
+  Fast base64 encoding and decoding on bytestrings.
 
 flag llvm
   description: Use GHC's LLVM backend.
   default:     False
   manual:      True
 
+flag sanitize
+  description: Build with AddressSanitizer and UndefinedBehaviorSanitizer.
+  default:     False
+  manual:      True
+
 source-repository head
   type:     git
   location: git.ppad.tech/base64.git
@@ -31,9 +36,17 @@ library
     ghc-options: -fllvm -O2
   exposed-modules:
       Data.ByteString.Base64
+      Data.ByteString.Base64.Arm
   build-depends:
       base >= 4.9 && < 5
     , bytestring >= 0.9 && < 0.13
+  c-sources:
+      cbits/base64_arm.c
+  if arch(aarch64)
+    cc-options: -march=armv8-a
+  if flag(sanitize)
+    cc-options: -fsanitize=address,undefined -fno-omit-frame-pointer
+    ghc-options: -optl=-fsanitize=address,undefined
 
 test-suite base64-tests
   type:                exitcode-stdio-1.0
@@ -43,6 +56,8 @@ test-suite base64-tests
 
   ghc-options:
     -rtsopts -Wall -O2
+  if flag(sanitize)
+    ghc-options: -optl=-fsanitize=address,undefined
 
   build-depends:
       base

	base64 Fast Haskell base64 encoding/decoding (docs.ppad.tech/base64).
	git clone git://git.ppad.tech/base64.git
	Log \| Files \| Refs \| README \| LICENSE

M	README.md	\|	31	+++++++++++++++----------------
A	cbits/base64_arm.c	\|	307	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	flake.nix	\|	2	+-
M	lib/Data/ByteString/Base64.hs	\|	511	+++++++++++++++++++++++++++++++++----------------------------------------------
A	lib/Data/ByteString/Base64/Arm.hs	\|	86	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	ppad-base64.cabal	\|	19	+++++++++++++++++--