base64, branch HEAD

release: v0.1.0

2026-05-16T15:41:15Z

commit 690785db542a958976fe289044f612a240a2cc9e parent 3d622446b5ee3af52511cc9770895cb1acf4d940 Author: Jared Tobin Date: Sat, 16 May 2026 13:11:15 -0230 release: v0.1.0

Merge branch 'perf-refactor'

2026-05-16T15:34:11Z

commit 3d622446b5ee3af52511cc9770895cb1acf4d940 parent b4dd9ff6c285bfb9db834cdcca3d460688c3297d Author: Jared Tobin Date: Sat, 16 May 2026 13:04:11 -0230 Merge branch 'perf-refactor' Performance refactor + ARM NEON intrinsics, mirroring the analogous work merged to ppad-base16 master. Five commits, organized as two logical changes: 1. Drop the bytestring 'Builder' pipeline in favour of 'BI.unsafeCreate' plus two static-rodata lookup tables (encode alphabet + decode table), with the 0x40-offset trick keeping the decode table's string literal NUL-free so it lives in rodata via the bytestring IsString rewrite. Encode falls from ~2.3 μs to ~270 ns on 1 KiB inputs. 2. Add an aarch64 NEON kernel in 'cbits/base64_arm.c' exposed via the new 'Data.ByteString.Base64.Arm' module: * Encode kernel processes 12 input bytes -> 16 output chars per iteration via a vqtbl1q_u8 shuffle, four parallel u32 shifts + masks, and a vqtbl4q_u8 alphabet lookup. * Decode kernel processes 16 input chars -> 12 output bytes per iteration. Range-compare validation with OR-accumulated 'bad' masks, per-u32-lane 24-bit pack, vqtbl1q_u8 reorder to BE triplets. The Haskell side hands the C kernel both inlen and outlen; padding detection and the padded final quartet (including RFC 4648 §3.5 non-data-bit validation) are handled in C for symmetry with encode. 'Data.ByteString.Base64.encode' and 'decode' dispatch to the NEON path when 'base64_arm_available' returns true, falling back to the scalar path otherwise. Cabal adds the C sources, an aarch64 '-march=armv8-a' cc-option, and a 'sanitize' flag for ASan + UBSan builds. Performance on 1 KiB inputs, M4 MacBook Air, GHC 9.10.3 + LLVM 19, 'cabal bench -f+llvm': encode time: 2.279 μs -> 102 ns (~22×) decode time: 649.2 ns -> 160 ns (~4×) The existing tasty suite (5000 QuickCheck cases × 3 properties + the RFC 4648 §10 unit vectors) passes through the dispatched path under 'cabal test', 'cabal test -fllvm', and 'cabal test -fsanitize'. Also rebrands the cabal/flake/README descriptions from "Pure" to "Fast" to reflect that the hot path is no longer purely Haskell.

readme: ARM intrinsics note + bench figures

2026-05-16T15:31:38Z

commit f606ebee8dd5da7c25005f5c0e7c7bdefc20f52f parent 0e83ab9538f4c79593e70ae5d338c349fdfe0e8b Author: Jared Tobin Date: Sat, 16 May 2026 13:01:38 -0230 readme: ARM intrinsics note + bench figures Update README tagline from "Pure" to "Fast" and rewrite the Performance section to note hardware acceleration via ARM NEON intrinsics. New 1 KiB criterion figures from an M4 MacBook Air, GHC 9.10.3 + LLVM 19, -fllvm: encode time: 2.279 μs -> 102 ns (~22×) decode time: 649.2 ns -> 160 ns (~4×)

meta: rebrand from Pure to Fast

2026-05-16T15:31:30Z

commit 0e83ab9538f4c79593e70ae5d338c349fdfe0e8b parent e01f8d10d9bafcab783a6a3bce9ae1b31d6223b6 Author: Jared Tobin Date: Sat, 16 May 2026 13:01:30 -0230 meta: rebrand from Pure to Fast Now that the library uses ARM NEON intrinsics for the hot path (when available) it's no longer purely Haskell. Update the cabal synopsis/description and flake.nix description accordingly.

lib: dispatch encode/decode to ARM NEON when available

2026-05-16T15:30:26Z

commit e01f8d10d9bafcab783a6a3bce9ae1b31d6223b6 parent 72fa80fdb1438d0d10e0f536558afd2ddbd593c8 Author: Jared Tobin Date: Sat, 16 May 2026 13:00:26 -0230 lib: dispatch encode/decode to ARM NEON when available Wire 'Data.ByteString.Base64.encode' and 'decode' to the NEON implementation added in the previous commit, with the pure Haskell scalar loop kept as a fallback. Mirrors the dispatch pattern in ppad-base16 / ppad-sha256: encode bs | Arm.base64_arm_available = Arm.encode bs | otherwise = encode_scalar bs No behavioural change beyond dispatch: on aarch64 the NEON path is taken, on every other arch the C stubs return availability = 0 and the scalar bodies run. Existing tasty suite (5000 QuickCheck cases × 3 properties + the RFC 4648 §10 unit vectors) passes through the dispatched path, including under 'cabal test -fllvm -fsanitize' which exercises the C kernel under AddressSanitizer + UndefinedBehaviorSanitizer. Performance on 1 KiB inputs, M4 MacBook Air, GHC 9.10.3 + LLVM 19, -fllvm: encode time: 270 ns -> 102 ns (~2.6×) decode time: 273 ns -> 160 ns (~1.7×)

lib: add ARM NEON implementation

2026-05-16T15:28:58Z

commit 72fa80fdb1438d0d10e0f536558afd2ddbd593c8 parent d9c21f51a123552c70e582d98e14593860259889 Author: Jared Tobin Date: Sat, 16 May 2026 12:58:58 -0230 lib: add ARM NEON implementation Mirror ppad-base16's arm-neon branch. Add an aarch64 NEON kernel for base64 encode and decode in a small C file with intrinsics gated by '#if defined(__aarch64__)' + stubs in the '#else' branch, exposed to Haskell via 'foreign import ccall unsafe' in a new module 'Data.ByteString.Base64.Arm'. The C kernel: * Encode processes 12 input bytes per NEON iteration. 'vld1q_u8' loads 16 bytes (the 4-byte over-read is safe under the loop bound); 'vqtbl1q_u8' with a fixed shuffle gathers each 4-byte output lane as [b1, b0, b2, b1], the order that lets four 'vshrq_n_u32 + vandq_u32' pairs extract the six-bit indices i0..i3 directly into byte slots; 'vqtbl4q_u8' looks each index up in the 64-byte alphabet table; one 'vst1q_u8' stores all 16 output chars. A scalar tail finishes any full triplet that fell outside the NEON cut-off, then a final branch emits the 0/1/2-byte padded tail. * Decode processes 16 input chars per NEON iteration. 'ascii_to_b64' validates each lane with byte-range compares and yields its 6-bit value via an additive offset; the per-iter 'bad' masks are OR- accumulated and reduced once at the end with 'vmaxvq_u8'. Each u32 lane packs four 6-bit values into a 24-bit V; 'vqtbl1q_u8' reorders V's LE bytes into BE triplets, giving 12 valid output bytes in the low 12 lanes; 'vst1q_u8' stores 16 with the loop bound keeping the 4-byte overrun inside the allocated buffer. A scalar tail handles the remaining body quartets, then the padded final quartet (1- or 2-byte output) is decoded explicitly with non-data-bit checks per RFC 4648 §3.5. The Haskell wrapper: * 'base64_arm_available :: Bool' NOINLINE CAF queries the C-side availability probe once; returns 'True' on aarch64, 'False' on every other arch (where the C stubs are linked in). * 'encode' wraps 'BI.unsafeCreate'; 'decode' computes the padded outlen up front, allocates with 'BI.mallocByteString', and passes both inlen and outlen to the C kernel. * 'OPTIONS_HADDOCK hide' keeps the module out of public docs. Cabal: * 'c-sources: cbits/base64_arm.c' compiles the kernel into the library on every platform; the '#if'-gated body means the contributed code is empty on non-aarch64. * 'if arch(aarch64) cc-options: -march=armv8-a' pins the target to baseline armv8. * New 'sanitize' flag adds '-fsanitize=address,undefined -fno-omit-frame-pointer' to both the C source and the test-suite link, mirroring ppad-base16 and ppad-sha256. Built with 'cabal test -fllvm -fsanitize'. * 'Data.ByteString.Base64.Arm' added to 'exposed-modules' so consumers can call the NEON path directly if they want to bypass dispatch. No call sites in 'Data.ByteString.Base64' wired yet — the existing tasty + criterion suites still go through the scalar path after this commit, and pass unchanged (verified under cabal test, cabal test -fllvm, and cabal test -fsanitize).

lib: drop bytestring builder, use unsafeCreate + lookup tables

2026-05-16T15:17:33Z

commit d9c21f51a123552c70e582d98e14593860259889 parent b4dd9ff6c285bfb9db834cdcca3d460688c3297d Author: Jared Tobin Date: Sat, 16 May 2026 12:47:33 -0230 lib: drop bytestring builder, use unsafeCreate + lookup tables Mirror ppad-base16's perf-refactor. * enc_tab is the 64-byte alphabet, indexed by 6-bit value. * dec_tab is a 256-byte table mapping each ASCII byte to its 6-bit value (offset by 0x40, in the range 0x40..0x7F) or 0x80 for any invalid byte (including '='). The offset keeps the literal NUL- free so it lives in static rodata via the bytestring IsString rewrite. * Decode OR-folds every lookup into an accumulator and tests 'acc .&. 0x80 == 0' once at the end, mirroring base16's bit-5 sentinel trick. * encode_scalar walks 3 input bytes at a time via direct pointer ops in BI.unsafeCreate; final 1- or 2-byte tail emits padding. * decode_scalar peels off the padded final quartet, runs a tight body loop, then validates non-data bits per RFC §3.5. Encode falls from ~2.3 μs to ~270 ns on 1 KB inputs under -fllvm.

meta: benchmark figures from m4 macbook air

2026-05-16T14:16:33Z

commit b4dd9ff6c285bfb9db834cdcca3d460688c3297d parent 5a89ef39a87510cfb42fef8356e1efd26d2c1f2e Author: Jared Tobin Date: Sat, 16 May 2026 11:46:33 -0230 meta: benchmark figures from m4 macbook air Captured with cabal bench -f+llvm on an Apple M4 MacBook Air, GHC 9.10.3 with the LLVM backend, on a 1024-byte input.

meta: align README title with package name

2026-05-16T14:16:20Z

commit 5a89ef39a87510cfb42fef8356e1efd26d2c1f2e parent 011d1f446a94c0ac72eb372accdc6951b5d797ea Author: Jared Tobin Date: Sat, 16 May 2026 11:46:20 -0230 meta: align README title with package name Use "ppad-base64" instead of "base64" to match the cabal package name.

bench: criterion and weigh suites

2026-05-16T14:08:25Z

commit 011d1f446a94c0ac72eb372accdc6951b5d797ea parent d4c704d005ceedbac7cb11b3b7abec818a22bdb2 Author: Jared Tobin Date: Sat, 16 May 2026 11:38:25 -0230 bench: criterion and weigh suites Criterion bench for encode (1024B) and decode (1024-char input), plus opt-in groups comparing against base64-bytestring and base64. Weigh suite measures allocation on a ~1KB string against the same two references.

test: property tests and RFC vectors

2026-05-16T14:08:19Z

commit d4c704d005ceedbac7cb11b3b7abec818a22bdb2 parent c84cc9b184e71f455d0cd8d6b829f20f34bf232b Author: Jared Tobin Date: Sat, 16 May 2026 11:38:19 -0230 test: property tests and RFC vectors QuickCheck properties (5000 iters each) for decode-inverts-encode and agreement with base64-bytestring on both encode and decode. Unit test covers the seven RFC 4648 §10 vectors ("", "f", "fo", "foo", "foob", "fooba", "foobar"), checking both directions.

lib: base64 encoding and decoding

2026-05-16T14:08:09Z

commit c84cc9b184e71f455d0cd8d6b829f20f34bf232b parent 634f91042b13e9512fa8db4c2191bcf3e4a3f18c Author: Jared Tobin Date: Sat, 16 May 2026 11:38:09 -0230 lib: base64 encoding and decoding Standard RFC 4648 §4 base64 (charset A-Za-z0-9+/, '=' padding). Strict decode: rejects unpadded inputs, non-multiple-of-4 lengths, invalid characters, and non-canonical encodings (non-zero non-data bits in the final quartet, per RFC §3.5). Encode dispatches over l rem 6 into six arms using go64 (6 bytes → word64BE), go32 (3 bytes → word32BE), and tail1/tail2 for the final padded quartet. Decode peels off the final 4-char quartet, then processes the body in chunks of 32/16/8/4 chars writing 3·word64BE, word64BE+word32BE, word32BE+word16BE, or word16BE+word8.

meta: initial scaffolding

2026-05-16T14:07:21Z

commit 634f91042b13e9512fa8db4c2191bcf3e4a3f18c Author: Jared Tobin Date: Sat, 16 May 2026 11:37:21 -0230 meta: initial scaffolding Mirror ppad-base16 (master, v0.2.1) project layout: LICENSE, .ghci, .gitignore, CHANGELOG, README, flake.nix/lock, and cabal file. Library set up to expose Data.ByteString.Base64 with the same llvm flag and dep bounds as ppad-base16.