This document provides a detailed breakdown of the internal architectural mechanics of FuzzyTok, the design of its custom OpenAI Triton GPU kernels, and a formal mathematical proof of its autoregressive causality.
FuzzyTok bypasses discrete vocabulary tables (e.g., BPE, WordPiece) by mapping raw UTF-8 byte sequences directly into a continuous latent space.
Input: [B, T] (Raw UTF-8 Byte IDs)
│
▼ (Byte Embedding Matrix: 256 x C)
[B, C, T] (Initial Continuous Channels)
│
▼ (Residual Block 1: Dilations = [1, 2, 4], Kernel = 3)
[B, C, T] (Receptive Field = 15 characters)
│
▼ (Residual Block 2: Dilations = [1, 2, 4], Kernel = 3)
[B, C, T] (Receptive Field = 43 characters)
│
▼ (LayerNorm + Linear Projection)
[B, T, D] (Output Continuous Latent Space)
For a single 1D convolutional layer with a kernel size and a dilation factor , the temporal receptive field is given by:
For a residual block composed of convolutional layers with dilation factors , the cumulative receptive field of the block is:
For a network stacking residual blocks, the global receptive field () over the input sequence is:
In our default benchmark configuration (, layers with dilations , and blocks):
Note: For the 3-level dilated benchmark stack running across multiple blocks, the receptive field covers up to 43 characters, allowing the tokenizer to model full word stems and morphological prefixes/suffixes dynamically.
Standard PyTorch implementations of causal convolutions suffer from severe memory bandwidth limitations. Since the channel and temporal dimensions are large, reloading the activation tensors from GPU Global Memory (VRAM) to the shared memory (SRAM) for every channel iteration creates a memory bottleneck.
To mitigate this, FuzzyTok implements custom Triton kernels featuring 2D Channel Tiling and Fused Activation Memory Layouts.
The program grid is defined as:
[BLOCK_C_OUT, BLOCK_T].BLOCK_C_IN (set to 32 to prevent register spilling on Turing SMs).To maximize GPU occupancy on hardware like the NVIDIA Tesla T4 (Turing architecture), block sizes must be chosen to avoid register spilling (exceeding the physical register file capacity per SM, which forces Triton to dump intermediate variables to slower local VRAM).
BLOCK_C_OUT = 32 and BLOCK_T = 128, the register matrix requires float32 elements. This fits comfortably within the register file limits.tl.static_range(KERNEL_SIZE) forces the compiler to fully unroll the inner loop over filter spatial coordinates at compile time, eliminating branch overhead and allowing memory fetching pipelining.To prevent out-of-bounds pointer calculations and GPU thread divergence during backward pass temporal offset calculation (), we implement clamping at the pointer calculation level:
causal_mask = src_t >= 0
clamped_t = tl.where(causal_mask, src_t, 0)
The clamped temporal indexes prevent segmentation faults, while valid_mask ensures that out-of-bound indexes are zeroed out at the register level during loading.
An autoregressive language model requires that the prediction of token depends only on the history . Therefore, the representation output by the tokenizer must be strictly causal:
We prove this causality by induction over the convolutional layers.
Let be the initial character embedding at timestep . Since the embedding function is applied pointwise:
Assume that for a layer , the activation at any timestep depends only on the history :
The output of the causal dilated convolutional layer at channel and timestep is defined as:
Since , , and , the temporal index of the terms in the sum is . Because , we have:
By our induction hypothesis, depends only on . Since , then:
Therefore, every term in the convolution sum depends only on the inputs up to timestep . Thus, the linear combination depends only on :
This completes the proof. Since residual connections, LayerNorm, and linear projection are all pointwise operations in the temporal dimension, they preserve causality. The final output representation depends only on .
Measured in processed tokens per second:
====================================================================================================
Config │ FuzzyTok t/s │ BPE Conv t/s │ Transf t/s │ LSTM t/s │ Speedup (vs BPE)
====================================================================================================
B=1 T=128 │ 56,186 │ 136,118 │ 109,014 │ 43,367 │ 0.41x
B=8 T=256 │ 135,239 │ 486,093 │ 783,383 │ 346,676 │ 0.28x
B=16 T=512 │ 142,790 │ 517,959 │ 649,422 │ 608,339 │ 0.28x
B=32 T=1024 │ 195,930 │ 715,643 │ 459,770 │ 840,924 │ 0.27x
B=64 T=1024 │ 189,718 │ 710,961 │ 430,550 │ 812,238 │ 0.27x
B=32 T=2048 │ 184,247 │ 649,549 │ 213,294 │ 810,042 │ 0.28x
B=64 T=2048 │ OOM │ --- │ --- │ --- │ ---
Config │ FuzzyTok MB │ BPE Conv MB │ Transf MB │ LSTM MB │ VRAM Saving (vs BPE)
───────────────────────────────────────────────────────────────────────────────────────────────
B=8 T=512 │ 87.8 │ 95.8 │ 172.8 │ 128.0 │ 8.3%
B=16 T=1024 │ 144.0 │ 232.5 │ 763.8 │ 220.3 │ 38.1%
B=32 T=1024 │ 232.2 │ 328.8 │ 1452.0 │ 348.8 │ 29.4%
B=32 T=2048 │ 392.7 │ 489.0 │ 9240.2 │ 573.1 │ 19.7%
Values after 40 epochs on next-character autoregressive prediction:
0.4110 | Perplexity = 1.512.1257 | Perplexity = 8.382.4547 | Perplexity = 11.642.2320 | Perplexity = 9.32Evaluates latent spaces' clustering of cognate word roots (intra-radical) vs. random roots (inter-radical):