Pytorch调用#

PyTorch的调用接口是高度统一的，它们都继承自nn.Module，核心都是沿着某个维度计算统计量，之后标准化，可选仿射变换。

BN、LN、RMS的区别

区别在于沿着哪个轴做归一化，以及计算哪些统计量

理解了这个本质，剩下的就是查找参数，看维度。

BatchNorm#

1
torch.nn.BatchNorm1d(num_features, eps=1e-05, momentum=0.1, affine=True, ...)
2
torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, ...)
3
torch.nn.BatchNorm3d(num_features, eps=1e-05, momentum=0.1, affine=True, ...)

对于BatchNorm来说，主要有上述三个版本，区别在于它们对于输入的维度布局不同，我们来看一下基本用法：

1
def batchnorm_example():
2
    x = torch.rand(100, 16, 784)
3
    layer = nn.BatchNorm1d(16)
4
    out = layer(x)
5

6
    y = torch.rand(1, 16, 7, 7)
7
    layer = nn.BatchNorm2d(16) # 传入通道数C
8
    out = layer(y)

Pytorch要求显式传入特征通道数，以便可以在内部初始化可学习参数weight和bias。它们的形状都是(C,)，实际上BatchNorm会根据自身是否为training来自动选择后续行为，对于调用者是完全透明的。

学习了官方的调用之后，我们来看一下具体是如何实现的：

1
def batchnorm(x, running_mean, running_var, weight, bias, training, momentum=0.1, eps=1e-5):
2

3

4
    """
5
        x.dim() 返回张量的维度数，例如对于一个形状为 (3, 4, 5) 的张量，x.dim() 将返回 3。
6
        range(x.dim()) 生成一个从 0 到 x.dim() - 1 的序列，例如生成 [0, 1, 2]。
7
        [d for d in range(x.dim()) if d != 1] 是列表推导式，它遍历上述生成的序列，并将不等于 1 的维度索引添加到列表中。
8
    """
9
    dims = [d for d in range(x.dim()) if d != 1]
10

11
    """
12
        [1: -1]是一个列表，第一个元素是1，第二个元素是-1
13
        [1] * (x.dim() - 2) 是一个列表，包含 x.dim() - 2 个元素，每个元素都是1。
14
        shape表示前两个维度是1和-1，后边的维度全都是1
15
    """
16
    shape = [1, -1] + [1] * (x.dim() - 2)
17

18
    # 训练模式
19
    if training:
20
        mean = x.mean(dim=dims, keepdim=True)                       # shape: (1, C, 1, 1)
21
        var_biased = x.var(dim=dims, keepdim=True, correction=0)    # 有偏估计，用于当前批次的归一化
22
        var_unbiased = x.var(dim=dims, keepdim=True, correction=1)  # 无偏估计，用于更新全局统计量
23

24
        with torch.no_grad():
25
            # running_mean 和running_var 形状是(C,)， 用squeeze()挤掉多余的维度
26
            running_mean.data = (1 - momentum) * running_mean.data + momentum * mean.squeeze()
27
            running_var.data  = (1 - momentum) * running_var.data  + momentum * var_unbiased.squeeze()
28
        var = var_biased
29
    else:
30
        mean = running_mean.view(shape)
31
        var = running_var.view(shape)
32

33
    # 归一化
34
    x_norm = (x - mean) / torch.sqrt(var + eps)
35
    # 仿射变换：y=γx+β
36
    if weight is not None:
37
        w = weight.view(shape)
38
        b = bias.view(shape)
39
        return x_norm * w + b
40
    return x_norm

具体内容如上，关键点我都已经写出来了，大家自行查阅即可。

优化思维时时刻刻都要有，我还写了一个优化版本的，mean、var_unbiased可以同时计算出来：

1
def manual_batchnorm_v2(x, running_mean, running_var, weight, bias, training, momentum=0.1, eps=1e-5):
2
    dims = [d for d in range(x.dim()) if d != 1]
3
    shape = [1, -1] + [1] * (x.dim() - 2)
4

5
    if training:
6
        # 计算参与归约的元素总数
7
        n = 1
8
        for d in dims:
9
            n *= x.shape[d]
10

11
        # 一次性计算均值和方差（无偏方差）
12
        var_unbiased, mean = torch.var_mean(x, dim=dims, keepdim=True, correction=1)
13
        # 转为有偏方差用于归一化
14
        var_biased = var_unbiased * ((n - 1) / n)
15

16
        with torch.no_grad():
17
            running_mean.data = (1 - momentum) * running_mean.data + momentum * mean.squeeze()
18
            running_var.data  = (1 - momentum) * running_var.data  + momentum * var_unbiased.squeeze()
19
        var = var_biased
20
    else:
21
        mean = running_mean.view(shape)
22
        var = running_var.view(shape)
23

24
    x_norm = (x - mean) / torch.sqrt(var + eps)
25
    if weight is not None:
26
        return x_norm * weight.view(shape) + bias.view(shape)
27
    return x_norm

LayerNorm#

LayerNorm不依赖batch统计量，在NLP和Transformer是绝对主力。

1
torch.nn.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True, ...)
2
# shape可以是int或者tuple，表示输入最后几个维度的大小，归一化在这些维度进行
3
# eps，同BatchNorm
4
# elementwise_affine，是否学习逐元素的weight和bias

调用示例如下：

1
ln = nn.LayerNorm(512)
2
x = torch.randn(32, 10, 512)
3
out = ln.(x) # 形状不变，每个token的512维向量被归一化

我们手动实现的版本：

1
import torch
2
import torch.nn as nn
3

4

5

6
class layernorm(nn.Module):
7
    def __init__(self, embed_size, eps=1e-5):
8
        super().__init__()
9
        self.gamma = nn.Parameter(torch.ones(embed_size))
10
        self.beta = nn.Parameter(torch.zeros(embed_size))
11
        self.eps = eps
12

13
    def forward(self, x):
14
        # correction=0表示有偏估计
15
        var, mean = torch.var_mean(x, dim=-1, keepdim=True, correction=0)
16

17
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
18
        return x_norm * self.gamma + self.beta
19

20
if __name__ == "__main__":
21
    x = torch.randn(2, 4, 3)
22
    embed_size = x.size(-1)
23
    official = nn.LayerNorm(embed_size, eps=1e-5)
24
    my  = layernorm(embed_size, eps=1e-5)
25

26
    # 复制相同权重
27
    my.gamma.data = official.weight.data.clone()
28
    my.beta.data  = official.bias.data.clone()
29

30
    diff = (official(x) - my(x)).abs().max().item()
31
    print(f"Max difference: {diff:.2e}")

RMSNorm#

RMSNorm是LayerNorm的一个简化变体，直接减去均值计算的步骤，只除以均方根，在最新的大模型里广泛使用。

1
torch.nn.RMSNorm(normalized_shape, eps=1e-06, elementwise_affine=True, ...)

使用方法完全同LayerNorm一致。

特性	BatchNorm	LayerNorm	RMSNorm
归一化轴	沿 batch 和空间维度 (每个通道独立)	沿特征维度 (每个样本独立)	沿特征维度 (每个样本独立)
依赖 batch	是	否	否
可学习参数	`weight` + `bias` (通道数)	`weight` + `bias` (归一化形状)	仅 `weight`
适用场景	CNN、CV	NLP、Transformer	大模型
PyTorch 接口	`BatchNorm1d/2d/3d`	`LayerNorm`	`RMSNorm`

Triton实现#

在了解了Pytorch这一层级的实现之后，我们要手动在Triton中实现LN和RMS。

下边我们核心先讲解LN和RMS的Triton版本，BN我就先忽略，大家可以自行学习。

归一化的视角#

无论是LN还是RMS，输入都是一个二维矩阵 $X \in R^{N \times K}$ ，其中N是所有token的总行数，K是特征维度。

每一行的计算是完全独立的。对于第n行，算子聚合该行内所有K个元素，算出指定的统计量，再用统计量去逐元素变换该行。

Tip

这是经典的Reduce+Broadcast模式

具体来说，LayerNorm需要两步聚合，分别计算均值和方差；RMSNorm只进行一次平方和聚类。

在GPU上，Reduce是代价最高的部分，Broadcast是element wise操作。很明显，Norm操作是带宽密集型操作。

LayerNorm#

我在这里参考了flash-attenion官方的实现: flash-attention

关于反向传播的代码

实际上，除了前向传播，在写绝大多数算子的时候，也要考虑反向传播，后续我会写一个完整的macroTorch，来完整的学习是如何进行前向传播与反向传播的。

关于自动调优，可以写一个函数来囊括进去常用的内容。同时我在这里保存了mean和rstd，这样能够给未来的反向传播直接复用，避免重新进行规约计算，因此可以看到我显式的进行了两次保存。

并且还有一个设计上的巧思，即步长参数化，而不是假设内存严格连续，这可以让kernel实现更加复杂的张量布局。

1
import torch
2
import torch.nn as nn
3
import triton
4
import triton.language as tl
5
from triton.testing import do_bench
6

7

8
def get_autotune_configs():
9
    warp_size = 32
10
    max_threads_per_block = 1024
11
    configs = []
12
    for num_warps in [1,2, 4, 8, 16, 32]:
13
        if num_warps * warp_size <= max_threads_per_block:
14
            configs.append(triton.Config({}, num_warps=num_warps))
15
    return configs
16

17
@triton.autotune(
18
        configs=get_autotune_configs(),
19
        key=["N"],
20
)
21
@triton.jit
22
def layer_norm_fwd_kernel(
23
    X_ptr, Y_ptr, W_ptr, B_ptr, Mean_ptr, Rstd_ptr,
24
    stride_x_row, stride_y_row,     # 传入行步长即可灵活索引，Triton编程常见模式
25
    N, eps, BLOCK_N: tl.constexpr
26
):
27
    # 每个program处理1行
28
    row_idx = tl.program_id(0)
29
    X_row_ptr = X_ptr + row_idx * stride_x_row
30
    Y_row_ptr = Y_ptr + row_idx * stride_y_row
31

32

33
    cols = tl.arange(0, BLOCK_N)
34
    mask = cols < N
35

36
    # 加载该行的所有元素，在计算中提升为FP32
37
    x = tl.load(X_row_ptr + cols, mask=mask, other=0.0).to(tl.float32)
38
    w = tl.load(W_ptr + cols, mask=mask).to(tl.float32)
39
    b = tl.load(B_ptr + cols, mask=mask).to(tl.float32)
40

41
    # 规约：均值 and 方差
42
    mean = tl.sum(x, axis=0) / N
43
    tl.store(Mean_ptr + row_idx, mean)
44

45
    x_bar = tl.where(mask, x - mean, 0.0)
46
    var = tl.sum(x_bar * x_bar, axis=0) / N
47
    rstd = 1.0 / tl.sqrt(var + eps)
48
    tl.store(Rstd_ptr + row_idx, rstd)
49

50
    # 广播：归一化+仿射变换
51
    y = (x - mean) * rstd * w + b
52
    tl.store(Y_row_ptr + cols, y, mask=mask)
53

54

55
def layer_norm_fwd(x, weight, bias=None, eps=1e-5):
56
    M, N = x.shape
57
    y = torch.empty_like(x)
58
    mean = torch.empty(M, device=x.device, dtype=torch.float32)
59
    rstd = torch.empty(M, device=x.device, dtype=torch.float32)
60

61
    if bias is None:
62
        bias = torch.zeros(N, device=x.device, dtype=weight.dtype)
63
    if weight is None:
64
        weight = torch.ones(N, device=x.device, dtype=x.dtype)
65

66
    BLOCK_N = triton.next_power_of_2(N)
67
    layer_norm_fwd_kernel[(M,)](
68
        x, y, weight, bias, mean, rstd,
69
        x.stride(0), y.stride(0),
70
        N, eps, BLOCK_N=BLOCK_N
71
    )
72
    return y, mean, rstd
73

74
def test_correctness(shapes=[(128, 256), (512, 1024)]):
75
    for M, N in shapes:
76
        x = torch.randn(M, N, device='cuda', dtype=torch.float32)
77
        weight = torch.randn(N, device='cuda', dtype=torch.float32)
78
        bias = torch.randn(N, device='cuda', dtype=torch.float32)
79
        eps = 1e-5
80

81
        ln = nn.LayerNorm(N, eps=eps).to('cuda')
82
        ln.weight.data = weight
83
        ln.bias.data = bias
84
        y_ref = ln(x)
85

86
        y_tri, _, _ = layer_norm_fwd(x, weight, bias, eps)
87

88
        max_diff = (y_tri - y_ref).abs().max().item()
89
        print(f"Shape ({M}, {N}): max diff = {max_diff:.6e}")
90

91

92
bench_perf_report = None
93
try:
94
    from triton.testing import Benchmark, perf_report
95

96
    @perf_report(
97
        Benchmark(
98
            x_names=["N"],
99
            x_vals=[256, 512, 1024, 2048, 4096, 8192],
100
            line_arg="provider",
101
            line_vals=["triton", "pytorch"],
102
            line_names=["Triton", "PyTorch"],
103
            styles=[("blue", "-"), ("red", "-")],
104
            ylabel="Latency (ms)",
105
            plot_name="LayerNorm Fwd Performance",
106
            args={"M": 1024, "eps": 1e-5, "dtype": torch.float32},
107
        )
108
    )
109
    def _bench_perf_report(M, N, eps, dtype, provider):
110
        device = 'cuda'
111
        x = torch.randn(M, N, device=device, dtype=dtype)
112
        weight = torch.randn(N, device=device, dtype=dtype)
113
        bias = torch.randn(N, device=device, dtype=dtype)
114

115
        if provider == "triton":
116
            def run():
117
                return layer_norm_fwd(x, weight, bias, eps)
118
        else:
119
            ln = nn.LayerNorm(N, eps=eps).to(device)
120
            ln.weight.data = weight
121
            ln.bias.data = bias
122
            def run():
123
                return ln(x)
124
        return do_bench(run, quantiles=[0.5, 0.2, 0.8])
125

126
    bench_perf_report = _bench_perf_report
127
except ImportError:
128
    print("当前 Triton 版本不支持 perf_report，跳过高阶绘图功能\n")
129

130

131
if __name__ == "__main__":
132
    test_correctness()
133
    if bench_perf_report is not None:
134
        bench_perf_report.run(show_plots=True, print_data=True, save_path="./layer_norm_fwd_perf.png")

性能对比如下：

RMSNorm#

LayerNorm需要计算两个统计量：mean和std。 mean就得涉及到规约，而RMSNorm只计算一个：

1
rms = sqrt(E[x^2] + eps)
2
y = x / rms * γ

核心代码如下：

1
@triton.autotune(
2
    configs=autotune_configs(),
3
    key=["N"],
4
)
5
@triton.jit
6
def rms_norm_fwd_kernel(
7
    X_ptr, Y_ptr, W_ptr,
8
    stride_x_row, stride_y_row,
9
    N, eps,
10
    BLOCK_N: tl.constexpr,
11
):
12
    row_idx = tl.program_id(0)
13
    X_row_ptr = X_ptr + row_idx * stride_x_row
14
    Y_row_ptr = Y_ptr + row_idx * stride_y_row
15

16
    cols = tl.arange(0, BLOCK_N)
17
    mask = cols < N
18

19
    x = tl.load(X_row_ptr + cols, mask=mask, other=0.0).to(tl.float32)
20
    w = tl.load(W_ptr + cols, mask=mask, other=0.0).to(tl.float32)
21

22
    # RMSNorm: rstd = rsqrt(mean(x^2) + eps)
23
    x2 = x * x
24
    mean_x2 = tl.sum(x2, axis=0) / N
25
    rstd = tl.rsqrt(mean_x2 + eps)
26
    # 广播：缩放+仿射变换
27
    y = x * rstd * w
28
    tl.store(Y_row_ptr + cols, y, mask=mask)
29

30

31
def rms_norm_fwd(x, weight, eps=1e-5):
32
    M, N = x.shape
33
    y = torch.empty_like(x)
34

35
    if weight is None:
36
        weight = torch.ones(N, device=x.device, dtype=x.dtype)
37

38
    BLOCK_N = triton.next_power_of_2(N)
39
    rms_norm_fwd_kernel[(M,)](
40
        x, y, weight,
41
        x.stride(0), y.stride(0),
42
        N, eps,
43
        BLOCK_N=BLOCK_N,
44
    )
45
    return y

我在特征维度从256到8192进行了测试，Triton版本始终优于Pytorch，并且自动调优保证了在不同尺寸下的均衡表现。

RMS在FP16下的优势更加明显，因为Triton对混合精度路径的控制更加精细，避免了不必要的类型转换开销。

1
====================================================================================================
2
RMSNorm Performance: dtype=torch.float16
3
====================================================================================================
4
     M       N       Triton(ms)     PyTorch (ms)    my vs PT
5
----------------------------------------------------------------------------------------------------
6
   128     256         0.003816         0.021189       5.55x
7
   128     512         0.004489         0.021973       4.89x
8
   128    1024         0.004901         0.023449       4.78x
9
   128    2048         0.005038         0.026430       5.25x
10
   128    4096         0.006675         0.032510       4.87x
11
   128    8192         0.008356         0.038082       4.56x
12
   512     256         0.004816         0.022823       4.74x
13
   512     512         0.005085         0.025100       4.94x
14
   512    1024         0.006658         0.029662       4.45x
15
   512    2048         0.008573         0.038389       4.48x
16
   512    4096         0.011734         0.058861       5.02x
17
   512    8192         0.022541         0.093368       4.14x
18
  1024     256         0.005728         0.025211       4.40x
19
  1024     512         0.006578         0.029643       4.51x
20
  1024    1024         0.009162         0.037848       4.13x
21
  1024    2048         0.011643         0.057440       4.93x
22
  1024    4096         0.020953         0.093154       4.45x
23
  1024    8192         0.041657         0.170544       4.09x
24
  2048     256         0.006578         0.029480       4.48x
25
  2048     512         0.007969         0.037949       4.76x
26
  2048    1024         0.011659         0.057081       4.90x
27
  2048    2048         0.021078         0.092297       4.38x
28
  2048    4096         0.041632         0.169116       4.06x
29
  2048    8192         0.077019         0.681771       8.85x
30
  4096     256         0.007757         0.037739       4.87x
31
  4096     512         0.012584         0.057165       4.54x
32
  4096    1024         0.020957         0.091967       4.39x
33
  4096    2048         0.040349         0.168362       4.17x
34
  4096    4096         0.077127         0.669127       8.68x
35
  4096    8192         0.149073         1.469917       9.86x
36

37
====================================================================================================
38
RMSNorm Performance: dtype=torch.float32
39
====================================================================================================
40
     M       N       Triton(ms)     PyTorch (ms)    my vs PT
41
----------------------------------------------------------------------------------------------------
42
   128     256         0.004033         0.016027       3.97x
43
   128     512         0.004791         0.016718       3.49x
44
   128    1024         0.005044         0.018569       3.68x
45
   128    2048         0.006306         0.021332       3.38x
46
   128    4096         0.009204         0.026772       2.91x
47
   128    8192         0.013090         0.030454       2.33x
48
   512     256         0.005348         0.017874       3.34x
49
   512     512         0.006673         0.019975       2.99x
50
   512    1024         0.008565         0.023732       2.77x
51
   512    2048         0.012434         0.031180       2.51x
52
   512    4096         0.020615         0.048329       2.34x
53
   512    8192         0.043017         0.077455       1.80x
54
  1024     256         0.006282         0.019679       3.13x
55
  1024     512         0.008333         0.023323       2.80x
56
  1024    1024         0.012538         0.030153       2.40x
57
  1024    2048         0.020061         0.046295       2.31x
58
  1024    4096         0.043250         0.076708       1.77x
59
  1024    8192         0.079363         0.145954       1.84x
60
  2048     256         0.008264         0.023329       2.82x
61
  2048     512         0.011833         0.030187       2.55x
62
  2048    1024         0.023103         0.046089       1.99x
63
  2048    2048         0.042146         0.075980       1.80x
64
  2048    4096         0.078298         0.144876       1.85x
65
  2048    8192         0.150213         0.479396       3.19x
66
  4096     256         0.012040         0.030257       2.51x
67
  4096     512         0.020640         0.045097       2.18x
68
  4096    1024         0.040291         0.075542       1.87x
69
  4096    2048         0.079232         0.144678       1.83x
70
  4096    4096         0.150072         0.465174       3.10x
71
  4096    8192         0.296848         1.028277       3.46x

FP32下加速比在2-4倍之间，FP16下可达4-9倍。

CUDA实现#

二者对于每一行N，都会遍历该行的所有元素，计算统计量（平方和、均值和方差等），并且使用这些统计量做归一化变换。

实际上，Norm的核心分为两个阶段：

1
阶段 1 — Reduce（归约）:
2
  每一行的 K 个元素 →  汇总成 1 个统计量（标量）
3

4
  例如：RMSNorm 把一行 8192 个 float 聚合成一个平方和
5
       LayerNorm 把一行 8192 个 float 先聚合成均值，再聚合成方差
6

7
阶段 2 — Broadcast（广播）:
8
  用这 1 个标量去变换该行的每一个元素
9

10
  y[n][k] = f(x[n][k], stat[n])

用 C++ 伪代码表达就是：

1
for (int n = 0; n < N; n++) {         // N 行，行间独立
2
    float stat = 0;
3
    for (int k = 0; k < K; k++) {     // Reduce: K 个元素 → 1 个值
4
        stat += compute(x[n][k]);
5
    }
6
    stat = finalize(stat);             // 例如 rsqrt(stat/K + eps)
7

8
    for (int k = 0; k < K; k++) {     // Broadcast: 1 个值 → K 个元素
9
        y[n][k] = transform(x[n][k], stat, gamma[k], beta[k]);
10
    }
11
}

为什么Norm是带宽瓶颈#

1
y = x * rsqrt(mean(x²) + ε) * gamma

每读取一个元素只做两次乘加和一次 rsqrt，算术密度低。这类 kernel 的性能不由算力决定，而由内存带宽决定。

对于4090D，理论带宽为1000GB/s左右，一个 $4096 \times 8192$ 的FP16的RMSNorm理论最优耗时为：

1
数据量 = (读x + 读gamma + 写y) = (2×4096×8192 + 8192) × 2 bytes ≈ 134 MB
2
理论最优 = 134 MB / 1008 GB/s ≈ 0.133 ms

测试结果如下：

1
❯ ./rms_norm
2
--- RMSNorm 性能测试 ---
3
kernel               | shape     | avg_ms    | bandwidth  | correctness
4
--------------------------------------------------------------------
5
RMSNorm FP32         | 1024x256   |   0.007 ms |    310.1 GB/s | err: 6.0e-07 [OK]
6
RMSNorm FP16         | 1024x256   |   0.007 ms |    156.1 GB/s | err: 1.5e-03 [OK]
7

8
RMSNorm FP32         | 1024x1024  |   0.010 ms |    876.1 GB/s | err: 9.5e-07 [OK]
9
RMSNorm FP16         | 1024x1024  |   0.008 ms |    556.0 GB/s | err: 1.5e-03 [OK]
10

11
RMSNorm FP32         | 2048x4096  |   0.078 ms |    865.7 GB/s | err: 2.0e-06 [OK]
12
RMSNorm FP16         | 2048x4096  |   0.037 ms |    905.3 GB/s | err: 1.6e-03 [OK]
13

14
RMSNorm FP32         | 4096x8192  |   0.294 ms |    914.2 GB/s | err: 4.4e-06 [OK]
15
RMSNorm FP16         | 4096x8192  |   0.147 ms |    911.0 GB/s | err: 1.6e-03 [OK]

大概在理论性能的90%左右，基本能够跑满带宽。

Oneflow的设计分析#

OneFlow 的 LayerNorm/RMSNorm 实现是目前开源社区中架构最成熟的实现之一。在这里我还是参考它的设计思路。

分离策略#

1. 数据搬运与计算分离

1
// OneFlow 将读写抽象为仿函数
2
template<typename SRC, typename DST>
3
struct DirectLoad {
4
    template<int N>
5
    __device__ void load(DST* dst, int64_t row, int64_t col) const { ... }
6
};

同一个Kernel模板可以适配不同的输入输出需求，Kernel只关心”加载 → 计算 → 存回”，不关心数据从哪里来，到哪里去。

2. 规约与归一化分离

朴素实现中，规约和归一化交织在一起，例如需要先完整遍历一次数据求出 $\mu$ ，再遍历第二次求方差，第三次才做归一化。这不仅 I/O 开销大，而且求方差时容易产生严重的数值精度问题。

朴素求方差的公式是：

\sigma^2 = E[x^2] - (E[x])^2

这要求同时累积 $\sum x$ 和 $\sum x^2$ 。当数据的均值较大而方差本身很小时， $\sum x^2$ 和 $(\sum x)^2/N$ 是两个非常接近的大数，相减会导致大量有效数字抵消，这叫做 catastrophic cancellation，结果误差极大。

Welford 算法提供了一种数值稳定的在线计算方式，它不需要保存 $\sum x^2$ ，而是维护一个与均值无关的修正平方和m2。

1
template<typename T>
2
__device__ void WelfordCombine(T val, T* mean, T* m2, T* count) {
3
    *count += 1;
4
    T delta1 = val - *mean;
5
    *mean += delta1 / *count;
6
    T delta2 = val - *mean;
7
    *m2 += delta1 * delta2;
8
}

Welford 算法天然支持单趟遍历，一边读取数据一边更新 mean 和 m2。

GPU 上求全局均值/方差需要跨线程块规约。Welford 具备可合并性 这允许我们先让每个 warp / block 独立做局部 Welford，然后只归并这几个状态.

一旦得到全局 $\mu$ 和 $\sigma$ ，归一化就变成了一个完全无依赖的逐元素操作：(x - μ) * inv_std。
这可以很方便地和其他逐元素操作（如 $\gamma, \beta$ 、残差加法、激活函数）融合到一个 kernel 里，形成端到端的“大融合算子”。分离的设计让规约成为一次全局通信+少量计算，归一化成为纯本地并行计算，边界清晰，利于编译器和手写 kernel 深度优化。

3. 策略与算法分离

策略	适用条件	线程组织	核心思路
WarpImpl	K ≤ 1024	2D block	一个 warp group 一行，block 内多行并行
BlockSMemImpl	K > 1024	1D block + dynamic shared memory	数据缓存到 shared memory，省去第二次 global 读
BlockUncachedImpl	SMemImpl 放不下	1D block，两次 global 读	依赖 L2 cache

设计策略#

1. 基于 occupancy 的动态 grid sizing

1
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_active_blocks, func, block_size, smem);
2
*num_blocks = max(1, min(max_blocks, sm_count * max_active_blocks * waves));

根据 SM 数量和每个 SM 能驻留的最大 block 数，乘以一个 waves 因子（默认 32），确保有足够的 block 来隐藏内存延迟。可以根据当前占用动态设计grid size大小。

2. Pack 与 pack_size 自动选择

1
if (cols % 4 == 0 && CanPackAs<LOAD>(load, 4)) {
2
    // use pack_size = 4
3
} else if (cols % 2 == 0 && CanPackAs<LOAD>(load, 2)) {
4
    // use pack_size = 2
5
} else {
6
    // fallback to 1
7
}

根据列对齐自动选择最优向量化宽度，在不牺牲通用性的前提下最大化访存效率。

3. 小K大N场景的批量处理

当 K 很小（如 64）且 N 很大（如 10⁶）时，每个 warp group 一次处理 2 行而非 1 行，指令级并行度翻倍，更好地隐藏计算延迟。

我的CUDA设计#

基于对 OneFlow 的深入分析，我进行了精简，保留代码的核心思路。

模块架构预览#

1
cuda_norm/
2
├── reduce.cuh          ← warp/block 级规约原语
3
├── io.cuh              ← 向量化访存抽象 (Pack, DirectLoad, AffineStore)
4
├── norm_kernel.cuh     ← 通用 kernel 骨架 (WarpImpl + BlockSMemImpl)
5
├── rms_norm.cuh        ← RMSNorm: Stats特化 + dispatch + host API
6
├── layer_norm.cuh      ← LayerNorm: Stats特化 + dispatch + host API
7
└── bench/
8
    ├── reduce_bench.cu
9
    ├── io_bench.cu
10
    ├── rms_norm_bench.cu
11
    └── layer_norm_bench.cu

依赖关系如下：

1
reduce.cuh  →  io.cuh  →  norm_kernel.cuh  →  rms_norm.cuh
2
                                            →  layer_norm.cuh

Reduce模块#

职责

提供Warp级别和Block级别的求和规约。

核心接口如下：

1
template <typename T>
2
__device__ T warp_reduce_sum(T val);
3

4
template <const int NUM_THREADS, typename T>
5
__device__ T block_reduce_sum(T val);

设计思路如下：

Warp内部使用__shfl_xor_sync()

1
初始:  lane0  lane1  lane2  lane3  ...  lane31
2
mask=16: 每个 lane 与相距16的 lane 交换并求和
3
mask=8:  每个 lane 与相距8的 lane 交换并求和
4
...
5
mask=1:  最终每个 lane 持有全部32个值的和

Block规约是两级结构：

1
第一步: 各 warp 内部独立规约 → lane0 写入 shared memory[warp_id]
2
第二步: warp0 从 shared memory 读取各 warp 结果 → 再做一次 warp 规约

具体代码可以在代码仓库中查看。

IO模块#

职责

将数据搬运与计算逻辑解耦。提供向量化访存（128-bit LDG/STG）、透明的类型转换（fp16↔fp32）、以及零开销的 affine 融合（gamma/beta 乘加），让 kernel 代码不感知内存布局、对齐和精度。

核心接口如下：

1
// 层0: 16字节对齐的寄存器数组，触发 LDG.128 / STG.128
2
template <typename T, int N>
3
struct alignas(sizeof(T) * N) Pack { T elem[N]; };
4

5
// 层1: 加载 + 类型转换合二为一 (如 half→float)
6
template <typename SRC, typename DST>
7
struct DirectLoad {
8
    template <int N>
9
    __device__ void load(DST* dst, int64_t row, int64_t col) const;
10
};
11

12
// 层2: 类型转换 + 向量化存储
13
template <typename SRC, typename DST>
14
struct DirectStore {
15
    template <int N>
16
    __device__ void store(const SRC* src, int64_t row, int64_t col);
17
};
18

19
// 层3: 存储时融合 affine 参数 (y = x * gamma + beta)，编译期分支消除
20
template <typename SRC, typename DST, bool do_scale, bool do_center>
21
struct AffineStore {
22
    template <int N>
23
    __device__ void store(const SRC* src, int64_t row, int64_t col);
24
};

具体代码可以在代码仓库中查看。

假设没有IO层，直接在Kernel中写数据搬运代码，是什么情况？

1
template <int N>
2
__global__ void rms_norm_naive(const __half* x, __half* y, const __half* gamma,
3
                                int rows, int cols, float eps) {
4
    int row = blockIdx.x;
5
    int K = cols;
6

7
    // 问题1: 逐元素标量加载 — 32 条 LDG.16 指令，而非 4 条 LDG.128
8
    float sum_sq = 0.0f;
9
    for (int k = threadIdx.x; k < K; k += blockDim.x) {
10
        float v = __half2float(x[row * K + k]);  // 问题2: 类型转换散落各处
11
        sum_sq += v * v;
12
    }
13

14
    // 问题3: reduce 逻辑和访存耦合在一起
15
    // ...
16

17
    float inv_rms = rsqrtf(sum_sq / K + eps);
18

19
    // 问题4: 写回时再次手动转换 + 手动乘 gamma, 每次都是标量 STG.16
20
    for (int k = threadIdx.x; k < K; k += blockDim.x) {
21
        float v = __half2float(x[row * K + k]);
22
        float out = v * inv_rms;
23
        // 问题5: gamma 的加载和乘法也在 kernel 中，换了 LayerNorm 要全部重写
24
        float g = __half2float(gamma[k]);
25
        y[row * K + k] = __float2half(out * g);
26
    }
27
}

有了IO层之后是什么样子的呢？

1
template <typename LOAD, typename STORE, typename ComputeType, typename Stats, ...>
2
__global__ void NormWarpImpl(LOAD load, STORE store, int rows, int cols, float eps) {
3
    ComputeType buf[cols_per_thread];                    // 寄存器中的缓冲区
4

5
    // 加载: 一行代码，封装了 128-bit LDG + half→float 转换
6
    load.template load<pack_size>(buf, row, col);
7

8
    // 计算: kernel 只看到 float 类型的数据，不需要知道原始是 fp16 还是 fp32
9
    Stats::normalize(buf, stat, cols_per_thread);
10

11
    // 存储: 封装了 float→half 转换 + gamma*out + 128-bit STG
12
    store.template store<pack_size>(buf, row, col);

编译器辅助向量化#

1
template <typename T, int N>
2
struct alignas(sizeof(T) * N) Pack {
3
    T elem[N];
4
};

CUDA 编译器在看到对齐的 16 字节读写时，会自动生成 LDG.128/STG.128 指令。

alignas是核心

1
Pack<float, 4>:  4 × 4 bytes = 16 bytes, alignas(16)
2
Pack<__half, 8>: 8 × 2 bytes = 16 bytes, alignas(16)

在这里还有一个是T elem[N]，这里Pack放在了寄存器，这是因为每个线程处理的元素比较少，并且没有访存冲突和数据同步。

类型转换透明化#

1
template <typename SRC, typename DST>
2
struct DirectLoad {
3
    template <int N>
4
    __device__ void load(DST* dst, int64_t row, int64_t col) const {
5
        Pack<SRC, N> pack;
6
        const int64_t offset = (row * row_size + col) / N;
7
        pack = *reinterpret_cast<const Pack<SRC, N>*>(src + offset * N);  // 128-bit LDG
8
        #pragma unroll
9
        for (int i = 0; i < N; ++i)
10
            dst[i] = static_cast<DST>(pack.elem[i]);  // 自动类型转换
11
    }
12
};

SRC 和 DST 两个模板参数是这一层的核心设计：

SRC = 显存中的存储类型（__half 用于 FP16 模型）
DST = 寄存器中的计算类型（float 用于 FP32 精度计算）

1
GPU 显存: fp16 ──DirectLoad<__half, float>──→ 寄存器: fp32
2
                    自动调用 __half2float

那这里为什么不让Kernel手动类型转换呢？

当然是，容易忘！

DirectStore 是 DirectLoad 的镜像，方向相反，逻辑完全对称：

1
pack = *reinterpret_cast<const Pack<SRC, N>*>(src + offset * N);  // Load:  Global → Pack
2
*reinterpret_cast<Pack<DST, N>*>(dst + offset * N) = pack;        // Store: Pack → Global

AffineStore编译期分支消除#

1
template <typename SRC, typename DST, bool do_scale, bool do_center>
2
struct AffineStore {
3
    template <int N>
4
    __device__ void store(const SRC* src, int64_t row, int64_t col) {
5
        if (do_scale)   // 编译期 false → 整块代码被删除
6
            gamma_pack = *reinterpret_cast<const Pack<DST, N>*>(gamma + w_offset * N);
7
        if (do_center)  // 编译期 false → 整块代码被删除
8
            beta_pack  = *reinterpret_cast<const Pack<DST, N>*>(beta  + w_offset * N);
9

10
        for (int i = 0; i < N; ++i) {
11
            DST v = static_cast<DST>(src[i]);
12
            if (do_scale)  v = v * gamma_pack.elem[i];   // RMSNorm: 保留 | 纯Norm: 删除
13
            if (do_center) v = v + beta_pack.elem[i];    // LayerNorm: 保留 | RMSNorm: 删除
14
            dst_pack.elem[i] = v;
15
        }
16
        *reinterpret_cast<Pack<DST, N>*>(dst + offset * N) = dst_pack;  // 128-bit STG
17
    }
18
};

我在这里写到了两个编译期参数，这样可以只需要传入true或者false，就能够分别在RMS和LN中进行调用。

同时，gamma/beta 的加载也只在需要时才发生：

1
if (do_scale)
2
    gamma_pack = *reinterpret_cast<const Pack<DST, N>*>(gamma + w_offset * N);

do_scale=false 时，gamma 的加载指令都不会生成。这意味着纯归一化场景下，gamma 可以传 nullptr 而不会崩溃。

完整的数据流#

把三层抽象串联起来，一次 Norm kernel 调用的数据通路由 4 步完成：

1
                      GPU Global Memory
2
                    ┌──────────────────┐
3
                    │ x (fp16)          │
4
                    │ gamma (fp16)      │
5
                    │ beta (fp16)       │
6
                    │ y (fp16)          │
7
                    └──┬────────────┬──┘
8
                       │            ▲
9
        ┌──────────────┘            └──────────────┐
10
        │ ① DirectLoad                              │ ③ AffineStore
11
        │   LDG.128 × N/pack_size                   │   LDG.128 (gamma/beta)
12
        │   half→float 转换                          │   float→half 转换
13
        │                                           │   val * gamma + beta
14
        ▼                                           │   STG.128
15
   ┌─────────┐                                      │
16
   │ buf[]   │ ② Stats::reduce + normalize          │
17
   │ (寄存器) │──────────→ stat ──→ normalize ──────→┘
18
   └─────────┘

使用示例#

RMSNorm (乘 gamma，不加 beta)

1
// 准备数据
2
const __half* x      = ...;   // 输入 [N, K]
3
__half*       y      = ...;   // 输出 [N, K]
4
const __half* gamma  = ...;   // 权重 [K]
5

6
// 创建 Loader 和 Store
7
DirectLoad<__half, float> load(x, K);
8
AffineStore<float, __half, true, false> store(y, K, gamma, nullptr);
9
//                         ^^^^  ^^^^^
10
//                    do_scale=true  do_center=false
11

12
// 在 kernel 中使用
13
float buf[8];
14
load.load<8>(buf, row, col);           // 加载 8 个 half → float
15
// ... 归一化计算 (buf 中是 float) ...
16
store.store<8>(buf, row, col);         // 存储: float→half, 乘 gamma

LayerNorm (乘 gamma，加 beta)

1
const __half* beta = ...;   // 偏置 [K]
2

3
DirectLoad<__half, float> load(x, K);
4
AffineStore<float, __half, true, true> store(y, K, gamma, beta);
5
//                         ^^^^  ^^^^
6
//                    do_scale=true  do_center=true
7

8
// kernel 中的 load/store 调用完全相同！
9
load.load<8>(buf, row, col);
10
// ... 归一化计算 ...
11
store.store<8>(buf, row, col);         // 存储: float→half, 乘 gamma, 加 beta

纯归一化（无 affine 参数）

1
DirectLoad<float, float> load(x, K);
2
DirectStore<float, float> store(y, K);  // 或 AffineStore<float,float,false,false>
3
//                     ^^^^^  ^^^^^
4
//                SRC=float  DST=float (无类型转换)
5

6
// kernel 中的调用依然相同
7
load.load<4>(buf, row, col);
8
// ... 归一化计算 ...
9
store.store<4>(buf, row, col);

三种场景下，kernel 代码完全不变。只有 Load/Store 对象的创建方式不同。这就是 io.cuh 分离”数据搬运”和”计算逻辑”的设计价值。

Norm Kernel模块#

有了IO和Reduce，我们就可以很好的实现Norm Kernel。

先看两个 kernel 的伪代码对比：

1
RMSNorm:                              LayerNorm:
2
  for each row:                         for each row:
3
    sq_sum = 0                            sum = 0; sq_sum = 0
4
    for each col:                         for each col:
5
      v = load(x[row][col])                 v = load(x[row][col])
6
      sq_sum += v * v                       sum += v
7
    inv_rms = rsqrt(sq_sum/K + eps)         sq_sum += v * v
8
    for each col:                         mean = sum / K
9
      out = v * inv_rms * gamma[col]       var = sq_sum/K - mean²
10
      store(y, out)                        inv_std = rsqrt(var + eps)
11
                                          for each col:
12
                                            out = (v - mean) * inv_std
13
                                                  * gamma[col] + beta[col]
14
                                            store(y, out)

二者绝大部分代码都相同

行遍历、列加载、累加、reduce、归一化、这些骨架完全一致

因此我们完全可以拆出一个Kernel来把这些骨架放进去，然后用Stats结构体去描述我们需要的参数量。

Dispatch#

1
template <typename T, int pack_size>
2
static void rms_norm_dispatch(..., int N, int K, float eps) {
3
    if (K <= 1024)
4
        RMSNormWarpDispatch<T, pack_size>::launch(...);   // 策略一
5
    else
6
        launch_rms_smem<T, pack_size, 256>(...);           // 策略二
7
}

这里根据归一化维度 K 的大小，选择完全不同的 kernel 策略。

K 范围	策略	核心思想
K ≤ 1024	WarpImpl	一个 warp group处理一行，一个 block 同时处理 4 行
K > 1024	BlockSMemImpl	整个 block（256 线程）协作处理一行，shared memory 缓存避免重复读

1
                    ┌──────────────┐
2
                    │ reduce.cuh   │ ← warp/block 规约原语
3
                    └──────┬───────┘
4
                           │ 被 Stats 调用
5
              ┌────────────┼────────────┐
6
              │            ▼            │
7
    ┌─────────┴─────┐ ┌──────────┐ ┌───┴──────────┐
8
    │ io.cuh        │ │  Stats   │ │ io.cuh       │
9
    │ DirectLoad    │ │ (RMS/LN) │ │ AffineStore  │
10
    └───────┬───────┘ └────┬─────┘ └─────┬─────────┘
11
            │              │             │
12
            └──────────────┼─────────────┘
13
                           ▼
14
              ┌────────────────────────┐
15
              │  norm_kernel.cuh       │
16
              │  NormWarpImpl /        │
17
              │  NormBlockSMemImpl     │
18
              └────────────────────────┘

WarpImpl#

使用场景是当K比较小的时候(hidd_size比较小)，N可以任意。

核心思路

既然 K 小到可以塞进一个 warp 的寄存器，就不需要整个 block 协作。把 block 按 warp 拆分，每个 warp 处理独立的一行。

1
blockDim = (32, 4)        ← 32 线程的 warp group，4 个 group 叠在一起
2
gridDim  = (N/4, 1)
3

4
block (32, 4)
5
├── warp group 0 (lane 0..31, threadIdx.y=0) → row 0
6
├── warp group 1 (lane 0..31, threadIdx.y=1) → row 1
7
├── warp group 2 (lane 0..31, threadIdx.y=2) → row 2
8
└── warp group 3 (lane 0..31, threadIdx.y=3) → row 3

BlockSMemImpl#

适用场景是当K比较大的时候(大hidden_size,例如4096， 8192)

朴素实现需要读取输入X两次，第一次做统计量，第二次做归一化。

优化方案也很简单，第一次统计之后放在Shared Memory中，第二次归一化的时候直接从共享内存取就可以。

1
第一趟 (Global → SMem + Accumulate):
2
  Global Mem ──read──→ [Shared Memory] ──→ [Thread Accumulator]
3
                         col 0..K-1            (sum / sq_sum)
4

5
Block Reduce → stat:
6
  [各线程的 acc] ──reduce──→ s_stat (只有 warp0 持有)
7

8
Broadcast (通过 shared memory):
9
  if (threadIdx.x == 0) s_stat = stat;
10
  __syncthreads();              ← 第一次同步：等 warp0 写完
11
  stat = s_stat;                ← 全部线程读到
12

13
第二趟 (SMem → Normalize → Global):
14
  [Shared Memory] ──read──→ normalize ──store──→ Global Mem y

RMS和LN模块#

rms_norm.cuh 和 layer_norm.cuh 的职责是定义Stats。

1
norm_kernel 需要什么？               rms_norm.cuh 提供什么？
2
─────────────────────────           ───────────────────────
3
Stats::accum_t   → 累加器类型        float (只存平方和)
4
Stats::stat_t    → 统计量类型        float (inv_rms)
5
Stats::init(acc) → 初始化为 0        置零
6
Stats::accumulate → 每读一个元素      a += v²
7
Stats::warp_reduce → warp 内合并     warp_reduce_sum
8
Stats::block_reduce → block 内合并   block_reduce_sum
9
Stats::compute   → 算出最终统计量    rsqrt(a/K + eps)
10
Stats::normalize → 就地归一化        v * inv_rms
11

12
加上 Dispatch 层:
13
  K ≤ 1024 → NormWarpImpl<..., RMSNormStats, ...>
14
  K > 1024 → NormBlockSMemImpl<..., RMSNormStats, ...>
15

16
再封装 Host API:
17
  rms_norm_forward(x, y, gamma, N, K, eps, is_fp16)

对于Stats来说：

1
struct Stats {
2
    // 模式一：void — 原地修改，状态累积
3
    static void init(accum_t& a);                          // 引用参数
4
    static void accumulate(accum_t& a, const T* vals, int n); // 引用参数
5
    static void normalize(T* vals, stat_t s, int n);        // 指针参数
6

7
    // 模式二：return value — 纯函数，产生新值
8
    static accum_t warp_reduce(accum_t a);                   // 值参数 + 返回值
9
    static accum_t block_reduce(accum_t a);                  // 值参数 + 返回值
10
    static stat_t compute(accum_t a, int K, float eps);     // 值参数 + 返回值
11
};

具体完整的代码可以查看仓库，至此所有关于Norm系列的内容讲解完毕。

音乐

Pytorch调用#

BatchNorm#

LayerNorm#

RMSNorm#

Triton实现#

归一化的视角#

LayerNorm#

RMSNorm#

CUDA实现#

为什么Norm是带宽瓶颈#

Oneflow的设计分析#

分离策略#

设计策略#

我的CUDA设计#

模块架构预览#

Reduce模块#

IO模块#

编译器辅助向量化#

类型转换透明化#

AffineStore编译期分支消除#

完整的数据流#

使用示例#

Norm Kernel模块#

Dispatch#

WarpImpl#

BlockSMemImpl#

RMS和LN模块#

支持与分享

音乐

目录

音乐

LeetGPU习题07：Norm系列代码实现

Pytorch调用#

BatchNorm#

LayerNorm#

RMSNorm#

Triton实现#

归一化的视角#

LayerNorm#

RMSNorm#

CUDA实现#

为什么Norm是带宽瓶颈#

Oneflow的设计分析#

分离策略#

设计策略#

我的CUDA设计#

模块架构预览#

Reduce模块#

IO模块#

编译器辅助向量化#

类型转换透明化#

AffineStore编译期分支消除#

完整的数据流#

使用示例#

Norm Kernel模块#

Dispatch#

WarpImpl#

BlockSMemImpl#

RMS和LN模块#

支持与分享

音乐

目录