更新中…

前提摘要#

0.祛魅

1.速通环境配置

2.C++关键知识点复习

3.GPU并行本质 | 硬件架构 × 编程模型

4.CUDA全局坐标计算

5. 逐元素操作算子详解

引言#

在上一篇 CUDA 学习之路中，我们学会了第一个Kernel算子的写法。但很多同学会问：“我写的 kernel 如何优雅地嵌入到 Python 流程中？”

PyTorch 提供了 torch.utils.cpp_extension 工具箱，让我们能以极小的代价将 C++/CUDA 代码编译为 Python 模块。本文将梳理三种主流方式，并深入解释为什么需要自定义算子、如何避免环境坑，以及如何让自定义 kernel 无缝支持autograd。

本篇学习目标

掌握 PyTorch 调用 CUDA 的三种方式及其适用场景
理解 BuildExtension、CUDAExtension、load 的内部机制
学会为自定义算子编写 torch.autograd.Function 反向传播
能够将一个独立 CUDA kernel 转化为可直接 pip install 的 Python 包

核心概念速查#

在深入代码之前，先梳理几个易混淆的关键对象：

概念	作用	典型位置
`torch.utils.cpp_extension.load_inline`	直接编译字符串中的 C++/CUDA 代码	Jupyter / 快速原型
`torch.utils.cpp_extension.load`	编译指定 `.cpp`/`.cu` 文件	脚本中临时编译
`CUDAExtension` / `CppExtension`	在 `setup.py` 中声明扩展模块	正式打包分发
`BuildExtension`	替换 `setuptools` 默认构建命令，注入 PyTorch 编译参数	`setup.py` 的 `cmdclass`
`PYBIND11_MODULE`	将 C++ 函数暴露给 Python 的宏	`.cpp` 文件末尾
`torch.autograd.Function`	自定义算子前向/反向传播的包装类	让算子支持自动微分

本质关系

load 和 load_inline 是 JIT 便捷工具，背后调用相同的编译器逻辑；setup.py + CUDAExtension 则是预编译方案，更适合生产环境。

方式一：`load_inline` 即时编译#

适用场景：在 Jupyter Notebook 中快速验证一个小 kernel，或写一次性实验脚本。

C++的编译都需要ninja，使用uv pip install ninja提前进行安装。

1
import torch
2
from torch.utils.cpp_extension import load_inline
3

4
# C++ 头文件声明
5
cpp_src = """
6
torch::Tensor add_cuda(torch::Tensor a, torch::Tensor b);
7
"""
8

9
# CUDA 实现（注意：不包含 PYBIND11_MODULE 宏）
10
cuda_src = """
11
#include <torch/extension.h>
12
#include <cuda_runtime.h>
13

14
__global__ void add_kernel(const float* a, const float* b, float* c, int N) {
15
    int i = blockIdx.x * blockDim.x + threadIdx.x;
16
    if (i < N) c[i] = a[i] + b[i];
17
}
18

19
torch::Tensor add_cuda(torch::Tensor a, torch::Tensor b) {
20
    auto c = torch::empty_like(a);
21
    int N = a.numel();
22
    const int threads = 256;
23
    const int blocks = (N + threads - 1) / threads;
24
    add_kernel<<<blocks, threads>>>(a.data_ptr<float>(), b.data_ptr<float>(), c.data_ptr<float>(), N);
25
    return c;
26
}
27
"""
28

29
# 自动编译并加载，add_cuda 将绑定为模块的 add 方法
30
module = load_inline(
31
    name="inline_cuda_add",
32
    cpp_sources=cpp_src,
33
    cuda_sources=cuda_src,
34
    functions=["add_cuda"],          # 指定要暴露的函数名
35
    verbose=True
36
)
37

38
# 测试
39
a = torch.randn(1000, device='cuda')
40
b = torch.randn(1000, device='cuda')
41
c = module.add_cuda(a, b)            # 调用暴露的 C++ 函数
42
print(torch.allclose(c, a + b))      # True

优缺点

优点：无需任何文件操作，代码与结果同屏，调试直观。
缺点：每次运行都重新编译且错误定位较难。

方式二：`load` 与预编译扩展#

load 是 load_inline 的文件版，适合脚本中编译已有的 .cu / .cpp 文件。

1
from torch.utils.cpp_extension import load
2

3
vector_add = load(
4
    name="vector_add_ext",
5
    sources=["add_kernel.cu"],
6
    verbose=True,
7
    extra_cuda_cflags=["-O3"]
8
)
9

10
x = torch.ones(10, device='cuda')
11
y = torch.ones(10, device='cuda')
12
print(vector_add.add(x, y))

对应的 add_kernel.cu 内容与后文 vector_add.cu 类似。load 会在 ~/.cache/torch_extensions/ 下生成编译产物，存储空间不足时可以直接删除对应空间中的内容，第二次运行将跳过编译直接加载。

常见坑

CUDA 版本不匹配：load 会检测系统 nvcc 版本，若与 torch.version.cuda 不一致则报错。解决：安装与 PyTorch CUDA 版本一致的工具包。
重复编译耗时：代码量较大时建议改用 setup.py 预编译。

1. load 和 load_inline 参数解释#

load 和 load_inline 都是 PyTorch 提供的 JIT 编译工具，它们的核心参数基本一致。

verbose(bool) 控制是否打印详细的编译日志（False为静默编译，True输出完整的编译指令）
name(str) 用于生成对应的Python模块名，也是编译中间文件的目录名，例如name="ops"，最终可通过import ops调用。
sources 需要指定编译的源文件
functions(仅load_inline)指定需要暴露给Python的C++函数名列表。绑定的Python方法名与C++函数名相同。
extra_cuda_cflags / extra_cxxflags 向nvcc或者C++传递额外的编译选项。

参数	适用函数	说明
`sources`	`load`	传入 `.cpp` 或 `.cu` 文件列表，自动识别类型
`cpp_sources`	`load_inline`	C++ 代码字符串（通常只放函数声明）
`cuda_sources`	`load_inline`	CUDA 代码字符串（包含 kernel 实现）

1
extra_cuda_cflags=["-O3", "-arch=sm_80"]   # 针对 .cu 文件
2
extra_cxxflags=["-O3"]                     # 针对 .cpp 文件

常用选项：-O3（最高优化）、-g（调试符号）、-arch=sm_xx（指定 GPU 计算能力）。

方式三：`setup.py` 与 `CUDAExtension`#

这是正式项目的首选。只需写好 setup.py，运行 pip install . 即可将扩展永久安装到当前 Python 环境。

1. 项目结构#

1
custom_ops/
2
├── setup.py
3
├── cuda_add/
4
│   └── vector_add.cu

2. 算子编写注意事项#

1
#include <torch/extension.h>
2
#include <cuda_runtime.h>
3

4
// 高性能向量加法 kernel (float4 向量化)
5
__global__ void add_kernel_vec4(const float* __restrict__ a,
6
                                const float* __restrict__ b,
7
                                float* __restrict__ c,
8
                                int N) {
9
    const float4* a4 = reinterpret_cast<const float4*>(a);
10
    const float4* b4 = reinterpret_cast<const float4*>(b);
11
    float4* c4 = reinterpret_cast<float4*>(c);
12

13
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
14
    int stride = blockDim.x * gridDim.x;
15
    int N4 = N / 4;
16

17
    for (int i = idx; i < N4; i += stride) {
18
        float4 av = a4[i];
19
        float4 bv = b4[i];
20
        float4 cv;
21
        cv.x = av.x + bv.x;
22
        cv.y = av.y + bv.y;
23
        cv.z = av.z + bv.z;
24
        cv.w = av.w + bv.w;
25
        c4[i] = cv;
26
    }
27

28
    int remainder_start = N4 * 4;
29
    for (int i = remainder_start + idx; i < N; i += stride) {
30
        c[i] = a[i] + b[i];
31
    }
32
}
33

34
torch::Tensor add_forward(torch::Tensor a, torch::Tensor b) {
35
    TORCH_CHECK(a.device().is_cuda(), "a must be CUDA tensor");
36
    TORCH_CHECK(b.device().is_cuda(), "b must be CUDA tensor");
37
    auto c = torch::empty_like(a);
38
    int N = a.numel();
39
    const int threads = 256;
40
    int blocks = ((N / 4) + threads - 1) / threads;
41
    if (blocks == 0) blocks = 1;
42
    add_kernel_vec4<<<blocks, threads>>>(
43
        a.data_ptr<float>(),
44
        b.data_ptr<float>(),
45
        c.data_ptr<float>(),
46
        N
47
    );
48
    return c;
49
}
50

51
// 自动微分包装
52
class AddFunction : public torch::autograd::Function<AddFunction> {
53
public:
54
    static torch::Tensor forward(
55
        torch::autograd::AutogradContext* ctx,
56
        torch::Tensor a,
57
        torch::Tensor b) {
58
        ctx->save_for_backward({a, b});
59
        return add_forward(a, b);
60
    }
61

62
    static torch::autograd::variable_list backward(
63
        torch::autograd::AutogradContext* ctx,
64
        torch::autograd::variable_list grad_output) {
65
        auto grad = grad_output[0];
66
        return {grad, grad};
67
    }
68
};
69

70
torch::Tensor add_autograd(torch::Tensor a, torch::Tensor b) {
71
    return AddFunction::apply(a, b);
72
}
73

74
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
75
    m.def("add", &add_autograd, "Vector addition with autograd");
76
    m.def("add_forward", &add_forward, "Raw forward (no autograd)");
77
}

3. 编写`setup.py`#

1
from setuptools import setup
2
from torch.utils.cpp_extension import CUDAExtension, BuildExtension
3

4
setup(
5
    name="cuda_add",
6
    version="0.1.0",
7
    ext_modules=[
8
        CUDAExtension(
9
            name="cuda_add",
10
            sources=["vector_add.cu"],
11
            extra_compile_args={"cxx": ["-O3"], "nvcc": ["-O3"]}
12
        )
13
    ],
14
    cmdclass={"build_ext": BuildExtension},
15
    install_requires=["torch"],
16
)

4. 安装与使用#

1
python setup.py install   # 或 pip install .

1
import torch
2
import cuda_add
3

4
a = torch.randn(10_000_000, device='cuda')
5
b = torch.randn(10_000_000, device='cuda')
6
c = cuda_add.add(a, b)          # 带 autograd
7
c_no_grad = cuda_add.add_forward(a, b)  # 纯前向，稍快

为什么推荐预编译？

一次编译，随处 import：省去 JIT 等待时间。
依赖管理清晰：可指定 PyTorch 版本、打包上传 PyPI。
错误提示友好：编译失败会生成完整日志。

自动微分#

且看mycuda的系列教程。

完整案例：RGB 转灰度与三维归一化#

下面展示如何将两个经典 CUDA 任务封装成 PyTorch 扩展。

1. 算子定义#

1
#include <torch/extension.h>
2
#include <cuda_runtime.h>
3

4
// 定义 uchar3 结构体（与 OpenCV 对应）
5
struct uchar3 {
6
    unsigned char x, y, z;
7
};
8

9
__global__ void rgbToGrayKernel(const uchar3* img, unsigned char* gray,
10
                                int width, int height) {
11
    int x = blockIdx.x * blockDim.x + threadIdx.x;
12
    int y = blockIdx.y * blockDim.y + threadIdx.y;
13
    if (x < width && y < height) {
14
        int idx = y * width + x;
15
        uchar3 pixel = img[idx];
16
        gray[idx] = (unsigned char)(0.299f * pixel.x + 0.587f * pixel.y + 0.114f * pixel.z);
17
    }
18
}
19

20
torch::Tensor rgb_to_gray_cuda(torch::Tensor img) {
21
    // 假设输入是 uint8 [H, W, 3] 且已在 GPU 上
22
    TORCH_CHECK(img.dim() == 3 && img.size(2) == 3, "Input must be HxWx3");
23
    TORCH_CHECK(img.dtype() == torch::kUInt8, "Input must be uint8");
24
    int height = img.size(0);
25
    int width = img.size(1);
26
    auto gray = torch::empty({height, width}, img.options().dtype(torch::kUInt8));
27

28
    dim3 block(16, 16);
29
    dim3 grid((width + 15) / 16, (height + 15) / 16);
30
    rgbToGrayKernel<<<grid, block>>>(
31
        reinterpret_cast<const uchar3*>(img.data_ptr<unsigned char>()),
32
        gray.data_ptr<unsigned char>(),
33
        width, height
34
    );
35
    return gray;
36
}
37

38
__global__ void normalizeVolumeKernel(const unsigned short* in, float* out,
39
                                      int dimX, int dimY, int dimZ, float maxVal) {
40
    int x = blockIdx.x * blockDim.x + threadIdx.x;
41
    int y = blockIdx.y * blockDim.y + threadIdx.y;
42
    int z = blockIdx.z * blockDim.z + threadIdx.z;
43
    if (x < dimX && y < dimY && z < dimZ) {
44
        int idx = z * dimY * dimX + y * dimX + x;
45
        out[idx] = (float)in[idx] / maxVal;
46
    }
47
}
48

49
torch::Tensor normalize_volume_cuda(torch::Tensor volume, float maxVal) {
50
    TORCH_CHECK(volume.dim() == 3, "Volume must be 3D");
51
    TORCH_CHECK(volume.dtype() == torch::kUInt16, "Input must be uint16");
52
    int dimX = volume.size(2);  // 注意 PyTorch 维度顺序 DxHxW，这里假设 Z,Y,X
53
    int dimY = volume.size(1);
54
    int dimZ = volume.size(0);
55
    auto out = torch::empty_like(volume, volume.options().dtype(torch::kFloat32));
56

57
    dim3 block(8, 8, 4);
58
    dim3 grid((dimX + 7) / 8, (dimY + 7) / 8, (dimZ + 3) / 4);
59
    normalizeVolumeKernel<<<grid, block>>>(
60
        volume.data_ptr<unsigned short>(),
61
        out.data_ptr<float>(),
62
        dimX, dimY, dimZ, maxVal
63
    );
64
    return out;
65
}
66

67
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
68
    m.def("rgb_to_gray", &rgb_to_gray_cuda, "RGB to grayscale (CUDA)");
69
    m.def("normalize_volume", &normalize_volume_cuda, "3D volume normalization (CUDA)");
70
}

2. Python 测试代码#

1
import torch
2
import cv2
3
import numpy as np
4
import custom_image_ops  # 编译后的模块名
5

6
# RGB 转灰度测试
7
img = cv2.imread("input.jpg")
8
img_tensor = torch.from_numpy(img).cuda()  # HxWx3
9
gray = custom_image_ops.rgb_to_gray(img_tensor)
10
cv2.imwrite("gray.png", gray.cpu().numpy())
11

12
# 三维归一化测试
13
vol = torch.randint(0, 4096, (128, 256, 256), dtype=torch.uint16, device='cuda')
14
norm = custom_image_ops.normalize_volume(vol, 4095.0)
15
print(norm.min(), norm.max())  # 0.0 ~ 1.0

音乐

音乐

前提摘要#

引言#

核心概念速查#

方式一：`load_inline` 即时编译#

方式二：`load` 与预编译扩展#

1. load 和 load_inline 参数解释#

方式三：`setup.py` 与 `CUDAExtension`#

1. 项目结构#

2. 算子编写注意事项#

3. 编写`setup.py`#

4. 安装与使用#

自动微分#

完整案例：RGB 转灰度与三维归一化#

1. 算子定义#

2. Python 测试代码#

支持与分享

音乐

目录

音乐

音乐

CUDA学习之路[6]：PyTorch CUDA 扩展完全指南

前提摘要#

引言#

核心概念速查#

方式一：load_inline 即时编译#

方式二：load 与预编译扩展#

1. load 和 load_inline 参数解释#

方式三：setup.py 与 CUDAExtension#

1. 项目结构#

2. 算子编写注意事项#

3. 编写setup.py#

4. 安装与使用#

自动微分#

完整案例：RGB 转灰度与三维归一化#

1. 算子定义#

2. Python 测试代码#

支持与分享

音乐

目录

方式一：`load_inline` 即时编译#

方式二：`load` 与预编译扩展#

方式三：`setup.py` 与 `CUDAExtension`#

3. 编写`setup.py`#