Sigmoid 函数的实现#

知识讲解#

Sigmoid 函数定义为 $\sigma(x)=\frac{1}{1+e^{-x}}$ ，它将任意实数压缩至 $(0,1)$ 区间。其导数为 $\sigma'(x)=\sigma(x)(1-\sigma(x))$ ，在 $x=0$ 处取得最大值 $0.25$ ，向两侧迅速衰减至 $0$ 。

数学特性：

单调递增，光滑可导
饱和区： $|x|>5$ 时梯度趋近于 $0$ ，导致梯度消失
中心对称： $\sigma(0)=0.5$ ，满足 $\sigma(-x)=1-\sigma(x)$

深度学习应用：

分类输出层：将网络输出转化为正类概率 $P(y=1|x)=\sigma(z)$ ，配合交叉熵损失实现稳定训练。
旧式隐藏层激活：早期全连接网络常用，但因饱和区梯度消失，已被 ReLU 系列取代。

题目描述#

难度：简单任务：编写一个GPU程序，对一个32位浮点数向量逐元素应用Sigmoid激活函数。对于输入向量 $X$ 中的每一个元素 $x$ ，计算

\mathrm{sigmoid}(x)=\frac{1}{1+e^{-x}}

并将结果存储到输出向量 $Y$ 当中，Sigmoid函数将任意实数映射至区间 $(0,1)$ 内。

实现要求：

仅允许C/CUDA runtime库
最终结果存储在Y当中

示例输入/输出#

请自行测试：

示例1：

1
输入 X = [0.0, 1.0, -1.0, 2.0]
2
输出 Y = [0.5, 0.7311, 0.2689, 0.8808]

示例2：

1
输入 X = [0.5, -0.5, 3.0, -3.0]
2
输出 Y = [0.6225, 0.3775, 0.9526, 0.0474]

模板代码#

CUDA版#

你只需要填写sigmoid_kernel。

1
#include <stdio.h>
2
#include <cuda_runtime.h>
3
#include <math.h>
4

5
// ========== 请在此处实现 sigmoid_kernel ==========
6
__global__ void sigmoid_kernel(float* x, float* y, int N) {
7
    // TODO: 计算全局索引，确保不越界，计算 sigmoid
8
    // 提示: 使用 expf 进行指数运算
9

10
}
11
// =================================================
12

13
int main() {
14
    // 典型测试值
15
    float typical_vals[] = {-10.0f, -5.0f, -2.0f, -1.0f, -0.5f,
16
                             0.0f, 0.5f, 1.0f, 2.0f, 5.0f, 10.0f};
17
    int num_typical = sizeof(typical_vals) / sizeof(float);
18

19
    int repeat = 100000;          // 每个典型值重复次数
20
    int N = num_typical * repeat;
21
    size_t bytes = N * sizeof(float);
22

23
    // 主机内存分配与初始化
24
    float* hx = (float*)malloc(bytes);
25
    float* hy = (float*)malloc(bytes);
26
    for (int r = 0; r < repeat; r++) {
27
        for (int i = 0; i < num_typical; i++) {
28
            hx[r * num_typical + i] = typical_vals[i];
29
        }
30
    }
31

32
    // 设备内存分配与数据拷贝
33
    float *dx, *dy;
34
    cudaMalloc(&dx, bytes);
35
    cudaMalloc(&dy, bytes);
36
    cudaMemcpy(dx, hx, bytes, cudaMemcpyHostToDevice);
37

38
    // 配置内核启动参数
39
    int threads_per_block = 256;
40
    int blocks_per_grid = (N + threads_per_block - 1) / threads_per_block;
41

42
    // 启动内核
43
    sigmoid_kernel<<<blocks_per_grid, threads_per_block>>>(dx, dy, N);
44
    cudaDeviceSynchronize();  // 等待内核完成
45

46
    // 结果拷贝回主机
47
    cudaMemcpy(hy, dy, bytes, cudaMemcpyDeviceToHost);
48
    printf("   x    ->   sigmoid(x)\n");
49
    printf("-----------------------\n");
50
    for (int i = 0; i < num_typical; i++) {
51
        printf("%6.2f    ->   %8.6f\n", hx[i], hy[i]);
52
    }
53

54
    // 释放内存
55
    free(hx);
56
    free(hy);
57
    cudaFree(dx);
58
    cudaFree(dy);
59
    cudaDeviceReset();
60
    return 0;
61
}

Triton版#

1
import torch
2
import triton
3
import triton.language as tl
4

5
# ========== 请在此处实现 sigmoid_kernel ==========
6
@triton.jit
7
def sigmoid_kernel(x_ptr, y_ptr, N, BLOCK_SIZE: tl.constexpr):
8
    # TODO: 获取当前程序的起始索引
9
    pass
10

11
def solve(X):
12
    N = X.numel()
13
    Y = torch.empty_like(X)
14

15
    # 配置 block 大小和 grid 大小
16
    BLOCK_SIZE = 256
17
    grid = (triton.cdiv(N, BLOCK_SIZE),)
18

19
    sigmoid_kernel[grid](X, Y, N, BLOCK_SIZE=BLOCK_SIZE)
20
    return Y
21

22
def main():
23
    # 典型测试值
24
    typical_vals = torch.tensor([-10.0, -5.0, -2.0, -1.0, -0.5,
25
                                  0.0, 0.5, 1.0, 2.0, 5.0, 10.0], dtype=torch.float32)
26
    num_typical = typical_vals.numel()
27
    repeat = 100000
28
    N = num_typical * repeat
29

30
    hx = typical_vals.repeat(repeat).cuda()
31

32
    # 调用 solve
33
    hy = solve(hx)
34

35
    # 打印结果
36
    print("   x    ->   sigmoid(x)")
37
    print("-----------------------")
38
    for i in range(num_typical):
39
        print(f"{hx[i]:6.2f}    ->   {hy[i]:8.6f}")
40

41

42
if __name__ == "__main__":
43
    main()

Pytorch版#

1
import torch
2

3
# ========== 请在此处实现 sigmoid 函数 ==========
4
def sigmoid(x):
5
    # 手动实现 sigmoid 函数
6
    return torch.sigmoid(x)
7

8

9
def solve(X):
10
    return sigmoid(X)
11

12
def main():
13
    # 典型测试值
14
    typical_vals = torch.tensor([-10.0, -5.0, -2.0, -1.0, -0.5,
15
                                  0.0, 0.5, 1.0, 2.0, 5.0, 10.0], dtype=torch.float32)
16
    num_typical = typical_vals.numel()
17
    repeat = 100000
18
    N = num_typical * repeat
19

20
    hx = typical_vals.repeat(repeat).cuda()
21

22
    hy = solve(hx)
23

24
    print("   x    ->   sigmoid(x)")
25
    print("-----------------------")
26
    for i in range(num_typical):
27
        print(f"{hx[i]:6.2f}    ->   {hy[i]:8.6f}")
28

29
if __name__ == "__main__":
30
    main()

音乐

音乐

Sigmoid 函数的实现#

知识讲解#

题目描述#

示例输入/输出#

模板代码#

CUDA版#

Triton版#

Pytorch版#

支持与分享

音乐

目录

音乐

音乐

LeetGPU习题01：Sigmoid手动实现

Sigmoid 函数的实现#

知识讲解#

题目描述#

示例输入/输出#

模板代码#

CUDA版#

Triton版#

Pytorch版#

支持与分享

音乐

目录