LeetGPU习题02：Matrix Copy - 杜子源的博士居

更新已完成

算子说明#

名称	说明
Vector Addition	两个向量逐元素相加
Matrix Addition	两个矩阵逐元素相加
Matrix Copy	逐元素复制矩阵
Color Inversion	对每个像素独立取反
Reverse Array	反转数组，每个元素独立移动位置
ReLU	逐元素应用 ReLU 函数
Leaky ReLU	逐元素应用 Leaky ReLU
Sigmoid Activation	逐元素应用 Sigmoid 函数
Value Clipping	逐元素裁剪到指定范围
Sigmoid Linear Unit (SiLU)	逐元素 SiLU 激活
Swish-Gated Linear Unit (SWiGLU)	逐元素 SWiGLU（门控部分也为逐元素）
Gaussian Error Gated Linear Unit (GEGLU)	逐元素 GEGLU 激活
RGB to Grayscale	每个像素独立转换，不依赖邻域
Interleave Arrays	交替合并两数组，每个输出元素仅依赖对应位置输入
Rotary Positional Embedding	对每个位置独立应用旋转矩阵
Weight Dequantization	每个权重独立反量化
INT8 Quantized MatMul（仅反量化部分）	反量化部分为逐元素，整体不是
Simple Inference	线性层前向包含矩阵乘，非 element-wise，但其中的激活部分可能是逐元素

1. Matrix Copy#

1.1. 题目#

实现一个程序，在 GPU 上将输入矩阵 A 中的 32 位浮点数按元素直接复制到输出矩阵 B。
即对于所有有效下标 (i, j)，满足 B[i][j] = A[i][j]。

1.2. 实现要求#

不允许使用外部库
solve 函数签名必须保持不变
最终结果必须存储在矩阵 B 中

1.3. 示例#

示例 1：

输入：

1
A = [[1.0, 2.0],
2
     [3.0, 4.0]]

输出：

1
B = [[1.0, 2.0],
2
     [3.0, 4.0]]

示例 2：

输入：

1
A = [[5.5, 6.6, 7.7],
2
     [8.8, 9.9, 10.1],
3
     [11.2, 12.3, 13.4]]

输出：

1
B = [[5.5, 6.6, 7.7],
2
     [8.8, 9.9, 10.1],
3
     [11.2, 12.3, 13.4]]

1.4. 约束#

1 ≤ N ≤ 4096
所有元素均为 32 位浮点数
性能评测时 N = 4096

2. Pytorch题解#

1
import torch
2

3
def solve(A: torch.Tensor, B: torch.Tensor, N: int):
4
    """原地复制 A 到 B"""
5
    B.copy_(A) # 进行原地复制

4. Triton题解#

1
import torch
2
import triton
3
import triton.language as tl
4
@triton.autotune(
5
    configs=[
6
        triton.Config({'BLOCK_SIZE': 256}, num_warps=4),
7
        triton.Config({'BLOCK_SIZE': 512}, num_warps=4),
8
        triton.Config({'BLOCK_SIZE': 1024}, num_warps=4),
9
        triton.Config({'BLOCK_SIZE': 2048}, num_warps=8),
10
    ],
11
    key=['n_elements'],
12
)
13
@triton.jit
14
def copy_kernel(src_ptr, dst_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
15
    pid = tl.program_id(axis=0)
16
    block_start = pid * BLOCK_SIZE
17
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
18
    mask = offsets < n_elements
19
    x = tl.load(src_ptr + offsets, mask=mask)
20
    tl.store(dst_ptr + offsets, x, mask=mask)
21

22
def solve_triton(A, B, N):
23
    n_elements = A.numel()
24
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
25
    copy_kernel[grid](A, B, n_elements)

测试结果：

3. CUDA题解#

直接三板斧

1
#include <cuda_runtime.h>
2
#include <stdio.h>
3

4
__global__ void copy_naive(const float* src, float* dst, int N) {
5
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
6

7
    if (tid < N) {
8
        dst[tid] = src[tid];
9
    }
10
}
11

12
__global__ void copy_stride(const float* src, float* dst, int N) {
13
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
14
    int stride = gridDim.x * blockDim.x;
15
    for (int i = tid; i < N; i += stride) {
16
        dst[i] = src[i];
17
    }
18
}
19

20
__global__ void copy_vec4(const float* src, float* dst, int N) {
21
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
22
    int stride = gridDim.x * blockDim.x;
23
    int vecN = N / 4;
24

25
    const float4* src4 = (const float4*)src;
26
    float4* dst4 = (float4*) dst;
27

28
    for (int i = tid; i < vecN; i += stride) {
29
        dst4[i] = src4[i];
30
    }
31

32
    int tail_idx = vecN * 4 + tid;
33
    if (tail_idx < N) {
34
        dst[tail_idx] = src[tail_idx];
35
    }
36
}
37

38
// 使用模板 T 接受任意可调用对象
39
template <typename T>
40
void run_bench(const char* name, T func, size_t bytes) {
41
    cudaEvent_t start, stop;
42
    cudaEventCreate(&start); cudaEventCreate(&stop);
43

44
    func(); // Warmup
45
    cudaEventRecord(start);
46
    for (int i = 0; i < 100; i++) func();
47
    cudaEventRecord(stop);
48
    cudaEventSynchronize(stop);
49

50
    float ms;
51
    cudaEventElapsedTime(&ms, start, stop);
52
    float gib_s = (bytes * 2.0f * 100) / (ms * 1e6f * 1.073741824f);
53
    printf("%-12s : %.2f GiB/s | Avg: %.3f ms\n", name, gib_s, ms / 100.0f);
54
}
55

56
int main() {
57
    const int N = 4097 * 4097, block = 256;
58
    const size_t bytes = N * sizeof(float);
59
    float *d_src, *d_dst;
60
    cudaMalloc(&d_src, bytes); cudaMalloc(&d_dst, bytes);
61

62
    int sms;
63
    cudaDeviceGetAttribute(&sms, cudaDevAttrMultiProcessorCount, 0);
64
    int g_stride = sms * 32;
65

66
    run_bench("Naive", [=](){ copy_naive<<<(N+block-1)/block, block>>>(d_src, d_dst, N); }, bytes);
67
    run_bench("Stride", [=](){ copy_stride<<<g_stride, block>>>(d_src, d_dst, N); }, bytes);
68
    run_bench("Vec4", [=](){
69
        copy_vec4<<<g_stride, block>>>(d_src,d_dst, N);
70
    }, bytes);
71

72
    cudaFree(d_src); cudaFree(d_dst);
73
    return 0;
74
}

预期性能：

音乐

音乐

算子说明#

1. Matrix Copy#

1.1. 题目#

1.2. 实现要求#

1.3. 示例#

1.4. 约束#

2. Pytorch题解#

4. Triton题解#

3. CUDA题解#

支持与分享

音乐

目录