>A single convolution step is a local operation (only pulling from nearby pixels), whereas attention is a "global" operation.
In the same way where the learned weights to generate K,Q,V matricies may have zeros (or small values) for referencing certain tokens, convolution kernels just have defined zeros.
In the same way where the learned weights to generate K,Q,V matricies may have zeros (or small values) for referencing certain tokens, convolution kernels just have defined zeros.