>A single convolution step is a local operation (only pulling from nearby pixels...

>A single convolution step is a local operation (only pulling from nearby pixels), whereas attention is a "global" operation.

In the same way where the learned weights to generate K,Q,V matricies may have zeros (or small values) for referencing certain tokens, convolution kernels just have defined zeros.