'Questions about convolution (in CNN)

I suddenly came up with a question about convolution and just wanted to be clear if I'm missing something. The question is whether if the two operations below are identical.

Case1) Suppose we have a feature map C^2 x H x W. And, we have a K x K x C^2 Conv weight with stride S. (To be clear, C^2 is the channel dimension but just wanted to make it as a square number, K is the kernel size).

Case2) Suppose we have a feature map 1 x CH x CW. And, we have a CK x CK x 1 Conv weight with stride CS.


So, basically Case2 is a pixel-upshuffled version of case1 (both feature-map and Conv weight.) As convolutions are simply element-wise multiplication, both operations seem identical to me.

# given a feature map and a conv_weight, namely f_map, conv_weight

#case1)
convLayer = Conv(conv_weight)
result = convLayer(f_map, stride=1)

#case2)
f_map = pixelshuffle(f_map, scale=C)
conv_weight = pixelshuffle(f_map, scale=C)
result = convLayer(f_map, stride=C)

But this means that, (for example) given a 256xHxW feature-map with a 3x3 Conv (as in many deep learning models), performing a convolution was simply doing a HUUUGE 48x48 Conv to a 1 x 16*H x 16*W Feature map.

But this doesn't meet my basic intuition of CNNs, stacking multiple of layers with the smallest 3x3 Conv, resulting in somewhat large receptive field, and each channel having different (possibly redundant) information.



Solution 1:[1]

You can, in a sense, think of "folding" spatial information into the channel dimension. This is the rationale behind ResNet's trade-off between spatial resolution and feature dimension. In the ResNet case whenever they sample x2 in space they increase feature space x2. However, since you have two spatial dimensions and you sample x2 in both you effectively reduce the "volume" of the feature map by x0.5.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Shai