This is the first input layer followed by a 3D output volume of neuron activations based on the previous 3D input volume → First layer holds raw image information, hence depth would be 3 as you have 3 color channels

This is the first input layer followed by a 3D output volume of neuron activations based on the previous 3D input volume → First layer holds raw image information, hence depth would be 3 as you have 3 color channels

Each layer in a ConvNet transforms a 3D input into a 3D output with some differentiable function that may or may not have parameters (inputs or no inputs)

Layers in CNNs: