- very similar to ordinary neural networks from previous chapters
- made up of neurons that have learnable weights and biases → each neuron has learnable weights and biases
- each neuron receives inputs, performs dot product and optionally follows with non-linearity
- the key difference between neural networks and CNNS is that in CNNs, the explicit inputs are designed to be images → this is actually false but I'll get to that in a second
- images not in the sense of raw pixel values, but moreso to extract key features and reduce the dimensionality of the data without losing key info → this reduces the total parameters required in the network
- additionally, CNNs are able to take into account context and have spatial understanding of what is around the
- recieve input (single vector) and transform it through series of hidden layers → in regular nets, this architecture doesn't scale well for full images or ones that have a high number of pixels
- CIFAR-10 images are 32.32.3 → super small but with real images, the total inputs can reach up to 50K and even 1Million
- in images, you have 3D data or volume of neurons → CNNS take advantage of input being images and constrain architecture in a sensible way to make it more efficient (this also prevents vanishing gradient and high variance issues)
- neurons in ConvNet are arranged in 3D: width, height, depth → third dimension of an activation volume
- CIFAR-10 images have a width of 32, height of 32, and depth of 3 (RGB/pixel values)
- rather than connecting all the neurons with each other, CNNs only connect to the necessary neurons in the previous layer → similar to the idea of dropout in terms of regularization, only in this case we are doing so with spatial awareness
- final output layer would be 1x1x10 for CIFAR-10 → single vector of class scores (10 represents the total possible classes while the model outputs the respective probabilities)

This is the first input layer followed by a 3D output volume of neuron activations based on the previous 3D input volume → First layer holds raw image information, hence depth would be 3 as you have 3 color channels
Each layer in a ConvNet transforms a 3D input into a 3D output with some differentiable function that may or may not have parameters (inputs or no inputs)
Layers in CNNs:
- every layer of ConvNet for CIFAR-10 calissifcation coul dhave architecure: input → conv → relu→ pool → fc
- INPUT: hold raw pixel values of image and all dimensoins
- CONV layer: compute output of neurons that are connected to local regions in input → dot product betw. weights and small region they are connected to (rmbr that the nodes are not connected to all of the previous nodes, only few) → based on number of filters, the depth will also increase (12 filters means 12 stacked 2-D transformations) where the total dimensions of the model will be 32 x 32 x 12
- RELU: elementwise activation function (regularize from 0 to 1)
- POOL: perform downsampling operations along widht and height of the dimensions (input channels only) ⇒ reduce it down to something manageable like [16x16x12]
- FC layer: produces outputs in class scores, resulting in a vector with depth of 10 (or subject to classes) as per the differnet classes in the CIFAR-10 dataset