PCA | Notion

goal is to take a snapshot of the data and get the best approximation that models the data the best while condensing it down
find best fit of data points by measuring the projected distance against a potential line of best fit
- the projection of the point is the perpendicular of the line and the point
now, the data has been reduced from 2d to 1d; you can now apply statistics on top of this data: mean, variance
- Variance: measures the distance of points from the mean/middle line and that distance is squared → if mean = (a+b+c)/3, then variance = (a^2 + c^2 + ([a+c]/2)^2)/3
- VARIANCE IS A MEASURE OF HOW SPREAD OUT A SET IS
variance for 1d is easy, what about 2d?
- you have a measure for the x and y variane both independently, but this is flawed because we can't actually use 2 scalars and differentiate between them (fail to take into account direction, plane, etc)
- better idea is to use the product of coordinates → known as COVARIANCE
  - Covariance is the sum of the product of the coordinates
    - before it was sum of square of distances, now you can do (2 + 0 + 2)/3 as your mean of sorts
    - we differentiate the types of covariance by whether they are positive or negative which ultimately decides the type of correlation (negative correlation is as x grows, y decreases)