Principal Component Analysis
A dimensionality reduction technique that transforms high-dimensional data while preserving variance
Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that transforms a dataset with potentially correlated variables into a set of linearly uncorrelated variables called principal components. These principal components are ordered so that the first component explains the largest possible variance in the data, and each succeeding component explains the highest variance possible while being orthogonal to the preceding components.
The PCA Algorithm
- Standardize the data: Center the data around the mean and scale it to have unit variance
- Compute the covariance matrix: Calculate how each variable relates to each other variable
- Calculate eigenvectors and eigenvalues: Find the principal directions (eigenvectors) and their importance (eigenvalues)
- Sort eigenvectors: Order them by decreasing eigenvalues to get principal components in order of importance
- Select top k eigenvectors: Choose how many dimensions to keep based on explained variance
- Project the data: Transform the original data onto the new subspace defined by the principal components
Key Concepts
Principal Components
The orthogonal axes that capture the directions of maximum variance in the data. Each principal component is a linear combination of the original features.
Eigenvalues & Eigenvectors
Eigenvectors of the covariance matrix define the principal components, while eigenvalues represent the amount of variance explained by each principal component.
Explained Variance Ratio
The proportion of the dataset's variance explained by each principal component, which helps determine how many components to retain.
Dimensionality Reduction
By keeping only the top k principal components, we can represent high-dimensional data in a lower-dimensional space while preserving most of the information.
Advantages and Limitations
Advantages
- Reduces dimensionality without losing much information
- Removes correlated features and reduces redundancy
- Helps visualize high-dimensional data
- Mitigates the curse of dimensionality
- Can improve performance of machine learning models
- Useful for noise reduction and data compression
Limitations
- Assumes linear relationships between variables
- Sensitive to the scale of the features
- May not work well for non-linear data
- Principal components may be hard to interpret
- May lose information if too few components are retained
- Not suitable when variance doesn't represent information