Performing Principal Component Analysis (PCA) effectively requires careful consideration at every stage, from data preparation to interpretation. Here is a comprehensive checklist of what to consider.
Phase 1: Pre-Analysis Considerations & Data Suitability
1. Goal Definition:
Why are you using PCA? Common goals include:
Dimensionality Reduction: To reduce the number of variables for downstream analysis (e.g., regression, clustering).
Noise Filtering: To create a "denoised" version of the data.
Visualization: To project high-dimensional data into 2D/3D for exploration.
Feature Engineering: To create new, uncorrelated features (the principal components).
Multicollinearity Detection/Handling: To identify correlated variables before regression.
2. Data Suitability:
Variables should be quantitative and continuous. PCA is designed for interval or ratio-scaled data. Applying it to binary or ordinal data can be misleading (consider alternatives like MCA for categorical data).
Check for correlations. PCA is most effective when variables are correlated. If the correlation matrix is close to an identity matrix (no correlation), PCA offers little benefit. Use Bartlett's test of sphericity or visually inspect the correlation matrix.
Sample size. A general rule is to have many more observations (n) than variables (p). A rough guideline is n > 5p, but more is better for stability.
3. Data Preprocessing (CRITICAL STEP):
Centering: Always required. Subtract the mean from each variable so that the data is centered around the origin.
Scaling (Standardization): This is the most crucial decision.
Use scaling (z-scores) when variables are measured on different units or scales (e.g., income in dollars vs. age in years). This prevents variables with larger ranges from dominating the PCs. This is the most common scenario.
Do not scale if all variables are on the same, comparable scale (e.g., gene expression from the same assay, pixels from the same image) and you want the variance to guide importance.
Phase 2: Performing the PCA & Model Decisions
1. Covariance vs. Correlation Matrix:
This choice is directly tied to scaling.
If you scaled your data, you are decomposing the correlation matrix. All variables contribute equally initially.
If you only centered your data, you are decomposing the covariance matrix. Variables with higher variance will exert more influence.
2. Determining the Number of Components to Retain:
Do not blindly keep all components. Use a combination of rules:
Kaiser Criterion: Keep components with eigenvalues > 1 (if using the correlation matrix). Often too simplistic.
Scree Plot: Look for the "elbow" point where eigenvalues level off. Retain components before the elbow.
Variance Explained: Keep enough components to explain a pre-determined cumulative percentage of total variance (e.g., 70-90%). This is the most interpretable and common method in practice.
Cross-Validation & Purpose-Driven: For downstream tasks, use cross-validation to choose the number that optimizes model performance.
Phase 3: Post-Analysis Interpretation & Validation
1. Interpreting the Components:
Loadings (or Rotation Matrix): These are the coefficients linking the original variables to the PCs. Examine the loadings.
A high absolute loading (positive or negative) means the variable h3ly influences that PC.
Try to name the PC based on the variables that load highly on it (e.g., "Size Component" if height, weight, length all load highly).
Caution: Loadings can change if you rotate the components (e.g., Varimax rotation), which is often done to improve interpretability but sacrifices orthogonality.
2. Assessing the Results:
Biplot: A powerful tool to visualize both component scores for observations and loadings for variables in one plot. It shows relationships between observations and which variables drive their positioning.
Outlier Detection: Plot the component scores (e.g., PC1 vs PC2). Points far from the cluster may be outliers influencing the PCA.
Check for Linearity: PCA finds linear combinations. Non-linear relationships between variables will not be captured well (consider Kernel PCA for such cases).
4. Reporting & Next Steps:
Document the process: Clearly state whether data was scaled, how many components were retained, and the variance they explain.
Use the outputs:
Component Scores: Use these as new, uncorrelated features in your subsequent models (like regression or clustering).
Do not use the retained PCs in a model and then interpret the original variables' importance directly—that information is mixed in the loadings.
Summary Checklist
Stage Key Questions to Ask
Before 1. Are my variables continuous and correlated?
2. Are they on comparable scales? If not, I MUST SCALE.
3. What is my primary goal (visualization, dimensionality reduction)?
During 1. Am I using the correlation matrix (scaled) or covariance matrix (centered only)?
2. How many components will I keep? Use Scree Plot and % Variance Explained.
After 1. Can I interpret the components by examining the loadings?
2. Do the scores/biplot reveal clusters or outliers?
3. Have I documented my choices for reproducibility?
Final Reminder: PCA is an exploratory and unsupervised technique. It reveals patterns based on variance, which may or may not be related to your specific predictive goal. Always validate findings from PCA with domain knowledge and downstream analyses.
TEL: +86-632-3671188
FAX: +86-632-3671189
E-mail: [email protected]
ADD: No.1, Fuqian South Road, Xuecheng Chemical Industrial Park, Xuecheng District, Zaozhuang City, Shandong Province, China