References for this assignment (in case you get stuck):
Instructions: Download all your files (by shift clicking or right clicking on the highlighted file names) before starting MATLAB. Write an m-file to reproduce your analysis and e-mail the m-file to the Instructor.
1). In this problem we are going to explore the uses of PCA and Factor Analysis on real oceanographic data. To get you started we suggest that you download the satl.dat data set into your working directory now. This is some basic bottle data from the South Atlantic Ventilation Experiment (SAVE), this has data necessary for this PCA/FA problem. While it is downloading let's consider what needs to be accomplished by first examining what you've got. The data file you are downloading contains the following information in the corresponding columns:
| Lat. | Long. | depth | temp. | salinity | O2 | PO4 | NO3 | NO2 | SiO2 |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| deg. | deg. | m | deg. C | ND |
To help you get started we are providing a m-file (function) to calculate the PCAs and some attendant information, you can download the file pca2.m to get this function.
Now suppose that you suspect that there are at least two underlying factors that are responsible for the distribution of these variables: biogeochemical processes (photosynthesis, respiration, etc.) and physical processes (precipitation, upwelling, etc.). Let's see if you can design a principal component--factor analysis to elucidate these factors.
a). First things first, after you have loaded the data, "look at the data". Using MATLAB's subplot feature, have your m-file make a four panel plot of temperature, salinity, oxygen, and nitrate all vs. depth. From these plots it should be clear that a separation of the data by depth is in order.
Separate the data into two matrices, one with everything above 1000 m and one with everything below 1000 m. Hint: MATLAB does it like this:
Xupp=satl(satl(:,3)<=1000,4:10); Xlwr=satl(satl(:,3)>1000,4:10);
This command also removes the variables longitude, latitude and depth from the PCA/FA. Is this a good thing? PCA looks for patterns within the data set among and between variables. The location information will be of use in the next assignment (i.e. where the factors are). There is a certain general, monotonous regularity to depth data, have your m-file display a short note as to why it shouldn't be included in the analysis.
Note that pca2.m gives you the option to standardize the data into standard normal Z-scores using:
via standardiz.m, download it and study it and make sure you know how it works. Here the subscript i refers to the ith observation/sample and the subscript j refers to the jth variable measured.
Discuss whether or not you think the data should be standardized; this decision will effect the results of the rest of the problem, so think about it.
b). Calculate the eigenvectors, eigenvalues, factor matrix, and factor scores of both Xupp and Xlwr with pca2.m. Have your m-file display only the eigenvectors, the eigenvalues summary table and the factor matrix (loadings).
Remember that the principle components come from the Eckert-Young Theorem and can be stated as:
where U are the eigenvectors and the capital lambda the diagonal matrix of singular values which are the square roots of the eigenvalues.
But in MATLAB you can get them from them:
[U lam]=eig(cov(X))
notice how you can embed one MATLAB command inside another? Or you could use the m-file provided to you and X can be standardized or not depending on the value set in the standardization flag in pca2.m. The factor matrix is given by:
and the principle component scores are given by:
c). Calculate the communalities of the factors, both above and below 1000m, and report the values. You remember that the communalities are given by:
Here the j refers to the jth variable and the r refers to the rth factor. In PCA j=r, but this isn't always the case in factor analysis. From a computational point of view, the idea is to sum horizontally across the factor loadings. What do these values represent?
d). Here is where we "cross over the line" between PCA and factor analysis. Based on the loadings, magnitudes of the eigenvalues and the communalities, how many factors should be kept for Xupp and Xlwr? How much variance do they account for? What are their communalities?
e). Now use varimax.m to perform a varimax roatation on the kept factors. Varimax.m requires vfunct.m as well, so be sure to download it. How do the communalities (varimax.m reports them as hjsq) and rotated factor loadings (varimax reports these as lding) fit in with your expectations? Should more or fewer factors be used? Note the groupings of the variables (going down the rows: temperature, salinity, oxygen, phosphate, nitrate, nitrite and silica), what seems to be the notable differences between the surface and deep waters?
f). Using pca2.m, varimax.m, and our knowledge of factor analysis we found the acceptable
factors for the surface and deep waters. But sometimes in factor analysis, it isn't the loadings of
the variables on the individual factors (
) that is the goal, but rather the spatial
distribution of the rotated factor scores is the object of the analysis.
Now after rotating the factor loadings with varimax.m you will have
and you might think
calculating the rotated factor scores would be a simple matter (remember varimax.x puts the
answer in lding) of applying:
in a manner analogous to a way of calculating the unrotated factor scores, but it's not quite that
simple. When you selected a number of factors (a number smaller than the total number of
eigenvectors) you reduced the factors ability to account for all of the variance in the data set. If
this were all you had done, you could calculate
as in Eqn. (PS3-6), but you have also rotated
the factors and a new set of factor scores correspond to the projection of the individual data
points onto these new eigen-coordinates. Instead you need to use the score coefficients (B) as in:
Remember, X is standardized in Eqn (PS3-7). If you used the standardization feature of PCA2, keep in mind that it does not affect the values of your original data matrix. These score coefficients can be calculated from:
The variable S represents the pxp covariances derived from the p retained factors. If the data was standardized to begin with, then calculating S is a relatively simple matter of:
note that the first
has a prime mark following it indicating transpose. If you've been
keeping track of the sizes of these matrices, you know that
has as many rows
as the original data matrix, but now has fewer columns. Calculate a set
of rotated factor scores for Xupp and Xlwr as in part b. After
calculating the rotated factor scores, reattach the coordinates (lon, lat,
depth). Please show only the first 10 rows of the rotated factor scores.
g). Plot these factor scores against each other and compare them to similar property-property plots, are there any patterns that are similar?
NOTE:
1). Remember to use the help feature, if in doubt what a particular m-file does, use:
type filename.m
and MATLAB will list the entire contents of the m-file in your window. Good hunting.
2). The m-file varimax.m works in "keyboard" mode prompting you to change variable values from the keyboard while it is running. Check out:
help varimax
or
help keyboard
for more information.