Principal Component Analysis Tutorial - Convert R code to Matlab issues - r

I am trying to understand PCA by finding practical examples online. Sadly most tutorials I have found don't really seem to show simple practical applications of PCA. After a lot of searching, I came across this
http://yatani.jp/HCIstats/PCA
It is a nice simple tutorial. I want to re-create the results in Matlab, but the tutorial is in R. I have been trying to replicate the results in Matlab, but have been so far unsuccessful; I am new to Matlab. I have created the arrays as follows:
Price = [6,7,6,5,7,6,5,6,3,1,2,5,2,3,1,2];
Software = [5,3,4,7,7,4,7,5,5,3,6,7,4,5,6,3];
Aesthetics = [3,2,4,1,5,2,2,4,6,7,6,7,5,6,5,7];
Brand = [4,2,5,3,5,3,1,4,7,5,7,6,6,5,5,7];
Then in his example, he does this
data <- data.frame(Price, Software, Aesthetics, Brand)
I did a quick search online, and this apparently converts vectors into a data table in R code. So in Matlab I did this
dataTable(:,1) = Price;
dataTable(:,2) = Software;
dataTable(:,3) = Aesthetics;
dataTable(:,4) = Brand;
Now it is the next part I am unsure of.
pca <- princomp(data, cor=TRUE)
summary(pca, loadings=TRUE)
I have tried using Matlab's PCA function
[COEFF SCORE LATENT] = princomp(dataTable)
But my results do not match the ones shown in the tutorial at all. My results are
COEFF =
-0.5958 0.3786 0.7065 -0.0511
-0.1085 0.8343 -0.5402 -0.0210
0.6053 0.2675 0.3179 -0.6789
0.5166 0.2985 0.3287 0.7321
SCORE =
-2.3362 0.0276 0.6113 0.4237
-4.3534 -2.1268 1.4228 -0.3707
-1.1057 -0.2406 1.7981 0.4979
-3.6847 0.4840 -2.1400 1.0586
-1.4218 2.9083 1.2020 -0.2952
-3.3495 -1.3726 0.5049 0.3916
-4.1126 0.1546 -2.4795 -1.0846
-1.7309 0.2951 0.9293 -0.2552
2.8169 0.5898 0.4318 0.7366
3.7976 -2.1655 -0.2402 -1.2622
3.3041 1.0454 -0.8148 0.7667
1.4969 2.9845 0.7537 -0.8187
2.3993 -1.1891 -0.3811 0.7556
1.7836 -0.0072 -0.2255 -0.7276
2.2613 -0.1977 -2.4966 0.0326
4.2350 -1.1899 1.1236 0.1509
LATENT =
9.3241
2.2117
1.8727
0.5124
Yet the results in the tutorial are
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.5589391 0.9804092 0.6816673 0.37925777
Proportion of Variance 0.6075727 0.2403006 0.1161676 0.03595911
Cumulative Proportion 0.6075727 0.8478733 0.9640409 1.00000000
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Price -0.523 0.848
Software -0.177 0.977 -0.120
Aesthetics 0.597 0.134 0.295 -0.734
Brand 0.583 0.167 0.423 0.674
Could anyone please explain why my results differ so much from the tutorial. Am I using the wrong Matlab function?
Also if you are able to provide any other nice simple practical applications of PCA, would be very beneficial. Still trying to get my head around all the concepts in PCA and I like examples where I can code it and see the results myself, so I can play about with it, I find it is easier when to learn this way
Any help would be much appreciated!!

Edit: The issue is purely the scaling.
R code:
summary(princomp(data, cor = FALSE), loadings=T, cutoff = 0.01)
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Price -0.596 -0.379 0.706 -0.051
Software -0.109 -0.834 -0.540 -0.021
Aesthetics 0.605 -0.268 0.318 -0.679
Brand 0.517 -0.298 0.329 0.732
According to the Matlab help you should use this if you want scaling:
Matlab code:
princomp(zscore(X))
Old answer (a red herring):
From help(princomp) (in R):
The calculation is done using eigen on the correlation or covariance
matrix, as determined by cor. This is done for compatibility with the
S-PLUS result. A preferred method of calculation is to use svd on x,
as is done in prcomp.
Note that the default calculation uses divisor N for the covariance
matrix.
In the documentation of the R function prcomp (help(prcomp)) you can read:
The calculation is done by a singular value decomposition of the
(centered and possibly scaled) data matrix, not by using eigen on the
covariance matrix. This is generally the preferred method for
numerical accuracy. [...] Unlike princomp, variances are computed with
the usual divisor N - 1.
The Matlab function apparently uses the svd algorithm. If I use prcom (without scaling, i.e., not based on correlations) with the example data I get:
> prcomp(data)
Standard deviations:
[1] 3.0535362 1.4871803 1.3684570 0.7158006
Rotation:
PC1 PC2 PC3 PC4
Price -0.5957661 0.3786184 -0.7064672 0.05113761
Software -0.1085472 0.8342628 0.5401678 0.02101742
Aesthetics 0.6053008 0.2675111 -0.3179391 0.67894297
Brand 0.5166152 0.2984819 -0.3286908 -0.73210631
This is (appart from the irrelevant signs) identical to the Matlab output.

Related

Calculating scaled score using Item Response Theory (IRT, 3PL) in R with MIRT package

I have an English Listening Comprehension test consisting of 50 items taken by about 400 students. I would like to score the test using the TOEFL scale (31-68), and it is claimed that TOEFL is scored using IRT (3PL model). I am using MIRT package in R to obtain three parameters, as I used 3PL model.
library(readxl)
TOEFL006 <- read_excel("TOEFL Prediction Mei 2021 Form 006.xlsx")
TOEFL006_LIST <- TOEFL006[, 9:58]
library(mirt)
sv <- mirt(TOEFL006_LIST, # data frame (ordinal only)
1, # 1 for unidimentional, 2 for exploratory
itemtype = '3PL') # models, i.e. Rasch, 1PL, 2PL, and 3PL
sv_coeffs <- coef(sv,
simplify = T,
IRTpars = T)
sv_coeffs
The result is shown below :
a
b
g
u
L1
2.198
0.165
0.198
1
L2
2.254
0.117
0.248
1
L3
2.103
-0.049
0.232
1
L4
4.663
0.293
0.248
1
L5
1.612
-0.374
0.001
1
...
...
...
...
...
Then I calculated the factor score using the following codes:
factor_score <- fscores(sv, method = "EAP", full.scores = T)
The first five results are shown below:
head(factor_score)
[1,] 2.1839888
[2,] 1.8886260
[3,] 0.6791995
[4,] 1.2761837
[5,] 0.8195919
[6,] -1.5257231
The problem is that I do not know how to convert the factor score into the TOEFL score of 31 - 68 (as shown on the ETS website for listening score: https://www.ets.org/s/toefl_itp/pdf/38781-TOEFL-itp-flyer_Level1_HR.pdf).
Would anyone help me show how I can do that in R, please? Or maybe there are other ways of obtaining students' scores. Your help is much appreciated.
The data can be downloaded here: https://drive.google.com/file/d/1WwwjzgxJRBByCXAjdlNkGNRCtXjMlddW/view?usp=sharing
Thank you very much for your help.
To calculate each respondents ability score use:
Theta <- fscores(mod, method = 'MAP', itemtype = '2-PL', stats.only = TRUE) head(personfit(mod, Theta=Theta))
Working through a similar project and found it in the mirt help files :)
I think you misunderstood something. It doesn't seem to be scored using IRT - according to wikipedia, it's just summed up, which is a quite normal way to score tests. Maybe I misunderstood it, but if it's done like you say, I think it uses a proprietary scoring model and you should likely contact and pay them to get you your scores.
So to get the scores, you would simply do TOEFL006_LIST$total <- RowSums(TOEFL006_LIST) and get the head using head(TOEFL006_LIST$total).
Also, I'm glad not to be a student of yours - you're putting their names out in the open in your linked dataset... If you're in Europe it would be illegal, possibly in other countries too.
I was looking for a solution to your question and I found that you can use this function to get a homologous score from a given model and theta value.
expected.test(model, as.matrix(theta))
theta should be specifically a matrix.

Letters group Games-Howell post hoc in R

I use the sweetpotato database included in library agricolae of R:
data(sweetpotato)
This dataset contains two variables: yield(continous variable) and virus(factor variable).
Due to Levene test is significant I cannot assume homogeneity of variances and I apply Welch test in R instead of one-way ANOVA followed by Tukey posthoc.
Nevertheless, the problems come from when I apply posthoc test. In Tukey posthoc test I use library(agricolae) and displays me the superscript letters between virus groups. Therefore there are no problems.
Nevertheless, to perform Games-Howell posthoc, I use library(userfriendlyscience) and I obtain Games-Howell output but it's impossible for me to obtain a letter superscript comparison between virus groups as it is obtained through library(agricolae).
The code used it was the following:
library(userfriendlyscience)
data(sweetpotato)
oneway<-oneway(sweetpotato$virus, y=sweetpotato$yield, posthoc =
'games-howell')
oneway
I try with cld() importing previously library(multcompView) but doesn't work.
Can somebody could helps me?
Thanks in advance.
This functionality does not exist in userfriendlyscience at the moment. You can see which means differ, and with which p-values, by looking at the row names of the dataframe with the post-hoc test results. I'm not sure which package contains the sweetpotato dataset, but using the ChickWeight dataset that comes with R (and is used on the oneway manual page):
oneway(y=ChickWeight$weight, x=ChickWeight$Diet, posthoc='games-howell');
Yields:
### (First bit removed as it's not relevant.)
### Post hoc test: games-howell
diff ci.lo ci.hi t df p
2-1 19.97 0.36 39.58 2.64 201.38 .044
3-1 40.30 17.54 63.07 4.59 175.92 <.001
4-1 32.62 13.45 51.78 4.41 203.16 <.001
3-2 20.33 -6.20 46.87 1.98 229.94 .197
4-2 12.65 -10.91 36.20 1.39 235.88 .507
4-3 -7.69 -33.90 18.52 0.76 226.16 .873
The first three rows compare groups 2, 3 and 4 to 1: using alpha = .05, 1 and 2 have the same means, but 3 and 4 are higher. This allows you to compute the logical vector you need for multCompLetters in multcompView. Based on the example from the manual page at ?multcompView:
### Run oneway anova and store result in object 'res'
res <- oneway(y=ChickWeight$weight, x=ChickWeight$Diet, posthoc='games-howell');
### Extract dataframe with post hoc test results,
### and overwrite object 'res'
res <- res$intermediate$posthoc;
### Extract p-values and comparison 'names'
pValues <- res$p;
### Create logical vector, assuming alpha of .05
dif3 <- pValues > .05;
### Assign names (row names of post hoc test dataframe)
names(dif3) <- row.names(res);
### convert this vector to the letters to compare
### the group means (see `?multcompView` for the
### references for the algorithm):
multcompLetters(dif3);
This yields as final result:
2 3 4 1
"a" "b" "c" "abc"
This is what you need, right?
I added this functionality to userfriendlyscience, but it will be a while before this new version will be on CRAN. In the meantime, you can get the source code for this update at https://github.com/Matherion/userfriendlyscience/blob/master/R/oneway.R if you want (press the 'raw' button to get an easy-to-download version of the source code).
Note that if you need this updated version, you need to set parameter posthocLetters to TRUE, because it's FALSE by default. For example:
oneway(y=ChickWeight$weight,
x=ChickWeight$Diet,
posthoc='games-howell',
posthocLetters=TRUE);
shouldn't it be
dif3 <- pValues < .05, instead of dif3 <- pValues > .05 ?
This way the letters are the same if the distributions are 'the same' (this is, no evidence that they are different).
Please correct me if I'm interpreting this wrong.

Identify Principal component from Biplot in R

I'm doing a principal component analysis, after I got the analysis result, how to identify the first couple of principal predictors? As it is messy from the plot. It's hard to see the predictors names:
Which part of the PCA results should I look into? This is more like how to determine the most important predictors which could explain, lets' say 80%, of the variance of your data. We know, e.g, the first 5 component did this, while the principal component is just combination of predictors. How to identify those "important" predictors.
See this answer Principal Components Analysis - how to get the contribution (%) of each parameter to a Prin.Comp.?
The information is stored within your pca results.
If you used prcomp(), then $rotation is what you are after, or if you used princomp(), then $loadings holds the key.
Eg.
require(graphics)
data("USArrests")
pca_1<-prcomp(USArrests, scale = TRUE)
load_1<-with(pca_1,unclass(rotation))
aload_1<-abs(load_1)
sweep(aload_1, 2, colSums(aload_1), "/")
# PC1 PC2 PC3 PC4
#Murder 0.2761363 0.2540139 0.1890303 0.40186493
#Assault 0.3005008 0.1141873 0.1485443 0.46016113
#UrbanPop 0.1433452 0.5301651 0.2094067 0.08286886
#Rape 0.2800177 0.1016337 0.4530187 0.05510509
pca_2<-princomp(USArrests,cor=T)
load_2<-with(pca_2,unclass(loadings))
aload_2<-abs(load_2)
sweep(aload_2, 2, colSums(aload_2), "/")
# Comp.1 Comp.2 Comp.3 Comp.4
#Murder 0.2761363 0.2540139 0.1890303 0.40186493
#Assault 0.3005008 0.1141873 0.1485443 0.46016113
#UrbanPop 0.1433452 0.5301651 0.2094067 0.08286886
#Rape 0.2800177 0.1016337 0.4530187 0.05510509
As you can see, Murder, Assault, and Rape each contribute ~30% to PC1, whereas UrbanPop only contributes ~14% to PC1, yet is the major contributor to PC2 (~53%).

In R, when fitting a regression with ordinal predictor variables, how do you suppress one of the polynomial contrast levels?

Below is some of the summary data from a mixed model I have run in R (produced by summary()):
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -3.295e-01 1.227e-01 3.740e+01 -2.683 0.0108 *
STANDING.L 8.447e-02 7.091e-02 7.346e+02 1.188 0.2354
STANDING.Q -4.624e-03 5.940e-02 7.323e+02 -0.078 0.9380
STANDING.C 2.899e-03 5.560e-02 7.327e+02 0.052 0.9585
FIRST.CLASS1 2.643e-02 7.017e-02 7.308e+02 0.376 0.7068
CAREER.L 1.300e-01 5.917e-02 7.345e+02 2.189 0.0289 *
CAREER.Q 8.914e-04 7.370e-02 7.295e+02 0.012 0.9904
GENDER1 9.411e-02 5.892e-02 7.296e+02 1.596 0.1109
HS.COURSES.L -3.996e-02 7.819e-02 7.347e+02 -0.510 0.6102
HS.COURSES.Q 4.977e-02 6.674e-02 7.322e+02 0.745 0.4567
HS.COURSES.C 2.087e-02 5.735e-02 7.298e+02 0.364 0.7163
PARENT.LIVE1 5.770e-03 8.434e-02 7.296e+02 0.068 0.9455
CHILD.SETTING.L 1.241e-01 6.027e-02 7.288e+02 2.057 0.0400 *
CHILD.SETTING.Q -4.911e-02 4.879e-02 7.268e+02 -1.006 0.3146
ES.EXTRA.L 2.702e-02 8.202e-02 7.287e+02 0.329 0.7421
ES.EXTRA.Q 1.267e-01 7.761e-02 7.274e+02 1.631 0.1032
ES.EXTRA.C 8.317e-02 7.533e-02 7.287e+02 1.104 0.2701
TEACH.TAUGHT1 2.475e-01 6.316e-02 7.268e+02 3.918 9.79e-05 ***
SOME1ELSE.TAUGHT1 -1.818e-03 6.116e-02 7.277e+02 -0.030 0.9763
Several of my predictor variables are ordinal, as indicated by the Linear (.L), Quadratic (.Q), and sometimes Cubic (.C) terms that are being automatically generated for them. My question is this: How could I re-run this same regression removing, say, the ES.EXTRA.C term? In other words, I want to suppress one or more of the automatically-generated polynomial contrasts but potentially keep others. I would have thought update() could do this, but I haven't been able to get it to work.
I can't share my actual data, but this code will create a few outcomes that are sort of similar and include an illustration of smci's answer below as well:
set.seed(151) #Lock in a fixed random structure to these data.
Y.data = sort(round(rnorm(100, 75, 10))) #Some random Y data that are basically the same form as mine.
X.data1 = as.ordered(rep(c(1,2,3,4), each=25)) #Some random X data that are similar in form to mine.
summary(lm(Y.data~X.data1)) #This is what I had been doing, albeit using lmer() instead of lm(). It looks to have been creating the polynomial terms automatically.
summary(lm(Y.data~poly(X.data1, 3))) #Returns an error because X.data1 is not numeric
summary(lm(Y.data~poly(as.numeric(X.data1), 3))) #Now returns a call very similar to the first one, but this time I am in control of which polynomial terms are included.
summary(lm(Y.data~poly(as.numeric(X.data1), 2))) #The cubic term is suppressed now, as desired.
As a follow-up, is there a way using poly() to get only a certain mixture of polynomial terms? Say, the cubic and fourth power ones only? I have no idea why one would want to do that, but it seems like something worth knowing...
UPDATE: after you posted your code:
As I guessed you're building a model using polynomials of ordinal variables:
fit <- lm(y ~ poly(STANDING,3) + FIRST.CLASS + poly(CAREER,2) + GENDER +
poly(HS.COURSES,3) + poly(CHILD.SETTING,2) + poly(ES.EXTRA,3) ...)
If you want to prevent cubic terms, use poly(..., 2)
If you really want to only have cubic and quartic terms, no quadratic or linear, a hack is to use I(STANDING^3) + I(STANDING^4), although those will be raw polynomials (not orthogonal, centered and scaled like poly() does). I have never seen a need for this, sounds like a very strange request.
See related:
How to model polynomial regression in R?
UCLA: "R library Contrast coding systems for categorical variables"
FOOTNOTE: lmer() is for Fixed-Effects Models, if you don't know what that is, don't use it, use plain lm().

Statistical inefficiency (block-averages)

I have a series of data, these are obtained through a molecular dynamics simulation, and therefore are sequential in time and correlated to some extent. I can calculate the mean as the average of the data, I want to estimate the the error associated to mean calculated in this way.
According to this book I need to calculate the "statistical inefficiency", or roughly the correlation time for the data in the series. For this I have to divide the series in blocks of varying length and, for each block length (t_b), the variance of the block averages (v_b). Then, if the variance of the whole series is v_a (that is, v_b when t_b=1), I have to obtain the limit, as t_b tends to infinity, of (t_b*v_b/v_a), and that is the inefficiency s.
Then the error in the mean is sqrt(v_a*s/N), where N is the total number of points. So, this means that only one every s points is uncorrelated.
I assume this can be done with R, and maybe there's some package that does it already, but I'm new to R. Can anyone tell me how to do it? I have already found out how to read the data series and calculate the mean and variance.
A data sample, as requested:
# t(ps) dH/dl(kJ/mol)
0.0000 582.228
0.0100 564.735
0.0200 569.055
0.0300 549.917
0.0400 546.697
0.0500 548.909
0.0600 567.297
0.0700 638.917
0.0800 707.283
0.0900 703.356
0.1000 685.474
0.1100 678.07
0.1200 687.718
0.1300 656.729
0.1400 628.763
0.1500 660.771
0.1600 663.446
0.1700 637.967
0.1800 615.503
0.1900 605.887
0.2000 618.627
0.2100 587.309
0.2200 458.355
0.2300 459.002
0.2400 577.784
0.2500 545.657
0.2600 478.857
0.2700 533.303
0.2800 576.064
0.2900 558.402
0.3000 548.072
... and this goes on until 500 ps. Of course, the data I need to analyze is the second column.
Suppose x is holding the sequence of data (e.g., data from your second column).
v = var(x)
m = mean(x)
n = length(x)
si = c()
for (t in seq(2, 1000)) {
nblocks = floor(n/t)
xg = split(x[1:(nblocks*t)], factor(rep(1:nblocks, rep(t, nblocks))))
v2 = sum((sapply(xg, mean) - m)**2)/nblocks
#v rather than v1
si = c(si, t*v2/v)
}
plot(si)
Below image is what I got from some of my time series data. You have your lower limit of t_b when the curve of si becomes approximately flat (slope = 0). See http://dx.doi.org/10.1063/1.1638996 as well.
There are a couple different ways to calculate the statistical inefficiency, or integrated autocorrelation time. The easiest, in R, is with the CODA package. They have a function, effectiveSize, which gives you the effective sample size, which is the total number of samples divided by the statistical inefficiency. The asymptotic estimator for the standard deviation in the mean is sd(x)/sqrt(effectiveSize(x)).
require('coda')
n_eff = effectiveSize(x)
Well it's never too late to contribute to a question, isn't it?
As I'm doing some molecular simulation myself, I did step uppon this problem but did not see this thread already. I found out that the method actually proposed by Allen & Tildesley seems a bit out dated compared to modern error analysis methods. The rest of the book is good enought to worth the look though.
While Sunhwan Jo's answer is correct concerning block averages method,concerning error analysis you can find other methods like the jacknife and bootstrap methods (closely related to one another) here: http://www.helsinki.fi/~rummukai/lectures/montecarlo_oulu/lectures/mc_notes5.pdf
In short, with the bootstrap method, you can make series of random artificial samples from your data and calculate the value you want on your new sample. I wrote a short piece of Python code to work some data out (don't forget to import numpy or the functions I used):
def Bootstrap(data):
B = 100 # arbitraty number of artificial samplings
es = 0.
means = numpy.zeros(B)
sizeB = data.shape[0]/4 # (assuming you pass a numpy array)
# arbitrary bin-size proportional to the one of your
# sampling.
for n in range(B):
for i in range(sizeB):
# if data is multi-column array you may have to add the one you use
# specifically in randint, else it will give you a one dimension array.
# Check the doc.
means[n] = means[n] + data[numpy.random.randint(0,high=data.shape[0])] # Assuming your desired value is the mean of the values
# Any calculation is ok.
means[n] = means[n]/sizeB
es = numpy.std(means,ddof = 1)
return es
I know it can be upgraded but it's a first shot. With your data, I get the following:
Mean = 594.84368
Std = 66.48475
Statistical error = 9.99105
I hope this helps anyone stumbling across this problem in statistical analysis of data. If I'm wrong or anything else (first post and I'm no mathematician), any correction is welcomed.

Resources