Difference between proc princomp in SAS and princomp command in R? - r

I am currently trying to obtain equivalent results with the proc princomp command in SAS and the princomp() command in R (in the stats package). The results I am getting are very similar, leading me to suspect that this isn't a problem with different options settings in the two commands. However, the outpus are also different enough that the component scores for each data row are notably different. They are also sign-reversed, but this doesn't matter, of course.
The end goal of this analysis is to produce a set of coefficients from the PCA to score data outside the PCA routine (i.e. a formula that can be applied to new datasets to easily produce scored data).
Without posting all my data, I'm hoping someone can provide some information on how these two commands may differ in their calculations. I don't know enough about the PCA math to determine if this is a conceptual difference in the processes or just something like an internal rounding difference. For simplicity, I'll post the eigenvectors for PC1 and PC2 only.
In SAS:
proc princomp data=climate out=pc_out outstat=pc_outstat;
var MAT MWMT MCMT logMAP logMSP CMI cmiJJA DD_5 NFFD;
run;
returns
Eigenvectors
Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8 Prin9
MAT 0.372 0.257 -.035 -.033 -.106 0.270 -.036 0.216 -.811
MWMT 0.381 0.077 0.160 -.261 0.627 0.137 -.054 0.497 0.302
MCMT 0.341 0.324 -.229 0.046 -.544 0.421 0.045 0.059 0.493
logMAP -.184 0.609 -.311 -.357 -.041 -.548 0.183 0.183 0.000
logMSP -.205 0.506 0.747 -.137 -.040 0.159 -.156 -.266 0.033
CMI -.336 0.287 -.451 0.096 0.486 0.499 0.050 -.318 -.031
cmiJJA -.365 0.179 0.112 0.688 -.019 0.012 0.015 0.588 0.018
DD_5 0.379 0.142 0.173 0.368 0.183 -.173 0.725 -.282 0.007
NFFD 0.363 0.242 -.136 0.402 0.158 -.351 -.637 -.264 0.052
In R:
PCA.model <- princomp(climate[,c("MAT","MWMT","MCMT","logMAP","logMSP","CMI","cmiJJA","DD.5","NFFD")], scores=T, cor=T)
PCA.model$loadings
returns
Eigenvectors
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
MAT -0.372 -0.269 0.126 -0.250 0.270 0.789
MWMT -0.387 -0.171 0.675 0.494 -0.325
MCMT -0.339 -0.332 0.250 0.164 -0.500 -0.414 -0.510
logMAP 0.174 -0.604 0.309 0.252 0.619 -0.213 0.125
logMSP 0.202 -0.501 -0.727 0.223 -0.162 0.175 -0.268
CMI 0.334 -0.293 0.459 -0.222 0.471 -0.495 -0.271
cmiJJA 0.365 -0.199 -0.174 -0.612 -0.247 0.590
DD.5 -0.382 -0.143 -0.186 -0.421 -0.695 -0.360
NFFD -0.368 -0.227 -0.487 0.309 0.655 -0.205
As you can see, the values are similar (sign reversed), but not identical. The differences matter in the scored data, the first row of which looks like this:
Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8 Prin9
SAS -1.95 1.68 -0.54 0.72 -1.07 0.10 -0.66 -0.02 0.05
R 1.61 -1.99 0.52 -0.42 -1.13 -0.16 0.79 0.12 -0.09
If I use a GLM (in SAS) or lm() (in R) to calculate the coefficients from the scored data, I get very similar numbers (inverse sign), with the exception of the intercept. Like so:
in SAS:
proc glm order=data data=pc_out;
model Prin1 = MAT MWMT MCMT logMAP logMSP CMI cmiJJA DD_5 NFFD;
run;
in R:
scored <- cbind(PCA.model$scores, climate)
pca.lm <- lm(Comp.1~MAT+MWMT+MCMT+logMAP+logMSP+CMI+cmiJJA+DD.5+NFFD, data=scored)
returns
Coefficients:
(Int) MAT MWMT MCMT logMAP logMSP CMI cmiJJA DD.5 NFFD
SAS 0.42 0.04 0.06 0.03 -0.65 -0.69 -0.003 -0.01 0.0002 0.004
R -0.59 -0.04 -0.06 -0.03 0.62 0.68 0.004 0.02 -0.0002 -0.004
So it would seem that the model intercept is changing the value in the scored data. Any thoughts on why this happens (why the intercept is different) would be appreciated.

Thanks again to all those who commented. Embarrassingly, the differences I found between the SAS proc princomp and R princomp() procedures was actually a product of a data error that I made. Sorry to those who took time to help answer.
But rather than let this question go to waste, I will offer what I found to be statistically equivalent procedures for SAS and R when running a principal component analysis (PCA).
The following procedures are statistically equivalent, with data named 'mydata' and variables named 'Var1', 'Var2', and 'Var3'.
In SAS:
* Run the PCA on your data;
proc princomp data=mydata out=pc_out outstat=pc_outstat;
var Var1 Var2 Var3;
run;
* Use GLM on the individual components to obtain the coefficients to calculate the PCA scoring;
proc glm order=data data=pc_out;
model Prin1 = Var1 Var2 Var3;
run;
In R:
PCA.model <- princomp(mydata[,c("Var1","Var2","Var3")], scores=T, cor=T)
scored <- predict(PCA.model, mydata)
scored <- cbind(PCA.model$scores, mydata)
lm(Comp.1~Var1+Var2+Var3, data=scored)

Related

Changepoints detection in time series in R

I need some guidance regarding how changepoints work in time series. I am trying to detect some changepoints using R, and the package called "changepoint" (https://cran.r-project.org/web/packages/changepoint/changepoint.pdf).
There are options for how to detect when the variance (cpt.var) and the mean (cpt.mean) change, but what I'm trying to look for is when the time series changes trend.
Maybe I'm confused with what changepoints really are, but is there any way to get this information?
I am showing the result of using cpt.var() function, and I have added some arrows, showing what I would like to achieve.
Is there any way to achieve this? I guess should be somehow like inflection points...
I would appreciate any light on this.
Thanks beforehand,
Jon
EDIT
I have tried with the approach of using diff(), but is not detecting the change correctly:
The data I am using is the following:
[1] 10.695 10.715 10.700 10.665 10.830 10.830 10.800 11.070 11.145 11.270 11.015 11.060 10.945 10.965 10.780 10.735 10.705 10.680 10.600 10.335 10.220 10.125
[23] 10.370 10.595 10.680 11.000 10.980 11.065 11.060 11.355 11.445 11.415 11.350 11.310 11.330 11.360 11.445 11.335 11.275 11.300 11.295 11.470 11.445 11.325
[45] 11.300 11.260 11.200 11.210 11.230 11.240 11.300 11.250 11.285 11.215 11.260 11.395 11.410 11.235 11.320 11.475 11.470 11.685 11.740 11.740 11.700 11.905
[67] 11.720 12.230 12.285 12.505 12.410 11.995 12.110 12.005 11.915 11.890 11.820 11.730 11.700 11.660 11.685 11.615 11.360 11.425 11.185 11.275 11.265 11.375
[89] 11.310 11.250 11.050 10.880 10.775 10.775 10.805 10.755 10.595 10.700 10.585 10.510 10.290 10.255 10.395 10.290 10.425 10.405 10.365 10.010 10.305 10.185
[111] 10.400 10.700 10.725 10.875 10.750 10.760 10.905 10.680 10.670 10.895 10.790 10.990 10.925 10.980 10.975 11.035 10.895 10.985 11.035 11.295 11.245 11.535
[133] 11.510 11.430 11.450 11.390 11.520 11.585
And when I do diff() I get this data:
[1] 0.020 -0.015 -0.035 0.165 0.000 -0.030 0.270 0.075 0.125 -0.255 0.045 -0.115 0.020 -0.185 -0.045 -0.030 -0.025 -0.080 -0.265 -0.115 -0.095 0.245
[23] 0.225 0.085 0.320 -0.020 0.085 -0.005 0.295 0.090 -0.030 -0.065 -0.040 0.020 0.030 0.085 -0.110 -0.060 0.025 -0.005 0.175 -0.025 -0.120 -0.025
[45] -0.040 -0.060 0.010 0.020 0.010 0.060 -0.050 0.035 -0.070 0.045 0.135 0.015 -0.175 0.085 0.155 -0.005 0.215 0.055 0.000 -0.040 0.205 -0.185
[67] 0.510 0.055 0.220 -0.095 -0.415 0.115 -0.105 -0.090 -0.025 -0.070 -0.090 -0.030 -0.040 0.025 -0.070 -0.255 0.065 -0.240 0.090 -0.010 0.110 -0.065
[89] -0.060 -0.200 -0.170 -0.105 0.000 0.030 -0.050 -0.160 0.105 -0.115 -0.075 -0.220 -0.035 0.140 -0.105 0.135 -0.020 -0.040 -0.355 0.295 -0.120 0.215
[111] 0.300 0.025 0.150 -0.125 0.010 0.145 -0.225 -0.010 0.225 -0.105 0.200 -0.065 0.055 -0.005 0.060 -0.140 0.090 0.050 0.260 -0.050 0.290 -0.025
[133] -0.080 0.020 -0.060 0.130 0.065
What I get is the next results:
> cpt =cpt.mean(diff(vector), method="PELT")
> (cpt.pts <- attributes(cpt)$cpts)
[1] 137
Appearly does not make sense... Any clue?
In R, there are many packages available for time series changepoint detection. changepoint is definitely a very useful one. A partial list of the packages is summarized in CRAN Task View:
Change point detection is provided in strucchange (using linear regression models), and in trend (using nonparametric tests). The changepoint package provides many popular changepoint methods, and ecp does nonparametric changepoint detection for univariate and multivariate series. changepoint.np implements the nonparametric PELT algorithm, while changepoint.mv detects changepoints in multivariate time series. InspectChangepoint uses sparse projection to estimate changepoints in high-dimensional time series. robcp provides robust change-point detection using Huberized cusum tests, and Rbeast provides Bayesian change-point detection and time series decomposition.
Here is also a great blog comparing several alternative packages: https://www.marinedatascience.co/blog/2019/09/28/comparison-of-change-point-detection-methods/. Another impressive comparison is from Dr. Jonas Kristoffer LindelΓΈv who developed the mcp package: https://lindeloev.github.io/mcp/articles/packages.html.
Below I used your sample time series to generate some quick results using the Rbeast package developed by myself (chosen here apparently for ego of self-promoting as well as perceived relvance). Rbeast is a Baysian changepoint detection algorithm and it can estimate the probability of changepoint occurrence. It can also be used for decomposing time series into seasonality and trend, but apparently, your time series is trend-only, so in the beast function below, season='none' is specified.
y = c(10.695,10.715,10.700,10.665,10.830,10.830,10.800,11.070,11.145,11.270,11.015,11.060,10.945,10.965,10.780,10.735,10.705,
10.680,10.600,10.335,10.220,10.125,10.370,10.595,10.680,11.000,10.980,11.065,11.060,11.355,11.445,11.415,11.350,11.310,11.330,
11.360,11.445,11.335,11.275,11.300,11.295,11.470,11.445,11.325,11.300,11.260,11.200,11.210,11.230,11.240,11.300,11.250,11.285,
11.215,11.260,11.395,11.410,11.235,11.320,11.475,11.470,11.685,11.740,11.740,11.700,11.905,11.720,12.230,12.285,12.505,12.410,
11.995,12.110,12.005,11.915,11.890,11.820,11.730,11.700,11.660,11.685,11.615,11.360,11.425,11.185,11.275,11.265,11.375,11.310,
11.250,11.050,10.880,10.775,10.775,10.805,10.755,10.595,10.700,10.585,10.510,10.290,10.255,10.395,10.290,10.425,10.405,10.365,
10.010,10.305,10.185,10.400,10.700,10.725,10.875,10.750,10.760,10.905,10.680,10.670,10.895,10.790,10.990,10.925,10.980,10.975,
11.035,10.895,10.985,11.035,11.295,11.245,11.535 ,11.510,11.430,11.450,11.390,11.520,11.585)
library(Rbeast)
out=beast(y, season='none')
plot(out)
print(out)
In the figure above, dashed vertical lines mark the most likely locations of changepoints; the green curve of Pr(tcp) shows the point-wise probability of changepoint occurrence over time. The order_t curve gives the estimated mean order of the piecewise polynomials needed to adequately fit the trend (the 0-th order is constant and the 1st order is linear): An average order toward 0 means that the trend is more likely to be flat and an order close to 1 means that the trend is linear. The output can be also printed as some ascii outputs, as shown below. Again, it says that the time series is most likely to have 8 changepoints; their most probable locations are given in out$trend$cp.
Result for time series #1 (total number of time series in 'out': 1)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ SEASONAL CHANGEPOINTS +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
No seasonal/periodic component present (i.e., season='none')
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ TREND CHANGEPOINTS +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
An ascii plot of the probability dist for number of chgpts(ncp)
---------------------------------------------------------------
Pr(ncp=0 )=0.000|* |
Pr(ncp=1 )=0.000|* |
Pr(ncp=2 )=0.000|* |
Pr(ncp=3 )=0.000|* |
Pr(ncp=4 )=0.000|* |
Pr(ncp=5 )=0.000|* |
Pr(ncp=6 )=0.055|***** |
Pr(ncp=7 )=0.074|****** |
Pr(ncp=8 )=0.575|******************************************** |
Pr(ncp=9 )=0.240|******************* |
Pr(ncp=10)=0.056|***** |
---------------------------------------------------------------
Max ncp : 10 | A parameter you set (e.g., maxTrendKnotNum) |
Mode ncp: 8 | Pr(ncp= 8)=0.57; there is a 57.5% probability|
| that the trend componet has 8 chngept(s). |
Avg ncp : 8.17 | Sum[ncp*Pr(ncp)] |
---------------------------------------------------------------
List of most probable trend changepoints (avg number of changpts: 8.17)
--------------------------------.
tcp# |time (cp) |prob(cpPr)|
-----|---------------|----------|
1 |8.0000 | 0.92767|
2 |112.0000 | 0.91433|
3 |68.0000 | 0.84213|
4 |21.0000 | 0.80188|
5 |32.0000 | 0.78171|
6 |130.0000 | 0.76938|
7 |101.0000 | 0.66404|
8 |62.0000 | 0.61171|
--------------------------------'
If the signal isn't too noisy, you could use diff to detect changepoints in slope instead of mean:
library(changepoint)
set.seed(1)
slope <- rep(sample(10,10)-5,sample(100,10))
sig <- cumsum(slope)+runif(n=length(slope),min = -1, max = 1)
cpt =cpt.mean(diff(sig),method="PELT")
# Show change point
(cpt.pts <- attributes(cpt)$cpts)
#> [1] 58 109 206 312 367 440 447 520 599
plot(sig,type="l")
lines(x=cpt.pts,y=sig[cpt.pts],type="p",col="red",cex=2)
Another option which seems to work better with the data you provided is to use piecewise linear segmentation:
library(ifultools)
changepoints <- linearSegmentation(x=1:length(data),y=data,angle.tolerance = 90,n.fit=10,plot=T)
changepoints
#[1] 13 24 36 58 72 106

Create a Factor Analysis diagram/cluster visualization using R

I conducted a Factor Analysis in R and discovered there are 4 latent variables (factors).
I am trying to run a diagram/cluster visualization, to better represent my findings.
I would want something like this:
Here is my output:
print(factor2$loadings,cutoff = 0.3)
Loadings:
MR2 MR1 MR4 MR3
Inflatie 0.796
Dobanda 0.439
optq20 0.627
optq22 0.661
optq25 0.489
optq27 0.462
optq28 0.651
optq29 0.359
optq30 0.636
optq36 0.322
optq37 0.621
optq38 0.517
optq39 0.620
optq43 0.543
MR2 MR1 MR4 MR3
SS loadings 1.560 1.524 1.225 0.873
Proportion Var 0.111 0.109 0.087 0.062
Cumulative Var 0.111 0.220 0.308 0.370
I have searched all over the Internet but could not find some useful code for that.
Thank you!

Simultaneous factor solution - Factor analysis

In order to replicate the results of a previous study, I am trying to apply a method of factor analysis of a matrix that is described in Horst (1965) as "basic structure with simultaneous factor solution".
How would I approach this method in R?
Given a matrix m, and providing for instance that I extract two factors, I have tried applying the following:
fa(r = cor(m), rotate = 'none', factors = 2)
but I don't think this approach is right.
Just found out.
Library(psych)
Principal(r= cor(m), rotate = " none ", nfactor= 2)
Does the job. Horst refers to what is also called an eigen value decomposition. It can also be done using eigen() and attaining the same result.
.. not really.. loadings seem pretty close but looking at the maths I am not sure the method described below is akin to eigen value decomposition in fact, looking more closely, the method is applied directly on the raw data and no product momentum calculations are required..
.. I am trying (slowly) to work out the maths myself and to understand what the computation instruction describes.
For your information, here is the standardized matrix that is used for the calculation carried out in the example in the original textbook:
0.444 0.627 1.458 1.754 2.967 2.585 0.970 0.616 0.853
2.648 2.563 1.950 -1.341 -1.015 -0.700 0.904 0.976 0.150
-0.104 -0.159 0.049 0.510 -0.378 -0.468 2.217 2.378 2.291
-0.970 -1.216 -1.129 -0.079 -0.378 -0.645 0.287 0.312 -2.266
-1.164 -1.060 -1.485 -1.878 -0.021 -0.530 -1.483 0.190 -0.429
-0.956 -1.122 -0.938 -1.282 -0.779 0.121 0.447 -1.565 -0.429
0.198 -0.242 -0.055 0.021 0.526 -1.528 -0.575 -1.244 -0.114
-0.035 -0.485 1.129 -0.014 -0.894 -0.316 -1.421 -0.705 -0.349
-1.050 0.786 -0.048 0.101 -0.354 -0.433 -0.298 -0.377 -0.256
0.298 0.197 -0.010 0.558 0.253 0.464 -0.284 -0.240 -0.031
0.568 0.367 -0.429 0.811 -0.007 0.786 -0.250 0.081 0.541
0.125 -0.256 -0.492 0.839 0.079 0.665 -0.513 -0.422 0.039
here are the computation instructions and examples
... I was wondering if this is just a standard approach in factor analysis or in pricipal component analysis.. and if so, which one? The introduction says that this method is rank reduction type solution in the sense that the major product of the factor score and factor loading matrices yields a residual which is precisely of rank equal to that of the original matrix less the number of factors.
This particular type of analysis is "direct" in the sense that is carried out directly on the raw data (at best it is the normalized matrix).

Building a time series of r-squared values in R

I have multiple xts objects that look something like this:
t x0 x1 x2 x3 x4
2000-01-01 0.397 0.262 -0.039 0.440 0.186
2000-01-02 0.020 -0.219 0.197 0.301 -0.300
2000-01-03 0.010 0.064 -0.034 0.214 -0.451
2000-01-04 -0.002 -0.483 -0.251 0.023 0.153
2000-01-05 0.451 0.375 0.566 0.258 -0.092
2000-01-06 0.411 0.219 0.108 0.137 -0.087
2000-01-07 0.111 -0.039 0.187 -0.271 0.346
2000-01-08 0.131 -0.078 0.163 -0.057 -0.313
2000-01-09 -0.409 -0.022 0.211 -0.297 0.217
.
.
.
Now, I am interested in finding out how well the average of the other variables explains each single x variable during each period 𝜏 (e.g. monthly).
Or in other words, I'm interested in building a time series of the r-squared in the following regression:
xi,t = 𝛽0 + 𝛽1 avgi,t + πœ€i,t
where avgi,t is be the average of all other variables at time t. And running that for each i πœ– n variables and observations t πœ– 𝜏 (e.g. 𝜏 is a month). I would like to be able to run this with any period 𝜏, where I believe xts might be able to help, since there are functions such as endpoints or apply.monthly?
I have years and years of data, and multiple of those xts objects, so the question is, what would be a sensible way to run those regressions, collecting the time series of r-squared values? A for-loop should be able to take care of doing that for each separate xts object, but I really don't think a for-loop would be very wise for each single object?

How to set the level above which to display factor loadings from factanal() in R?

I was performing factor analysis with data state.x77, which is in R by default. After running the analysis, I inspected the factor loadings.
> output = factanal(state.x77, factors=3, rotation="promax")
> ld = output$loadings
> ld
Loadings:
Factor1 Factor2 Factor3
Population 0.161 0.239 -0.316
Income -0.149 0.681
Illiteracy 0.446 -0.284 -0.393
Life Exp -0.924 0.172 -0.221
Murder 0.917 0.103 -0.129
HS Grad -0.414 0.731
Frost 0.107 1.046
Area 0.387 0.585 0.101
Factor1 Factor2 Factor3
SS loadings 2.274 1.519 1.424
Proportion Var 0.284 0.190 0.178
Cumulative Var 0.284 0.474 0.652
It looks like that by default R is blocking all values less than 0.1. I was wondering if there is a way to set this blocking level by hand, say 0.3 instead of 0.1?
try this:
print(output$loadings, cutoff = 0.3)
see ?print.loadings for the details.

Resources