I am using effects R package and effect function on a cox model. There is a default method for this function so it somehow should work for any model.
When I try to use this function I get this error:
Any idea how to fix this and what is wrong?
> eff_cf <- effect("TP53:MDM2", model)
Error in mod.matrix %*% mod$coefficients[!is.na(coef(mod))] :
non-conformable arguments
My model looks like this:
> model
Call:
coxph(formula = Surv(times, patient.vital_status) ~ TP53 + MDM2 +
TP53:MDM2, data = clinForPlot)
coef exp(coef) se(coef) z p
TP53Other -0.163 0.850 0.217 -0.752 4.5e-01
TP53WILD -1.086 0.337 0.277 -3.928 8.6e-05
MDM2(1183.7,1674.7] -0.669 0.512 0.235 -2.851 4.4e-03
MDM2(1674.7,2248.5] -0.744 0.475 0.305 -2.444 1.5e-02
MDM2(2248.5,50339] -0.867 0.420 0.375 -2.308 2.1e-02
TP53Other:MDM2(1183.7,1674.7] 0.394 1.483 0.412 0.958 3.4e-01
TP53WILD:MDM2(1183.7,1674.7] 0.133 1.142 0.413 0.323 7.5e-01
TP53Other:MDM2(1674.7,2248.5] -0.192 0.825 0.517 -0.372 7.1e-01
TP53WILD:MDM2(1674.7,2248.5] 0.546 1.726 0.433 1.260 2.1e-01
TP53Other:MDM2(2248.5,50339] -0.140 0.869 0.650 -0.215 8.3e-01
TP53WILD:MDM2(2248.5,50339] 0.786 2.195 0.484 1.623 1.0e-01
Likelihood ratio test=72.8 on 11 df, p=3.54e-11 n= 1321, number of events= 258
And the model and the data.frame used for model can be reproduced using this code
library(archivist)
model <- loadFromGitub("68eeefba87be70364eb3801cec58eb3d",
user = "MarcinKosinski",
repo = "Museum",
value = TRUE)
clinForPlot <- loadFromGitub("cfa5145e6b98964d5f8b760bf749e426",
user = "MarcinKosinski",
repo = "Museum",
value = TRUE)
Any idea how to fix this and what is wrong?
Related
I am trying to present an average partial effects table of multiple binary-response models. Back in December last year, I had no issue calling modelsummary in a list of objects of the class "margins" "data.frame" to produce a table like the one below:
However, when I try to call modelsummary on a list of margins or marginaleffects objects, I get the following:
# Working example
df<-mtcars
df$cyl<-as.factor(df$cyl)
library(modelsummary)
library(tidyverse)
library(marginaleffects)
# Binary response Models
model1<-glm(am ~ mpg + cyl,
data = df,
family = quasibinomial(link = 'logit'))
model2<-glm(am ~ mpg + cyl,
data = df,
family = quasibinomial(link = 'probit'))
model3<-glm(am ~ wt + cyl,
data = df,
family = quasibinomial(link = 'logit'))
model4<-glm(am ~ wt + cyl,
data = df,
family = quasibinomial(link = 'probit'))
models<-list(model1,model2,model3,model4)
mfx<-lapply(models, marginaleffects)
modelsummary(mfx,
output = 'markdown')
Model 1
Model 2
Model 3
Model 4
mpg
0.056
0.057
(0.027)
(0.026)
cyl
0.093
0.091
0.120
0.121
0.093
0.091
0.120
0.260
0.093
0.091
0.253
0.121
0.093
0.091
0.253
0.260
0.093
0.102
0.120
0.121
0.093
0.102
0.120
0.260
0.093
0.102
0.253
0.121
0.093
0.102
0.253
0.260
0.097
0.091
0.120
0.121
0.097
0.091
0.120
0.260
0.097
0.091
0.253
0.121
0.097
0.091
0.253
0.260
0.097
0.102
0.120
0.121
0.097
0.102
0.120
0.260
0.097
0.102
0.253
0.121
0.097
0.102
0.253
0.260
(0.169)
(0.174)
(0.058)
(0.053)
(0.169)
(0.174)
(0.058)
(0.063)
(0.169)
(0.174)
(0.068)
(0.053)
(0.169)
(0.174)
(0.068)
(0.063)
(0.169)
(0.236)
(0.058)
(0.053)
(0.169)
(0.236)
(0.058)
(0.063)
(0.169)
(0.236)
(0.068)
(0.053)
(0.169)
(0.236)
(0.068)
(0.063)
(0.238)
(0.174)
(0.058)
(0.053)
(0.238)
(0.174)
(0.058)
(0.063)
(0.238)
(0.174)
(0.068)
(0.053)
(0.238)
(0.174)
(0.068)
(0.063)
(0.238)
(0.236)
(0.058)
(0.053)
(0.238)
(0.236)
(0.058)
(0.063)
(0.238)
(0.236)
(0.068)
(0.053)
(0.238)
(0.236)
(0.068)
(0.063)
wt
-0.533
-0.556
(0.080)
(0.072)
Num.Obs.
32
32
32
32
F
2.157
2.707
4.417
5.832
RMSE
0.39
0.39
0.27
0.27
I get the following warning messages:
---
Objects of class 'NULL' are currently not supported.
Objects of class 'NULL' are currently not supported.
Objects of class 'NULL' are currently not supported.
Objects of class 'NULL' are currently not supported.
Warning message:
There are duplicate term names in the table. The `shape` argument of the `modelsummary` function can be used to print related terms together, and to label them appropriately. You can find the group identifier to use in the `shape` argument by calling `get_estimates()` on one of your models. Candidate group identifiers include: type, contrast. See `?modelsummary` for details.
I tried listening to the warning message and added a shape argument as shape = ~ term + contrast ~ model, which results in a better table. But I still get the "objects of class NULL" warning and I also get another new column on my table:
Model 1
Model 2
Model 3
Model 4
mpg
dY/dX
0.056
0.057
(0.027)
(0.026)
cyl
6 - 4
0.097
0.091
0.120
0.121
(0.169)
(0.174)
(0.058)
(0.053)
8 - 4
0.093
0.102
0.253
0.260
(0.238)
(0.236)
(0.068)
(0.063)
wt
dY/dX
-0.533
-0.556
(0.080)
(0.072)
Num.Obs.
32
32
32
32
F
2.157
2.707
4.417
5.832
RMSE
0.39
0.39
0.27
0.27
I would like to know if there is any way to get rid of that warning message, and also how to ommit/edit the new column that appears in the table with the shape argument.
This answer uses the development version of modelsummary, which can be installed as follows (make sure you restart R after install):
library(remotes)
install_github("vincentarelbundock/modelsummary")
There are two main approaches:
Display term names and contrast identifiers in separate columns, and rename them using the group_map argument.
Use the shape argument to combine the term and contrast identifiers.
First, note that contrasts and slopes are uniquely identified by two columns: term and contrast
library(modelsummary)
library(marginaleffects)
mod <- list(
glm(am ~ factor(cyl), data = mtcars, family = binomial),
glm(am ~ factor(cyl) + mpg, data = mtcars, family = binomial))
mfx <- lapply(mod, marginaleffects)
tidy(mfx[[1]])
#> type term contrast estimate std.error statistic p.value
#> 1 response cyl 6 - 4 -0.2987013 0.2302534 -1.297272 0.1945376094
#> 2 response cyl 8 - 4 -0.5844156 0.1636389 -3.571373 0.0003551141
#> conf.low conf.high
#> 1 -0.7499897 0.1525871
#> 2 -0.9051419 -0.2636893
We can display terms and contrasts in separate columns using the shape argument:
modelsummary(
mfx,
shape = term + contrast ~ model
)
Model 1
Model 2
cyl
6 - 4
-0.299
0.097
(0.230)
(0.166)
8 - 4
-0.584
0.093
(0.164)
(0.233)
mpg
dY/dX
0.056
(0.027)
Num.Obs.
32
32
AIC
39.9
37.4
BIC
44.3
43.3
Log.Lik.
-16.967
-14.702
F
3.691
2.236
RMSE
0.42
0.39
We can rename and reorder the groups with the group_map argument:
modelsummary(
mfx,
group_map = c("8 - 4" = "High - Low", "6 - 4" = "Mid - Low", "dY/dX" = "Slope"),
shape = term + contrast ~ model
)
Model 1
Model 2
cyl
High - Low
-0.584
0.093
(0.164)
(0.233)
Mid - Low
-0.299
0.097
(0.230)
(0.166)
mpg
Slope
0.056
(0.027)
Num.Obs.
32
32
AIC
39.9
37.4
BIC
44.3
43.3
Log.Lik.
-16.967
-14.702
F
3.691
2.236
RMSE
0.42
0.39
We can display terms and group identifiers in the same column by including an interaction in the shape formula:
modelsummary(
mfx,
shape = term : contrast ~ model
)
Model 1
Model 2
cyl 6 - 4
-0.299
0.097
(0.230)
(0.166)
cyl 8 - 4
-0.584
0.093
(0.164)
(0.233)
mpg dY/dX
0.056
(0.027)
Num.Obs.
32
32
AIC
39.9
37.4
BIC
44.3
43.3
Log.Lik.
-16.967
-14.702
F
3.691
2.236
RMSE
0.42
0.39
We can then rename or omit using the coef_* arguments:
modelsummary(
mfx,
shape = term : contrast ~ model,
coef_rename = function(x) gsub("dY/dX", "(Slope)", x)
)
Model 1
Model 2
cyl 6 - 4
-0.299
0.097
(0.230)
(0.166)
cyl 8 - 4
-0.584
0.093
(0.164)
(0.233)
mpg (Slope)
0.056
(0.027)
Num.Obs.
32
32
AIC
39.9
37.4
BIC
44.3
43.3
Log.Lik.
-16.967
-14.702
F
3.691
2.236
RMSE
0.42
0.39
There's a lot to discuss here:
1. The new column that appeared was necessary because of the variable cyl. Since the glm function dummifies categorical variables internally, you end up with 2 variables out of cyl (it have 3 categories, so 3 minus 1 to avoid perfect multicolinearity). The table have to distinguish them in some way.
2. The default style of summaries does not yet support a "marginaleffects" model class. It's a potential issue that you can open here, helping the package development.
3. Unfortunately, I couldn't figured out how to include the F-statistic and the RMSE metric to the final table, so probably my answer is not complete in terms of desired output. This change of function behavior was weird.
# compatible style
options(modelsummary_get = "broom")
df <- mtcars
# manually dummifying
df$cyl4 <- ifelse(df$cyl == 4, 1, 0)
df$cyl6 <- ifelse(df$cyl == 6, 1, 0)
# (just avoiding excessive typing)
df <- subset(df, select = c("am", "cyl4", "cyl6", "mpg", "wt"))
glm2 <- function(x, y) glm(x, data = df, family = quasibinomial(y))
model1 <- glm2(am ~ . - wt, "logit")
model2 <- glm2(am ~ . - wt, "probit")
model3 <- glm2(am ~ . - mpg, "logit")
model4 <- glm2(am ~ . - mpg, "probit")
models <- list(model1, model2, model3, model4)
mfx <- lapply(models, marginaleffects::marginaleffects)
modelsummary::modelsummary(mfx, output = "markdown", gof_map = "nobs")
Model 1
Model 2
Model 3
Model 4
cyl4
-0.106
-0.116
-0.365
-0.389
(0.300)
(0.299)
(0.137)
(0.132)
cyl6
0.005
-0.011
-0.154
-0.164
(0.205)
(0.202)
(0.100)
(0.093)
mpg
0.056
0.057
(0.027)
(0.026)
wt
-0.533
-0.556
(0.080)
(0.072)
Num.Obs.
32
32
32
32
I am trying to predict and graph models with species presence as the response. However I've run into the following problem: the ggpredict outputs are wildly different for the same data in glmer and glmmTMB. However, the estimates and AIC are very similar. These are simplified models only including date (which has been centered and scaled), which seems to be the most problematic to predict.
yntest<- glmer(MYOSOD.P~ jdate.z + I(jdate.z^2) + I(jdate.z^3) +
(1|area/SiteID), family = binomial, data = sodpYN)
> summary(yntest)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: MYOSOD.P ~ jdate.z + I(jdate.z^2) + I(jdate.z^3) + (1 | area/SiteID)
Data: sodpYN
AIC BIC logLik deviance df.resid
1260.8 1295.1 -624.4 1248.8 2246
Scaled residuals:
Min 1Q Median 3Q Max
-2.0997 -0.3218 -0.2013 -0.1238 9.4445
Random effects:
Groups Name Variance Std.Dev.
SiteID:area (Intercept) 1.6452 1.2827
area (Intercept) 0.6242 0.7901
Number of obs: 2252, groups: SiteID:area, 27; area, 9
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.96778 0.39190 -7.573 3.65e-14 ***
jdate.z -0.72258 0.17915 -4.033 5.50e-05 ***
I(jdate.z^2) 0.10091 0.08068 1.251 0.21102
I(jdate.z^3) 0.25025 0.08506 2.942 0.00326 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) jdat.z I(.^2)
jdate.z 0.078
I(jdat.z^2) -0.222 -0.154
I(jdat.z^3) -0.071 -0.910 0.199
The glmmTMB model + summary:
Tyntest<- glmmTMB(MYOSOD.P ~ jdate.z + I(jdate.z^2) + I(jdate.z^3) +
(1|area/SiteID), family = binomial("logit"), data = sodpYN)
> summary(Tyntest)
Family: binomial ( logit )
Formula: MYOSOD.P ~ jdate.z + I(jdate.z^2) + I(jdate.z^3) + (1 | area/SiteID)
Data: sodpYN
AIC BIC logLik deviance df.resid
1260.8 1295.1 -624.4 1248.8 2246
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
SiteID:area (Intercept) 1.6490 1.2841
area (Intercept) 0.6253 0.7908
Number of obs: 2252, groups: SiteID:area, 27; area, 9
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.96965 0.39638 -7.492 6.78e-14 ***
jdate.z -0.72285 0.18250 -3.961 7.47e-05 ***
I(jdate.z^2) 0.10096 0.08221 1.228 0.21941
I(jdate.z^3) 0.25034 0.08662 2.890 0.00385 **
---
ggpredict outputs
testg<-ggpredict(yntest, terms ="jdate.z[all]")
> testg
# Predicted probabilities of MYOSOD.P
# x = jdate.z
x predicted std.error conf.low conf.high
-1.95 0.046 0.532 0.017 0.120
-1.51 0.075 0.405 0.036 0.153
-1.03 0.084 0.391 0.041 0.165
-0.58 0.072 0.391 0.035 0.142
-0.14 0.054 0.390 0.026 0.109
0.35 0.039 0.399 0.018 0.082
0.79 0.034 0.404 0.016 0.072
1.72 0.067 0.471 0.028 0.152
Adjusted for:
* SiteID = 0 (population-level)
* area = 0 (population-level)
Standard errors are on link-scale (untransformed).
testgTMB<- ggpredict(Tyntest, "jdate.z[all]")
> testgTMB
# Predicted probabilities of MYOSOD.P
# x = jdate.z
x predicted std.error conf.low conf.high
-1.95 0.444 0.826 0.137 0.801
-1.51 0.254 0.612 0.093 0.531
-1.03 0.136 0.464 0.059 0.280
-0.58 0.081 0.404 0.038 0.163
-0.14 0.054 0.395 0.026 0.110
0.35 0.040 0.402 0.019 0.084
0.79 0.035 0.406 0.016 0.074
1.72 0.040 0.444 0.017 0.091
Adjusted for:
* SiteID = NA (population-level)
* area = NA (population-level)
Standard errors are on link-scale (untransformed).
The estimates are completely different and I have no idea why.
I did try to use both the ggeffects package from CRAN and the developer version in case that changed anything. It did not. I am using the most up to date version of glmmTMB.
This is my first time asking a question here so please let me know if I should provide more information to help explain the problem.
I checked and the issue is the same when using predict instead of ggpredict, which would imply that it is a glmmTMB issue?
GLMER:
dayplotg<-expand.grid(jdate.z=seq(min(sodp$jdate.z), max(sodp$jdate.z), length=92))
Dfitg<-predict(yntest, re.form=NA, newdata=dayplotg, type='response')
dayplotg<-data.frame(dayplotg, Dfitg)
head(dayplotg)
> head(dayplotg)
jdate.z Dfitg
1 -1.953206 0.04581691
2 -1.912873 0.04889584
3 -1.872540 0.05195598
4 -1.832207 0.05497553
5 -1.791875 0.05793307
6 -1.751542 0.06080781
glmmTMB:
dayplot<-expand.grid(jdate.z=seq(min(sodp$jdate.z), max(sodp$jdate.z), length=92),
SiteID=NA,
area=NA)
Dfit<-predict(Tyntest, newdata=dayplot, type='response')
head(Dfit)
dayplot<-data.frame(dayplot, Dfit)
head(dayplot)
> head(dayplot)
jdate.z SiteID area Dfit
1 -1.953206 NA NA 0.4458236
2 -1.912873 NA NA 0.4251926
3 -1.872540 NA NA 0.4050944
4 -1.832207 NA NA 0.3855801
5 -1.791875 NA NA 0.3666922
6 -1.751542 NA NA 0.3484646
I contacted the ggpredict developer and figured out that if I used poly(jdate.z,3) rather than jdate.z + I(jdate.z^2) + I(jdate.z^3) in the glmmTMB model, the glmer and glmmTMB predictions were the same.
I'll leave this post up even though I was able to answer my own question in case someone else has this question later.
I have run a Confirmatory Factor Analysis and I now would like to apply the Fornell/Larcker Criterion. For doing so, I need the correlation between the latent variables. How can I display/retrieve the correlation between the latent variables?
I have tried the following commands generating an output:
standardizedSolution(fit)
summary(fit, fit.measures=TRUE)
lavInspect(fit,"standardized")
But none of these commands generates a "phi" (covariance between latent variables. Thus, I have two questions:
1) So, does anyone know how to display latent variables of a confirmatory factor analysis in r?
2) Take a look at the output of lavInspect(fit,"standardized") (see the link at the bottom of the text). Instead of a "phi" it generates a "$psi". Does that "psi" may be a "phi"? Because the matrix it generates looks like a correlation matrix
Here is the code:
#packages
library(lavaan)
library(readr)
CNCS<- read_delim("Desktop/20190703 Full Launch/Regressionen/Factor analysis/CNCS -47 Reversed.csv",
";", escape_double = FALSE, trim_ws = TRUE)
View(CNCS)
library(carData)
library(car)
CNCS.model <-
'AttitudeTowardsTheDeal =~ Q42_1 + Q42_2 + Q42_3
SubjectiveNormsImportance =~ Q43_r1 + Q43_r2 + Q43_r3 + Q43_r4
SubjectiveNormsFavour =~ Q44_r1 + Q44_r2 + Q44_r3 + Q44_r4
EaseOfPurchasing =~ Q45_r1 + Q45_r2 + Q45_r3 + Q45_r4 + Q45_r5 + Q45_r6
SE =~ Q3_r1 + Q3_r2 + Q3_r3 + Q4_r4
Consumer Innovativeness =~ Q4_r1 + Q4_r2 + Q4_r3 + Q4_r4 + Q4_r5
Purchase Intention =~ Q41moeglich_1 + Q41gewiss_1 + Q1wahrscheinlich_1 + Q41vorauss_1'
fit <- cfa(CNCS.model, data=CNCS)
summary(fit, fit.measures=TRUE)
lavInspect(fit,"standardized")
standardizedSolution(fit)
Partial OUTPUT of lavInspect(fit,"standardized")
Please follow the link to the screenshot of the partial output of lavInspect()
Take the cfa example given in the manual as
library(lavaan)
## The famous Holzinger and Swineford (1939) example
HS.model <- ' visual =~ x1 + x2 + x3
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9 '
fit <- cfa(HS.model, data=HolzingerSwineford1939)
and include the standardized fit in the summary with
summary(fit, standardized = TRUE)
obtaining
...
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
visual =~
x1 1.000 0.900 0.772
x2 0.554 0.100 5.554 0.000 0.498 0.424
x3 0.729 0.109 6.685 0.000 0.656 0.581
textual =~
x4 1.000 0.990 0.852
x5 1.113 0.065 17.014 0.000 1.102 0.855
x6 0.926 0.055 16.703 0.000 0.917 0.838
speed =~
x7 1.000 0.619 0.570
x8 1.180 0.165 7.152 0.000 0.731 0.723
x9 1.082 0.151 7.155 0.000 0.670 0.665
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
visual ~~
textual 0.408 0.074 5.552 0.000 0.459 0.459
speed 0.262 0.056 4.660 0.000 0.471 0.471
textual ~~
speed 0.173 0.049 3.518 0.000 0.283 0.283
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.x1 0.549 0.114 4.833 0.000 0.549 0.404
.x2 1.134 0.102 11.146 0.000 1.134 0.821
.x3 0.844 0.091 9.317 0.000 0.844 0.662
.x4 0.371 0.048 7.779 0.000 0.371 0.275
.x5 0.446 0.058 7.642 0.000 0.446 0.269
.x6 0.356 0.043 8.277 0.000 0.356 0.298
.x7 0.799 0.081 9.823 0.000 0.799 0.676
.x8 0.488 0.074 6.573 0.000 0.488 0.477
.x9 0.566 0.071 8.003 0.000 0.566 0.558
visual 0.809 0.145 5.564 0.000 1.000 1.000
textual 0.979 0.112 8.737 0.000 1.000 1.000
speed 0.384 0.086 4.451 0.000 1.000 1.000
You find the entries of the covariance matrix in the Covariances: and Variances: sections respectively in column Estimate and the entries of the correlation matrix in column Std.lv.
Note that inspect or rather lavInspect provides the argument what which by default is specified with "free". Taken from the manual, the three relevant other options are
"est": A list of model matrices. The values represent the estimated model parameters. Aliases: "estimates", and "x".
"std": A list of model matrices. The values represent the (completely) standardized model parameters (the variances of both the observed and the latent variables are set to unity). Aliases: "std.all", "standardized".
"std.lv": A list of model matrices. The values represent the standardized model parameters (only the variances of the latent variables are set to unity.)
which refer to the summary columns Estimate Std.lv and Std.all. Further try the following line
cov2cor(lavInspect(fit, what = "est")$psi)
In case of any remaining doubt, I recommend you consult the tutorial, the packages support infrastructure or the homepage.
I was performing factor analysis with data state.x77, which is in R by default. After running the analysis, I inspected the factor loadings.
> output = factanal(state.x77, factors=3, rotation="promax")
> ld = output$loadings
> ld
Loadings:
Factor1 Factor2 Factor3
Population 0.161 0.239 -0.316
Income -0.149 0.681
Illiteracy 0.446 -0.284 -0.393
Life Exp -0.924 0.172 -0.221
Murder 0.917 0.103 -0.129
HS Grad -0.414 0.731
Frost 0.107 1.046
Area 0.387 0.585 0.101
Factor1 Factor2 Factor3
SS loadings 2.274 1.519 1.424
Proportion Var 0.284 0.190 0.178
Cumulative Var 0.284 0.474 0.652
It looks like that by default R is blocking all values less than 0.1. I was wondering if there is a way to set this blocking level by hand, say 0.3 instead of 0.1?
try this:
print(output$loadings, cutoff = 0.3)
see ?print.loadings for the details.
I am currently trying to obtain equivalent results with the proc princomp command in SAS and the princomp() command in R (in the stats package). The results I am getting are very similar, leading me to suspect that this isn't a problem with different options settings in the two commands. However, the outpus are also different enough that the component scores for each data row are notably different. They are also sign-reversed, but this doesn't matter, of course.
The end goal of this analysis is to produce a set of coefficients from the PCA to score data outside the PCA routine (i.e. a formula that can be applied to new datasets to easily produce scored data).
Without posting all my data, I'm hoping someone can provide some information on how these two commands may differ in their calculations. I don't know enough about the PCA math to determine if this is a conceptual difference in the processes or just something like an internal rounding difference. For simplicity, I'll post the eigenvectors for PC1 and PC2 only.
In SAS:
proc princomp data=climate out=pc_out outstat=pc_outstat;
var MAT MWMT MCMT logMAP logMSP CMI cmiJJA DD_5 NFFD;
run;
returns
Eigenvectors
Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8 Prin9
MAT 0.372 0.257 -.035 -.033 -.106 0.270 -.036 0.216 -.811
MWMT 0.381 0.077 0.160 -.261 0.627 0.137 -.054 0.497 0.302
MCMT 0.341 0.324 -.229 0.046 -.544 0.421 0.045 0.059 0.493
logMAP -.184 0.609 -.311 -.357 -.041 -.548 0.183 0.183 0.000
logMSP -.205 0.506 0.747 -.137 -.040 0.159 -.156 -.266 0.033
CMI -.336 0.287 -.451 0.096 0.486 0.499 0.050 -.318 -.031
cmiJJA -.365 0.179 0.112 0.688 -.019 0.012 0.015 0.588 0.018
DD_5 0.379 0.142 0.173 0.368 0.183 -.173 0.725 -.282 0.007
NFFD 0.363 0.242 -.136 0.402 0.158 -.351 -.637 -.264 0.052
In R:
PCA.model <- princomp(climate[,c("MAT","MWMT","MCMT","logMAP","logMSP","CMI","cmiJJA","DD.5","NFFD")], scores=T, cor=T)
PCA.model$loadings
returns
Eigenvectors
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
MAT -0.372 -0.269 0.126 -0.250 0.270 0.789
MWMT -0.387 -0.171 0.675 0.494 -0.325
MCMT -0.339 -0.332 0.250 0.164 -0.500 -0.414 -0.510
logMAP 0.174 -0.604 0.309 0.252 0.619 -0.213 0.125
logMSP 0.202 -0.501 -0.727 0.223 -0.162 0.175 -0.268
CMI 0.334 -0.293 0.459 -0.222 0.471 -0.495 -0.271
cmiJJA 0.365 -0.199 -0.174 -0.612 -0.247 0.590
DD.5 -0.382 -0.143 -0.186 -0.421 -0.695 -0.360
NFFD -0.368 -0.227 -0.487 0.309 0.655 -0.205
As you can see, the values are similar (sign reversed), but not identical. The differences matter in the scored data, the first row of which looks like this:
Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8 Prin9
SAS -1.95 1.68 -0.54 0.72 -1.07 0.10 -0.66 -0.02 0.05
R 1.61 -1.99 0.52 -0.42 -1.13 -0.16 0.79 0.12 -0.09
If I use a GLM (in SAS) or lm() (in R) to calculate the coefficients from the scored data, I get very similar numbers (inverse sign), with the exception of the intercept. Like so:
in SAS:
proc glm order=data data=pc_out;
model Prin1 = MAT MWMT MCMT logMAP logMSP CMI cmiJJA DD_5 NFFD;
run;
in R:
scored <- cbind(PCA.model$scores, climate)
pca.lm <- lm(Comp.1~MAT+MWMT+MCMT+logMAP+logMSP+CMI+cmiJJA+DD.5+NFFD, data=scored)
returns
Coefficients:
(Int) MAT MWMT MCMT logMAP logMSP CMI cmiJJA DD.5 NFFD
SAS 0.42 0.04 0.06 0.03 -0.65 -0.69 -0.003 -0.01 0.0002 0.004
R -0.59 -0.04 -0.06 -0.03 0.62 0.68 0.004 0.02 -0.0002 -0.004
So it would seem that the model intercept is changing the value in the scored data. Any thoughts on why this happens (why the intercept is different) would be appreciated.
Thanks again to all those who commented. Embarrassingly, the differences I found between the SAS proc princomp and R princomp() procedures was actually a product of a data error that I made. Sorry to those who took time to help answer.
But rather than let this question go to waste, I will offer what I found to be statistically equivalent procedures for SAS and R when running a principal component analysis (PCA).
The following procedures are statistically equivalent, with data named 'mydata' and variables named 'Var1', 'Var2', and 'Var3'.
In SAS:
* Run the PCA on your data;
proc princomp data=mydata out=pc_out outstat=pc_outstat;
var Var1 Var2 Var3;
run;
* Use GLM on the individual components to obtain the coefficients to calculate the PCA scoring;
proc glm order=data data=pc_out;
model Prin1 = Var1 Var2 Var3;
run;
In R:
PCA.model <- princomp(mydata[,c("Var1","Var2","Var3")], scores=T, cor=T)
scored <- predict(PCA.model, mydata)
scored <- cbind(PCA.model$scores, mydata)
lm(Comp.1~Var1+Var2+Var3, data=scored)