print() factor analysis output by sort in R - r

df.fa is the result of psych::fa(bfi[1:25],5,rotate = 'oblimin',fm='minres',cor = 'cor'),
I print(df.fa$loadings,sort=TRUE) ,then:
Loadings:
MR2 MR1 MR3 MR5 MR4
N1 0.815 0.103 -0.111
N2 0.777
N3 0.706 -0.100
E1 -0.557 0.106 -0.103
E2 -0.676
E4 0.591 0.287
C1 0.546 0.148
C2 0.149 0.666
C3 0.567
C4 0.174 -0.614
C5 0.189 -0.142 -0.553
A2 0.640
A3 0.116 0.660
A5 -0.112 0.233 0.532
You can find N2 only has number under one factor(MR2), but why does N3 has number in 2 factors, even N1 has number in 3 factors.
How to explain it?

I would consider calculating absolute fit statistics to determine the goodness of fit for your current model. Then you could drop some of the items above that have low factor loadings and create a new model via Confirmatory Factor Analysis. The following three statistics are generally recommended:
Chi Square; recommended to be non-significant
Tucker Lewis Index (TLI) recommended to be 0.9 or greater
Root Mean Square Error of Approximation (RMSEA); recommended to be less than 0.005
EFA_model <- fa(bfi[1:25], nfactors = 5)
EFA_model$TLI
EFA_model$RMSEA
EFA_model$chi
You can then drop the items from your EFA_model$loadings that have low factor loadings scores and build a CFA model the cfa() function.
Run the same assessment on the CFA model's absolute fit statistics as above, for example, CFA_model$TLI and you can also compare relative fit statistics between your EFA and CFA models using BIC (Bayesian Information Criterion) with EFA_model$BIC and CFA_model$BIC and the model with lower BIC is preferred.

Maybe this is formatting question, not a statistical one. By default, low factor loadings are not printed.
The line below will remove your "don't print below this" cutoff (it is .1 by default):
print(df.fa$loadings,sort=TRUE, cutoff = 0)

Related

Do results of survival analysis only pertain to the observations analyzed?

Hey guys, so I taught myself time-to-event analysis recently and I need some help understanding it. I made some Kaplan-Meier survival curves.
Sure, the number of observations within each node is small but let's pretend that I have plenty.
K <- HF %>%
filter(serum_creatinine <= 1.8, ejection_fraction <= 25)
## Call: survfit(formula = Surv(time, DEATH_EVENT) ~ 1, data = K)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 20 36 5 0.881 0.0500 0.788 0.985
## 45 33 3 0.808 0.0612 0.696 0.937
## 60 31 3 0.734 0.0688 0.611 0.882
## 80 23 6 0.587 0.0768 0.454 0.759
## 100 17 1 0.562 0.0776 0.429 0.736
## 110 17 0 0.562 0.0776 0.429 0.736
## 120 16 1 0.529 0.0798 0.393 0.711
## 130 14 0 0.529 0.0798 0.393 0.711
## 140 14 0 0.529 0.0798 0.393 0.711
## 150 13 1 0.488 0.0834 0.349 0.682
If someone were to ask me about the third node, would the following statements be valid?:
For any new patient that walks into this hospital with <= 1.8 in Serum_Creatine & <= 25 in Ejection Fraction, their probability of survival is 53% after 140 days.
What about:
The survival distributions for the samples analyzed, and no other future incoming samples, are visualized above.
I want to make sure these statements are correct.
I would also like to know if logistic regression could be used to predict the binary variable DEATH_EVENT? Since the TIME variable contributes to how much weight one patient's death at 20 days has over another patient's death at 175 days, I understand that this needs to be accounted for.
If logistic regression can be used, does that imply anything over keeping/removing variable TIME?
Here are some thoughts:
Logistic regression is not appropriate in your case. As it is not the correct method for time to event analysis.
If the clinical outcome observed is “either-or,” such as if a patient suffers an MI or not, logistic regression can be used.
However, if the information on the time to MI is the observed outcome, data are analyzed using statistical methods for survival analysis.
Text from here
If you want to use a regression model in survival analysis then you should use a COX PROPORTIONAL HAZARDS MODEL. To understand the difference of a Kaplan-Meier analysis and Cox proportional hazards model you should understand both of them.
The next step would be to understand what is a univariable in contrast to a multivariable Cox proportional hazard model.
At the end you should understand all 3 methods(Kaplan-Meier, Cox univariable and Cox multivariable) then you can answer your question if this is a valid statement:
For any new patient that walks into this hospital with <= 1.8 in Serum_Creatine & <= 25 in Ejection Fraction, their probability of survival is 53% after 140 days.
There is nothing wrong to state the results of a subgroup of a Kaplan-Meier method. But it has a different value if the statement comes from a multivariable Cox regression analysis.

GlmmTMB model and emmeans

I am new to glmmtmb models, so i have ran into a problem.
I build a model and then based on the AICtab and DHARMa this was the best:
Insecticide_2<- glmmTMB(Insect_abundace~field_element+land_distance+sampling_time+year+treatment_day+(1|field_id),
data=Insect_002,
family= nbinom2)
After glmmTMB i ran Anova (from Car), and then emmeans, but the results of p-values in emmeans are the same (not lower.CL or upper.CL). What may be the problem? Is the model overfitted? Is the way i am doing the emmeans wrong?
Anova also showed that the land_distance, sampling_time, treatment_day were significant, year was almost significant (p= 0.07)
comp_emmeans1<-emmeans(Insect_002, pairwise ~ land_distance|year , type = "response")
> comp_emmeans1
$emmeans
Year = 2018:
land_distance response SE df lower.CL upper.CL
30m 2.46 0.492 474 1.658 3.64
50m 1.84 0.369 474 1.241 2.73
80m 1.36 0.283 474 0.906 2.05
110m 1.25 0.259 474 0.836 1.88
Year = 2019:
land_distance response SE df lower.CL upper.CL
30m 3.42 0.593 474 2.434 4.81
50m 2.56 0.461 474 1.799 3.65
80m 1.90 0.335 474 1.343 2.68
110m 1.75 0.317 474 1.222 2.49
Results are averaged over the levels of: field_element, sampling_time, treatment_day
Confidence level used: 0.95
Intervals are back-transformed from the log scale
$contrasts
year = 2018:
contrast ratio SE df null t.ratio p.value
30m / 50m 1.34 0.203 474 1 1.906 0.2268
30m / 80m 1.80 0.279 474 1 3.798 0.0009
30m / 110m 1.96 0.311 474 1 4.239 0.0002
50m / 80m 1.35 0.213 474 1 1.896 0.2311
50m / 110m 1.47 0.234 474 1 2.405 0.0776
80m / 110m 1.09 0.176 474 1 0.516 0.9552
year = 2019:
contrast ratio SE df null t.ratio p.value
30m / 50m 1.34 0.203 474 1 1.906 0.2268
30m / 80m 1.80 0.279 474 1 3.798 0.0009
30m / 110m 1.96 0.311 474 1 4.239 0.0002
50m / 80m 1.35 0.213 474 1 1.896 0.2311
50m / 110m 1.47 0.234 474 1 2.405 0.0776
80m / 110m 1.09 0.176 474 1 0.516 0.9552
Results are averaged over the levels of: field_element, sampling_time, treatment_day
P value adjustment: tukey method for comparing a family of 4 estimates
Tests are performed on the log scale
Should i use different comparison way? I saw that some use poly~, I tried that, results picture is the same. Also am I comparing the right things?
Last and also important question is how should i report the glmmTMB, Anova and emmeans results?
I don't recall seeing this question before, but it's been 8 months, and maybe I just forgot.
Anyway, I am not sure exactly what the question is, but there are three things going on that might possibly have caused some confusion:
The emmeans() call has the specification pairwise ~ land_distance|year, which causes it to compute both means and pairwise comparisons thereof. I think users are almost always better served by separating those steps, because estimating means and estimating contrasts are two different things.
The default way in which means are summarized (estimates, SEs, and confidence intervals) is different than the default for comparisons or other contrasts (estimates, SEs, t ratios, and adjusted P values). That's because, as I said before, there are two different things, and usually people want CIs for means and P values for contrasts. See below.
There is a log link in this model, and that has special properties when it comes to contrasts, because the difference on a log scale is the log of the ratio. So we display a ratio when we have type = "response". (With most other link functions, there is no way to back-transform the differences of transformed values.)
What I suggest, per (1), is to get the means (and not comparisons) first:
EMM <- emmeans(Insect_002, ~ land_distance|year , type = "response")
EMM # see the estimates
You can get pairwise comparisons next:
CON <- pairs(EMM) # or contrast(EMM, "pairwise")
CON # see the ratios as shown in the OP
confint(CON) # see confidence intervals instead of tests
confint(CON, type = "link") # See the pairwise differences on the log scale
If you actually want differences on the response scale rather than ratios, that's possible too:
pairs(regrid(EMM)) # tests
confint(pairs(regrid(EMM)) # CIs

How to get a scientific p-value using the cmprsk package?

Hi stackoverflow community,
I am a recent R starter and today I tried several hours to figure out how to get a scientific p-value (e.g. 3*e-1) from a competing risk analysis using the cmprsk package.
I used:
sumary_J1<-crr(ftime, fstatus, cov1, failcode=2)
summary(sumary_J1)
And got
Call:
crr(ftime = ftime, fstatus = fstatus, cov1 = cov1, failcode = 2)
coef exp(coef) se(coef) z p-value
group1 0.373 1.45 0.02684 13.90 0.00
age 0.122 1.13 0.00384 31.65 0.00
sex 0.604 1.83 0.04371 13.83 0.00
bmi 0.012 1.01 0.00611 1.96 0.05
exp(coef) exp(-coef) 2.5% 97.5%
group1 1.45 0.689 1.38 1.53
age 1.13 0.886 1.12 1.14
sex 1.83 0.546 1.68 1.99
bmi 1.01 0.988 1.00 1.02
Num. cases = 470690 (1900 cases omitted due to missing values)
Pseudo Log-likelihood = -28721
Pseudo likelihood ratio test = 2229 on 4 df,
I can see the p-value column,but I only get two decimal places. I would like to see as many decimal places as possible or print those p-values in the format e.g. 3.0*e-3.
I tried all of those, but nothing worked so far:
summary(sumary_J1, digits=max(options()$digits - 5,10))
print.crr(sumary_J1, digits = 20)
print.crr(sumary_J1, digits = 3, scipen = -2)
print.crr(sumary_J1, format = "e", digits = 3)
Maybe someone is able to help me! Thanks!
Best,
Carolin
The use of digits=2 limits the number of digits to the right of the decimal point when used as an argument to a .summary value. The digits parameter does affect how results are displayed for summary.crr.
summary(z, digits=3) # using first example in `?cmprsk::crr`
#----------------------
#Competing Risks Regression
Call:
crr(ftime = ftime, fstatus = fstatus, cov1 = cov)
coef exp(coef) se(coef) z p-value
x1 0.2668 1.306 0.421 0.633 0.526
x2 -0.0557 0.946 0.381 -0.146 0.884
x3 0.2805 1.324 0.381 0.736 0.462
exp(coef) exp(-coef) 2.5% 97.5%
x1 1.306 0.766 0.572 2.98
x2 0.946 1.057 0.448 2.00
x3 1.324 0.755 0.627 2.79
Num. cases = 200
Pseudo Log-likelihood = -320
Pseudo likelihood ratio test = 1.02 on 3 df,
You can use formatC to control format:
formatC( summary(z, digits=5)$coef , format="e")
#------------>
coef exp(coef) se(coef) z p-value
x1 "2.6676e-01" "1.3057e+00" "4.2115e-01" "6.3340e-01" "5.2647e-01"
x2 "-5.5684e-02" "9.4584e-01" "3.8124e-01" "-1.4606e-01" "8.8387e-01"
x3 "2.8049e-01" "1.3238e+00" "3.8098e-01" "7.3622e-01" "4.6159e-01"
You also might search on [r] very small p-value
Here's the first of over 100 hits on that topic, which despite not very much attention, still has very useful information and coding examples: Reading a very small p-value in R
By looking at the function that prints the output of crr() (cmprsk::print.crr) you can see what is done to create the p-values displayed in the summary. The code below is taken from that function.
x <- sumary_J1
v <- sqrt(diag(x$var))
signif(v, 4) # Gives you the one-sided p-values.
v <- 2 * (1 - pnorm(abs(x$coef)/v))
signif(v, 4) # Gibes you the two-sided p-values.

Linear model in R - Multiplication Expression

I have 3 numerical variables A, B and C. I am trying to create a linear model capable of predicting A. The expression that I am using is the product of B*C in order to predict A; however, when looking at the output I am not able to get my equation because I get and extra variable that I don't know what is it.
Here is my code
MyData<-read.csv("...", header = T)
head(MyData,6)
str(MyData)
#Linear Model
#Expersion A= B*C
Model1<-lm(MyData$A~MyData$B*MyData$C)
summary(Model1)
Output of str(MyData)
> str(MyData)
'data.frame': 6 obs. of 3 variables:
$ A: num 2.5 3.4 2.7 3.6 2.5 2.1
$ B: num 0.01 0.02 0.015 0.017 0.018 0.01
$ C: num 0.1 0.2 0.27 0.19 0.17 0.16
Output of summary(Model1)
Call:
lm(formula = MyData$A ~ MyData$B * MyData$C)
Residuals:
1 2 3 4 5 6
-0.03945 -0.08386 -0.13925 0.67703 -0.40055 -0.01393
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.473 5.774 0.948 0.443
MyData$B -222.431 454.508 -0.489 0.673
MyData$C -26.482 36.222 -0.731 0.541
MyData$B:MyData$C 1938.961 2679.207 0.724 0.544
Residual standard error: 0.5688 on 2 degrees of freedom
Multiple R-squared: 0.6149, Adjusted R-squared: 0.03723
F-statistic: 1.064 on 3 and 2 DF, p-value: 0.5178
lm uses the Wilkinson-Rogers notation so "*" is an iteration, based on the output, right? is this true, how do I create my model using the product of my two variables?
If you just want a single term that is the literal product of the two variables, not an interaction, you can use I():
Model1 <- lm(MyData$A ~ I(MyData$B * MyData$C))
I think in practice, with 2 numeric variables, this ends up the same as Dan's suggestion to use x1:x2 to get just the interaction without the terms for each individual predictor, but it might differ in other cases.

Print loadings and yloadings details in R

I am new to R and have been asked to prepare a script that will be used to capture some R output in a text file.
I have been given a set of commands that creates DB connection, loads data and then performs some mathematical calculation and churns out Summary, Loadings and YLoadings. I am to capture this output and save it in database. I have got everything working already except one bit that keeps on giving issues time and again.
The loadings and yloadings functions sometime gives out Matrix that has white-spaces in it. For example,
Comp 1, Comp 2, Comp 3
Row1 0.495 0.748 -0.272
Row2 0.605 -0.562
Row3 0.666 -0.397 0.781
Row4
LongNameRow1 0.536 -1.483
LongNameRow2 -0.681 -0.408 -1.145
Because of such outputs I have to manually check the files and edit them so that they become,
Comp 1, Comp 2, Comp 3
Row1 0.495 0.748 -0.272
Row2 0.605 0.000 -0.562
Row3 0.666 -0.397 0.781
Row4 0.000 0.000 0.000
LongNameRow1 0.536 0.000 -1.483
LongNameRow2 -0.681 -0.408 -1.145
i.e. I have to manually replace all the spaces with 0.000 (I am not sure of 0.000 is the correct value, but this was the only thing I could think of) in the output. This is very time consuming and painful to do.
I did some search around the loadings function and found,
Small loadings are conventionally not printed (replaced by spaces), to draw the eye to the pattern of the larger loadings.
So my question is, Are there any other methods or any configuration that I am missing that can give me the Output the way I need? i.e. 0.000 instead of white spaces or any other reasonable value? At the very least I am wondering if I can delimit the output with commas or the pipe character (i.e. "|") or something similar to make parsing the text possible?
Thanks in advance for help!!!
The answer is to use unclass to convert the loadings to a matrix. The following example illustrates this.
The loadings function extracts the loadings matrix and changes the class of this matrix to loadings. When you print an object of class loadings, small values are not printed, as you observe.
Here is the example from ?princomp:
fit <- princomp(USArrests, cor = TRUE)
l <- loadings(fit)
l
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.536 0.418 -0.341 0.649
Assault -0.583 0.188 -0.268 -0.743
UrbanPop -0.278 -0.873 -0.378 0.134
Rape -0.543 -0.167 0.818
Comp.1 Comp.2 Comp.3 Comp.4
SS loadings 1.00 1.00 1.00 1.00
Proportion Var 0.25 0.25 0.25 0.25
Cumulative Var 0.25 0.50 0.75 1.00
It is quite straightforward to change the class of this object back to its default. If you then print it, the values are displayed as the true underlying values
l <- unclass(l)
l
Comp.1 Comp.2 Comp.3 Comp.4
Murder -0.5358995 0.4181809 -0.3412327 0.64922780
Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
Rape -0.5434321 -0.1673186 0.8177779 0.08902432

Resources