Change factor labels in psych::fa or psych::fa.diagram - r

I'm using the psych package for factor analysis. I want to specify the labels of the latent factors, either in the fa() object, or when graphing with fa.diagram().
For example, with toy data:
require(psych)
n <- 100
choices <- 1:5
df <- data.frame(a=sample(choices, replace=TRUE, size=n),
b=sample(choices, replace=TRUE, size=n),
c=sample(choices, replace=TRUE, size=n),
d=sample(choices, replace=TRUE, size=n))
model <- fa(df, nfactors=2, fm="pa", rotate="promax")
model
Factor Analysis using method = pa
Call: fa(r = df, nfactors = 2, rotate = "promax", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 h2 u2 com
a 0.45 -0.49 0.47 0.53 2.0
b 0.22 0.36 0.17 0.83 1.6
c -0.02 0.20 0.04 0.96 1.0
d 0.66 0.07 0.43 0.57 1.0
I want to change PA1 and PA2 to FactorA and FactorB, either by changing the model object itself, or adjusting the labels in the output of fa.diagram():
The docs for fa.diagram have a labels argument, but no examples, and the experimentation I've done so far hasn't been fruitful. Any help much appreciated!

With str(model) I found the $loadings attribute, which fa.diagram() uses to render the diagram. Modifying colnames() of model$loadings did the trick.
colnames(model$loadings) <- c("FactorA", "FactorB")
fa.diagram(model)

Related

ROC Curve Plot using R (Error code: Predictor must be numeric or ordered)

I am trying to make a ROC Curve using pROC with the 2 columns as below: (the list goes on to over >300 entries)
Actual_Findings_%
Predicted_Finding_Prob
0.23
0.6
0.48
0.3
0.26
0.62
0.23
0.6
0.48
0.3
0.47
0.3
0.23
0.6
0.6868
0.25
0.77
0.15
0.31
0.55
The code I tried to use is:
roccurve<- plot(roc(response = data$Actual_Findings_% <0.4, predictor = data$Predicted_Finding_Prob >0.5),
legacy.axes = TRUE, print.auc=TRUE, main = "ROC Curve", col = colors)
Where the threshold for positive findings is
Actual_Findings_% <0.4
AND
Predicted_Finding_Prob >0.5
(i.e to be TRUE POSITIVE, actual_finding_% would be LESS than 0.4, AND predicted_finding_prob would be GREATER than 0.5)
but when I try to plot this roc curve, I get the error:
"Setting levels: control = FALSE, case = TRUE
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'plot': Predictor must be numeric or ordered."
Any help would be much appreciated!
This should work:
data <- read.table( text=
"Actual_Findings_% Predicted_Finding_Prob
0.23 0.6
0.48 0.3
0.26 0.62
0.23 0.6
0.48 0.3
0.47 0.3
0.23 0.6
0.6868 0.25
0.77 0.15
0.31 0.55
", header=TRUE, check.names=FALSE )
library(pROC)
roccurve <- plot(
roc(
response = data$"Actual_Findings_%" <0.4,
predictor = data$"Predicted_Finding_Prob"
),
legacy.axes = TRUE, print.auc=TRUE, main = "ROC Curve"
)
Now importantly - the roc curve is there to show you what happens when you varry your classification threshold. So one thing you do do wrong is to go and enforce one, by setting predictions < 0.5
This does however give a perfect separation, which is nice I guess. (Though bad for educational purposes.)

Build Logistic Regression Model for shares

The data i am working with , contains the closing prices of 10 shares of the S&P 500 index.
Data :
> dput(head(StocksData))
structure(list(ACE = c(56.86, 56.82, 56.63, 56.39, 55.97, 55.23
), AMD = c(8.47, 8.77, 8.91, 8.69, 8.83, 9.19), AFL = c(51.83,
50.88, 50.78, 50.5, 50.3, 49.65), APD = c(81.59, 80.38, 80.03,
79.61, 79.76, 79.77), AA = c(15.12, 15.81, 15.85, 15.66, 15.71,
15.78), ATI = c(53.54, 52.37, 52.53, 51.91, 51.32, 51.45), AGN = c(69.77,
69.53, 69.69, 69.98, 68.99, 68.75), ALL = c(29.32, 29.03, 28.99,
28.66, 28.47, 28.2), MO = c(20.09, 20, 20.07, 20.16, 20, 19.88
), AMZN = c(184.22, 185.01, 187.42, 185.86, 185.49, 184.68)), row.names = c(NA,
6L), class = "data.frame")
In the following part , i am calculating the daily percentage changes of 10 shares :
perc_change <- (StocksData[-1, ] - StocksData[-nrow(StocksData), ])/StocksData[-nrow(StocksData), ] * 100
perc_change
Output :
# ACE AMD AFL APD AA ATI AGN ALL MO AMZN
#2 -0.07 3.5 -1.83 -1.483 4.56 -2.19 -0.34 -0.99 -0.45 0.43
#3 -0.33 1.6 -0.20 -0.435 0.25 0.31 0.23 -0.14 0.35 1.30
#4 -0.42 -2.5 -0.55 -0.525 -1.20 -1.18 0.42 -1.14 0.45 -0.83
#5 -0.74 1.6 -0.40 0.188 0.32 -1.14 -1.41 -0.66 -0.79 -0.20
#6 -1.32 4.1 -1.29 0.013 0.45 0.25 -0.35 -0.95 -0.60 -0.44
With the above code i find the latest N rates of change (N should be in [1,10]).
I want to make Logistic Regression Model in order to predict the change of the next day (N + 1), i.e., "increase" or "decrease".
Firstly, i split the data into two chunks: training and testing set :
(NOTE: as testset i must take the last 40 sessions and as trainset the previous 85 sessions of the test set !)
trainset <- head(StocksData, 870)
testset <- tail(StocksData, 40)
Continued with the fitting of the model:
model <- glm(Here???,family=binomial(link='logit'),data=trainset)
The problem iam facing is that i dont have understand and i dont know what to include in the glm function. I have study many models of logistic regression and i think that i havent in my data this object that i need to place there.
Any help for this misunderstanding part of my code ?
Based on what you shared, you need to predict an increment or decrease when new data arrives about the portfolio you mentioned. In that case, you need to define the target variable. We can do that computing the number of positive and negative changes. With that variables, we can create a target variable with 1 if positive is greater than negative (there will be an increment) and with 0 if opposite (there will not be an increment). Data shared is pretty small but I have sketched the code so that you can apply the training/test approach for the modeling. Here the code:
We will start from perc_change and compute the positive and negative variables:
#Build variables
#Store number of and positive negative changes
i <- names(perc_change)
perc_change$Neg <- apply(perc_change[,i],1,function(x) length(which(x<0)))
perc_change$Pos <- apply(perc_change[,i],1,function(x) length(which(x>0)))
Now, we create the target variable with a conditional:
#Build target variable
perc_change$Target <- ifelse(perc_change$Pos>perc_change$Neg,1,0)
We create a replicate for data and remove non necessary variables:
#Replicate data
perc_change2 <- perc_change
perc_change2$Neg <- NULL
perc_change2$Pos <- NULL
With perc_change2 the input is ready and you should split into train/test data. I will not do that as data is too small. I will go directly to the model:
#Train the model, few data for train/test in example but you can adjust that
model <- glm(Target~.,family=binomial(link='logit'),data=perc_change2)
With that model, you know how to evaluate performance and other things. Please do not hesitate in telling me if more details are needed.

R: Run multiple post hoc tests at once, using emmeans package

I'm working on a dataset with several different types of proteins as columns. It kinds of looks like this This is simplified, the original dataset contains over 100 types of proteins. I wanted to see if the concentration of a protein differs by treatments when taking random effect (=id) into consideration. I managed to run multiple repeated ANOVA at once. But I would also like to do pairwise comparisons for all proteins based on the treatment. The first thing came to my mind was using emmeans package, but I had trouble coding this.
#install packages
library(tidyverse)
library(emmeans)
#Create a data set
set.seed(1)
id <- rep(c("1","2","3","4","5","6"),3)
Treatment <- c(rep(c("A"), 6), rep(c("B"), 6),rep(c("C"), 6))
Protein1 <- c(rnorm(3, 1, 0.4), rnorm(3, 3, 0.5), rnorm(3, 6, 0.8), rnorm(3, 1.1, 0.4), rnorm(3, 0.8, 0.2), rnorm(3, 1, 0.6))
Protein2 <- c(rnorm(3, 1, 0.4), rnorm(3, 3, 0.5), rnorm(3, 6, 0.8), rnorm(3, 1.1, 0.4), rnorm(3, 0.8, 0.2), rnorm(3, 1, 0.6))
Protein3 <- c(rnorm(3, 1, 0.4), rnorm(3, 3, 0.5), rnorm(3, 6, 0.8), rnorm(3, 1.1, 0.4), rnorm(3, 0.8, 0.2), rnorm(3, 1, 0.6))
DF <- data.frame(id, Treatment, Protein1, Protein2, Protein3) %>%
mutate(id = factor(id),
Treatment = factor(Treatment, levels = c("A","B","C")))
#First, I tried to run multiple anova, by using lapply
responseList <- names(DF)[c(3:5)]
modelList <- lapply(responseList, function(resp) {
mF <- formula(paste(resp, " ~ Treatment + Error(id/Treatment)"))
aov(mF, data = DF)
})
lapply(modelList, summary)
#Pairwise comparison using emmeans. This did not work
wt_emm <- emmeans(modelList, "Treatment")
> wt_emm <- emmeans(modelList, "Treatment")
Error in ref_grid(object, ...) : Can't handle an object of class “list”
Use help("models", package = "emmeans") for information on supported models.
So I tried a different approach
anova2 <- aov(cbind(Protein1,Protein2,Protein3)~ Treatment +Error(id/Treatment), data = DF)
summary(anova2)
#Pairwise comparison using emmeans.
#I got only result for the whole dataset, instead of by different types of protein.
wt_emm2 <- emmeans(anova2, "Treatment")
pairs(wt_emm2)
> pairs(wt_emm2)
contrast estimate SE df t.ratio p.value
A - B -1.704 1.05 10 -1.630 0.2782
A - C 0.865 1.05 10 0.827 0.6955
B - C 2.569 1.05 10 2.458 0.0793
I don't understand why even if I used "cbind(Protein1, Protein2, Protein3)" in the anova model. R still only gives me one result instead of something like the following
this is what I was hoping to get
> Protein1
contrast
A - B
A - C
B - C
> Protein2
contrast
A - B
A - C
B - C
> Protein3
contrast
A - B
A - C
B - C
How do I code this or should I try a different package/function?
I don't have trouble running one protein at a time. However, since I have over 100 proteins to run, it would be really time-consuming to code them one by one.
Any suggestion is appreciated. Thank you!
Here
#Pairwise comparison using emmeans. This did not work
wt_emm <- emmeans(modelList, "Treatment")
you need to lapply over the list like you did with lapply(modelList, summary)
modelList <- lapply(responseList, function(resp) {
mF <- formula(paste(resp, " ~ Treatment + Error(id/Treatment)"))
aov(mF, data = DF)
})
But when you do this, there is an error:
lapply(modelList, function(x) pairs(emmeans(x, "Treatment")))
Note: re-fitting model with sum-to-zero contrasts
Error in terms(formula, "Error", data = data) : object 'mF' not found
attr(modelList[[1]], 'call')$formula
# mF
Note that mF was the name of the formula object, so it seems emmeans needs the original formula for some reason. You can add the formula to the call:
modelList <- lapply(responseList, function(resp) {
mF <- formula(paste(resp, " ~ Treatment + Error(id/Treatment)"))
av <- aov(mF, data = DF)
attr(av, 'call')$formula <- mF
av
})
lapply(modelList, function(x) pairs(emmeans(x, "Treatment")))
# [[1]]
# contrast estimate SE df t.ratio p.value
# A - B -1.89 1.26 10 -1.501 0.3311
# A - C 1.08 1.26 10 0.854 0.6795
# B - C 2.97 1.26 10 2.356 0.0934
#
# P value adjustment: tukey method for comparing a family of 3 estimates
#
# [[2]]
# contrast estimate SE df t.ratio p.value
# A - B -1.44 1.12 10 -1.282 0.4361
# A - C 1.29 1.12 10 1.148 0.5082
# B - C 2.73 1.12 10 2.430 0.0829
#
# P value adjustment: tukey method for comparing a family of 3 estimates
#
# [[3]]
# contrast estimate SE df t.ratio p.value
# A - B -1.58 1.15 10 -1.374 0.3897
# A - C 1.27 1.15 10 1.106 0.5321
# B - C 2.85 1.15 10 2.480 0.0765
#
# P value adjustment: tukey method for comparing a family of 3 estimates
Make a loop of the function by column names.
responseList <- names(DF)[c(3:5)]
for(n in responseList) {
anova2 <- aov(get(n) ~ Treatment +Error(id/Treatment), data = DF)
summary(anova2)
wt_emm2 <- emmeans(anova2, "Treatment")
print(pairs(wt_emm2))
}
This returns
Note: re-fitting model with sum-to-zero contrasts
Note: Use 'contrast(regrid(object), ...)' to obtain contrasts of back-transformed estimates
contrast estimate SE df t.ratio p.value
A - B -1.41 1.26 10 -1.122 0.5229
A - C 1.31 1.26 10 1.039 0.5705
B - C 2.72 1.26 10 2.161 0.1269
Note: contrasts are still on the get scale
P value adjustment: tukey method for comparing a family of 3 estimates
Note: re-fitting model with sum-to-zero contrasts
Note: Use 'contrast(regrid(object), ...)' to obtain contrasts of back-transformed estimates
contrast estimate SE df t.ratio p.value
A - B -2.16 1.37 10 -1.577 0.2991
A - C 1.19 1.37 10 0.867 0.6720
B - C 3.35 1.37 10 2.444 0.0810
Note: contrasts are still on the get scale
P value adjustment: tukey method for comparing a family of 3 estimates
Note: re-fitting model with sum-to-zero contrasts
Note: Use 'contrast(regrid(object), ...)' to obtain contrasts of back-transformed estimates
contrast estimate SE df t.ratio p.value
A - B -1.87 1.19 10 -1.578 0.2988
A - C 1.28 1.19 10 1.077 0.5485
B - C 3.15 1.19 10 2.655 0.0575
Note: contrasts are still on the get scale
P value adjustment: tukey method for comparing a family of 3 estimates
If you want to have the output as a list:
responseList <- names(DF)[c(3:5)]
output <- list()
for(n in responseList) {
anova2 <- aov(get(n) ~ Treatment +Error(id/Treatment), data = DF)
summary(anova2)
wt_emm2 <- emmeans(anova2, "Treatment")
output[[n]] <- pairs(wt_emm2)
}

Penalized Regression: "ridge" RMSE greater than that for plain "lm"

Working with the "prostate" dataset in "ElemStatLearn" package.
set.seed(3434)
fit.lm = train(data=trainset, lpsa~., method = "lm")
fit.ridge = train(data=trainset, lpsa~., method = "ridge")
fit.lasso = train(data=trainset, lpsa~., method = "lasso")
Comparing RMSE (for bestTune in case of ridge and lasso)
fit.lm$results[,"RMSE"]
[1] 0.7895572
fit.ridge$results[fit.ridge$results[,"lambda"]==fit.ridge$bestTune$lambda,"RMSE"]
[1] 0.8231873
fit.lasso$results[fit.lasso$results[,"fraction"]==fit.lasso$bestTune$fraction,"RMSE"]
[1] 0.7779534
Comparing absolute value of coefficients
abs(round(fit.lm$finalModel$coefficients,2))
(Intercept) lcavol lweight age lbph svi lcp gleason pgg45
0.43 0.58 0.61 0.02 0.14 0.74 0.21 0.03 0.01
abs(round(predict(fit.ridge$finalModel, type = "coef", mode = "norm")$coefficients[8,],2))
lcavol lweight age lbph svi lcp gleason pgg45
0.49 0.62 0.01 0.14 0.65 0.05 0.00 0.01
abs(round(predict(fit.lasso$finalModel, type = "coef", mode = "norm")$coefficients[8,],2))
lcavol lweight age lbph svi lcp gleason pgg45
0.56 0.61 0.02 0.14 0.72 0.18 0.00 0.01
My question is: how can "ridge" RMSE be higher than that of plain "lm". Doesn't that defeat the very purpose of penalized regression vs plain "lm"?
Also, how can the absolute value of the coefficient of "lweight" be actually higher in ridge (0.62) vs that in lm (0.61)? Both coefficients are positive originally without the abs().
I was expecting ridge to perform similar to lasso, which not only reduced RMSE but also shrank the size of coefficients vs plain "lm".
Thank you!

What if I want a single linear regression model rather than an "mlm"?

I have shared the top 9 rows of the data I am working on in the image below (y0 to y6 are outputs, rest are inputs):
My objective is to get fitted output data for y0 to y6.
I tried lm function in R using the commands:
lm1 <- lm(cbind(y0, y1, y2, y3, y4, y5, y6) ~ tt + tcb + s + l + b, data = table3)
summary(lm1)
And it has returned 7 sets of coefficients like "Response y0", "Response y1", etc.
What I really want is just 1 set of coefficients which can predict values for outputs y0 to y6.
Could you please help in this?
By cbind(y0, y1, y2, y3, y4, y5, y6) we fit 7 independent models (which is be a better idea).
For what you are looking for, stack your y* variables, replicate other independent variables and do a single regression.
Y <- c(y0, y1, y2, y3, y4, y5, y6)
tt. <- rep(tt, times = 7)
tcb. <- rep(tcb, times = 7)
s. <- rep(s, times = 7)
l. <- rep(l, times = 7)
b. <- rep(b, times = 7)
fit <- lm(Y ~ tt. + tcb. + s. + l. + b.)
Predicted values for y* are
matrix(fitted(fit), ncol = 7)
For other readers than OP
I hereby prepare a tiny reproducible example (with only one covariate x and two replicates y1, y2) to help you digest the issue.
set.seed(0)
dat_wide <- data.frame(x = round(runif(4), 2),
y1 = round(runif(4), 2),
y2 = round(runif(4), 2))
# x y1 y2
#1 0.90 0.91 0.66
#2 0.27 0.20 0.63
#3 0.37 0.90 0.06
#4 0.57 0.94 0.21
## The original "mlm"
fit_mlm <- lm(cbind(y1, y2) ~ x, data = dat_wide)
Instead of doing c(y1, y2) and rep(x, times = 2), I would use the reshape function from R base package stats, as such operation is essentially a "wide" to "long" dataset reshaping.
dat_long <- stats::reshape(dat_wide, ## wide dataset
varying = 2:3, ## columns 2:3 are replicates
v.names = "y", ## the stacked variable is called "y"
direction = "long" ## reshape to "long" format
)
# x time y id
#1.1 0.90 1 0.91 1
#2.1 0.27 1 0.20 2
#3.1 0.37 1 0.90 3
#4.1 0.57 1 0.94 4
#1.2 0.90 2 0.66 1
#2.2 0.27 2 0.63 2
#3.2 0.37 2 0.06 3
#4.2 0.57 2 0.21 4
Extra variables time and id are created. The former tells which replicate a case comes from; the latter tells which record that case is within a replicate.
To fit the same model for all replicates, we do
fit1 <- lm(y ~ x, data = dat_long)
#(Intercept) x
# 0.2578 0.5801
matrix(fitted(fit1), ncol = 2) ## there are two replicates
# [,1] [,2]
#[1,] 0.7798257 0.7798257
#[2,] 0.4143822 0.4143822
#[3,] 0.4723891 0.4723891
#[4,] 0.5884029 0.5884029
Don't be surprised that two columns are identical; there is only a single set of regression coefficients for both replicates after all.
If you think carefully, we can do the following instead:
dat_wide$ymean <- rowMeans(dat_wide[2:3]) ## average all replicates
fit2 <- lm(ymean ~ x, data = dat_wide)
#(Intercept) x
# 0.2578 0.5801
and we will get the same point estimates. Standard errors and other summary statistics would differ as two models have different sample size.
coef(summary(fit1))
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2577636 0.2998382 0.8596755 0.4229808
#x 0.5800691 0.5171354 1.1216967 0.3048657
coef(summary(fit2))
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2577636 0.01385864 18.59949 0.002878193
#x 0.5800691 0.02390220 24.26844 0.001693604

Resources