After getting my predictions from glmnet, I am trying to use "prediction" function, in "ROCR" package, to get tpr, fpr, etc but get this error:
pred <- prediction(pred_glmnet_s5_3class, y)
Error in prediction(pred_glmnet_s5_3class, y) :
Format of predictions is invalid.
I have output both glmnet predictions and labels and they look like they are in similar format and hence I don't understand what is invalid here.
The code is as follows and input can be found here input. It is a small dataset and should not take much time to run.
library("ROCR")
library("caret")
sensor6data_s5_3class <- read.csv("/home/sensei/clustering /sensor6data_f21_s5_with3Labels.csv")
sensor6data_s5_3class <- within(sensor6data_s5_3class, Class <- as.factor(Class))
sensor6data_s5_3class$Class2 <- relevel(sensor6data_s5_3class$Class,ref="1")
set.seed("4321")
inTrain_s5_3class <- createDataPartition(y = sensor6data_s5_3class$Class, p = .75, list = FALSE)
training_s5_3class <- sensor6data_s5_3class[inTrain_s5_3class,]
testing_s5_3class <- sensor6data_s5_3class[-inTrain_s5_3class,]
y <- testing_s5_3class[,22]
ctrl_s5_3class <- trainControl(method = "repeatedcv", number = 10, repeats = 10 , savePredictions = TRUE)
model_train_glmnet_s5_3class <- train(Class2 ~ ZCR + Energy + SpectralC + SpectralS + SpectralE + SpectralF + SpectralR + MFCC1 + MFCC2 + MFCC3 + MFCC4 + MFCC5 + MFCC6 + MFCC7 + MFCC8 + MFCC9 + MFCC10 + MFCC11 + MFCC12 + MFCC13, data = training_s5_3class, method="glmnet", trControl = ctrl_s5_3class)
pred_glmnet_s5_3class = predict(model_train_glmnet_s5_3class, newdata=testing_s5_3class, s = "model_train_glmnet_s5_3class$finalModel$lambdaOpt")
pred <- prediction(pred_glmnet_s5_3class, y)
Appreciate your help!
The main problem is that prediction takes "a vector, matrix, list, or data frame" for both predictions and labels arguments. Even though pred_glmnet_s5_3class and y look like vectors, they are not, e.g.
sapply(c(is.vector, is.matrix, is.list, is.data.frame), do.call, list(y))
# [1] FALSE FALSE FALSE FALSE
In fact, they are factors (which can be seen from e.g. class(y)), and ?is.vector informs us to
Note that factors are not vectors; ‘is.vector’ returns ‘FALSE’
and ‘as.vector’ converts a factor to a character vector for ‘mode
= "any"’.
We can convert both objects to numeric:
pred <- prediction(as.numeric(pred_glmnet_s5_3class), as.numeric(y))
# Number of classes is not equal to 2.
# ROCR currently supports only evaluation of binary classification tasks.
Unfortunately, it produces a different problem which is beyond the scope of this question.
Related
I am trying to run the double generalized linear model (DGLM) in R on my traits of interest. I have made a function that extracts the components of interest from dglm with the arguments accepting a column (cT) of my Phenotypic object (Phenos), the snp (i) from my genotypic object (Geno), and PCA's (covar) to control with population structure.
my.pdglm <- function(cT=NULL, i=NULL, Phenos=NULL, Geno=NULL, covar=NULL)
The body of my p.dglm function is this as follows
my.pdglm <- function(cT=NULL, i=NULL, Phenos=NULL, Geno=NULL, covar=NULL) {
y <- Phenos[,cT]
model <- dglm(y ~ Geno[, i] + covar[, 2] + covar[, 3] + covar[, 4] + covar[, 5] + covar[,6] + covar[,7] + covar[, 8], ~ Geno[, i], family = gaussian(link = "identity"))
P.mean <- summary(model)$coef[2, 4] # Extarct p values for mean part
P.disp <- pchisq(q = anova(model)$Adj.Chisq[2], df = anova(model)$DF[2], lower.tail = FALSE)
s.model <- summary(model$dispersion.fit)
beta <- s.model$coef[2, 1] # Extarct cofficients
se <- s.model$coef[2, 2] # Extract standard errors
out <- data.frame(Beta = beta, SE = se, P.mean = P.mean, P.disp = P.disp,
stringsAsFactors = FALSE) # Save all the extracted
return(out)
}
When I try and run this function, I keep getting the following error using this as an example:
my.pdglm(cT=3, i=9173, Phenos=SP_Zm_NULL, Geno=t(Geno), covar=Zm_covar_FULL)
[1] "--------- Fitting DGLM model for SNP 9173 out of 41611 ----------"
Error in eval(predvars, data, env) : object 'y' not found
Called from: eval(predvars, data, env)
When I print(y) as a quality control step, it usually prints, but dglm is not recognizing it. The only way I get my function to work is if I run my function with the exact arguments named as the arguments themselves. Can anyone help me with this? This has been holding me up for a while.
Problems occur in Line 20: x3 <- lm(Salary ~ ...
Error in as.data.frame.default(data) : cannot coerce class ‘c("train", "train.formula")’ to a data.frame
How to solve?
attach(Hitters)
Hitters
library(caret)
set.seed(123)
# Define training control
set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
x2 <- train(Salary ~., data = x, method = "lm",
trControl = train.control)
# Summarize the results
print(x)
x3 <- lm(Salary ~ poly(AtBat,3) + poly(Hits,3) + poly(Walks,3) + poly(CRuns,3) + poly(CWalks,3) + poly(PutOuts,3), data = x2)
summary(x3)
MSE = mean(x3$residuals^2)
print("Mean Squared Error: ")
print(MSE)
First, as #dcarlson already mentioned, you should define x.
Second, x3 does not return a data frame.
If you run
str(x2)
you'll see that all the elements you're using in the lm function are part of a data frame called trainingData.
So if you intend to use the lm function, use that as your data source in the lm function, NOT x2.
I've rewritten your code below.
PS I'm far from a R expert so if someone wants to shoot at this answer, go ahead, I'm always willing to learn ;)
attach(Hitters)
Hitters
library(caret)
set.seed(123)
# Define training control
set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
x2 <- train(Salary ~., data = x, method = "lm", trControl = train.control)
# Summarize the results
print(x2)
# str(x2) # $trainingData data.frame
x2$trainingData[["AtBat"]]
m <- x2$trainingData
x3 <- lm(Salary ~ poly(AtBat,3) + poly(Hits,3) + poly(Walks,3) + poly(CRuns,3) + poly(CWalks,3) + poly(PutOuts,3), data = m)
summary(x3)
MSE = mean(x3$residuals^2)
cat("Mean Squared Error: ", MSE) # use cat to concatenate text and variable value in one line
I am running glmer logit models using the lme4 package. I am interested in various two and three way interaction effects and their interpretations. To simplify, I am only concerned with the fixed effects coefficients.
I managed to come up with a code to calculate and plot these effects on the logit scale, but I am having trouble transforming them to the predicted probabilities scale. Eventually I would like to replicate the output of the effects package.
The example relies on the UCLA's data on cancer patients.
library(lme4)
library(ggplot2)
library(plyr)
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
facmin <- function(n) {
min(as.numeric(levels(n)))
}
facmax <- function(x) {
max(as.numeric(levels(x)))
}
hdp <- read.csv("http://www.ats.ucla.edu/stat/data/hdp.csv")
head(hdp)
hdp <- hdp[complete.cases(hdp),]
hdp <- within(hdp, {
Married <- factor(Married, levels = 0:1, labels = c("no", "yes"))
DID <- factor(DID)
HID <- factor(HID)
CancerStage <- revalue(hdp$CancerStage, c("I"="1", "II"="2", "III"="3", "IV"="4"))
})
Until here it is all data management, functions and the packages I need.
m <- glmer(remission ~ CancerStage*LengthofStay + Experience +
(1 | DID), data = hdp, family = binomial(link="logit"))
summary(m)
This is the model. It takes a minute and it converges with the following warning:
Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0417259 (tol = 0.001, component 1)
Even though I am not quite sure if I should worry about the warning, I use the estimates to plot the average marginal effects for the interaction of interest. First I prepare the dataset to be feed into the predict function, and then I calculate the marginal effects as well as the confidence intervals using the fixed effects parameters.
newdat <- expand.grid(
remission = getmode(hdp$remission),
CancerStage = as.factor(seq(facmin(hdp$CancerStage), facmax(hdp$CancerStage),1)),
LengthofStay = seq(min(hdp$LengthofStay, na.rm=T),max(hdp$LengthofStay, na.rm=T),1),
Experience = mean(hdp$Experience, na.rm=T))
mm <- model.matrix(terms(m), newdat)
newdat$remission <- predict(m, newdat, re.form = NA)
pvar1 <- diag(mm %*% tcrossprod(vcov(m), mm))
cmult <- 1.96
## lower and upper CI
newdat <- data.frame(
newdat, plo = newdat$remission - cmult*sqrt(pvar1),
phi = newdat$remission + cmult*sqrt(pvar1))
I am fairly confident these are correct estimates on the logit scale, but maybe I am wrong. Anyhow, this is the plot:
plot_remission <- ggplot(newdat, aes(LengthofStay,
fill=factor(CancerStage), color=factor(CancerStage))) +
geom_ribbon(aes(ymin = plo, ymax = phi), colour=NA, alpha=0.2) +
geom_line(aes(y = remission), size=1.2) +
xlab("Length of Stay") + xlim(c(2, 10)) +
ylab("Probability of Remission") + ylim(c(0.0, 0.5)) +
labs(colour="Cancer Stage", fill="Cancer Stage") +
theme_minimal()
plot_remission
I think now the OY scale is measured on the logit scale but to make sense of it I would like to transform it to predicted probabilities. Based on wikipedia, something like exp(value)/(exp(value)+1) should do the trick to get to predicted probabilities. While I could do newdat$remission <- exp(newdat$remission)/(exp(newdat$remission)+1) I am not sure how should I do this for the confidence intervals?.
Eventually I would like to get to the same plot what the effects package generates. That is:
eff.m <- effect("CancerStage*LengthofStay", m, KR=T)
eff.m <- as.data.frame(eff.m)
plot_remission2 <- ggplot(eff.m, aes(LengthofStay,
fill=factor(CancerStage), color=factor(CancerStage))) +
geom_ribbon(aes(ymin = lower, ymax = upper), colour=NA, alpha=0.2) +
geom_line(aes(y = fit), size=1.2) +
xlab("Length of Stay") + xlim(c(2, 10)) +
ylab("Probability of Remission") + ylim(c(0.0, 0.5)) +
labs(colour="Cancer Stage", fill="Cancer Stage") +
theme_minimal()
plot_remission2
Even though I could just use the effects package, it unfortunately does not compile with a lot of the models I had to run for my own work:
Error in model.matrix(mod2) %*% mod2$coefficients :
non-conformable arguments
In addition: Warning message:
In vcov.merMod(mod) :
variance-covariance matrix computed from finite-difference Hessian is
not positive definite or contains NA values: falling back to var-cov estimated from RX
Fixing that would require adjusting the estimation procedure, which at the moment I would like to avoid. plus, I am also curious what effects actually does here.
I would be grateful for any advice on how to tweak my initial syntax to get to predicted probabilities!
To obtain a similar result as the effect function provided in your question, you just have to back transform both the predicted values and the boundaries of your confidence interval from the logit scale to the original scale with the transformation you provide : exp(x)/(1+exp(x)).
This transformation can be done in base R with the plogis function :
> a <- 1:5
> plogis(a)
[1] 0.7310586 0.8807971 0.9525741 0.9820138 0.9933071
> exp(a)/(1+exp(a))
[1] 0.7310586 0.8807971 0.9525741 0.9820138 0.9933071
So using proposal from #eipi10 using ribbons for the confidence bands instead of the dotted lines (I also find this presentation more readable) :
ggplot(newdat, aes(LengthofStay, fill=factor(CancerStage), color=factor(CancerStage))) +
geom_ribbon(aes(ymin = plogis(plo), ymax = plogis(phi)), colour=NA, alpha=0.2) +
geom_line(aes(y = plogis(remission)), size=1.2) +
xlab("Length of Stay") + xlim(c(2, 10)) +
ylab("Probability of Remission") + ylim(c(0.0, 0.5)) +
labs(colour="Cancer Stage", fill="Cancer Stage") +
theme_minimal()
The results are the same (with effects_3.1-2 and lme4_1.1-13):
> compare <- merge(newdat, eff.m)
> compare[, c("remission", "plo", "phi")] <-
+ sapply(compare[, c("remission", "plo", "phi")], plogis)
> head(compare)
CancerStage LengthofStay remission Experience plo phi fit se lower upper
1 1 10 0.20657613 17.64129 0.12473504 0.3223392 0.20657613 0.3074726 0.12473625 0.3223368
2 1 2 0.35920425 17.64129 0.27570456 0.4522040 0.35920425 0.1974744 0.27570598 0.4522022
3 1 4 0.31636299 17.64129 0.26572506 0.3717650 0.31636299 0.1254513 0.26572595 0.3717639
4 1 6 0.27642711 17.64129 0.22800277 0.3307300 0.27642711 0.1313108 0.22800360 0.3307290
5 1 8 0.23976445 17.64129 0.17324422 0.3218821 0.23976445 0.2085896 0.17324530 0.3218805
6 2 10 0.09957493 17.64129 0.06218598 0.1557113 0.09957493 0.2609519 0.06218653 0.1557101
> compare$remission-compare$fit
[1] 8.604228e-16 1.221245e-15 1.165734e-15 1.054712e-15 9.714451e-16 4.718448e-16 1.221245e-15 1.054712e-15 8.326673e-16
[10] 6.383782e-16 4.163336e-16 7.494005e-16 6.383782e-16 5.689893e-16 4.857226e-16 2.567391e-16 1.075529e-16 1.318390e-16
[19] 1.665335e-16 2.081668e-16
The differences between the confidence boundaries is higher but still very small :
> compare$plo-compare$lower
[1] -1.208997e-06 -1.420235e-06 -8.815678e-07 -8.324261e-07 -1.076016e-06 -5.481007e-07 -1.429258e-06 -8.133438e-07 -5.648821e-07
[10] -5.806940e-07 -5.364281e-07 -1.004792e-06 -6.314904e-07 -4.007381e-07 -4.847205e-07 -3.474783e-07 -1.398476e-07 -1.679746e-07
[19] -1.476577e-07 -2.332091e-07
But if I use the real quantile of the normal distribution cmult <- qnorm(0.975) instead of cmult <- 1.96 I obtain very small differences also for these boundaries :
> compare$plo-compare$lower
[1] 5.828671e-16 9.992007e-16 9.992007e-16 9.436896e-16 7.771561e-16 3.053113e-16 9.992007e-16 8.604228e-16 6.938894e-16
[10] 5.134781e-16 2.289835e-16 4.718448e-16 4.857226e-16 4.440892e-16 3.469447e-16 1.006140e-16 3.382711e-17 6.765422e-17
[19] 1.214306e-16 1.283695e-16
I have written this R code to reproduce. Here, I have a created a unique column "ID", and I am not sure how to add the predicted column back to test dataset mapping to their respective IDs. Please guide me on the right way to do this.
#Code
library(C50)
data(churn)
data=rbind(churnTest,churnTrain)
data$ID<-seq.int(nrow(data)) #adding unique id column
rm(churnTrain)
rm(churnTest)
set.seed(1223)
ind <- sample(2,nrow(data),replace = TRUE, prob = c(0.7,0.3))
train <- data[ind==1,1:21]
test <- data[ind==2, 1:21]
xtrain <- train[,-20]
ytrain <- train$churn
xtest <- test[,-20]
ytest<- test$churn
x <- cbind(xtrain,ytrain)
## C50 Model
c50Model <- C5.0(churn ~
state +
account_length +
area_code +
international_plan +
voice_mail_plan +
number_vmail_messages +
total_day_minutes +
total_day_calls +
total_day_charge +
total_eve_minutes +
total_eve_calls +
total_eve_charge +
total_night_minutes +
total_night_calls +
total_night_charge +
total_intl_minutes +
total_intl_calls +
total_intl_charge +
number_customer_service_calls,data=train, trials=10)
# Evaluate Model
c50Result <- predict(c50Model, xtest)
table(c50Result, ytest)
#adding prediction to test data
testnew = cbind(xtest,c50Result)
#OR predict directly
xtest$churn = predict(c50Model, xtest)
I’d use match(dataID, predictedID) to match ID columns in data sets.
In reply to your comment:
If you want to add predicted values to the original dataframe, both ways of merging data and prediction are correct and produce identical result. The only thing is, I would use
xtest$churn_hut <- predict(c50Model, xtest)
instead of
xtest$churn <- predict(c50Model, xtest)
because here you are replacing original churn ( as in data$churn) with whatever the model predicted, so you can’t compare the two.
I am working through an example from Aguinis, Gottfredson, & Culpepper (2013). They have provided some R code to perform a bootstrapping procedure in R to estimate confidence intervals for slope variances. This is their original R code:
library(RLRsim)
#STEP 3: Random Intercept and Random Slope model
lmm.fit3=lmer(Y ~ (Xc|l2id) + Xc + I(Wj-mean(Wj)), data=exdata, REML=F)
# Nonparametric Bootstrap Function
REMLVC=VarCorr(lmer(Y ~Xc+(Xc|l2id)+I(Wj-mean(Wj) ),data=exdata,REML=T))$l2id[1:2,1:2]
U.R=chol(REMLVC)
REbootstrap=function(Us,es,X,gs){
nj=nrow(Us)
idk=sample(1:nj,size=nj,replace=T)
Usk=as.matrix(Us[idk,])
esk=sample(es,size=length(es),replace=T)
S=t(Usk)%*%Usk/nj
U.S = chol(S)
A=solve(U.S)%*%U.R
Usk = Usk%*%A
datk=expand.grid(l1id = 1:6,l2id = 1:nj)
colnames(X)=c('one','Xc','Wjc')
datk=cbind(datk,X)
datk$yk = X%*%gs + Usk[datk$l2id,1]+Usk[datk$l2id,2]*X[,2]+esk
lmm.fitk=lmer(yk ~Xc+(Xc|l2id)+Wjc,data=datk,REML=F)
tau11k = VarCorr(lmm.fitk)$l2id[2,2]
tau11k
}
# Implementing Bootstrap
bootks=replicate(1500,REbootstrap(Us=ranef(lmm.fit3)$l2id,es=resid(lmm.fit3),X=model.matrix(lmm.fit3),gs=fixef(lmm.fit3)))
quantile(bootks,probs=c(.025,.975))
I was trying to adapt the code to suit my own data and model. That was unfruitful so far because (a) I do not fully understand all the lines of code and (b) I have missing datapoints in one of my predictors. Here is what I have so far:
#reproducible code
set.seed(855)
exdf <- data.frame(
ID= c(rep(1:105, 28)),
content= sort(c(rep(1:28, 105))),
PrePost= sample(0:1, 105*28, replace=TRUE),
eyeFRF= sort(rep(rnorm(28), 105)),
APMs= sample(0:1, 105*28, replace=TRUE),
Gf= rep(rnorm(105), 28)
)
exdf[which(exdf$ID==62), "eyeFRF"] <- NA
RandomMissing <- sample(rownames(exdf[-which(exdf$ID==62), ]), 17)
exdf[RandomMissing, "eyeFRF"] <- NA
View(exdf)
#model
M03b <- glmer(APMs ~ PrePost + Gf + eyeFRF + (1|content) + (eyeFRF|ID), data=exdf, family=binomial("logit"))
#own adaptation
REMLVC=VarCorr(M03b)$ID[1:2,1:2]
U.R=chol(REMLVC)
REbootstrap=function(Us, es, X, gs){
#Us = random effects
#es = residuals
#X = design matrix
#gs = fixed effects
nj = nrow(Us) #104 in this case, one is excluded (#62) b/c no eye-data
idk = sample(1:nj, size=nj, replace=TRUE) #104 IDs
Usk = as.matrix(Us[idk,]) #104 intercepts and slopes
esk = sample(es, size=length(es), replace=TRUE) #2895 datapoints called 'x' (errors?)
S = t(Usk)%*%Usk/nj #?
U.S = chol(S) #?
A = solve(U.S)%*%U.R #?
Usk = Usk%*%A #?
datk = expand.grid(content=1:28, ID=1:nj)
colnames(X) = c('one', 'PrePost', 'Gf', 'eyeFRF')
datk = cbind(datk, X)
datk$APMsk = X%*%gs + Usk[datk$ID,1] + Usk[datk$ID,2]*X[ ,2] + esk
lmm.fitk = glmer(APMsk ~ PrePost + Gf + eyeFRF + (1|content) + (zb|ID), data=datk, family=binomial("logit"))
tau11k = VarCorr(lmm.fitk)$l2id[2,2]
tau11k
}
# Implementing Bootstrap
bootks <- replicate(1500, REbootstrap(Us=ranef(M03b)$ID, es=resid(M03b), X=model.matrix(M03b), gs=fixef(M03b)))
quantile(bootks, probs=c(.025,.975))
(upgrading comment to an answer)
If you're trying to get confidence intervals via parametric bootstrapping, would confint(M03b,method="boot") work for you? (I think these methods may be new or better developed since that paper was written ...)