Following the model-based recursive partitioning in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6015941/ I want to replicate the following code:
sim_data <- function(n=2000){
x1 <- rnorm(n)
x2 <- rbinom(n,1,0.3)
x3 <- runif(n)
x4 <- rnorm(n)
t <- rbinom(n,1,0.5)
z <- 1-x2+x1+2*(x1>=0)*x2*t-2*(x1<0)*x2*t
pr <- 1/(1+exp(-z))
y <- as.factor(rbinom(n,1,pr))
data.frame(x1,x3,x2=as.factor(x2),x4, t=factor(t,labels=c("C","A")),y,z)
}
dt <- sim_data()
dt.num = as.data.frame(sapply(dt, as.numeric))
dt.num$y <- dt.num$y-1 #only to convert outcome 1,2 into 0,1
mbase <- glm(y~t, data=dt.num,
family = binomial())
round(summary(mbase)$coefficients,3)
library("model4you")
pmtr <- pmtree(mbase, zformula = ~. ,
data = dt.num,
control = ctree_control(minbucket = 250))
plot(pmtr, terminal_panel = node_pmterminal(pmtr,
plotfun = binomial_glm_plot,
confint = TRUE))
However, the following inexplicable error occurs:
Error in .Call.graphics(C_palette2, .Call(C_palette2, NULL)) :
invalid graphics state
I was looking for a solution to this problem in the post Persistent invalid graphics state error when using ggplot2. But the problem persists.
Any clue?
Thank you in advance
When I tried to replicate this, I got a different error:
plot(pmtr, terminal_panel = node_pmterminal(pmtr, plotfun = binomial_glm_plot, confint = TRUE))
## Waiting for profiling to be done...
## Error in plotfun(mod = list(coefficients = c(`(Intercept)` = -0.16839363929017, :
## Plotting currently only works for models with a single factor covariate.
## We recommend using partykit or ggparty plotting functionalities!
The reason for this is that the panel function expects both the response and the treatment to be binary factors (as in dt). When you use binary numeric variables instead (as in dt.num) the model estimation in glm() leads to equivalent output but the plot() functionality is confused.
When I refit both the glm() and the pmtree() with dt rather than dt.num everything works as intended for me, yielding the following graphic:
Related
I want to run logistic regression for multiple parameters and store the different metrics i.e AUC.
I wrote the function below but I get an error when I call it: Error in eval(predvars, data, env) : object 'X0' not found even if the variable exists in both my training and testing dataset. Any idea?
new.function <- function(a) {
model = glm(extry~a,family=binomial("logit"),data = train_df)
pred.prob <- predict(model,test_df, type='response')
predictFull <- prediction(pred.prob, test_df$extry)
auc_ROCR <- performance(predictFull, measure = "auc")
my_list <- list("AUC" = auc_ROCR)
return(my_list)
}
# Call the function new.function supplying 6 as an argument.
les <- new.function(X0)
The main reason why your function didn't work is that you are trying to call an object into a formula. You can fix it with paste formula function, but that is ultimately quite limiting.
I suggest instead that you consider using update. This allow you more flexibility to change with multiple variable combination, or change a training dataset, without breaking the function.
model = glm(extry~a,family=binomial("logit"),data = train_df)
new.model = update(model, .~X0)
new.function <- function(model){
pred.prob <- predict(model, test_df, type='response')
predictFull <- prediction(pred.prob, test_df$extry)
auc_ROCR <- performance(predictFull, measure = "auc")
my_list <- list("AUC" = auc_ROCR)
return(my_list)
}
les <- new.function(new.model)
The function can be further improved by calling the test_df as a separate argument, so that you can fit it with an alternative testing data.
To run the function in the way you intended, you would need to use non-standard evaluation to capture the symbol and insert it in a formula. This can be done using match.call and as.formula. Here's a fully reproducible example using dummy data:
new.function <- function(a) {
# Convert symbol to character
a <- as.character(match.call()$a)
# Build formula from character strings
form <- as.formula(paste("extry", a, sep = "~"))
model <- glm(form, family = binomial("logit"), data = train_df)
pred.prob <- predict(model, test_df, type = 'response')
predictFull <- ROCR::prediction(pred.prob, test_df$extry)
auc_ROCR <- ROCR::performance(predictFull, "auc")
list("AUC" = auc_ROCR)
}
Now we can call the function in the way you intended:
new.function(X0)
#> $AUC
#> A performance instance
#> 'Area under the ROC curve'
new.function(X1)
#> $AUC
#> A performance instance
#> 'Area under the ROC curve'
If you want to see the actual area under the curve you would need to do:
new.function(X0)$AUC#y.values[[1]]
#> [1] 0.6599759
So you may wish to modify your function so that the list contains auc_ROCR#y.values[[1]] rather than auc_ROCR
Data used
set.seed(1)
train_df <- data.frame(X0 = sample(100), X1 = sample(100))
train_df$extry <- rbinom(100, 1, (train_df$X0 + train_df$X1)/200)
test_df <- data.frame(X0 = sample(100), X1 = sample(100))
test_df$extry <- rbinom(100, 1, (test_df$X0 + test_df$X1)/200)
Created on 2022-06-29 by the reprex package (v2.0.1)
I am using this code to fit a model using LASSO regression.
library(glmnet)
IV1 <- data.frame(IV1 = rnorm(100))
IV2 <- data.frame(IV2 = rnorm(100))
IV3 <- data.frame(IV3 = rnorm(100))
IV4 <- data.frame(IV4 = rnorm(100))
IV5 <- data.frame(IV5 = rnorm(100))
DV <- data.frame(DV = rnorm(100))
data<-data.frame(IV1,IV2,IV3,IV4,IV5,DV)
x <-model.matrix(DV~.-IV5 , data)[,-1]
y <- data$DV
AB<-glmnet(x=x, y=y, alpha=1)
plot(AB,xvar="lambda")
lambdas = NULL
for (i in 1:100)
{
fit <- cv.glmnet(x,y)
errors = data.frame(fit$lambda,fit$cvm)
lambdas <- rbind(lambdas,errors)
}
lambdas <- aggregate(lambdas[, 2], list(lambdas$fit.lambda), mean)
bestindex = which(lambdas[2]==min(lambdas[2]))
bestlambda = lambdas[bestindex,1]
fit <- glmnet(x,y,lambda=bestlambda)
I would like to calculate some sort of R2 using the training data. I assume that one way to do this is using the cross-validation that I performed in choosing lambda. Based off of this post it seems like this can be done using
r2<-max(1-fit$cvm/var(y))
However, when I run this, I get this error:
Warning message:
In max(1 - fit$cvm/var(y)) :
no non-missing arguments to max; returning -Inf
Can anyone point me in the right direction? Is this the best way to compute R2 based off of the training data?
The function glmnet does not return cvm as a result on fit
?glmnet
What you want to do is use cv.glmnet
?cv.glmnet
The following works (note you must specify more than 1 lambda or let it figure it out)
fit <- cv.glmnet(x,y,lambda=lambdas[,1])
r2<-max(1-fit$cvm/var(y))
I'm not sure I understand what you are trying to do. Maybe do this?
for (i in 1:100)
{
fit <- cv.glmnet(x,y)
errors = data.frame(fit$lambda,fit$cvm)
lambdas <- rbind(lambdas,errors)
r2[i]<-max(1-fit$cvm/var(y))
}
lambdas <- aggregate(lambdas[, 2], list(lambdas$fit.lambda), mean)
bestindex = which(lambdas[2]==min(lambdas[2]))
bestlambda = lambdas[bestindex,1]
r2[bestindex]
I'm having some problems with the predict function when using bayesglm. I've read some posts that say this problem may arise when the out of sample data has more levels than the in sample data, but I'm using the same data for the fit and predict functions. Predict works fine with regular glm, but not with bayesglm. Example:
control <- y ~ x1 + x2
# this works fine:
glmObject <- glm(control, myData, family = binomial())
predicted1 <- predict.glm(glmObject , myData, type = "response")
# this gives an error:
bayesglmObject <- bayesglm(control, myData, family = binomial())
predicted2 <- predict.bayesglm(bayesglmObject , myData, type = "response")
Error in X[, piv, drop = FALSE] : subscript out of bounds
# Edit... I just discovered this works.
# Should I be concerned about using these results?
# Not sure why is fails when I specify the dataset
predicted3 <- predict(bayesglmObject, type = "response")
Can't figure out how to predict with a bayesglm object. Any ideas? Thanks!
One of the reasons could be to do with the default setting for the parameter "drop.unused.levels" in the bayesglm command. By default, this parameter is set to TRUE. So if there are unused levels, it gets dropped during model building. However, the predict function still uses the original data with the unused levels present in the factor variable. This causes differences in level between the data used for model building and the one used for prediction (even it is the same data fame -in your case, myData). I have given an example below:
n <- 100
x1 <- rnorm (n)
x2 <- as.factor(sample(c(1,2,3),n,replace = TRUE))
# Replacing 3 with 2 makes the level = 3 as unused
x2[x2==3] <- 2
y <- as.factor(sample(c(1,2),n,replace = TRUE))
myData <- data.frame(x1 = x1, x2 = x2, y = y)
control <- y ~ x1 + x2
# this works fine:
glmObject <- glm(control, myData, family = binomial())
predicted1 <- predict.glm(glmObject , myData, type = "response")
# this gives an error - this uses default drop.unused.levels = TRUE
bayesglmObject <- bayesglm(control, myData, family = binomial())
predicted2 <- predict.bayesglm(bayesglmObject , myData, type = "response")
Error in X[, piv, drop = FALSE] : subscript out of bounds
# this works fine - value of drop.unused.levels is set to FALSE
bayesglmObject <- bayesglm(control, myData, family = binomial(),drop.unused.levels = FALSE)
predicted2 <- predict.bayesglm(bayesglmObject , myData, type = "response")
I think a better way would be to use droplevels to drop the unused levels from the data frame beforehand and use it for both model building and prediction.
I'm testing the kernlab package in a regression problem. It seems it's a common issue to get 'Error in .local(object, ...) : test vector does not match model ! when passing the ksvm object to the predict function. However I just found answers to classification problems or custom kernels that are not applicable to my problem (I'm using a built-in one for regression). I'm running out of ideas here, my sample code is:
data <- matrix(rnorm(200*10),200,10)
tr <- data[1:150,]
ts <- data[151:200,]
mod <- ksvm(x = tr[,-1],
y = tr[,1],
kernel = "rbfdot", type = 'nu-svr',
kpar = "automatic", C = 60, cross = 3)
pred <- predict(mod,
ts
)
You forgot to remove the y variable in the test set, and so it fails because the number of predictors don't match. This will work:
predict(mod,ts[,-1])
You can use pred <- predict(mod, ts) if ts is a dataframe.
It would be
data <- setNames(data.frame(matrix(rnorm(200*10),200,10)),
c("Y",paste("X", 1:9, sep = "")))
tr <- data[1:150,]
ts <- data[151:200,]
mod <- ksvm(as.formula("Y ~ ."), data = tr,
kernel = "rbfdot", type = 'nu-svr',
kpar = "automatic", C = 60, cross = 3)
pred <- predict(mod, ts)
I want use survfit() and basehaz() inside a function, but they do not work. Could you take a look at this problem. Thanks for your help. The following code leads to the error:
library(survival)
n <- 50 # total sample size
nclust <- 5 # number of clusters
clusters <- rep(1:nclust,each=n/nclust)
beta0 <- c(1,2)
set.seed(13)
#generate phmm data set
Z <- cbind(Z1=sample(0:1,n,replace=TRUE),
Z2=sample(0:1,n,replace=TRUE),
Z3=sample(0:1,n,replace=TRUE))
b <- cbind(rep(rnorm(nclust),each=n/nclust),rep(rnorm(nclust),each=n/nclust))
Wb <- matrix(0,n,2)
for( j in 1:2) Wb[,j] <- Z[,j]*b[,j]
Wb <- apply(Wb,1,sum)
T <- -log(runif(n,0,1))*exp(-Z[,c('Z1','Z2')]%*%beta0-Wb)
C <- runif(n,0,1)
time <- ifelse(T<C,T,C)
event <- ifelse(T<=C,1,0)
mean(event)
phmmd <- data.frame(Z)
phmmd$cluster <- clusters
phmmd$time <- time
phmmd$event <- event
fmla <- as.formula("Surv(time, event) ~ Z1 + Z2")
BaseFun <- function(x){
start.coxph <- coxph(x, phmmd)
print(start.coxph)
betahat <- start.coxph$coefficient
print(betahat)
print(333)
print(survfit(start.coxph))
m <- basehaz(start.coxph)
print(m)
}
BaseFun(fmla)
Error in formula.default(object, env = baseenv()) : invalid formula
But the following function works:
fit <- coxph(fmla, phmmd)
basehaz(fit)
It is a problem of scoping.
Notice that the environment of basehaz is:
environment(basehaz)
<environment: namespace:survival>
meanwhile:
environment(BaseFun)
<environment: R_GlobalEnv>
Therefore that is why the function basehaz cannot find the local variable inside the function.
A possible solution is to send x to the top using assign:
BaseFun <- function(x){
assign('x',x,pos=.GlobalEnv)
start.coxph <- coxph(x, phmmd)
print(start.coxph)
betahat <- start.coxph$coefficient
print(betahat)
print(333)
print(survfit(start.coxph))
m <- basehaz(start.coxph)
print(m)
rm(x)
}
BaseFun(fmla)
Other solutions may involved dealing with the environments more directly.
I'm following up on #moli's comment to #aatrujillob's answer. They were helpful so I thought I would explain how it solved things for me and a similar problem with the rpart and partykit packages.
Some toy data:
N <- 200
data <- data.frame(X = rnorm(N),W = rbinom(N,1,0.5))
data <- within( data, expr = {
trtprob <- 0.4 + 0.08*X + 0.2*W -0.05*X*W
Trt <- rbinom(N, 1, trtprob)
outprob <- 0.55 + 0.03*X -0.1*W - 0.3*Trt
Outcome <- rbinom(N,1,outprob)
rm(outprob, trtprob)
})
I want to split the data to training (train_data) and testing sets, and train the classification tree on train_data.
Here's the formula I want to use, and the issue with the following example. When I define this formula, the train_data object does not yet exist.
my_formula <- Trt~W+X
exists("train_data")
# [1] FALSE
exists("train_data", envir = environment(my_formula))
# [1] FALSE
Here's my function, which is similar to the original function. Again,
badFunc <- function(data, my_formula){
train_data <- data[1:100,]
ct_train <- rpart::rpart(
data= train_data,
formula = my_formula,
method = "class")
ct_party <- partykit::as.party(ct_train)
}
Trying to run this function throws an error similar to OP's.
library(rpart)
library(partykit)
bad_out <- badFunc(data=data, my_formula = my_formula)
# Error in is.data.frame(data) : object 'train_data' not found
# 10. is.data.frame(data)
# 9. model.frame.default(formula = Trt ~ W + X, data = train_data,
# na.action = function (x) {Terms <- attr(x, "terms") ...
# 8. stats::model.frame(formula = Trt ~ W + X, data = train_data,
# na.action = function (x) {Terms <- attr(x, "terms") ...
# 7. eval(expr, envir, enclos)
# 6. eval(mf, env)
# 5. model.frame.rpart(obj)
# 4. model.frame(obj)
# 3. as.party.rpart(ct_train)
# 2. partykit::as.party(ct_train)
# 1. badFunc(data = data, my_formula = my_formula)
print(bad_out)
# Error in print(bad_out) : object 'bad_out' not found
Luckily, rpart() is like coxph() in that you can specify the argument model=TRUE to solve these issues. Here it is again, with that extra argument.
goodFunc <- function(data, my_formula){
train_data <- data[1:100,]
ct_train <- rpart::rpart(
data= train_data,
## This solved it for me
model=TRUE,
##
formula = my_formula,
method = "class")
ct_party <- partykit::as.party(ct_train)
}
good_out <- goodFunc(data=data, my_formula = my_formula)
print(good_out)
# Model formula:
# Trt ~ W + X
#
# Fitted party:
# [1] root
# | [2] X >= 1.59791: 0.143 (n = 7, err = 0.9)
##### etc
documentation for model argument in rpart():
model:
if logical: keep a copy of the model frame in the result? If
the input value for model is a model frame (likely from an earlier
call to the rpart function), then this frame is used rather than
constructing new data.
Formulas can be tricky as they use lexical scoping and environments in a way that is not always natural (to me). Thank goodness Terry Therneau has made our lives easier with model=TRUE in these two packages!