I'm having some problems with the predict function when using bayesglm. I've read some posts that say this problem may arise when the out of sample data has more levels than the in sample data, but I'm using the same data for the fit and predict functions. Predict works fine with regular glm, but not with bayesglm. Example:
control <- y ~ x1 + x2
# this works fine:
glmObject <- glm(control, myData, family = binomial())
predicted1 <- predict.glm(glmObject , myData, type = "response")
# this gives an error:
bayesglmObject <- bayesglm(control, myData, family = binomial())
predicted2 <- predict.bayesglm(bayesglmObject , myData, type = "response")
Error in X[, piv, drop = FALSE] : subscript out of bounds
# Edit... I just discovered this works.
# Should I be concerned about using these results?
# Not sure why is fails when I specify the dataset
predicted3 <- predict(bayesglmObject, type = "response")
Can't figure out how to predict with a bayesglm object. Any ideas? Thanks!
One of the reasons could be to do with the default setting for the parameter "drop.unused.levels" in the bayesglm command. By default, this parameter is set to TRUE. So if there are unused levels, it gets dropped during model building. However, the predict function still uses the original data with the unused levels present in the factor variable. This causes differences in level between the data used for model building and the one used for prediction (even it is the same data fame -in your case, myData). I have given an example below:
n <- 100
x1 <- rnorm (n)
x2 <- as.factor(sample(c(1,2,3),n,replace = TRUE))
# Replacing 3 with 2 makes the level = 3 as unused
x2[x2==3] <- 2
y <- as.factor(sample(c(1,2),n,replace = TRUE))
myData <- data.frame(x1 = x1, x2 = x2, y = y)
control <- y ~ x1 + x2
# this works fine:
glmObject <- glm(control, myData, family = binomial())
predicted1 <- predict.glm(glmObject , myData, type = "response")
# this gives an error - this uses default drop.unused.levels = TRUE
bayesglmObject <- bayesglm(control, myData, family = binomial())
predicted2 <- predict.bayesglm(bayesglmObject , myData, type = "response")
Error in X[, piv, drop = FALSE] : subscript out of bounds
# this works fine - value of drop.unused.levels is set to FALSE
bayesglmObject <- bayesglm(control, myData, family = binomial(),drop.unused.levels = FALSE)
predicted2 <- predict.bayesglm(bayesglmObject , myData, type = "response")
I think a better way would be to use droplevels to drop the unused levels from the data frame beforehand and use it for both model building and prediction.
Related
I have been stymied by an error that traces back to predict.lme, running inside a function, failing to interpret a formula based on a variable that has been passed from outside the function. I know the issue has to do with variable scope and different environments, but I've been unable to fully understand it or find a workaround. Your help would be much appreciated.
Here's a reproducible example:
# This will be the nested function.
train_test_perf <- function(train_data, test_data, model, termLabels) {
fixForm <- reformulate(termlabels=termLabels, response="Y")
fit <- nlme::lme(fixForm, data=train_data, random=~ 1|ID)
train_pred <- predict(fit, newdata=train_data, level=0, na.action=na.exclude)
rtrain <- cor.test(train_data$Y, train_pred)
test_pred <- predict(fit, newdata=test_data, level=0, na.action=na.exclude)
rtest <- cor.test(test_data$Y, test_pred)
tmp <- data.frame(Model=model,
R_train=rtrain$estimate,
R_test=rtest$estimate)
return(tmp)
}
# And here is the function that calls it.
myfunc <- function(df, newdf, varList) {
for (v in varList) {
perf <- train_test_perf(train_data=df, test_data=newdf, model=v, termLabels=v)
print(perf)
}
}
# The outer function call.
myfunc(df=dat, newdf=newdat, varList=list("W", "X"))
Running this gives the following error and traceback:
Error in eval(mCall$fixed) : object 'fixForm' not found
7.
eval(mCall$fixed)
6.
eval(mCall$fixed)
5.
eval(eval(mCall$fixed)[-2])
4.
predict.lme(fit, newdata = train_data, level = 0, na.action = na.exclude)
3.
predict(fit, newdata = train_data, level = 0, na.action = na.exclude)
2.
train_test_perf(train_data = df, test_data = newdf, model = v,
termLabels = v)
1.
myfunc(df = dat, newdf = newdat, varList = list("W", "X"))
It seems clear that predict.lme does not have access to the fixForm variable, but I haven't been able to work out a way to both define a formula based on a variable and have the value accessible to predict.lme. I'm not sure whether the nested function structure is part of the problem here--if it is, I would prefer to find a workaround that would maintain this structure, as my real-life code includes some other things inside myfunc that occur before and after the call to train_test_perf.
Thanks,
Jeff Phillips
Using a variable as formula doesn't stores the variable not the formula which might be the issue. We can use a do.call.
train_test_perf <- function(train_data, test_data, model, termLabels) {
fixForm <- reformulate(termlabels=termLabels, response="Y")
fit <- do.call(nlme::lme, list(fixForm, data=quote(train_data), random=~ 1|ID))
train_pred <- predict(fit, newdata=train_data, level=0, na.action=na.exclude)
rtrain <- cor.test(train_data$Y, train_pred)
test_pred <- predict(fit, newdata=test_data, level=0, na.action=na.exclude)
rtest <- cor.test(test_data$Y, test_pred)
tmp <- data.frame(Model=model, R_train=rtrain$estimate,
R_test=rtest$estimate)
return(tmp)
}
Finally put it in an sapply to avoid tedious for loops.
t(sapply(c("W", "X"), \(x) train_test_perf(train_data=dat, test_data=newdat, model=x, termLabels=x)))
# Model R_train R_test
# [1,] "W" 0.1686495 -0.001738604
# [2,] "X" 0.4138526 0.2992374
Following the model-based recursive partitioning in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6015941/ I want to replicate the following code:
sim_data <- function(n=2000){
x1 <- rnorm(n)
x2 <- rbinom(n,1,0.3)
x3 <- runif(n)
x4 <- rnorm(n)
t <- rbinom(n,1,0.5)
z <- 1-x2+x1+2*(x1>=0)*x2*t-2*(x1<0)*x2*t
pr <- 1/(1+exp(-z))
y <- as.factor(rbinom(n,1,pr))
data.frame(x1,x3,x2=as.factor(x2),x4, t=factor(t,labels=c("C","A")),y,z)
}
dt <- sim_data()
dt.num = as.data.frame(sapply(dt, as.numeric))
dt.num$y <- dt.num$y-1 #only to convert outcome 1,2 into 0,1
mbase <- glm(y~t, data=dt.num,
family = binomial())
round(summary(mbase)$coefficients,3)
library("model4you")
pmtr <- pmtree(mbase, zformula = ~. ,
data = dt.num,
control = ctree_control(minbucket = 250))
plot(pmtr, terminal_panel = node_pmterminal(pmtr,
plotfun = binomial_glm_plot,
confint = TRUE))
However, the following inexplicable error occurs:
Error in .Call.graphics(C_palette2, .Call(C_palette2, NULL)) :
invalid graphics state
I was looking for a solution to this problem in the post Persistent invalid graphics state error when using ggplot2. But the problem persists.
Any clue?
Thank you in advance
When I tried to replicate this, I got a different error:
plot(pmtr, terminal_panel = node_pmterminal(pmtr, plotfun = binomial_glm_plot, confint = TRUE))
## Waiting for profiling to be done...
## Error in plotfun(mod = list(coefficients = c(`(Intercept)` = -0.16839363929017, :
## Plotting currently only works for models with a single factor covariate.
## We recommend using partykit or ggparty plotting functionalities!
The reason for this is that the panel function expects both the response and the treatment to be binary factors (as in dt). When you use binary numeric variables instead (as in dt.num) the model estimation in glm() leads to equivalent output but the plot() functionality is confused.
When I refit both the glm() and the pmtree() with dt rather than dt.num everything works as intended for me, yielding the following graphic:
I'm trying to specify a cluster variable after plm using vcovCR() in clubSandwich package for my simulated data (which I use for power simulation), but I get the following error message:
"Error in [.data.frame(eval(mf$data, envir), , index_names) : undefined columns selected"
I'm not sure if this is specific to vcovCR() or something general about R, but could anyone tell me what's wrong with my code? (I saw a related post here How to cluster standard errors of plm at different level rather than id or time?, but it didn't solve my problem).
My code:
N <- 100;id <- 1:N;id <- c(id,id);gid <- 1:(N/2);
gid <- c(gid,gid,gid,gid);T <- rep(0,N);T = c(T,T+1)
a <- qnorm(runif(N),mean=0,sd=0.005)
gp <- qnorm(runif(N/2),mean=0,sd=0.0005)
u <- qnorm(runif(N*2),mean=0,sd=0.05)
a <- c(a,a);gp = c(gp,gp,gp,gp)
Ylatent <- -0.05*T + a + u
Data <- data.frame(
Y = ifelse(Ylatent > 0, 1, 0),
id = id,gid = gid,T = T
)
library(clubSandwich)
library(plm)
fe.fit <- plm(formula = Y ~ T, data = Data, model = "within", index = "id",effect = "individual", singular.ok = FALSE)
vcovCR(fe.fit,cluster=Data$id,type = "CR2") # doesn't work, but I can run this by not specifying cluster as in the next line
vcovCR(fe.fit,type = "CR2")
vcovCR(fe.fit,cluster=Data$gid,type = "CR2") # I ultimately want to run this
Make your data a pdata.frame first. This is safer, especially if you want to have the time index created automatically (seems to be the case looking at your code).
Continuing what you have:
pData <- pdata.frame(Data, index = "id") # time index is created automatically
fe.fit2 <- plm(formula = Y ~ T, data = pData, model = "within", effect = "individual")
vcovCR(fe.fit2, cluster=Data$id,type = "CR2")
vcovCR(fe.fit2, type = "CR2")
vcovCR(fe.fit2,cluster=Data$gid,type = "CR2")
Your example does not work due to a bug in clubSandwich's data extraction function get_index_order (from version 0.3.3) for plm objects. It assumes both index variables are in the original data but this is not the case in your example where the time index is created automatically by only specifying the individual dimension by the index argument.
I'm testing the kernlab package in a regression problem. It seems it's a common issue to get 'Error in .local(object, ...) : test vector does not match model ! when passing the ksvm object to the predict function. However I just found answers to classification problems or custom kernels that are not applicable to my problem (I'm using a built-in one for regression). I'm running out of ideas here, my sample code is:
data <- matrix(rnorm(200*10),200,10)
tr <- data[1:150,]
ts <- data[151:200,]
mod <- ksvm(x = tr[,-1],
y = tr[,1],
kernel = "rbfdot", type = 'nu-svr',
kpar = "automatic", C = 60, cross = 3)
pred <- predict(mod,
ts
)
You forgot to remove the y variable in the test set, and so it fails because the number of predictors don't match. This will work:
predict(mod,ts[,-1])
You can use pred <- predict(mod, ts) if ts is a dataframe.
It would be
data <- setNames(data.frame(matrix(rnorm(200*10),200,10)),
c("Y",paste("X", 1:9, sep = "")))
tr <- data[1:150,]
ts <- data[151:200,]
mod <- ksvm(as.formula("Y ~ ."), data = tr,
kernel = "rbfdot", type = 'nu-svr',
kpar = "automatic", C = 60, cross = 3)
pred <- predict(mod, ts)
I have written a function to run phylogenetic generalized least squares, and everything looks like it should work fine, but for some reason, a specific variable which is defined in the script (W) keeps coming up as undefined. I have stared at this code for hours and cannot figure out where the problem is.
Any ideas?
myou <- function(alpha, datax, datay, tree){
data.frame(datax[tree$tip.label,],datay[tree$tip.label,],row.names=tree$tip.label)->dat
colnames(dat)<-c("Trait1","Trait2")
W<-diag(vcv.phylo(tree)) # Weights
fm <- gls(Trait1 ~ Trait2, data=dat, correlation = corMartins(alpha, tree, fixed = TRUE),weights = ~ W,method = "REML")
return(as.numeric(fm$logLik))
}
corMartins2<-function(datax, datay, tree){
data.frame(datax[tree$tip.label,],datay[tree$tip.label,],row.names=tree$tip.label)->dat
colnames(dat)<-c("Trait1","Trait2")
result <- optimize(f = myou, interval = c(0, 4), datax=datax,datay=datay, tree = tree, maximum = TRUE)
W<-diag(vcv.phylo(tree)) # Weights
fm <- gls(Trait1 ~ Trait2, data = dat, correlation = corMartins(result$maximum, tree, fixed =T),weights = ~ W,method = "REML")
list(fm, result$maximum)}
#test
require(nlme)
require(phytools)
simtree<-rcoal(50)
as.data.frame(fastBM(simtree))->dat1
as.data.frame(fastBM(simtree))->dat2
corMartins2(dat1,dat2,tree=simtree)
returns "Error in eval(expr, envir, enclos) : object 'W' not found"
even though W is specifically defined!
Thanks!
The error's occuring in the gls calls in myou and corMatrins2: you have to pass in W as a column in dat because gls is looking for it there (when you put weights = ~W as a formula like that it looks for dat$W and can't find it).
Just change data=dat to data=cbind(dat,W=W) in both functions.
The example is not reproducible for me, as lowerB and upperB are not defined, however, perhaps the following will work for you, cbinding dat with W:
myou <- function(alpha, datax, datay, tree){
data.frame(datax[tree$tip.label,],datay[tree$tip.label,],row.names=tree$tip.label)->dat
colnames(dat)<-c("Trait1","Trait2")
W<-diag(vcv.phylo(tree)) # Weights
### cbind W to dat
dat <- cbind(dat, W = W)
fm <- gls(Trait1 ~ Trait2, data=dat, correlation = corMartins(alpha, tree, fixed = TRUE),weights = ~ W,method = "REML")
return(as.numeric(fm$logLik))
}
corMartins2<-function(datax, datay, tree){
data.frame(datax[tree$tip.label,],datay[tree$tip.label,],row.names=tree$tip.label)->dat
colnames(dat)<-c("Trait1","Trait2")
result <- optimize(f = myou, interval = c(lowerB, upperB), datax=datax,datay=datay, tree = tree, maximum = TRUE)
W<-diag(vcv.phylo(tree)) # Weights
### cbind W to dat
dat <- cbind(dat, W = W)
fm <- gls(Trait1 ~ Trait2, data = dat, correlation = corMartins(result$maximum, tree, fixed =T),weights = ~ W,method = "REML")
list(fm, result$maximum)}
#test
require(phytools)
simtree<-rcoal(50)
as.data.frame(fastBM(simtree))->dat1
as.data.frame(fastBM(simtree))->dat2
corMartins2(dat1,dat2,tree=simtree)