R error message in my bootstrapping & regression code - r

I wanted to use bootstrapping & run a regression, but for some reason, it doesn't work. I always get the error message "Anzahl der zu ersetzenden Elemente ist kein Vielfaches der Ersetzungslänge" (in English: apparently I have two sets of elements & I want to change one to the other but they do not match).
Does anyone see what could have gone wrong here:
bootReg <- function(formula,data,indices)
{
d <- data[indices,]
fit <- lm(formula, data=d)
return(coef(fit))
}
results <- boot(statistic = bootReg,
formula = RT ~ Code + Situation + Block, data = reg_df, R = 2000)
#RT = reaction times (--> numeric)
#Situation = "lab" or "online"
#Block = either 0,1,2 or 3 (--> as characters)
#Code = each subject's individual code
The data in the groups are dependent (= there are RTs from each subject in each situation X block combination)
Thanks in advance!
P.S.: I googled the error message & compared my code with other people's (working) approaches, but don't know what happened here anyway.

This is what worked for me, you have to add the lm() in the formula in boot().
bootReg <- function(formula,
data,
indices)
{
d <- data[indices,]
fit <- lm(formula, data=d)
return(coef(fit))
}
results <- boot(statistic = bootReg, formula = lm(RT ~ Code + Situation + Block, data = df), data = df, R = 2000)

( I recognize that this is a comment at this point, but I'd like to keep it up as an answer for How-To solve the problem. I'll add specific diagnosis if the OP provides more source info)
A general tutorial for debugging:
First, try traceback() to see if an internal call threw the error or boot itself did.
Next, take a look at the class and the size of the object returned from your bootReg function. Is it exactly what boot will accept for the statistic input? Does your formula return what you expect it to return (again, class and length)? Are you certain that data and indices inputs are getting fed in the right order to your formula?

Related

What does "invalid type (closure) for variable 'variable1'" mean and how do I fix it?

I am trying to write a function in R, which contains a function from another package. The code works perfectly outside a function.
I am guessing, it might have got to do something with the package I am using (survey).
A self-contained code example:
#activating the package
library(survey)
#getting the dataset into R
tm <- read.spss("tm.sav", to.data.frame = T, max.value.labels = 5)
# creating svydesign object (it basically contains the weights to adjust the variables (~persgew: also a column variable contained in the tm-dataset))
tm_w <- svydesign(ids=~0, weights = ~persgew, data = tm)
#getting overview of the welle-variable
#this variable is part of the tm-dataset. it is needed to execute the following steps
table(tm$welle)
# data manipulation as in: taking the v12d_gr-variable as well as the welle-variable and the svydesign-object to create a longitudinal variable which is transformed into a data frame that can be passed to ggplot
t <- svytable(~v12d_gr+welle, tm_w)
tt <- round(prop.table(t,2)*100, digits=0)
v12d <- tt[2,]
v12d <- as.data.frame(v12d)
this is the code outside the function, working perfectly. since I have to transform quite a few variables in the exact same way, I aim to create a function to save up some time.
The following function is supposed to take a variable that will be transformed as an argument (v12sd2_gr).
#making sure the survey-object is loaded
tm_w <- svydesign(ids=~0, weights = ~persgew, data = data)
#trying to write a function containing the code from above
ltd_zsw <- function(variable1){
t <- svytable(~variable1+welle, tm_w)
tt <- round(prop.table(t,2)*100, digits=0)
var_ltd_zsw <- tt[2,]
var_ltd_zsw <- as.data.frame(var_ltd_zsw)
return(var_ltd_zsw)
}
Calling the function:
#as v12d has been altered already, I am trying to transform another variable v12sd2_gr
v12sd2 <- ltd_zsw(v12sd2_gr)
Console output:
Error in model.frame.default(formula = weights ~ variable1 + welle, data = model.frame(design)) :
invalid type (closure) for variable 'variable1'
Called from: model.frame.default(formula = weights ~ variable1 + welle, data = model.frame(design))
How do I fix it? And what does it mean to dynamically build a formula and reformulating?
PS: I hope it is the appropriate way to answer to the feedback in the comments.
Update: I think I was able to trace the problem back to the argument I am passing (variable1) and I am guessing it has got something to do with the fact, that I try to call a formula within the function. But when I try to call the svytable with as.formula(svytable(~variable1+welle, tm_w))it still doesn't work.
What to do?
I have found a solution to the problem.
Here is the tested and working function:
ltd_test <- function (var, x, string1="con", string2="pro") {
print (table (var))
x$w12d_gr <- ifelse(as.numeric(var)>2,1,0)
x$w12d_gr <- factor(x$w12d_gr, levels = c(0,1), labels = c(string1,string2))
print (table (x$w12d_gr))
x_w <- svydesign(ids=~0, weights = ~persgew, data = x)
t <- svytable(~w12d_gr+welle, x_w)
tt <- round(prop.table(t,2)*100, digits=0)
w12d <- tt[2,]
w12d <- as.data.frame(w12d)
}
The problem appeared to be caused by the svydesgin()-fun. In its output it produces an object which is then used by the formula for svytable()-fun. Thats why it is imperative to first create the x_w-object with svydesgin() and then use the svytable()-fun to create the t-object.
Within the code snippet I posted originally in the question the tm_w-object has been created and stored globally.
Thanks for the help to everyone. I hope this is gonna be of use to someone one day!

Scoping with formulae in coxph objects

I'm trying to write a set of functions where the first function fits a cox model (via coxph in the survival package in R), and the second function gets estimated survival for a new dataset, given the fitted model object from the first function. I'm running into some sort of scoping issue that I don't quite know how to solve without substantially re-factoring my code (the only way I could think to do it would be much less general and much harder to read).
I have a very similar set of functions that are based on the glm function that do not run into the same issue and give me the answers I would expect. I've included a short worked example below that demonstrates the issue. The glue.cox and glue.glm are functions that have the basic functionality I am trying to get. glue.glm works as expected (yielding the same values from a calculation in the global environment), but the glue.cox complains that it can't find the data that was used to fit the cox model and ends with an error. I don't understand how to do this with substitute but I suspect that is the way forward. I've hit a wall with experimenting.
library(survival)
data.global = data.frame(time=runif(20), x=runif(20))
newdata.global = data.frame(x=c(0,1))
f1 = Surv(time) ~ x # this is the part that messes it up!!!!! Surv gets eval
f2 = time ~ x # this is the part that messes it up!!!!! Surv gets eval
myfit.cox.global = coxph(f1, data=data.global)
myfit.glm.global = glm(f2, data=data.global)
myfit.glm.global2 = glm(time ~ x, data=data.global)
myfit.cox <- function(f, dat.local){
coxph(f, data=dat.local)
}
myfit.glm <- function(f, dat.local){
glm(f, data=dat.local)
}
mypredict.cox <- function(ft, dat.local){
newdata = data.frame(x=c(0,1))
tail(survfit(ft, newdata)$surv, 1)
}
mypredict.glm <- function(ft, dat.local){
newdata = data.frame(x=c(0,1))
predict(ft, newdata)
}
glue.cox <- function(f, dat.local){
fit = myfit.cox(f, dat.local)
mypredict.cox(fit, dat.local)
}
glue.glm <- function(f, dat.local){
fit = myfit.glm(f, dat.local)
mypredict.glm(fit, dat.local)
}
# these numbers are the goal for non-survival data
predict(myfit.glm.global, newdata = newdata.global)
0.5950440 0.4542248
glue.glm(f2, data.global)
0.5950440 0.4542248 # this works
# these numbers are the goal for survival data
tail(survfit(myfit.cox.global, newdata = newdata.global)$surv, 1)
[20,] 0.02300798 0.03106081
glue.cox(f1, data.global)
Error in eval(predvars, data, env) : object 'dat.local' not found
This appears to work, at least in the narrow sense of making glue.cox() work as desired:
myfit.cox <- function(f, dat.local){
environment(f) <- list2env(list(dat.local=dat.local))
coxph(f, data=dat.local)
}
The trick here is that most R modeling/model-processing functions look for data in the environment associated with the formula.
I don't know why glue.glm works without doing more digging, except for the general statement that [g]lm objects store more of the information needed for downstream processing internally (e.g. in the $qr element) than other model types.

Why is gam::step.gam returning NULL with forward selection?

I have a large Generalized Additive Model (GAM) made from 10K observations with ~ 100 variables. Building the model with forward stepwise selection results in an object of class "NULL". Why might this be and how do I resolve it?
library(gam)
load(url("https://github.com/cornejom/DataSets/raw/master/mydata.Rdata"))
load(url("https://github.com/cornejom/DataSets/raw/master/mygam.Rdata"))
myscope <- gam.scope(mydata, response = 3, arg = "df=4") #Target var in 3rd col.
mygam.step <- step.gam(mygam, myscope, direction = "forward")
mygam.step
NULL
The code that was used to fit mygam from mydata is:
library(gam)
#Identify numerical variables, but exclude the integer response.
numbers = sapply(mydata, class) %in% c("integer", "numeric")
numbers[match("Response", names(mydata))] = FALSE
#Identify factor variables.
factors = sapply(mydata, class) == "factor"
#Create a formula to feed into gam function.
myformula = paste0(paste0("Response ~ ",
paste0("s(", names(mydata)[numbers], ", df=4)", collapse = " + ")
),
" + ",
paste0(paste0(names(mydata)[factors], collapse = " + ")))
mygam = gam(as.formula(myformula), family = "binomial", mydata)
I suspect the issue is with the mygam object.
Explanation
If you read the help(step.gam) it has this paragraph in the explanation of scope argument:
The supplied model ‘object’ is used as the starting model,
and hence there is the requirement that one term from each of
the term formulas be present in ‘formula(object)’. This also
implies that any terms in ‘formula(object)’ not contained
in any of the term formulas will be forced to be present in
every model considered. The function ‘gam.scope’ is helpful
for generating the scope argument for a large model.
In essence this says that the first argument passed to step.gam function (mygam in this case) will have a formula and that formula will be used as a starting model for the stepwise procedure.
Since here we have forward stepwise - it cannot start from the full model, because in that case there is nothing left to add.
Exploring The Code
This idea is reinforced if we look at the code. The code of step.gam function has this loop that runs in case of forward selection.
if (forward) {
trial <- items
trial[i] <- trial[i] + 1
if (trial[i] <= term.lengths[i] && !get.visit(trial,
visited)) {
visited <- cbind(visited, trial)
tform.vector <- form.vector
tform.vector[i] <- scope[[i]][trial[i]]
form.list = c(form.list, list(list(trial = trial,
form.vector = tform.vector, which = i)))
}
}
Notice that the loop executes only when the inner if statement is TRUE. And that if statement seems to check if you have potential variables in your scope (term.length) that are not yet in your model (items, trial). If you don't - the loop skips.
Since in your case the loop never executes it doesn't form the return object and the procedure returns NULL.
The Solution
Given all the above - the solution is to not start with the complete formula when using forward selection method. Here for the demonstration I will be using the intercept-only model as a starting model:
library(gam)
load(url("https://github.com/cornejom/DataSets/raw/master/mydata.Rdata"))
mygam <- gam(Response ~ 1, family = "binomial", mydata)
The last line is the only change that needs to be made. Everything else is the same as in the original post:
myscope <- gam.scope(mydata, response = 3, arg = "df=4")
mygam.step <- step.gam(mygam, myscope, direction = "forward")
And now the procedure works.

For loops regression in R

I'm fitting GARCH model to the residuals of and ARIMA, and trying to apply ARCH(p) for p from 1 to 10 to compare the fitness. Here is my code. Errors are returned in the for loop part but I cannot figure out the reason why. Could anyone give some tips?
So for the single value p=1 the codes are as below and it's no problem.
fitone<- garchFit(~garch(1,0),data=logprice)
coef(fitone)
summary(fitone)
And for the for loop my codes go like
for (n in 1:10) {
fit [[n]]<- garchFit(~garch(n,0),data=logprice)
coef(fit[[n]])
summary(fit[[n]])
}
Error in .garchArgsParser(formula = formula, data = data, trace = FALSE) :
Formula and data units do not match.
I never wrote a loop code before. Can someone help me with the codes?
The problem is that generally one tries to evaluate all the variables in a formula in the context of the data= parameter, but your n variable isn't coming from logprice, it's coming from the global environment. You will need to dynamically create the formula. Here's one way to run all the models with lapply rather than a for look would be
library(fGarch)
#sample data
x.vec = as.vector(garchSim(garchSpec(rseed = 1985), n = 200)[,1])
fits <- lapply(1:10, function(n) {
garchFit(bquote(~garch(.(n),0)), data = x.vec, trace = FALSE)
})
and then we can get the coefs with
lapply(fits, coef)

R random forest - training set using target column for prediction

I am learning how to use various random forest packages and coded up the following from example code:
library(party)
library(randomForest)
set.seed(415)
#I'll try to reproduce this with a public data set; in the mean time here's the existing code
data = read.csv(data_location, sep = ',')
test = data[1:65] #basically data w/o the "answers"
m = sample(1:(nrow(factor)),nrow(factor)/2,replace=FALSE)
o = sample(1:(nrow(data)),nrow(data)/2,replace=FALSE)
train2 = data[m,]
train3 = data[o,]
#random forest implementation
fit.rf <- randomForest(train2[,66] ~., data=train2, importance=TRUE, ntree=10000)
Prediction.rf <- predict(fit.rf, test) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
#cforest implementation
fit.cf <- cforest(train3[,66]~., data=train3, controls=cforest_unbiased(ntree=10000, mtry=10))
Prediction.cf <- predict(fit.cf, test, OOB=TRUE) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
Data[,66] is the is the target factor I'm trying to predict, but it seems that by using "~ ." to solve for it is causing the formula to use the factor in the prediction model itself.
How do I solve for the dimension I want on high-ish dimensionality data, without having to spell out exactly which dimensions to use in the formula (so I don't end up with some sort of cforest(data[,66] ~ data[,1] + data[,2] + data[,3}... etc.?
EDIT:
On a high level, I believe one basically
loads full data
breaks it down to several subsets to prevent overfitting
trains via subset data
generates a fitting formula so one can predict values of target (in my case data[,66]) given data[1:65].
so my PROBLEM is now if I give it a new set of test data, let’s say test = data{1:65], it now says “Error in eval(expr, envir, enclos) :” where it is expecting data[,66]. I want to basically predict data[,66] given the rest of the data!
I think that if the response is in train3 then it will be used as a feature.
I believe this is more like what you want:
crtl <- cforest_unbiased(ntree=1000, mtry=3)
mod <- cforest(iris[,5] ~ ., data = iris[,-5], controls=crtl)

Resources