How to use get() to construct a complicated model of variables - r

I know how to use get() to construct an model on the fly from a variable, for example:
dvar="myResponse"
ivar="someIndependentVariable"
family="binomial"
myGLM <- glm(data=ds, get(dvar) ~ get(ivar),family=myFamily)
This is handy for looping through a list of variables, of course -- you could feed it a list of independent variables in a for() loop, and look at a number of different models. My question is, how would I use get(), eval(), or some similar commands to create more complex calls? For example, suppose I have two independent variables in a list:
dvar="myResponse"
ivar=c("independentVar1","independentVar2")
and what I want, in the end, is this:
myGLM<-glm(data=ds, myResponse ~ independentVar1 + independentVar2)
I know I could do this with three get() statements, given that I only have 1 dependent and 2 independent variables, but is there a general way to do it for an n-item list of independent variables? Basically, what I'm up to is something like stepwise regression, but I'm not happy with any of the existing options in caret, MASS, and so forth.

You want ?reformulate ...
dvar="myResponse"
ivar <- c("independentVar1","independentVar2")
form <- reformulate(ivar, response=dvar)
glm(form, myFamily = family_string, data= ...)
As a general rule,
solutions that use reformulate() or those that manipulate the formula directly (with quote(), substitute(), as.symbol() etc.) are more idiomatic/safer/more robust than ...
string-based solutions (deparse()/as.formula()) which are idiomatic/safer/more robust than ...
solutions with [m]get(), eval(), etc ...
(I'm actually cheating on this hierarchy a little bit here since reformulate() is actually string-based, but since it's a built-in function ...)

Related

Using Surv function repeatedly with different data frames

I'm running some Surv() functions, and one thing I do not like, or understand, is why this function does not take a "data=" argument. This is annoying because I want to perform the same Surv() function on the same data frame but filtered by different criteria each time.
So for example, my data frame is called "ikt" and I want to filter by "donor_type2=='LD'" and also use a strata variable "plan 2". I tried the following but it didn't work:
library(survival)
library(dplyr)
ikt<-data.frame(organ_yrs=(seq(1,20)),
organ_status=rep(c(0,0,1,1),each=5),
plan2=rep(c('A','B','A','B'),each=5),
donor_type2=rep(c('LD','DD'),each=10) )
organ_surv_func<-function(data,criteria,strata) {
data2<-filter(data,criteria)
Surv(data2$organ_yrs,data2$organ_status)~data2$strata
}
organ_surv_func(ikt,donor_type2=='LD',plan2)
Error in filter_impl(.data, quo) : object 'donor_type2' not found
I'm coming from a SAS background so that's probably why I'm thinking this should work and it doesn't...
I looked up something about sapply(), but I don't think that works when the function doesn't have the data= option.
Also the reason I need the Surv() object and not just survfit(Surv()) (which would let me use data=) is because I'm also using survdiff() for log-rank tests, which takes in the Surv() object as it's main argument:
lr<-function (surv) {
round(1-pchisq(survdiff(surv)$chisq,length(survfit(surv)$strata)-1),3)
}
Thanks for any help you can provide.
I'm writing this "answer" to caution you against proceeding down the path you seem to be following. The Surv function is really intended to be used as the LHS of a formula defined within one of the survival package functions. You should avoid using constructions like:
Surv(data2$organ_yrs,data2$organ_status)~data2$strata
For one thing it's needlessly verbose, but more importantly, it will prevent the use of predict when it comes time to match up names to formals. The survdiff and the other survival functions all have both a "data" argument as well as a "subset" argument. The subset function should allow you to avoid using filter.
organ_surv_func<-function(data, covar) {
form = as.formula(substitute( Surv(organ_yrs, organ_status) ~ covar, list(covar=covar) ) )
survdiff(form, data=data)
}
# although I think running surdiff in a for-loop might be easier,
# as it would involve fewer tricky language constructs
organ_surv_func( subset(ikt, (donor_type2=='LD')), covar=quote(plan2))
If you assign the output of survfit to a named variable, you will be able to more economically access chisq and strata:
myfit <- organ_surv_func( subset(ikt, (donor_type2=='LD')), covar=quote(plan2))
my.lr.test<-function (myfit) {
round(1-pchisq(myfit$chisq, length(myfit$strata)-1), 3)
}
my.lr.test(myfit) # not going to be useful with that dataset.

When to use 'with' function and why is it good?

What are the benefits of using with()? In the help file it mentions it evaluates the expression in an environment it creates from the data. What are the benefits of this? Is it faster to create an environment and evaluate it in there as opposed to just evaluating it in the global environment? Or is there something else I'm missing?
with is a wrapper for functions with no data argument
There are many functions that work on data frames and take a data argument so that you don't need to retype the name of the data frame for every time you reference a column. lm, plot.formula, subset, transform are just a few examples.
with is a general purpose wrapper to let you use any function as if it had a data argument.
Using the mtcars data set, we could fit a model with or without using the data argument:
# this is obviously annoying
mod = lm(mtcars$mpg ~ mtcars$cyl + mtcars$disp + mtcars$wt)
# this is nicer
mod = lm(mpg ~ cyl + disp + wt, data = mtcars)
However, if (for some strange reason) we wanted to find the mean of cyl + disp + wt, there is a problem because mean doesn't have a data argument like lm does. This is the issue that with addresses:
# without with(), we would be stuck here:
z = mean(mtcars$cyl + mtcars$disp + mtcars$wt)
# using with(), we can clean this up:
z = with(mtcars, mean(cyl + disp + wt))
Wrapping foo() in with(data, foo(...)) lets us use any function foo as if it had a data argument - which is to say we can use unquoted column names, preventing repetitive data_name$column_name or data_name[, "column_name"].
When to use with
Use with whenever you like interactively (R console) and in R scripts to save typing and make your code clearer. The more frequently you would need to re-type your data frame name for a single command (and the longer your data frame name is!), the greater the benefit of using with.
Also note that with isn't limited to data frames. From ?with:
For the default with method this may be an environment, a list, a data frame, or an integer as in sys.call.
I don't often work with environments, but when I do I find with very handy.
When you need pieces of a result for one line only
As #Rich Scriven suggests in comments, with can be very useful when you need to use the results of something like rle. If you only need the results once, then his example with(rle(data), lengths[values > 1]) lets you use the rle(data) results anonymously.
When to avoid with
When there is a data argument
Many functions that have a data argument use it for more than just easier syntax when you call it. Most modeling functions (like lm), and many others too (ggplot!) do a lot with the provided data. If you use with instead of a data argument, you'll limit the features available to you. If there is a data argument, use the data argument, not with.
Adding to the environment
In my example above, the result was assigned to the global environment (bar = with(...)). To make an assignment inside the list/environment/data, you can use within. (In the case of data.frames, transform is also good.)
In packages
Don't use with in R packages. There is a warning in help(subset) that could apply just about as well to with:
Warning This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.
If you build an R package using with, when you check it you will probably get warnings or notes about using variables without a visible binding. This will make the package unacceptable by CRAN.
Alternatives to with
Don't use attach
Many (mostly dated) R tutorials use attach to avoid re-typing data frame names by making columns accessible to the global environment. attach is widely considered to be bad practice and should be avoided. One of the main dangers of attach is that data columns can become out of sync if they are modified individually. with avoids this pitfall because it is invoked one expression at a time. There are many, many questions on Stack Overflow where new users are following an old tutorial and run in to problems because of attach. The easy solution is always don't use attach.
Using with all the time seems too repetitive
If you are doing many steps of data manipulation, you may find yourself beginning every line of code with with(my_data, .... You might think this repetition is almost as bad as not using with. Both the data.table and dplyr packages offer efficient data manipulation with non-repetitive syntax. I'd encourage you to learn to use one of them. Both have excellent documentation.
I use it when i don't want to keep typing dataframe$. For example
with(mtcars, plot(wt, qsec))
rather than
plot(mtcars$wt, mtcars$qsec)
The former looks up wt and qsec in the mtcars data.frame. Of course
plot(qsec~wt, data = mtcars)
is more appropriate for plot or other functions that take a data= argument.

iterating a coxph() model using various sets of covariates

I'm still a little new to R, so this may be a basic question.
I am looking for risk estimates for a joint-cox model using coxph(). I have to iterate the model for about 60 times using various combinations of variables. Since each iteration of the model will have different covariates (and main exposures), I want to write one function to do it. In the age-adjusted model I just had the main exposure, everything runs fine. I can add the covariates, it runs... I just need a way to write a single function where the "covars" can be whatever I put into the function call.
Note: this is a simplified version, it runs just fine, I just want to make it work without writing out 60 unique iterations of it.
subtype <- function(expo, covars){
temp <- coxph(Surv(FAIL, OUTCOME) ~ joint[[expo]]*strata(EVENT2)+
covars+
cluster(ID)+strata(AGE_INT),
na.action=na.exclude,
data=joint)
return(summary(temp))
}
results <- subtype("RACE", covars=...)
results2 <- subtype("GENDER", covers=...
When I did this macro programing in SAS, it was easy.
Thank you for your help.

R calling string for lookup in a function

I'm trying to call a column name for the e1071 svm function.
The working code looks like:
model = svm(Air_Flow~., data = trainset)
But in an effort to make it more automated I changed it to:
coi=44
model = svm(colnames(data)[coi]~., data = trainset)
where
This didn't work due (I think) to the quote marks, so I tried:
get(colnames(data)[coi])
cat(...)
print(...,quote = F)
as.name(...)
parse(...)
Only get() sort of worked, but then when I tried to predict other values using model it didn't. Any suggestions on what may get this working?
Thanks
Formulas are not strings that you can just "paste" variables into. Nor are variable names the same as strings. You need to be careful about how you build expressions to make sure you are using the correct type. Formulas are really un-evaluated calls that hold names/symbols as parameters.
You might consider using bquote() to build your formula expression, and be sure to convert the character version of the variable name to a proper variable name with as.name()
coi=44
model = svm(bquote(.(as.name(colnames(data)[coi])~.), data = trainset)
Yes, this is a bit ugly. That's why often functions that allow formulas also have an alternative interface that's easier to program against. svm() also allows you to pass in an x and y parameter for the response and predictors. You might do
model = svm(trainset[,col], trainset[,-col])
which is nicer because you can subset columns from your dataset with both string and numeric indexes

Use the multiple variables in function in r

I have this function
ANN<-function (x,y){
DV<-rep(c(0:1),5)
X1<-c(1:10)
X2<-c(2:11)
ANN<-neuralnet(x~y,hidden=10,algorithm='rprop+')
return(ANN)
}
I need the function run like
formula=X1+X2
ANN(DV,formula)
and get result of the function. So the problem is to say the function USE the object which was created during the run of function. I need to run trough lapply more combinations of x,y, so I need it this way. Any advices how to achieve it? Thanks
I've edited my answer, this still works for me. Does it work for you? Can you be specific about what sort of errors you are getting?
New response:
ANN<-function (y){
X1<-c(1:10)
DV<-rep(c(0:1),5)
X2<-c(2:11)
dat <- data.frame(X1,X2)
ANN<-neuralnet(DV ~y,hidden=10,algorithm='rprop+',data=dat)
return(ANN)
}
formula<-X1+X2
ANN(formula)
If you want so specify the two parts of the formula separately, you should still pass them as formulas.
library(neuralnet)
ANN<-function (x,y){
DV<-rep(c(0:1),5)
X1<-c(1:10)
X2<-c(2:11)
formula<-update(x,y)
ANN<-neuralnet(formula,data=data.frame(DV,X1,X2),
hidden=10,algorithm='rprop+')
return(ANN)
}
ANN(DV~., ~X1+X2)
And assuming you're using neuralnet() from the neuralnet library, it seems the data= is required so you'll need to pass in a data.frame with those columns.
Formulas as special because they are not evaluated unless explicitly requested to do so. This is different than just using a symbol, where as soon as you use it is evaluated to something in the proper frame. This means there's a big difference between DV (a "name") and DV~. (a formula). The latter is safer for passing around to functions and evaluating in a different context. Things get much trickier with symbols/names.

Resources