Working with dataframe variable names passed as arguments to the R function - r

I want to fit a statistical model (e.g. a linear one) for arbitrary variables from the dataframe passed as arguments to a function.
E.g.
myLm = function(x,y) {
fit = lm(y~x, data=mtcars)}
myLm("cyl","mpg")
But this does not work.
How do I do that correctly?

You have to pass a formula object to lm function. But you give characters in input of myLM.
You can change your function in this way
myLm = function(x,y) {
fit = lm(formula(paste(x,"~",y)), data=mtcars)
}
myLm("cyl","mpg")
You still give characters input but they are converted inside your function myLM

Why don't you pass a formula object directly:
myLm <- function(f)
lm(f, data=mtcars)
myLm(mpg ~ cyl)

Related

using lm function in R with a variable name in a loop

I am trying to create a simple linear model in R a for loop where one of the variables will be specified as a parameter and thus looped through, creating a different model for each pass of the loop. The following does NOT work:
model <- lm(test_par[i] ~ weeks, data=all_data_plant)
If I tried the same model with the "test_par[i]" replaced with the variable's explicit name, it works just as expected:
model <- lm(weight_dry ~ weeks, data=all_data_plant)
I tried reformulate and paste ineffectively. Any thoughts?
Maybe try something like this:
n <- #add the column position of first variable
m <- #add the column position of last variable
lm_models <- lapply(n:m, function(x) lm(all_data_plant[,x] ~ weeks, data=all_data_plant))
You can pass the argument "formula" in lm() as character using paste(). Here a working example:
data("trees")
test_par <- names(trees)
model <- lm(Girth ~ Height, data = trees)
model <- lm("Girth ~ Height", data = trees) # character formula works
model <- lm(paste(test_par[1], "~ Height"), data=trees)

Save an object in R using a function parameter as name

I looked all over the website and could not get the correct answer for this dilemma:
I have an UDF for evaluating some classification models, with different datasets, and i wanted to have a single function for evaluating them. I want to have something like the following, that given the name of the model and the data, it computes some metrics (confusion matrix for example) and saves them to an object outside the function.
The problem here is that I want to create this object using the name of the model I am evaluating.
I ended up with something like this:
foo <- function(x) {return(as.character(substitute(x)))}
model1 <- lm(Sepal.Width ~ Sepal.Length, iris)
Validation.func <- function(model_name, dataset){
Pred_Train = predict(model_name, dataset)
assign(paste("Pred_Train_",foo(model_name), sep=''), Pred_Train, envir=globalenv())
Pred_Train_prob = predict(model_name, dataset, type = "prob")
MC_Train = confusionMatrix(Pred_Train, dataset$target_salto)
}
Running it for Validation.func(model1,iris) We would want to get the variable stored as "Pred_Train_model1".
As model_name is not a string we had to try to convert it using the foo function (which is the answer i found in here) foo = function(x)deparse(substitute(x)) I do not get what I want, since it saves the object as: "Pred_Train_model_name" instead of "Pred_Train_model1".
Does anyone know how to solve it?
model_name in your function must be a model object, hence cannot be used in paste function, which expects characters.
I think you want your function to know that the model object is actually called "model1" in the environment where it comes from. I think this is quite tricky attempt since your model object may be called by various names.
The easiest implementation would be to give both model object and the name separately, and the use the former for prediction and the latter for naming the outcome.
func1 <- function(model, model_str, dataset)
{
p <- predict(model, dataset)
assign(paste("predict_", model_str, sep=""), p, envir=globalenv())
}
model1 <- lm(mpg ~ cyl, data=mtcars)
func1(model1, "model1", mtcars)
predict_model1
Another implementation, tricky but works if used with care, would be to give only the character name of the model and obtain the model object by get function from the parent environment.
func2 <- function(model_str, dataset)
{
p <- predict(get(model_str, envir=parent.env(environment())), dataset)
assign(paste("predict_", model_str, sep=""), p, envir=globalenv())
}
model2 <- lm(mpg ~ cyl, data=mtcars)
func2("model2", mtcars)
predict_model2
Finally, in order to give the model object to the function and let the function to find the variable name, then you can use match.call function to recover how the function has been called.
func3 <- function(model, dataset)
{
s <- match.call()
model_str <- as.character(s)[2]
p <- predict(model, dataset)
assign(paste("predict_", model_str, sep=""), p, envir=globalenv())
}
model3 <- lm(mpg ~ cyl, data=mtcars)
func3(model3, mtcars)
predict_model3
So here's a suggestion, that does not exactly solve the problem, but does make the function work.
Validation.func <- function(model_name, dataset){
model_name_obj<- eval(parse(text = model_name))
Pred_Train = predict(model_name_obj, dataset)
assign(paste("Pred_Train_",model_name, sep=''), Pred_Train, envir=globalenv())
Pred_Train_prob = predict(model_name_obj, dataset, type = "prob")
MC_Train = confusionMatrix(Pred_Train, dataset$target_salto)
}
Validation.func("model1", data)
What I did is pretty much the opposite of what you were trying. I passed model_name as a string, and then evaluate it using parse(text = model_name). Note that the evaluated object is now called model_name_obj and it is passed in the predict function.
I got some errors later on in the function, but they are irrelevant to the issue at hand. They had to do with the type argument in predict and about not recognizing the confusionMatrix, because I assume I didn't load the corresponding package.

How is formula specified in R's lm function

I have a function in R, lets say
myfunction <- function(formula,data)
Among other things, the function contains a call to lm(). Formula should include the covariates, and should be specified as
formula = x1 + x2 + ... + x_n
Data contains columns Z and W, where the response
y=data$Z/data$W
I only want to have formula including the covariates, since the function modifies the response variable for each iteration.
The call for lm() should then work with
lm(y~formula,data=data)
why would you do that? It is cleaner to pass the whole formula in myfunction, for instance:
myfunction <- function(formula,data) {
data = data*2 # this is an example of data manipulation
lm(formula=formula, data=data)
}
then use myfunction as you would use lm
If you REALLY want to create complexity (for nothing?), you can also use the fact that lm will coerce whatever string you pass as an argument into a proper formula object
myfunction2 <- function(formula2,data) {
data = data*2 # this is an example of data manipulation
lm(formula=paste0("y~",formula2), data=data)
}

Using apply to loop over different datasets in a regression

I found this way of looping over variables in an lm() when the variable names are stored as characters (http://www.ats.ucla.edu/stat/r/pages/looping_strings.htm):
models <- lapply(varlist, function(x) {
lm(substitute(read ~ i, list(i = as.name(x))), data = hsb2)
})
My first question is: Is there a more efficient/faster way?
What if I want to loop over different data instead of looping over variables?
Example:
reg1 <- lm(a~b, data=dataset1)
reg2 <- lm(a~b, data=dataset2)
Can I apply something similar to the code shown above? Using the substitute function for the data did not work.
Thank You!
The substitute in your example is used to construct the formula. If you want to to apply lm to a number of data.frames use:
lapply(list(dataset1, dataset2), lm, formula = a ~ b)

User defined functions as formula input

Built-in functions in R can be used in formula objects, for example
reg1 = lm(y ~ log(x), data = data1)
How can I write my functions such that they can be used in formula objects?
fnMyFun = function(x) {
return(x^2)
}
reg2 = lm(y ~ fnMyFun(x), data = data1)
What you've got certainly works. One problem is that different modelling functions handle formulas in different ways. I think that as long as you return something that model.matrix can make sense of, you'll be fine. That would mean
The function is vectorised; ie given a vector of length N, it returns a result also of length N
It has to return an atomic vector or matrix (but not a list, or of type raw)

Resources