using non-standard evaluation with formula Bis - r

Non-Standard evaluation in R. I want to send a formula to a function that uses lm.
I have a data frame with one response: y and many predictors. I will fit a model inside a function. The function will receive a filtering criteria as a string and the name of the predictor variable as a string. The response will enter as a name. The function will filter on the filter criteria, then fit the a model using the predictor variable that was sent to it as a string. I can't get the predictor string to work correctly.
This is very close to using non-standard evaluation with formula.
In fact I illustrate that solution, which gets me part of the way there.
Difference: I want to send a string with the value of my predictor instead of sending the predictor to the function.
Use Case: Eventually I will put this in a shiny ap and let the user select the predictor and response as well as the filter.
Here is what works:
# create a data frame.
n <- 100
levels_1 <- sample(c("a","b","c"),n,replace=TRUE)
levels_2 <- sample(c("a","b","c"),n,replace=TRUE)
d <-tibble(l_1 = levels_1 ,l_2 = levels_2, y = rnorm(n))
# A function that works
my_lm <- function(d,predictor,response,filter_criteria){
d1 <- d %>% filter(l_2 == 'a')
lm(y ~ l_1,data=d1)
}
my_lm(d,l_1,y,'a')
my_lm2 <- function(d,predictor,response,filter_criteria){
enquo_predictor <- enquo(predictor)
enquo_response <- enquo(response)
enquo_filter_criteria <- enquo(filter_criteria)
d1 <- d %>% filter(l_2 == !!filter_criteria)
form <- as.formula(paste(enquo_response, " ~ ", predictor)[2])
# form <- as.formula(paste(enquo_response, " ~ ", enquo_predictor)[2]) wrong way to do it.
lm(form,data=d1)
#lm(!!enqu_preditor ~ !!enquo_response,data=d1)
}
selected_var <- names(d)[1]
selected_var
filter_value <- 'a'
my_lm2(d,l_1,y,filter_value) # This works but is not what I want.
my_lm2(d,selected_var,y,filter_value) # This does not work but is what I want to work.

Related

Use string of independent variables within the lm function

I have a dataframe with many variables. I want to apply a linear regression to explain the last one with the others. So as I had to much to write I thought about creating a string with the independent variables e.g. Var1 + Var2 +...+ VarK. I achieved it pasting "+" to all column names except for the last one with this code:
ExVar <- toString(paste(names(datos)[1:11], "+ ", collapse = ''))
I also had to remove the last "+":
ExVar <- substr(VarEx, 1, nchar(ExVar)-2)
So I copied and pasted the ExVar string within the lm() function and the result looked like this:
m1 <- lm(calidad ~ Var1 + Var 2 +...+ Var K)
The question is: Is there any way to use "ExVar" within the lm() function as a string, not as a variable, to have a cleaner code?
For better understanding:
If I use this code:
m1 <- lm(calidad ~ ExVar)
It is interpreting ExVar as a independent variable.
The following will all produce the same results. I am providing multiple methods because there is are simpler ways of doing what you are asking (see examples 2 and 3) instead of writing the expression as a string.
First, I will generate some example data:
n <- 100
p <- 11
dat <- array(rnorm(n*p),c(n,p))
dat <- as.data.frame(dat)
colnames(dat) <- paste0("X",1:p)
If you really want to specify the model as a string, this example code will help:
ExVar <- toString(paste(names(dat[2:11]), "+ ", collapse = ''))
ExVar <- substr(ExVar, 1, nchar(ExVar)-3)
model1 <- paste("X1 ~ ",ExVar)
fit1 <- lm(eval(parse(text = model1)),data = dat)
Otherwise, note that the 'dot' notation will specify all other variables in the model as predictors.
fit2 <- lm(X1 ~ ., data = dat)
Or, you can select the predictors and outcome variables by column, if your data is structured as a matrix.
dat <- as.matrix(dat)
fit3 <- lm(dat[,1] ~ dat[,-1])
All three of these fit objects have the same estimates:
fit1
fit2
fit3
if you have a dataframe, and you want to explain the last one using all the rest then you can use the code below:
lm(calidad~.,dat)
or you can use
lm(rev(dat))#Only if the last column is your response variable
Any of the two above will give you the results needed.
To do it your way:
EXV=as.formula(paste0("calidad~",paste0(names(datos)[-12],collapse = '+')))
lm(EXV,dat)
There is no need to do it this way since the lm function itself will do this by using the first code above.

Pass df column names to nested equation in Graph Printing Function

I need some clarification on the primary post on Passing a data.frame column name to a function
I need to create a function that will take a testSet, trainSet, and colName(aka predictor) as inputs to a function that prints a plot of the dataset with a GAM model trend line.
The issue I run into is:
plot.model = function(predictor, train, test) {
mod = gam(Response ~ s(train[[predictor]], spar = 1), data = train)
...
}
#Function Call
plot.model("Predictor1", 1.0, crime.train, crime.test)
I can't simply pass the predictor as a string into the gam function, but I also can't use a string to index the data frame values as shown in the link above. Somehow, I need to pass the colName key to the game function. This issue occurs in other similar scenarios regarding plotting.
plot <- ggplot(data = test, mapping = aes(x=predictor, y=ViolentCrimesPerPop))
Again, I can't pass a string value for the column name and I can't pass the column values either.
Does anyone have a generic solution for these situations. I apologize if the answer is buried in the above link, but it's not clear to me if it is.
Note: A working gam function call looks like this:
mod = gam(Response ~ s(Predictor1, spar = 1.0), data = train)
Where the train set is a data frame with column names "Response" & "Predictor".
Use aes_string instead of aes when you pass a column name as string.
plot <- ggplot(data = test, mapping = aes_string(x=predictor, y=ViolentCrimesPerPop))
For gam function:: Example which is copied from gam function's documentation. I have used vector, scalar is even easier. Its just using paste with a collapse parameter.
library(mgcv)
set.seed(2) ## simulate some data...
dat <- gamSim(1,n=400,dist="normal",scale=2)
# String manipulate for formula
formula <- as.formula(paste("y~s(", paste(colnames(dat)[2:5], collapse = ")+s("), ")", sep =""))
b <- gam(formula, data=dat)
is same as
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)

Inner functions pulls call from outer function and causes error

I'm using a function from the library leaps within another function. The last two rows of the leaps function in question goes:
rval$call <- sys.call(sys.parent())
rval
This apparently causes the call to the outer function to be passed to rval$call. And the actual call to the regsubsets function is needed as an argument later on.
Below an example to illustrate:
library(leaps)
#Create some sample data to perform a regression on
inda <- rnorm(100)
indb <- rnorm(100)
dep <- 2 + 0.1*inda + 0.2*indb + rnorm(100, sd = 0.3)
dfk <- data.frame(dep=dep, inda = inda, indb = indb)
#Create some arbitrary outer function
test <- function(dependent, data){
best.fit <- regsubsets(as.formula(paste0(dependent, " ~ .")), data = data, nvmax = 2)
return(best.fit)
}
#Call outer function
best <- test("dep", dfk)
best$call #Returns "test("dep", dfk)"
So best$call will contain the call to the outer function (test), and not the call to the inner (regsubsets) function. As it's not really an option to change the inner function, is there any way of avoiding this problem?
EDIT:
One way around the problem could be something like this:
test <- function(dependent, data){
thecall <- 'regsubsets(as.formula(paste0(dependent, " ~ .")), data = data, nvmax = 2)'
best.fit <- eval(parse(text = thecall))
#best.fit$call <- [some transformation of thecall
return(best.fit)
}
EDIT2:
The reason I need to access what's inside $call is that it's needed in a predict function that I copied from Introduction to statitical learning:
predict.regsubsets <- function(regsubset_model, newdata, id, ...){
form <- as.formula(regsubset_model$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(regsubset_model, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
In the second line it uses $call
I’m still not entirely clear on how this is going to be used but in the case of your test function, you could write the following code:
test = function (dependent, data) {
regsubsets_call = bquote(regsubsets(.(as.formula(paste0(dependent, " ~ ."))),
data = .(substitute(data)), nvmax = 2))
best_fit = eval(regsubsets_call)
best_fit$call = regsubsets_call
best_fit
}
However, the result may not work with downstream functions the package provides (though, realistically, it probably will; I’m guessing summary.regsubsets only uses it to print the call).
What’s going on here?
bquote constructs an unevaluated R expression; it’s similar to quote but it allows you to interpolate values (similar to substitute). substitute(data) means that, rather than putting the actual data.frame into the call (which would lead to a very unwieldy output, it puts the variable name (or expression) the user passed to test. So if the user called it as test('mpg', mtcars), then the resulting expression would be
regsubsets(mpg ~ ., data = mtcars, nvmax = 2)
The resulting call object is then (a) evaluated via eval, and (b) stored in the resulting $call.
Incidentally, the formula can (and, as far as I’m concerned, should) be constructed in the same way; no need to parse a string:
as.formula(bquote(.(as.name(dependent)) ~ .))
Taken together, the whole expression would then become:
formula = as.formula(bquote(.(as.name(dependent)) ~ .))
regsubsets_call = bquote(regsubsets(.(formula), data = .(substitute(data)), nvmax = 2))

entering variable name as argument of a function

I'm trying to make a function that will run and compare a set of models given a dataset and a variable name (essentially to be able to change just one model set and have them apply to all relevant dependent variables--selecting an a priori modelset to compare rather than using a data-dredging existing function like glmulti). A simple example:
RunModelset<- function(dataset, response)
{
m1 <- lm(formula=response ~ 1, data=dataset)
m2 <- lm(formula=response ~ 1 + temperature, data=dataset)
comp <- AICctab(m1,m2, base = T, weights = T, nobs=length(data))
return(comp)
}
If I manually enter a specific variable name within the function, it runs the models correctly. However, using the code above and entering a text value for the response argument doesn't work:
RunModel(dataset=MyData,response="responsevariablename")
yields an error: invalid type (NULL) for variable 'dataset$response', which I interpret to mean it isn't finding the column I'm telling it to use. My problem must be in how R inserts a text value as an argument in the function.
How do I enter the response variable name so R knows that "formula=response ~" becomes "formula=dataset$responsevariablename ~"?
ETA Working answer based on this solution:
RunModel<- function(dataset, response)
{
resvar <- eval(substitute(response),dataset)
m1 <- lm(formula=resvar ~ 1, data=dataset)
m2 <- lm(formula=resvar ~ 1 + R.biomass, data=dataset)
comp <- AICctab(m1,m2, base = T, weights = T, nobs=length(data))
return(comp)
}
RunModel(dataset=MyData,response=responsevariablename)
NB - this didn't work when I had quotes on the response argument.
You should be able to use match.call() to achieve this.
See this post

How to use one variable in regression with many independent variables in lm()

I need to reproduce this code using all of these variables.
composite <- read.csv("file.csv", header = T, stringsAsFactors = FALSE)
composite <- subset(composite, select = -Date)
model1 <- lm(indepvariable ~., data = composite, na.action = na.exclude)
composite is a data frame with 82 variables.
UPDATE:
What I have done is found a way to create an object that contains only the significantly correlated variables, to narrow the number of independent variables down.
I have a variable now: sigvars, which is the names of an object that sorted a correlation matrix and picked out only the variables with correlation coefficients >0.5 and <-0.5. Here is the code:
sortedcor <- sort(cor(composite)[,1])
regvar = NULL
k = 1
for(i in 1:length(sortedcor)){
if(sortedcor[i] > .5 | sortedcor[i] < -.5){
regvar[k] = i
k = k+1
}
}
regvar
sigvars <- names(sortedcor[regvar])
However, it is not working in my lm() function:
model1 <- lm(data.matrix(composite[1]) ~ sigvars, data = composite)
Error: Error in model.frame.default(formula = data.matrix(composite[1]) ~ sigvars, : variable lengths differ (found for 'sigvars')
Think about what sigvars is for a minute...?
After sigvars <- names(sortedcor[regvar]), sigvars is a character vector of column names. Say your data have 100 rows and 5 variables come out as significant using the method you've chosen (which doesn't sound overly defensible to be). The model formula you are using will result in composite[, 1] being a vector of length 100 (100 rows) and sigvars being a character vector of length 5.
Assuming you have the variables you want to include in the model, then you could do:
form <- reformulate(sigvars, response = names(composite)[1])
model1 <- lm(form, data = composite)
or
model1 <- lm(composite[,1] ~ ., data = composite[, sigvars])
In the latter case, do yourself a favour and write the name of the dependent variable into the formula instead of composite[,1].
Also, you don't seem to have appreciated the difference between [i] and [i,j] for data frames, hence you are doing data.matrix(composite[1]) which is taking the first component of composite, leaving it as a data frame, then converting that to a matrix via the data.matrix() function. All you really need is just the name of the dependent variable on the LHS of the formula.
The error is here:
model1 <- lm(data.matrix(composite[1]) ~ sigvars, data = composite)
The sigvars is names(data). The equation is usually of the form lm(var1 ~ var2+var3+var4), you however have it as lm(var1 ~ var2 var3 var4).
Hopefully that helps.

Resources