I have a function that inputs a data.frame and outputs the residual version of it with some chosen variable as predictor.
residuals.DF = function(data, resid.var, suffix="") {
lm_f = function(x) {
x = residuals(lm(data=data, formula= x ~ eval(parse(text=resid.var))))
}
resid = data.frame(apply(data,2,lm_f))
colnames(resid) = paste0(colnames(data),suffix)
return(resid)
}
set.seed(31233)
df = data.frame(Age = c(1,3,6,7,3,8,4,3,2,6),
Var1 = c(19,45,76,34,83,34,85,34,27,32),
Var2 = round(rnorm(10)*100))
df.res = residuals.DF(df, "Age", ".test")
df.res
Age.test Var1.test Var2.test
1 -1.696753e-17 -25.1351351 -90.20582
2 -1.318443e-19 -0.8108108 31.91892
3 -5.397735e-18 27.6756757 84.10603
4 -5.927747e-18 -15.1621622 -105.83160
5 -3.807699e-18 37.1891892 -57.08108
6 -6.457759e-18 -16.0000000 -25.76923
7 5.117344e-17 38.3513514 -65.01871
8 -3.807699e-18 -11.8108108 35.91892
9 -3.277687e-18 -17.9729730 97.85655
10 -5.397735e-18 -16.3243243 94.10603
This works fine, however, I often need to use the eval parse combo when working with variable inputs to lm(), so I decided to write a wrapper function:
#Wrapper function for convenience for evaluating strings
evalparse = function(string) {
eval(parse(text=string))
}
This works fine when used alone, e.g.:
> evalparse("5+5")
[1] 10
However, if one uses it in the above function, one gets:
> df.res = residuals.DF(df, "Age", ".test")
Error in eval(expr, envir, enclos) : object 'Age' not found
I figure this is because the wrapper function means that the string gets evaluated in its own environment where the chosen variable is missing. This does not happen when using eval parse combo because it then happens in the lm() environment where the chosen variable is not missing.
Is there some clever solution to this problem? A better way of using dynamic formulas in lm()? Otherwise I will have to keep typing eval(parse(text=object)).
Anytime you're trying to perform operations that modify the contents of a formula, you should use update because it is designed for this purpose.
In your case, you want to modify your function as follows:
residuals.DF = function(data, resid.var, suffix="") {
lm_f = function(x) {
x = residuals(lm(data=data, formula= update(x ~ 0, paste0("~",resid.var))))
}
resid = data.frame(apply(data,2,lm_f))
colnames(resid) = paste0(colnames(data),suffix)
return(resid)
}
Basically, update (or the update.formula method specifically) takes a formula as its first argument, and then allows for modifications based on its second argument. To get a handle on it, check out the following examples:
f <- y ~ x
f
# y ~ x
update(f, ~ z)
# y ~ z
update(f, x ~ y)
# x ~ y
update(f, "~ x + y")
# y ~ x + y
update(f, ~ . + z + w)
# y ~ x + z + w
x <- "x"
update(f, paste0("~",x))
# y ~ x
As you can see, the second argument can be a formula or character string containing one or more variables. This greatly simplifies the creation of a dynamically modified formula where you are only trying to change one part of the formula.
Related
I use speedglm to fit a GLM to data. When I call the function directly, the code works as expected, but when I create a function to fit the model, I get an error that an argument is not found.
The variable (w in the example below) clearly exists in the scope of the function but it seems that the variable is evaluated only later within the speedglm function where w is no longer available or so I think. This is where I start questioning my current understanding of R.
Did I make an error while creating the function, does speedglm use some weird trick to scope the variable (source code here) that breaks the normal (?) logic or do I have a wrong understanding of how R functions work?
I am trying to understand this behavior and also fix my train_glm function to make it work with speedglm and weights.
MWE
library(speedglm)
# works as expected
m1 <- speedglm(wt ~ cyl, data = mtcars, weights = mtcars$wt)
# define a small helper function that just forwards its arguments
train_glm <- function(f, d, w) {
speedglm(formula = f, data = d, weights = w)
}
# does not work
m <- train_glm(wt ~ cyl, d = mtcars, w = mtcars$wt)
#> Error in eval(extras, data, env) : object 'w' not found
Even weirder, if I change the code I found the following
# removing the weights as a base case -> WORKS
train_glm3 <- function(f, d) {
speedglm(formula = f, data = d)
}
m3 <- train_glm3(wt ~ cyl, d = mtcars)
# works
# hardcoding the weights inside the function -> BREAKS
train_glm4 <- function(f, d) {
speedglm(formula = f, data = d, weights = d$wt)
}
m4 <- train_glm4(wt ~ cyl, d = mtcars)
# Error in eval(extras, data, env) : object 'd' not found
# creating a new dataset and hardcoding the weights inside the function
# but using the name of the dataset at the highest environment -> WORKS
train_glm5 <- function(f, d) {
speedglm(formula = f, data = d, weights = mtcars2$wt)
}
mtcars2 <- mtcars
m5 <- train_glm5(wt ~ cyl, d = mtcars2)
# works
The solution (thanks to #Mike for the hint) is to evaluate the code either by using the solution given by this answer or by using do.call like so:
library(speedglm)
train_glm_docall <- function(f, d, w) {
do.call(
speedglm,
list(
formula = f,
data = d,
weights = w
)
)
}
m2 <- train_glm_docall(f = wt ~ cyl, d = mtcars, w = mtcars$wt)
class(m2)
#> [1] "speedglm" "speedlm"
I'm working with some old code for the first time in a few years, and have realized it's broken where is uses lm(). The problem is replicated in this simple example:
df <- data.frame(x=c(1,2,3), y=c(1,2,3))
lm(df$y ~ 1/(0.00005 * df$x))
Running the above lm() throws an error:
Error in terms.formula(formula, data = data) :
invalid model formula in ExtractVars
I can't figure out why this is happening - can anyone help me? The code worked last time I used it. Thanks!
EDIT: below, akrun suggested I use transform() within lm(). This works in the simple case above, but I'm actually trying to do this within a function like
test <- function(bmp, M, Q, c=5e-5)
{
bmp.w <- which(bmp <= Q)
if (length(bmp.w) > 1 & length(unique(bmp[bmp.w]))>1)
{
m <- lm(M[bmp.w] ~ I(1 / (c * (bmp[bmp.w]))))
return(abs(summary(m)$coefficients[2, c('Estimate', 'Std. Error')]))
} else
{
return(c(0.5,3))
}
How would I use transform() in this function? I've tried m <- lm(M[bmp.w] ~ x, data=transform(x=1 / (c * (bmp[bmp.w])))), but that does not work.
EDIT2: The easy solution is to just evaluate the term before calling lm(), like
test <- function(bmp, M, Q, c=5e-5)
{
bmp.w <- which(bmp <= Q)
if (length(bmp.w) > 1 & length(unique(bmp[bmp.w]))>1)
{
#m <- lm(M[bmp.w] ~ I(1 / (c * (bmp[bmp.w]))))
gah <- 1 / (c * (bmp[bmp.w]))
m <- lm(M[bmp.w] ~ I(gah))
return(abs(summary(m)$coefficients[2, c('Estimate', 'Std. Error')]))
} else
{
return(c(0.5,3))
}
but this doesn't solve how to include transform().
We can do this outside i.e specify the formula as y ~ x, but change 'x' value in transform and pass that into data argument
lm(y ~ x, data = transform(df, x = 1/(0.00005 * x)))
#Call:
#lm(formula = y ~ x, data = transform(df, x = 1/(5e-05 * x)))
#Coefficients:
#(Intercept) x
# 3.6923077 -0.0001385
You can always use I:
df <- data.frame(x=c(1,2,3), y=c(1,2,3), idx = 3:1)
lm(y ~ I(1/(0.00005 * x)), df)
#R>
#R> Call:
#R> lm(formula = y ~ I(1/(5e-05 * x)), data = df)
#R>
#R> Coefficients:
#R> (Intercept) I(1/(5e-05 * x))
#R> 3.6923077 -0.0001385
lm(y ~ I(1/(0.00005 * x[idx])), df)
#R>
#R> Call:
#R> lm(formula = y ~ I(1/(5e-05 * x[idx])), data = df)
#R>
#R> Coefficients:
#R> (Intercept) I(1/(5e-05 * x[idx]))
#R> 0.3076923 0.0001385
From ?I:
In function formula. There it is used to inhibit the interpretation of operators such as "+", "-", "*" and "^" as formula operators, so they are used as arithmetical operators. This is interpreted as a symbol by terms.formula.
Notes
In the above, I use the data argument of lm is:
an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.
This way you can use transform on a data.frame and pass it to the data argument of lm like akrun.
I'm trying to understand how to use Formula objects. Let's say I wanted to make my own 2SLS function and want to divide the objects I'm working with into 4 main groups: y = response; X = exogenous variables; E = endogenous variables; Z = instruments.
I want to be able to construct these objects without making extra copies of the data unnecessarily (say, large N and large number of instruments would make this prohibitively costly in memory usage/time). I also want to take into account NAs from across the data.
Let's use a formula syntax similar to felm (I tried looking at the parsing code there, but couldn't follow it).
frml = y ~ x1 + x2 + x3*x4 | (e1 | e2 ~ z1 + z2)
library(Formula)
N = 12 # be divisible by 6
data = data.frame(y=rnorm(N), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N),
x4=factor(rep(1:2, N/2)), e1=rnorm(N), e2=rnorm(N),
z1=rnorm(N), z2=factor(rep(1:3, N/3)))
data[2,'y'] = data[3,'x1'] = data[4,'e1'] = data[5,'z2'] = NA
parse_frml = function(frml, data, subset=NULL) {
frml = as.Formula(frml)
# does not take into account NAs at all
y = model.part(frml, data=data, subset=subset, lhs=1)
# does not take into account NAs in other variables (y, Z, E)
X = model.matrix(frml, data=data, subset=subset, lhs=0, rhs=1)
Z = model.matrix(frml, data=data, subset=subset, lhs=0, rhs=2)
#E = # I can't figure this out at all
return(list(y=y, X=X, E=E, Z=Z))
}
Now, I can do something like
mf = model.frame(frml, data=data, subset=subset, lhs=1, rhs=1)
which will take into account NAs in y and X, but ignores E and Z. Further, this copies the data into the mf, and then copies again into y and X.
So, I have 2 questions and 1 constraint
How do I get E? (a matrix for the LHS of the 2nd equation)
How do I take into account NAs from across the data used by frml in all matrices?
While minimizing the number of copies of the data (ideally just copied into the matrices)
More generally, what's a good resource for understanding Formula, formula, terms, and the like? I've not found, e.g. the Formula libraries package documentation to be super helpful.
This isn't perfect, but it works. It's a shame how there is almost no information on how to actually handle and manipulate formulas in R code. My solution depends on formula.tools
library(formula.tools)
parse_frml = function(frml, data, subset=NULL) {
frml = as.Formula(frml)
vars = all.vars(frml)
other_vars = c(all.vars(formula(frml, lhs=1, rhs=1)),
rhs.vars(formula(frml, lhs=0, rhs=2)))
e_vars = setdiff(vars, other_vars)
valid = which(complete.cases(data[, vars]))
if (!is.null(subset)) {
if (class(subset) == 'logical') {
subset = which(subset)
}
valid = intersect(valid, subset)
}
y = model.part(frml, data=data[valid,], lhs=1)
X = model.matrix(frml, data=data[valid,], lhs=0, rhs=1)
Z = model.matrix(frml, data=data[valid,], lhs=0, rhs=2)
E = data.matrix(data[valid, e_vars])
return(list(y=y, X=X, E=E, Z=Z))
}
I suspect that subsetting data with valid each time is rather expensive. But in the above test cast, it seems to work.
I have a data frame and a formula stored in variables:
> d <- data.frame(cls=1, foo=2, bar=3)
> f <- formula(cls ~ .)
I'd like to remove one variable from the RHS of this formula programatically (in my code, the name of this variable would be passed somewhere as a string). I tried using update.formula:
> update(f, .~.-foo)
Error in terms.formula(tmp, simplify = TRUE) :
'.' in formula and no 'data' argument
Then I tried providing the data argument:
> update(f, .~.-foo, data=d)
Error in terms.formula(tmp, simplify = TRUE) :
'.' in formula and no 'data' argument
I know the above would work if the initial formula didn't have a dot on the right side:
> f <- formula(cls ~ foo + bar)
> update(f, .~.-foo)
cls ~ bar
How do I remove a variable from RHS of a formula if I can't ensure that RHS doesn't contain a dot?
update(terms(f, data = d), . ~ . - foo)
# cls ~ bar
I have a model building function where the formula can contain a some functions, and I would like it work so that if user inputs the function several times, only first occasion is used with a warning. For example in lm if we use same variable twice, the second one is dropped:
y<-1:3
x<-1:3
lm(y~x+x)
Call:
lm(formula = y ~ x + x)
Coefficients:
(Intercept) x
0 1
This works because the function terms used in model.frame removes variables with identical name. But in my case I'm working with functions inside of formula which doesn't necessarily have identical argument list, and I would like this behaviour to extend so that arguments of these functions wouldn't matter:
model(y~x+fn("x"))
(Intercept) x temp
1 1 1 1
2 1 2 1
3 1 3 1
model(y~x+fn("x")+fn("x")) #identical function calls
(Intercept) x temp
1 1 1 1
2 1 2 1
3 1 3 1
model(y~x+fn("x")+fn("z")) #function with different argument value
Error in attr(all_terms, "variables")[[1 + ind_fn]] :
subscript out of bounds
Here is an example function (highly simplified) I used above:
model <- function(formula, data) {
#the beginning is pretty much copied from lm function
mf <- match.call(expand.dots = FALSE)
mf <- mf[c(1L, match(c("formula", "data"), names(mf), 0L))]
mf[[1L]] <- as.name("model.frame")
mf$na.action <- as.name("na.pass")
all_terms <- if (missing(data)){
terms(formula, "fn")
} else terms(formula, "fn", data = data)
#find the position of the function call in the formula
ind_fn <- attr(all_terms, "specials")$fn
#update the formula by removing the "fn" part
if(!is.null(ind_fn)){
fn_term <- attr(all_terms, "variables")[[1 + ind_fn]]
formula <- update( formula, paste(". ~ .-", deparse(fn_term,
width.cutoff = 500L, backtick = TRUE)))
mf$formula<-formula
}
# build y and X
mf <- eval(mf, parent.frame())
y <- model.response(mf, "numeric")
mt <- attr(mf, "terms")
X <- model.matrix(mt, mf)
#if fn was in formula do something with it
if (!is.null(ind_fn)){
foobar<-function(type=c("x","z")){
if(type=="x"){
rep(1,nrow(X))
} else rep(0,nrow(X))
}
fn_term[[1]]<-as.name("foobar")
temp<-eval(fn_term)
X<-cbind(X,temp)
}
X
}
I could check the name of the specials (the function calls) and rename them as identical with the first occurence, but I was wondering if there would be more clever way of dealing with this?
I wasn't able to get your code to work correctly, but assuming I've understood your task, perhaps something like this accomplishes what you're after.
f <- y ~ x + fn("x") + fn("z") + z + fn('a')
# get list of terms
vars <- as.list(attr(terms(f), 'variables'))
# get those terms that are duplicate calls
redundant <- vars[sapply(vars, is.call) & duplicated(sapply(vars, function(x) as.list(x)[[1]]))]
# remove the duplicate calls from the formula
update(f, paste(". ~ .", paste(sapply(redundant, deparse), collapse='-'), sep='-'))
# y ~ x + fn("x") + z