Remove Inf values from formula before lm in R - r

Let's say I have use mtcars dataset to set arbitrary formula:
data(mtcars)
myFormula <- as.formula("mpg ~ cyl + I(disp / hp) + I(wt^2) + I((qsec + vs) / gear)")
I would like to use that formula inside lm function, but before that, I would like to remove potential rows that contain Inf, NaN and NA. From example if disp / hp result in any Inf values I would like to remove rows that contain it. I know I can do that by generate new variable first , remove Inf and then run lm with formula, but I would like to do that using formula terms, since it is part of shiny application and formula is input.
My try:
formulaTerms <- terms(myFormula)
formulaTerms <- gsub("I", "", labels(formulaTerms))
formulaTermsRatio <- formulaTerms[grep("/", formulaTerms)]
mtcarsDT <- setDT(mtcars)
mtcarsDT <- mtcarsDT[, formulaTermsRatio[1] := sym(formulaTermsRatio[1])]

Use drop.terms. Assuming that each term is represented by a single column in the model matrix (i.e. no factors with > 2 levels) we compute the model matrix mm and find the column numbers, wx, of the bad columns. Then use drop.terms to drop those columns from the terms object and extract the formula from the revised terms object.
mtcars[1, 3] <- Inf
# is.na is TRUE for NA or NaN; is.infinite is TRUE for Inf or -Inf
is.bad <- function(x) any(is.na(x) | is.infinite(x))
fo_terms <- terms(myFormula) # myFormula is taken from question
mm <- model.matrix(myFormula, mtcars)
wx <- which(apply(mm[, -1], 2, is.bad))
fo_terms2 <- drop.terms(fo_terms, wx, keep.response = TRUE)
fo2 <- formula(fo_terms2)
myFormula
## mpg ~ cyl + I(disp/hp) + I(wt^2) + I((qsec + vs)/gear)
fo2
## mpg ~ cyl + I(wt^2) + I((qsec + vs)/gear)
Update
If you want to remove bad rows rather than terms from the formula then:
lm(myFormula, mtcars, subset = !apply(mm, 1, is.bad))
Note that lm will automatically remove rows with NAs and NaNs (dependintg on the na.action argument) so in this case you could simplify is.bad to only check for Inf and -Inf.
Another approach would be to replace Inf and -Inf with NA.
mtcars[is.infinite(mtcars)] <- NA
and then perform lm normally.

You can remove these values from the data you're regressing on. Inf will occur where hp==0 or gear==0.
data(mtcars)
df <- mtcars
myFormula <- as.formula("mpg ~ cyl + I(disp / hp) + I(wt^2) + I((qsec + vs) / gear)")
df <- df[!(df$hp==0 | df$gear==0),]
lm(myFormula,df)
> lm(myFormula,df)
Call:
lm(formula = myFormula, data = df)
Coefficients:
(Intercept) cyl I(disp/hp) I(wt^2) I((qsec + vs)/gear)
35.5847 -1.9639 1.0707 -0.3671 -0.1699

Related

How to reference objects in a different environment inside a formula

I'm using fixest::feols() and I have a function I want to pass an argument to in order to subset the data using the subset = argument. However when keep getting the error: The argument 'subset' is a formula whose variables must be in the data set given in argument 'data'.
I have tried the following code:
library(fixest)
cars <- mtcars
my_fun <- function(data, hp.c.off) {
feols(mpg ~ disp + drat,
data = data,
subset = ~ hp > substitute(hp.c.off))
}
my_fun(data = cars, 150)
My expected outcome would be the same as if one typed:
feols(mpg ~ disp + drat,
data = cars,
subset = ~ hp > 150)
I know I have to replace the value of hp.c.off before passing it onto a formula. And one could do this by creating a string expression first and then using as.formula() however, I was wondering if there is a better way to do programmatically build the expression that didn't require creating a string expression first and then converting it into a formula.
Thanks!
1) Create the formula as a character string and then convert it to a formula.
my_fun <- function(data, hp.c.off) {
feols(mpg ~ disp + drat,
data = data,
subset = as.formula(paste("~ hp >", hp.c.off)))
}
2) or just don't use the subset= argument and instead use the data argument with subset.
my_fun <- function(data, hp.c.off) {
feols(mpg ~ disp + drat,
data = subset(data, hp > hp.c.off))
}
3) or use the fact that subset= can be a logical vector
my_fun <- function(data, hp.c.off) {
feols(mpg ~ disp + drat,
data = data,
subset = data$hp > hp.c.off)
}
You can use rlang::new_formula(), with rlang::expr() to quote the rhs and !!rlang::enexpr() to capture and inject the hp.c.off argument.
I don’t have fixest installed, but this demonstrates building the formula inside a function:
library(rlang)
cars <- mtcars
my_fun <- function(data, hp.c.off) {
new_formula(lhs = NULL, rhs = expr(hp > !!enexpr(hp.c.off)))
}
my_fun(data = cars, 150)
# ~hp > 150
# <environment: 0x1405e38>
Simple option is to pass an expression as argument to the function
my_fun <- function(data,expr = ~ hp > 150){
feols(mpg ~ disp + drat,
data = data,
subset = expr)
}
-testing
> my_fun(data = cars)
OLS estimation, Dep. Var.: mpg
Observations: 13
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.414923 8.019808 2.919636 0.015310 *
disp -0.021349 0.008284 -2.577276 0.027545 *
drat -0.201284 2.014207 -0.099932 0.922373
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 2.16851 Adj. R2: 0.300667

Problem extracting model covariates for model summary table

I'm a graduate student using a linear regression (count) model to understand drivers of fish movement into and out of tidal wetlands. I am currently trying to generate a publication-worthy model summary table in r. I've been using the sel.table function which has been working well for this purpose.
However, I've been unable to generate a column that contains the individual model formulas. Below is my code which is based off of some nice instructions for using the MuMIn package. https://sites.google.com/site/rforfishandwildlifegrads/home/mumin_usage_examples
So to recap, my question pertains to the last line of code below,
How can I insert model formulas into a model selection table.**
install.packages("MuMIn")
library(MuMIn)
data = mtcars
models = list(
model1 <- lm(mpg ~ cyl, data = data),
model2 <- lm(mpg ~ cyl + hp, data = data),
model3 <- lm(mpg ~ cyl * hp, data = data)
)
#create an object “out.put” that contains all of the model selection information
out.put <- model.sel(models)
#coerce the object out.put into a data frame
sel.table <-as.data.frame(out.put)[6:10]
#add a column for model names
sel.table$Model <- rownames(sel.table)
#replace model name with formulas
for(i in 1:nrow(sel.table)) sel.table$Model[i]<- as.character(formula(paste(sel.table$Model[i])))[3]
#Any help on this topic would be greatly appreciated!
UPDATED CODE
My method of pulling out model names is pretty clunky but otherwise this code seems to generate what I intended (a complete model selection table). Also, I'm not sure if the model coefficients are displayed properly but I hope to follow up on this for my final answer.
data = mtcars
#write linear models
models = list(
model1 <- lm(mpg ~ cyl, data = data),
model2 <- lm(mpg ~ cyl + hp, data = data),
model3 <- lm(mpg ~ cyl * hp + disp, data = data),
model4 <- lm(mpg ~ cyl * hp + disp + wt + drat, data = data)
)
#create an object “out.put” that contains all of the model selection information
out.put <- model.sel(models)
#coerce the object out.put into a data frame
sel.table <-as.data.frame(out.put)
#slightly rename intercept column
names(sel.table)[1]="Intercept"
#select variables to display in model summary table
sel.table <- sel.table %>%
select(Intercept,cyl,hp,disp,wt,drat,df,logLik,AICc,delta)
#round numerical coumns
sel.table[,1:6]<- round(sel.table[,1:6],2)
sel.table[,8:10]<-round(sel.table[,8:10],2)
#add a column for model (row) names
sel.table$Model <- rownames(sel.table)
#extract model formulas
form <- data.frame(name = as.character(lapply(models, `[[`, c(10,2))))
#generate a column with model (row) numbers (beside associated model formulas)
form <- form %>%
mutate(Model=(1:4))
#merge model table and model formulas
sum_table <- merge (form,sel.table,by="Model")
#rename model equation column
names(sum_table)[2]="Formula"
print <- flextable(head(sum_table))
print <- autofit(print)
print
6/1/20 UPDATE:
Below is an image that describes two issues that I'm having with the code. I've found a workaround to the first question but I'm still investigating the second.
see details here
Models end up being misnumbered
Model formula columns are being generated for each model
I believe there is a part of the code missing in the examples you followed, that is why your code does not work.
The easiest way to generate formula-like strings is simply to deparse the right hand side of the model formulas (i.e. 3-rd element):
sapply(get.models(out.put, TRUE), function(mo) deparse(formula(mo)[[3]], width.cutoff = 500))
or, if you want A*B's expanded into A + B + A:B:
sapply(get.models(out.put, TRUE), function(mo) deparse(terms(formula(mo), simplify = TRUE)[[3]], width.cutoff = 500))
Update: the original example code improved and simplified:
library(MuMIn)
data <- mtcars
#! Feed the models directly to `model.sel`. No need to create a separate list of
#! models.
gm <- lm(mpg ~ cyl, data = data)
out.put <- model.sel(
model1 = gm,
model2 = update(gm, . ~. + hp),
model3 = update(gm, . ~ . * hp + disp),
model4 = update(gm, . ~ . * hp + disp + wt + drat)
)
sel.table <- out.put
sel.table$family <- NULL
sel.table <- round(sel.table, 2)
#! Use `get.models` to get the list of models in the same order as in the
#! selection table
sel.table <- cbind(
Model =
#! Update (2): model number according to their original order, use:
attr(out.put, "order"),
#! otherwise: seq(nrow(sel.table)),
#!
#! Update (2): add a large `width.cutoff` to `deparse` so that the result is
#! always a single string and `sapply` returns a character vector
#! rather than a list.
#! For oversize formulas, use `paste0(deparse(...), collapse = "")`
formula = sapply(get.models(out.put, TRUE),
function(mo) deparse(formula(mo)[[3]], width.cutoff = 500)),
#!
sel.table
)
library(MuMIn)
data <- mtcars
#! Feed the models directly to `model.sel`. No need to create a separate list of
#! models.
gm <- lm(mpg ~ cyl, data = data)
out.put <- model.sel(
model1 = gm,
model2 = update(gm, . ~. + hp),
model3 = update(gm, . ~ . * hp + disp),
model4 = update(gm, . ~ . * hp + disp + wt + drat)
)
sel.table <- out.put
sel.table$family <- NULL
sel.table <- round(sel.table, 2)
#! Use `get.models` to get the list of models in the same order as in the
sel.table <- cbind(
Model =
#! Update (2): model number according to their original order, use:
attr(out.put, "order"),
#! otherwise: seq(nrow(sel.table)),
#!
#! Update (2): add a large `width.cutoff` to `deparse` so that the result is
#! always a single string and `sapply` returns a character vector
#! rather than a list.
#! For oversize formulas, use `paste0(deparse(...), collapse = "")`
formula = sapply(get.models(out.put, TRUE),
function(mo) deparse(formula(mo)[[3]], width.cutoff = 500)),
#!
sel.table
)
#slightly rename intercept column
colnames(sel.table)[3] <- 'Intercept'
# #select summary columns for model selection table
# sel.table <- sel.table %>%
# select(Model,formula,Intercept,df,logLik,AICc,delta,weight)
print <- flextable(head(sel.table))
print <- autofit(print)
print
Since your question isn't reproducible, i'll try with something else and maybe that's what you're looking for:
data = mtcars
models = list(
model1 = lm(mpg ~ cyl, data = data),
model2 = lm(mpg ~ cyl + hp, data = data)
)
data.frame(name = as.character(lapply(models, `[[`, c(10,2))),
other.column = NA)
#> name other.column
#> 1 mpg ~ cyl NA
#> 2 mpg ~ cyl + hp NA
Created on 2020-05-28 by the reprex package (v0.3.0)
The formula (call) of a lm object is on position 10 of the list. You can actually count when you type model1$. You can use rownames() instead of a column, but that's not recommended.
EDIT AFTER REPRODUCIBLE EXAMPLE
library(MuMIn)
data = mtcars
models = list(
model1 <- lm(mpg ~ cyl, data = data),
model2 <- lm(mpg ~ cyl + hp, data = data),
model3 <- lm(mpg ~ cyl * hp, data = data)
)
# create an object that contains all of the model selection information
out.put <- model.sel(models)
#coerce the object out.put into a data frame
sel.table <-as.data.frame(out.put)[6:10]
# formulas as names
sel.table$name = as.character(lapply(models, `[[`, c(10,2)))
# reordering
sel.table = sel.table[, c(6,1,2,3,4,5)]
sel.table
#> name df logLik AICc delta weight
#> 3 mpg ~ cyl 5 -78.14329 168.5943 0.000000 0.5713716
#> 1 mpg ~ cyl + hp 3 -81.65321 170.1636 1.569298 0.2607054
#> 2 mpg ~ cyl * hp 4 -80.78092 171.0433 2.449068 0.1679230
Created on 2020-05-31 by the reprex package (v0.3.0)

Performing multiple regression with purrr and R

I'm trying to figure out how can I set up purrr to run several multiple regressions like the image below. As you will notice, this dataset describes an intervention program and we are analyzing this data using ANCOVA procedures (TIME 2 ~ TIME 1 + CONDITION).
om4g**TIME2**01 ~ om4g**TIME1**01 + CONDITION
example:
om4g201 ~ om4g01 + CONDITION
Just in case someone want a reproducible code:
dataset <- data.frame(rest201=c(10,20,30,40),
rest101=c(5,10,20,24),
omgt201=c(40,10,20,10),
omgt101=c(10,20,10,05),
CONDITION=c(0,1))
lm(rest201~rest101+CONDITION, data=dataset)
lm(omgt201~omgt101+CONDITION, data=dataset)
I found just one similar question than mine here (Making linear models in a for loop using R programming) but the answer was not working.
Thanks!
Similar to #Roman's answer, here is how to do it using map2 from purrr:
library(purrr)
y_var = c("rest201", "omgt201")
x_var = list(c("rest101", "CONDITION"), c("omgt101", "CONDITION"))
map2(x_var, y_var, ~ lm(as.formula(paste(.y, "~", paste(.x, collapse = " + "))), data = dataset))
To get the summary table for each model, you can wrap each lm with summary and extract the coefficients table:
map2(x_var, y_var, ~ {
lm(as.formula(paste(.y, "~", paste(.x, collapse = " + "))), data = dataset) %>%
summary() %>%
`$`("coefficients")
})
Result:
[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.779097 0.76821670 3.617596 0.17169133
rest101 1.377672 0.04750594 29.000000 0.02194371
CONDITION 3.800475 0.72163694 5.266464 0.11945968
[[2]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.000000e+01 16.666667 1.800000e+00 0.3228289
omgt101 -2.445145e-16 1.333333 -1.833859e-16 1.0000000
CONDITION -2.000000e+01 14.529663 -1.376494e+00 0.3999753
You could construct a list of formulas for each model and use that to construct a model.
x <- c(101, 102, 103)
mdls <- sprintf("omg4g%s ~ om4g%s + CONDITION",
as.character(x + 100),
as.character(x)
)
out <- sapply(mdls, FUN = function(x) {
formula(x, data = latino_dataset)
})
$`omg4g201 ~ om4g101 + CONDITION`
omg4g201 ~ om4g101 + CONDITION
<environment: 0x0000000009aff7b8>
$`omg4g202 ~ om4g102 + CONDITION`
omg4g202 ~ om4g102 + CONDITION
<environment: 0x0000000009afda98>
$`omg4g203 ~ om4g103 + CONDITION`
omg4g203 ~ om4g103 + CONDITION
<environment: 0x00000000099b0828>
e.g.
sapply(out, FUN = lm)

Estimate equation with varying numbers of parameters in R (lm) with demeaned variables

I want to estimate an equation such as:
(where the bar denotes the mean of a variable.... Meaning, I want to automatically have interactions between Z and a demeaned version of X. So far I just demean the variables manually beforehand and estimate:
lm(Y ~ .*Z, data= sdata)
This seems to be working, but I would rather use a solution that does not require manual demeaning beforehand because I would also like to include the means of more complex terms, such as:
Edit:
As requested, a working code-sample, note that in the actual thing I have large (and varying) numbers of X- variables, so that I dont want to use a hard-coded variant:
x1 <- runif(100)
x2 <- runif(100)
Z <- runif(100)
Y <- exp(x1) + exp(x2) + exp(z)
##current way of estimating the first equation:
sdata <- data.frame(Y=Y,Z=Z,x1=x1-mean(x1),x2=x2-mean(x2))
lm(Y ~ .*Z, data= sdata)
##basically what I want is that the following terms, and their interactions with Z are also used:
# X1^2 - mean(X1^2)
# X2^2 - mean(X2^2)
# X1*X2 - mean(X1*X2)
Edit 2:
Now, what I want to achieve is basically what
lm(Y ~ .^2*Z, data= sdata)
would do. However, given prior demeaing expressions in there, such as: Z:X1:X2 would correspond to: (x1-mean(x1))*(x2-mean(x2)), while what I want to have is x1*x2-mean(x1*x2)
To show that scale works inside a formula:
lm(mpg ~ cyl + scale(disp*hp, scale=F), data=mtcars)
Call:
lm(formula = mpg ~ cyl + scale(disp * hp, scale = F), data = mtcars)
Coefficients:
(Intercept) cyl scale(disp * hp, scale = F)
3.312e+01 -2.105e+00 -4.642e-05
Now for comparison let's scale the interaction outside the formula:
mtcars$scaled_interaction <- with(mtcars, scale(disp*hp, scale=F))
lm(mpg ~ cyl + scaled_interaction, data=mtcars)
Call:
lm(formula = mpg ~ cyl + scaled_interaction, data = mtcars)
Coefficients:
(Intercept) cyl scaled_interaction
3.312e+01 -2.105e+00 -4.642e-05
At least in these examples, it seems as if scale inside formulae is working.
To provide a solution to your specific issue:
Alternative 1: Use formulae
# fit without Z
mod <- lm(Y ~ (.)^2, data= sdata[, names(sdata) != "Z" ])
vars <- attr(mod$terms, "term.labels")
vars <- gsub(":", "*", vars) # needed so that scale works later
vars <- paste0("scale(", vars, ", scale=F)")
newf <- as.formula(paste0("Y ~ ", paste0(vars, collapse = "+")))
# now interact with Z
f2 <- update.formula(newf, . ~ .*Z)
# This fives the following formula:
f2
Y ~ scale(x1, scale = F) + scale(x2, scale = F) + scale(x1*x2, scale = F) +
Z + scale(x1, scale = F):Z + scale(x2, scale = F):Z + scale(x1*x2, scale = F):Z
Alternative 2: Use Model Matrices
# again fit without Z and get model matrix
mod <- lm(Y ~ (.)^2, data= sdata[, names(sdata) != "Z" ])
modmat <- apply(model.matrix(mod), 2, function(x) scale(x, scale=F))
Here, all x's and the interactions are demeaned:
> head(modmat)
(Intercept) x1 x2 x1:x2
[1,] 0 0.1042908 -0.08989091 -0.01095459
[2,] 0 0.1611867 -0.32677059 -0.05425087
[3,] 0 0.2206845 0.29820499 0.06422944
[4,] 0 0.3462069 -0.15636463 -0.05571430
[5,] 0 0.3194451 -0.38668844 -0.12510551
[6,] 0 -0.4708222 -0.32502269 0.15144812
> round(colMeans(modmat), 2)
(Intercept) x1 x2 x1:x2
0 0 0 0
You can use the model matrix as follows:
modmat <- modmat[, -1] # remove intercept
lm(sdata$Y ~ modmat*sdata$Z)
It is not beautiful, but should do the work with any number of explanatory variables. You can also add Y and Z to the matrix so that the output looks prettier if this is a concern. Note that you can also create the model matrix directly without fitting the model. I took it from the fitted model directly since it have already fitted it for the first approach.
As a sidenote, it may be that this is not implemented in a more straight forward fashion because it is difficult to imagine situations in which demeaning the interaction is more desirable compared to the interaction of demeaned variables.
Comparing both approaches:
Here the output of both approaches for comparison. As you can see, apart from the coefficient names everything is identical.
> lm(sdata$Y ~ modmat*sdata$Z)
Call:
lm(formula = sdata$Y ~ modmat * sdata$Z)
Coefficients:
(Intercept) modmatx1 modmatx2 modmatx1:x2 sdata$Z
4.33105 1.56455 1.43979 -0.09206 1.72901
modmatx1:sdata$Z modmatx2:sdata$Z modmatx1:x2:sdata$Z
0.25332 0.38155 -0.66292
> lm(f2, data=sdata)
Call:
lm(formula = f2, data = sdata)
Coefficients:
(Intercept) scale(x1, scale = F) scale(x2, scale = F)
4.33105 1.56455 1.43979
scale(x1 * x2, scale = F) Z scale(x1, scale = F):Z
-0.09206 1.72901 0.25332
scale(x2, scale = F):Z scale(x1 * x2, scale = F):Z
0.38155 -0.66292

Easily performing the same regression on different datasets

I'm performing the same regression on several different datasets (same dependent and independe variables). However, there are many independent variables, and I often want to test adding/removing different variables. I'd like to avoid making all these changes to different lines of code, just because they use different datasets. Can I instead just copy the formula that was used to create some object, and then create a new object using a different dataset? For example, something like:
fit1 <- lm(y ~ x1 + x2 + x3 + ..., data = dataset1)
fit2 <- lm(fit1$call, data = dataset2) # this doesn't work
fit3 <- lm(fit1$call, data = dataset3) # this doesn't work
This way, if I want to update numerous regressions, I just update the first one and then rerun them all.
Can this be done? Preferably without using a loop or paste().
Thanks!
Or use update
(fit <- lm(mpg ~ wt, data = mtcars))
# Call:
# lm(formula = mpg ~ wt, data = mtcars)
#
# Coefficients:
# (Intercept) wt
# 37.285 -5.344
update(fit, data = mtcars[mtcars$hp < 100, ])
# Call:
# lm(formula = mpg ~ wt, data = mtcars[mtcars$hp < 100, ])
#
# Coefficients:
# (Intercept) wt
# 39.295 -5.379
update(fit, data = mtcars[1:10, ])
# Call:
# lm(formula = mpg ~ wt, data = mtcars[1:10, ])
#
# Coefficients:
# (Intercept) wt
# 33.774 -4.285
Collect your datasets into a list and then use lapply. E.g.:
dsets <- list(dataset1,dataset2,dataset3)
lapply(dsets, function(x) lm(y ~ x1 + x2, data=x) )
Not sure entirely that this what you want but you can do this as follows:
formula <- y ~ x1 + x2 + x3 + ...
fit1 <- lm(formula, data = dataset1)
fit2 <- lm(formula, data = dataset2)
fit3 <- lm(formula, data = dataset3)

Resources