Predicting with log transformation in formula - r

Imagine a simple model
fit <- lm(log(mpg) ~ cyl + hp, data = mtcars)
to make predictions we have to take exponent
exp(predict(fit, newdata = mtcars))
Is there any better way to do this then apply it manually? Documentation of ?predict does not give any helpful hints on this.
I guess, the easiest way would be to extract the transforming function from the formula
> formula(fit)
log(mpg) ~ cyl + hp
How can I check if any transformation was applied to the left hand side of formula and if there was any transformation, to extract name of the function?

I'm not sure if it helps, but you could test it in this manner:
Convert it into character and check if it starts with log/sqrt and the likes:
startsWith(as.character((formula(fit))[2]), "log")
The answer is true of false:
[1] TRUE
Maybe this will help you automate your solution?

Related

How to extract robust standard errors in r?

I calculated robust standard errors after running a regression with lm() function.
# robust standard errors
cov2I <- vcovHC(ols2I, type = "HC1")
robust_se2I <- sqrt(diag(cov2I))
print(robust_se2I)
I would like to extract the second value out of the resulting matrix and save it under a new variable. I've tried the following code but it didn't work.
stderrorols2I <- (summary(robust_se2I))[2]
Thanks for your help!
This is how you could add RSE manually to your summary output. Alternativly, you could have a look at coeftest()
library(sandwich)
mod1 <- lm(mpg ~ cyl + disp, data = mtcars)
# robust standard errors
cov2I <- vcovHC(mod1, type = "HC1")
robust_se2I <- sqrt(diag(cov2I))
mod1 %>%
broom::tidy() %>%
mutate(rse = robust_se2I)
I found the answer to my question.
To save the specific robust standard error I should have coded the following:
stderrorols2I <- (robust_se2I)[2]
Now it's working perfectly.
Thanks anyway for your quick feedback!

How to run a loop regression in r

I am want to run a loop regression for fund1 till fund10 based on the the LIQ-factor. I want to do this regression:
lm(fund1S~ LIQ, data = Merge_Liq_Size)
but for all of the funds simultaneously.
I have attached some picture of the dataset to show you the setup. The dataset has 479 observation/rows. Can anyoune help me to sturcture a code? Sorry if this question is phrased in a wrong way.
Perhaps this is what you are looking for:
my_models <- lapply(paste0("fund", 1:10, "S ~ LIQ"), function(x) lm(as.formula(x), data = Merge_Liq_Size))
You can access each model by my_models[[1]] to my_models[[10]].
If it is lm, we can do this without lapply as well i.e. create a matrix as the dependent variable and construct the formula
lm(as.matrix(Merge_Liq_Size[paste0("fund", 1:10, "S")]) ~ LIQ, data = Merge_Liq_Size)
Using a small reproducible example
> data(mtcars)
> lm(as.matrix(mtcars[1:3]) ~ vs, data = mtcars)
Call:
lm(formula = as.matrix(mtcars[1:3]) ~ vs, data = mtcars)
Coefficients:
mpg cyl disp
(Intercept) 16.617 7.444 307.150
vs 7.940 -2.873 -174.693

Margins Package error using quadratic and interaction terms

I have code which uses the margins command in Stata and I am trying to replicate it in R using the "margins" package found here and on cran.
I keep getting the error:
marg1<-margins(reg2)
Error in names(classes) <- clean_terms(names(classes)) : 'names' attribute [18] must be the same length as the vector [16]"
A minimum reproducible example is show below:
install.packages(margins)
library(margins)
mod1 <- lm(log(mpg) ~ vs + cyl + hp + vs*hp + I(vs*hp*hp) + wt + I(hp*hp), data = mtcars)
(marg1 <- margins(mod1))
summary(marg1)
I need vs to be a dummy variable interacted with both a quadratic term and a normal interaction.
Does anyone know what I am doing wrong or if there is a way around this?
Your model specification is a bit confusing. For example, vs*hp introduces 3 variables: i) vs, ii) hp and iii) interaction vs and hp. As a result, hp appears twice in the formula you provided. You can simplify massively! Try this for example (I think it is what you want):
mtcars$hp2 = mtcars$hp^2
mod1 <- lm(log(mpg) ~ cyl + wt + vs*hp + vs*hp2, data = mtcars)
summary(mod1) # With this you can check that the model you specified is what you want
(marg1 <- margins(mod1)) # The error disappeared.
summary(marg1)
In general, I would recommend you to avoid I() in formula specifications, as it often gives rise to such errors when not treated with enough care (though sometimes one cannot avoid it). Good luck!

How to understand the arguments of "data" and "subset" in randomForest R package?

Arguments
data: an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForestis called from
subset: an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)
My questions:
Why is data argument "optional"? If data is optional, where does the training data come from? And what exactly is the meaning of "By default the variables are taken from the environment which randomForestis called from"?
Why do we need the subset parameter? Let's say, we have the iris data set. If I want to use the first 100 rows as the training data set, I just select training_data <- iris[1:100,]. Why bother? What's the benefit of using subset?
This is not an uncommon methodology, and certainly not unique to randomForests.
mpg <- mtcars$mpg
disp <- mtcars$disp
lm(mpg~disp)
# Call:
# lm(formula = mpg ~ disp)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122
So when lm (in this case) is attempting to resolve the variables referenced in the formula mpg~disp, it looks at data if provided, then in the calling environment. Further example:
rm(mpg,disp)
mpg2 <- mtcars$mpg
lm(mpg2~disp)
# Error in eval(predvars, data, env) : object 'disp' not found
lm(mpg2~disp, data=mtcars)
# Call:
# lm(formula = mpg2 ~ disp, data = mtcars)
# Coefficients:
# (Intercept) disp
# 29.59985 -0.04122
(Notice that mpg2 is not in mtcars, so this used both methods for finding the data. I don't use this functionality, preferring the resilient step of providing all data in the call; it is not difficult to think of examples where reproducibility suffers if this is not the case.
Similarly, many similar functions (including lm) allow this subset= argument, so the fact that randomForests includes it is consistent. I believe it is merely a convenience argument, as the following are roughly equivalent:
lm(mpg~disp, data=mtcars, subset= cyl==4)
lm(mpg~disp, data=mtcars[mtcars$cyl == 4,])
mt <- mtcars[ mtcars$cyl == 4, ]
lm(mpg~disp, data=mt)
The use of subset allows slightly simpler referencing (cyl versus mtcars$cyl), and its utility is compounded when the number of referenced variables increases (i.e., for "code golf" purposes). But this could also be done with other mechanisms such as with, so ... mostly personal preference.
Edit: as joran pointed out, randomForest (and others but notably not lm) can be called with either a formula, which is where you'd typically use the data argument, or by specifying the predictor/response arguments separately with the arguments x and y, as in the following examples taken from ?randomForest (ignore the other arguments being inconsistent):
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE, proximity=TRUE)
iris.rrf <- randomForest(iris[-1], iris[[1]], ntree=101, proximity=TRUE, oob.prox=FALSE)

Updating data in lm() calls

Is there is an equivalent to update for the data part of an lm call object?
For example, say i have the following model:
dd = data.frame(y=rnorm(100),x1=rnorm(100))
Model_all <- lm(formula = y ~ x1, data = dd)
Is there a way of operating on the lm object to have the equivalent effect of:
Model_1t50 <- lm(formula = y ~ x1, data = dd[1:50,])
I am trying to construct some psudo out of sample forecast tests, and it would be very convenient to have a single lm object and to simply roll the data.
I'm fairly certain that update actually does what you want!
example(lm)
dat1 <- data.frame(group,weight)
lm1 <- lm(weight ~ group, data=dat1)
dat2 <- data.frame(group,weight=2*weight)
lm2 <- update(lm1,data=dat2)
coef(lm1)
##(Intercept) groupTrt
## 5.032 -0.371
coef(lm2)
## (Intercept) groupTrt
## 10.064 -0.742
If you're hoping for an effiency gain from this, you'll be disappointed -- R just substitutes the new arguments and re-evaluates the call (see the code of update.default). But it does make the code a lot cleaner ...
biglm objects can be updated to include more data, but not less. So you could do this in the opposite order, starting with less data and adding more. See http://cran.r-project.org/web/packages/biglm/biglm.pdf
However, I suspect you're interested in parameters estimated for subpopulations (ie if rows 1:50 correspond to level "a" of factor variable factrvar. In this case, you should use interaction in your formula (~factrvar*x1) rather than subsetting to data[1:50,]. Interaction of this type will give different effect estimates for each level of factrvar. This is more efficient than estimating each parameter separately and will constrain any additional parameters (ie, x2 in ~factrvar*x1 + x2) to be the same across values of factrvar--if you estimated the same model multiple times to different subsets, x2 would receive a separate parameter estimate each time.

Resources