When using the add1 function to consider new variables, I would like to reference all variables (either in some dataframe or global environment), but I can not figure out how to use the scope argument to do this.
I am aware I can use it like this
X = data.frame(replicate(4,rnorm(20))) ; y = rnorm(20)
lm1 = lm(y ~ 1)
out = add1(lm1, scope= ~X$X1 + X$X2 + X$X3)
but I want to avoid manually writing in every variable.
As I have seen in other questions, I know the . symbol will not work but I am not sure why. It stands for what is already there, so if I do
x1 = rnorm(20) ; x2 = rnorm(20) ; x3 = rnorm(20) ; x4 = rnorm(20) ; y = rnorm(20)
out = add1(lm1, scope= ~ . )
it does not use what is already in the global environment.
I know the documentation says that scope must be "a formula giving the terms to be considered", but that is usually where . can be used to reference all variables.
Thanks in advance.
Also note I have read Chp 7 of MASS, and these related threads
scope from add1()-command in R
http://tolstoy.newcastle.edu.au/R/help/02b/3588.html
This is an even simpler answer, which I found after browsing this question
http://r.789695.n4.nabble.com/glm-formula-vs-character-td2543061.html
x1 = rnorm(100)
x2 = rnorm(100)
x3 = rnorm(100)
y = rnorm(100)
BaseReg = lm(y ~ 1)
newdf = data.frame(x1,x2,x3)
out = add1(BaseReg, names(newdf))
It is baffling that such a simple way to get this was not stated in the documentation for add1.
As the help page for add1 says the formula ~. means "what's already there". It is not any simpler to use as.formula for small numbers of names but this approach can be using in a function or script. (Generally one would expect to put the X's and Y in the same dataframe.)
as.formula(paste("~", paste(names(YX)[-c(1,5)],collapse="+")))
#~X1 + X2 + X3
YX <- cbind(y,X)
form <- as.formula(paste("~", paste(names(YX)[-c(1,5)],collapse="+")))
add1(lm1, form)
You appear to have stumbled across a more efficient strategy. If using a data object with column names: "y" "X1" "X2" "X3"
"X4:
> formula(YX)
y ~ X1 + X2 + X3 + X4
> formula(YX)[-2]
~X1 + X2 + X3 + X4
> as.list(formula(YX))
[[1]]
`~`
[[2]]
y
[[3]]
X1 + X2 + X3 + X4
> names(YX)
[1] "y" "X1" "X2" "X3" "X4"
You can see that a formula object has as its first element the formula-defining tilde which is really an R function. The second element is the LHS expression and the third elemtn is the RHS expression.
Here is something I found that works:
X = data.frame(replicate(4,rnorm(20)))
lm1 = lm(X1 ~ 1 ,data=X)
add1(lm1, scope=formula(X)[-2])
Granted, I have no idea why this is the case
formula(X)[-2]
# ~X2 + X3 + X4
I just found it by accident. Other things like formula(X)[-1] and formula(X)[-3] also return other things which are equally bizarre to me.
Related
I'm trying to run lm() on only a subset of my data, and running into an issue.
dt = data.table(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100), x3 = as.factor(c(rep('men',50), rep('women',50)))) # sample data
lm( y ~ ., dt) # Use all x: Works
lm( y ~ ., dt[x3 == 'men']) # Use all x, limit to men: doesn't work (as expected)
The above doesn't work because the dataset now has only men, and we therefore can't
include x3, the gender variable, into the model. BUT...
lm( y ~ . -x3, dt[x3 == 'men']) # Exclude x3, limit to men: STILL doesn't work
lm( y ~ x1 + x2, dt[x3 == 'men']) # Exclude x3, with different notation: works great
This is an issue with the "minus sign" notation in the formula? Please advice. Note: Of course I can do it a different way; for example, I could exclude the variables prior to putting them into lm(). But I'm teaching a class on this stuff, and I don't want to confuse the students, having already told them they can exclude variable using a minus sign in the formula.
The error you are getting is because x3 is in the model with only one value = "men" (see comment below from #Artem Sokolov)
One way to solve it is to subset ahead of time:
dt = data.table(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100), x3 = as.factor(c(rep('men',50), rep('women',50)))) # sample data
dmen<-dt[x3 == 'men'] # create a new subsetted dataset with just men
lm( y ~ ., dmen[,-"x3"]) # now drop the x3 column from the dataset (just for the model)
Or you can do both in the same step:
lm( y ~ ., dt[x3 == 'men',-"x3"])
I'm trying to regress returns against FF 3-factors with a rolling window.
To do so, I have found the function roll_lm in R, but the function is only producing regression output for one of the 3 variables.
The code is described here:
Y <- as.matrix(Portfolio_returns[,2])
X1 <- as.matrix(Mydata[,2])
X2 <- as.matrix(Mydata[,3])
X3 <- as.matrix(Mydata[,4])
Five_years_Rolling_reg <- roll_lm(X1 + X2 + X3,Y,60)
When I apply the coef function, I only get output for X1 and not X2 nor X3.
What am I doing wrong?
You problem seems to be a basic misunderstanding of how the function works. Looking at ?roll_lm
Arguments
x
matrix or xts object. Rows are observations and columns are the independent variables.
Currently it seems like you are trying to use a formula = X1 + X2 + X3 style of input, which is not what the help page is saying. As such it is adding the columns together as if it was: x1 = 2; x2 = 3; x1 + x2 = 5
Instead you should bind the rows together.
Y <- as.matrix(Portfolio_returns[,2])
X <- as.matrix(Mydata[,2:4]
roll_lm(X, Y, 60)
Or alternatively use the model.frame, model.response, model.matrix functions from base-R, which gives you the familiarity of the formula settings.
names(Mydata)[1:4] <- c("Y", "X1", "X2", "X3")
frame <- model.frame(Y ~ X1 + X2 + X3, data = Mydata)
X <- model.matrix(Y ~ X1 + X2 + X3, data = Mydata)
roll_lm(X, model.response(frame), 60)
I am trying to add term to a model formula in R. This is straightforward to do using update() if I enter the variable name directly into the update function. However it does not work if the variable name is in a variable.
myFormula <- as.formula(y ~ x1 + x2 + x3)
addTerm <- 'x4'
#Works: x4 is added
update(myFormula, ~ . + x4)
Output: y ~ x1 + x2 + x3 + x4
#Does not work: "+ addTerm" is added instead of x4 being removed
update(myFormula, ~ . + addTerm)
Output: y ~ x1 + x2 + x3 + addTerm
Adding x4 via the variable can be done in a slightly more complex way.
formulaString <- deparse(myFormula)
newFormula <- as.formula(paste(formulaString, "+", addTerm))
update(newFormula, ~.)
Output: y ~ x1 + x2 + x3 + x4
Is there a way to get update() to do this directly without needing these extra steps? I've tried paste, parse, and the other usual functions and they don't work.
For example, if paste0 is used the output is
update(myFormula, ~ . + paste0(addTerm))
Output: y ~ x1 + x2 + x3 + paste0(addTerm)
Does anybody have any recommendations on how to use a variable in update()?
Thanks
You can probably just do:
update(myFormula, paste("~ . +",addTerm))
Because of a bug in the neuralnet command in R, I am building a formula manually instead of using the '.' notation for all variables. Inside of a loop, the paste function is transposing the "~" and "y" as shown below.
for(i in 1:3)
{
f <- as.formula(paste(c("y",i,"~", paste(c("x1","x2"), collapse = " + ")), collapse=""))
message(f)
}
produces:
~y1x1 + x2
~y2x1 + x2
~y3x1 + x2
I tried reversing the order of the "~" and "y", but that gives an error "unexpected symbol". So the question is, how do I get:
y1~x1 + x2
y2~x1 + x2
y3~x1 + x2
Thanks!
This would be a method of producing 5 formula-objects with an sapply-loop. Note: Your current for-loop will over-write the f-values because you did not index the assignment:
sapply( paste("y",1:5,"~", paste(c("x1","x2"), collapse = " + "),
sep="") , as.formula)
$`y1~x1 + x2`
y1 ~ x1 + x2
<environment: 0x121e1b668>
$`y2~x1 + x2`
y2 ~ x1 + x2
<environment: 0x121e1b668>
$`y3~x1 + x2`
y3 ~ x1 + x2
<environment: 0x121e1b668>
$`y4~x1 + x2`
y4 ~ x1 + x2
<environment: 0x121e1b668>
$`y5~x1 + x2`
y5 ~ x1 + x2
<environment: 0x121e1b668>
There is really no way to have any other structure than a list-object, since formulas are language constructs and typically need to be inside list or list like structures and use "[[" to gain access to their values.
Using a dataset I built a model as below:
fit <- lm(y ~ as.numeric(X1) + as.factor(x2) + log(1 + x3) + as.numeric(X4) , dataset)
Then I build new data:
X1 <- 1
X2 <- 10
X3 <- 15
X4 <- 0.5
new <- data.frame(X1, X2, X3, X4)
predict(fit, new , se.fit=TRUE)
Then I get the Error below:
Error in data.frame(state_today, daily_creat, last1yr_min_hosp_icu_MDRD, :
object 'X2' is not found
What am I doing wrong? Is this because of logarithm in the model?
A great way of looking at your problem another way is by constructing a self contained reproducible example. With no copy/pasting. This often gives you a fresh perspective and often teases out the weirdest bugs imaginable.
As flodel and Ben have pointed out, your problem is probably due to bad choice of variable names. I'm guessing you're using Rstudio, which in my opinion uses a terrible default font exactly for this reason. I can't tell x and X apart (easily).
Here is something similar to what you're trying to do, with all variable names correctly (un)capitalized.
xy <- data.frame(y = runif(20), x1 = runif(20), x2 = sample(1:5, 20, replace = TRUE), x3 = runif(20))
fit <- lm(y ~ as.numeric(x1) + as.factor(x2) + log(1+x3), data = xy)
predict(fit, newdata = data.frame(x1 = 1, x2 = as.factor(3), x3 = 15))
1
0.05015187