How to paste formula into model.matrix function in R? - r

By way of simplified example, say you have the following data:
n <- 10
df <- data.frame(x1 = rnorm(n, 3, 1), x2 = rnorm(n, 0, 1))
And you wish to create a model matrix of the following form:
model.matrix(~ df$x1 + df$x2)
or more preferably:
model.matrix(~ x1 + x2, data = df)
but instead by pasting the formula into model.matrix. I have experimented with the following but encounter errors with all of them:
form1 <- "df$x1 + df$x2"
model.matrix(~ as.formula(form1))
model.matrix(~ eval(parse(text = form1)))
model.matrix(~ paste(form1))
model.matrix(~ form1)
I've also tried the same with the more preferable structure:
form2 <- "x1 + x2, data = df"
Is there a direct solution to this problem? Or is the model.matrix function not conducive to this approach?

Do you mean something like this?
expr <- "~ x1 + x2"
model.matrix(as.formula(expr), df)
You need to give df as the data argument outside of as.formula, as the data argument defines the environment within which to evaluate the formula.
If you don't want to specify the data argument you can do
model.matrix(as.formula("~ df$x1 + df$x2"))

Related

Adding extra variables to a formula

I want to add extra variables to a formula, with the use of a separate object part_B. As an example:
part_A <- as.formula("y ~ x1")
part_B <- c("x2", "x3")
I tried a couple of things, but one issue is that you cannot call as.formula on the object part_B (because in that case I could have created the formula by combining character vectors).
Desired Result
as.formula("y ~ x1 + x2 + x3")
Is there any way to do this? I guess one solution would be to create a function that writes the character vector as "y ~ x1 + x2 + x3" so it can be fed to as.formula.
Use reformulate and update like this:
update(part_A, reformulate(c(".", part_B)))
## y ~ x1 + x2 + x3
This also works:
v <- all.vars(part_A)
reformulate(c(v[-1], part_B), v[1])
## y ~ x1 + x2 + x3
If you write part_A as a vector instead:
part_A <- c("y", "x1")
part_B <- c("x2", "x3")
new_formula <- as.formula(paste(part_A[1], paste(c(part_A[2], part_B), collapse = " + "), sep = " ~ "))

How to use a dataframe in a function in r

I need to insert the variables of a dataframe into a function in r. The function in question is "y=[1- (x1-x2) / x3]". When I write, and enter the variables manually it works, however, I need to use the random numbers from the dataframe.
#Original function
f<-function(x1, x2, x3)
+{}
f<-function(x1, x2, x3)
+{return(1-(x1-x2)/x3)}
f(0.9, 0.5, 0.5)```
#Dataframe function
f<-function(x1, x2, x3)
+{}
f<-function(x1, x2, x3)
+{return(1-(x1-x2)/x3)}
f(x1 = x1, x2 = x2, x3 = x3, DATA = DF)
The first output is ok, however, the second output appears the error message. Error in f(VMB = VMB, VMR = VMR, DATA = DATA1) : unused argument (DATA = DATA1) I know I'm not properly inserting the dataframe into the code, but I'm already circling, can anyone help me?
As the comments suggest, your problem is that the function doesn't contain a data argument. R doesn't know where x1, x2, x3 comes from and will only look at through the global environment trying to find them. If these are contained in a data frame, it doesn't know that it should take them from there, and will fail.
For example
f <- function(x,y,z)
1 + (x-y)/z
f(0.9, 0.5, 0.5)
will work, because it knows where to retrieve the values. So will
x1 <- 0.9
x2 <- 0.5
x3 <- 0.5
f(x1, x2, x3)
because it looks through these environemnts, but
df <- data.frame(x = 0.9, y = 0.5, z = 0.5)
f(x, y, z) #fails
fails, because it doesn't look for them in df. Instead you can use
f(df$x, df$y, df$z)
with(df, f(x, y, z)) #same
which lets R know where to get the variables. (Here i used x, y and z to avoid conflict names)
If this function should always take a data.frame and use columns x1, x2, x3 you could use rewrite it to incorporate this, as below.
f <- function(df){
with(df, 1 + (x1-x2)/x3)
}

Regression in R using a function

I am trying to smooth out my data for each variable in the data frame. Lets say it looks like this:
data <- data.frame(v1 = c(0.5,1.1,2.9,3.4,4.1,5.7,6.3,7.4,6.9,8.5,9.1),
v2 = c(0.1,0.8,0.5,1.1,1.9,2.4,0.8,3.4,2.9,3.1,4.2),
v3 = c(1.3,2.1,0.8,4.1,5.9,8.1,4.3,9.1,9.2,8.4,7.4))
data$x <- 1:nrow(data)
I then specify my x and y variables as:
x <- data$x
y <- data$v1
I can fit the predicted line I want (and I am happy with the process):
f <- function (x,a,b,d) {(a*x^2) + (b*x) + d}
order_two <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1))
co2 <- coef(order_two)
data$order_two_predicted_v1 <- (co2[1] * (data$x)^2) + (co2[2] * data$x) + co2[3]
I therefore end up with an appropriately titled new variable (the predicted values for v1). I now want to do this for each of the other 100 variables in my data frame (v2 and v3 in this example).
I tried using a function to do this but can't get it to work as intended. Here is my attempt:
myfunction <- function(xaxis,yaxis){
# Specfiy my "y" and "x"
x <- data$xaxis
y <- data$yaxis
f <- function (x,a,b,d) {(a*x^2) + (b*x) + d}
order_two <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1))
co2 <- coef(order_two)
data$order_two_predicted_yaxis <- (co2[1] * (data$x)^2) + (co2[2] * data$x) + co2[3]
}
myfunction(x,v1)
myfunction(x,v2)
myfunction(x,v3)
Not only does the function not work as intended, I would like to avoid calling the function 100 times for each variable and instead somehow loop through it.
This is really simple to do in SAS using macros but I am struggling to get this to work in R.
You can model your data directly with the lm() function:
data <- data.frame(v1 = c(0.5,1.1,2.9,3.4,4.1,5.7,6.3,7.4,6.9,8.5,9.1),
v2 = c(0.1,0.8,0.5,1.1,1.9,2.4,0.8,3.4,2.9,3.1,4.2),
v3 = c(1.3,2.1,0.8,4.1,5.9,8.1,4.3,9.1,9.2,8.4,7.4))
x <- 1:nrow(data)
# initialize a list to store the models
models = vector("list", length = (ncol(data)))
# create a loop running over the columns of data
for (i in 1:(ncol(data))){
models[[i]] = lm(data[,i] ~ poly(x,2, raw = TRUE))}
You can also use lapply instead of the for-loop, as stated in the comments.
Use predict() to get the values of the models:
smoothed_v1 = predict(model[[1]], newdata=data.frame(x = x))
Edit:
Regarding your comment - you can store the new values in data with:
for (i in (length(models):1)){
data <- cbind(predict(models[[i]], newdata=data.frame(x = x)), data)
# set the name for the new column
names(data)[1] = paste("pred_v",i, sep ="")}

Using * in lm for variables with common names

If I have 100 variables with a common name, such as year_1951, year_1952, year_1953 etc, is there a way to do a linear regression that includes all variables that start with year_ ? In Stata this is easy by using the *, but in R, I'm not sure how to go about this.
THanks.
Stata Example :
regress y year_*
Is there an equivalence in R, such as
ols.lm <- lm(y ~ year_*, data = d)
I don't think R support that kind of expansion inside formula. It do support y ~ . kind of expansion.
Here is how you can do it
variables <- colnames(d)
depVar <- 'y'
indepVars <- variables[grepl('^year_',variables)]
myformulae <- as.formula(paste(depVar,paste(indepVars,collapse=' + '),sep = ' ~ '))
modelfit <-lm(myformulae,data=d)
Edit
: Solving the problem mentioned in the comment (Adding constants in the formulae)
variables <- colnames(d)
depVar <- 'y'
indepVars <- variables[grepl('^year_',variables)]
indepVarsCollapse <- paste(paste(indepVars,collapse=' + '), '-1')
myformulae <- as.formula(paste(depVar,indepVarsCollapse,sep = ' ~ '))
modelfit <-lm(myformulae,data=d)
Rather than selecting the columns in the formula, select them in the data argument:
nms <- c("y", grep("year_", names(d), value = TRUE))
lm(y ~., d[nms])
Alternately, select all the desired columns in the grep
ix <- grep("^(y$|year_)", names(d))
lm(y ~., d[ix])
or if we knew that the unwanted columns do not start with y:
ix <- grep("^y", names(d))
lm(y ~., d[ix])

How to construct a big regular formula for a model in R?

I am trying create model to predict "y" from data "D" that contain predictor x1 to x100 and other 200 variables . since all Xs are not stored consequently I can't call them by column.
I can't use ctree( y ~ , data = D) because other variables , Is there a way that I can refer them x1:100 ?? in the model ?
instead of writing a very long code
ctree( y = x1 + x2 + x..... x100)
Some recommendation would be appreciated.
Two more. The simplest in my mind is to subset the data:
ctree(y ~ ., data = D[, c("y", paste0("x", 1:100))]
Or a more functional approach to building dynamic formulas:
ctree(reformulate(paste0("x", 1:100), "y"), data = D)
Construct your formula as a text string, and convert it with as.formula.
vars <- names(D)[1:100] # or wherever your desired predictors are
fm <- paste("y ~", paste(vars, collapse="+"))
fm <- as.formula(fm)
ctree(fm, data=D, ...)
You can use this:
fml = as.formula(paste("y", paste0("x", 1:100, collapse=" + "), sep=" ~ "))
ctree(fmla)

Resources