If I have 100 variables with a common name, such as year_1951, year_1952, year_1953 etc, is there a way to do a linear regression that includes all variables that start with year_ ? In Stata this is easy by using the *, but in R, I'm not sure how to go about this.
THanks.
Stata Example :
regress y year_*
Is there an equivalence in R, such as
ols.lm <- lm(y ~ year_*, data = d)
I don't think R support that kind of expansion inside formula. It do support y ~ . kind of expansion.
Here is how you can do it
variables <- colnames(d)
depVar <- 'y'
indepVars <- variables[grepl('^year_',variables)]
myformulae <- as.formula(paste(depVar,paste(indepVars,collapse=' + '),sep = ' ~ '))
modelfit <-lm(myformulae,data=d)
Edit
: Solving the problem mentioned in the comment (Adding constants in the formulae)
variables <- colnames(d)
depVar <- 'y'
indepVars <- variables[grepl('^year_',variables)]
indepVarsCollapse <- paste(paste(indepVars,collapse=' + '), '-1')
myformulae <- as.formula(paste(depVar,indepVarsCollapse,sep = ' ~ '))
modelfit <-lm(myformulae,data=d)
Rather than selecting the columns in the formula, select them in the data argument:
nms <- c("y", grep("year_", names(d), value = TRUE))
lm(y ~., d[nms])
Alternately, select all the desired columns in the grep
ix <- grep("^(y$|year_)", names(d))
lm(y ~., d[ix])
or if we knew that the unwanted columns do not start with y:
ix <- grep("^y", names(d))
lm(y ~., d[ix])
Related
I know that somewhere there will exist this kind of question, but I couldn't find it. I have the variables a, b, c, d and I want to write a loop, such that I regress and append the variables and regress again with the additional variable
lm(Y ~ a, data = data), then
lm(Y ~ a + b, data = data), then
lm(Y ~ a + b + c, data = data) etc.
How would you do this?
Using paste and as.formula, example using mtcars dataset:
myFits <- lapply(2:ncol(mtcars), function(i){
x <- as.formula(paste("mpg",
paste(colnames(mtcars)[2:i], collapse = "+"),
sep = "~"))
lm(formula = x, data = mtcars)
})
Note: looks like a duplicate post, I have seen a better solution for this type of questions, cannot find at the moment.
You could do this with a lapply / reformulate approach.
formulae <- lapply(ivars, function(x) reformulate(x, response="Y"))
lapply(formulae, function(x) summary(do.call("lm", list(x, quote(dat)))))
Data
set.seed(42)
dat <- data.frame(matrix(rnorm(80), 20, 4, dimnames=list(NULL, c("Y", letters[1:3]))))
ivars <- sapply(1:3, function(x) letters[1:x]) # create an example vector ov indep. variables
vars = c('a', 'b', 'c', 'd')
# might want to use a subset of names(data) instead of
# manually typing the names
reg_list = list()
for (i in seq_along(vars)) {
my_formula = as.formula(sprintf('Y ~ %s', paste(vars[1:i], collapse = " + ")))
reg_list[[i]] = lm(my_formula, data = data)
}
You can then inspect an individual result with, e.g., summary(reg_list[[2]]) (for the 2nd one).
By way of simplified example, say you have the following data:
n <- 10
df <- data.frame(x1 = rnorm(n, 3, 1), x2 = rnorm(n, 0, 1))
And you wish to create a model matrix of the following form:
model.matrix(~ df$x1 + df$x2)
or more preferably:
model.matrix(~ x1 + x2, data = df)
but instead by pasting the formula into model.matrix. I have experimented with the following but encounter errors with all of them:
form1 <- "df$x1 + df$x2"
model.matrix(~ as.formula(form1))
model.matrix(~ eval(parse(text = form1)))
model.matrix(~ paste(form1))
model.matrix(~ form1)
I've also tried the same with the more preferable structure:
form2 <- "x1 + x2, data = df"
Is there a direct solution to this problem? Or is the model.matrix function not conducive to this approach?
Do you mean something like this?
expr <- "~ x1 + x2"
model.matrix(as.formula(expr), df)
You need to give df as the data argument outside of as.formula, as the data argument defines the environment within which to evaluate the formula.
If you don't want to specify the data argument you can do
model.matrix(as.formula("~ df$x1 + df$x2"))
I'm looking to loop a number of independent variables through a mixed effect model. There are a couple of similar questions but nothing that quite works for me. An example using mtcars:
data(mtcars)
mtcars <- mtcars
t <- as.data.frame(seq(from = 10, to = 1000, by = 100))
names(t)[1] <- "return"
t <- as.data.frame(t(t))
#create some new variables to loop through
new <- cbind(mtcars$drat, t)
new2 <- 1-exp(-mtcars$drat/new[, 2:10])
new3 <- cbind(mtcars, new2)
xnam <- paste(colnames(mtcars)[c(3:4)], sep="")
xnam2 <- paste(colnames(reference)[c(12:20)], sep="")
#basic model (works fine)
fmla <- paste(xnam, collapse= "+")
fla <- paste("cyl ~", paste(fmla))
f <- paste0(paste(fla), " +(carb|gear)")
mtcarsmodel <- lmer(f, data= mtcars)
mtcarsmodel
So with my 'basic' model, I now want iteratively run each of the variables in xnam2 through the model as a fixed effect, but can't get it working with lapply and paste method:
f2 <- " +(carb|gear)"
newmodels <- lapply(xnam2, function(x) {
lmer(substitute(paste(fla), i + (paste(f2)), list(i = as.name(x))), data = mtcars)
})
So cyl ~ disp+hp + looping variable + (carb|gear) is what I'm after.
Hopefully that's clear with what I'm trying to accomplish. I'm getting a bit confused with the multiple pastes, but seems like the best way to approach dealing with many variables. Any suggestions?
If I've understood your question, I think you can just create the model formula with paste and use lapply to iterate through each new variable.
library(lme4)
vars = names(mtcars)[c(1,5,6,7)]
models = lapply(setNames(vars, vars), function(var) {
form = paste("cyl ~ disp + hp + ", var, "+ (carb|gear)")
lmer(form, data=mtcars)
})
A slight variant on #eipi10's solution:
library(lme4)
vars = names(mtcars)[c(1,5,6,7)]
otherVars <- c("disp","hp","(carb|gear)")
formList <- lapply(vars,function(x) {
reformulate(c(otherVars,x),response="cyl")
})
modList <- lapply(formList,lmer,data=mtcars)
I found a function that provides frequencies with condition and I thought of creating a function
do.call(data.frame, aggregate(X1 ~ X2, data=dat, FUN=table))
I also managed to get the column names by their index number from this thread using name <- names(dataset)[index].
I want to get the frequency of Xn ~ Xstatic, where Xn are the n-1 variables and Xstatic is the variable of interest.
So far I made a for loop and here is my code:
library(prodlim)
NUM <- 100
dat1 <- SimSurv(NUM)
dat1$time <- sample(24:160,NUM,rep=TRUE)
dat1$X3 <- sample(0:1,NUM,rep=TRUE)
dat1$X4 <- sample(0:9,NUM,rep=TRUE)
dat1$X5 <- sample(c("a","b","c"),NUM,rep=TRUE)
dat1$X6 <- sample(c("was","que","koa","sim","sol"),NUM,rep=TRUE)
dat1$X7 <- sample(1:99,NUM,rep=TRUE)
dat1$X8 <- sample(1:200,NUM,rep=TRUE)
attach(dat1)
# EXAMPLE
# do.call(data.frame, aggregate(status ~ X6, data=dat1, FUN=table))
for( i in 1:ncol(dat1) ) {
name <- names(dat1)[i]
do.call(data.frame, aggregate(name ~ X6, data=dat1, FUN=table))
}
I get the error below and I am at a loss on how to solve this. All help is appreciated.
Error in model.frame.default(formula = name ~ X6, data = dat1) :
variable lengths differ (found for 'X6')
1) I would suggest not using attach;
2) it is meaningless to make a frequency table of your variable of interest to some of these other variables, the continuous ones, for instance, or the ones from which you have sampled from 99 and 200 possible values;
3) why would you want to combine your results into a data frame? just print them or save to a list:
mylist <- list()
for ( i in c('status','X2','X3','X4','X5','X7','X8') ) {
mylist[i] <- list(table(dat1[ ,i], dat1$X6))
}
I am trying create model to predict "y" from data "D" that contain predictor x1 to x100 and other 200 variables . since all Xs are not stored consequently I can't call them by column.
I can't use ctree( y ~ , data = D) because other variables , Is there a way that I can refer them x1:100 ?? in the model ?
instead of writing a very long code
ctree( y = x1 + x2 + x..... x100)
Some recommendation would be appreciated.
Two more. The simplest in my mind is to subset the data:
ctree(y ~ ., data = D[, c("y", paste0("x", 1:100))]
Or a more functional approach to building dynamic formulas:
ctree(reformulate(paste0("x", 1:100), "y"), data = D)
Construct your formula as a text string, and convert it with as.formula.
vars <- names(D)[1:100] # or wherever your desired predictors are
fm <- paste("y ~", paste(vars, collapse="+"))
fm <- as.formula(fm)
ctree(fm, data=D, ...)
You can use this:
fml = as.formula(paste("y", paste0("x", 1:100, collapse=" + "), sep=" ~ "))
ctree(fmla)