For loops: Running through column names - r

I was looking for a shorter way to write this using for loops
ie: i is 1 to 22 and my data will add columns 1 through 22 in the multiple regression:
reg <-lm(log(y)~x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+z1+z+z3+z4+z5+z6+z7+z8+z9+z10+z11+z12, data)
To clarify, x1 and x2 and x3 are all column names - they are x two (not x squared), I am trying to do a multiple regression with the last 22 columns in my data set
Someone suggested to do this:
reg1 <- lm(log(data$y)~terms( as.formula(
paste(" ~ (", paste0("X", 29:ncol(data) , collapse="+"), ")")
)
))
But
It doesn't work
I don't think it is doing multiple regression (xone + xtwo+ xthree), rather it assigned the binary value 1 to each variable x1, x2, x3... and added them, which is not what I want.

I know that a for-loop was requested but it would have been a clumsy strategy, so here's a possible correct strategy:
formchr <- paste(
paste( "log(y)" , paste0( "x", 1:10, collapse="+"), sep="~"),
# the LHS and first 10 terms
paste0( "z", 1:12, collapse="+"), #next 12 terms
sep="+") # put both parts together
reg1 <- lm( as.formula(formchr), data=data)
The full character-version of the formula should be passed to the as.formula function and the paste and paste0 functions are fully vectorized, so no loop is needed.
If the first 22 columns were the desired target for the RHS terms, you could have pasted together names(data)[1:22] or ...[29:50] if those were hte locations, and htis would be substituted for the RHS terms in the second paste above, dropping the third paste.
The only reason I used data as the name of an object is that it was implied by the question. It is a very confusing practice to use that name. data is an R function and objects should have specific names that do not overlap with function names. The other very commonly abused name in this regard is df, which is the density function for the distribution.

You could first subset your data into a data.frame which contains only the columns of interest. Then, you can run a linear model using the . formula syntax to select all columns other than the y variable.
Example using 1000 rows and 50 cols of data
N <- 1000
P <- 50
data <- as.data.frame(rep(data.frame(rnorm(N)), P))
Assign your y data to y.
y <- as.data.frame(rep(data.frame(rnorm(N)), 1))
Create a new data.frame containing y and the last 22 columns.
model_data <- cbind(y, data[ ,29:50])
colnames(model_data) <- c("y", paste0("x", 1:10), paste0("z",1:12))
The following should do the trick. The . formula syntax will select all columns other than the y column.
reg <-lm(log(y) ~ ., data = model_data)

Related

use more than 100 features lm function in R

Assume I have a dataframe consisting of 101 columns, where the first 100 are named data1 to data100 and the 101th column is named y.
If I want to use the lm function in R where data1 to data100 are the features.
I know this can be written as:
lin_reg <- lm(y ~ data1+...+data100, dataframe)
Is there a better way of doing this?
lin_reg <- lm(y ~ ., data = dataframe)
This assumes your data is really only consisting of your outcome + all feature variables. No extra column. The "." indicates "take everything else from that data frame".
Since - as per comment - the TO wants to exclude certain columns:
data_frame_subset <- dataframe[, !names(dataframe) %in% c("data5", "data10")]
lin_reg <- lm(y ~ ., data = dataframe_subset)
In this example, I would exclude the columns data5 and data10.
You can create the formula dynamically with reformulate :
lin_reg <- lm(reformulate(paste0('data', 1:100), 'y'), dataframe)

Unique list of variable strings for model estimation

I want to create a vector of unique variable combinations to estimate various regression models for different sets of variables, while fixing one variable to be always included.
For example, I always want to include variable X1, plus a distinct combination of up to, say, three (this threshold could be varying depending on the specific data and research question at hand) other variables from the full list of available variables X2, X3, ..., XN.
The bi-variate case is rather simple, I guess.
However, already for tri-variate models, the variable combination "X1 X2 X3" will yield the same coefficients as "X1 X3 X2". Further, I also want to exclude combinations which contain same variables twice, e.g "X1 X2 X2".
How to exclude these "double-counting"/redundant combinations best? Or how to create such a vector of all possible distinct combinations?
Test code i tried so far (separating variables with underscore):
library(dplyr)
'%!in%' <- function(x,y)!('%in%'(x,y))
A <- c("X1", "X2", "X3", "X4", "X5") # all variables in dataset
a <- "X1" # keep X1 in all models
A_minus_a <- A[A %!in% a]
# first combination:
C1 <- outer(a, A_minus_a, paste, sep = "_")
# second set of combinations:
C2 <- outer(C1, A_minus_a, paste, sep = "_") %>% as.vector
# third set of combinations:
C3 <- outer(C2, A_minus_a, paste, sep = "_") %>% as.vector
# full list of model combinations, but including many "double-counted"/redundant models:
C <- c(C1, C2, C3)
Any help you can provide is very much appreciated!
P.S. for the second step I could prevent the problem by formatting the result of outer() into a matrix and then extracting the lower triangular elements without the diagonal of the matrix. However, when turning to the third set of combinations this does not work anymore. So, there might be a better solution from start.
How about using combn()? e.g. for sets of three variables:
cc <- combn(A_minus_a, m=3)
apply(cc,2,paste,collapse="_")
## [1] "X2_X3_X4" "X2_X3_X5" "X2_X4_X5" "X3_X4_X5"

Passing arguments to subset within a function

I am attempting to fit a bunch of different models to a single dataset. Each of the models uses a different combination of outcome variable and data subset. To fit all of these models, I created a dataframe with one column for the outcome variable and one column specifying the data subset (as a string). (Note that the subsets are overlapping so there doesn't appear to be an obvious way to do this using nest().) I then created a new function which takes one row of this dataframe and calls "lm" using these options. Lastly, I use pmap to map this function to the dataframe.
After a bunch of experimentation, I found an approach that works but that is rather inelegant (see below for a simplified version of what I did). It seems like there should be a way to pass the subset condition to the subset argument in lm rather than using parse(eval(text = condition)) to first create a logical vector. I read the Advanced R section on metaprogramming in the hopes that they would provide some insight, but I was unable to find anything that works.
Any suggestions would be helpful.
library(tidyverse)
outcomes <- c("mpg", "disp")
sub_conditions <- c("mtcars$cyl >=6", "mtcars$wt > 2")
models <- expand.grid(y = outcomes, condition = sub_conditions) %>% mutate_all(as.character)
fit <- function(y, condition) {
# Create the formula to use in all models
rx <- paste(y, "~ hp + am")
log_vec <- eval(parse(text = condition))
lm(rx, data = mtcars[log_vec,])
}
t <- pmap(models, fit)
Are you sure you want to pass conditions in this way using string?
If that is the case, there are not many options. You can use rlang::parse_expr as an alternative.
fit <- function(y, condition) {
rx <- paste(y, "~ hp + am")
lm(rx, data = mtcars[eval(rlang::parse_expr(condition)),])
}
and call it via
purrr::pmap(models, fit)

how to cbind many data-frames?

I have 247 data frames which are sequentially named (y1, y2, y3, ...., y247). They are resulted from the following code:
for (i in (1:247)) {
nam <- paste("y", i, sep = "")
assign(nam, dairy[dairy$FARM==i,"YIT"])
}
I wish to cbind all of them to have:
df <- cbind(y1,y2,...,y247)
Can I do this with a loop without typing all 247 data frames?
Thanks
If you really want to do this, it is possible:
df <- y1
for (i in 2:247) {
df <- cbind(df, eval(parse(text=paste("y", i, sep = ''))))
}
Creating many variables in a loop as you do is not a good idea. You should use a list instead:
ys <- split(dairy$FARM, dairy$FARM)
names(ys) <- paste0("y", names(ys))
The first line creates list ys that contains your y1 as its first element (ys[[1]]), your y2 as its second element (ys[[2]]) and so on. The second line names the list elements the same way as you named your variables (y1, y2, etc.), since those will in the end
be used to name the columns in the data frame.
There is a function in the dplyr package that takes a list of data frames and binds them all together as columns:
library(dplyr)
df <- bind_cols(ys)
Note, by the way, that this will only work, if each value appears exactly the same number of times in the column FARM, since the columns in a data frame must all have the same length.

With R, loop over data frames, and assign appropriate names to objects created in the loop

This is something which data analysts do all the time (especially when working with survey data which features missing responses.) It's common to first multiply impute a set of compete data matrices, fit models to each of these matrices, and then combine the results. At the moment I'm doing things by hand and looking for a more elegant solution.
Imagine there's 5 *.csv files in the working directory, named dat1.csv, dat2.csv, ... dat5.csv. I want to estimate the same linear model using each data set.
Given this answer, a first step is to gather a list of the files, which I do with the following
csvdat <- list.files(pattern="dat.*csv")
Now I want to do something like
for(x in csvdat) {
lm.which(csvdat == "x") <- lm(y ~ x1 + x2, data = x)
}
The "which" statement is my silly way of trying to number each model in turn, using the location in the csvdat list the loop is currently up to. that is, I'd like this loop to return a set of 5 lm objects with the names lm.1, lm.2, etc
Is there some simple way to create these objects, and name them so that I can easily indicate which data set they correspond to?
Thanks for your help!
Another approach is to use the plyr package to do the looping. Using the example constructed by #chl, here is how you would do it
require(plyr)
# read csv files into list of data frames
data_frames = llply(csvdat, read.csv)
# run regression models on each data frame
regressions = llply(data_frames, lm, formula = y ~ .)
names(regressions) = csvdat
Use a list to store the results of your regression models as well, e.g.
foo <- function(n) return(transform(X <- as.data.frame(replicate(2, rnorm(n))),
y = V1+V2+rnorm(n)))
write.csv(foo(10), file="dat1.csv")
write.csv(foo(10), file="dat2.csv")
csvdat <- list.files(pattern="dat.*csv")
lm.res <- list()
for (i in seq(along=csvdat))
lm.res[[i]] <- lm(y ~ ., data=read.csv(csvdat[i]))
names(lm.res) <- csvdat
what you want is a combination of the functions seq_along() and assign()
seq_along helps creates a vector from 1 to 5 if there are five objects in csvdat (to get the appropriate numbers and not only the variable names). Then assign (using paste to create the appropriate astrings from the numbers) lets you create the variable.
Note that you will also need to load the data file first (was missing in your example):
for (x in seq_along(csvdat)) {
data.in <- read.csv(csvdat[x]) #be sure to change this to read.table if necessary
assign(paste("lm.", x, sep = ""), lm(y ~ x1 + x2, data = data.in))
}
seq_along is not totally necessary, there could be other ways to solve the numeration problem.
The critical function is assign. With assign you can create variables with a name based on a string. See ?assign for further info.
Following chl's comments (see his post) everything in one line:
for (x in seq_along(csvdat)) assign(paste("lm", x, sep = "."), lm(y ~ x1 + x2, data = read.csv(csvdat[x]))

Resources