New objects for table in regression loop - r

I am new to R, and am trying to loop regressions by group. For my data I have 13 groups, and would like to create 13 objects--a regression result for each group, so I can put all the regression results in a table.
Here is what I have tried:
for (i in 1:13) {groupi = lm(Yvariable ~ Xvariables,
data = dataset,
subset = dataset$group== i )}
So that I would have 13 group'i' objects that are each a regression result to put into a table.
THANKS!

If I get your problem right there is a specialised command for this: lmList from the nlme package.
Try this:
library(nlme)
your.result.list <- lmList(Yvariable ~ Xvariables | group, data = dataset)
your.result.list
The object your.result.list is of class lmList so it is a list with the 13 elements that you wanted to have as single objects. It has a generic print option which prints you a table of the coefficients into the console. So maybe this is already what you want?

Consider by, object-oriented wrapper to tapply, designed to subset data frames by factor(s) and run operations on the subsets. Often it can replace split + lapply for a more streamlined call:
reg_list <- by(dataset, dataset$group, function(sub)
summary(lm(Yvariable ~ Xvariables,
data = sub)
)
)
Please note the above only produces a named list of regression result summaries. Further work is needed for extraction of each model's estimates by extending function.

Related

rstatix::anova_test() over multiple columns of data frame

I like to run multiple one-way ANOVAs over multiple columns of a data frame. My approach to doing this was with a for loop. The first column of my data frame contains the groups. To provide a reproducible example, i here take the iris data set. I want to use rstatix::anova_test() instead of f.ex. aov(), because rstatix::anova_test() is pipe-friendly, seems to be a better option for unbalanced data (like i have it) and allows also to define the type of sums of squares for ANOVA.
When i write a for loop with aov() it works. Unfortunately, I have so far failed doing similar with rstatix::anova_test(). Please can anybody help me?
data <- iris %>% relocate(Species, .before = Sepal.Length)
# Define object which will receive the results
results <- NULL
results <- as.data.frame(results)
with aov() it works
for(i in 2:ncol(data)){
# Put the name of the variable in first column of results object
results[i-1,1] <- names(data)[i]
# ANOVA test iterating through each column of the data frame and save output in a temporary object.
temp_anova_results <- broom::tidy(aov(data[,i] ~ Species, data = data))
# write ANOVA p value in second column of results object
results[i-1,2] <- temp_anova_results$p.value[1]
rm(temp_anova_results)
}
for several reasons i like to work with rstatix::anova_test(), but failed to get a correct for loop, an example i tried:
for(i in 2:ncol(data)){
# Put the name of the variable in first column of results object
results[i-1,1] <- names(data)[i]
# ANOVA test iterating through each column of the data frame and save output in a temporary object.
temp_anova_results <- data %>% anova_test(data[,i] ~ Species, type = 3)
# write ANOVA p value in second column of results object
results[i-1,2] <- temp_anova_results$p[1]
rm(temp_anova_results)
}
data %>% anova_test(data[,i] ~ Species) seems to be the problem, but works outside the for loop when inserting a number for i, like f.ex. data %>% anova_test(data[,2] ~ Species)
Maybe somebody else has a better answer, but the only way I can get this to work is to build the formula from the column names, that is replace your anova_test line with:
temp_anova_results <- data %>% anova_test(formula(paste0(names(dat)[i],"~","Species")))
I don't know why your method didn't work. Even outside of the loop using i instead of a numeric constant broke the anova_test function call.

Passing arguments to subset within a function

I am attempting to fit a bunch of different models to a single dataset. Each of the models uses a different combination of outcome variable and data subset. To fit all of these models, I created a dataframe with one column for the outcome variable and one column specifying the data subset (as a string). (Note that the subsets are overlapping so there doesn't appear to be an obvious way to do this using nest().) I then created a new function which takes one row of this dataframe and calls "lm" using these options. Lastly, I use pmap to map this function to the dataframe.
After a bunch of experimentation, I found an approach that works but that is rather inelegant (see below for a simplified version of what I did). It seems like there should be a way to pass the subset condition to the subset argument in lm rather than using parse(eval(text = condition)) to first create a logical vector. I read the Advanced R section on metaprogramming in the hopes that they would provide some insight, but I was unable to find anything that works.
Any suggestions would be helpful.
library(tidyverse)
outcomes <- c("mpg", "disp")
sub_conditions <- c("mtcars$cyl >=6", "mtcars$wt > 2")
models <- expand.grid(y = outcomes, condition = sub_conditions) %>% mutate_all(as.character)
fit <- function(y, condition) {
# Create the formula to use in all models
rx <- paste(y, "~ hp + am")
log_vec <- eval(parse(text = condition))
lm(rx, data = mtcars[log_vec,])
}
t <- pmap(models, fit)
Are you sure you want to pass conditions in this way using string?
If that is the case, there are not many options. You can use rlang::parse_expr as an alternative.
fit <- function(y, condition) {
rx <- paste(y, "~ hp + am")
lm(rx, data = mtcars[eval(rlang::parse_expr(condition)),])
}
and call it via
purrr::pmap(models, fit)

How do I use print list-column of regression models in a tibble using huxtable()

I am trying to produce an Rmarkdown document that prints a series of regression models I currently have stored in list-column in a tibble. The regression models look like this.
#generate data
covar1<-rnorm(100)
covar2<-rnorm(100)
depvar1<-rnorm(100)
depvar2<-rnorm(100)
#generate models
model1<-lm(depvar1~covar1)
model2<-lm(depvar1~covar1+covar2)
model3<-lm(depvar2~covar1)
model4<-lm(depvar2~covar1+covar2)
#list models
library(huxtable)
library(dplyr)
model.list<-list(model1, model2,model3, model4)
#make tibble
model.list<-tibble(model.list)
#name models by dependent variable
model.list$model_name<-c('Depvar1', 'Depvar1', 'Depvar2', 'Depvar2' )
#check
model.list
I know that you can just do
huxreg(model1, model2, model3, model4)
But i have many more models, and many more list columns that I want to ignore. I was trying.
library(purrr)
map(model.list[,1], huxreg)
And that works, to a point, but it does not render properly in Rarkdown.
Just do
huxreg(model.list)
with the list object. From ?huxreg:
... Models, or a single list of models. Names will be used as column headings.
You don't need to make a tibble.
You could also cut out one step:
models <- list()
models[[1]] <- lm(depvar1~covar1)
models[[2]] <- lm(depvar1~covar1+covar2)
models[[3]] <- lm(depvar2~covar1)
models[[4]] <- lm(depvar2~covar1+covar2)
huxreg(models)
In general, naming your variables like name1, name2 etc. is a sign you should be using a list in the first place.

Storing lm objects within a data table (In order to use predict)

Following some great questions like this one:
Why is using update on a lm inside a grouped data.table losing its model data?, I'm running regression within a data.table and storing it, as the following:
DT = data.table(iris)
fit = DT[, list(list(lm(Sepal.Length ~ Sepal.Width + Petal.Length))), by = Species]
However, I would like to store the .J output as lm object lm output, and not as a data.table:
class(fit[Species=="setosa"])
#i would like fit to contain 3 lm objects, not data.tables!
# [1] "data.table" "data.frame"
My question is, how can I store within fit 3 lm objects and not 3 data tables, the reason I need that, is that I want to further use fit for out sample prediction (using predict.lm)?
For example, I would like to store within the data table an element of the following type:
model<-lm(Sepal.Length ~ Sepal.Width + Petal.Length,data=DT[Species=="setosa"])
class(model)
# [1] "lm"
#i would like the first element of fit to inclide model -> the model output object
new_data<-DT #just a toy example :) this isnt really the new data
predict(model,new_data)

With R, loop over data frames, and assign appropriate names to objects created in the loop

This is something which data analysts do all the time (especially when working with survey data which features missing responses.) It's common to first multiply impute a set of compete data matrices, fit models to each of these matrices, and then combine the results. At the moment I'm doing things by hand and looking for a more elegant solution.
Imagine there's 5 *.csv files in the working directory, named dat1.csv, dat2.csv, ... dat5.csv. I want to estimate the same linear model using each data set.
Given this answer, a first step is to gather a list of the files, which I do with the following
csvdat <- list.files(pattern="dat.*csv")
Now I want to do something like
for(x in csvdat) {
lm.which(csvdat == "x") <- lm(y ~ x1 + x2, data = x)
}
The "which" statement is my silly way of trying to number each model in turn, using the location in the csvdat list the loop is currently up to. that is, I'd like this loop to return a set of 5 lm objects with the names lm.1, lm.2, etc
Is there some simple way to create these objects, and name them so that I can easily indicate which data set they correspond to?
Thanks for your help!
Another approach is to use the plyr package to do the looping. Using the example constructed by #chl, here is how you would do it
require(plyr)
# read csv files into list of data frames
data_frames = llply(csvdat, read.csv)
# run regression models on each data frame
regressions = llply(data_frames, lm, formula = y ~ .)
names(regressions) = csvdat
Use a list to store the results of your regression models as well, e.g.
foo <- function(n) return(transform(X <- as.data.frame(replicate(2, rnorm(n))),
y = V1+V2+rnorm(n)))
write.csv(foo(10), file="dat1.csv")
write.csv(foo(10), file="dat2.csv")
csvdat <- list.files(pattern="dat.*csv")
lm.res <- list()
for (i in seq(along=csvdat))
lm.res[[i]] <- lm(y ~ ., data=read.csv(csvdat[i]))
names(lm.res) <- csvdat
what you want is a combination of the functions seq_along() and assign()
seq_along helps creates a vector from 1 to 5 if there are five objects in csvdat (to get the appropriate numbers and not only the variable names). Then assign (using paste to create the appropriate astrings from the numbers) lets you create the variable.
Note that you will also need to load the data file first (was missing in your example):
for (x in seq_along(csvdat)) {
data.in <- read.csv(csvdat[x]) #be sure to change this to read.table if necessary
assign(paste("lm.", x, sep = ""), lm(y ~ x1 + x2, data = data.in))
}
seq_along is not totally necessary, there could be other ways to solve the numeration problem.
The critical function is assign. With assign you can create variables with a name based on a string. See ?assign for further info.
Following chl's comments (see his post) everything in one line:
for (x in seq_along(csvdat)) assign(paste("lm", x, sep = "."), lm(y ~ x1 + x2, data = read.csv(csvdat[x]))

Resources