R - Perform the same operations to many data sets - r

Apologies if this is a repeat question, if the answer exists somewhere I would appreciate being pointed to it.
I have a large data frame with many factors, mix of categorical and continuous. Here is a shortened example:
x1 = sample(x = c("A", "B", "C"), size = 50, replace = TRUE)
x2 = sample(x = c(5, 10, 27), size = 50, replace = TRUE)
y = rnorm(50, mean=0)
dat = as.data.frame(cbind(y, x1, x2))
dat$x2 = as.numeric(dat$x2)
dat$y = as.numeric(dat$y)
> head(dat)
y x1 x2
1 9 C 2
2 7 C 2
3 8 B 1
4 21 A 2
5 48 A 1
6 19 A 3
I want to subset this dataset for each level of x1, so I end up with 3 new datasets for each level of factor x1. I can do this the following way:
#A
dat.A = dat[which(dat$x1== "A"),,drop=T]
dat.A$x1 = factor(dat.A$x1)
#B
dat.B = dat[which(dat$x1== "B"),,drop=T]
dat.B$x1 = factor(dat.B$x1)
#C
dat.C = dat[which(dat$x1== "C"),,drop=T]
dat.C$x1 = factor(dat.C$x1)
This is somewhat tedious as my real data have 7 levels of the factor of interest so I have to repeat the code 7 times. Once I have each new data frame in my global environment, I want to perform several functions to each one (graphing, creating tables, fitting linear models). Here is a simple example:
#same plot for each dataset
A.plot = plot(dat.A$y, dat.A$x2)
B.plot = plot(dat.B$y, dat.B$x2)
C.plot = plot(dat.C$y, dat.C$x2)
#same models for each dataset
mod.A = lm(y ~ x2, data = dat.A)
summary(mod.A)
mod.B = lm(y ~ x2, data = dat.B)
summary(mod.B)
mod.C = lm(y ~ x2, data = dat.C)
summary(mod.C)
This is a lot of copying and pasting. Is there a way I can write out one line of code for each thing I want to do and loop over each dataset? Something like below, which I know is wrong but it's what I am trying to do:
for (i in datasets) {
[i].plot = plot(dat.[i]$y, dat.[i]$x2)
mod.[i] = lm(y ~ x2, data = dat[i])
}

We can do a split into a list of data.frames and then loop over the list with lapply
lst1 <- split(dat, dat$x1)
lst2 <- lapply(lst1, function(dat) {
plt <- plot(dat$y, dat$x2)
model <- lm(y ~ x2, data = dat)
list(plt, model)
})

For completeness' sake, here's how I would do this in the tidyverse, producing two lists: one with the plots and one with the models.
library(dplyr)
library(ggplot2)
model_list <- dat %>%
group_by(x1) %>%
group_map( ~ lm(y ~ x2, data = .x))
plot_list <- dat %>%
group_by(x1) %>%
group_map( ~ ggplot(.x, aes(x2, y)) + geom_point())

Related

Writing for loop in R over imputations

I basically have the same sequence of code that I want to repeat for a list of numbers from 1 through 10. In Stata, I would do foreach num in numlist 1 2 3 4 5 6 7 8 9 10 { and this would be straightforward. But in R, I'm not quite sure how to execute it.
So this code...
d1 <- read_dta("C:/Users/Folder/imputation_1.dta")
d1$race <- factor(d1$race)
d1$educ <- factor(d1$educ)
psm_1 <- weightit(trtmnt ~ race + education + gender,
data = d1,
method = "psm",
estimand = "ATT")
d1$psm_weights <- psm_1$weights
write_dta(d1, "C:/Users/Folder/weighted_1.dta")
...I just want to repeat that while replacing the "1" with a "2", and then a "3", and so on. I could just repeat the same code and do that manually (like below) but there must be a way to loop through efficiently.
d2 <- read_dta("C:/Users/Folder/imputation_2.dta")
d2$race <- factor(d2$race)
d2$educ <- factor(d2$educ)
psm_2 <- weightit(trtmnt ~ race + education + gender,
data = d2,
method = "psm",
estimand = "ATT")
d2$psm_weights <- psm_2$weights
write_dta(d2, "C:/Users/Folder/weighted_2.dta")
I tried following this: https://cran.r-project.org/web/packages/foreach/vignettes/foreach.html but it doesn't seem to be exactly what I need (or I just don't fully understand it).
This is an suggestion and i sequence as 1,2,3:
d=list()
psm=list()
for (i in 1:3)
{
d[[i]] <- read_dta(paste0("C://Users//Folder//imputation_",i,
".dta"))
d[[i]]$race <- factor(d[[i]]$race)
d[[i]]$educ <- factor(d[[i]]$educ)
psm[[i]] <- weightit(trtmnt ~ race + education + gender,
data = d[[i]],
method = "psm",
estimand = "ATT")
d[[i]]$psm_weights <- psm[[i]]$weights
write_dta(d[[i]], paste0("C://Users//Folder//weighted_",i,".dta"))
}

How to test all subsets of predictor variables in R

I would like to programmatically build glms in r, similarly to what's described here (How to build and test multiple models in R), except testing all possible subsets of predictor variables instead.
So, for a dataset like this, with outcome variable z:
data <- data.frame("z" = rnorm(20, 15, 3),
"a" = rnorm(20, 20, 3),
"b" = rnorm(20, 25, 3),
"c" = rnorm(20, 5, 1))
is there a way to automate building the models:
m1 <- glm(z ~ a, data = data)
m2 <- glm(z ~ b, data = data)
m3 <- glm(z ~ c, data = data)
m4 <- glm(z ~ a + b, data = data)
m5 <- glm(z ~ a + c, data = data)
m6 <- glm(z ~ b + c, data = data)
m7 <- glm(Z ~ a + b + c, data = data)
I know the dredge function of the MuMIn package can do this, but I got an error saying that I was including too many variables, so I'm looking for ways to do this independently of dredge. I've tried grid.expand() and combn(), map() and lapply() variants of answers I've found on StackOverflow and can't seem to piece this together. Ideally, model output, including BIC, would be stored in a sortable dataframe.
Any help would be greatly appreciated!!
Assuming you have taken note of #Maurits Evers' comment, you can achieve what you want to do by combination of lapply and combn
cols <- names(data)[-1]
lapply(seq_along(cols), function(x) combn(cols, x, function(y)
glm(reformulate(y, "z"), data = data), simplify = FALSE))

Iterate over list and append in order to do a regression in R

I know that somewhere there will exist this kind of question, but I couldn't find it. I have the variables a, b, c, d and I want to write a loop, such that I regress and append the variables and regress again with the additional variable
lm(Y ~ a, data = data), then
lm(Y ~ a + b, data = data), then
lm(Y ~ a + b + c, data = data) etc.
How would you do this?
Using paste and as.formula, example using mtcars dataset:
myFits <- lapply(2:ncol(mtcars), function(i){
x <- as.formula(paste("mpg",
paste(colnames(mtcars)[2:i], collapse = "+"),
sep = "~"))
lm(formula = x, data = mtcars)
})
Note: looks like a duplicate post, I have seen a better solution for this type of questions, cannot find at the moment.
You could do this with a lapply / reformulate approach.
formulae <- lapply(ivars, function(x) reformulate(x, response="Y"))
lapply(formulae, function(x) summary(do.call("lm", list(x, quote(dat)))))
Data
set.seed(42)
dat <- data.frame(matrix(rnorm(80), 20, 4, dimnames=list(NULL, c("Y", letters[1:3]))))
ivars <- sapply(1:3, function(x) letters[1:x]) # create an example vector ov indep. variables
vars = c('a', 'b', 'c', 'd')
# might want to use a subset of names(data) instead of
# manually typing the names
reg_list = list()
for (i in seq_along(vars)) {
my_formula = as.formula(sprintf('Y ~ %s', paste(vars[1:i], collapse = " + ")))
reg_list[[i]] = lm(my_formula, data = data)
}
You can then inspect an individual result with, e.g., summary(reg_list[[2]]) (for the 2nd one).

Extracting residual values from lavaan list matrices in R

I am using lavaan package and my intention is to get my model residuals as dataframes for further use. I run several models that have grouping variables. Here's the basic workflow:
require(lavaan)
df <- data.frame(
y1 = sample(1:100),
y2 = sample(1:100),
x1 = sample(1:100),
x2 = sample(1:100),
x3 = sample(1:100),
grpvar = sample(c("grp1","grp2"), 100, replace = T))
semModel <- list(length = 2)
semModel[1] <- 'y1 ~ c(a,b)*x1 + c(a,b)*x2'
semModel[2] <- 'y1 ~ c(a,b)*x1
y2 ~ c(a,b)*x2 + c(a,b)*x3'
funEstim <- function(model){
sem(model, data = df, group = "grpvar", estimator = "MLM")}
fits <- lapply(semModel, funEstim)
residuals <- lapply(fits, function(x) resid(x, "obs"))
Now the resulting residuals object bugs me. It is a list of matrices that is nested few times. How do I get each of the matrices as a separate dataframe without any hardcoding? I don't want to unlist them as that would lose some information.
You can use list2env along with unlist to make the grp1, grp2, length.grp1, and length.grp2 directly available in the global environment.
list2env(unlist(residuals, recursive=FALSE), envir=.GlobalEnv)
ls()
#[1] "df" "fits" "funEstim" "grp1" "grp2"
#[6] "length.grp1" "length.grp2" "residuals" "semModel"
But they won't be data frames. For that you could convert them to data frames before calling list2env:
df.list <- lapply(unlist(residuals, recursive=FALSE), data.frame)
list2env(df.list, envir=.GlobalEnv)

List Indexing in R over a loop

I'm new to using lists in R and am trying to run a loop over various data frames that stores multiple models for each frame. I would like the models that correspond to a given data frame within the first index of the list; e.g. [[i]][1], [[i]][2]. The following example overwrites the list:
f1 <- data.frame(x = seq(1:6), y = sample(1:100, 6, replace = TRUE), z = rnorm(6))
f2 <- data.frame(x = seq(6,11), y = sample(1:100, 6, replace = TRUE), z = rnorm(6))
data.frames <- list(f1,f2)
fit <- list()
for(i in 1:length(data.frames)){
fit[[i]] <- lm(y ~ x, data = data.frames[[i]])
fit[[i]] <- lm(y ~ x + z, data = data.frames[[i]])
}
Any idea how to set up the list or the indexing in the loop such that it generates an output that has the two models for the first frame referenced as [[1]][1] and [[1]][2] and the second frame as [[2]][1] and [[2]][2]? Thanks for any and all help.
Calculate both models in a single lapply call applied to each part of the data.frames list:
lapply(data.frames, function(i) {
list(lm(y ~ x, data = i),
lm(y ~ x + z, data=i))
})

Resources