Writing for loop in R over imputations

Writing for loop in R over imputations - r

I basically have the same sequence of code that I want to repeat for a list of numbers from 1 through 10. In Stata, I would do foreach num in numlist 1 2 3 4 5 6 7 8 9 10 { and this would be straightforward. But in R, I'm not quite sure how to execute it.
So this code...
d1 <- read_dta("C:/Users/Folder/imputation_1.dta")
d1$race <- factor(d1$race)
d1$educ <- factor(d1$educ)
psm_1 <- weightit(trtmnt ~ race + education + gender,
data = d1,
method = "psm",
estimand = "ATT")
d1$psm_weights <- psm_1$weights
write_dta(d1, "C:/Users/Folder/weighted_1.dta")
...I just want to repeat that while replacing the "1" with a "2", and then a "3", and so on. I could just repeat the same code and do that manually (like below) but there must be a way to loop through efficiently.
d2 <- read_dta("C:/Users/Folder/imputation_2.dta")
d2$race <- factor(d2$race)
d2$educ <- factor(d2$educ)
psm_2 <- weightit(trtmnt ~ race + education + gender,
data = d2,
method = "psm",
estimand = "ATT")
d2$psm_weights <- psm_2$weights
write_dta(d2, "C:/Users/Folder/weighted_2.dta")
I tried following this: https://cran.r-project.org/web/packages/foreach/vignettes/foreach.html but it doesn't seem to be exactly what I need (or I just don't fully understand it).

This is an suggestion and i sequence as 1,2,3:
d=list()
psm=list()
for (i in 1:3)
{
d[[i]] <- read_dta(paste0("C://Users//Folder//imputation_",i,
".dta"))
d[[i]]$race <- factor(d[[i]]$race)
d[[i]]$educ <- factor(d[[i]]$educ)
psm[[i]] <- weightit(trtmnt ~ race + education + gender,
data = d[[i]],
method = "psm",
estimand = "ATT")
d[[i]]$psm_weights <- psm[[i]]$weights
write_dta(d[[i]], paste0("C://Users//Folder//weighted_",i,".dta"))
}

Related

How can I loop a list of models to get slope estimate

I have a list of models as specified by the following code:
varlist <- list("PRS_Kunkle", "PRS_Kunkle_e07",
"PRS_Kunkle_e06","PRS_Kunkle_e05", "PRS_Kunkle_e04",
"PRS_Kunkle_e03", "PRS_Kunkle_e02", "PRS_Kunkle_e01",
"PRS_Kunkle_e00", "PRS_Jansen", "PRS_deroja_KANSL")
PRS_age_pacc3 <- lapply(varlist, function(x) {
lmer(substitute(z_pacc3_ds ~ i*AgeAtVisit + i*I(AgeAtVisit^2) +
APOE_score + gender + EdYears_Coded_Max20 +
VisNo + famhist + X1 + X2 + X3 + X4 + X5 +
(1 |family/DBID),
list(i=as.name(x))), data = WRAP_all, REML = FALSE)
})
I want to obtain the slope of PRS at different age points in each of the models. How can I write code to achieve this goal? Without loop, the raw code should be:
test_stat1 <- simple_slopes(PRS_age_pacc3[[1]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat2 <- simple_slopes(PRS_age_pacc3[[2]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat3 <- simple_slopes(PRS_age_pacc3[[3]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat4 <- simple_slopes(PRS_age_pacc3[[4]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat5 <- simple_slopes(PRS_age_pacc3[[5]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat6 <- simple_slopes(PRS_age_pacc3[[6]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat7 <- simple_slopes(PRS_age_pacc3[[7]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat8 <- simple_slopes(PRS_age_pacc3[[8]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat9 <- simple_slopes(PRS_age_pacc3[[9]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat10 <- simple_slopes(PRS_age_pacc3[[10]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat11 <- simple_slopes(PRS_age_pacc3[[11]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))

library(lme4)
library(reghelper)
set.seed(101)
## add an additional factor variable so we can use it for an interaction
sleepstudy$foo <- factor(sample(LETTERS[1:3], size = nrow(sleepstudy),
replace = TRUE))
m1 <- lmer(Reaction ~ Days*foo + I(Days^2)*foo + (1|Subject), data = sleepstudy)
s1 <- simple_slopes(m1, levels=list(Days = c(5, 10, 15)))
Looking at these results, s1 is a data frame with 6 rows (number of levels of foo × number of Days values specified) and 5 columns (Days, foo, estimate, std error, t value).
The simplest way to do this:
res <- list()
for (i in seq_along(varlist)) {
res[[i]] <- simple_slopes(model_list[[i]], ...) ## add appropriate args here
}
res <- do.call("rbind", res) ## collapse elements to a single data frame
## add an identifier column
res_final <- data.frame(model = rep(varlist, each = nrow(res[[1]])), res)
If you want to be fancier, you could replace the for loop with an appropriate lapply. If you want to be even fancier than that:
library(tidyverse)
(model_list
%>% setNames(varlist)
## map_dfr runs the function on each element, collapses results to
## a single data frame. `.id="model"` adds the names of the list elements
## (set in the previous step) as a `model` column
%>% purrr::map_dfr(simple_slopes, ... <extra args here>, .id = "model")
)
By the way, I would be very careful with simple_slopes when you have a quadratic term in the model as well. The slopes calculated will (presumably) apply only in the case where any other continuous variables in the model are zero. You might want to center your variables as in Schielzeth 2010 Methods in Ecology and Evolution ("Simple means to improve ...")

for inside foreach parallel not populating a dataframe in R

I am having an issue populating a foreach. Suppose I have the following dataframe, the consequence of this dataframe is exactly what my real one looks like:
Elec2 <- rep(rep(rep(27:1, each = 81), each = 18), times = 100)
Ind <- rep(1:18, times = 218700)
Cond <- rep(1:3, times = 1312200)
Trial <- rep(rep(1:100, each = 2187), each = 18)
DVAR <- rbeta (3936600, 0.7, 1,5)
data <- cbind(DVAR, Ind, Cond, Trial, Elec1, Elec2)
I am trying the following code of parallelisation:
distinct_pairs <-
data %>%
select(Elec1, Elec2) %>%
distinct()
cl <- makeCluster(2) #values here are adjusted to cores, used 2 for the example
registerDoParallel(cl)
output <- foreach (i = 1:nrow(distinct_pairs), .packages='glmmTMB',
.combine = rbind,
.errorhandling="pass",
.verbose = T) %dopar% {
dep <- distinct_pairs[i,]
dat1 <- subset(data, dep$Elec1 == data$Elec1 & dep$Elec2 == data$Elec2)
df[i,]$Elec1 <- dep[i,]$Elec1
df[i,]$Elec2 <- dep[i,]$Elec2
for (j in 1:18) { #By individual
dat2 <- subset(dat1, dat1$Ind==j)
model <- glmmTMB(DVAR ~ Cond, family=beta_family('logit'), data=dat2)
results <- summary(model)
est <- results$coefficients$cond[2,1]
ste <- results$coefficients$cond[2,2]
df[j,] <- c(est,ste)
}
return(df)
}
output <- as.data.frame(output, row.names = FALSE)
As you can see I am expecting a dataframe with the results of the iterations est & ste plus the identification of the electrodes Elec1 & Elec2. If I run the lines independently one by one it seems to work fine, but i cannot make it work the way I expect.
First loop takes a pair of electrodes, every row in distinct_pairs is a pair of electrodes, numbered from 1 to 27 for Elec1 and for Elec2.
Problem is I am unable to get the data of the for loop written in the final output dataframe.
I am sure the problem is pretty basic, but I appreciate any insight as I seem to be missing something.
Thanks!
[[UPDATE WITH SOLUTION]]
In case anyone is interested, here is the solution.
output <- foreach (i = 1:10, .packages='glmmTMB',
.combine = rbind,
.errorhandling="pass",
.inorder = TRUE,
.verbose = T) %dopar% {
dat1 <- subset(data, distinct_pairs[i,]$Elec1 == data$Elec1 & distinct_pairs[i,]$Elec2 == data$Elec2)
df <- data.frame('Elec1'=rep(distinct_pairs[i,]$Elec1,18),'Elec2'=rep(distinct_pairs[i,]$Elec2,18),'est'=rep(NA,18),'ste'=rep(NA,18))
for (j in 1:18) {
dat2 <- subset(dat1, dat1$Ind==j)
model <- glmmTMB(DVAR ~ Condition, family=beta_family('logit'), data=dat2)
results <- summary(model)
est <- results$coefficients$cond[2,1]
ste <- results$coefficients$cond[2,2]
df[j,c('est','ste')] <- c(est,ste)
}
return(df)
}
Which returns exactly what I was looking for:
> head(output)
Elec1 Elec2 est ste
1 1 1 0.034798615 0.03530296
2 1 1 -0.005363760 0.03392442
3 1 1 -0.017349123 0.03404430
4 1 1 -0.034819068 0.03196078
5 1 1 0.002301062 0.03163825
6 1 1 0.003575131 0.03452420

I am definetly not sure if I got the problem, could you also provide an Elec1 in your data Example?
An idea:
Foreach might not find df, you could create the data frame at the beginning of your loop with something like
df <- data.frame('Elec1'=rep(NA,18),'Elec2'=rep(NA,18),'est'=rep(NA,18),'ste'=rep(NA,18))
maybe add then below in the for loop: df[j,c('est','ste')] <- c(est,ste)

R - Perform the same operations to many data sets

Apologies if this is a repeat question, if the answer exists somewhere I would appreciate being pointed to it.
I have a large data frame with many factors, mix of categorical and continuous. Here is a shortened example:
x1 = sample(x = c("A", "B", "C"), size = 50, replace = TRUE)
x2 = sample(x = c(5, 10, 27), size = 50, replace = TRUE)
y = rnorm(50, mean=0)
dat = as.data.frame(cbind(y, x1, x2))
dat$x2 = as.numeric(dat$x2)
dat$y = as.numeric(dat$y)
> head(dat)
y x1 x2
1 9 C 2
2 7 C 2
3 8 B 1
4 21 A 2
5 48 A 1
6 19 A 3
I want to subset this dataset for each level of x1, so I end up with 3 new datasets for each level of factor x1. I can do this the following way:
#A
dat.A = dat[which(dat$x1== "A"),,drop=T]
dat.A$x1 = factor(dat.A$x1)
#B
dat.B = dat[which(dat$x1== "B"),,drop=T]
dat.B$x1 = factor(dat.B$x1)
#C
dat.C = dat[which(dat$x1== "C"),,drop=T]
dat.C$x1 = factor(dat.C$x1)
This is somewhat tedious as my real data have 7 levels of the factor of interest so I have to repeat the code 7 times. Once I have each new data frame in my global environment, I want to perform several functions to each one (graphing, creating tables, fitting linear models). Here is a simple example:
#same plot for each dataset
A.plot = plot(dat.A$y, dat.A$x2)
B.plot = plot(dat.B$y, dat.B$x2)
C.plot = plot(dat.C$y, dat.C$x2)
#same models for each dataset
mod.A = lm(y ~ x2, data = dat.A)
summary(mod.A)
mod.B = lm(y ~ x2, data = dat.B)
summary(mod.B)
mod.C = lm(y ~ x2, data = dat.C)
summary(mod.C)
This is a lot of copying and pasting. Is there a way I can write out one line of code for each thing I want to do and loop over each dataset? Something like below, which I know is wrong but it's what I am trying to do:
for (i in datasets) {
[i].plot = plot(dat.[i]$y, dat.[i]$x2)
mod.[i] = lm(y ~ x2, data = dat[i])
}

We can do a split into a list of data.frames and then loop over the list with lapply
lst1 <- split(dat, dat$x1)
lst2 <- lapply(lst1, function(dat) {
plt <- plot(dat$y, dat$x2)
model <- lm(y ~ x2, data = dat)
list(plt, model)
})

For completeness' sake, here's how I would do this in the tidyverse, producing two lists: one with the plots and one with the models.
library(dplyr)
library(ggplot2)
model_list <- dat %>%
group_by(x1) %>%
group_map( ~ lm(y ~ x2, data = .x))
plot_list <- dat %>%
group_by(x1) %>%
group_map( ~ ggplot(.x, aes(x2, y)) + geom_point())

Do I need Indicators for Regression with Categorical Variables?

It is always said that we need to create predictor variables for categorical values in order to the regression. I made a test, by creating first a predictor column of 1, 2, 3s for a five-layered categorical variable. Then, I ran the same model, without the predictor column, but with the categorical column itself.
In conclusion, the coefficients are different; however, their relative importance and effect on the y-value are the same. Moreover R-squared and p-value numbers are exactly the same in these two cases. So, do I have to create the predictor variable, or is R smart enough to do it automatically?
for(i in 1:74)
{
if(travel$accommodation[i] == "Hotel")
{
travel$pred_hotel[i] <- 1
}
if(travel$accommodation[i] == "Airbnb")
{
travel$pred_hotel[i] <- 2
}
if(travel$accommodation[i] == "Hostel")
{
travel$pred_hotel[i] <- 3
}
if(travel$accommodation[i] == "With friend/family")
{
travel$pred_hotel[i] <- 4
}
if(travel$accommodation[i] == "Other")
{
travel$pred_hotel[i] <- 5
}
}
travel$pred_hotel <- as.factor(travel$pred_hotel)
Then:
msf <- lm(ticket_events_money ~ museum_fee + nationality +
ticket_events_frequency + accommodation + line + activity_1 +
locals + vacation_days, data = travel[-1, ])
mm <- lm(ticket_events_money ~ museum_fee + nationality +
ticket_events_frequency + pred_hotel + line + activity_1 +
locals + vacation_days, data = travel[-1, ])
summary(msf)
summary(mm)

The problem is, you originally have a character column accommodation. Your new variable pred_hotel is a factor. Function lm automatically converts character covariate into factor. In your test, the only difference will be in factor levels; all the rest is the same. If you want to see difference, remove the as.factor line.
Another common failure is as in the following minimal, reproducible example.
dat <- data.frame(y = rnorm(20), x = rep(letters[1:2], 10), stringsAsFactors = FALSE)
m1 <- lm(y ~ x, dat)
dat$x[dat$x == 'a'] <- 1
dat$x[dat$x == 'b'] <- 2
class(dat$x) # still a character column!!
m2 <- lm(y ~ x, dat)
But you will see difference, if you use real numeric:
dat$x <- as.numeric(dat$x)
m3 <- lm(y ~ x, dat)

R: Replacing a for-loop with an apply function

I managed to apply a linear regression for each subject of my data frame and paste the values into a new dataframe using a for-loop. However, I think there should be a more readable way of achieving my result using an apply function, but all my attempts fail. This is how I do it:
numberOfFiles <- length(resultsHick$subject)
intslop <- data.frame(matrix(0,numberOfFiles,4))
intslop <- rename(intslop,
subject = X1,
intercept = X2,
slope = X3,
Rsquare = X4)
cond <- c(0:3)
allSubjects <- resultsHick$subject
for (i in allSubjects)
{intslop[i,1] <- i
yvalues <- t(subset(resultsHick,
subject == i,
select = c(H0meanRT, H1meanRT, H2meanRT, H258meanRT)))
fit <- lm(yvalues ~ cond)
intercept <- fit$coefficients[1]
slope <- fit$coefficients[2]
rsquared <- summary(fit)$r.squared
intslop[i,2] <- intercept
intslop[i,3] <- slope
intslop[i,4] <- rsquared
}
The result should look the same as
> head(intslop)
subject intercept slope Rsquare
1 1 221.3555 54.98290 0.9871209
2 2 259.4947 66.33344 0.9781499
3 3 227.8693 47.28699 0.9537868
4 4 257.7355 80.71935 0.9729132
5 5 197.4659 49.57882 0.9730409
6 6 339.1649 61.63161 0.8213179
...
Does anybody know a more readable way of writing this code using an apply function?

One common pattern I use to replace for loops that aggregate data.frames is:
do.call(
rbind,
lapply(1:numberOfDataFrames,
FUN = function(i) {
print(paste("Processing index:", i)) # helpful to see how slow/fast
temp_df <- do_some_work[i]
temp_df$intercept <- 1, etc.
return(temp_df) # key is to return a data.frame for each index.
}
)
)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Writing for loop in R over imputations - r

Related

How can I loop a list of models to get slope estimate

for inside foreach parallel not populating a dataframe in R

R - Perform the same operations to many data sets

Do I need Indicators for Regression with Categorical Variables?

R: Replacing a for-loop with an apply function

Categories

Resources