R: Replacing a for-loop with an apply function - r

I managed to apply a linear regression for each subject of my data frame and paste the values into a new dataframe using a for-loop. However, I think there should be a more readable way of achieving my result using an apply function, but all my attempts fail. This is how I do it:
numberOfFiles <- length(resultsHick$subject)
intslop <- data.frame(matrix(0,numberOfFiles,4))
intslop <- rename(intslop,
subject = X1,
intercept = X2,
slope = X3,
Rsquare = X4)
cond <- c(0:3)
allSubjects <- resultsHick$subject
for (i in allSubjects)
{intslop[i,1] <- i
yvalues <- t(subset(resultsHick,
subject == i,
select = c(H0meanRT, H1meanRT, H2meanRT, H258meanRT)))
fit <- lm(yvalues ~ cond)
intercept <- fit$coefficients[1]
slope <- fit$coefficients[2]
rsquared <- summary(fit)$r.squared
intslop[i,2] <- intercept
intslop[i,3] <- slope
intslop[i,4] <- rsquared
}
The result should look the same as
> head(intslop)
subject intercept slope Rsquare
1 1 221.3555 54.98290 0.9871209
2 2 259.4947 66.33344 0.9781499
3 3 227.8693 47.28699 0.9537868
4 4 257.7355 80.71935 0.9729132
5 5 197.4659 49.57882 0.9730409
6 6 339.1649 61.63161 0.8213179
...
Does anybody know a more readable way of writing this code using an apply function?

One common pattern I use to replace for loops that aggregate data.frames is:
do.call(
rbind,
lapply(1:numberOfDataFrames,
FUN = function(i) {
print(paste("Processing index:", i)) # helpful to see how slow/fast
temp_df <- do_some_work[i]
temp_df$intercept <- 1, etc.
return(temp_df) # key is to return a data.frame for each index.
}
)
)

Related

How can I loop a list of models to get slope estimate

I have a list of models as specified by the following code:
varlist <- list("PRS_Kunkle", "PRS_Kunkle_e07",
"PRS_Kunkle_e06","PRS_Kunkle_e05", "PRS_Kunkle_e04",
"PRS_Kunkle_e03", "PRS_Kunkle_e02", "PRS_Kunkle_e01",
"PRS_Kunkle_e00", "PRS_Jansen", "PRS_deroja_KANSL")
PRS_age_pacc3 <- lapply(varlist, function(x) {
lmer(substitute(z_pacc3_ds ~ i*AgeAtVisit + i*I(AgeAtVisit^2) +
APOE_score + gender + EdYears_Coded_Max20 +
VisNo + famhist + X1 + X2 + X3 + X4 + X5 +
(1 |family/DBID),
list(i=as.name(x))), data = WRAP_all, REML = FALSE)
})
I want to obtain the slope of PRS at different age points in each of the models. How can I write code to achieve this goal? Without loop, the raw code should be:
test_stat1 <- simple_slopes(PRS_age_pacc3[[1]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat2 <- simple_slopes(PRS_age_pacc3[[2]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat3 <- simple_slopes(PRS_age_pacc3[[3]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat4 <- simple_slopes(PRS_age_pacc3[[4]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat5 <- simple_slopes(PRS_age_pacc3[[5]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat6 <- simple_slopes(PRS_age_pacc3[[6]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat7 <- simple_slopes(PRS_age_pacc3[[7]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat8 <- simple_slopes(PRS_age_pacc3[[8]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat9 <- simple_slopes(PRS_age_pacc3[[9]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat10 <- simple_slopes(PRS_age_pacc3[[10]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
test_stat11 <- simple_slopes(PRS_age_pacc3[[11]], levels=list(AgeAtVisit=c(55,60,65,70,75,80)))
library(lme4)
library(reghelper)
set.seed(101)
## add an additional factor variable so we can use it for an interaction
sleepstudy$foo <- factor(sample(LETTERS[1:3], size = nrow(sleepstudy),
replace = TRUE))
m1 <- lmer(Reaction ~ Days*foo + I(Days^2)*foo + (1|Subject), data = sleepstudy)
s1 <- simple_slopes(m1, levels=list(Days = c(5, 10, 15)))
Looking at these results, s1 is a data frame with 6 rows (number of levels of foo × number of Days values specified) and 5 columns (Days, foo, estimate, std error, t value).
The simplest way to do this:
res <- list()
for (i in seq_along(varlist)) {
res[[i]] <- simple_slopes(model_list[[i]], ...) ## add appropriate args here
}
res <- do.call("rbind", res) ## collapse elements to a single data frame
## add an identifier column
res_final <- data.frame(model = rep(varlist, each = nrow(res[[1]])), res)
If you want to be fancier, you could replace the for loop with an appropriate lapply. If you want to be even fancier than that:
library(tidyverse)
(model_list
%>% setNames(varlist)
## map_dfr runs the function on each element, collapses results to
## a single data frame. `.id="model"` adds the names of the list elements
## (set in the previous step) as a `model` column
%>% purrr::map_dfr(simple_slopes, ... <extra args here>, .id = "model")
)
By the way, I would be very careful with simple_slopes when you have a quadratic term in the model as well. The slopes calculated will (presumably) apply only in the case where any other continuous variables in the model are zero. You might want to center your variables as in Schielzeth 2010 Methods in Ecology and Evolution ("Simple means to improve ...")

How to fasten going through all independent variable combinations?

I want to write function combination_rsquare(y, data, factor_number) where
y - A vector - dependent variable
data - A data frame containing independent variables
factor_number - vector or numeric which tells how many elements in combination should be included.
Let's consider my function :
combination_rsquare <- function(y, data, factor_number = c(2, 3)) {
name_vec <- c()
r_sq <- c()
for (j in seq_along(factor_number)) {
# Defining combinations
comb_names <- combn(colnames(data), factor_number[j])
for (i in 1:ncol(comb_names)) {
#Append model r-squared for each combination
r_sq<- append(
r_sq,
summary(lm(y ~ ., data = data[comb_names[1:factor_number[j], i]]))$r.squared
)
# Create vector containing model names seperated by "+"
name_vec <- append(
name_vec,
paste(comb_names[1:factor_number[j], i], collapse = "+")
)
}
}
data.frame(name_vec, r_sq)
}
Let's have a look how my function works on data :
Norm <- rnorm(100)
Unif <- runif(100)
Exp <- rexp(100)
Pois <- rpois(100,1)
Weib <- rweibull(100,1)
df <- data.frame(Unif, Exp, Pois, Weib)
combination_rsquare(Norm, df)
name_vec r_sq
1 Unif+Exp 0.02727265
2 Unif+Pois 0.02912956
3 Unif+Weib 0.01613404
4 Exp+Pois 0.04853872
5 Exp+Weib 0.03252025
6 Pois+Weib 0.03573252
7 Unif+Exp+Pois 0.05138219
8 Unif+Exp+Weib 0.03571401
9 Unif+Pois+Weib 0.04112936
10 Exp+Pois+Weib 0.06209911
Okay - so we have it! Everything is working! However - If I'm putting very large data frame to my function and adding new features to be calculated (adjusted R.squared, AIC, BIC and so on) it's taking ages! My question is - is there any possibility how can I make this function works faster ? i.e. maybe the double loop can be omitted, or maybe there is R build function for creating such combinations ?
To summarize - How can I make combination_rsquare() to calculate faster ?

R creating 5 variables with column name suffix 1 through 5 using loop

I am trying to create an iterative function in R using a loop or array, which will create three variables and three data frames with the same 1-3 suffix. My current code is:
function1 <- function(b1,lvl1,lvl2,lvl3,b2,x) {
lo1 <- exp(b1*lvl1 + b2*x)
lo2 <- exp(b1*lvl2 + b2*x)
lo3 <- exp(b1*lvl3 + b2*x)
out1 <- t(c(lv1,lo1))
out2 <- t(c(lvl2,lo2))
out3 <- t(c(lvl3,lo3))
out <- rbind(out1, out2, out3)
colnames(out) <- c("level","risk")
return(out)
}
function1(.18, 1, 2, 3, .007, 24)
However, I would like to iterate the same line of code three times to create lo1, lo2, lo3, and out1, out2 and out3. The syntax below is completely wrong because I don't know how to use two arguments in a for-loop, or nest a for loop within a function, but as a rough idea:
function1 <- function(b1,b2,x) {
for (i in 1:3) {
loi <- exp(b1*i + b2*x)
return(lo[i])
outi <- t(c(i, loi)
return(out[i])
}
out <- rbind(out1, out2, out3)
colnames(out) <- c("level","risk")
return(out)
}
function1(.18,.007,24)
The output should look like:
level risk
1 1.42
2 1.70
3 2.03
In R, the for loops are really inefficient. A good practice is to use all the functions from the apply family and try to use as much as possible vectorization. Here are some discussions about this.
For your work, you can simply do it with the dataframe structure. Here the example:
# The function
function1 <- function(b1,b2,level,x) {
# Create the dataframe with the level column
df = data.frame("level" = level)
# Add the risk column
df$risk = exp(b1*df$level + b2*x)
return(df)
}
# Your variables
b1 = .18
b2 = .007
level = c(1,2,3)
# Your process
function1(b1, b2, level, 24)
# level risk
# 1 1 1.416232
# 2 2 1.695538
# 3 3 2.029927

R - Cleanest way to run statistical test on every permutation of multiple populations

I have three populations stored as individual vectors. I need to run a statistical test (wilcoxon, if it matters) on each pair of these three populations.
I want to input three vectors into some block of code and get as output a vector of 6 p-values (one p-value is the result of one test and is a double).
I have a method that works but I am new to R and from what I've been reading I feel like there should be a better way, possibly involving storing the vectors as a data frame and using vectorization, to write this code.
Here is the code I have:
library(arrangements)
runAllTests <- function(pop1,pop2,pop3) {
populations <- list(pop1=pop1,pop2=pop2,pop3=pop3)
colLabels <- c("pop1", "pop2", "pop3")
#This line makes a data frame where each column is a pair of labels
perms <- data.frame(t(permutations(colLabels,2)))
pvals <- vector()
#This for loop gets each column of that data frame
for (pair in perms[,]) {
pair <- as.vector(pair)
p1 <- as.numeric(unlist(populations[pair[1]]))
p2 <- as.numeric(unlist(populations[pair[2]]))
pvals <- append(pvals, wilcox.test(p1, p2,alternative=c("less"))$p.value)
}
return(pvals)
}
What is a more R appropriate way to write this code?
Note: Generating populations and comparing them all to each other is a common enough thing (and tricky enough to code) that I think this question will apply to more people than myself.
EDIT: I forgot that my actual populations are of different sizes. This means I cannot make a data frame out of the vectors (as far as I know). I can make a list of vectors though. I have updated my code with a version that works.
Yes, this is indeed common; indeed so common that R has a built-in function for exactly this scenario: pairwise.table.
p <- list(pop1, pop2, pop3)
pairwise.table(function(i, j) {
wilcox.test(p[[i]], p[[j]])$p.value
}, 1:3)
There are also specific versions for t tests, proportion tests, and Wilcoxon tests; here's an example using pairwise.wilcox.test.
p <- list(pop1, pop2, pop3)
d <- data.frame(x=unlist(p), g=rep(seq_along(p), sapply(p, length)))
with(d, pairwise.wilcox.test(x, g))
Also, make sure you look into the p.adjust.method parameter to correctly adjust for multiple comparisons.
Per your comments, you're interested in tests where the order matters; that's really hard to imagine (and isn't true for the Wilcoxon test you mentioned) but still...
This is the pairwise.table function, edited to do tests in both directions.
pairwise.table.all <- function (compare.levels, level.names, p.adjust.method) {
ix <- setNames(seq_along(level.names), level.names)
pp <- outer(ix, ix, function(ivec, jvec)
sapply(seq_along(ivec), function(k) {
i <- ivec[k]; j <- jvec[k]
if (i != j) compare.levels(i, j) else NA }))
pp[] <- p.adjust(pp[], p.adjust.method)
pp
}
This is a version of pairwise.wilcox.test which uses the above function, and also runs on a list of vectors, instead of a data frame in long format.
pairwise.lazerbeam.test <- function(dat, p.adjust.method=p.adjust.methods) {
p.adjust.method <- match.arg(p.adjust.method)
level.names <- if(!is.null(names(dat))) names(dat) else seq_along(dat)
PVAL <- pairwise.table.all(function(i, j) {
wilcox.test(dat[[i]], dat[[j]])$p.value
}, level.names, p.adjust.method = p.adjust.method)
ans <- list(method = "Lazerbeam's special method",
data.name = paste(level.names, collapse=", "),
p.value = PVAL, p.adjust.method = p.adjust.method)
class(ans) <- "pairwise.htest"
ans
}
Output, both before and after tidying, looks like this:
> p <- list(a=1:5, b=2:8, c=10:16)
> out <- pairwise.lazerbeam.test(p)
> out
Pairwise comparisons using Lazerbeams special method
data: a, b, c
a b c
a - 0.2821 0.0101
b 0.2821 - 0.0035
c 0.0101 0.0035 -
P value adjustment method: holm
> pairwise.lazerbeam.test(p) %>% broom::tidy()
# A tibble: 6 x 3
group1 group2 p.value
<chr> <chr> <dbl>
1 b a 0.282
2 c a 0.0101
3 a b 0.282
4 c b 0.00350
5 a c 0.0101
6 b c 0.00350
Here is an example of one approach that uses combn() which has a function argument that can be used to easily apply wilcox.test() to all variable combinations.
set.seed(234)
# Create dummy data
df <- data.frame(replicate(3, sample(1:5, 100, replace = TRUE)))
# Apply wilcox.test to all combinations of variables in data frame.
res <- combn(names(df), 2, function(x) list(data = c(paste(x[1], x[2])), p = wilcox.test(x = df[[x[1]]], y = df[[x[2]]])$p.value), simplify = FALSE)
# Bind results
do.call(rbind, res)
data p
[1,] "X1 X2" 0.45282
[2,] "X1 X3" 0.06095539
[3,] "X2 X3" 0.3162251

Saving and accessing results from regression in a loop

I am trying to do several panel data regression through the pml package in a for loop and then save the results, so that I can use summary on each of the regression results. However, I can't seem to figure out how to use summary on the list of saved results. This is what I have tried:
library(plm)
########### Some toy data #################
Id <- c(rep(1:4,3),rep(5,2))
Id <- Id[order(Id)]
Year <- c(rep(2000:2002,4),c(2000,2002))
z1 <- rnorm(14)
z2 <- rnorm(14)
z3 <- rnorm(14)
z4 <- rnorm(14)
CORR <- rbind(c(1,0.6,0.5,0.2),c(0.6,1,0.7,0.3),c(0.5,0.7,1,0.4),c(0.2,0.3,0.4,1))
CholCORR <- chol(CORR)
DataTest <- as.data.frame(cbind(z1,z2,z3,z4)%*%CholCORR)
names(DataTest)<-c("y","x1","x2","x3")
DataTest <- cbind(Id, Year, DataTest)
############################################
for(i in 2001:2002){
Data <- DataTest[(DataTest$Year <= i), ]
TarLV <- plm(diff(y) ~ lag(x1) + x2 + x3, data = Data, model="pooling", index = c("Id","Year"))
if(i==2001){
Res1St <- TarLV
} else {
Res1St <- c(Res1St,TarLV)
}
}
sapply(Res1St, function(x) summary(x))
However, I get error:
Error in tapply(x, effect, func, ...): Arguments must have same length
I probably don't save the regressions results in a very sensible way, and the for loop can probably be avoided, I just don't see how. Any help appreciated!
Store the plm object in a list. Therefore create an empty object (out) before the loop and then fill it within the loop.
out <- NULL
yr <- 2001:2002
for(i in seq_along(yr)){
take <- DataTest[(DataTest$Year <= yr[i]), ]
out[[i]] <- plm(diff(y) ~ lag(x1) + x2 + x3, data = take, model="pooling", index = c("Id","Year"))
}
lapply(out, summary)
Here I made also other changes:
loop over 1,2, 3, ... , instead of 2001, 2002
Don't want to overwrite DataTests -> renamed to take

Resources