Running loop on groups of data frame in R - r

I need to code a loop that will run t-tests on small groups of a data frame. I think they recommended using a for loop.
There are 271 rows in the data frame. The first 260 rows need to be split into 13 groups of 20, and a t-test must be run on each of the 13 groups.
This is the code I used to run a t-test on the entire data frame:
t.test(a, c, alternative =c("two.sided"), mu=0, paired=TRUE, var.equal=TRUE, conf.level=0.95)
I'm a coding noob, please help! D:

First at all, I don't see a data.frame here. a and c seem to be vectors. I assume that these both vectors are of length 271 and you want to ignore the last 11 items. So you can throw away these items first:
a2 <- a[1:260]
c2 <- c[1:260]
Now you can create a vector of length 260 determining the indices of the subsets. (There are many ways to do this, but I think this way is easy to understand.)
indices <- as.numeric(cut(1:260, 20))
indices #just to show the output
You probably have to store the output in a list. The following code is again not the most efficient, but easy to understand.
result <- list()
for (i in 1:20){
result[[i]] <- t.test(a2[which(indices == i)], c2[which(indices == i)],
alternative = c("two.sided"),
mu = 0, paired = TRUE, var.equal = TRUE,
conf.level = 0.95)
}
result[[1]] #gives the results of the first t-test (items 1 to 20)
result[[2]] # ...
As alternative to the for-loop you could also use lapply which usually is more effective and a bit shorter (but that doesn't matter for 260 data points):
result2 <- lapply(1:20, function(i) t.test(a2[which(indices == i)],
c2[which(indices == i)],
alternative = c("two.sided"),
mu = 0, paired = TRUE, var.equal = TRUE,
conf.level = 0.95))
result[[1]] # ...
I hope that answers you're question.

Related

Combining for loops and ifelse in R

I am trying to write a for loop that will generate a correlation for a fixed column (LPS0) vs. all other columns in the data set. I don't want to use a correlation matrix because I only care about the correlation of LPS0 vs all other columns, not the correlations of the other columns with themselves. I then want to include an if statement to print only the significant correlations (p.value <= 0.05). I ran into some issues where some of the p.values are returned as NA, so I switched to an if_else loop. However, I am now getting an error. My code is as follows:
for(i in 3:ncol(microbiota_lps_0_morm)) {
morm_0 <- cor.test(microbiota_lps_0_morm$LPS0, microbiota_lps_0_morm[[colnames(microbiota_lps_0_morm)[i]]], method = "spearman")
if_else(morm_0$p.value <= 0.05, print(morm_0), print("Not Sig"), print("NA"))
}
The first value is returned, and then the loop stops with the following error:
Error in if_else():
! true must be length 1 (length of condition), not 8.
Backtrace: 1. dplyr::if_else(morm_0$p.value <= 0.05, print(morm_0), print("Not Sig"), print("NA"))
How can I make the loop print morm only when p.value <- 0.05?
Here's a long piece of code which aytomates the whole thing. it might be overkill but you can just take the matrix and use whatever you need. it makes use of the tidyverse.
df <- select_if(mtcars,is.numeric)
glimpse(df)
# keeping real names
dict <- cbind(original=names(df),new=paste0("v",1:ncol(df)))
# but changing names for better data viz
colnames(df) <- paste0("v",1:ncol(df))
# correlating between variables + p values
pvals <- list()
corss <- list()
for (coln in colnames(df)) {
pvals[[coln]] <- map(df, ~ cor.test(df[,coln], .)$p.value)
corss[[coln]] <- map(df, ~ cor(df[,coln], .))
}
# Keeping both matrices in a list
matrices <- list(
pvalues = matrix(data=unlist(pvals),
ncol=length(names(pvals)),
nrow=length(names(pvals))),
correlations = matrix(data=unlist(corss),
ncol=length(names(corss)),
nrow=length(names(corss)))
)
rownames(matrices[[1]]) <- colnames(df)
rownames(matrices[[2]]) <- colnames(df)
# Creating a combined data frame
long_cors <- expand.grid(Var1=names(df),Var2=names(df)) %>%
mutate(cor=unlist(matrices["correlations"]),
pval=unlist(matrices["pvalues"]),
same=Var1==Var2,
significant=pval<0.05,
dpcate=duplicated(cor)) %>%
# Leaving no duplicants, non-significant or self-correlation results
filter(same ==F,significant==T,dpcate==F) %>%
select(-c(same,dpcate,significant))
# Plotting correlations
long_cors %>%mutate(negative=cor<0) %>%
ggplot(aes(x=Var1,y=Var2,
color=negative,size=abs(cor),fill=Var2,
label=round(cor,2)))+
geom_label(show.legend = F,alpha=0.2)+
scale_color_manual(values = c("black","darkred"))+
# Sizing each correlation by it's magnitude
scale_size_area(seq(1,100,length=length(unique(long_cors$Var1))))+ theme_light()+
theme(axis.text = element_text(face = "bold",size=12))+
labs(title="Correlation between variables",
caption = "p < 0.05")+xlab("")+ylab("")
If you want to correlate a column of a matrix with the remaining columns, you can do so with one function call:
mtx <- matrix(rnorm(800), ncol=8)
cor(mtx[,1], mtx[,-1])
However, you will not get p-values. For getting p-values, I would recommend this approach:
library(tidyverse)
significant <- map_dbl(2:ncol(mtx),
~ cor.test(mtx[,1], mtx[,.], use="p", method="s")$p.value)
Whenever you feel like you need a for loop in R, chances are, you should be using another approach. for is a very un-R construct, and R gives many better ways of handling the same issues. map_* family of functions from tidyverse is but one of them. Another approach, in base R, would be to use apply:
significant <- apply(mtx[,-1], 2,
\(x) cor.test(x, mtx[,1], method="s", use="p")$p.value)

How to speed up row-wise computations on a data.frame using alternative functions to apply family?

I have a data frame with 10,000 rows and 40 columns. I am trying to apply a function to each of these rows. For each row, I am expecting to return a scalar which is the value of the statistic I am calculating in this function. Below is what I have done so far;
library(dfadjust)
library(MASS)
# Creating example data #
nrows=10000
ncols=40
n1=20
n2=20
df=data.frame(t(replicate(nrows, rnorm(ncols, 100, 3))))
cov=data.frame(group=as.factor(rep(c(1,2),c(n1,n2))))
# Function to evaluate on each row of df #
get_est= function(x){
mod = rlm(x~cov$group)
fit = dfadjustSE(mod)
coef = fit$coefficients[2,1]
se = fit$coefficients[2,4]
stats = coef/se
return(stats)
}
# Applying above function to full data #
t1=Sys.time()
estimates=apply(df, 1, function(x) get_est(x))
t2=Sys.time()-t1
# Time taken by apply function
Time difference of 37.10623 secs
Is there a way to significantly decrease the time taken to implement get_est() on the full data? The main reason I need to speed up the computation on a single df is because I have 1000 more data frames with the same dimension and I have to apply this function to each row to each of these data frames simultaneously. To illustrate, below is the broader situation I am dealing with;
# Creating example data
set.seed(1234)
nrows = 10000
ncols = 40
n1 = 20
n2 = 20
df.list = list()
for(i in 1:1000){
df.list[[i]] = data.frame(t(replicate(nrows, rnorm(ncols, 100, 3))))
}
# Applying get_est() to each row and to each of data frame in df.list #
pcks = c('MASS','dfadjust')
all.est = foreach(j = 1:length(df.list), .combine = cbind, .packages = pcks) %dopar% {
cov=data.frame(group=as.factor(rep(c(1,2),c(n1,n2))))
est = apply(df.list[[j]], 1, function(x) get_est(x))
return(est)
}
Even after parallelizing it is taking hours to finish. My ultimate objective is to significantly cut down the time to obtain "all.est" which will contain 10000 rows and 1000 columns where each column has the stats estimates for the respective data set. Any help is much appreciated!! Thanks in advance!
Doing some preprocessing of data we can adjust the rlm function so there's less overhead:
x3 <- as.matrix(cbind(1L, y))
colnames(x3) <- c('(Intercept)', 'x')
w = rep(1, nrow(x3))
get_est= function(x){
mod = rlm(x3, x, weights = w, w = w)
fit = dfadjustSE(mod)
coef = fit$coefficients[2,1]
se = fit$coefficients[2,4]
stats = coef/se
return(stats)
}
I got 12 seconds instead of 18 seconds for initial approach. ~33% improvement
For larger speedups I would suggest looking into rlm and dfadjustSE functions and try to optimize those for your specific needs (removing unnecessary checks etc., as you are calling those functions millions of times). But that probably will be quite time consuming and better performance is not guaranteed. Maybe there are other packages with similar but faster functions?

Generating n new datasets by randomly sampling existing data, and then applying a function to new datasets

For a paper I'm writing I have subsetted a larger dataset into 3 groups, because I thought the strength of correlations between 2 variables in those groups would differ (they did). I want to see if subsetting my data into random groupings would also significantly affect the strength of correlations (i.e., whether what I'm seeing is just an effect of subsetting, or if those groupings are actually significant).
To this end, I am trying to generate n new data frames by randomly sampling 150 rows from an existing dataset, and then want to calculate correlation coefficients for two variables in those n new data frames, saving the correlation coefficient and significance in a new file.
But, HOW?
I can do it manually, e.g., with dplyr, something like
newdata <- sample_n(Random_sample_data, 150)
output <- cor.test(newdata$x, newdata$y, method="kendall")
I'd obviously like to not type this out 1000 or 100000 times, and have been trying things with loops and lapply (see below) but they've not worked (undoubtedly due to something really obvious that I'm missing!).
Here I have tried to assign each row to a different group, with 10 groups in total, and then to do correlations between x and y by those groups:
Random_sample_data<-select(Range_corrected, x, y)
cat <- sample(1:10, 1229, replace=TRUE)
Random_sample_cats<-cbind(Random_sample_data,cat)
correlation <- function(c) {
c <- cor.test(x,y, method="kendall")
return(c)
}
b<- daply(Random_sample_cats, .(cat), correlation)
Error message:
Error in cor.test(x, y, method = "kendall") :
object 'x' not found
Once you have the code for what you want to do once, you can put it in replicate to do it n times. Here's a reproducible example on built-in data
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
output <- cor.test(newdata$wt, newdata$qsec, method="kendall")
})
replicate will save the result of the last line of what you did (output <- ...) for each replication. It will attempt to simplify the result, in this case cor.test returns a list of length 8, so replicate will simplify the results to a matrix with 8 rows and 10 columns (1 column per replication).
You may want to clean up the results a little bit so that, e.g., you only save the p-value. Here, we store only the p-value, so the result is a vector with one p-value per replication, not a matrix:
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
cor.test(newdata$wt, newdata$qsec, method="kendall")$p.value
})

Perform two-sample t-test and output the t, p values for two groups of matrices in R

I am trying to perform a two-sample t-test like this
The wt1, wt2, wt3, mut1, mut2, mut3 are 3x3 matrices. After running the t-test, I would like to get t.stat and p.value matrices, in which
t.stat[i,j] <- the t value from t.test(c(wt1[i,j],wt2[i,j],wt3[i,j]),c(mut1[i,j],mut2[i,j],mut3[i,j]))
p.value[i,j] <- the p-value from t.test(c(wt1[i,j],wt2[i,j],wt3[i,j]),c(mut1[i,j],mut2[i,j],mut3[i,j]))
with i and j indicating the row and column indices.
Is there an efficient way to achieve this without a loop?
Thank you very much for the help, it works!
Now I found that my data in the diagonal directions are all 1, which would result in Error in t.test.default(c(wt1[x], wt2[x], wt3[x]), c(mut1[x], mut2[x], :
data are essentially constant.
In order to pass those errors, I would like to output N/A in the t.stat and p.value. If the matrices have to contain the same type of values, 0 and 1 can be used for t.stat and p.value, respectively. It seems that tryCatch can do the job, but I am not sure how to handle it with sapply?
You can do something like this:
test<- sapply(1:9, function(x) t.test(c(wt1[x], wt2[x], wt3[x]),
c(mut1[x], mut2[x], mut3[x])))
t.stat<- matrix(test["statistic", ], nrow = 3)
p.value<- matrix(test["p.value", ], nrow = 3)
For the second part of your question, I think using tryCatch inside sapply will help. Unfortunately, I couldn't think of a way of pre-allocating test and then creating the 2 matrices while using tryCatch. In order to do that, I am adapting Aaron's answer.
t.stat<- matrix(sapply(1:9, function(x)
tryCatch({t.test(c(wt1[x], wt2[x], wt3[x]),
c(mut1[x], mut2[x], mut3[x]))$statistic},
error = function(err) {return(NA)})), nrow = 3)
p.value<- matrix(sapply(1:9, function(x)
tryCatch({t.test(c(wt1[x], wt2[x], wt3[x]),
c(mut1[x], mut2[x], mut3[x]))$p.value},
error = function(err) {return(NA)})), nrow = 3)

Using a for loop for performing several regressions

I am currently performing a style analysis using the following method: http://www.r-bloggers.com/style-analysis/ . It is a constrained regression of one asset on a number of benchmarks, over a rolling 36 month window.
My problem is that I need to perform this regression for a fairly large number of assets and doing it one by one would take a huge amount of time. To be more precise: Is there a way to tell R to regress columns 1-100 one by one on colums 101-116. Of course this also means printing 100 different plots, one for each asset. I am new to R and have been stuck for several days now.
I hope it doesn't matter that the following excerpt isn't reproducible, since the code works as originally intended.
# Style Regression over Window, constrained
#--------------------------------------------------------------------------
# setup
load.packages('quadprog')
style.weights[] = NA
style.r.squared[] = NA
# Setup constraints
# 0 <= x.i <= 1
constraints = new.constraints(n, lb = 0, ub = 1)
# SUM x.i = 1
constraints = add.constraints(rep(1, n), 1, type = '=', constraints)
# main loop
for( i in window.len:ndates ) {
window.index = (i - window.len + 1) : i
fit = lm.constraint( hist.returns[window.index, -1], hist.returns[window.index, 1], constraints )
style.weights[i,] = fit$coefficients
style.r.squared[i,] = fit$r.squared
}
# plot
aa.style.summary.plot('Style Constrained', style.weights, style.r.squared, window.len)
Thank you very much for any tips!
"Is there a way to tell R to regress columns 1-100 one by one on colums 101-116."
Yes! You can use a for loop, but you there's also a whole family of 'apply' functions which are appropriate. Here's a generalized solution with a random / toy dataset and using lm(), but you can sub in whatever regression function you want
# data frame of 116 cols of 20 rows
set.seed(123)
dat <- as.data.frame(matrix(rnorm(116*20), ncol=116))
# with a for loop
models <- list() # empty list to store models
for (i in 1:100) {
models[[i]] <-
lm(formula=x~., data=data.frame(x=dat[, i], dat[, 101:116]))
}
# with lapply
models2 <-
lapply(1:100,
function(i) lm(formula=x~.,
data=data.frame(x=dat[, i], dat[, 101:116])))
# compare. they give the same results!
all.equal(models, models2)
# to access a single model, use [[#]]
models2[[1]]

Resources