How to fasten going through all independent variable combinations? - r

I want to write function combination_rsquare(y, data, factor_number) where
y - A vector - dependent variable
data - A data frame containing independent variables
factor_number - vector or numeric which tells how many elements in combination should be included.
Let's consider my function :
combination_rsquare <- function(y, data, factor_number = c(2, 3)) {
name_vec <- c()
r_sq <- c()
for (j in seq_along(factor_number)) {
# Defining combinations
comb_names <- combn(colnames(data), factor_number[j])
for (i in 1:ncol(comb_names)) {
#Append model r-squared for each combination
r_sq<- append(
r_sq,
summary(lm(y ~ ., data = data[comb_names[1:factor_number[j], i]]))$r.squared
)
# Create vector containing model names seperated by "+"
name_vec <- append(
name_vec,
paste(comb_names[1:factor_number[j], i], collapse = "+")
)
}
}
data.frame(name_vec, r_sq)
}
Let's have a look how my function works on data :
Norm <- rnorm(100)
Unif <- runif(100)
Exp <- rexp(100)
Pois <- rpois(100,1)
Weib <- rweibull(100,1)
df <- data.frame(Unif, Exp, Pois, Weib)
combination_rsquare(Norm, df)
name_vec r_sq
1 Unif+Exp 0.02727265
2 Unif+Pois 0.02912956
3 Unif+Weib 0.01613404
4 Exp+Pois 0.04853872
5 Exp+Weib 0.03252025
6 Pois+Weib 0.03573252
7 Unif+Exp+Pois 0.05138219
8 Unif+Exp+Weib 0.03571401
9 Unif+Pois+Weib 0.04112936
10 Exp+Pois+Weib 0.06209911
Okay - so we have it! Everything is working! However - If I'm putting very large data frame to my function and adding new features to be calculated (adjusted R.squared, AIC, BIC and so on) it's taking ages! My question is - is there any possibility how can I make this function works faster ? i.e. maybe the double loop can be omitted, or maybe there is R build function for creating such combinations ?
To summarize - How can I make combination_rsquare() to calculate faster ?

Related

Storing data from nested loop in r

I need to repeat the sampling procedure of the below loop 1000 times using a second loop.
This is the simplified code i produced for reproducability, the inner loop.
##Number of iterations
N = 8
##Store data from inner loop in vectors
PMSE <- rep(1 , N)
PolynomialDegree <- rep(1, N)
for (I in 1:N){
PolynomialDegree [I] <- I
PMSE [I] <- I*rnorm(1)
}
Now, using a second , outer loop. I want repeat this "sampling procedure" 1000 times and store the data of all those vectors into a single dataframe. Im struggling to write the outer loop and was hoping for some assistance.
This is my attempt with non-reproducable code, I hope it is clear what i am attempting to do.
##Set number of iterations
N <- 8
M <- 1000
##Store data
OUTPUT <- rep(1,M)
##Outer loop starts
for (J in 1:M){
PMSE <- rep(1 , N)
PolynomialDegree <- rep(1, N)
sample <- sample(nrow(tempraindata), floor(nrow(tempraindata)*0.7))
training <- tempraindata[sample,]
testing <- tempraindata[-sample,]
##Inner loop starts
for (I in 1:N){
##Set up linear model with x polynomial of degree I x = year, y = temp
mymodel <- lm(tem ~ poly(Year, degree = I), data = training)
##fit model on testing set and save predictions
predictions <- predict(mymodel, newdata = testing, raw = FALSE)
##define and store PMSE
PMSE[I] <- (1/(nrow(tempraindata)- nrow(training)))*(sum(testing$tem-predictions))^2
PolynomialDegree [I] <- I
} ## End of inner loop
OUTPUT[J] <- ##THIS IS WHERE I WANT TO SAVE THE DATA
} ##End outer loop
I want to store all the data inside OUTPUT and make it a dataframe, if done correctly it should contain 8000 values of PMSE and 8000 values of PolynomialDegree.
Avoid the bookkeeping of initializing vectors and then assigning elements by index. Consider a single sapply (or vapply) passing both iterations to build a matrix of 8,000 elements of the PSME calculations within a 1000 X 8 structure. Every column would then be a model run (or PolynomialDegree) and every row the training/testing data pair.
## Set number of iterations
N <- 8
M <- 1000
## Defined method to generalize process
calc_PSME <- function(M, N) {
## Randomly build training/testing sets
set.seed(M+N) # TO REPRODUCE RANDOM SAMPLES
sample <- sample(nrow(tempraindata), floor(nrow(tempraindata)*0.7))
training <- tempraindata[sample,]
testing <- tempraindata[-sample,]
## Set up linear model with x polynomial of degree I x = year, y = temp
mymodel <- lm(tem ~ poly(Year, degree = N), data = training)
## Fit model on testing set and save predictions
predictions <- predict(mymodel, newdata = testing, raw = FALSE)
## Return single PSME value
(
(1/(nrow(tempraindata)- nrow(training))) *
(sum(testing$tem-predictions)) ^ 2
)
}
# RETURN (1000 X 8) MATRIX WITH NAMED COLUMNS
PSME_matrix <- sapply(1:N, calc_PSME, 1:M)
PSME_matrix <- vapply(1:N, calc_PSME, numeric(M), 1:M)
Should you need a 8,000-row data frame of two columns, consider reshape to long format:
long_df <- reshape(
data.frame(output_matrix),
varying = 1:8,
timevar = "PolynomialDegree",
v.names = "PSME",
ids = NULL,
new.row.names = 1:1E4,
direction = "long"
)

nested loop in r to correlate columns of df1 to columns of df2

I have two datasets with abundance data from groups of different species. Columns are species and rows are sites. The sites (rows) are identical between the two datasets and what i am trying to do is to correlate the columns of the first dataset to the columns of the second dataset in order to see if there is a positive or a negative correlation.
library(Hmisc)
rcorr(otu.table.filter$sp1,new6$spA, type="spearman"))$P
rcorr(otu.table.filter$sp1,new6$spA, type="spearman"))$r
the first will give me the p value of the relation between sp1 and spA and the second the r value
I initially created a loop that allowed me to check all species of the first dataframe with a single column of the second dataframe. Needless to say if I was to make this work I would have to repeat the process a few hundred times.
My simple loop for one column of df1(new6) against all columns of df2(otu.table.filter)
pvalues = list()
for(i in 1:ncol(otu.table.filter)) {
pvalues[[i]] <-(rcorr(otu.table.filter[ , i], new6$Total, type="spearman"))$P
}
rvalues = list()
for(i in 1:ncol(otu.table.filter)) {
rvalues[[i]] <-(rcorr(otu.table.filter[ , i], new6$Total, type="spearman"))$r
}
p<-NULL
for(i in 1:length(pvalues)){
tmp <-print(pvalues[[i]][2])
p <- rbind(p, tmp)
}
r<-NULL
for(i in 1:length(rvalues)){
tmp <-print(rvalues[[i]][2])
r <- rbind(r, tmp)
}
fdr<-as.matrix(p.adjust(p, method = "fdr", n = length(p)))
sprman<-cbind(r,p,fdr)
and using the above as a starting point I tried to create a nested loop that each time would examine a column of df1 vs all columns of df2 and then it would proceed to the second column of df1 against all columns of df2 etc etc
but here i am a bit lost and i could not find an answer for a solution in r
I would assume that the pvalues output should be a list of
pvalues[[i]][[j]]
and similarly the rvalues output
rvalues[[i]][[j]]
but I am a bit lost and I dont know how to do that as I tried
pvalues = list()
rvalues = list()
for (j in 1:7){
for(i in 1:ncol(otu.table.filter)) {
pvalues[[i]][[j]] <-(rcorr(otu.table.filter[ , i], new7[,j], type="spearman"))$P
}
for(i in 1:ncol(otu.table.filter)) {
rvalues[[i]][[j]] <-(rcorr(otu.table.filter[ , i], new7[,j], type="spearman"))$r
}
}
but I cannot make it work cause I am not sure how to direct the output in the lists and then i would also appreciate if someone could help me with the next part which would be to extract for each comparison the p and r value and apply the fdr function (similar to what i did with my simple loop)
here is a subset of my two dataframes
Here a small demo. Let's assume two matrices x and y with a sample size n. Then correlation and approximate p-values can be estimated as:
n <- 100
x <- matrix(rnorm(10 * n), nrow = n)
y <- matrix(rnorm(5 * n), nrow = n)
## correlation matrix
r <- cor(x, y, method = "spearman")
## p-values
pval <- function(r, n) 2 * (1 - pt(abs(r)/sqrt((1 - r^2)/(n - 2)), n - 2))
pval(r, n)
## for comparison
cor.test(x[,1], y[,1], method = "spearman", exact = FALSE)
More details can be found here: https://stats.stackexchange.com/questions/312216/spearman-correlation-significancy-test
Edit
And finally a loop with cor.test:
## for comparison
p <- matrix(NA, nrow = ncol(x), ncol=ncol(y))
for (i in 1:ncol(x)) {
for (j in 1:ncol(y)) {
p[i, j] <- cor.test(x[,i], y[,j], method = "spearman")$p.value
}
}
p
The values differ a somewhat, because the first uses the t-approximation then the second the "exact AS 89 algorithm" of cor.test.

R - Cleanest way to run statistical test on every permutation of multiple populations

I have three populations stored as individual vectors. I need to run a statistical test (wilcoxon, if it matters) on each pair of these three populations.
I want to input three vectors into some block of code and get as output a vector of 6 p-values (one p-value is the result of one test and is a double).
I have a method that works but I am new to R and from what I've been reading I feel like there should be a better way, possibly involving storing the vectors as a data frame and using vectorization, to write this code.
Here is the code I have:
library(arrangements)
runAllTests <- function(pop1,pop2,pop3) {
populations <- list(pop1=pop1,pop2=pop2,pop3=pop3)
colLabels <- c("pop1", "pop2", "pop3")
#This line makes a data frame where each column is a pair of labels
perms <- data.frame(t(permutations(colLabels,2)))
pvals <- vector()
#This for loop gets each column of that data frame
for (pair in perms[,]) {
pair <- as.vector(pair)
p1 <- as.numeric(unlist(populations[pair[1]]))
p2 <- as.numeric(unlist(populations[pair[2]]))
pvals <- append(pvals, wilcox.test(p1, p2,alternative=c("less"))$p.value)
}
return(pvals)
}
What is a more R appropriate way to write this code?
Note: Generating populations and comparing them all to each other is a common enough thing (and tricky enough to code) that I think this question will apply to more people than myself.
EDIT: I forgot that my actual populations are of different sizes. This means I cannot make a data frame out of the vectors (as far as I know). I can make a list of vectors though. I have updated my code with a version that works.
Yes, this is indeed common; indeed so common that R has a built-in function for exactly this scenario: pairwise.table.
p <- list(pop1, pop2, pop3)
pairwise.table(function(i, j) {
wilcox.test(p[[i]], p[[j]])$p.value
}, 1:3)
There are also specific versions for t tests, proportion tests, and Wilcoxon tests; here's an example using pairwise.wilcox.test.
p <- list(pop1, pop2, pop3)
d <- data.frame(x=unlist(p), g=rep(seq_along(p), sapply(p, length)))
with(d, pairwise.wilcox.test(x, g))
Also, make sure you look into the p.adjust.method parameter to correctly adjust for multiple comparisons.
Per your comments, you're interested in tests where the order matters; that's really hard to imagine (and isn't true for the Wilcoxon test you mentioned) but still...
This is the pairwise.table function, edited to do tests in both directions.
pairwise.table.all <- function (compare.levels, level.names, p.adjust.method) {
ix <- setNames(seq_along(level.names), level.names)
pp <- outer(ix, ix, function(ivec, jvec)
sapply(seq_along(ivec), function(k) {
i <- ivec[k]; j <- jvec[k]
if (i != j) compare.levels(i, j) else NA }))
pp[] <- p.adjust(pp[], p.adjust.method)
pp
}
This is a version of pairwise.wilcox.test which uses the above function, and also runs on a list of vectors, instead of a data frame in long format.
pairwise.lazerbeam.test <- function(dat, p.adjust.method=p.adjust.methods) {
p.adjust.method <- match.arg(p.adjust.method)
level.names <- if(!is.null(names(dat))) names(dat) else seq_along(dat)
PVAL <- pairwise.table.all(function(i, j) {
wilcox.test(dat[[i]], dat[[j]])$p.value
}, level.names, p.adjust.method = p.adjust.method)
ans <- list(method = "Lazerbeam's special method",
data.name = paste(level.names, collapse=", "),
p.value = PVAL, p.adjust.method = p.adjust.method)
class(ans) <- "pairwise.htest"
ans
}
Output, both before and after tidying, looks like this:
> p <- list(a=1:5, b=2:8, c=10:16)
> out <- pairwise.lazerbeam.test(p)
> out
Pairwise comparisons using Lazerbeams special method
data: a, b, c
a b c
a - 0.2821 0.0101
b 0.2821 - 0.0035
c 0.0101 0.0035 -
P value adjustment method: holm
> pairwise.lazerbeam.test(p) %>% broom::tidy()
# A tibble: 6 x 3
group1 group2 p.value
<chr> <chr> <dbl>
1 b a 0.282
2 c a 0.0101
3 a b 0.282
4 c b 0.00350
5 a c 0.0101
6 b c 0.00350
Here is an example of one approach that uses combn() which has a function argument that can be used to easily apply wilcox.test() to all variable combinations.
set.seed(234)
# Create dummy data
df <- data.frame(replicate(3, sample(1:5, 100, replace = TRUE)))
# Apply wilcox.test to all combinations of variables in data frame.
res <- combn(names(df), 2, function(x) list(data = c(paste(x[1], x[2])), p = wilcox.test(x = df[[x[1]]], y = df[[x[2]]])$p.value), simplify = FALSE)
# Bind results
do.call(rbind, res)
data p
[1,] "X1 X2" 0.45282
[2,] "X1 X3" 0.06095539
[3,] "X2 X3" 0.3162251

R: Replacing a for-loop with an apply function

I managed to apply a linear regression for each subject of my data frame and paste the values into a new dataframe using a for-loop. However, I think there should be a more readable way of achieving my result using an apply function, but all my attempts fail. This is how I do it:
numberOfFiles <- length(resultsHick$subject)
intslop <- data.frame(matrix(0,numberOfFiles,4))
intslop <- rename(intslop,
subject = X1,
intercept = X2,
slope = X3,
Rsquare = X4)
cond <- c(0:3)
allSubjects <- resultsHick$subject
for (i in allSubjects)
{intslop[i,1] <- i
yvalues <- t(subset(resultsHick,
subject == i,
select = c(H0meanRT, H1meanRT, H2meanRT, H258meanRT)))
fit <- lm(yvalues ~ cond)
intercept <- fit$coefficients[1]
slope <- fit$coefficients[2]
rsquared <- summary(fit)$r.squared
intslop[i,2] <- intercept
intslop[i,3] <- slope
intslop[i,4] <- rsquared
}
The result should look the same as
> head(intslop)
subject intercept slope Rsquare
1 1 221.3555 54.98290 0.9871209
2 2 259.4947 66.33344 0.9781499
3 3 227.8693 47.28699 0.9537868
4 4 257.7355 80.71935 0.9729132
5 5 197.4659 49.57882 0.9730409
6 6 339.1649 61.63161 0.8213179
...
Does anybody know a more readable way of writing this code using an apply function?
One common pattern I use to replace for loops that aggregate data.frames is:
do.call(
rbind,
lapply(1:numberOfDataFrames,
FUN = function(i) {
print(paste("Processing index:", i)) # helpful to see how slow/fast
temp_df <- do_some_work[i]
temp_df$intercept <- 1, etc.
return(temp_df) # key is to return a data.frame for each index.
}
)
)

Compute p-values across all columns of (possibly large) matrices in R

is there are any more efficient/faster way to compare two matrices (column by columns) and to compute p-values using t-test for no difference in means (eventually switching to the chisq.test when necessary)?
Here is my solution:
## generate fake data (e.g., from treatment and control data)
z0 <- matrix(rnorm(100),10,10)
z1 <- matrix(rnorm(100, mean=1.1, sd=2),10,10)
## function to compare columns (bloody for loop)
compare.matrix <- function(z0, z1){
pval <- numeric(ncol(z0)) ## initialize
for(i in 1:ncol(z0)){ ## compare columns
pval[i] <- t.test(z1[, i], z0[, i])$p.value
## if var is categorical, switch test type
if ( length(unique(z1[,i]))==2){
index <- c(rep(0, nrow(z0)), rep(1, nrow(z1)))
xx <- c(z0[,i], z1[,i])
pval[i] <- chisq.test(table(xx, index), simulate.p.value=TRUE)$p.value
}
}
return(pval)
}
compare.matrix(z0, z1)
Here's one way using dplyr. It would probably be better to combine the first three lines into a single step if you've got large matrices, but I separated them for clarity. I think the chi-squared case would be a fairly simple extension.
z0_melt = melt(z0, value.name='z0')[,c('Var2','z0')]
z1_melt = melt(z1, value.name='z1')[,c('Var2','z1')]
all_df = merge(z0_melt, z1_melt)
library(dplyr)
all_df %>%
group_by(Var2) %>%
summarize(p = t.test(z0, z1)$p.value)

Resources