Regression across matrices in list - r

I want to perform regressions across matrices in a list, and in each case use the last 48 months of data.
Setup
Mat1 <- matrix(nrow=310, ncol =48, data= rnorm(310*48, 0, 0.1))
Mat2 <- matrix(nrow =310, ncol= 51, data = rnorm(310*51,0,0.1))
Mat1[1:300, 48] <- NA
Mat2[1:40, 51] <- NA
ind1 <- matrix(nrow =48, ncol =3, data = rnorm(48*3, 0,0.1))
ind2 <- matrix(nrow =51, ncol =3, data = rnorm(51*3, 0,0.1))
list1 <- c(Mat1, Mat2)
ind.list <- c(ind1,ind2)
I want to regress rows in List1[i], against columns in ind1[i] using the most recent 48 months of data. For instance, the first regression will regress row 1 of Mat1, against column1 of Ind1, Column2 of Ind1 and Column3 of Ind1. The second regression, will regress row 2 of Mat1, against column1 of Ind1, Column2 of Ind1 and Column3 of Ind1. Repeat this process for all rows in Mat1 and store coefficients.
However, I only want the regression to run if i have data in the last column of each row of Mat1. So to be clear: step1. check if data is present in column 48 row 1 of Mat1. If yes, run regression over 48 months of data. if No store NA in coefficient matrix. Step 2. Is NA present in row2 of Mat1 in column 48? If No, then run regression. If yes, store an NA in coefficient matrix.
Then move to Mat2. Regress Mat2[1, 3:51] against column 1 of Ind2[3:51,1] column2 of Ind2 i.e Ind2[3:51,2] and column3 of Ind2 i.e Ind2[3:51,3]
Repeat the process for all rows in Mat2 and store beta coefficients in a matrix.
Overall, regress rows of matrices in list1 against the corresponding columns of matrices in Ind1.
What i've tried:
for (i in 1:2) {
for (j in 1:310) {
coefficients1 = matrix(nrow = 310, ncol = 2)
coefficients2 = matrix(nrow=310, ncol =2)
coefficients3 = matrix(nrow=310, ncol =2)
if(is.na(list1[[i]][j,ncol(is.na(list1[[i]]))])) next
coefficients1[j,i] = coefficients(lm(list1[[i]][j, ]) ~ ind.list[[i]][,1] + ind.list[[i]][,2] +ind.list[[i]][,3])[1]
coefficients2[j,i] = coefficients(lm(list1[[i]][j, ]) ~ ind.list[[i]][,1] + ind.list[[i]][,2] +ind.list[[i]][,3])[2]
cefficients3[j,i] = coefficients(lm(list1[[i]][j, ]) ~ ind1.list[[i]][,1] + ind.list[[i]][,2] +ind.list[[i]][,3])[3]
}
}
Output should stall all the beta coefficients for each corresponding regression. Regression of Mat1 against Ind1 should yield 310 coefficients of Beta1 stored in coefficients1, 310 coefficients of beta2, stored in coefficients2 and 310 coefficients of beta3, stored in coefficients3.

You can try using apply which works well on matrices.
total_mat1 <- t(apply(Mat1, 1, function(x) coefficients(lm(x~ind1))))
#Change the values to NA when the last value of Mat1 is NA
total_mat1[is.na(Mat1[, ncol(Mat1)]), ] <- NA
total_mat1 is a matrix of 310 rows X 4 columns.
Now similar for Mat2 after subsetting selected rows and columns.
total_mat2 <- t(apply(Mat2[, 3:51], 1, function(x) coefficients(lm(x~ind2[3:51,]))))
To first check the NA values and then perform the analysis we can do
total_mat1 <- matrix(NA, nrow = nrow(Mat1), ncol = 4)
inds <- !is.na(Mat1[, ncol(Mat1)])
total_mat1[inds, ] <- t(apply(Mat1[inds, ],1, function(x) coefficients(lm(x~ind1))))

Related

Create subset matrix according to criteria/ Extract key rows according to criteria

I want to subset the rows of my original matrix into two separate matrices.
I setup the problem as follows:
set.seed(2)
Mat1 <- data.frame(matrix(nrow = 4, ncol =10, data = rnorm(40,0,1)))
keep.rows = matrix(nrow =2, ncol =4)
keep.rows[,1] = c(1,2)
keep.rows[,2] = c(2,3)
keep.rows[,3] = c(2,3)
keep.rows[,4] = c(1,2)
Mat1
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 0.9959846 -2.2079198 -0.3869496 -1.183606 1.959357077 1.0744594 -0.8621983 -0.4213736 0.4718595 1.2309537
2 -1.6957649 1.8221225 0.3866950 -1.358457 0.007645872 0.2605978 2.0480403 -0.3508344 1.3589398 1.1471368
3 -0.5333721 -0.6533934 1.6003909 -1.512671 -0.842615198 -0.3142720 0.9399201 -1.0273806 0.5641686 0.1065980
4 -1.3722695 -0.2846812 1.6811550 -1.253105 -0.601160105 -0.7496301 2.0086871 -0.2505191 0.4559801 -0.7833167
Mat 1 is my original matrix. Now from the Keep rows matrix, I want to create two output matrices. The first output matrix (Output1) should store all the row numbers specified in keep.row. The second output(Output2) matrix should store all remaining rows. In my actual application my matrices are very large and so cannot be sorted manually as i do here.
I need:
1) I need a function that does this simply over large matrices.
2) Ideally one where i can change the number of entries to "keep" each time. So in this case I store 3 entries. However, imagine if my keep.rows matrix was 2x2. In this case, I might want to store five entries each time.
Results should be of the form:
Output1 <- data.frame(matrix(nrow = 2, ncol =10))
Output1[1:2,1:3] <- Mat1[c(1,2), 1:3]
Output1[1:2,4:6] <- Mat1[c(2,3), 4:6]
Output1[1:2,7:9] <- Mat1[c(2,3), 7:9]
Output1[1:2,10] <- Mat1[c(1,2), 10]
Output2 <- data.frame(matrix(nrow = 2, ncol =10))
Output2[1:2,1:3] <- Mat1[c(3,4), 1:3]
Output2[1:2,4:6] <- Mat1[c(1,4), 4:6]
Output2[1:2,7:9] <- Mat1[c(1,4), 7:9]
Output2[1:2,10] <- Mat1[c(3,4), 10]
IMPORTANT: In the answer i need output 2 to be specified in a way that keeps all remaining rows. In my application my keep.row matrix is the same size. But Mat1 contains 1000 rows +
You can use sapply which iterates over the columns of Mat1 with seq_along(Mat1) and subset Mat1 using keep.rows. With cbind you get a matrix-like data.frame from the returned list of sapply. To get the remaining data you simply place a - before keep.rows.
Output1 <- do.call(cbind, sapply(seq_along(Mat1), function(i) Mat1[keep.rows[,(i+2) %/% 3], i, drop = FALSE], simplify = FALSE))
Output2 <- do.call(cbind, sapply(seq_along(Mat1), function(i) Mat1[-keep.rows[,(i+2) %/% 3], i, drop = FALSE], simplify = FALSE))

Efficient way to find all combinations in a data frame in R

I am looking for a efficient way in R to derive possible combinations.
I have a data frame with 3 columns and on the basis first column contents I am calculating all the possible combinations.
df <- data.frame("H" = c("H1","H2","H3","H4"), "W1" = c(95, 0, 85 ,0) , "W2" = c(50, 85, 0,0))
df$H <- as.character.factor(df$H)
nH <- nrow(df)
nW <- 2
library(plyr)
library(gtools)
if(nW<=5){
# Find all possible combinations
mat1 <- matrix(nrow = 0, ncol = nH)
for(i in 1:nH){
# mat1 <- rbind.fill.matrix(mat1, combinations(nH,nH-(i-1),df$H))
mat1 <- rbind.fill.matrix(mat1, t(combn(df$H,nH-(i-1))))
}
df_comb <- data.frame(mat1)
}
View(df_comb)
df_comb gives correct output. Above code works good for small data sets but when the values for H column is more than 15 , R results into out of memory.
Looking for ways in which calculation of combinations in above scenario can be done efficiently in R till H1, H2 .... H49, H50.
EDIT:
Tried a different Approach, Now after certain number of possible combinations (in below case - 32767), applied random sampling to generate combinations using ratio method.
nH <- 26
nW <- 2
if(nW<=5){
# Find all possible combinations ~~~~~ Random Sampling
ncomb <- 0
for(i in 1:nH){
ncomb <- ncomb + choose(nH, nH-(i-1))
}
nmax <- 10000 # Total number of combinations cannot exceed 10000
mat1 <- matrix( nrow = 0, ncol = nH)
for(i in 1:nH){ # For each Group 26C1 26C2 26C3 ..... 26C25 26C26
ncombi <- choose(nH, nH-(i-1)) #For i = 1 , 26C25
ncombComputed <- ceiling(nmax/ncomb*choose(nH, nH-(i-1)))
if(ncomb <= 32767 ){ # This condition is independent of NMAX - For 15
#Combinations
print("sefirst")
final <- mat1
print(paste(nH," ",i))
abc <- combinations(nH,nH-(i-1),df$herbicide)
mat1 <- rbind.fill.matrix(mat1, combinations(nH,nH-(i-1),df$H))
}
else {
print(i)
print("second")
combi <- matrix( nrow = 0, ncol = nH-(i-1))
#random sampling
while(nrow(combi) < ncombComputed){
combi<- rbind(combi,sort(sample(df$herbicide,nH-(i-1))))
combi <- unique(combi)
}
mat1 <- rbind.fill.matrix(mat1, combi)
}
}
df_comb_New <- data.frame(mat1)
}
The above code gives the result but for 26 Entries its taking 36 seconds for 10000 Combinations.Now I am looking that is there a way to optimize the while loop so that execution becomes faster or any other way to achieve the same result in efficient manner.

Extract information from one matrix through another matrix

I have 2 matrices, one is species x traits and the second one is site x species (presence/absence). I need a third matrix sites x traits and in each column, I will have more than one value (all the values for all species of one site). How can I do this? Extract information of one matrix through another matrix? I am just a beginner in R...
I transposed the site x species and cbind the 2 matrices, but the result was all columns in one matrix...
trait <- read.table("trait_matrix_final.txt", head=T, sep="\t", dec=',', row.names=1)
com <- read.table("community_matrix2.txt", head=T, sep="\t", dec=',', row.names=1)
comt <- t(com)
new <- cbind(trait, comt)
And I tried to multiply both matrices, but it is not possible because I have continuous and categorical data.
EDIT:
Complementary comments: I have continuous (eg. body size) and categoricals variables (a daily activity with the values: nocturnal, diurnal or both). So, if I have 3 species in site 1, I want to obtain mean body size for these 3 species for site 1. For the categorical variable, if the 3 species have these values: species 1= nocturnal, species 2= nocturnal and species 3 =diurnal, the column will be something like that: nocturnal+diurnal or nocturnal.diurnal. My third matrix will have the same numbers of columns that in the 1st matrix (species x traits), but the traits are averaged across all species for the particular site.
It would be very useful to provide a reproducible example so SO community can help you in solving the problem.
AFTER EDIT:
You should store the data in an object of class matrix only if all entries of that matrix are of the same class (e.g. all numeric or all character). Because your first matrix has both numeric and character values it is better to format it as a data.frame. See this post for more info.
I will generate some data assuming you have 5 traits per species, 20 species per site, and 10 sites:
n.traits <- 5
n.species <- 20
n.sites <- 10
traits.names <- paste ("trait", 1:n.traits, sep = "_")
species.names <- paste ("spec", 1:n.species, sep = "_")
sites.names <- paste ("site", 1:n.sites, sep = "_")
# species*traits matrix
set.seed (4)
mat1 <- as.data.frame (matrix (replicate (n = n.traits, rnorm (n = n.species)), nrow = n.species, ncol = n.traits, dimnames = list (species.names, traits.names)))
mat1
set.seed (89)
mat1[, 2] <- sample (x = c ("diurnal", "nocturnal"), size = nrow (mat1), replace = T)
mat1
# site*species matrix
set.seed (6)
mat2 <- matrix (replicate (n = n.species, rbinom (n = n.sites, size = 1, prob = 0.8)), nrow = n.sites, ncol = n.species, dimnames = list (sites.names, species.names))
mat2
Following for loop will average traits across species for each site:
# sites*traits matrix
mat3 <- as.data.frame (matrix (NA, nrow = n.sites, ncol = n.traits, dimnames = list (sites.names, traits.names)))
for (i in 1:n.sites){
spec_per_site_boolean <- mat2[i, ] == 1
mat1_subset <- mat1[spec_per_site_boolean, ]
for (j in 1:n.traits){
if (is.numeric (mat1_subset[,j]))
mat3[i,j] <- mean (mat1_subset[,j])
else
mat3[i,j] <- paste (sort (unique(mat1_subset[,j])), collapse = ".")
}
}
mat3
Note that the third matrix has the same number of columns as the first one (e.g. ncol (mat1) == ncol (mat3)), but it doesn't have the same number of rows (e.g. nrow (mat1) != nrow (mat3)).

Speeding up count of pairwise observations in R

I have a dataset where a subset of measurements for each entry are randomly missing:
dat <- matrix(runif(100), nrow=10)
rownames(dat) <- letters[1:10]
colnames(dat) <- paste("time", 1:10)
dat[sample(100, 25)] <- NA
I am interested in calculating correlations between each row in this dataset (i.e., a-a, a-b, a-c, a-d, ...). However, I would like to exclude correlations where there are fewer than 5 pairwise non-NA observations by setting their value to NA in the resulting correlation matrix.
Currently I am doing this as follows:
cor <- cor(t(dat), use = 'pairwise.complete.obs')
names <- rownames(dat)
filter <- sapply(names, function(x1) sapply(names, function(x2)
sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5))
cor[filter] <- NA
However, this operation is very slow as the actual dataset contains >1,000 entries.
Is there way to filter cells based on the number of non-NA pairwise observations in a vectorized manner, instead of within nested loops?
You can count the number of non-NA pairwise observations using matrix approach.
Let's use this data generation code. I made data larger and added more NAs.
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow=nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA
Then you filter code is taking 85 seconds
tic = proc.time()
names = rownames(dat)
filter = sapply(names, function(x1) sapply(names, function(x2)
sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5));
toc = proc.time();
show(toc-tic);
# 85.50 seconds
My version creates a matrix with values 1 for non-NAs in the original data. Then using matrix multiplication I calculate number of pairwise non-NAs. It ran in a fraction of a second.
tic = proc.time()
NAmat = matrix(0, nrow = nr, ncol = nc)
NAmat[ !is.na(dat) ] = 1;
filter2 = (tcrossprod(NAmat) < 5)
toc = proc.time();
show(toc-tic);
# 0.09 seconds
Simple check shows the results are the same:
all(filter == filter2)
# TRUE

R: How to write a for loop that reads every two lines in a matrix?

I want to calculate correlation statistics using cor.test(). I have a data matrix where the two pairs to be tested are on consecutive lines (I have more than thousand pairs so I need to correct for that also later). I was thinking that I could loop through every two and two lines in the matrix and perform the test (i.e. first test correlation between row1 and row2, then row3 and row4, row5 and row6 etc.), but I don't know how to make this kind of loop.
This is how I do the test on a single pair:
d = read.table(file="cor-test-sample-data.txt", header=T, sep="\t", row.names = 1)
d = as.matrix(d)
cor.test(d[1,], d[2,], method = "spearman")
You could try
res <- lapply(split(seq_len(nrow(mat1)),(seq_len(nrow(mat1))-1)%/%2 +1),
function(i){m1 <- mat1[i,]
if(NROW(m1)==2){
cor.test(m1[1,], m1[2,], method="spearman")
}
else NA
})
To get the p-values
resP <- sapply(res, function(x) x$p.value)
indx <- t(`dim<-`(seq_len(nrow(mat1)), c(2, nrow(mat1)/2)))
names(resP) <- paste(indx[,1], indx[,2], sep="_")
resP
# 1_2 3_4 5_6 7_8 9_10 11_12 13_14
#0.89726818 0.45191660 0.14106085 0.82532260 0.54262680 0.25384239 0.89726815
# 15_16 17_18 19_20 21_22 23_24 25_26 27_28
#0.02270217 0.16840791 0.45563229 0.28533447 0.53088721 0.23453161 0.79235990
# 29_30 31_32
#0.01345768 0.01611903
Or using mapply (assuming that the rows are even)
ind <- seq(1, nrow(mat1), by=2) #similar to the one used by #CathG in for loop
mapply(function(i,j) cor.test(mat1[i,], mat1[j,],
method='spearman')$p.value , ind, ind+1)
data
set.seed(25)
mat1 <- matrix(sample(0:100, 20*32, replace=TRUE), ncol=20)
Try
d = matrix(rep(1:9, 3), ncol=3, byrow = T)
sapply(2*(1:(nrow(d)/2)), function(pair) unname(cor.test(d[pair-1,], d[pair,], method="spearman")$estimate))
pvalues<-c()
for (i in seq(1,nrow(d),by=2)) {
pvalues<-c(pvalues,cor.test(d[i,],d[i+1,],method="spearman")$p.value)
}
names(pvalues)<-paste(row.names(d)[seq(1,nrow(d),by=2)],row.names(d)[seq(2,nrow(d),by=2)],sep="_")

Resources