I have a dataset where a subset of measurements for each entry are randomly missing:
dat <- matrix(runif(100), nrow=10)
rownames(dat) <- letters[1:10]
colnames(dat) <- paste("time", 1:10)
dat[sample(100, 25)] <- NA
I am interested in calculating correlations between each row in this dataset (i.e., a-a, a-b, a-c, a-d, ...). However, I would like to exclude correlations where there are fewer than 5 pairwise non-NA observations by setting their value to NA in the resulting correlation matrix.
Currently I am doing this as follows:
cor <- cor(t(dat), use = 'pairwise.complete.obs')
names <- rownames(dat)
filter <- sapply(names, function(x1) sapply(names, function(x2)
sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5))
cor[filter] <- NA
However, this operation is very slow as the actual dataset contains >1,000 entries.
Is there way to filter cells based on the number of non-NA pairwise observations in a vectorized manner, instead of within nested loops?
You can count the number of non-NA pairwise observations using matrix approach.
Let's use this data generation code. I made data larger and added more NAs.
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow=nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA
Then you filter code is taking 85 seconds
tic = proc.time()
names = rownames(dat)
filter = sapply(names, function(x1) sapply(names, function(x2)
sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5));
toc = proc.time();
show(toc-tic);
# 85.50 seconds
My version creates a matrix with values 1 for non-NAs in the original data. Then using matrix multiplication I calculate number of pairwise non-NAs. It ran in a fraction of a second.
tic = proc.time()
NAmat = matrix(0, nrow = nr, ncol = nc)
NAmat[ !is.na(dat) ] = 1;
filter2 = (tcrossprod(NAmat) < 5)
toc = proc.time();
show(toc-tic);
# 0.09 seconds
Simple check shows the results are the same:
all(filter == filter2)
# TRUE
Related
I'm new to R, and trying to calculate the product between a fixed matrix to a 2-way frequency table for any combinations of columns in a dataframe or matrix and divide it by the sequence length (aka number of rows which is 15), the problem is that the running time increases dramatically when performing it on 1K sequences (1K columns). the goal is to use it with as much as possible sequences (more than 10 minutes, for 10K could be more than 1hr)
mat1 <- matrix(sample(LETTERS),ncol = 100,nrow = 15)
mat2 <- matrix(sample(abs(rnorm(26,0,3))),ncol=26,nrow=26)
rownames(mat2) <- LETTERS
colnames(mat2) <- LETTERS
diag(mat2) <- 0
test_vec <- c()
for (i in seq(ncol(mat1)-1)){
for(j in seq(i+1,ncol(mat1))){
s2 <- table(mat1[,i],mat1[,j]) # create 2-way frequency table
mat2_1 <- mat2
mat2_1 <- mat2_1[rownames(mat2_1) %in% rownames(s2),
colnames(mat2_1) %in% colnames(s2)]
calc <- ((1/nrow(mat1))*sum(mat2_1*s2))
test_vec <- append(test_vec,calc)
}}
Thanks for the help.
Here is an approach that converts mat1 to a data.table, and converts all the columns to factors, and uses table(..., exclude=NULL)
library(data.table)
m=as.data.table(mat1)[,lapply(.SD, factor, levels=LETTERS)]
g = combn(colnames(m),2, simplify = F)
result = sapply(g, function(x) sum(table(m[[x[1]]], m[[x[2]]], exclude=NULL)*mat2)/nrow(m))
Check equality:
sum(result-test_vec>1e-10)
[1] 0
Here there are 4950 combinations (100*99/2), but the number of combinations will increase quickly as nrow(mat1) increases (as you point out). You might find in that case that a parallelized version works well.
library(doParallel)
library(data.table)
registerDoParallel()
m=as.data.table(mat1)[,lapply(.SD, factor, levels=LETTERS)]
g = combn(colnames(m),2, simplify = F)
result = foreach(i=1:length(g), .combine=c) %dopar%
sum(table(m[[g[[i]][1]]], m[[g[[i]][2]]], exclude=NULL)*mat2)
result = result/nrow(m)
Lets assume I have a binary matrix with 24 columns and 5000 rows.
The columns are Parameters (P1 - P24) of 5000 subjects. The parameters are binary (0 or 1).
(Note: my real data can contain as much as 40,000 subjects)
m <- matrix(, nrow = 5000, ncol = 24)
m <- apply(m, c(1,2), function(x) sample(c(0,1),1))
colnames(m) <- paste("P", c(1:24), sep = "")
Now I would like to determine what are all possible combinations of the 24 measured parameters:
comb <- expand.grid(rep(list(0:1), 24))
colnames(comb) <- paste("P", c(1:24), sep = "")
The final question is: How often does each of the possible row combinations from comb appear in matrix m?
I managed to write a code for this and create a new column in comb to add the counts. But my code appears to be really slow and would take 328 days to complete to run. Therefore the code below only considers the 20 first combinations
comb$count <- 0
for (k in 1:20){ # considers only the first 20 combinations of comb
for (i in 1:nrow(m)){
if (all(m[i,] == comb[k,1:24])){
comb$count[k] <- comb$count[k] + 1
}
}
}
Is there computationally a more efficient way to compute this above so I can count all combinations in a short time?
Thank you very much for your help in advance.
Data.Table is fast at this type of operation:
m <- matrix(, nrow = 5000, ncol = 24)
m <- apply(m, c(1,2), function(x) sample(c(0,1),1))
colnames(m) <- paste("P", c(1:24), sep = "")
comb <- expand.grid(rep(list(0:1), 24))
colnames(comb) <- paste("P", c(1:24), sep = "")
library(data.table)
data_t = data.table(m)
ans = data_t[, .N, by = P1:P24]
dim(ans)
head(ans)
The core of the function is by = P1:P24 means group by all the columns; and .N the number of records in group
I used this as inspiration - How does one aggregate and summarize data quickly?
and the data_table manual https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html
If all you need is the combinations that occur in the data and how many times, this will do it:
m2 <- apply(m, 1, paste0, collapse="")
m2.tbl <- xtabs(~m2)
head(m2.tbl)
m2
# 000000000001000101010010 000000000010001000100100 000000000010001110001100 000000000100001000010111 000000000100010110101010 000000000100101000101100
# 1 1 1 1 1 1
You can use apply to paste the unique values in a row and use table to count the frequency.
table(apply(m, 1, paste0, collapse = '-'))
I am stuck.
We are asked to pick 30 random data from our dataset, then replace the picked values with NAs.
I'm stuck at the beginning, using the following function, as it selects 30 random data items from each column, while I want 30 random data picked among the whole dataset.
data2[sample(nrow(data2),30), ]
I hope you can help me out, thank you for your help/
Do you mean to replace 30 random rows?
data2 <- iris # as an example
throwouts <- sample(nrow(data2),30)
data2[throwouts, ] <- NA
print(data2)
Do you mean to replace 30 values in random rows and random columns?
data2 <- iris # as an example
coords <- expand.grid(1:nrow(data2),1:ncol(data2)) # all the possible values
coords <- coords[ sample(nrow(coords), 30), ] # take 30 unique ones of all possible values
for(i in 1:30) # erase each of them individually
data2[coords$Var1[i], coords$Var2[i] ] <- NA
print(data2)
The following seems to be memory efficient, it uses a logical matrix of FALSE values and 30 TRUE values in random positions to assign NA's.
set.seed(2020)
v <- rep(FALSE, prod(dim(df1)))
v[sample(length(v), 30)] <- TRUE
is.na(df1) <- matrix(v, nrow = nrow(df1))
rm(v)
This can easily be written as a function.
assignNA <- function(x, n){
v <- rep(FALSE, prod(dim(x)))
v[sample(length(v), 30)] <- TRUE
is.na(x) <- matrix(v, nrow = nrow(x))
x
}
set.seed(2020)
assignNA(df1, n = 30)
Tested with the data
df1 <- iris
I have a dataset with 20 columns and 1000 rows generated using:
sim_data <- do.call(cbind, replicate(20, rexp(1000, 1/120), simplify = FALSE))
How can I pick a random number of columns per row to add up their values, and have a column indicating how many columns were picked?
I have:
picked <- sim_data[sample(nrow(sim_data), 5)]
sim_data$Sum <- sum(picked)
sim_data$Number <- length(picked)
but how do I pick a random size from 1 to 20, instead of "5", and repeat over all rows?
We can use apply
cbind(sim_data, t(apply(sim_data,1, function(x) {
i1 <- sample(seq_along(x), 1)
out <- sum(sample(x, i1))
c(Length = i1, Sum = out)
}
)))
I am looking for a efficient way in R to derive possible combinations.
I have a data frame with 3 columns and on the basis first column contents I am calculating all the possible combinations.
df <- data.frame("H" = c("H1","H2","H3","H4"), "W1" = c(95, 0, 85 ,0) , "W2" = c(50, 85, 0,0))
df$H <- as.character.factor(df$H)
nH <- nrow(df)
nW <- 2
library(plyr)
library(gtools)
if(nW<=5){
# Find all possible combinations
mat1 <- matrix(nrow = 0, ncol = nH)
for(i in 1:nH){
# mat1 <- rbind.fill.matrix(mat1, combinations(nH,nH-(i-1),df$H))
mat1 <- rbind.fill.matrix(mat1, t(combn(df$H,nH-(i-1))))
}
df_comb <- data.frame(mat1)
}
View(df_comb)
df_comb gives correct output. Above code works good for small data sets but when the values for H column is more than 15 , R results into out of memory.
Looking for ways in which calculation of combinations in above scenario can be done efficiently in R till H1, H2 .... H49, H50.
EDIT:
Tried a different Approach, Now after certain number of possible combinations (in below case - 32767), applied random sampling to generate combinations using ratio method.
nH <- 26
nW <- 2
if(nW<=5){
# Find all possible combinations ~~~~~ Random Sampling
ncomb <- 0
for(i in 1:nH){
ncomb <- ncomb + choose(nH, nH-(i-1))
}
nmax <- 10000 # Total number of combinations cannot exceed 10000
mat1 <- matrix( nrow = 0, ncol = nH)
for(i in 1:nH){ # For each Group 26C1 26C2 26C3 ..... 26C25 26C26
ncombi <- choose(nH, nH-(i-1)) #For i = 1 , 26C25
ncombComputed <- ceiling(nmax/ncomb*choose(nH, nH-(i-1)))
if(ncomb <= 32767 ){ # This condition is independent of NMAX - For 15
#Combinations
print("sefirst")
final <- mat1
print(paste(nH," ",i))
abc <- combinations(nH,nH-(i-1),df$herbicide)
mat1 <- rbind.fill.matrix(mat1, combinations(nH,nH-(i-1),df$H))
}
else {
print(i)
print("second")
combi <- matrix( nrow = 0, ncol = nH-(i-1))
#random sampling
while(nrow(combi) < ncombComputed){
combi<- rbind(combi,sort(sample(df$herbicide,nH-(i-1))))
combi <- unique(combi)
}
mat1 <- rbind.fill.matrix(mat1, combi)
}
}
df_comb_New <- data.frame(mat1)
}
The above code gives the result but for 26 Entries its taking 36 seconds for 10000 Combinations.Now I am looking that is there a way to optimize the while loop so that execution becomes faster or any other way to achieve the same result in efficient manner.