R - randomly picking columns to sum up row values - r

I have a dataset with 20 columns and 1000 rows generated using:
sim_data <- do.call(cbind, replicate(20, rexp(1000, 1/120), simplify = FALSE))
How can I pick a random number of columns per row to add up their values, and have a column indicating how many columns were picked?
I have:
picked <- sim_data[sample(nrow(sim_data), 5)]
sim_data$Sum <- sum(picked)
sim_data$Number <- length(picked)
but how do I pick a random size from 1 to 20, instead of "5", and repeat over all rows?

We can use apply
cbind(sim_data, t(apply(sim_data,1, function(x) {
i1 <- sample(seq_along(x), 1)
out <- sum(sample(x, i1))
c(Length = i1, Sum = out)
}
)))

Related

multiply every cell of a data.table by a different factor from a prespecified range

i have a data.table that i want to change:
testDF <- data.table(col1 = c(1,1,1),
col2 = c(2,2,2),
col3 = c(3,3,3))
For specific columns, i want to multiply every row (and column) by a different factor. This factor has to satisfy some conditions:
lie in a certain range, ex 0.5 - 1.5
not to contain a specific number, ex. 1
for this i have written a function:
generateRandomFactor <- function(n){
factor <- runif(n, 0.5, 1.5)
if(any(factor == 1)){
stop("factor of 1 chosen.")
}
return(factor)
}
the i use this function to multiply selected columns with a random factor:
SC <- c("col1", "col3")
factors <- generateRandomFactor(length(SC))
testMultiplication <- mapply(`*`, testDF[, ..SC], factors)
how can i extend this that the multiplication happens for selected columns, but also per row with a different factor every time? The number of rows is not known in advance. So, i essential want to create a new data.table with random contents of the same dimensions as a subset of the original data.table testDT[,..SC] and then multiply them together.
Or is there a better alternative?
Maybe you can try the code below
testDF[, c(SC) := lapply(.SD[, SC, with = FALSE], function(x) x * runif(length(x), 0.5, 1.5))]
and you will see
col1 col2 col3
1: 0.8953371 2 1.501498
2: 1.0371162 2 4.234734
3: 0.8999022 2 2.519295
i ended up modifying the function:
generateRandomMatrix <- function(nrow,
ncol){
randomMatrix <- matrix(runif(nrow*ncol), nrow = nrow, ncol = ncol)
if(any(randomMatrix == 1)){
stop("factor of 1 chosen.")
}
return(randomMatrix)
}
and then multiply the data.tables:
randomFactors <- generateRandomMatrix(nrow(testDF) , length(SC))
testDF_mock <- testDF[, ..SC] * randomFactors

How to quantify the frequency of all possible row combinations of a binary matrix in R in a more efficient way?

Lets assume I have a binary matrix with 24 columns and 5000 rows.
The columns are Parameters (P1 - P24) of 5000 subjects. The parameters are binary (0 or 1).
(Note: my real data can contain as much as 40,000 subjects)
m <- matrix(, nrow = 5000, ncol = 24)
m <- apply(m, c(1,2), function(x) sample(c(0,1),1))
colnames(m) <- paste("P", c(1:24), sep = "")
Now I would like to determine what are all possible combinations of the 24 measured parameters:
comb <- expand.grid(rep(list(0:1), 24))
colnames(comb) <- paste("P", c(1:24), sep = "")
The final question is: How often does each of the possible row combinations from comb appear in matrix m?
I managed to write a code for this and create a new column in comb to add the counts. But my code appears to be really slow and would take 328 days to complete to run. Therefore the code below only considers the 20 first combinations
comb$count <- 0
for (k in 1:20){ # considers only the first 20 combinations of comb
for (i in 1:nrow(m)){
if (all(m[i,] == comb[k,1:24])){
comb$count[k] <- comb$count[k] + 1
}
}
}
Is there computationally a more efficient way to compute this above so I can count all combinations in a short time?
Thank you very much for your help in advance.
Data.Table is fast at this type of operation:
m <- matrix(, nrow = 5000, ncol = 24)
m <- apply(m, c(1,2), function(x) sample(c(0,1),1))
colnames(m) <- paste("P", c(1:24), sep = "")
comb <- expand.grid(rep(list(0:1), 24))
colnames(comb) <- paste("P", c(1:24), sep = "")
library(data.table)
data_t = data.table(m)
ans = data_t[, .N, by = P1:P24]
dim(ans)
head(ans)
The core of the function is by = P1:P24 means group by all the columns; and .N the number of records in group
I used this as inspiration - How does one aggregate and summarize data quickly?
and the data_table manual https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html
If all you need is the combinations that occur in the data and how many times, this will do it:
m2 <- apply(m, 1, paste0, collapse="")
m2.tbl <- xtabs(~m2)
head(m2.tbl)
m2
# 000000000001000101010010 000000000010001000100100 000000000010001110001100 000000000100001000010111 000000000100010110101010 000000000100101000101100
# 1 1 1 1 1 1
You can use apply to paste the unique values in a row and use table to count the frequency.
table(apply(m, 1, paste0, collapse = '-'))

r loop for filtering through each column

I have a data frame like this:
gene expression data frame
Assuming column name as different samples and row name as different genes.
Now I want to know the number of genes left after I filter from each column with a number
For example,
sample1_more_than_5 <- df[(df[,1]>5),]
sample1_more_than_10 <- df[(df[,1]>10),]
sample1_more_than_20 <- df[(df[,1]>20),]
sample1_more_than_30 <- df[(df[,1]>30),]
Then,
sample2_more_than_5 <- df[(df[,2]>5),]
sample2_more_than_10 <- df[(df[,2]>10),]
sample2_more_than_20 <- df[(df[,2]>20),]
sample2_more_than_30 <- df[(df[,2]>30),]
But I don't want to repeat this 100 times as I have 100 samples.
Can anyone write a loop for me for this situation? Thank you
Here is a solution using two loops that calculates, by each sample (columns), the number of genes (rows) that have a value greater than the one indicated in the nums vector.
#Create the vector with the numbers used to filter each columns
nums<-c(5, 10, 20, 30)
#Loop for each column
resul <- apply(df, 2, function(x){
#Get the length of rows that have a higher value than each nums entry
sapply(nums, function(y){
length(x[x>y])
})
})
#Transform the data into a data.frame and add the nums vector in the first column
resul<-data.frame(greaterthan = nums,
as.data.frame(resul))
We can loop over the columns and do this and create the grouping with cut
lst1 <- lapply(df, function(x) split(x, cut(x, breaks = c(5, 10, 20, 30))))
or findInterval and then split
lst1 <- lapply(df, function(x) split(x, findInterval(x, c(5, 10, 20, 30))))
If we go by the way the objects are created in the OP's post, there would be 100 * 4 i.e. 400 objects (100 columns) in the global environment. Instead, it can be single list object.
The objects can be created, but it is not recommended
v1 <- c(5, 10, 20, 30)
v2 <- seq_along(df)
for(i in v2) {
for(j in v1) {
assign(sprintf('sample%d_more_than_%d', i, j),
value = df[df[,i] > j,, drop = FALSE])
}
}

Speeding up count of pairwise observations in R

I have a dataset where a subset of measurements for each entry are randomly missing:
dat <- matrix(runif(100), nrow=10)
rownames(dat) <- letters[1:10]
colnames(dat) <- paste("time", 1:10)
dat[sample(100, 25)] <- NA
I am interested in calculating correlations between each row in this dataset (i.e., a-a, a-b, a-c, a-d, ...). However, I would like to exclude correlations where there are fewer than 5 pairwise non-NA observations by setting their value to NA in the resulting correlation matrix.
Currently I am doing this as follows:
cor <- cor(t(dat), use = 'pairwise.complete.obs')
names <- rownames(dat)
filter <- sapply(names, function(x1) sapply(names, function(x2)
sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5))
cor[filter] <- NA
However, this operation is very slow as the actual dataset contains >1,000 entries.
Is there way to filter cells based on the number of non-NA pairwise observations in a vectorized manner, instead of within nested loops?
You can count the number of non-NA pairwise observations using matrix approach.
Let's use this data generation code. I made data larger and added more NAs.
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow=nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA
Then you filter code is taking 85 seconds
tic = proc.time()
names = rownames(dat)
filter = sapply(names, function(x1) sapply(names, function(x2)
sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5));
toc = proc.time();
show(toc-tic);
# 85.50 seconds
My version creates a matrix with values 1 for non-NAs in the original data. Then using matrix multiplication I calculate number of pairwise non-NAs. It ran in a fraction of a second.
tic = proc.time()
NAmat = matrix(0, nrow = nr, ncol = nc)
NAmat[ !is.na(dat) ] = 1;
filter2 = (tcrossprod(NAmat) < 5)
toc = proc.time();
show(toc-tic);
# 0.09 seconds
Simple check shows the results are the same:
all(filter == filter2)
# TRUE

select first nth percent of rows from random sampled dataframes of list in r

I wrote a function that selects first nth percent of rows (i.e., threshold) from dataframe and this works on dataframes of list as well. The functions is given below:
set.threshold.rand <-function(value, vector){
print(length(vector))
n<-as.integer(length(vector)/100*value)
threshold<-vector[n]
return(threshold)
}
sensitivity.rand<-function(vector, threshold){
thresh<-set.threshold.rand(threshold, vector)
print(thresh)
score<-ifelse(vector<=thresh, "H", "L") # after taking the threshold values it assign them to 'H' and 'L' according to condition
return(score)
}
This function selects first nth percent of rows from dataframes of list. For example, the codes below selects first 143 rows as "H" which was expected.
vec.1 <- c(1:574)
vec.2 <- c(3001:3574)
df.1 <- data.frame(vec.1, vec.2)
df.2 <- data.frame(vec.2, vec.1)
my_list1 <- list(df.1, df.2)
my_list1 <- lapply(my_list1, function(x) {x[1] <- lapply(x[1], sensitivity.rand, threshold = 25)
x})
But this don't work on sampled and replicated dataframes of list (given below). For example:
my_list <- replicate(10, df.1[sample(nrow(df.1)),] , simplify = FALSE)
my_list <- lapply(my_list, function(x) {x[1] <- lapply(x[1], sensitivity.rand, threshold = 25)
x})
These select more than 300 number of rows. How to solve this?
Your function set.threshold.rand relies on the fact that the input vector is sorted.
That's why it works with my_list1 and not with my_list, where you've shuffled the rows with sample().
Replace threshold <- vector[n] with threshold <- sort(vector)[n] in set.threshold.rand
Adapted from answer given by #SirSaleh here:
sensitivity.rand <- function(vector, threshold){
num_to_thres <- floor(threshold*0.01*length(vector))
l = length (vector)
score = c(rep("H",num_to_thres),rep("L",l-num_to_thres))
return(score)
}
Now it can take any threshold and works with great efficacy.

Resources