How to iteratively perform combinations on larger datasets?

How to iteratively perform combinations on larger datasets? - r

Background - I want to try and exhaustively search a set of all possible combinations of 250 rows taken 10 at a time. In order to iteratively get this, I use the following code
`
## Function definition
gen.next.cbn <- function(cbn, n){
## Generates the combination that follows the one provided as input
cbn.bin <- rep(0, n)
cbn.bin[cbn] <- 1
if (tail(cbn.bin, 1) == 0){
ind <- tail(which(cbn.bin == 1), 1)
cbn.bin[c(ind, ind+1)] <- c(0, 1)
}else{
ind <- 1 + tail(which(diff(cbn.bin) == -1), 1)
nb <- sum(cbn.bin[-c(1:ind)] == 1)
cbn.bin[c(ind-1, (n-nb+1):n)] <- 0
cbn.bin[ind:(ind+nb)] <- 1
}
cbn <- which(cbn.bin == 1)
}
## Example parameters
n <- 40
k <- 10
## Iteration example
for (i in 1:choose(n, k)){
if (i == 1){
cbn <- 1:k
}else{
cbn <- gen.next.cbn(cbn, n)
}
print(cbn)
}
`
I get the error "cannot allocate vector of size n GB" when I go beyond 40 rows.
Ideal Solution:
a) If the combinations can be dumped and memory can be flushed iteratively after every run in the loop (where I can check the further conditions)
b) If the combinations can be dumped to a csv file such that it does not cause a memory hog.
Thanks for your support.

As I said in the comments, iterpc is the way to go for such a task. You first need to initialize an iterator via the iterpc function. Next we can generate the next n combinations via getnext. After this, we simply append our results to a csv (or any file type you like).
getComboChunks <- function(n, k, chunkSize, totalCombos, myFile) {
myIter <- iterpc(n, k)
## initialized myFile
myCombs <- getnext(myIter, chunkSize)
write.table(myCombs, file = myFile, sep = ",", col.names = FALSE)
maxIteration <- (totalCombos - chunkSize) %/% chunkSize
for (i in 1:maxIteration) {
## get the next "chunkSize" of combinations
myCombs <- getnext(myIter, chunkSize)
## append the above combinations to your file
write.table(myCombs, file = myFile, sep = ",",
col.names = FALSE , append = TRUE)
}
}
For example, getComboChunks(250, 10, 100, 1000, "myCombos.csv") will write out 1000 combinations of 250 choose 10 to the file myCombos.csv 100 combinations at a time. Doing this in chunks will be more efficient than one at a time.
This library is written in C/C++ so it should be fairly efficient, but as #Florian points out in the comments, it won't produce all gmp::chooseZ(250, 10) = Big Integer ('bigz') : [1] 219005316087032475 combinations any time soon. I haven't tested it, but if you settle for 200 choose 5, I think you will be able to produce it in under a day (it is just over 2.5 billion results).

Related

How do I fill in a matrix (by chunks) using a while loop?

I am trying to read in chunks of a a large data set:
find the mean of each chunk (representing a larger column)
add the mean into a matrix column
then find the mean of the means to give me the overall mean of the column.
I have the set up, but my while-loop is not repeating its cycle. I think it may be with how I am referring to "chunks" and "chunk".
This is a practice using "iris.csv" in R
fl <- file("iris.csv", "r")
clname <- readLines(fl, n=1) # read the header
r <- unlist(strsplit(clname,split = ","))
length(r) # get the number of columns in the matrix
cm <- matrix(NA, nrow=1000, ncol=length(r)) # need a matrix that can be filled on each #iteration.
numchunk = 0 #set my chunks of code to build up
while(numchunk <= 0){ #stop when no more chunks left to run
numchunk <- numchunk + 1 # keep on moving through chunks of code
x <- readLines(fl, n=100) #read 100 lines at a time
chunk <- as.numeric(unlist(strsplit(x,split = ","))) # readable chunk of code
m <- matrix(chunk, ncol=length(r), byrow = TRUE) # put chunk in a matrix
cm[numchunk,] <- colMeans(m) #get the column means of the matrix and fill in larger matrix
print(numchunk) # print the number of chunks used
}
cm
close(fl)
final_mean <- colSums(cm)/nrow(cm)
return(final_mean)
--
This works when I set my n = 1000, but I want it to work for larger data sets, where the while will need to keep running.
Can anyone help me correct this please?

Perhaps, this helps
clname <- readLines(fl, n=1) # read the header
r <- unlist(strsplit(clname,split = ","))
length(r) # get the number of columns in the matrix
cm <- matrix(NA, nrow=1000, ncol=length(r)) #
numchunk = 0
flag <- TRUE
while(flag){
numchunk <- numchunk + 1 # keep on moving through chunks of code
x <- readLines(fl, n=5)
print(length(x))
if(length(x) == 0) {
flag <- FALSE
} else {
chunk <- as.numeric(unlist(strsplit(x,split = ","))) # readable chunk of code
m <- matrix(chunk, ncol=length(r), byrow = TRUE) # put chunk in a matrix
cm[numchunk,] <- colMeans(m) #get the column means of the matrix and fill in larger matrix
print(numchunk) # print the number of chunks used
}
}
cm
close(fl)
final_mean <- colSums(cm)/nrow(cm)

First, it might be helpful, to define a helper function r2v() to split raw lines into useful vectors.
r2v <- Vectorize(\(x) {
## splits raw lines to vectors
strsplit(gsub('\\"', '', x), split=",")[[1]][-1]
})
After opening file, check the size w/o the need to read it in, using system() and bash commands (for Windows see there.)
## open file
f <- 'iris.csv'
fl <- file(f, "r")
## rows
(nr <-
as.integer(gsub(paste0('\\s', f), '', system(paste('wc -l', f), int=T))) - 1)
# nr <- 150 ## alternatively define nrows manually
# [1] 150
## columns
nm <- readLines(fl, n=1) |> r2v()
(nc <- length(nm))
# [1] 5
Next, define a chunk size by which the rows can be divided.
## define chunk size
ch_sz <- 50
stopifnot(nr %% ch_sz == 0) ## all chunks should be filled
Then, using replicate(), we calculate chunk-wise rowMeans() (because we get the chunks transposed), and finally rowMeans() again on everything to get the column means of the entire matrix.
## calculate means chunk-wise
final_mean <-
replicate(nr / ch_sz,
rowMeans(type.convert(r2v(readLines(fl, n=ch_sz)), as.is=TRUE))) |>
rowMeans()
close(fl)
Vet's validate the result.
## test
all.equal(final_mean, as.numeric(colMeans(iris[-5])))
# [1] TRUE
Data:
iris[-5] |>
write.csv('iris.csv')

In R distance between two sentences: Word-level comparison by minimum edit distance

While trying to learn R, I want to implement the algorithm below in R. Consider the two lists below:
List 1: "crashed", "red", "car"
List 2: "crashed", "blue", "bus"
I want to find out how many actions it would take to transform 'list1' into 'list2'.
As you can see I need only two actions:
1. Replace "red" with "blue".
2. Replace "car" with "bus".
But, how we can find the number of actions like this automatically.
We can have several actions to transform the sentences: ADD, REMOVE, or REPLACE the words in the list.
Now, I will try my best to explain how the algorithm should work:
At the first step: I will create a table like this:
rows: i= 0,1,2,3,
columns: j = 0,1,2,3
(example: value[0,0] = 0 , value[0, 1] = 1 ...)
crashed red car
0 1 2 3
crashed 1
blue 2
bus 3
Now, I will try to fill the table. Please, note that each cell in the table shows the number of actions we need to do to reformat the sentence (ADD, remove, or replace).
Consider the interaction between "crashed" and "crashed" (value[1,1]), obviously we don't need to change it so the value will be '0'. Since they are the same words. Basically, we got the diagonal value = value[0,0]
crashed red car
0 1 2 3
crashed 1 0
blue 2
bus 3
Now, consider "crashed" and the second part of the sentence which is "red". Since they are not the same word we can use calculate the number of changes like this :
min{value[0,1] , value[0,2] and value[1,1]} + 1
min{ 1, 2, 0} + 1 = 1
Therefore, we need to just remove "red".
So, the table will look like this:
crashed red car
0 1 2 3
crashed 1 0 1
blue 2
bus 3
And we will continue like this :
"crashed" and "car" will be :
min{value[0,3], value[0,2] and value[1,2]} + 1
min{3, 2, 1} +1 = 2
and the table will be:
crashed red car
0 1 2 3
crashed 1 0 1 2
blue 2
bus 3
And we will continue to do so. the final result will be :
crashed red car
0 1 2 3
crashed 1 0 1 2
blue 2 1 1 2
bus 3 2 2 2
As you can see the last number in the table shows the distance between two sentences: value[3,3] = 2
Basically, the algorithm should look like this:
if (characters_in_header_of_matrix[i]==characters_in_column_of_matrix [j] &
value[i,j] == value[i+1][j-1] )
then {get the 'DIAGONAL VALUE' #diagonal value= value[i, j-1]}
else{
value[i,j] = min(value[i-1, j], value[i-1, j-1], value[i, j-1]) + 1
}
endif
for finding the difference between the elements of two lists that you can see in the header and the column of the matrix, I have used the strcmp() function which will give us a boolean value(TRUE or FALSE) while comparing the words. But, I fail at implementing this.
I'd appreciate your help on this one, thanks.

The question
After some clarification in a previous post, and after the update of the post, my understanding is that Zero is asking: 'how one can iteratively count the number of word differences in two strings'.
I am unaware of any implementation in R, although i would be surprised if i doesn't already exists. I took a bit of time out to create a simple implementation, altering the algorithm slightly for simplicity (For anyone not interested scroll down for 2 implementations, 1 in pure R, one using the smallest amount of Rcpp). The general idea of the implementation:
Initialize with string_1 and string_2 of length n_1 and n_2
Calculate the cumulative difference between the first min(n_1, n_2) elements,
Use this cumulative difference as the diagonal in the matrix
Set the first off-diagonal element to the very first element + 1
Calculate the remaining off diagonal elements as: diag(i) - diag(i-1) + full_matrix(i-1,j)
In the previous step i iterates over diagonals, j iterates over rows/columns (either one works), and we start in the third diagonal, as the first 2x2 matrix is filled in step 1 to 4
Calculate the remaining abs(n_1 - n_2) elements as full_matrix[,min(n_1 - n_2)] + 1:abs(n_1 - n_2), applying the latter over each value in the prior, and bind them appropriately to the full_matrix.
The output is a matrix with dimensions row and column names of the corresponding strings, which has been formatted for some easier reading.
Implementation in R
Dist_between_strings <- function(x, y,
split = " ",
split_x = split, split_y = split,
case_sensitive = TRUE){
#Safety checks
if(!is.character(x) || !is.character(y) ||
nchar(x) == 0 || nchar(y) == 0)
stop("x, y needs to be none empty character strings.")
if(length(x) != 1 || length(y) != 1)
stop("Currency the function is not vectorized, please provide the strings individually or use lapply.")
if(!is.logical(case_sensitive))
stop("case_sensitivity needs to be logical")
#Extract variable names of our variables
# used for the dimension names later on
x_name <- deparse(substitute(x))
y_name <- deparse(substitute(y))
#Expression which when evaluated will name our output
dimname_expression <-
parse(text = paste0("dimnames(output) <- list(",make.names(x_name, unique = TRUE)," = x_names,",
make.names(y_name, unique = TRUE)," = y_names)"))
#split the strings into words
x_names <- str_split(x, split_x, simplify = TRUE)
y_names <- str_split(y, split_y, simplify = TRUE)
#are we case_sensitive?
if(isTRUE(case_sensitive)){
x_split <- str_split(tolower(x), split_x, simplify = TRUE)
y_split <- str_split(tolower(y), split_y, simplify = TRUE)
}else{
x_split <- x_names
y_split <- y_names
}
#Create an index in case the two are of different length
idx <- seq(1, (n_min <- min((nx <- length(x_split)),
(ny <- length(y_split)))))
n_max <- max(nx, ny)
#If we have one string that has length 1, the output is simplified
if(n_min == 1){
distances <- seq(1, n_max) - (x_split[idx] == y_split[idx])
output <- matrix(distances, nrow = nx)
eval(dimname_expression)
return(output)
}
#If not we will have to do a bit of work
output <- diag(cumsum(ifelse(x_split[idx] == y_split[idx], 0, 1)))
#The loop will fill in the off_diagonal
output[2, 1] <- output[1, 2] <- output[1, 1] + 1
if(n_max > 2)
for(i in 3:n_min){
for(j in 1:(i - 1)){
output[i,j] <- output[j,i] <- output[i,i] - output[i - 1, i - 1] + #are the words different?
output[i - 1, j] #How many words were different before?
}
}
#comparison if the list is not of the same size
if(nx != ny){
#Add the remaining words to the side that does not contain this
additional_words <- seq(1, n_max - n_min)
additional_words <- sapply(additional_words, function(x) x + output[,n_min])
#merge the additional words
if(nx > ny)
output <- rbind(output, t(additional_words))
else
output <- cbind(output, additional_words)
}
#set the dimension names,
# I would like the original variable names to be displayed, as such i create an expression and evaluate it
eval(dimname_expression)
output
}
Note that the implementation is not vectorized, and as such can only take single string inputs!
Testing the implementation
To test the implementation, one could use the strings given. As they were said to be contained in lists, we will have to convert them to strings. Note that the function lets one split each string differently, however it assumes space separated strings. So first I'll show how one could achieve a conversion to the correct format:
list_1 <- list("crashed","red","car")
list_2 <- list("crashed","blue","bus")
string_1 <- paste(list_1,collapse = " ")
string_2 <- paste(list_2,collapse = " ")
Dist_between_strings(string_1, string_2)
output
#Strings in the given example
string_2
string_1 crashed blue bus
crashed 0 1 2
red 1 1 2
car 2 2 2
This is not exactly the output, but it yields the same information, as the words are ordered as they were given in the string.
More examples
Now i stated it worked for other strings as well and this is indeed the fact, so lets try some random user-made strings:
#More complicated strings
string_3 <- "I am not a blue whale"
string_4 <- "I am a cat"
string_5 <- "I am a beautiful flower power girl with monster wings"
string_6 <- "Hello"
Dist_between_strings(string_3, string_4, case_sensitive = TRUE)
Dist_between_strings(string_3, string_5, case_sensitive = TRUE)
Dist_between_strings(string_4, string_5, case_sensitive = TRUE)
Dist_between_strings(string_6, string_5)
Running these show that these do yield the correct answers. Note that if either string is of size 1, the comparison is a lot faster.
Benchmarking the implementation
Now as the implementation is accepted, as correct, we would like to know how well it performs (For the uninterested reader, one can scroll past this section, to where a faster implementation is given). For this purpose, i will use much larger strings. For a complete benchmark i should test various string sizes, but for the purposes i will only use 2 rather large strings of size 1000 and 2500. For this purpose i use the microbenchmark package in R, which contains a microbenchmark function, which claims to be accurate down to nanoseconds. The function itself executes the code 100 (or a user defined) number of times, returning the mean and quartiles of the run times. Due to other parts of R such as the Garbage Cleaner, the median is mostly considered a good estimate of the actual average run-time of the function.
The execution and results are shown below:
#Benchmarks for larger strings
set.seed(1)
string_7 <- paste(sample(LETTERS,1000,replace = TRUE), collapse = " ")
string_8 <- paste(sample(LETTERS,2500,replace = TRUE), collapse = " ")
microbenchmark::microbenchmark(String_Comparison = Dist_between_strings(string_7, string_8, case_sensitive = FALSE))
# Unit: milliseconds
# expr min lq mean median uq max neval
# String_Comparison 716.5703 729.4458 816.1161 763.5452 888.1231 1106.959 100
Profiling
Now i find the run-times very slow. One use case for the implementation could be an initial check of student hand-ins to check for plagiarism, in which case a low difference count very likely shows plagiarism. These can be very long and there may be hundreds of handins, an as such i would like the run to be very fast.
To figure out how to improve my implementation i used the profvis package with the corrosponding profvis function. To profile the function i exported it in another R script, that i sourced, running the code 1 once prior to profiling to compile the code and avoid profiling noise (important). The code to run the profiling can be seen below, and the most important part of the output is visualized in an image below it.
library(profvis)
profvis(Dist_between_strings(string_7, string_8, case_sensitive = FALSE))
Now, despite the colour, here i can see a clear problem. The loop filling the off-diagonal by far is responsible for most of the runtime. R (like python and other not compiled languages) loops are notoriously slow.
Using Rcpp to improve performance
To improve the implementation, we could implement the loop in c++ using the Rcpp package. This is rather simple. The code is not unlike the one we would use in R, if we avoid iterators. A c++ script can be made in file -> new file -> c++ File. The following c++ code would be pasted into the corrosponding file and sourced using the source button.
//Rcpp Code
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix Cpp_String_difference_outer_diag(NumericMatrix output){
long nrow = output.nrow();
for(long i = 2; i < nrow; i++){ // note the
for(long j = 0; j < i; j++){
output(i, j) = output(i, i) - output(i - 1, i - 1) + //are the words different?
output(i - 1, j);
output(j, i) = output(i, j);
}
}
return output;
}
The corresponding R function needs to be altered to use this function instead of looping. The code is similar to the first function, only switching the loop for a call to the c++ function.
Dist_between_strings_cpp <- function(x, y,
split = " ",
split_x = split, split_y = split,
case_sensitive = TRUE){
#Safety checks
if(!is.character(x) || !is.character(y) ||
nchar(x) == 0 || nchar(y) == 0)
stop("x, y needs to be none empty character strings.")
if(length(x) != 1 || length(y) != 1)
stop("Currency the function is not vectorized, please provide the strings individually or use lapply.")
if(!is.logical(case_sensitive))
stop("case_sensitivity needs to be logical")
#Extract variable names of our variables
# used for the dimension names later on
x_name <- deparse(substitute(x))
y_name <- deparse(substitute(y))
#Expression which when evaluated will name our output
dimname_expression <-
parse(text = paste0("dimnames(output) <- list(", make.names(x_name, unique = TRUE)," = x_names,",
make.names(y_name, unique = TRUE)," = y_names)"))
#split the strings into words
x_names <- str_split(x, split_x, simplify = TRUE)
y_names <- str_split(y, split_y, simplify = TRUE)
#are we case_sensitive?
if(isTRUE(case_sensitive)){
x_split <- str_split(tolower(x), split_x, simplify = TRUE)
y_split <- str_split(tolower(y), split_y, simplify = TRUE)
}else{
x_split <- x_names
y_split <- y_names
}
#Create an index in case the two are of different length
idx <- seq(1, (n_min <- min((nx <- length(x_split)),
(ny <- length(y_split)))))
n_max <- max(nx, ny)
#If we have one string that has length 1, the output is simplified
if(n_min == 1){
distances <- seq(1, n_max) - (x_split[idx] == y_split[idx])
output <- matrix(distances, nrow = nx)
eval(dimname_expression)
return(output)
}
#If not we will have to do a bit of work
output <- diag(cumsum(ifelse(x_split[idx] == y_split[idx], 0, 1)))
#The loop will fill in the off_diagonal
output[2, 1] <- output[1, 2] <- output[1, 1] + 1
if(n_max > 2)
output <- Cpp_String_difference_outer_diag(output) #Execute the c++ code
#comparison if the list is not of the same size
if(nx != ny){
#Add the remaining words to the side that does not contain this
additional_words <- seq(1, n_max - n_min)
additional_words <- sapply(additional_words, function(x) x + output[,n_min])
#merge the additional words
if(nx > ny)
output <- rbind(output, t(additional_words))
else
output <- cbind(output, additional_words)
}
#set the dimension names,
# I would like the original variable names to be displayed, as such i create an expression and evaluate it
eval(dimname_expression)
output
}
Testing the c++ implementation
To be sure the implementation is correct we check if the same output is obtained with the c++ implementation.
#Test the cpp implementation
identical(Dist_between_strings(string_3, string_4, case_sensitive = TRUE),
Dist_between_strings_cpp(string_3, string_4, case_sensitive = TRUE))
#TRUE
Final benchmarks
Now is this actually faster? To see this we could run another benchmark using the microbenchmark package. The code and results are shown below:
#Final microbenchmarking
microbenchmark::microbenchmark(R = Dist_between_strings(string_7, string_8, case_sensitive = FALSE),
Rcpp = Dist_between_strings_cpp(string_7, string_8, case_sensitive = FALSE))
# Unit: milliseconds
# expr min lq mean median uq max neval
# R 721.71899 753.6992 850.21045 787.26555 907.06919 1756.7574 100
# Rcpp 23.90164 32.9145 54.37215 37.28216 47.88256 243.6572 100
From the microbenchmark median improvement factor of roughly 21 ( = 787 / 37), which is a massive improvement from just implementing a single loop!

There is already an edit-distance function in R we can take advantage of: adist().
As it works on the character level, we'll have to assign a character to each unique word in our sentences, and stitch them together to form pseudo-words we can calculate the distance between.
s1 <- c("crashed", "red", "car")
s2 <- c("crashed", "blue", "bus")
ll <- list(s1, s2)
alnum <- c(letters, LETTERS, 0:9)
ll2 <- relist(alnum[factor(unlist(ll))], ll)
ll2 <- sapply(ll2, paste, collapse="")
adist(ll2)
# [,1] [,2]
# [1,] 0 2
# [2,] 2 0
Main limitation here, as far as I can tell, is the number of unique characters available, which in this case is 62, but can be extended quite easily, depending on your locale. E.g: intToUtf8(c(32:126, 161:300), TRUE).

Save loop results as csv table

I have simple loop that generate a value at each step, and I want to save all results as a single table. Problem is that each step overwrites the previous.
for(i in 1:5){
x = 3*i
print(c(i,x))
}
This gives
[1] 1 3
[1] 2 6
[1] 3 9
[1] 4 12
[1] 5 15
Now I create a matrix that I will then save as a csv file, but it only shows the final step of the loop.
results = matrix(c(i,x), ncol = 2)
[,1] [,2]
[1,] 5 15
write.table(results, file = "Results.csv", col.names=NA, append = T)
How to show the entire list of results? Thanks in advance!
(ps.- I know that a similar question has been posted previously, e.g. Write output of R loop to file, but the problem was quite specific and I did not manage to adapt the answers to my case).

Your loop only prints, to the console, the results. The matrix you're creating only relies on the single (and last) value of i. There are many ways to do it but if you really want to write a matrix, then you need to store them somewhere to export all iteration intermediate results. You can try something like:
results <- matrix(NA, nrow=5, ncol=2)
for(i in 1:5){
results[i, ] <- c(i, 3*i)
}
write.table(results, file = "Results.csv", col.names=NA, append = T)
And by the way you don't really need a loop here:
i <- 1:5
m <- matrix(c(i, 3*i), nrow=5)
would do the job.

You can usually use sapply instead of for-loops:
results <- t(sapply(1:5, function(x) c(x, 3*x)))
write.table(results, file="Results.csv", col.names=NA, append=T)

Assuming you really want/need a for-loop
1) You store all the result into a matrix and then you write the whole matrix to file
n = 5;
results = matrix(NA, ncol=2, nrow=n);
for(i in 1:n) {
results[i, ] = c(i, x);
}
write.table(results, file = "Results.csv", col.names=NA, append = T);
This is a "good" solution if you don't have many results and you want to access the file just once.
2) You store current result only into a matrix and you write to file at each iteration
n = 5;
for(i in 1:n) {
results = matrix(c(i,x), ncol = 2)
write.table(results, file = "Results.csv", col.names=NA, append = T);
}
This is a "good" solution if you have many data and memory limits. Maybe slower than the previous one because you will open the file many times...

To append using a matrix you could use:
exampleMatrix <- matrix(ncol = 2)
for(i in 1:5){
x = 3*i
if(i ==1){
exampleMatrix<- rbind(c(x,i))
}else{
exampleMatrix<- rbind(exampleMatrix,c(x,i))
}}
To append to a dataframe using a loop you could use the following:
exampleDF <- data.frame()
for(i in 1:5){
x = 3*i
exampleDF <- rbind(exampleDF,c(x,i))
}
write.csv(exampleDF, "C:\\path")

So when you want to store you values while using a loop, it's important to index. Below, I created some code where a(the iteration) and x(the value x * 3) are each stored inside a vector.
After the loop has finished, I combine the two vectors into one data frame with the cbind() function
a <- vector()
x <- vector()
for(i in 1:5){
a[i] = i
x[i] = 3*i
}
df <- as.data.frame(cbind(a, x))
There are other ways to do this without loops. Once you start raising the number of iterations, or doing nested loops, the processing time starts to get really high. Other options are in the apply package.
Hope this helped!

Optimising a calculation on every cumulative subset of a vector in R

I have a collection of DNA sequencing reads of various lengths, sorted from longest to shortest. I would like to know the largest number of reads I can include in a set such that the N50 of that set is above some threshold t
For any given set of reads, the total amount of data is just the cumulative sum of the lengths of the reads. The N50 is defined as the length of the read such that half of the data are contained in reads at least that long.
I have a solution below, but it is slow for very large read sets. I tried vectorising it, but this was slower (probably because my threshold is usually relatively large, such that my solution below stops calculating fairly early on).
Here's a worked example:
df = data.frame(l = 100:1) # read lengths
df$cs = cumsum(df$l) # getting the cumulative sum is easy and quick
t = 95 # let's imagine that this is my threshold N50
for(i in 1:nrow(df)){
N50 = df$l[min(which(df$cs>df$cs[i]/2))]
if(N50 < t){ break }
}
# the loop will have gone one too far, so I subtract one
number.of.reads = as.integer(i-1)
This works fine on small datasets, but my actual data are more like 5m reads that vary from ~200,000 to 1 in length (longer reads are rarer), and I'm interested in an N50 of 100,000, then it gets pretty slow.
This example is closer to something that's realistic. It takes ~15s on my desktop.
l = ceiling(runif(100000, min = 0, max = 19999))
l = sort(l, decreasing = T)
df = data.frame(l = l)
df$cs = cumsum(df$l)
t = 18000
for(i in 1:nrow(df)){
n = df$l[min(which(df$cs>df$cs[i]/2))]
if(n < t){ break }
}
result = as.integer(i-1)
So, I'm interested in any ideas, tips, or tricks to noticeably optimise this. It seems like this should be possible, but I'm out of ideas.

As n is decreasing with i, you should use a binary search algorithm.
binSearch <- function(min, max) {
print(mid <- floor(mean(c(min, max))))
if (mid == min) {
if (df$l[min(which(df$cs>df$cs[min]/2))] < t) {
return(min - 1)
} else {
return(max - 1)
}
}
n = df$l[min(which(df$cs>df$cs[mid]/2))]
if (n >= t) {
return(binSearch(mid, max))
} else {
return(binSearch(min, mid))
}
}
Then, just call
binSearch(1, nrow(df))

Since your data are ordered by DNA/read length, maybe you could avoid testing every single row. On the contrary, you can iterate and test a limited number of rows (reasonably spaced) at each iteration (using while() for example), and so get progressively closer to your solution. This should make things much faster. Just make sure that once you get close to the solution, you stop iterating.
This is your solution
set.seed(111)
l = ceiling(runif(100000, min = 0, max = 19999))
l = sort(l, decreasing = T)
df = data.frame(l = l)
df$cs = cumsum(df$l)
t = 18000
for(i in 1:nrow(df)){
n = df$l[min(which(df$cs>df$cs[i]/2))]
if(n < t){ break }
}
result = as.integer(i-1)
result
# 21216, in ~29 seconds
Instead of testing every row, let's set a range
i1 <- 1
i2 <- nrow(df)
i.range <- as.integer(seq(i1, i2, length.out = 10))
Now, test only these 10 rows. Get the closest one and "focus in" by re-defining the range. Stop when you cannot increase granularity.
while(sum(duplicated(i.range))==0){
for(i in 1:length(i.range)){
N50 = df$l[min(which(df$cs>df$cs[i.range[i]]/2))]
if(N50 < t){ break }
}
#update i1 and i2
i1 <- i.range[(i-1)]
i2 <- i.range[i]
i.range <- as.integer(seq(i1, i2, length.out = 10))
}
i.range <- seq(i1, i2, by=1)
for(i in i.range){
N50 = df$l[min(which(df$cs>df$cs[i]/2))]
if(N50 < t){ break }
}
result <- as.integer(i-1)
result
#21216, in ~ 0.06 seconds
Same result in a fraction of the time.

Indexing takes long time with for loop?

I am running this for loop without any problems but it takes a long time. I guess it can be faster with apply family but not sure how. Any hints?
set.seed(1)
nrows <- 1200
ncols <- 1000
outmat <- matrix(NA, nrows, ncols)
dat <- matrix(5, nrows, ncols)
for (nc in 1 : ncols){
for(nr in 1 : nrows){
val <- dat[nr, nc]
if(!is.na(val)){
file <- readBin(dir2[val], numeric(), size = 4, n = 1200*1000)
# my real data where dir2 is a list of files
# "dir2 <- list.files("/data/dir2", "*.dat", full.names = TRUE)"
file <- matrix((data = file), ncol = 1000, nrow = 1200) #my real data
outmat[nr, nc] <- file[nr, nc]
}
}
}

Two solutions.
The first fills more memory, but is more efficient and I guess feasible if you have 24 files, as you stated. You read all the files at once, then you subset properly according to dat. Something like:
allContents<-do.call(cbind,lapply(dir2,readBin,n=nrows*ncol,size=4,"numeric")
res<-matrix(allContents[cbind(1:length(dat),c(dat+1))],nrows,ncols)
The second one can handle a slightly bigger number of files (say 50-100). It reads chunks of each file and subset consequently. You have to open as many connections as the number of files you got. For instance:
outmat <- matrix(NA, nrows, ncols)
connections<-lapply(dir2,file,open="rb")
for (i in 1:ncols) {
values<-vapply(connections,readBin,what="numeric",n=nr,size=4,numeric(nr))
outmat[,i]<-values[cbind(seq_len(nrows),dat[,i]+1)]
}
The +1 after dat is due to the fact that, as you stated in the comments, the values in dat range from 0 to 23 and R indexing is 1-based.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to iteratively perform combinations on larger datasets? - r

Related

How do I fill in a matrix (by chunks) using a while loop?

In R distance between two sentences: Word-level comparison by minimum edit distance

Save loop results as csv table

Optimising a calculation on every cumulative subset of a vector in R

Indexing takes long time with for loop?

Categories

Resources