Perform Arithmetic on Two Adjacent Rows WITHOUT a For Loop in R - r

Let's assume I have a 200x10 Matrix of log files (in reality it will be something like 16 Million x 10) and in one of the columns I have a timestamp. I need to calculate the time difference between the row[n] and row [n+1]. Using a for-loop will cause my program to run incredibly slow but is there any other way to solve this problem?
Many Thanks,
Alex

Related

Am I using the most efficient (or right) R instructions?

first question, I'll try to go straight to the point.
I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.
Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.
This is the loop, just so if it helps to understand the question:
my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}
The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.
Please let me know if it lacks specificity or concretion, or if it just does not belong here.
In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:
use vectors
avoid repeated conversions
if possible, avoid loops and apply functions directly over list/vector
I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.
df <- as.data.frame(table)
In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.
New execution time over 6 Mill rows: 25.25 sec.

R Parallel Processing - Node Choice

I am attempting to process a large amount of data in R on Windows using the parallel package on a computer with 8 cores. I have a large data.frame that I need to process row-by-row. For each row, I can estimate how long it will take for that row to be processed and this can vary wildly from 10 seconds to 4 hours per row.
I don't want to run the entire program at once under the clusterApplyLB function (I know this is probably the most optimal method) because if it hits an error, then my entire set of results might be lost. My first attempt to run my program involved breaking it up into Blocks and then running each Block individually in parallel, saving the output from that parallel run and then moving on to the next Block.
The problem is that as it ran through the rows, rather than running at 7x "real" time (I have 8 cores, but I wanted to keep one spare), it only seems to be running at about 2x. I've guessed that this is because the allocation of rows to each core is inefficient.
For example, ten rows of data with 2 cores, two of the rows could run in 4 hours and the other two will take 10 seconds. Theoretically this could take 4 hours and 10 seconds to run but if allocated inefficiently, it could take 8 hours. (Obviously this is an exaggeration, but a similar situation can happen when estimates are incorrect with more cores and more rows)
If I estimate these times and submit them to the clusterApplyLB in what I estimate to be the correct order (to get the estimated times to be spread across cores to minimize time taken), they might not be sent to the cores that I want them to be, because they might not finish in the time that I estimate them to. For example, I estimate two processes to have times of 10 minutes and 12 minutes and they take 11.6 minutes and 11.4 seconds then the order that the rows are submitted to clusterApplyLB won't be what I anticipated. This kind of error might seem small, but if I have optimised multiple long-time rows, then this mix-up of order could cause two 4-hour rows to go to the same node rather than to different nodes (which could almost double my total time).
TL;DR. My question: Is there a way to tell an R parallel processing function (e.g. clusterApplyLB, clusterApply, parApply, or any sapply, lapply or foreach variants) which rows should be sent to which core/node? Even without the situation I find myself in, I think this would be a very useful and interesting thing to provide information on.
I would say there are 2 different possible solution approaches to your problem.
The first one is a static optimization of the job-to-node mapping according to the expected per-job computation time. You would assign each job (i.e., row of your dataframe) a node before starting the calculation. Code for a possible implementation of this is given below.
The second solution is dynamic and you would have to make your own load balancer based on the code given in clusterApplyLB. You would start out the same as in the first approach, but as soon as a job is done, you would have to recalculate the optimal job-to-node mapping. Depending on your problem, this may add significant overhead due to the constant re-optimization that takes place. I think that as long as you do not have a bias in your expected computation times, it's not necessary to go this way.
Here the code for the first solution approach:
library(parallel)
#set seed for reproducible example
set.seed(1234)
#let's say you have 100 calculations (i.e., rows)
#each of them takes between 0 and 1 second computation time
expected_job_length=runif(100)
#this is your data
#real_job_length is unknown but we use it in the mock-up function below
df=data.frame(job_id=seq_along(expected_job_length),
expected_job_length=expected_job_length,
#real_job_length=expected_job_length + some noise
real_job_length=expected_job_length+
runif(length(expected_job_length),-0.05,0.05))
#we might have a negative real_job_length; fix that
df=within(df,real_job_length[real_job_length<0]<-
real_job_length[real_job_length<0]+0.05)
#detectCores() gives in my case 4
cluster_size=4
Prepare the job-to-node mapping optimization:
#x will give the node_id (between 1 and cluster_size) for each job
total_time=function(x,expected_job_length) {
#in the calculation below, x will be a vector of reals
#we have to translate it into integers in order to use it as index vector
x=as.integer(round(x))
#return max of sum of node-binned expected job lengths
max(sapply(split(expected_job_length,x),sum))
}
#now optimize the distribution of jobs amongst the nodes
#Genetic algorithm might be better for the optimization
#but Differential Evolution is good for now
library(DEoptim)
#pick large differential weighting factor (F) ...
#... to get out of local minimas due to rounding
res=DEoptim(fn=total_time,
lower=rep(1,nrow(df)),
upper=rep(cluster_size,nrow(df)),
expected_job_length=expected_job_length,
control=DEoptim.control(CR=0.85,F=1.5,trace=FALSE))
#wait for a minute or two ...
#inspect optimal solution
time_per_node=sapply(split(expected_job_length,
unname(round(res$optim$bestmem))),sum)
time_per_node
# 1 2 3 4
#10.91765 10.94893 10.94069 10.94246
plot(time_per_node,ylim=c(0,15))
abline(h=max(time_per_node),lty=2)
#add node-mapping to df
df$node_id=unname(round(res$optim$bestmem))
Now it's time for the calculation on the cluster:
#start cluster
workers=parallel::makeCluster(cluster_size)
start_time=Sys.time()
#distribute jobs according to optimal node-mapping
clusterApply(workers,split(df,df$node_id),function(x) {
for (i in seq_along(x$job_id)) {
#use tryCatch to do the error handling for jobs that fail
tryCatch({Sys.sleep(x[i,"real_job_length"])},
error=function(err) {print("Do your error handling")})
}
})
end_time=Sys.time()
#how long did it take
end_time-start_time
#Time difference of 11.12532 secs
#add to plot
abline(h=as.numeric(end_time-start_time),col="red",lty=2)
stopCluster(workers)
Based on the input , it seems you are already saving the output of a task within that task.
Assuming each parallel task is saving the output as a file, you probably need an initial function that predicts the time for a particular row.
In order to do that
generate a structure with estimated time and row number
sort the the estimated time and reorder rows and run the parallel
process for each reordered rows.
This would automatically balance the workload.
We had a similar problem where the process had to be done column wise and each column took 10-200 seconds. So we generated a function to estimate time, reordered the column based on that and ran parallel process for each column.

Reduce computation time

Most of the data sets that I have worked with has generally been of moderate size (mostly less than 100k rows) and hence my code's execution time has usually not been that big a problem for me.
But I was recently trying to write a function that takes 2 dataframes as arguments (with, say, m & n rows) and returns a new dataframe with m*n rows. I then have to perform some operations on the resulting data set. So, even with small values of m & n (say around 1000 each ) the resulting dataframe would have more than a million rows.
When I try even simple operations on this dataset, the code takes an intolerably long time to run. Specifically, my resulting dataframe has 2 columns with numeric values and I need to add a new column which will compare the values of these columns and categorize them as - "Greater than", "less than", "Tied"
I am using the following code:
df %>% mutate(compare=ifelse(var1==var2,"tied",
ifelse(var1>var2,"Greater than","lesser then")
And, as I mentioned before, this takes forever to run. I did some research on this, and I figured out that apparently operations on data.table is significantly faster than dataframe, so maybe that's one option I can try.
But I have never used data.tables before. So before I plunge into that, I was quite curious to know if there are any other ways to speed up computation time for large data sets.
What other options do you think I can try?
Thanks!
For large problems like this I like to parallelize. Since operations on individual rows are atomic, meaning that the outcome of an operation on a particular row is independent of every other row, this is an "embarassingly parallel" situation.
library(doParallel)
library(foreach)
registerDoParallel() #You could specify the number of cores to use here. See the documentation.
df$compare <- foreach(m=df$m, n=df$n, .combine='c') %dopar% {
#Borrowing from #nicola in the comments because it's a good solution.
c('Less Than', 'Tied', 'Greater Than')[sign(m-n)+2]
}

correlation matrix using large data sets in R when ff matrix memory allocation is not enough

I have a simple analysis to be done. I just need to calculate the correlation of the columns (or rows ,if transposed). Simple enough? I am unable to get the results for the whole week and I have looked through most of the solutions here.
My laptop has a 4GB RAM. I do have access to a server with 32 nodes. My data cannot be loaded here as it is huge (411k columns and 100 rows). If you need any other information or maybe part of the data I can try to put it up here, but the problem can be easily explained without really having to see the data. I simply need to get a correlation matrix of size 411k X 411k which means I need to compute the correlation among the rows of my data.
Concepts I have tried to code: (all of them in some way give me memory issues or run forever)
The most simple way, one row against all, write the result out using append.T. (Runs forever)
biCorPar.r by bobthecat (https://gist.github.com/bobthecat/5024079), splitting the data into blocks and using ff matrix. (unable to allocate memory to assign the corMAT matrix using ff() in my server)
split the data into sets (every 10000 continuous rows will be a set) and do correlation of each set against the other (same logic as bigcorPar) but I am unable to find a way to store them all together finally to generate the final 411kX411k matrix.
I am attempting this now, bigcorPar.r on 10000 rows against 411k (so 10000 is divided into blocks) and save the results in separate csv files.
I am also attempting to run every 1000 vs 411k in one node in my server and today is my 3rd day and I am still on row 71.
I am not an R pro so I could attempt only this much. Either my codes run forever or I do not have enough memory to store the results. Are there any more efficient ways to tackle this issue?
Thanks for all your comments and help.
I'm familiar with this problem myself in the context of genetic research.
If you are interested only in the significant correlations, you may find my package MatrixEQTL useful (available on CRAN, more info here: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/ ).
If you want to keep all correlations, I'd like to first warn you that in the binary format (economical compared to text) it would take 411,000 x 411,000 x 8 bytes = 1.3 TB. If this what you want and you are OK with the storage required for that, I can provide my code for such calculations and storage.

Ordered Permutations

I am looking to generate ordered permutation for large numbers i.e. 37P10 (permutations for 37 of size 10). I am using combinat package, permn() function for the purpose but it does not work for more than 10 numbers. Also through this i cannot be able to generate permutation of different sizes as describe above in example.
Further, I am combining these permutation into a matrix using do.call(rbind,) function.Is any any other package in R-language that may be used for the purpose please?
What you've asked for simply cannot be done. You're asking to generate and store 1.22e15 (or 4.81e15 with replacement) permutations of 10 numbers. Even if each number were only one byte, you would need 10 million GB of RAM.
In my LSPM package, I use the function LSPM:::.nPri to generate a specific permutation based on its lexically ordered index. There's no way you will be able to iterate over every permutation in an reasonable amount of time, so I would suggest that you take a sample of all possible permutations.
Note that the above code will not work for nPr(37,10) due to precision issues with such a large number, but it should work as a good starting point.
It is near impossible to generate so many permutation on the normal computer.
Quick calculations shows (37 P 10) is 1264020397516800. To store this many integers itself, you would need 1264020397516800 x 64 bits. That is 8.09×10^7 Gb (gigabits) or 10^7 Gigabytes. Then to store actual permutation information you will need even more "memory" either in RAM or Harddisk.
I think best strategy would be to write permutation function, creates ordered permutation sequentially, and do your analysis iteratively without generating all possible permutations.

Resources