I have some moderate sized raster files (max size ~190 MB) that I would like to calculate grid distances for using raster::gridDistance()
I'm finding that the operation is slow and/or R crashes for the largest of my files. Please note: I'm not seeking memory management advice (e.g. maxing out memory.limit(), breaking into smaller rasters or pursuing parallel processing methods) as these are sidestepping my issue. If grid distances should not be attempted for 190+ MB size files, then I will just break the job into more manageable pieces.
The raster::gridDistance() documentation mentions that I can try to solve "errors in the case of complex objects spread over different chunks... by varying the chunk size, see function setOptions()" and that "Additional distance measures and options (directions, cost-distance) are available in the {gdistance} package", but I have been hesitant to pursue these without better understanding the limitations/considerations.
Thanks to this question R - terra::distance() equivalent of raster::gridDistance(..., origin = x, omit = y) I understand that there is an alternative method using terra::gridDistance(), but I am not able to discern if the operation is any more efficient or suitable for my needs than raster::gridDistance()
I haven't posted a reprex or session info as my question is really as follows:
Is terra::gridDistance() (or some other alternative like those offered by {gdistance}) really a more efficient (faster) or customizable way for calculating a grid distance using moderate-large raster files?
If not, what are considerations for changing how the grid distance is calculated (varying chunk size or other means) using raster::gridDistance() and setOptions()?
If there is interest, I can reformat my question so that it better fits guidelines with a reprex etc. Also, I am posting the question here rather than Geographic Information Systems because the original linked question was posted here.
I understand that there is an alternative method using terra::gridDistance(), but I am not able to discern if the operation is any more efficient or suitable for my needs
Well, did you try it? That could have been more efficient than writing a long question.
The help file does not mention the limitations that raster::gridDistance has, so you should be good to go. But note that the method was renamed to terra::gridDist()
The "terra" package is the replacement for "raster" package; so "terra" is the best starting point more generally, I think.
Related
I've been developing an R package for single cell RNA-seq analysis, and one of the functions I used repeatedly calculates the cosine dissimilarity matrix for a given matrix of m cells by n genes. The function I wrote is as follows:
CosineDist <- function(input = NULL) {
if (is.null(input)) { stop("You forgot to provide an input matrix") }
dist_mat <- as.dist(1 - input %*% t(input) / (sqrt(rowSums(input^2) %*% t(rowSums(input^2)))))
return(dist_mat)
}
This code works fine for smaller datasets, but when I run it on anything over 20,000 rows it takes forever and then crashes my R session due to memory issues. I believe that porting this to Rcpp would make it both faster and more memory efficient (I know this is a bit of a naive belief, but my knowledge of C++ in general is limited). Finally, the output of the function, though it does not have to be a distance matrix object when returned, does need to be able to be converted to that format after its generation.
How should I got about converting this function to Rcpp and then calling it as I would any of the other functions in my package? Alternatively, is this the best way to go about solving the speed / memory problem?
Hard to help you, since as the comments pointed out you are basically searching for an Rcpp intro.
I'll try to give you some hints, which I already mentioned partly in the comments.
In general using C/C++ can provide a great speedup (dependent on the task of course). But I've reached for (loop intensive, not optimized code) 100x+ speedups.
Since adding C++ can be complicated and sometimes cause problems, before you go this way check the following:
1. Is your R code optimized?
You can make lot of bad choices here (e.g. loops are slow in R). Just by optimizing your R code speedups of 10x or much more can often be easily reached.
2. Are there better implementations in other packages?
Especially if it is helper functions or common functionalities, often other packages have these already implemented. Benchmark different existing solutions with the 'microbenchmark' package. It is easier to just use an optimized function from another R package then doing everything on your own. (maybe the other package implementations are already in C++). I mostly try to look for mainstream and popular packages (since these are better tested and they are unlikely to suddenly drop from CRAN).
3. Profile your code
Take a look what parts exactly cause the speed / memory problems. Might be that you can keep parts in R and only create a function for the critical parts in C++. Or you find another package that has a R function that is implemented in C for exactly this critical part.
In the end I'd say, I prefer using Rcpp/C++ over C code. Think this is the easier way to go. For the Rcpp learning part you have to go with a dedicated tutorial (and not a SO question).
I am dealing with an optimisation issue, which I classified as a combinatorial problem. Now, I know this is a 2D variant of the knapsack problem, but please bear with me:
If I have an area that is modeled as a grid comprised of equal size cells, how to place a certain number of square objects of different sizes, on this grid area, if every object has its cost and its benefit and the goal is to have an arrangement of the objects that has the maximum Benefit/Cost ratio:
Object 1: 1x1 square, cost = 800, value= 2478336
Object 2: 2x2 square cost= 2000 value = 7565257
Object 3: 3x3 square cost= 3150 value= 14363679
The object 3 has the best value/cost ratio, so the approach would be a greedy one I guess, to first place as much of the bigger squares as possible, but still there are many optimal solutions depending on the size of the area.
Also, the square objects cannot overlap.
I am using R for this, and the package adagio has algorithms for the single and multiple knapsack, but not for a 2D knapsack problem. Because I am very new in optimization and programming, I am not sure if there is way of solving this problem with R, can someone please help?
Thanks!
Firstly, I'm not an expert in R and adagio. Secondly, I think that your problem is not exactly 2d knapsack, it looks like a variant of packing problem, so it requires a different approach.
So, first, check this awesome list of R optimization packages, especially the following sections:
Specific Applications in Optimization (for example, tabu search could be useful for you)
Mathematical Programming Solvers/Interfaces to Open Source Optimizers (lpsolve definitely could solve your task)
Global and Stochastic Optimization (some of this packages could be used to solve your task)
In case if you're not tied to R, consider Minizinc as a solver. It's very easy to install/use and it's pretty efficient in terms of memory/time consumption. Moreover, there is a bunch of great examples how to use it.
I have a large dataset with more than 250,000 observations, and I would like to use the TraMineR package for my analysis. In particular, I would like to use the commands seqtreeand seqdist, which works fine when I for example use a subsample of 10,000 observations. The limit my computer can manage is around 20,000 observations.
I would like to use all the observations and I do have access to a supercomputer who should be able to do just that. However, this doesn't help much as the process runs on a single core only. Therefore my question, is it possible to apply parallel computing technics to the above mentioned commands? Or are there other ways to speed up the process? Any help would be appreciated!
The internal seqdist function is written in C++ and has numerous optimizations. For this reason, if you want to parallelize seqdist, you need to do it in C++. The loop is located in the source file "distancefunctions.cpp" and you need to look at the two loops located around line 300 in function "cstringdistance" (Sorry but all comments are in French). Unfortunately, the second important optimization is that the memory is shared between all computations. For this reason, I think that parallelization would be very complicated.
Apart from selecting a sample, you should consider the following optimizations:
aggregation of identical sequences (see here: Problem with big data (?) during computation of sequence distances using TraMineR )
If relevant, you can try to reduce the time granularity. Distance computation time is highly dependent on sequence length (O^2). See https://stats.stackexchange.com/questions/43601/modifying-the-time-granularity-of-a-state-sequence
Reducing time granularity may also increase the number of identical sequences, and hence, the impact of optimization one.
There is a hidden option in seqdist to use an optimized version of the optimal matching algorithm. It is still in testing phase (that's why it is hidden), but it should replace the actual algorithm in a future version. To use it, set method="OMopt", instead of method="OM". Depending on your sequences, it may reduce computation time.
I need to solve (many times, for lots of data, alongside a bunch of other things) what I think boils down to a second order cone program. It can be succinctly expressed in CVX something like this:
cvx_begin
variable X(2000);
expression MX(2000);
MX = M * X;
minimize( norm(A * X - b) + gamma * norm(MX, 1) )
subject to
X >= 0
MX((1:500) * 4 - 3) == MX((1:500) * 4 - 2)
MX((1:500) * 4 - 1) == MX((1:500) * 4)
cvx_end
The data lengths and equality constraint patterns shown are just arbitrary values from some test data, but the general form will be much the same, with two objective terms -- one minimizing error, the other encouraging sparsity -- and a large number of equality constraints on the elements of a transformed version of the optimization variable (itself constrained to be non-negative).
This seems to work pretty nicely, much better than my previous approach, which fudges the constraints something rotten. The trouble is that everything else around this is happening in R, and it would be quite a nuisance to have to port it over to Matlab. So is doing this in R viable, and if so how?
This really boils down to two separate questions:
1) Are there any good R resources for this? As far as I can tell from the CRAN task page, the SOCP package options are CLSCOP and DWD, which includes an SOCP solver as an adjunct to its classifier. Both have similar but fairly opaque interfaces and are a bit thin on documentation and examples, which brings us to:
2) What's the best way of representing the above problem in the constraint block format used by these packages? The CVX syntax above hides a lot of tedious mucking about with extra variables and such, and I can just see myself spending weeks trying to get this right, so any tips or pointers to nudge me in the right direction would be very welcome...
You might find the R package CVXfromR useful. This lets you pass an optimization problem to CVX from R and returns the solution to R.
OK, so the short answer to this question is: there's really no very satisfactory way to handle this in R. I have ended up doing the relevant parts in Matlab with some awkward fudging between the two systems, and will probably migrate everything to Matlab eventually. (My current approach predates the answer posted by user2439686. In practice my problem would be equally awkward using CVXfromR, but it does look like a useful package in general, so I'm going to accept that answer.)
R resources for this are pretty thin on the ground, but the blog post by Vincent Zoonekynd that he mentioned in the comments is definitely worth reading.
The SOCP solver contained within the R package DWD is ported from the Matlab solver SDPT3 (minus the SDP parts), so the programmatic interface is basically the same. However, at least in my tests, it runs a lot slower and pretty much falls over on problems with a few thousand vars+constraints, whereas SDPT3 solves them in a few seconds. (I haven't done a completely fair comparison on this, because CVX does some nifty transformations on the problem to make it more efficient, while in R I'm using a pretty naive definition, but still.)
Another possible alternative, especially if you're eligible for an academic license, is to use the commercial Mosek solver, which has an R interface package Rmosek. I have yet to try this, but may give it a go at some point.
(As an aside, the other solver bundled with CVX, SeDuMi, fails completely on the same problem; the CVX authors aren't kidding when they suggest trying multiple solvers. Also, in a significant subset of cases, SDTP3 has to switch from Cholesky to LU decomposition, which makes the processing orders of magnitude slower, with only very marginal improvement in the objective compared to the pre-LU steps. I've found it worth reducing the requested precision to avoid this, but YMMV.)
There is a new alternative: CVXR, which comes from the same people.
There is a website, a paper and a github project.
Disciplined Convex Programming seems to be growing in popularity observing cvxpy (Python) and Convex.jl (Julia), again, backed by the same people.
I have been working on large datasets lately (more than 400 thousands lines). So far, I have been using XTS format, which worked fine for "small" datasets of a few tenth of thousands elements.
Now that the project grows, R simply crashes when retrieving the data for the database and putting it into the XTS.
It is my understanding that R should be able to have vectors with size up to 2^32-1 elements (or 2^64-1 according the the version). Hence, I came to the conclusion that XTS might have some limitations but I could not find the answer in the doc. (maybe I was a bit overconfident about my understanding of theoretical possible vector size).
To sum up, I would like to know if:
XTS has indeed a size limitation
What do you think is the smartest way to handle large time series? (I was thinking about splitting the analysis into several smaller datasets).
I don't get an error message, R simply shuts down automatically. Is this a known behavior?
SOLUTION
The same as R and it depends on the kind of memory being used (64bits, 32 bits). It is anyway extremely large.
Chuncking data is indeed a good idea, but it is not needed.
This problem came from a bug in R 2.11.0 which has been solved in R 2.11.1. There was a problem with long dates vector (here the indexes of the XTS).
Regarding your two questions, my $0.02:
Yes, there is a limit of 2^32-1 elements for R vectors. This comes from the indexing logic, and that reportedly sits 'deep down' enough in R that it is unlikely to be replaced soon (as it would affect so much existing code). Google the r-devel list for details; this has come up before. The xts package does not impose an additional restriction.
Yes, splitting things into chunks that are manageable is the smartest approach. I used to do that on large data sets when I was working exclusively with 32-bit versions of R. I now use 64-bit R and no longer have this issue (and/or keep my data sets sane),
There are some 'out-of-memory' approaches, but I'd first try to rethink the problem and affirm that you really need all 400k rows at once.