Why is linear regression taking very long time to run in R?

Why is linear regression taking very long time to run in R? - r

I'm running linear regression on a tiff image. Image sizes are;
ncol=6350, nrow=2077, nlayers=26
What I did before running the calculation is just read tiff image in R using
ndvi2000<-raster("img2000.tif")
Then wrote following script in R console window. Calculation process is taking very long time more than 20mins and still running. Is it normal to take long time on big image? The script of the regression is:
time<-sort(sample(97:297, nlayers(ndvi2000)))
t.lm.pred<-function(x) {if (is.na(x[1])) {NA} else{predict(lm(x~time))}}
f.pred<-calc(ndvi2000,t.lm.pred)

The number of values you have is very large, so I'm not in the least surprised that it takes very long. Simply making a list of random numbers the size of your tiff file:
x = runif(6350 * 2077 * 26)
object.size(x) / (1024 * 1024)
2616.216
That is over 2.5 Gb, and that is just to save one variable. A rule of thumb is that you need roughly three times the amount of RAM than your dataset size. So, assuming you load some more images, you'll needs more than 10-20 Gb of RAM. If you don't have enough RAM, your operating system will starting swapping memory to disk, which makes your analysis veeeery slow.
I think it will be good idea to rethink your analysis, either that or rent a 64 Gb RAM EC2 instance. You could only look at the temporal average, or spatial average. Only look at specific locations, etc, etc. Simply brute-force using all values in your data might not be best here.

Related

R running out of memory during time series distance computation

Problem description
I have 45000 short time series (length 9) and would like to compute the distances for a cluster analysis. I realize that this will result in (the lower triangle of) a matrix of size 45000x45000, a matrix with more than 2 billion entries. Unsurprisingly, I get:
> proxy::dist(ctab2, method="euclidean")
Error: cannot allocate vector of size 7.6 Gb
What can I do?
Ideas
Increase available/addressable memory somehow? However, these 7.6G are probably beyond some hard limit that cannot be extended? In any case, the system has 16GB memory and the same amount of swap. By "Gb", R seems to mean Gigabyte, not Gigabit, so 7.6Gb puts us already dangerously close to a hard limit.
Perhaps a different distance computation method instead of euclidean, say DTW, might be more memory efficient? However, as explained below, the memory limit seems to be the resulting matrix, not the memory required at computation time.
Split the dataset into N chunks and compute the matrix in N^2 parts (actually only those parts relevant for the lower triangle) that can later be reassembled? (This might look similar to the solution to a similar problem proposed here.) It seems to be a rather messy solution, though. Further, I will need the 45K x 45K matrix in the end anyway. However, this seems to hit the limit. The system also gives the memory allocation error when generating a 45K x 45K random matrix:
> N=45000; memorytestmatrix <- matrix( rnorm(N*N,mean=0,sd=1), N, N)
Error: cannot allocate vector of size 15.1 Gb
30K x 30K matrices are possible without problems, R gives the resulting size as
> print(object.size(memorytestmatrix), units="auto")
6.7 Gb
1 Gb more and everything would be fine, it seems. Sadly, I do not have any large objects that I could delete to make room. Also, ironically,
> system('free -m')
Warning message:
In system("free -m") : system call failed: Cannot allocate memory
I have to admit that I am not really sure why R refuses to allocate 7.6 Gb; the system certainly has more memory, although not a lot more. ps aux shows the R process as the single biggest memory user. Maybe there is an issue with how much memory R can address even if more is available?
Related questions
Answers to other questions related to R running out of memory, like this one, suggest to use a more memory efficient methods of computation.
This very helpful answer suggests to delete other large objects to make room for the memory intensive operation.
Here, the idea to split the data set and compute distances chunk-wise is suggested.
Software & versions
R version is 3.4.1. System kernel is Linux 4.7.6, x86_64 (i.e. 64bit).
> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.1
year 2017
month 06
day 30
svn rev 72865
language R
version.string R version 3.4.1 (2017-06-30)
nickname Single Candle
Edit (Aug 27): Some more information
Updating the Linux kernel to 4.11.9 has no effect.
The bigmemory package may also run out of memory. It uses shared memory in /dev/shm/ of which the system by default (but depending on configuration) allows half the size of the RAM. You can increase this at runtime by doing (for instance) mount -o remount,size=12Gb /dev/shm, but this may still not allow usage of 12Gb. (I do not know why, maybe the memory management configuration is inconsistent then). Also, you may end up crashing your system if you are not careful.
R apparently actually allows access to the full RAM and can create objects up to that size. It just seems to fail for particular functions such as dist. I will add this as an answer, but my conclusions are a bit based on speculation, so I do not know to what degree this is right.

R apparently actually allows access to the full RAM. This works perfectly fine:
N=45000; memorytestmatrix <- matrix(nrow=N, ncol=N)
This is the same thing I tried before as described in the original question, but with a matrix of NA's instead of rnorm random variates. Reassigning one of the values in the matrix as float (memorytestmatrix[1,1]<-0.5) still works and recasts the matrix as a float matrix.
Consequently, I suppose, you can have a matrix of that size, but you cannot do it the way the dist function attempts to do it. A possible explanation is that the function operates with multiple objects of that size in order to speed the computation up. However, if you compute the distances element-wise and change the values in place, this works.
library(mefa) # for the vec2dist function
euclidian <- function(series1, series2) {
return((sum((series1 - series2)^2))^.5)
}
mx = nrow(ctab2)
distMatrixE <- vec2dist(0.0, size=mx)
for (coli in 1:(mx-1)) {
for (rowi in (coli+1):mx) {
# Element indices in dist objects count the rows down column by column from left to righ in lower triangular matrices without the main diagonal.
# From row and column indices, the element index for the dist object is computed like so:
element <- (mx^2-mx)/2 - ((mx-coli+1)^2 - (mx-coli+1))/2 + rowi - coli
# ... and now, we replace the distances in place
distMatrixE[element] <- euclidian(ctab2[rowi,], ctab2[coli,])
}
}
(Note that addressing in dist objects is a bit tricky, since they are not matrices but 1-dimensional vectors of size (N²-N)/2 recast as lower triangular matrices of size N x N. If we go through rows and columns in the right order, it could also be done with a counter variable, but computing the element index explicitly is clearer, I suppose.)
Also note that it may be possible to speed this up by making use of sapply by computing more than one value at a time.

There exist good algorithms that do not need a full distance matrix in memory.
For example, SLINK and DBSCAN and OPTICS.

fast performance when reading multiple *.gif images

My computer (i5-6500 3.2 GHZ, 8 GB RAM) takes a long time: something like 10 minutes (havent yet measured exactly).
i currently have to
read 400 images. (*.gif format, should all be b&w, resolution of approx. 200*400 px.) (3520 images in total)
i want to "add" all images "cell-wise".
here is how im doing it at the moment: Read image with raster than turn it into matrix, then sum it.
library(rgdal)
library(raster)
library(magrittr)
oldPic <- raster("initalImage.gif") %>% as.matrix
for (pat_IND in currSide) {
newPic <- raster(pat_IND) %>% as.matrix
oldPic <- oldPic + newPic
}
This takes for ever. I used caTools::read.gif() which was even much slower. Do i have a bottle neck in my code? Is there a faster implementation?
Edit: Image Properties
i use "no dither", mono palette (b&w).
Edit2
i want to add the images pixel-wise. Lets take pic A and pic B.
A + B = C. If A(1,1) = 1 and B(1,1) = 1, C(1,1) should be 2. Its a simple matrix addition.
test image:
reading with raster takes 0.03699994 secs
reading with raster + as.matrix takes: 0.201 secs

you need to measure... without any sample image is hard to say and we can only guess. You need to take into account that loading/decoding JPG take time in milliseconds and encoding of GIF can be time consuming even 200 ms. Depends on kind of encoding. To speed up GIF encoding you can:
use single global palette + dithering
GIF is 8 bpp and JPG is 24 bpp so your encoder needs to do the transformation. That is called color quantization and is the most expensive operation while encoding which can take even ~200 ms per frame on average PC machine in well optimized C++ code. for more info see:
Effective gif/image color quantization?
To remedy this you can use single palette dedicated to dithering (like default VGA or use some WEB palette they have the same purpose) and use dithering with is much much faster. See:
simple and fast Dithering
btw if you need to preserve colors take a look at this:
Images lose quality after saving as GIF
So try to find out how to configure your encoder to force dithering instead of color quantization based on K-means or similar ....
limit encoding dictionary to less then 4096
The encoding/decoding is based on creating dictionary and encoding need to search it more than once on per pixel basis. So lovering its size to 1024 gets significant boost to speed. Of coarse you need to access to encoding code to change this unless this can be configured somehow in it... The compression will be decreased by this however and more clear codes will be present in the stream.
use multi-threading
you can fully parallelize this and encode with each core present in your system.
I strongly recommend you to measure how long it take to encode single frame of GIF. If you take advantage on both bullets #1,#2 then I estimate you can get near times around ~5 ms per frame with dithering and ~60 ms per frame with fast quantization. So with 3520 frames it would take around 17.6 or 211.2 seconds just to encode GIF so add the file memory and JPG manipulation and take into account all is heavily guessed/estimated as you did not provide sample data. And divide by number of cores if you use #3 +/- shared disc access waits.

How to speed up the generation of a latin hypercube (LHS) design

I'm trying to generate an optimized LHS (Latin Hypercube Sampling) design in R, with sample size N = 400 and d = 7 variables, but it's taking forever. My pc is an HP Z820 workstation with 12 cores, 32 Mb RAM, Windows 7 64 bit, and I'm running Microsoft R Open which is a multicore version of R. The code has been running for half an hour, but I still don't see any results:
library(lhs)
lhs_design <- optimumLHS(n = 400, k = 7, verbose = TRUE)
It seems a bit weird. Is there anything I could do to speed it up? I heard that parallel computing may help with R, but I don't know how to use it, and I have no idea if it speeds up only code that I write myself, or if it could speed up an existing package function such as optimumLHS. I don't have to use the lhs package necessarily - my only requirement is that I would like to generate an LHS design which is optimized in terms of S-optimality criterion, maximin metric, or some other similar optimality criterion (thus, not just a vanilla LHS). If worse comes to worst, I could even accept a solution in a different environment than R, but it must be either MATLAB or a open source environment.

Just a little code to check performance.
library(lhs)
library(ggplot2)
performance<-c()
for(i in 1:100){
ptm<-proc.time()
invisible(optimumLHS(n = i, k = 7, verbose = FALSE))
time<-print(proc.time()-ptm)[[3]]
performance<-rbind(performance,data.frame(time=time, n=i))
}
ggplot(performance,aes(x=n,y=time))+
geom_point()
Not looking too good. It seems to me you might be in for a very long wait indeed. Based on the algorithm, I don't think there is a way to speed things up via parallel processing, since to optimize the separation between sample points, you need to know the location of the all the sample points. I think your only option for speeding this up will be to take a smaller sample or get (access)a faster computer. It strikes me that since this is something that only really has to be done once, is there a resource where you could just get a properly sampled and optimized distribution already computed?
So it looks like ~650 hours for my machine, which is very comparable to yours, to compute with n=400.

Increase loop speed for FFT

I have heard that writing for loops in R is particularly slow. I have the following code which needs to run through 122,000 rows with each having 513 columns and transform them using fft() function:
for (i in 2:100000){
Data1[i,2:513]<- fft(as.numeric(Data1[i,2:513]), inverse = TRUE)/512
}
I have tried to do this for 1000 cycles and that took few minutes... is there a way to do this loop faster? Maybe by not using a loop or by doing it in C?

mvfft (documented on the fft help page) was designed to do this all at once. It's hard to imagine how you could do it any faster: less than three seconds (on an older Xeon workstation) for a dataset exactly your size.
n.row <- 122e3
X <- matrix(rnorm(n.row*512), n.row)
system.time(
Y <- mvfft(t(X), inverse=TRUE)/512
)
user system elapsed
2.34 0.39 2.75
Note that the discrete FFT in this case has complex values.
FFTs are fast. Typically they can be computed in less time than it takes to read data from an ASCII file (because the character-to-numeric conversions involved in the read take more time than the calculations in the FFT). Your limiting resources therefore are I/O throughput speed and RAM. But 122,000 vectors of 512 complex values occupy "only" about a gigabyte, so you should be ok.

ffdf object consumes extra RAM (in GB)

I have decided to test the key advantage of ff package - RAM minimal allocation (PC specs: i5, RAM 8Gb, Win7 64 bit, Rstudio).
According to the package discription we can manipulate physical objects (files) like virtual ones as if they are allocated into RAM. Thus, actual RAM usage is reduced greatly (from Gb to kb). The code I have used as follows:
library(ff)
library(ffbase)
setwd("D:/My_package/Personal/R/reading")
x<-cbind(rnorm(1:100000000),rnorm(1:100000000),1:100000000)
system.time(write.csv2(x,"test.csv",row.names=FALSE))
system.time(x <- read.csv2.ffdf(file="test.csv", header=TRUE, first.rows=100000, next.rows=100000000,levels=NULL))
print(object.size(x)/1024/1024)
print(class(x))
The actual file size is 4.5 Gb, the actual RAM used varies in such a way (by Task Manager): 2,92 -> upper limit(~8Gb) -> 5.25Gb.
The object size (by object.size()) is about 12 kb.
My concern is about RAM extra allocations (~2.3 GB). According to the package discription it should have increased only by 12 kb. I dont use any characters.
Maybe I have missed something of ff package.

Well, I have found a solution to eliminate the use of extra RAM.
First of all it is necessary to pay attention to such arguments as 'first.rows' and 'next.rows' of method 'read.table.ffdf' in ff package.
The first argument ('first.rows') stipulates the initial chunk in row quantity and it stipulates the initial memory allocation. I have used the default value (1000 rows).
The extra memory allocation is the subject of the second argument ('next.rows'). If you want to have ffdf object without extra RAM allocations (in my case - in Gb) so you need to select such a number of rows for the next chunk that the size of the chunk should not exceed the value of 'getOption("ffbatchbytes")'.
In my case I have used 'first.rows=1000' and 'next.rows=1000' and the total RAM allocation has varied up to 1Mb in Task Manager.
The increase of 'next.rows' up to 10000 has caused the RAM growth by 8-9 Mb.
So this arguments are subject to your experiments to pick up the best proportion.
Besides, you must keep in mind that the increase of 'next.rows' will impact the processing time to make ffdf object(by several runs):
'first.rows=1000' and 'next.rows=1000' is around 1500 sec. (RAM ~ 1Mb)
'first.rows=1000' and 'next.rows=10000' is around 230 sec. (RAM ~ 9Mb)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex