Avoid memory increase in foreach loop in R - r

I try to create summary statistics combining two different spatial data-sets: a big raster file and a polygon file. The idea is to get summary statistics of the raster values within each polygon.
Since the raster is too big to process it at once, I try to create subtasks and process them in parallel i.e. process each polygon from the SpatialPolgyonsDataframe at once.
The code works fine, however after around 100 interations I run into memory problems. Here is my code and what I intent to do:
# session setup
library("raster")
library("rgdal")
# multicore processing.
library("foreach")
library("doSNOW")
# assign three clusters to be used for current R session
cluster = makeCluster(3, type = "SOCK",outfile="")
registerDoSNOW(cluster)
getDoParWorkers()# check if it worked
# load base data
r.terra.2008<-raster("~/terra.tif")
spodf.malha.2007<-readOGR("~/,"composed")
# bring both data-sets to a common CRS
proj4string(r.terra.2008)
proj4string(spodf.malha.2007)
spodf.malha.2007<-spTransform(spodf.malha.2007,CRSobj = CRS(projargs = proj4string(r.terra.2008)))
proj4string(r.terra.2008)==proj4string(spodf.malha.2007) # should be TRUE
# create a function to extract areas
function.landcover.sum<-function(r.landuse,spodf.pol){
return(table(extract(r.landuse,spodf.pol)))}
# apply it one one subset to see if it is working
function.landcover.sum(r.terra.2008,spodf.malha.2007[1,])
## parallel loop
# define package(s) to be use in the parallel loop
l.packages<-c("raster","sp")
# try a parallel loop for the first 6 polygons
l.results<-foreach(i=1:6,
.packages = l.packages) %dopar% {
print(paste("Processing Polygon ",i, ".",sep=""))
return(function.landcover.sum(r.terra.2008,spodf.malha.2007[i,]))
}
here the output is a list that looks like this.
l.results
[[1]]
9 10
193159 2567
[[2]]
7 9 10 12 14 16
17 256 1084 494 67 15
[[3]]
3 5 6 7 9 10 11 12
2199 1327 8840 8579 194437 1061 1073 1834
14 16
222 1395
[[4]]
3 6 7 9 10 12 16
287 102 728 329057 1004 1057 31
[[5]]
3 5 6 7 9 12 16
21 6 20 495 184261 4765 28
[[6]]
6 7 9 10 12 14
161 161 386 943 205 1515
So the result is rather small and should not be the source of the memory allocation problem. So than the following loop upon the whole polygon dataset which has >32.000 rows creates the memory allocation which exceeds 8GB after around 100 iteratins.
# apply the parallel loop on the whole dataset
l.results<-foreach(i=1:nrow(spodf.malha.2007),
.packages = l.packages) %dopar% {
print(paste("Processing Polygon ",i, ".",sep=""))
return(function.landcover.sum(r.terra.2008,spodf.malha.2007[i,]))
# gc(reset=TRUE) # does not resolve the problem
# closeAllConnections() # does not resolve the problem
}
What am I doing wrong?
edit:
I tried (as suggested in the comments) to remove the object after each iteration in the internal loop, but it did not resolve the problem. I furthermore tried to resolve eventual problems of multiple data-imports by passing the objects to the environment in the first place:
clusterExport(cl = cluster,
varlist = c("r.terra.2008","function.landcover.sum","spodf.malha.2007"))
without major changes. My R version is 3.4 on a linux platform so supposedly also the patch of the link from the fist comment should already be included in this version. I also tried the parallel package as suggested in the first comment but no differences appeared.

You can try exact_extract in the exactextractr package. Is the fastest and memory safer function to extract values from raster. The main function is implemented in C++ and usually it doesn't need parallelization. Since you do not provide any example data I post an example with real data:
library(raster)
library(sf)
library(exactextractr)
# Pull municipal boundaries for Brazil
brazil <- st_as_sf(getData('GADM', country='BRA', level=2))
# Pull gridded precipitation data
prec <- getData('worldclim', var='prec', res=10)
#transform precipitation data in a dummy land use map
lu <- prec[[1]]
values(lu) <- sample(1:10,ncell(lu),replace = T)
plot(lu)
#extract land uses class for each pixel inside each polygon
ex <- exact_extract(lu, brazil)
#apply table to the resulting list. Here I use just the first 5 elements to avoid long output
lapply(ex[1:5],function(x){
table(x[,1])#note that I use x[,1] because by default exact_extract provide in the second column the coverage fraction of each pixel by each polygon
})
here the example output:
[[1]]
1 2 4 6 7 9 10
1 1 1 2 3 1 1
[[2]]
2 3 4 5 6 7 8 10
2 4 3 2 1 2 2 2
[[3]]
1 2 4 6 7 8 9 10
4 5 1 1 4 2 5 5
[[4]]
1 2 3 4 5 6 7 8 9 10
2 2 4 2 2 4 1 4 1 2
[[5]]
3 4 5 6 8 10
2 3 1 1 2 3

Related

Acoustic complexity index time series output

I have a wav file and I would like to calculate the Acoustic Complexity Index at each second and receive a time series output.
I understand how to modify other settings within a function like seewave::ACI() but I am unable to find out how to output a time series data frame where each row is one second of time with the corresponding ACI value.
For a reproducible example, this audio file is 20 seconds, so I'd like the output to have 20 rows, with each row printing the ACI for that 1-second of time.
library(soundecology)
data(tropicalsound)
acoustic_complexity(tropicalsound)
In fact, I'd like to achieve this is a few other indices, for example:
soundecology::ndsi(tropicalsound)
soundecology::acoustic_evenness(tropicalsound)
You can subset your wav file according to the samples it contains. Since the sampling frequency can be obtained from the wav object, we can get one-second subsets of the file and perform our calculations on each. Note that you have to set the cluster size to 1 second, since the default is 5 seconds.
library(soundecology)
data(tropicalsound)
f <- tropicalsound#samp.rate
starts <- head(seq(0, length(tropicalsound), f), -1)
aci <- sapply(starts, function(i) {
aci <- acoustic_complexity(tropicalsound[i + seq(f)], j = 1)
aci$AciTotAll_left
})
nds <- sapply(starts, function(i) {
nds <- ndsi(tropicalsound[i + seq(f)])
nds$ndsi_left
})
aei <- sapply(starts, function(i) {
aei <- acoustic_evenness(tropicalsound[i + seq(f)])
aei$aei_left
})
This allows us to create a second-by-second data frame representing a time series of each measure:
data.frame(time = 0:19, aci, nds, aei)
#> time aci nds aei
#> 1 0 152.0586 0.7752307 0.438022
#> 2 1 168.2281 0.4171902 0.459380
#> 3 2 149.2796 0.9366220 0.516602
#> 4 3 176.8324 0.8856127 0.485036
#> 5 4 162.4237 0.8848515 0.483414
#> 6 5 161.1535 0.8327568 0.511922
#> 7 6 163.8071 0.7532586 0.549262
#> 8 7 156.4818 0.7706808 0.436910
#> 9 8 156.1037 0.7520663 0.489253
#> 10 9 160.5316 0.7077717 0.491418
#> 11 10 157.4274 0.8320380 0.457856
#> 12 11 169.8831 0.8396483 0.456514
#> 13 12 165.4426 0.6871337 0.456985
#> 14 13 165.1630 0.7655454 0.497621
#> 15 14 154.9258 0.8083035 0.489896
#> 16 15 162.8614 0.7745876 0.458035
#> 17 16 148.6004 0.1393345 0.443370
#> 18 17 144.6733 0.8189469 0.458309
#> 19 18 156.3466 0.6067827 0.455578
#> 20 19 158.3413 0.7175293 0.477261
Note that this is simply a demonstration of how to achieve the desired output; you would need to check the literature to determine whether it is appropriate to use these measures over such short time periods.

Is there a way to create a permutation of a vector without using the sample() function in R?

I hope you are having a nice day. I would like to know if there is a way to create a permutation (rearrangement) of the values in a vector in R?
My professor provided with an assignment in which we are supposed create functions for a randomization test, one while using sample() to create a permutation and one not using the sample() function. So far all of my efforts have been fruitless, as any answer that I can find always resorts in the use of the sample() function. I have tried several other methods, such as indexing with runif() and writing my own functions, but to no avail. Alas, I have accepted defeat and come here for salvation.
While using the sample() function, the code looks like:
#create the groups
a <- c(2,5,5,6,6,7,8,9)
b <- c(1,1,2,3,3,4,5,7,7,8)
#create a permutation of the combined vector without replacement using the sample function()
permsample <-sample(c(a,b),replace=FALSE)
permsample
[1] 2 5 6 1 7 7 3 8 6 3 5 9 2 7 4 8 1 5
And, for reference, the entire code of my function looks like:
PermutationTtest <- function(a, b, P){
sample.t.value <- t.test(a, b)$statistic
perm.t.values<-matrix(rep(0,P),P,1)
N <-length(a)
M <-length(b)
for (i in 1:P)
{
permsample <-sample(c(a,b),replace=FALSE)
pgroup1 <- permsample[1:N]
pgroup2 <- permsample[(N+1) : (N+M)]
perm.t.values[i]<- t.test(pgroup1, pgroup2)$statistic
}
return(mean(perm.t.values))
}
How would I achieve the same thing, but without using the sample() function and within the confines of base R? The only hint my professor gave was "use indices." Thank you very much for your help and have a nice day.
You can use runif() to generate a value between 1.0 and the length of the final array. The floor() function returns the integer part of that number. At each iteration, i decrease the range of the random number to choose, append the element in the rn'th position of the original array to the new one and remove it.
a <- c(2,5,5,6,6,7,8,9)
b <- c(1,1,2,3,3,4,5,7,7,8)
c<-c(a,b)
index<-length(c)
perm<-c()
for(i in 1:length(c)){
rn = floor(runif(1, min=1, max=index))
perm<-append(perm,c[rn])
c=c[-rn]
index=index-1
}
It is easier to see what is going on if we use consecutive numbers:
a <- 1:8
b <- 9:17
ab <- c(a, b)
ab
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Now draw 17 (length(ab)) random numbers and use them to order ab:
rnd <- runif(length(ab))
ab[order(rnd)]
# [1] 5 13 11 12 6 1 17 3 10 2 8 16 7 4 9 15 14
rnd <- runif(length(ab))
ab[order(rnd)]
# [1] 14 11 5 15 10 7 13 9 17 8 2 6 1 4 16 12 3
For each permutation just draw another 17 random numbers.

creating a dataframe of means of 5 randomly sampled observations

I'm currently reading "Practical Statistics for Data Scientists" and following along in R as they demonstrate some code. There is one chunk of code I'm particularly struggling to follow the logic of and was hoping someone could help. The code in question is creating a dataframe with 1000 rows where each observation is the mean of 5 randomly drawn income values from the dataframe loans_income. However, I'm getting confused about the logic of the code as it is fairly complicated with a tapply() function and nested rep() statements.
The code to create the dataframe in question is as follows:
samp_mean_5 <- data.frame(income = tapply(sample(loans_income$income,1000*5),
rep(1:1000,rep(5,1000)),
FUN = mean),
type='mean_of_5')
In particular, I'm confused about the nested rep() statements and the 1000*5 portion of the sample() function. Any help understanding the logic of the code would be greatly appreciated!
For reference, the original dataset loans_income simply has a single column of 50,000 income values.
You have 50,000 loans_income in a single vector. Let's break your code down:
tapply(sample(loans_income$income,1000*5),
rep(1:1000,rep(5,1000)),
FUN = mean)
I will replace 1000 with 10 and income with random numbers, so it's easier to explain. I also set set.seed(1) so the result can be reproduced.
sample(loans_income$income,1000*5)
We 50 random incomes from your vector without replacement. They are (temporarily) put into a vector of length 50, so the output looks like this:
> sample(runif(50000),10*5)
[1] 0.73283101 0.60329970 0.29871173 0.12637654 0.48434952 0.01058067 0.32337850
[8] 0.46873561 0.72334215 0.88515494 0.44036341 0.81386225 0.38118213 0.80978822
[15] 0.38291273 0.79795343 0.23622492 0.21318431 0.59325586 0.78340477 0.25623138
[22] 0.64621658 0.80041393 0.68511759 0.21880083 0.77455662 0.05307712 0.60320912
[29] 0.13191926 0.20816298 0.71600799 0.70328349 0.44408218 0.32696205 0.67845445
[36] 0.64438336 0.13241312 0.86589561 0.01109727 0.52627095 0.39207860 0.54643661
[43] 0.57137320 0.52743012 0.96631114 0.47151170 0.84099503 0.16511902 0.07546454
[50] 0.85970500
rep(1:1000,rep(5,1000))
Now we are creating an indexing vector of length 50:
> rep(1:10,rep(5,10))
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6
[29] 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 10 10 10 10 10
Those indices "group" the samples from step 1. So basically this vector tells R that the first 5 entries of your "sample vector" belong together (index 1), the next 5 entries belong together (index 2) and so on.
FUN = mean
Just apply the mean-function on the data.
tapply
So tapply takes the sampled data (sample-part) and groups them by the second argument (the rep()-part) and applies the mean-function on each group.
If you are familiar with data.frames and the dplyr package, take a look at this (only the first 10 rows are displayed):
set.seed(1)
df <- data.frame(income=sample(runif(5000),10*5), index=rep(1:10,rep(5,10)))
income index
1 0.42585569 1
2 0.16931091 1
3 0.48127444 1
4 0.68357403 1
5 0.99374923 1
6 0.53227877 2
7 0.07109499 2
8 0.20754511 2
9 0.35839481 2
10 0.95615917 2
I attached the an index to the random numbers (your income). Now we calculate the mean per group:
df %>%
group_by(index) %>%
summarise(mean=mean(income))
which gives us
# A tibble: 10 x 2
index mean
<int> <dbl>
1 1 0.551
2 2 0.425
3 3 0.827
4 4 0.391
5 5 0.590
6 6 0.373
7 7 0.514
8 8 0.451
9 9 0.566
10 10 0.435
Compare it to
set.seed(1)
tapply(sample(runif(5000),10*5),
rep(1:10,rep(5,10)),
mean)
which yields basically the same result:
1 2 3 4 5 6 7 8 9
0.5507529 0.4250946 0.8273149 0.3905850 0.5902823 0.3730092 0.5143829 0.4512932 0.5658460
10
0.4352546

How can I tell a for loop in R to regenerate a sample if the sample contains a certain pair of species?

I am creating 1000 random communities (vectors) from a species pool of 128 with certain operations applied to the community and stored in a new vector. For simplicity, I have been practicing writing code using 10 random communities from a species pool of 20. The problem is that there are a couple of pairs of species such that if one of the pairs is generated in the random community, I need that community to be thrown out and a new one regenerated. I have been able to code that if the pair is found in a community for that community(vector) to be labeled NA. I also know how to tell the loop to skip that vector using the "next" command. But with both of these options, I do not get all of the communities that I needing.
Here is my code using the NA option, but again that ends up shorting me communities.
C<-c(1:20)
D<-numeric(10)
X<- numeric(5)
for(i in 1:10){
X<-sample(C, size=5, replace = FALSE)
if("10" %in% X & "11" %in% X) X=NA else X=X
if("1" %in% X & "2" %in% X) X=NA else X=X
print(X)
D[i]<-sum(X)
}
print(D)
This is what my result looks like.
[1] 5 1 7 3 14
[1] 20 8 3 18 17
[1] NA
[1] NA
[1] 4 7 1 5 3
[1] 16 1 11 3 12
[1] 14 3 8 10 15
[1] 7 6 18 3 17
[1] 6 5 7 3 20
[1] 16 14 17 7 9
> print(D)
[1] 30 66 NA NA 20 43 50 51 41 63
Thanks so much!

Is there a way to refer to the end of a vector?

I don't want to save the huge intermediate results for some of my calculations, and hence want to run some tests without saving these memory expensive vectors.
Say, during the computation I have a vector of arbitrary length l.
But I don't know what l is, and I can't save the vector in the memory.
Is there a way I can refer the length of the vector, something like
vec[100:END] or vec[100:-1] or vec[100:last]
Please note that vec here is not a variable, and it only refers to an intermediate expression which will output a vector.
I know length, head and tail functions, and that vec[-(1:99)] is an equivalent expression.
But, I actually want to know if there is some reference that will run an iteration from a specified number to the 'end' of the vector.
Thanks!!
I'm probably not understanding your question. If this isn't useful let me know and I'll delete it.
I gather you want to extract the elements from a vector of arbitrary length, from element N to the end, without explicitly storing the vector (which is required if you want to use, e.g. length(vec)). Here are two ways:
N <- 5 # grab element 5 to the end.
set.seed(12)
(1:sample(N:100,1))[-(1:(N-1))]
# [1] 5 6 7 8 9 10 11
set.seed(12)
tail(1:sample(N:100,1),-(N-1))
# [1] 5 6 7 8 9 10 11
Both of these create (temporarily) a sequence of integers of random length (>=5), and extract the elements from 5 to the end without self-referencing.
You mentioned memory a could of times. If you're concerned about memory and assigning large objects, you should take a look at the Memory-limits documentation, and the related links. First, there are ways to operate on the language in R. Here I only assign one object, the function f, and use it without making any other assignments.
> f <- function(x, y) x:y ## actually, g <- ":" is only 96 bytes
> object.size(f)
# 1560 bytes
> f(5, 20)[3:7]
# [1] 7 8 9 10 11
> object.size(f)
# 1560 bytes
> f(5, 20)[3:length(f(5, 20))]
# [1] 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> object.size(f)
# 1560 bytes
You can also use an expression to hold an unevaluated function call.
> e <- expression(f(5, 20)) ## but again, g <- ":" is better
> eval(e)
# [1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> eval(e)[6:9]
# [1] 10 11 12 13
> eval(e)[6:length(eval(e))]
# [1] 10 11 12 13 14 15 16 17 18 19 20
> rev(eval(e))
# [1] 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5
Depending of the type of data you're working with, there are ways to
avoid using large amounts of memory during a session. Here are a few related to
your question.
memory.profile()
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 274711 14.7 531268 28.4 531268 28.4
# Vcells 502886 3.9 1031040 7.9 881084 6.8
?gc() is good knowledge to have, and I can't really explain it. Best to read
about it. Also, I just learned about memCompress() and memDecompress() for
in-memory compression/storage. Here's a look Also, if you're working with
integer values, notifying R about it can help save memory.
That's what the L is for on the end of the rep.int() call.
x <- rep.int(5L, 1e4L)
y <- as.raw(x)
z1 <- memCompress(y)
z2 <- memCompress(y, "b")
z3 <- memCompress(y, "x")
mapply(function(a) object.size(get(a)), c('x','y','z1','z2','z3'))
# x y z1 z2 z3
# 40040 10040 88 88 168
And there is also
delayedAssign("p", rep.int(5L, 1e5L))
which is a promise object that takes up 0 bytes of memory until it is first evaluated.

Resources