I am trying to develop a function to "synchronise" NAs among layers of a raster stack, i.e. to make sure that for any given pixel of the stack, if one layer has a NA, then all layers should be set to NA for that pixel.
This is particularly useful when combining rasters coming from varying sources for species distribution modelling, because some models do not handle properly NAs.
I have found two ways to do this, but I find neither of them satisfactory. One of them requires to use the function getValues and thus is not usable for very large stacks or computers with low RAM. The other one is more memory-safe but is much slower. I am therefore here to ask if anyone has an idea to improve my attempts?
Here are the two possibilities:
Using getValues()
syncNA1 <- function (x)
{
val <- getValues(x)
NA.pos <- unique(which(is.na(val), arr.ind = T)[, 1])
val[NA.pos, ] <- NA
x <- setValues(x, val)
return(x)
}
Using calc()
syncNA2 <- function(y)
{
calc(y, na.rm = T, fun = function(x, na.rm = na.rm)
{
if(any(is.na(x)))
{
rep(NA, length(x))
} else
{
x
}
})
}
Now a demonstration of their respective computing times for the same stack:
> system.time(
+ b1 <- syncNA1(a1)
+ )
user system elapsed
3.04 0.15 3.20
> system.time(
+ b2 <- syncNA2(a1)
+ )
user system elapsed
5.89 0.19 6.08
Many thanks for your help,
Boris
With a stack named "s", I would first use calc(s, fun = sum) to compute a mask layer that records the location of all cells with an NA value in at least one of the stack's layers. mask() will then allow you to apply that mask to every layer in the stack.
Here's an example:
library(raster)
## Construct reproducible data! (Here a raster stack with NA values in each layer)
m <- raster(ncol=10, nrow=10)
n <- raster(ncol=10, nrow=10)
m[] <- runif(ncell(m))
n[] <- runif(ncell(n)) * 10
m[m < 0.5] <- NA
n[n < 5] <- NA
s <- stack(m,n)
## Synchronize the NA values
s2 <- mask(s, calc(s,fun = sum))
## Check that it worked
plot(s2)
I don't know about speed, but you might try converting to an array, loading up the NA, and converting back. pseudocode:
xarray<-as.array(xstack)
ind.na<-which(is.na(xarray),array.ind=TRUE)
for(j in nrow(ind.na) ) {
xarray[ind.na[j,1],ind.na[j,2],]<-NA
}
nastack<-raster(xarray)
I haven't verified the correct choice of indices there, nor have I verified I converted back to raster stack correctly, but I hope you get the idea.
EDIT: I ran a time test, with rasters 1000x1000 but otherwise as Josh created.
microbenchmark(josh(s),syncNA1(s),syncNA2(s),times=5)
Unit: milliseconds
expr min lq median uq max
josh(s) 774.2363 789.1653 800.2511 806.5364 809.9087
syncNA1(s) 652.3928 659.8327 692.3578 695.8057 743.9123
syncNA2(s) 7951.3918 8291.7917 8604.2226 8606.3432 10254.4739
neval
5
5
5
I ended up building an hybrid function between syncNA1 and Josh's solution.
This function is memory-safe if the computer does not have enough RAM, but can process faster if the computer has enough RAM:
synchroniseNA <- function(x)
{
if(canProcessInMemory(x, n = 2))
{
val <- getValues(x)
NA.pos <- unique(which(is.na(val), arr.ind = T)[, 1])
val[NA.pos, ] <- NA
x <- setValues(x, val)
return(x)
} else
{
x <- mask(x, calc(x, fun = sum))
return(x)
}
}
However, I empirically determined that the amount of ram used by a data.frame is twice the size of a raster file (for the n argument of canProcessInMemory()), but I am not exactly sure I am right here.
Related
I have a RasterStack s1 consisting of 400 layers with data from an island. The extent of the raster is cropped to the extent of the island, but due to its irregular shape, only around 20% of the pixels are actually land area and have data values; the other 80% are water and NA.
I also have a land-water-mask lwm (RasterLayer), where land is coded as 1 and water as NA.
I would like to do different kinds of cell-based calculations on s1, but noticed that these take a long time to finish. To speed things up, the calculations should only be carried out for cells that are land area, whereas water areas should always be NA. In pseudo-code:
for each cell:
if cell is land
do calculation
if cell is water
return(NA)
An requirement is memory-safety.
Here is some sample data to illustrate the problem:
library(raster)
# generate data
lwm <- raster(nrow = 5, ncol = 5)
lwm[] <- c(rep(NA, 10), rep(1, 5), rep(NA, 10))
r1 <- raster(nrow = 5, ncol = 5)
r1[] <- runif(ncell(r1)) * 10
r2 <- raster(nrow = 5, ncol = 5)
r2[] <- runif(ncell(r2)) * 10
s1 <- stack(r1, r2)
s1 <- mask(s1, lwm)
# this works, but all NA-values on water are also unnecessarily evaluated
calc(s1, function(x) {sum(!is.na(x))})
After some playing around, I finally found a solution which works very well in my case, and which takes off quite a nice amount from the overall processing time:
library(raster)
# generate data
lwm <- raster(nrow = 50, ncol = 50)
lwm[] <- 1
# replace 80% with NA values
lwm[sample(1:ncell(lwm), round(0.8 * ncell(lwm)))] <- NA
r1 <- raster(lwm)
r1[] <- runif(ncell(r1))
r1_list <- replicate(400 , r1)
s1 <- stack(r1_list)
s1 <- mask(s1, lwm)
# this works, but all NA-values on water are also unnecessarily evaluated
system.time(r_sum1 <- calc(s1, function(x) {sum(x)}))
#user system elapsed
#0.14 0.00 0.14
## new approach:
# stack land-water-mask with RasterStack
s1_lwm <- stack(lwm, s1)
# function to check if first element of vector is NA; if yes, return NA; if no, do calculation
fun1 <- function(y) {
if (!is.na(y[1])) {
y = y[-1]
return(sum(y))
} else {
return(NA)
}
}
system.time(
r_sum2 <- calc(s1_lwm, fun = fun1)
)
# user system elapsed
# 0.4 0.0 0.4
# results are identical
identical(r_sum1[], r_sum2[])
That's a tricky one and unfortunetely I don't have a straight forward solution for you.
You can either do multiple crops of the island (i.e. 2-3) to minimise NA values and do the calculcations separately on each cropped raster and mosaic the results.
Or another option is to do a parallel calculation, which will speed up the process significantly:
#initialize cluster
#number of cores to use for clusterR function (max recommended: ncores - 1)
beginCluster(3)
#calculation
result <- clusterR(s1, calc, args=list(fun=function(x) {sum(!is.na(x))}))
#end cluster
endCluster()
Since you asked for a memory-safe solution, you should look how much RAM is being allocated when you only a singe core and then estimate how many cores you can run your calculation on, so you won't run out of memory.
Good luck! Hope it helps.
I am attempting to build a large (~200 MM line) dataframe in R. Each entry in the dataframe will consist of approximately 10 digits (e.g. 1234.12345). The code is designed to walk through a list, subtract an item in position [i] from every item after [i], but not the items before [i] (If I was putting the output into a matrix it would be a triangular matrix). The code is simple and works fine on smaller lists, but I am wondering if there is a faster or more efficient way to do this? I assume the first part of the answer is going to entail "don't use a nested for loop," but I am not sure what the alternatives are.
The idea is that this will be an "edge list" for a social network analysis graph. Once I have 'outlist' I will reduce the number of edges based on some criteria(<,>,==,) so the final list (and graph) won't be quite so ponderous.
#Fake data of same approximate dimensions as real data
dlist<-sample(1:20,20, replace=FALSE)
#purge the output list before running the loop
rm(outlist)
outlist<-data.frame()
for(i in 1:(length(dlist)-1)){
for(j in (i+1):length(dlist)){
outlist<-rbind(outlist, c(dlist[i],dlist[j], dlist[j]-dlist[i]))
}
}
IIUC your final dataset will be ~200 million rows by 3 columns, all of type numeric, which takes a total space of:
200e6 (rows) * 3 (cols) * 8 (bytes) / (1024 ^ 3)
# ~ 4.5GB
That's quite a big data, where it's essential to avoid copies wherever possible.
Here's a method that uses data.table package's unexported (internal) vecseq function (written in C and is fast + memory efficient) and makes use of it's assignment by reference operator :=, to avoid copies.
fn1 <- function(x) {
require(data.table) ## 1.9.2
lx = length(x)
vx = as.integer(lx * (lx-1)/2)
# R v3.1.0 doesn't copy on doing list(.) - so should be even more faster there
ans = setDT(list(v1 = rep.int(head(x,-1L), (lx-1L):1L),
v2=x[data.table:::vecseq(2:lx, (lx-1L):1, vx)]))
ans[, v3 := v2-v1]
}
Benchmarking:
I'll benchmark with functions from other answers on your data dimensions. Note that my benchmark is on R v3.0.2, but fn1() should give better performance (both speed and memory) on R v3.1.0 because list(.) doesn't result in copy anymore.
fn2 <- function(x) {
diffmat <- outer(x, x, "-")
ss <- which(upper.tri(diffmat), arr.ind = TRUE)
data.frame(v1 = x[ss[,1]], v2 = x[ss[,2]], v3 = diffmat[ss])
}
fn3 <- function(x) {
idx <- combn(seq_along(x), 2)
out2 <- data.frame(v1=x[idx[1, ]], v2=x[idx[2, ]])
out2$v3 <- out2$v2-out2$v1
out2
}
set.seed(45L)
x = runif(20e3L)
system.time(ans1 <- fn1(x)) ## 18 seconds + ~8GB (peak) memory usage
system.time(ans2 <- fn2(x)) ## 158 seconds + ~19GB (peak) memory usage
system.time(ans3 <- fn3(x)) ## 809 seconds + ~12GB (peak) memory usage
Note that fn2() due to use of outer requires quite a lot of memory (peak memory usage was >=19GB) and is slower than fn1(). fn3() is just very very slow (due to combn, and unnecessary copy).
Another way to create that data is
#Sample Data
N <- 20
set.seed(15) #for reproducibility
dlist <- sample(1:N,N, replace=FALSE)
we could do
idx <- combn(1:N,2)
out2 <- data.frame(i=dlist[idx[1, ]], j=dlist[idx[2, ]])
out2$dist <- out2$j-out2$i
This uses combn to create all paris of indices in the data.set rather than doing loops. This allows us to build the data.frame all at once rather than adding a row at a time.
We compare that to
out1 <- data.frame()
for(i in 1:(length(dlist)-1)){
for(j in (i+1):length(dlist)){
out1<-rbind(out1, c(dlist[i],dlist[j], dlist[j]-dlist[i]))
}
}
we see that
all(out1==out2)
# [1] TRUE
Plus, if we compare with microbenchmark we see that
microbenchmark(loops(), combdata())
# Unit: microseconds
# expr min lq median uq max neval
# loops() 30888.403 32230.107 33764.7170 34821.2850 82891.166 100
# combdata() 684.316 800.384 873.5015 940.9215 4285.627 100
The method that doesn't use loops is much faster.
You can always start with a triangular matrix and then make your dataframe directly from that:
vec <- 1:10
diffmat <- outer(vec,vec,"-")
ss <- which(upper.tri(diffmat),arr.ind = TRUE)
data.frame(one = vec[ss[,1]],
two = vec[ss[,2]],
diff = diffmat[ss])
You need to preallocate out list, this will significantly increase the speed of your code. By preallocating I mean creating an output structure that already has the desired size, but filled with for example NA's.
I have looked around StackOverflow, but I cannot find a solution specific to my problem, which involves appending rows to an R data frame.
I am initializing an empty 2-column data frame, as follows.
df = data.frame(x = numeric(), y = character())
Then, my goal is to iterate through a list of values and, in each iteration, append a value to the end of the list. I started with the following code.
for (i in 1:10) {
df$x = rbind(df$x, i)
df$y = rbind(df$y, toString(i))
}
I also attempted the functions c, append, and merge without success. Please let me know if you have any suggestions.
Update from comment:
I don't presume to know how R was meant to be used, but I wanted to ignore the additional line of code that would be required to update the indices on every iteration and I cannot easily preallocate the size of the data frame because I don't know how many rows it will ultimately take. Remember that the above is merely a toy example meant to be reproducible. Either way, thanks for your suggestion!
Update
Not knowing what you are trying to do, I'll share one more suggestion: Preallocate vectors of the type you want for each column, insert values into those vectors, and then, at the end, create your data.frame.
Continuing with Julian's f3 (a preallocated data.frame) as the fastest option so far, defined as:
# pre-allocate space
f3 <- function(n){
df <- data.frame(x = numeric(n), y = character(n), stringsAsFactors = FALSE)
for(i in 1:n){
df$x[i] <- i
df$y[i] <- toString(i)
}
df
}
Here's a similar approach, but one where the data.frame is created as the last step.
# Use preallocated vectors
f4 <- function(n) {
x <- numeric(n)
y <- character(n)
for (i in 1:n) {
x[i] <- i
y[i] <- i
}
data.frame(x, y, stringsAsFactors=FALSE)
}
microbenchmark from the "microbenchmark" package will give us more comprehensive insight than system.time:
library(microbenchmark)
microbenchmark(f1(1000), f3(1000), f4(1000), times = 5)
# Unit: milliseconds
# expr min lq median uq max neval
# f1(1000) 1024.539618 1029.693877 1045.972666 1055.25931 1112.769176 5
# f3(1000) 149.417636 150.529011 150.827393 151.02230 160.637845 5
# f4(1000) 7.872647 7.892395 7.901151 7.95077 8.049581 5
f1() (the approach below) is incredibly inefficient because of how often it calls data.frame and because growing objects that way is generally slow in R. f3() is much improved due to preallocation, but the data.frame structure itself might be part of the bottleneck here. f4() tries to bypass that bottleneck without compromising the approach you want to take.
Original answer
This is really not a good idea, but if you wanted to do it this way, I guess you can try:
for (i in 1:10) {
df <- rbind(df, data.frame(x = i, y = toString(i)))
}
Note that in your code, there is one other problem:
You should use stringsAsFactors if you want the characters to not get converted to factors. Use: df = data.frame(x = numeric(), y = character(), stringsAsFactors = FALSE)
Let's benchmark the three solutions proposed:
# use rbind
f1 <- function(n){
df <- data.frame(x = numeric(), y = character())
for(i in 1:n){
df <- rbind(df, data.frame(x = i, y = toString(i)))
}
df
}
# use list
f2 <- function(n){
df <- data.frame(x = numeric(), y = character(), stringsAsFactors = FALSE)
for(i in 1:n){
df[i,] <- list(i, toString(i))
}
df
}
# pre-allocate space
f3 <- function(n){
df <- data.frame(x = numeric(1000), y = character(1000), stringsAsFactors = FALSE)
for(i in 1:n){
df$x[i] <- i
df$y[i] <- toString(i)
}
df
}
system.time(f1(1000))
# user system elapsed
# 1.33 0.00 1.32
system.time(f2(1000))
# user system elapsed
# 0.19 0.00 0.19
system.time(f3(1000))
# user system elapsed
# 0.14 0.00 0.14
The best solution is to pre-allocate space (as intended in R). The next-best solution is to use list, and the worst solution (at least based on these timing results) appears to be rbind.
Suppose you simply don't know the size of the data.frame in advance. It can well be a few rows, or a few millions. You need to have some sort of container, that grows dynamically. Taking in consideration my experience and all related answers in SO I come with 4 distinct solutions:
rbindlist to the data.frame
Use data.table's fast set operation and couple it with manually doubling the table when needed.
Use RSQLite and append to the table held in memory.
data.frame's own ability to grow and use custom environment (which has reference semantics) to store the data.frame so it will not be copied on return.
Here is a test of all the methods for both small and large number of appended rows. Each method has 3 functions associated with it:
create(first_element) that returns the appropriate backing object with first_element put in.
append(object, element) that appends the element to the end of the table (represented by object).
access(object) gets the data.frame with all the inserted elements.
rbindlist to the data.frame
That is quite easy and straight-forward:
create.1<-function(elems)
{
return(as.data.table(elems))
}
append.1<-function(dt, elems)
{
return(rbindlist(list(dt, elems),use.names = TRUE))
}
access.1<-function(dt)
{
return(dt)
}
data.table::set + manually doubling the table when needed.
I will store the true length of the table in a rowcount attribute.
create.2<-function(elems)
{
return(as.data.table(elems))
}
append.2<-function(dt, elems)
{
n<-attr(dt, 'rowcount')
if (is.null(n))
n<-nrow(dt)
if (n==nrow(dt))
{
tmp<-elems[1]
tmp[[1]]<-rep(NA,n)
dt<-rbindlist(list(dt, tmp), fill=TRUE, use.names=TRUE)
setattr(dt,'rowcount', n)
}
pos<-as.integer(match(names(elems), colnames(dt)))
for (j in seq_along(pos))
{
set(dt, i=as.integer(n+1), pos[[j]], elems[[j]])
}
setattr(dt,'rowcount',n+1)
return(dt)
}
access.2<-function(elems)
{
n<-attr(elems, 'rowcount')
return(as.data.table(elems[1:n,]))
}
SQL should be optimized for fast record insertion, so I initially had high hopes for RSQLite solution
This is basically copy&paste of Karsten W. answer on similar thread.
create.3<-function(elems)
{
con <- RSQLite::dbConnect(RSQLite::SQLite(), ":memory:")
RSQLite::dbWriteTable(con, 't', as.data.frame(elems))
return(con)
}
append.3<-function(con, elems)
{
RSQLite::dbWriteTable(con, 't', as.data.frame(elems), append=TRUE)
return(con)
}
access.3<-function(con)
{
return(RSQLite::dbReadTable(con, "t", row.names=NULL))
}
data.frame's own row-appending + custom environment.
create.4<-function(elems)
{
env<-new.env()
env$dt<-as.data.frame(elems)
return(env)
}
append.4<-function(env, elems)
{
env$dt[nrow(env$dt)+1,]<-elems
return(env)
}
access.4<-function(env)
{
return(env$dt)
}
The test suite:
For convenience I will use one test function to cover them all with indirect calling. (I checked: using do.call instead of calling the functions directly doesn't makes the code run measurable longer).
test<-function(id, n=1000)
{
n<-n-1
el<-list(a=1,b=2,c=3,d=4)
o<-do.call(paste0('create.',id),list(el))
s<-paste0('append.',id)
for (i in 1:n)
{
o<-do.call(s,list(o,el))
}
return(do.call(paste0('access.', id), list(o)))
}
Let's see the performance for n=10 insertions.
I also added a 'placebo' functions (with suffix 0) that don't perform anything - just to measure the overhead of the test setup.
r<-microbenchmark(test(0,n=10), test(1,n=10),test(2,n=10),test(3,n=10), test(4,n=10))
autoplot(r)
For 1E5 rows (measurements done on Intel(R) Core(TM) i7-4710HQ CPU # 2.50GHz):
nr function time
4 data.frame 228.251
3 sqlite 133.716
2 data.table 3.059
1 rbindlist 169.998
0 placebo 0.202
It looks like the SQLite-based sulution, although regains some speed on large data, is nowhere near data.table + manual exponential growth. The difference is almost two orders of magnitude!
Summary
If you know that you will append rather small number of rows (n<=100), go ahead and use the simplest possible solution: just assign the rows to the data.frame using bracket notation and ignore the fact that the data.frame is not pre-populated.
For everything else use data.table::set and grow the data.table exponentially (e.g. using my code).
Update with purrr, tidyr & dplyr
As the question is already dated (6 years), the answers are missing a solution with newer packages tidyr and purrr. So for people working with these packages, I want to add a solution to the previous answers - all quite interesting, especially .
The biggest advantage of purrr and tidyr are better readability IMHO.
purrr replaces lapply with the more flexible map() family,
tidyr offers the super-intuitive method add_row - just does what it says :)
map_df(1:1000, function(x) { df %>% add_row(x = x, y = toString(x)) })
This solution is short and intuitive to read, and it's relatively fast:
system.time(
map_df(1:1000, function(x) { df %>% add_row(x = x, y = toString(x)) })
)
user system elapsed
0.756 0.006 0.766
It scales almost linearly, so for 1e5 rows, the performance is:
system.time(
map_df(1:100000, function(x) { df %>% add_row(x = x, y = toString(x)) })
)
user system elapsed
76.035 0.259 76.489
which would make it rank second right after data.table (if your ignore the placebo) in the benchmark by #Adam Ryczkowski:
nr function time
4 data.frame 228.251
3 sqlite 133.716
2 data.table 3.059
1 rbindlist 169.998
0 placebo 0.202
A more generic solution for might be the following.
extendDf <- function (df, n) {
withFactors <- sum(sapply (df, function(X) (is.factor(X)) )) > 0
nr <- nrow (df)
colNames <- names(df)
for (c in 1:length(colNames)) {
if (is.factor(df[,c])) {
col <- vector (mode='character', length = nr+n)
col[1:nr] <- as.character(df[,c])
col[(nr+1):(n+nr)]<- rep(col[1], n) # to avoid extra levels
col <- as.factor(col)
} else {
col <- vector (mode=mode(df[1,c]), length = nr+n)
class(col) <- class (df[1,c])
col[1:nr] <- df[,c]
}
if (c==1) {
newDf <- data.frame (col ,stringsAsFactors=withFactors)
} else {
newDf[,c] <- col
}
}
names(newDf) <- colNames
newDf
}
The function extendDf() extends a data frame with n rows.
As an example:
aDf <- data.frame (l=TRUE, i=1L, n=1, c='a', t=Sys.time(), stringsAsFactors = TRUE)
extendDf (aDf, 2)
# l i n c t
# 1 TRUE 1 1 a 2016-07-06 17:12:30
# 2 FALSE 0 0 a 1970-01-01 01:00:00
# 3 FALSE 0 0 a 1970-01-01 01:00:00
system.time (eDf <- extendDf (aDf, 100000))
# user system elapsed
# 0.009 0.002 0.010
system.time (eDf <- extendDf (eDf, 100000))
# user system elapsed
# 0.068 0.002 0.070
Lets take a vector 'point' which has numbers from 1 to 5
point = c(1,2,3,4,5)
if we want to append a number 6 anywhere inside the vector then below command may come handy
i) Vectors
new_var = append(point, 6 ,after = length(point))
ii) columns of a table
new_var = append(point, 6 ,after = length(mtcars$mpg))
The command append takes three arguments:
the vector/column to be modified.
value to be included in the modified vector.
a subscript, after which the values are to be appended.
simple...!!
Apologies in case of any...!
My solution is almost the same as the original answer but it doesn't worked for me.
So, I gave names for the columns and it works:
painel <- rbind(painel, data.frame("col1" = xtweets$created_at,
"col2" = xtweets$text))
Short of working on a machine with more RAM, how can I work with large lists in R, for example put them on disk and then work on sections of it?
Here's some code to generate the type of lists I'm using
n = 50; i = 100
WORD <- vector(mode = "integer", length = n)
for (i in 1:n){
WORD[i] <- paste(sample(c(rep(0:9,each=5),LETTERS,letters),5,replace=TRUE),collapse='')
}
dat <- data.frame(WORD = WORD,
COUNTS = sample(1:50, n, replace = TRUE))
dat_list <- lapply(1:i, function(i) dat)
In my actual use case each data frame in the list is unique, unlike the quick example here. I'm aiming for n = 4000 and i = 100,000
This is one example of what I want to do with this list of dataframes:
FUNC <- function(x) {rep(x$WORD, times = x$COUNTS)}
la <- lapply(dat_list, FUNC)
With my actual use case this runs for a few hours, fills up the RAM and most of the swap and then RStudio freezes and shows a message with a bomb on it (RStudio was forced to terminate due to an error in the R session).
I see that bigmemory is limited to matrices and ff doesn't seem to handle lists. What are the other options? If sqldf or a related out-of-memory method possible here, how might I get started? I can't get enough out of the documentation to make any progress and would be grateful for any pointers. Note that instructions to "buy more RAM" will be ignored! This is for a package that I'm hoping will be suitable for average desktop computers (ie. undergrad computer labs).
UPDATE Followining up on the helpful comments from SimonO101 and Ari, here's some benchmarking comparing dataframes and data.tables, loops and lapply, and with and without gc
# self-contained speed test of untable
n = 50; i = 100
WORD <- vector(mode = "integer", length = n)
for (i in 1:n){
WORD[i] <- paste(sample(c(rep(0:9,each=5),LETTERS,letters),5,replace=TRUE),collapse='')
}
# as data table
library(data.table)
dat_dt <- data.table(WORD = WORD, COUNTS = sample(1:50, n, replace = TRUE))
dat_list_dt <- lapply(1:i, function(i) dat_dt)
# as data frame
dat_df <- data.frame(WORD = WORD, COUNTS = sample(1:50, n, replace = TRUE))
dat_list_df <- lapply(1:i, function(i) dat_df)
# increase object size
y <- 10
dt <- c(rep(dat_list_dt, y))
df <- c(rep(dat_list_df, y))
# untable
untable <- function(x) rep(x$WORD, times = x$COUNTS)
# preallocate objects for loop to fill
df1 <- vector("list", length = length(df))
dt1 <- vector("list", length = length(dt))
df3 <- vector("list", length = length(df))
dt3 <- vector("list", length = length(dt))
# functions for lapply
df_untable_gc <- function(x) { untable(df[[x]]); if (x%%10) invisible(gc()) }
dt_untable_gc <- function(x) { untable(dt[[x]]); if (x%%10) invisible(gc()) }
# speedtests
library(microbenchmark)
microbenchmark(
for(i in 1:length(df)) { df1[[i]] <- untable(df[[i]]); if (i%%10) invisible(gc()) },
for(i in 1:length(dt)) { dt1[[i]] <- untable(dt[[i]]); if (i%%10) invisible(gc()) },
df2 <- lapply(1:length(df), function(i) df_untable_gc(i)),
dt2 <- lapply(1:length(dt), function(i) dt_untable_gc(i)),
for(i in 1:length(df)) { df3[[i]] <- untable(df[[i]])},
for(i in 1:length(dt)) { dt3[[i]] <- untable(dt[[i]])},
df4 <- lapply(1:length(df), function(i) untable(df[[i]]) ),
dt4 <- lapply(1:length(dt), function(i) untable(dt[[i]]) ),
times = 10)
And here are the results, without explicit garbage collection, data.table is much faster and lapply slightly faster than a loop. With explicit garbage collection (as I think SimonO101 might be suggesting) they are all much the same speed - a lot slower! I know that using gc is a bit controversial and probably not helpful in this case, but I'll give it a shot with my actual use-case and see if it makes any difference. Of course I don't have any data on memory use for any of these functions, which is really my main concern. Seems that there is no function for memory benchmarking equivalent to the timing functions (for windows, anyway).
Unit: milliseconds
expr
for (i in 1:length(df)) { df1[[i]] <- untable(df[[i]]) if (i%%10) invisible(gc()) }
for (i in 1:length(dt)) { dt1[[i]] <- untable(dt[[i]]) if (i%%10) invisible(gc()) }
df2 <- lapply(1:length(df), function(i) df_untable_gc(i))
dt2 <- lapply(1:length(dt), function(i) dt_untable_gc(i))
for (i in 1:length(df)) { df3[[i]] <- untable(df[[i]]) }
for (i in 1:length(dt)) { dt3[[i]] <- untable(dt[[i]]) }
df4 <- lapply(1:length(df), function(i) untable(df[[i]]))
dt4 <- lapply(1:length(dt), function(i) untable(dt[[i]]))
min lq median uq max neval
37436.433962 37955.714144 38663.120340 39142.350799 39651.88118 10
37354.456809 38493.268121 38636.424561 38914.726388 39111.20439 10
36959.630896 37924.878498 38314.428435 38636.894810 39537.31465 10
36917.765453 37735.186358 38106.134494 38563.217919 38751.71627 10
28.200943 29.221901 30.205502 31.616041 34.32218 10
10.230519 10.418947 10.665668 12.194847 14.58611 10
26.058039 27.103217 27.560739 28.189448 30.62751 10
8.835168 8.904956 9.214692 9.485018 12.93788 10
If you really are going to be using very large data you can use the h5r package to write hdf5 files. You would be writing to and reading from your hard drive on the fly instead of using RAM. I have not used this so I can be of little help on it's general usage, I mention this because I think there's is no tutorial for it. I got this idea by thinking about pytables. Not sure if this solution is appropriate for you.
I am trying to use cor.test over the rows in 2 matrices, namely cer and par.
cerParCorTest <-mapply(function(x,y)cor.test(x,y),cer,par)
mapply,however, works on columns.
This issue has been discussed in Efficient apply or mapply for multiple matrix arguments by row . I tried that split solution (as below)
cer <- split(cer, row(cer))
par <- split(par, row(par))
and it results in the error (plus it is slow)
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
I also tried t(par) and t(cer) to get it running over the rows, but it results in the error
Error in cor.test.default(x, y) : not enough finite observations
The martices are shown below (for cer and same in par):
V1698 V1699 V1700 V1701
YAL002W(cer) 0.01860500 0.01947700 0.02043300 0.0214740
YAL003W(cer) 0.07001600 0.06943900 0.06891200 0.0684330
YAL005C(cer) 0.02298100 0.02391900 0.02485800 0.0257970
YAL007C(cer) -0.00026047 -0.00026009 -0.00026023 -0.0002607
YAL008W(cer) 0.00196200 0.00177360 0.00159490 0.0014258
My question is why transposing the matrix does not work and what is a short solution that will allow running over rows with mapply for cor.test().
I apologise for the long post and thanks in advance for any help.
I don't know what are the dimensions of your matrix , but this works fine for me
N <- 3751 * 1900
cer.m <- matrix(1:N,ncol=1900)
par.m <- matrix(1:N+rnorm(N),ncol=1900)
ll <- mapply(cor.test,
split(par.m,row(par.m)),
split(cer.m,row(cer.m)),
SIMPLIFY=FALSE)
this will give you a list of 3751 elements(the correlation for each row)
EDIT without split, you give the index of the row , this should be fast
ll <- mapply(function(x,y)cor.test(cer.m[x,],par.m[y,]),
1:nrow(cer.m),
1:nrow(cer.m),
SIMPLIFY=FALSE)
EDIT2 how to get the estimate value:
To get the estimate value for example :
sapply(ll,'[[','estimate')
You could always just program things in a for loop, seems reasonably fast on these dimensions:
x1 <- matrix(rnorm(10000000), nrow = 2000)
x2 <- matrix(rnorm(10000000), nrow = 2000)
out <- vector("list", nrow(x1))
system.time(
for (j in seq_along(out)) {
out[[j]] <- cor.test(x1[j, ], x2[j, ])
}
)
user system elapsed
1.35 0.00 1.36
EDIT: If you only want the estimate, I wouldn't store the results in a list, but a simple vector:
out2 <- vector("numeric", nrow(x1))
for (j in seq_along(out)) {
out2[j] <- cor.test(x1[j, ], x2[j, ])$estimate
}
head(out2)
If you want to store all the results and simply extract the estimate from each, then this should do the trick:
> out3 <- as.numeric(sapply(out, "[", "estimate"))
#Confirm they are the same
> all.equal(out2, out3)
[1] TRUE
The tradeoff is that the first method stores all the data in a list which may be useful for further processing vs a mroe simple method that only grabs what you initially want.