I have a data table with nrow being around a million or two and ncol of about 200.
Each entry in a row has a coordinate associated with it.
Tiny portion of the data:
[1,] -2.80331471 -0.8874522 -2.34401863 -3.811584 -2.1292443
[2,] 0.03177716 0.2588624 0.82877467 1.955099 0.6321881
[3,] -1.32954665 -0.5433407 -2.19211837 -2.342554 -2.2142461
[4,] -0.60771429 -0.9758734 0.01558774 1.651459 -0.8137684
Coordinates for the first 4 rows:
9928202 9928251 9928288 9928319
What I would like is a function that given the data and window-size would return a data table of the same size with a mean sliding window applied on each column. Or in other words - for each row entry i it would find entries with coordinates between coords[i]-windsize and coords[i]+windsize and replace the initial value with the mean of the values inside that interval (separately for each column).
Speed is the main issue here.
Here is my first take of such function.
doSlidingWindow <- function(intensities, coords, windsize) {
windHalfSize <- ceiling(windsize/2)
### whole range inds
RANGE <- integer(max(coords)+windsize)
RANGE[coords] <- c(1:length(coords)[1])
### get indeces of rows falling in each window
COORDS <- as.list(coords)
WINDOWINDS <- sapply(COORDS, function(crds){ unique(RANGE[(crds-windHalfSize):
(crds+windHalfSize)]) })
### do windowing
wind_ints <- intensities
wind_ints[] <- 0
for(i in 1:length(coords)) {
wind_ints[i,] <- apply(as.matrix(intensities[WINDOWINDS[[i]],]), 2, mean)
}
return(wind_ints)
}
The code before the last for loop is quite fast and it gets me a list of the indexes I need to use for each entry. However then everything falls apart since I need to grind the for loop a million times, take subsets of my data table and also make sure that I have more than one row to be able to work with all the columns at once inside apply.
My second approach is to just stick the actual values in the RANGE list, fill the gaps with zeroes and do rollmean from zoo package, repeated for each column. But this is redundant since rollmean will go through all the gaps and I will only be using the values for original coordinates in the end.
Any help to make it faster without going to C would be very appreciated.
Data generation:
N <- 1e5 # rows
M <- 200 # columns
W <- 10 # window size
set.seed(1)
intensities <- matrix(rnorm(N*M), nrow=N, ncol=M)
coords <- 8000000 + sort(sample(1:(5*N), N))
Original function with minor modifications I used for benchmarks:
doSlidingWindow <- function(intensities, coords, windsize) {
windHalfSize <- ceiling(windsize/2)
### whole range inds
RANGE <- integer(max(coords)+windsize)
RANGE[coords] <- c(1:length(coords)[1])
### get indices of rows falling in each window
### NOTE: Each elements of WINDOWINDS holds zero. Not a big problem though.
WINDOWINDS <- sapply(coords, function(crds) ret <- unique(RANGE[(crds-windHalfSize):(crds+windHalfSize)]))
### do windowing
wind_ints <- intensities
wind_ints[] <- 0
for(i in 1:length(coords)) {
# CORRECTION: When it's only one row in window there was a trouble
wind_ints[i,] <- apply(matrix(intensities[WINDOWINDS[[i]],], ncol=ncol(intensities)), 2, mean)
}
return(wind_ints)
}
POSSIBLE SOLUTIONS:
1) data.table
data.table is known to be fast with subsetting, but this page (and other related to sliding window) suggests, that this is not the case. Indeed, data.table code is elegant, but unfortunately very slow:
require(data.table)
require(plyr)
dt <- data.table(coords, intensities)
setkey(dt, coords)
aaply(1:N, 1, function(i) dt[WINDOWINDS[[i]], sapply(.SD,mean), .SDcols=2:(M+1)])
2) foreach+doSNOW
Basic routine is easy to run in parallel, so, we can benefit from it:
require(doSNOW)
doSlidingWindow2 <- function(intensities, coords, windsize) {
NC <- 2 # number of nodes in cluster
cl <- makeCluster(rep("localhost", NC), type="SOCK")
registerDoSNOW(cl)
N <- ncol(intensities) # total number of columns
chunk <- ceiling(N/NC) # number of columns send to the single node
result <- foreach(i=1:NC, .combine=cbind, .export=c("doSlidingWindow")) %dopar% {
start <- (i-1)*chunk+1
end <- ifelse(i!=NC, i*chunk, N)
doSlidingWindow(intensities[,start:end], coords, windsize)
}
stopCluster(cl)
return (result)
}
Benchmark shows notable speed-up on my Dual-Core processor:
system.time(res <- doSlidingWindow(intensities, coords, W))
# user system elapsed
# 306.259 0.204 307.770
system.time(res2 <- doSlidingWindow2(intensities, coords, W))
# user system elapsed
# 1.377 1.364 177.223
all.equal(res, res2, check.attributes=FALSE)
# [1] TRUE
3) Rcpp
Yes, I know you asked "without going to C". But, please, take a look. This code is inline and rather straightforward:
require(Rcpp)
require(inline)
doSlidingWindow3 <- cxxfunction(signature(intens="matrix", crds="numeric", wsize="numeric"), plugin="Rcpp", body='
#include <vector>
Rcpp::NumericMatrix intensities(intens);
const int N = intensities.nrow();
const int M = intensities.ncol();
Rcpp::NumericMatrix wind_ints(N, M);
std::vector<int> coords = as< std::vector<int> >(crds);
int windsize = ceil(as<double>(wsize)/2);
for(int i=0; i<N; i++){
// Simple search for window range (begin:end in coords)
// Assumed that coords are non-decreasing
int begin = (i-windsize)<0?0:(i-windsize);
while(coords[begin]<(coords[i]-windsize)) ++begin;
int end = (i+windsize)>(N-1)?(N-1):(i+windsize);
while(coords[end]>(coords[i]+windsize)) --end;
for(int j=0; j<M; j++){
double result = 0.0;
for(int k=begin; k<=end; k++){
result += intensities(k,j);
}
wind_ints(i,j) = result/(end-begin+1);
}
}
return wind_ints;
')
Benchmark:
system.time(res <- doSlidingWindow(intensities, coords, W))
# user system elapsed
# 306.259 0.204 307.770
system.time(res3 <- doSlidingWindow3(intensities, coords, W))
# user system elapsed
# 0.328 0.020 0.351
all.equal(res, res3, check.attributes=FALSE)
# [1] TRUE
I hope results are quite motivating. While data fits in memory Rcpp version is pretty fast. Say, with N <- 1e6 and M <-100 I got:
user system elapsed
2.873 0.076 2.951
Naturally, after R starts using swap everything slows down. With really large data that doesn't fit in memory you should consider sqldf, ff or bigmemory.
Rollapply works great with a small dataset. However, if you are working with several million rows (genomics) it is quite slow.
The following function is super fast:
data <- c(runif(100000, min=0, max=.1),runif(100000, min=.05, max=.1),runif(10000, min=.05, max=1), runif(100000, min=0, max=.2))
slideFunct <- function(data, window, step){
total <- length(data)
spots <- seq(from=1, to=(total-window), by=step)
result <- vector(length = length(spots))
for(i in 1:length(spots)){
result[i] <- mean(data[spots[i]:(spots[i]+window)])
}
return(result)
}
Details here.
Related
I have a large three dimensional array corresponding to a hyperspectral cube in spatRaster format. I also have two smaller ones (smaller area) that correspond to reference images (dark and white) used to correct values in the pixels of the cube. The smaller ones only have one line of pixels and were originally spatRasters but were transformed to array format to do some calculations. Lastly, I have a couple of constants.
For each line in the sample image, I need to subtract the line of pixels in dark image from the line of pixels in the sample image and then divide it by the line of pixels in the white image. Then, multiply by the constants and store the corrected line in a new array that eventually becomes a corrected spatRaster.
I wrote a script that does this as follows:
#hyperspectral cube
nlines_s <- 1412 #number of lines of pixels in the image
npixels_s <- 1024 #number of pixels within each line
nbands_s <- 448 #number of spectral bands within each pixel
sample_image <- terra::rast(array(runif(1),dim = c(nlines_s ,npixels_s ,nbands_s )))
# spatRasters already transformed to array type
dark_current <- array(runif(1),dim = c(1024,448))
white_reference <- array(runif(1),dim = c(1024,448))
#integration time constants
t_s <- 13
t_w <- 13
int_time <- t_w/t_s
#reflectance of reference material
R_ref <- 0.99
#Process that needs to be speed up
reflectance_image <- array(0, c(nlines_s,npixels_s,nbands_s ))
for (fr in 1:nlines_s) {
reflectance_image[fr, , ] <- as.matrix(R_ref * int_time * ((sample_image[fr, ,]-dark_current)/white_reference) )
print(paste0("Calculating reflectance: ",round((fr/nlines_s)*100,2), "%"))
}
ri_r <- terra::rast(reflectance_image)
names(ri_r) <- names(sample_image)
The code works as is, but it takes a long time because of the sequential nature of the for loop. I suspect this can be optimized by applying the calculation to all the lines (rows) of the image at the same time. I've been trying to do so using apply functions but I have not succeeded (might have done it wrong, though).
What would be the solution that would speed up the processing time the most?
What would be the solution that would speed up the processing time the
most?
This may not be the absolute most optimal, but a simple RcppArmadillo function that takes base R arrays instead of SpatRaster objects speeds it up by almost an order of magnitude. Using #RoberHijmans larger dataset (with 448 bands) and f3:
Rcpp::cppFunction(
"arma::cube f4(arma::cube& img, const arma::mat& drk, const arma::mat& wht) {
for(unsigned int i = 0; i < img.n_rows; i++) {
img.row(i) -= drk;
img.row(i) %= wht;
}
return(img);
}",
depends = "RcppArmadillo",
plugins = "cpp11"
)
ref_image1 <- as.array(image)
system.time(ref_image1 <- f4(ref_image1, dark[1,,], R_ref*int_time/white[1,,]))
#> user system elapsed
#> 14.78 0.65 15.42
system.time(ref_image2 <- f3())
#> user system elapsed
#> 139.89 8.63 149.09
Note that the two solutions will be very slightly different due to the different order of the multiplications.
all.equal(ref_image1, as.array(ref_image2))
#> [1] "Mean relative difference: 3.520606e-08"
Here is your (adjusted) example data.
library(terra)
#terra 1.6.52
nr <- 1400
nc <- 1000
nl <- 400
set.seed(1)
image <- terra::rast(nrow=nr, ncol=nc, nlyr=nl, vals=runif(nr*nc*nl))
dark <- array(runif(nc*nl),dim = c(1, nc, nl))
white <- array(runif(nc*nl),dim = c(1, nc, nl))
int_time <- 1
R_ref <- 0.99
Your approach, slightly improved
f1 <- function() {
ref <- array(0, c(nr, nc, nl))
for (r in 1:nr) {
ref[r, , ] <- as.matrix(((image[r, ] - dark)/white) )
}
ref <- terra::rast(ref * (R_ref * int_time), ext=ext(image), crs=crs(image))
names(ref) <- names(image)
ref
}
A "terra" solution
f2 <- function() {
d <- rast(dark, ext=ext(image), crs=crs(image))
d <- disagg(d, c(nrow(image), 1))
w <- rast(white, ext=ext(image), crs=crs(image))
w <- disagg(w, c(nrow(image), 1))
(R_ref * int_time) * (image - d) / w
}
The above is better but still clumsy. In terra version "terra 1.6.52" I have added support for arithmetic computations with a SpatRaster and a matrix. The columns in the matrix represent layers, the rows represent cells. I use that in f3 below.
f3 <- function() {
dcur <- t(apply(dark, 2, c))
wref <- t(apply(white, 2, c))
(R_ref * int_time) * (image - dcur) / wref
}
And now with #jblood94's solution
Rcpp::cppFunction(
"arma::cube f4(arma::cube& img, const arma::mat& drk, const arma::mat& wht) {
for(unsigned int i = 0; i < img.n_rows; i++) {
img.row(i) -= drk;
img.row(i) %= wht;
}
return(img);
}",
depends = "RcppArmadillo",
plugins = "cpp11"
)
f5 <- function() {
ref_image1 <- as.array(image)
ref_image1 <- f4(ref_image1, dark[1,,], R_ref*int_time/white[1,,])
rast(ref_image1)
}
Comparison
system.time(f1())
# user system elapsed
# 132.72 15.28 148.19
system.time(f2())
# user system elapsed
# 97.58 14.28 112.20
system.time(f3())
# user system elapsed
# 20.75 7.02 27.78
system.time(f5())
# user system elapsed
# 31.74 8.67 40.45
ref_image1 <- as.array(image)
system.time(f4(ref_image1, dark[1,,], R_ref*int_time/white[1,,]))
# user system elapsed
# 16.41 0.81 17.24
Note that f3 depends on "terra 1.6.52". That is currently the development version. You can install it with install.packages('terra', repos='https://rspatial.r-universe.dev')
You would expect the RcppArmadillo function f4 to be faster than f3; because f3 is more general (the arrays can have any length, they do not need to consist of a single row). However, with f4 you need to first create an array, and the output needs to be put into a SpatRaster. I account for that in f5. This suggests that f3 is the fastest with these example data.
Results may also vary based on the amount of RAM available. If that becomes limiting, f3 will start writing the results to disk, and that can slow things down considerably (you have some control over this via the terraOptions()
I have a large raster data (X) with a dimension of 32251*51333. The values of X are repetitions of another array (Y), which has a size of 3*10^6.
Now I want to change the values of X by matching it against each value of Y, for example I can program like this,
for (i in 1:length(Y)){
X[X==Y[i]] = Z[i] #Z is just another array with the same size as Y
}
The problem is that, first the index matching X[X==Y[i]] = Z[i] does not work because X is too large. After a few minutes the program just stops by giving an error "Error: cannot allocate vector of size 6.2 Gb".
Second, going over the loops from 1 to length(Y), even though Y is of size 10^6, may take "forever" to complete.
One approach came to my mind is to split X into small chunks and then do the index match for each chunk. But I feel this would still take a lot of time.
Is there a better way to achieve the above goal?
1st Update:
Thanks to the example provided by #Lyngbakr, I will elaborate this question further. Because the raster I'm working with is very large (32251*51333), it seems not possible to upload it. The example given by #Lyngbakr is very similar to what I want, except that the raster created is too small. Now following the example, I ran two tests by generating a much larger raster with dimension of 3000*2700. See code below.
#Method 1: Use subs
start_time <- Sys.time()
Y <- 1:9
Z <- 91:99
X <- raster(matrix(rep(Y, 3), nrow=3000,ncol = 2700))
df <- data.frame(Y, Z)
X <- subs(X, df)
end_time <- Sys.time()
end_time - start_time
#Time difference of 2.248908 mins
#Method 2: Use for loop
start_time <- Sys.time()
Y <- 1:9
Z <- 91:99
X <- raster(matrix(rep(Y, 3), nrow=3000,ncol = 2700))
for (i in 1:length(Y)){
X[X==Y[i]]=Z[i] #this indexing of R seems not efficient if X becomes large
}
end_time <- Sys.time()
end_time - start_time
#Time difference of 10.22717 secs
As you can see, a simple for loop is even more efficient than the subs function. Remember, the raster shown in the example is still smaller than what I work with (about an order of 100 smaller). Also, the array Y in the example is very small. Now the question could be, how to speed up the Method 2, which is just a simple for loop?
You're looking for the subs function. I don't know if it works with large rasters, but here's how you'd try.
I load the raster package and create some dummy data. (It would be really helpful if you provide data in your question.) Then, I plot the results.
# Load library
library(raster)
#> Loading required package: sp
# Z holds values that will replace Y
Y <- 1:9
Z <- 91:99
# Create dummy raster
X <- raster(matrix(rep(Y, 3), ncol = 9))
# Examine raster
plot(X)
As you can see, X is just a bunch of Y vectors patched together. Next, I bind Y and Z together into a data frame df.
# Combine y & z into a data frame
df <- data.frame(Y, Z)
Finally, I use subs to replace Y values with Z values.
# Substitute Z for Y in X
X <- subs(X, df)
A quick look at the raster shows that the values have been replaced correctly.
# Examine raster
plot(X)
Created on 2019-06-25 by the reprex package (v0.2.1.9000)
Update
Rcppis really helpful when performance is an issue. Below, I compare three methods:
Looping in R (from the question)
Using subs from the raster package
Looping in C++ using Rcpp
By the way, Sys.time() isn't a great way to examine performance, so I'd recommend microbenchmark instead.
# Load library
library(raster)
# Define vectors and raster
Y <- 1:9
Z <- 91:99
X <- raster(matrix(rep(Y, 3), nrow = 3000, ncol = 2700))
method_1 is the subs function.
# Using subs function
method_1 <- function(){
df <- data.frame(Y, Z)
X <- subs(X, df)
}
method_2 is your original looping approach.
# Using R loop
method_2 <- function(){
for (i in 1:length(Y)){
X[X==Y[i]]=Z[i]
}
X
}
method_3 is the looping approach implemented in C++.
# Using Rcpp loops
src <-
"Rcpp::NumericMatrix subs_cpp(Rcpp::NumericMatrix X, Rcpp::NumericVector Y, Rcpp::NumericVector Z){
for(int i = 0; i < Y.length(); ++i){
for(int j = 0; j < X.ncol(); ++j){
for(int k = 0; k < X.nrow(); ++k){
if(X(k, j) == Y(i)){
X(k, j) = Z(i);
}
}
}
}
return X;
}"
Rcpp::cppFunction(src)
method_3 <- function(){
subs_cpp(as.matrix(X), Y, Z)
}
And here I benchmark the approaches.
# Run benchmarking
microbenchmark::microbenchmark(method_1(), method_2(), method_3(), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# method_1() 16861.5447 17737.2124 19321.5674 18628.8573 20117.0159 25506.208 10
# method_2() 671.2223 677.6029 1111.3935 738.6216 1657.0542 2163.137 10
# method_3() 316.9810 319.1484 481.3548 320.2337 326.7133 1477.454 10
As you can see, the Rcpp approach is by far the fastest.
You can also compare the output to ensure they produce the same result using a smaller raster.
# Examine all three outputs with smaller raster
X <- raster(matrix(rep(Y, 3), ncol = 9))
plot(method_1(), main = "Method 1")
plot(method_2(), main = "Method 2")
plot(raster(method_3()), main = "Method 3") # Needs to converted into a raster
And they all look alike. Note that for the third method, the result needs to be converted back to a raster from a matrix.
My question is extremely closely related to this one:
Split a vector into chunks in R
I'm trying to split a large vector into known chunk sizes and it's slow. A solution for vectors with even remainders is here:
A quick solution when a factor exists is here:
Split dataframe into equal parts based on length of the dataframe
I would like to handle the case of no (large) factor existing, as I would like fairly large chunks.
My example for a vector much smaller than the one in my real life application:
d <- 1:6510321
# Sloooow
chunks <- split(d, ceiling(seq_along(d)/2000))
Using llply from the plyr package I was able to reduce the time.
chunks <- function(d, n){
chunks <- split(d, ceiling(seq_along(d)/n))
names(chunks) <- NULL
return(chunks)
}
require(plyr)
plyrChunks <- function(d, n){
is <- seq(from = 1, to = length(d), by = ceiling(n))
if(tail(is, 1) != length(d)) {
is <- c(is, length(d))
}
chunks <- llply(head(seq_along(is), -1),
function(i){
start <- is[i];
end <- is[i+1]-1;
d[start:end]})
lc <- length(chunks)
td <- tail(d, 1)
chunks[[lc]] <- c(chunks[[lc]], td)
return(chunks)
}
# testing
d <- 1:6510321
n <- 2000
system.time(chks <- chunks(d,n))
# user system elapsed
# 5.472 0.000 5.472
system.time(plyrChks <- plyrChunks(d, n))
# user system elapsed
# 0.068 0.000 0.065
identical(chks, plyrChks)
# TRUE
You can speed even more using the .parallel parameter from the llpyr function. Or you can add a progress bar using the .progress parameter.
A speed improvement from the parallel package:
chunks <- parallel::splitIndices(6510321, ncl = ceiling(6510321/2000))
I am attempting to build a large (~200 MM line) dataframe in R. Each entry in the dataframe will consist of approximately 10 digits (e.g. 1234.12345). The code is designed to walk through a list, subtract an item in position [i] from every item after [i], but not the items before [i] (If I was putting the output into a matrix it would be a triangular matrix). The code is simple and works fine on smaller lists, but I am wondering if there is a faster or more efficient way to do this? I assume the first part of the answer is going to entail "don't use a nested for loop," but I am not sure what the alternatives are.
The idea is that this will be an "edge list" for a social network analysis graph. Once I have 'outlist' I will reduce the number of edges based on some criteria(<,>,==,) so the final list (and graph) won't be quite so ponderous.
#Fake data of same approximate dimensions as real data
dlist<-sample(1:20,20, replace=FALSE)
#purge the output list before running the loop
rm(outlist)
outlist<-data.frame()
for(i in 1:(length(dlist)-1)){
for(j in (i+1):length(dlist)){
outlist<-rbind(outlist, c(dlist[i],dlist[j], dlist[j]-dlist[i]))
}
}
IIUC your final dataset will be ~200 million rows by 3 columns, all of type numeric, which takes a total space of:
200e6 (rows) * 3 (cols) * 8 (bytes) / (1024 ^ 3)
# ~ 4.5GB
That's quite a big data, where it's essential to avoid copies wherever possible.
Here's a method that uses data.table package's unexported (internal) vecseq function (written in C and is fast + memory efficient) and makes use of it's assignment by reference operator :=, to avoid copies.
fn1 <- function(x) {
require(data.table) ## 1.9.2
lx = length(x)
vx = as.integer(lx * (lx-1)/2)
# R v3.1.0 doesn't copy on doing list(.) - so should be even more faster there
ans = setDT(list(v1 = rep.int(head(x,-1L), (lx-1L):1L),
v2=x[data.table:::vecseq(2:lx, (lx-1L):1, vx)]))
ans[, v3 := v2-v1]
}
Benchmarking:
I'll benchmark with functions from other answers on your data dimensions. Note that my benchmark is on R v3.0.2, but fn1() should give better performance (both speed and memory) on R v3.1.0 because list(.) doesn't result in copy anymore.
fn2 <- function(x) {
diffmat <- outer(x, x, "-")
ss <- which(upper.tri(diffmat), arr.ind = TRUE)
data.frame(v1 = x[ss[,1]], v2 = x[ss[,2]], v3 = diffmat[ss])
}
fn3 <- function(x) {
idx <- combn(seq_along(x), 2)
out2 <- data.frame(v1=x[idx[1, ]], v2=x[idx[2, ]])
out2$v3 <- out2$v2-out2$v1
out2
}
set.seed(45L)
x = runif(20e3L)
system.time(ans1 <- fn1(x)) ## 18 seconds + ~8GB (peak) memory usage
system.time(ans2 <- fn2(x)) ## 158 seconds + ~19GB (peak) memory usage
system.time(ans3 <- fn3(x)) ## 809 seconds + ~12GB (peak) memory usage
Note that fn2() due to use of outer requires quite a lot of memory (peak memory usage was >=19GB) and is slower than fn1(). fn3() is just very very slow (due to combn, and unnecessary copy).
Another way to create that data is
#Sample Data
N <- 20
set.seed(15) #for reproducibility
dlist <- sample(1:N,N, replace=FALSE)
we could do
idx <- combn(1:N,2)
out2 <- data.frame(i=dlist[idx[1, ]], j=dlist[idx[2, ]])
out2$dist <- out2$j-out2$i
This uses combn to create all paris of indices in the data.set rather than doing loops. This allows us to build the data.frame all at once rather than adding a row at a time.
We compare that to
out1 <- data.frame()
for(i in 1:(length(dlist)-1)){
for(j in (i+1):length(dlist)){
out1<-rbind(out1, c(dlist[i],dlist[j], dlist[j]-dlist[i]))
}
}
we see that
all(out1==out2)
# [1] TRUE
Plus, if we compare with microbenchmark we see that
microbenchmark(loops(), combdata())
# Unit: microseconds
# expr min lq median uq max neval
# loops() 30888.403 32230.107 33764.7170 34821.2850 82891.166 100
# combdata() 684.316 800.384 873.5015 940.9215 4285.627 100
The method that doesn't use loops is much faster.
You can always start with a triangular matrix and then make your dataframe directly from that:
vec <- 1:10
diffmat <- outer(vec,vec,"-")
ss <- which(upper.tri(diffmat),arr.ind = TRUE)
data.frame(one = vec[ss[,1]],
two = vec[ss[,2]],
diff = diffmat[ss])
You need to preallocate out list, this will significantly increase the speed of your code. By preallocating I mean creating an output structure that already has the desired size, but filled with for example NA's.
I have looked around StackOverflow, but I cannot find a solution specific to my problem, which involves appending rows to an R data frame.
I am initializing an empty 2-column data frame, as follows.
df = data.frame(x = numeric(), y = character())
Then, my goal is to iterate through a list of values and, in each iteration, append a value to the end of the list. I started with the following code.
for (i in 1:10) {
df$x = rbind(df$x, i)
df$y = rbind(df$y, toString(i))
}
I also attempted the functions c, append, and merge without success. Please let me know if you have any suggestions.
Update from comment:
I don't presume to know how R was meant to be used, but I wanted to ignore the additional line of code that would be required to update the indices on every iteration and I cannot easily preallocate the size of the data frame because I don't know how many rows it will ultimately take. Remember that the above is merely a toy example meant to be reproducible. Either way, thanks for your suggestion!
Update
Not knowing what you are trying to do, I'll share one more suggestion: Preallocate vectors of the type you want for each column, insert values into those vectors, and then, at the end, create your data.frame.
Continuing with Julian's f3 (a preallocated data.frame) as the fastest option so far, defined as:
# pre-allocate space
f3 <- function(n){
df <- data.frame(x = numeric(n), y = character(n), stringsAsFactors = FALSE)
for(i in 1:n){
df$x[i] <- i
df$y[i] <- toString(i)
}
df
}
Here's a similar approach, but one where the data.frame is created as the last step.
# Use preallocated vectors
f4 <- function(n) {
x <- numeric(n)
y <- character(n)
for (i in 1:n) {
x[i] <- i
y[i] <- i
}
data.frame(x, y, stringsAsFactors=FALSE)
}
microbenchmark from the "microbenchmark" package will give us more comprehensive insight than system.time:
library(microbenchmark)
microbenchmark(f1(1000), f3(1000), f4(1000), times = 5)
# Unit: milliseconds
# expr min lq median uq max neval
# f1(1000) 1024.539618 1029.693877 1045.972666 1055.25931 1112.769176 5
# f3(1000) 149.417636 150.529011 150.827393 151.02230 160.637845 5
# f4(1000) 7.872647 7.892395 7.901151 7.95077 8.049581 5
f1() (the approach below) is incredibly inefficient because of how often it calls data.frame and because growing objects that way is generally slow in R. f3() is much improved due to preallocation, but the data.frame structure itself might be part of the bottleneck here. f4() tries to bypass that bottleneck without compromising the approach you want to take.
Original answer
This is really not a good idea, but if you wanted to do it this way, I guess you can try:
for (i in 1:10) {
df <- rbind(df, data.frame(x = i, y = toString(i)))
}
Note that in your code, there is one other problem:
You should use stringsAsFactors if you want the characters to not get converted to factors. Use: df = data.frame(x = numeric(), y = character(), stringsAsFactors = FALSE)
Let's benchmark the three solutions proposed:
# use rbind
f1 <- function(n){
df <- data.frame(x = numeric(), y = character())
for(i in 1:n){
df <- rbind(df, data.frame(x = i, y = toString(i)))
}
df
}
# use list
f2 <- function(n){
df <- data.frame(x = numeric(), y = character(), stringsAsFactors = FALSE)
for(i in 1:n){
df[i,] <- list(i, toString(i))
}
df
}
# pre-allocate space
f3 <- function(n){
df <- data.frame(x = numeric(1000), y = character(1000), stringsAsFactors = FALSE)
for(i in 1:n){
df$x[i] <- i
df$y[i] <- toString(i)
}
df
}
system.time(f1(1000))
# user system elapsed
# 1.33 0.00 1.32
system.time(f2(1000))
# user system elapsed
# 0.19 0.00 0.19
system.time(f3(1000))
# user system elapsed
# 0.14 0.00 0.14
The best solution is to pre-allocate space (as intended in R). The next-best solution is to use list, and the worst solution (at least based on these timing results) appears to be rbind.
Suppose you simply don't know the size of the data.frame in advance. It can well be a few rows, or a few millions. You need to have some sort of container, that grows dynamically. Taking in consideration my experience and all related answers in SO I come with 4 distinct solutions:
rbindlist to the data.frame
Use data.table's fast set operation and couple it with manually doubling the table when needed.
Use RSQLite and append to the table held in memory.
data.frame's own ability to grow and use custom environment (which has reference semantics) to store the data.frame so it will not be copied on return.
Here is a test of all the methods for both small and large number of appended rows. Each method has 3 functions associated with it:
create(first_element) that returns the appropriate backing object with first_element put in.
append(object, element) that appends the element to the end of the table (represented by object).
access(object) gets the data.frame with all the inserted elements.
rbindlist to the data.frame
That is quite easy and straight-forward:
create.1<-function(elems)
{
return(as.data.table(elems))
}
append.1<-function(dt, elems)
{
return(rbindlist(list(dt, elems),use.names = TRUE))
}
access.1<-function(dt)
{
return(dt)
}
data.table::set + manually doubling the table when needed.
I will store the true length of the table in a rowcount attribute.
create.2<-function(elems)
{
return(as.data.table(elems))
}
append.2<-function(dt, elems)
{
n<-attr(dt, 'rowcount')
if (is.null(n))
n<-nrow(dt)
if (n==nrow(dt))
{
tmp<-elems[1]
tmp[[1]]<-rep(NA,n)
dt<-rbindlist(list(dt, tmp), fill=TRUE, use.names=TRUE)
setattr(dt,'rowcount', n)
}
pos<-as.integer(match(names(elems), colnames(dt)))
for (j in seq_along(pos))
{
set(dt, i=as.integer(n+1), pos[[j]], elems[[j]])
}
setattr(dt,'rowcount',n+1)
return(dt)
}
access.2<-function(elems)
{
n<-attr(elems, 'rowcount')
return(as.data.table(elems[1:n,]))
}
SQL should be optimized for fast record insertion, so I initially had high hopes for RSQLite solution
This is basically copy&paste of Karsten W. answer on similar thread.
create.3<-function(elems)
{
con <- RSQLite::dbConnect(RSQLite::SQLite(), ":memory:")
RSQLite::dbWriteTable(con, 't', as.data.frame(elems))
return(con)
}
append.3<-function(con, elems)
{
RSQLite::dbWriteTable(con, 't', as.data.frame(elems), append=TRUE)
return(con)
}
access.3<-function(con)
{
return(RSQLite::dbReadTable(con, "t", row.names=NULL))
}
data.frame's own row-appending + custom environment.
create.4<-function(elems)
{
env<-new.env()
env$dt<-as.data.frame(elems)
return(env)
}
append.4<-function(env, elems)
{
env$dt[nrow(env$dt)+1,]<-elems
return(env)
}
access.4<-function(env)
{
return(env$dt)
}
The test suite:
For convenience I will use one test function to cover them all with indirect calling. (I checked: using do.call instead of calling the functions directly doesn't makes the code run measurable longer).
test<-function(id, n=1000)
{
n<-n-1
el<-list(a=1,b=2,c=3,d=4)
o<-do.call(paste0('create.',id),list(el))
s<-paste0('append.',id)
for (i in 1:n)
{
o<-do.call(s,list(o,el))
}
return(do.call(paste0('access.', id), list(o)))
}
Let's see the performance for n=10 insertions.
I also added a 'placebo' functions (with suffix 0) that don't perform anything - just to measure the overhead of the test setup.
r<-microbenchmark(test(0,n=10), test(1,n=10),test(2,n=10),test(3,n=10), test(4,n=10))
autoplot(r)
For 1E5 rows (measurements done on Intel(R) Core(TM) i7-4710HQ CPU # 2.50GHz):
nr function time
4 data.frame 228.251
3 sqlite 133.716
2 data.table 3.059
1 rbindlist 169.998
0 placebo 0.202
It looks like the SQLite-based sulution, although regains some speed on large data, is nowhere near data.table + manual exponential growth. The difference is almost two orders of magnitude!
Summary
If you know that you will append rather small number of rows (n<=100), go ahead and use the simplest possible solution: just assign the rows to the data.frame using bracket notation and ignore the fact that the data.frame is not pre-populated.
For everything else use data.table::set and grow the data.table exponentially (e.g. using my code).
Update with purrr, tidyr & dplyr
As the question is already dated (6 years), the answers are missing a solution with newer packages tidyr and purrr. So for people working with these packages, I want to add a solution to the previous answers - all quite interesting, especially .
The biggest advantage of purrr and tidyr are better readability IMHO.
purrr replaces lapply with the more flexible map() family,
tidyr offers the super-intuitive method add_row - just does what it says :)
map_df(1:1000, function(x) { df %>% add_row(x = x, y = toString(x)) })
This solution is short and intuitive to read, and it's relatively fast:
system.time(
map_df(1:1000, function(x) { df %>% add_row(x = x, y = toString(x)) })
)
user system elapsed
0.756 0.006 0.766
It scales almost linearly, so for 1e5 rows, the performance is:
system.time(
map_df(1:100000, function(x) { df %>% add_row(x = x, y = toString(x)) })
)
user system elapsed
76.035 0.259 76.489
which would make it rank second right after data.table (if your ignore the placebo) in the benchmark by #Adam Ryczkowski:
nr function time
4 data.frame 228.251
3 sqlite 133.716
2 data.table 3.059
1 rbindlist 169.998
0 placebo 0.202
A more generic solution for might be the following.
extendDf <- function (df, n) {
withFactors <- sum(sapply (df, function(X) (is.factor(X)) )) > 0
nr <- nrow (df)
colNames <- names(df)
for (c in 1:length(colNames)) {
if (is.factor(df[,c])) {
col <- vector (mode='character', length = nr+n)
col[1:nr] <- as.character(df[,c])
col[(nr+1):(n+nr)]<- rep(col[1], n) # to avoid extra levels
col <- as.factor(col)
} else {
col <- vector (mode=mode(df[1,c]), length = nr+n)
class(col) <- class (df[1,c])
col[1:nr] <- df[,c]
}
if (c==1) {
newDf <- data.frame (col ,stringsAsFactors=withFactors)
} else {
newDf[,c] <- col
}
}
names(newDf) <- colNames
newDf
}
The function extendDf() extends a data frame with n rows.
As an example:
aDf <- data.frame (l=TRUE, i=1L, n=1, c='a', t=Sys.time(), stringsAsFactors = TRUE)
extendDf (aDf, 2)
# l i n c t
# 1 TRUE 1 1 a 2016-07-06 17:12:30
# 2 FALSE 0 0 a 1970-01-01 01:00:00
# 3 FALSE 0 0 a 1970-01-01 01:00:00
system.time (eDf <- extendDf (aDf, 100000))
# user system elapsed
# 0.009 0.002 0.010
system.time (eDf <- extendDf (eDf, 100000))
# user system elapsed
# 0.068 0.002 0.070
Lets take a vector 'point' which has numbers from 1 to 5
point = c(1,2,3,4,5)
if we want to append a number 6 anywhere inside the vector then below command may come handy
i) Vectors
new_var = append(point, 6 ,after = length(point))
ii) columns of a table
new_var = append(point, 6 ,after = length(mtcars$mpg))
The command append takes three arguments:
the vector/column to be modified.
value to be included in the modified vector.
a subscript, after which the values are to be appended.
simple...!!
Apologies in case of any...!
My solution is almost the same as the original answer but it doesn't worked for me.
So, I gave names for the columns and it works:
painel <- rbind(painel, data.frame("col1" = xtweets$created_at,
"col2" = xtweets$text))