I am new to StackOverflow even though it's been a while I play with R. I am struggling with a problem for which I was not able to find any answer on the site. Please correct me if I my quest was not enough accurate.
I have two 3d arrays, in this simplified case 256x256x200. First one is a field, the second one is made of indices, spanning from 1 to 8. I want to compute the average on each vertical level according to the values and the counts of the indices, i.e. the average the the field for 200 levels for each index (from 1 to 8). This should be done only if there are enough counts of the indices (i.e. an if condition within the loop). My output must be a matrix of 8x200.
For the example I create two random arrays. Here below there is the basic code I am using:
nz=200
lev=1:nz
indices=8
var0=array(rnorm(256*256*nz),dim=c(256,256,nz))
#octo=array(sample(1:indices),dim=c(256,256,nz))
octo=array(sample(1:indices,size=256*256*nz,replace=T),dim=c(256,256,nz))
counts=apply(octo,3,function(x) table(factor(x,levels=1:indices)))
#thr=0.1
thr=0.125
np=length(var0[,1,1])*length(var0[1,,1])
profile=array(NA,dim=c(nz,indices))
t0=proc.time()
for (i in 1:indices)
{
for (z in 1:length(lev))
{
if (counts[i,z]/np>thr)
{v0=var0[,,z]; profile[z,i]=counts[i,z]/np*mean(v0[octo[,,z]==i],na.rm=T)}
}
}
print(proc.time()-t0)
user system elapsed
5.169 0.001 5.170
I tried with apply family of functions but I am not able to write it down in a reasonable and efficient way, considering that I need that each computation takes into account a "dynamic" variable that changes its level (i.e. octo and counts vars). My real case is made by way bigger matrices and this should be done on dozens of fields, thus time is pretty relevant.
Are you aware of any faster alternatives?
Many thanks for any help!
EDIT: I corrected the original definition of octo and I adjusted the threshold thr. In this way the if condition makes sense, since it is not always respected.
Here's a data.table reshape solution that avoids loops and or apply statements:
nz=200
lev=1:nz
indices=8
var0=array(rnorm(256*256*nz),dim=c(256,256,nz))
octo=array(sample(1:indices),dim=c(256,256,nz))
counts=apply(octo,3,function(x) table(factor(x,levels=1:indices)))
thr=0.1
np=length(var0[,1,1])*length(var0[1,,1])
profile=array(NA,dim=c(nz,indices))
# From here load data.table to do the manipulation
# reshape2 to convert back into a matrix at the end
library(data.table)
library(reshape2)
# Take the data long and convert to data.table
var01 <- setDT(melt(var0))
octo1 <- setDT(melt(octo))
# Join the data to get corresponding data
# EDIT, it currently works, but I think that's because all data is defined
# adding nomatch in case of missing data
octo1 <- octo1[var01, on = c('Var1','Var2','Var3'), nomatch = NA]
# Make our calculation grouping by the vertical dimension and the value
profile <- octo1[,if(.N/np > thr) .N / np * mean(i.value, na.rm = TRUE) else NA, by = .(value,Var3)]
# Recast to matrix
profile <- acast(profile, value ~ Var3, mean, value.var = 'V1')
I think that I find a good solution with sapply including the thr
f1<-function()
{
for (i in 1:indices)
{
for (z in 1:length(lev)) {if (counts[i,z]/np>thr) {v0=var0[,,z]; profile[z,i]=counts[i,z]/np*mean(v0[octo[,,z]==i],na.rm=T) } }
}
return(profile)
}
f2<-function()
{
profile=sapply(lev, function(i) {
v0=var0[,,i];
mV=sapply(1:indices, function(j) {mean(v0[octo[,,i] == j], na.rm = TRUE)})
counts[,i]/np*mV
})
profile[counts/np <= thr]=NA
profile<-matrix(profile, nz, indices, byrow = TRUE)
return(profile)
}
f3<-function()
{
profile=sapply(lev, function(i) {
v0=var0[,,i];
mV=sapply(1:indices, function(j) {if (counts[j,i]/np>thr) {mean(v0[octo[,,i] == j], na.rm = TRUE)} else {NA}})
counts[,i]/np*mV
})
profile<-matrix(profile, nz, indices, byrow = TRUE)
return(profile)
}
Actually f1() is the original, f2() is the #parksw3 one and f3() is my version slightly improved.
benchmark(f1(),f2(),f3(),replications=10)
test replications elapsed relative user.self sys.self user.child sys.child
1 f1() 10 27.382 1.411 27.375 0 0 0
2 f2() 10 35.195 1.814 35.186 0 0 0
3 f3() 10 19.403 1.000 19.392 0 0 0
In this way it is always faster than the standard loop. The data.table is likely faster but it requires a full change of data structure that I cannot perform for the moment. Hope this helps!
This seems to be faster on my machine:
profile2 <- sapply(lev, function(i){
v0 <- var0[,,i]
mV <- sapply(1:indices, function(j){
mean(v0[octo[,,i] == j], na.rm = TRUE)
})
counts[,i]/np*mV
})
profile2[counts/np > thr] <- NA
profile2<- t(profile2)
all.equal(profile, profile2)
## TRUE
I tried comparing them with microbenchmark package but it takes fairly long... Here's a quick comparison I did with rbenchmark package
f1 <- function(){
for (i in 1:indices){
for (z in 1:length(lev)) {
if (counts[i,z]/np>thr){
v0=var0[,,z]; profile[z,i]=counts[i,z]/np*mean(v0[octo[,,z]==i],na.rm=T)
}
}
}
}
f2 <- function(){
prof <- sapply(lev, function(i){
v0 <- var0[,,i]
mV <- sapply(1:indices, function(j){
mean(v0[octo[,,i] == j], na.rm = TRUE)
})
counts[,i]/np*mV
})
profile2[counts/np > thr] <- NA
profile2<- t(profile2)
}
library(rbenchmark)
benchmark(f1(), f2(), replications = 10)
I put both codes into a function and tested. Here's the result:
## test replications elapsed relative user.self sys.self
## 1 f1() 10 89.03 1.342 85.15 1.72
## 2 f2() 10 66.34 1.000 61.50 0.75
Related
I have 2 data frames in R, one of which is a subset of the other. I had to do some manipulations in it, and calculate the % of the subsetted data from the main data frame for 6 x-values (DayTreat in the code). So I created a function to do the calculation and create a new column. My issue is that it's painfully slow. Any suggestions?
percDay <- function(fullDat, subDat)
{
subDat$DaySum <- NULL
for (i in fullDat$DayTreat) # for each DayTreat value in fullDat. Must be `psmelt()` made phyloseq object
{
r <- sum(fullDat$Abundance[fullDat$DayTreat == i]) # Take the sum of all the taxa for that day
subDat$DaySum[subDat$DayTreat == i] <- r # Add the value to the subset of data
}
subDat$DayPerc <- (subDat$Abundance/subDat$DaySum) # Make the percentage of the subset
subDat
}
Examing your code, it looks like that you are doing redundant calculasions
the line:
for (i in fullDat$DayTreat)
should be:
for (i in unique(fullDat$DayTreat))
After that you could use data.table and do not use separate data frames,
if you say that one is subset of onother
require(data.table)
setDT(fullDat)
fullDat[, subsetI := Abundance > 30] # for example, should be your Condition
fullDat[, DaySum:= sum(Abundance), by = DayTreat]
fullDat[, DayPerc := Abundance/DaySum]
# get subset:
fullDat[subsetI == T]
If you would provide example data and desired output, it could be possible to supply more concrete code.
So, at a high level, I think the solutions are:
Use faster data classes if you aren't already
Avoid for loops
vectorize manually or
real on faster functions/libraries that use more C code and/or have more vectorization "under the hood"
Try data.table and/or tidyverse for greater speed and cleaner code
Benchmark and profile your code
Example:
require(tidyverse)
require(data.table)
percDay <- function(fullDat, subDat)
{
subDat$DaySum <- NULL
for (i in fullDat$DayTreat) # for each DayTreat value in fullDat. Must be `psmelt()` made phyloseq object
{
r <- sum(fullDat$Abundance[fullDat$DayTreat == i]) # Take the sum of all the taxa for that day
subDat$DaySum[subDat$DayTreat == i] <- r # Add the value to the subset of data
}
subDat$DayPerc <- (subDat$Abundance/subDat$DaySum) # Make the percentage of the subset
subDat
}
# My simulation of your data.frame:
fullDat <- data.frame(Abundance=rnorm(200),
DayTreat=c(1:100,1:100))
subDat <- dplyr::sample_frac(fullDat, .25)
# Your function modifies the data, so I'll make a copy. For a potential
# speed improvement I'll try data.table class
fullDat0 <- as.data.table(fullDat)
subDat0 <- as.data.table(subDat)
require(rbenchmark)
benchmark("original" = {
percDay(fullDat, subDat)
},
"example_improvement" = {
# Tidy approach
tmp <- fullDat0 %>%
group_by(DayTreat) %>%
summarize(DaySum = sum(Abundance))
subDat0 <- merge(subDat, tmp, by="DayTreat") # could use semi_join
subDat0$DayPerc <- (subDat0$Abundance/subDat0$DaySum) # could use mutate
},
replications = 100,
columns = c("test", "replications", "elapsed",
"relative", "user.self", "sys.self"))
test replications elapsed relative user.self sys.self
example_improvement 100 0.22 1.000 0.22 0.00
original 100 1.42 6.455 1.23 0.01
Typically a data.table approach is going to have the greatest speed. The tibble-based "tidy" approach has clearer syntax whilst typically being faster than data.frame but slower than data.table. An experienced data.table expert like #akrun could offer a maximal performance solution using probably just 1 single data.table statement.
I am attempting to build a large (~200 MM line) dataframe in R. Each entry in the dataframe will consist of approximately 10 digits (e.g. 1234.12345). The code is designed to walk through a list, subtract an item in position [i] from every item after [i], but not the items before [i] (If I was putting the output into a matrix it would be a triangular matrix). The code is simple and works fine on smaller lists, but I am wondering if there is a faster or more efficient way to do this? I assume the first part of the answer is going to entail "don't use a nested for loop," but I am not sure what the alternatives are.
The idea is that this will be an "edge list" for a social network analysis graph. Once I have 'outlist' I will reduce the number of edges based on some criteria(<,>,==,) so the final list (and graph) won't be quite so ponderous.
#Fake data of same approximate dimensions as real data
dlist<-sample(1:20,20, replace=FALSE)
#purge the output list before running the loop
rm(outlist)
outlist<-data.frame()
for(i in 1:(length(dlist)-1)){
for(j in (i+1):length(dlist)){
outlist<-rbind(outlist, c(dlist[i],dlist[j], dlist[j]-dlist[i]))
}
}
IIUC your final dataset will be ~200 million rows by 3 columns, all of type numeric, which takes a total space of:
200e6 (rows) * 3 (cols) * 8 (bytes) / (1024 ^ 3)
# ~ 4.5GB
That's quite a big data, where it's essential to avoid copies wherever possible.
Here's a method that uses data.table package's unexported (internal) vecseq function (written in C and is fast + memory efficient) and makes use of it's assignment by reference operator :=, to avoid copies.
fn1 <- function(x) {
require(data.table) ## 1.9.2
lx = length(x)
vx = as.integer(lx * (lx-1)/2)
# R v3.1.0 doesn't copy on doing list(.) - so should be even more faster there
ans = setDT(list(v1 = rep.int(head(x,-1L), (lx-1L):1L),
v2=x[data.table:::vecseq(2:lx, (lx-1L):1, vx)]))
ans[, v3 := v2-v1]
}
Benchmarking:
I'll benchmark with functions from other answers on your data dimensions. Note that my benchmark is on R v3.0.2, but fn1() should give better performance (both speed and memory) on R v3.1.0 because list(.) doesn't result in copy anymore.
fn2 <- function(x) {
diffmat <- outer(x, x, "-")
ss <- which(upper.tri(diffmat), arr.ind = TRUE)
data.frame(v1 = x[ss[,1]], v2 = x[ss[,2]], v3 = diffmat[ss])
}
fn3 <- function(x) {
idx <- combn(seq_along(x), 2)
out2 <- data.frame(v1=x[idx[1, ]], v2=x[idx[2, ]])
out2$v3 <- out2$v2-out2$v1
out2
}
set.seed(45L)
x = runif(20e3L)
system.time(ans1 <- fn1(x)) ## 18 seconds + ~8GB (peak) memory usage
system.time(ans2 <- fn2(x)) ## 158 seconds + ~19GB (peak) memory usage
system.time(ans3 <- fn3(x)) ## 809 seconds + ~12GB (peak) memory usage
Note that fn2() due to use of outer requires quite a lot of memory (peak memory usage was >=19GB) and is slower than fn1(). fn3() is just very very slow (due to combn, and unnecessary copy).
Another way to create that data is
#Sample Data
N <- 20
set.seed(15) #for reproducibility
dlist <- sample(1:N,N, replace=FALSE)
we could do
idx <- combn(1:N,2)
out2 <- data.frame(i=dlist[idx[1, ]], j=dlist[idx[2, ]])
out2$dist <- out2$j-out2$i
This uses combn to create all paris of indices in the data.set rather than doing loops. This allows us to build the data.frame all at once rather than adding a row at a time.
We compare that to
out1 <- data.frame()
for(i in 1:(length(dlist)-1)){
for(j in (i+1):length(dlist)){
out1<-rbind(out1, c(dlist[i],dlist[j], dlist[j]-dlist[i]))
}
}
we see that
all(out1==out2)
# [1] TRUE
Plus, if we compare with microbenchmark we see that
microbenchmark(loops(), combdata())
# Unit: microseconds
# expr min lq median uq max neval
# loops() 30888.403 32230.107 33764.7170 34821.2850 82891.166 100
# combdata() 684.316 800.384 873.5015 940.9215 4285.627 100
The method that doesn't use loops is much faster.
You can always start with a triangular matrix and then make your dataframe directly from that:
vec <- 1:10
diffmat <- outer(vec,vec,"-")
ss <- which(upper.tri(diffmat),arr.ind = TRUE)
data.frame(one = vec[ss[,1]],
two = vec[ss[,2]],
diff = diffmat[ss])
You need to preallocate out list, this will significantly increase the speed of your code. By preallocating I mean creating an output structure that already has the desired size, but filled with for example NA's.
I am trying to develop a function to "synchronise" NAs among layers of a raster stack, i.e. to make sure that for any given pixel of the stack, if one layer has a NA, then all layers should be set to NA for that pixel.
This is particularly useful when combining rasters coming from varying sources for species distribution modelling, because some models do not handle properly NAs.
I have found two ways to do this, but I find neither of them satisfactory. One of them requires to use the function getValues and thus is not usable for very large stacks or computers with low RAM. The other one is more memory-safe but is much slower. I am therefore here to ask if anyone has an idea to improve my attempts?
Here are the two possibilities:
Using getValues()
syncNA1 <- function (x)
{
val <- getValues(x)
NA.pos <- unique(which(is.na(val), arr.ind = T)[, 1])
val[NA.pos, ] <- NA
x <- setValues(x, val)
return(x)
}
Using calc()
syncNA2 <- function(y)
{
calc(y, na.rm = T, fun = function(x, na.rm = na.rm)
{
if(any(is.na(x)))
{
rep(NA, length(x))
} else
{
x
}
})
}
Now a demonstration of their respective computing times for the same stack:
> system.time(
+ b1 <- syncNA1(a1)
+ )
user system elapsed
3.04 0.15 3.20
> system.time(
+ b2 <- syncNA2(a1)
+ )
user system elapsed
5.89 0.19 6.08
Many thanks for your help,
Boris
With a stack named "s", I would first use calc(s, fun = sum) to compute a mask layer that records the location of all cells with an NA value in at least one of the stack's layers. mask() will then allow you to apply that mask to every layer in the stack.
Here's an example:
library(raster)
## Construct reproducible data! (Here a raster stack with NA values in each layer)
m <- raster(ncol=10, nrow=10)
n <- raster(ncol=10, nrow=10)
m[] <- runif(ncell(m))
n[] <- runif(ncell(n)) * 10
m[m < 0.5] <- NA
n[n < 5] <- NA
s <- stack(m,n)
## Synchronize the NA values
s2 <- mask(s, calc(s,fun = sum))
## Check that it worked
plot(s2)
I don't know about speed, but you might try converting to an array, loading up the NA, and converting back. pseudocode:
xarray<-as.array(xstack)
ind.na<-which(is.na(xarray),array.ind=TRUE)
for(j in nrow(ind.na) ) {
xarray[ind.na[j,1],ind.na[j,2],]<-NA
}
nastack<-raster(xarray)
I haven't verified the correct choice of indices there, nor have I verified I converted back to raster stack correctly, but I hope you get the idea.
EDIT: I ran a time test, with rasters 1000x1000 but otherwise as Josh created.
microbenchmark(josh(s),syncNA1(s),syncNA2(s),times=5)
Unit: milliseconds
expr min lq median uq max
josh(s) 774.2363 789.1653 800.2511 806.5364 809.9087
syncNA1(s) 652.3928 659.8327 692.3578 695.8057 743.9123
syncNA2(s) 7951.3918 8291.7917 8604.2226 8606.3432 10254.4739
neval
5
5
5
I ended up building an hybrid function between syncNA1 and Josh's solution.
This function is memory-safe if the computer does not have enough RAM, but can process faster if the computer has enough RAM:
synchroniseNA <- function(x)
{
if(canProcessInMemory(x, n = 2))
{
val <- getValues(x)
NA.pos <- unique(which(is.na(val), arr.ind = T)[, 1])
val[NA.pos, ] <- NA
x <- setValues(x, val)
return(x)
} else
{
x <- mask(x, calc(x, fun = sum))
return(x)
}
}
However, I empirically determined that the amount of ram used by a data.frame is twice the size of a raster file (for the n argument of canProcessInMemory()), but I am not exactly sure I am right here.
I am trying to find out the connecting transaction. From the first TRUE to last TRUE, its considered as one transaction and also find out in the transaction, the tpt_mode whether is mixed or pure. Then, insert a new column with new data but currently now the for loop is working with little volume of data, when it comes to huge volume of data, it tends to run very slow. How can I optimize the for loop to speed up the performance?
firstid<-1
currTpt <- 'NA'
count<-0
n <- nrow(tnx)
for (i in 1:n) {
if(tnx$FIRST[i]){
firstid<-i
currTpt <-tnx$mode[i]
count <-1
}
else{
count <- count + 1
}
if(as.character(tnx$mode[i])!= as.character(currTpt)){
currTpt <- 'both'
}
if(tnx$LAST[i])
{
tnx$final_end_loc[firstid]<-tnx$end_loc[i]
tnx$final_end_date[firstid]<-as.character(tnx$end_date[i])
tnx$final_end_time[firstid]<-as.character(tnx$end_time[i])
tnx$final_mode[firstid]<-as.character(currTpt)
tnx$final_count[firstid] <- count
}
}
final_tnx<-subset(tnx,FIRST==TRUE,c("id","start_date","start_time","final_end_date","final_end_time","start_loc","final_end_loc","final_mode","final_count"))
Sample data: EDIT
tnx<- data.frame(
id=c("A","A","A","A","C","C","D","D","E"),
mode=c("on","on","off","on","on","off","off","off","on"),
start_time=c("8:20:22","17:20:22","17:45:22","18:20:22","16:35:22","17:20:22","15:20:22","16:00:22","12:20:22"),
end_time=c("8:45:22","17:30:22","18:00:22","18:30:22","17:00:22","17:50:22","15:45:22","16:14:22","27:50:22"),
start_loc=c("12","12","207","12","11","65","222","32","12"),
end_loc=c(31,31,29,11,22,12,45,31,11),
start_date=c("6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012"),
end_date=c("6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012"),
FIRST=c(T,T,F,F,T,F,T,F,T),
LAST=c(T,F,F,T,F,T,F,T,T)
)
Sample dataset in picture form:
Expected results:
Thanks in advance.
To get your results, you don't need a loop. If you check where your transactions start and end and index accordingly your code simplifies to
nLAST <- which(tnx$LAST)
nFIRST <- which(tnx$FIRST)
count <- sapply(1:length(nFIRST),FUN = function(i){nFIRST[i]:nLAST[i]})
mode <- unlist(lapply(count,FUN=function(x){ifelse(length(unique(tnx$mode[x]))==1,
as.character(unique(tnx$mode[x])),'both')}))
final_tnx <- data.frame(id = tnx$id[nFIRST],start_date = tnx$start_date[nFIRST],
start_time = tnx$start_time[nFIRST],final_end_date = tnx$end_date[nLAST],
final_end_time = tnx$end_time[nLAST], start_loc=tnx$start_loc[nFIRST],
final_end_loc = tnx$end_loc[nLAST],final_mode = mode,
final_count = nLAST - nFIRST +1)
This should definitly speed up things and also perform well on larger data sets.
EDIT: When the mode is allowed to change more than once you have to check for the uniqueness on all subsets. In count I build a list of index sequences for each record. Then apply on the index list a function that checks whether there is one or more modes in the subset.
I'm sure there are far more improvements to be made but if you index as little as possible in the loop and specify data as vectors you can see some improvement.
require("rbenchmark")
###Specify data as vectors
FIRST <- tnx$FIRST
mode <- tnx$mode
LAST <- tnx$LAST
final_end_loc <- tnx$final_end_loc
final_end_date <- tnx$final_end_date
final_end_time <- tnx$final_end_time
final_mode <- tnx$final_mode
final_count <- tnx$final_count
end_date <- tnx$end_date
end_time <- tnx$end_time
end_loc <- tnx$end_loc
benchmark(for (i in 1:n) {
if(FIRST[i]){
firstid<-i
currTpt <-mode[i]
count <-1
}
else{
count <- count + 1
}
if(as.character(mode[i])!= as.character(currTpt)){
currTpt <- 'both'
}
if(LAST[i])
{
final_end_loc[firstid]<-end_loc[i]
final_end_date[firstid]<-as.character(end_date[i])
final_end_time[firstid]<-as.character(end_time[i])
final_mode[firstid]<-as.character(currTpt)
final_count[firstid] <- count
}
})
replications elapsed relative user.self sys.self user.child sys.child
1 100 0.11 1 0.11 0 NA NA
Now your loop
replications elapsed relative user.self sys.self user.child sys.child
1 100 0.18 1 0.19 0 NA NA
Cannot be certain if this will perform well with large dataset but keeping indexing to a minimum have worked for me in the past. A good post can be found here Speed up the loop operation in R if this isn't fast enough for you or doesn't work well with large data.
I have looked around StackOverflow, but I cannot find a solution specific to my problem, which involves appending rows to an R data frame.
I am initializing an empty 2-column data frame, as follows.
df = data.frame(x = numeric(), y = character())
Then, my goal is to iterate through a list of values and, in each iteration, append a value to the end of the list. I started with the following code.
for (i in 1:10) {
df$x = rbind(df$x, i)
df$y = rbind(df$y, toString(i))
}
I also attempted the functions c, append, and merge without success. Please let me know if you have any suggestions.
Update from comment:
I don't presume to know how R was meant to be used, but I wanted to ignore the additional line of code that would be required to update the indices on every iteration and I cannot easily preallocate the size of the data frame because I don't know how many rows it will ultimately take. Remember that the above is merely a toy example meant to be reproducible. Either way, thanks for your suggestion!
Update
Not knowing what you are trying to do, I'll share one more suggestion: Preallocate vectors of the type you want for each column, insert values into those vectors, and then, at the end, create your data.frame.
Continuing with Julian's f3 (a preallocated data.frame) as the fastest option so far, defined as:
# pre-allocate space
f3 <- function(n){
df <- data.frame(x = numeric(n), y = character(n), stringsAsFactors = FALSE)
for(i in 1:n){
df$x[i] <- i
df$y[i] <- toString(i)
}
df
}
Here's a similar approach, but one where the data.frame is created as the last step.
# Use preallocated vectors
f4 <- function(n) {
x <- numeric(n)
y <- character(n)
for (i in 1:n) {
x[i] <- i
y[i] <- i
}
data.frame(x, y, stringsAsFactors=FALSE)
}
microbenchmark from the "microbenchmark" package will give us more comprehensive insight than system.time:
library(microbenchmark)
microbenchmark(f1(1000), f3(1000), f4(1000), times = 5)
# Unit: milliseconds
# expr min lq median uq max neval
# f1(1000) 1024.539618 1029.693877 1045.972666 1055.25931 1112.769176 5
# f3(1000) 149.417636 150.529011 150.827393 151.02230 160.637845 5
# f4(1000) 7.872647 7.892395 7.901151 7.95077 8.049581 5
f1() (the approach below) is incredibly inefficient because of how often it calls data.frame and because growing objects that way is generally slow in R. f3() is much improved due to preallocation, but the data.frame structure itself might be part of the bottleneck here. f4() tries to bypass that bottleneck without compromising the approach you want to take.
Original answer
This is really not a good idea, but if you wanted to do it this way, I guess you can try:
for (i in 1:10) {
df <- rbind(df, data.frame(x = i, y = toString(i)))
}
Note that in your code, there is one other problem:
You should use stringsAsFactors if you want the characters to not get converted to factors. Use: df = data.frame(x = numeric(), y = character(), stringsAsFactors = FALSE)
Let's benchmark the three solutions proposed:
# use rbind
f1 <- function(n){
df <- data.frame(x = numeric(), y = character())
for(i in 1:n){
df <- rbind(df, data.frame(x = i, y = toString(i)))
}
df
}
# use list
f2 <- function(n){
df <- data.frame(x = numeric(), y = character(), stringsAsFactors = FALSE)
for(i in 1:n){
df[i,] <- list(i, toString(i))
}
df
}
# pre-allocate space
f3 <- function(n){
df <- data.frame(x = numeric(1000), y = character(1000), stringsAsFactors = FALSE)
for(i in 1:n){
df$x[i] <- i
df$y[i] <- toString(i)
}
df
}
system.time(f1(1000))
# user system elapsed
# 1.33 0.00 1.32
system.time(f2(1000))
# user system elapsed
# 0.19 0.00 0.19
system.time(f3(1000))
# user system elapsed
# 0.14 0.00 0.14
The best solution is to pre-allocate space (as intended in R). The next-best solution is to use list, and the worst solution (at least based on these timing results) appears to be rbind.
Suppose you simply don't know the size of the data.frame in advance. It can well be a few rows, or a few millions. You need to have some sort of container, that grows dynamically. Taking in consideration my experience and all related answers in SO I come with 4 distinct solutions:
rbindlist to the data.frame
Use data.table's fast set operation and couple it with manually doubling the table when needed.
Use RSQLite and append to the table held in memory.
data.frame's own ability to grow and use custom environment (which has reference semantics) to store the data.frame so it will not be copied on return.
Here is a test of all the methods for both small and large number of appended rows. Each method has 3 functions associated with it:
create(first_element) that returns the appropriate backing object with first_element put in.
append(object, element) that appends the element to the end of the table (represented by object).
access(object) gets the data.frame with all the inserted elements.
rbindlist to the data.frame
That is quite easy and straight-forward:
create.1<-function(elems)
{
return(as.data.table(elems))
}
append.1<-function(dt, elems)
{
return(rbindlist(list(dt, elems),use.names = TRUE))
}
access.1<-function(dt)
{
return(dt)
}
data.table::set + manually doubling the table when needed.
I will store the true length of the table in a rowcount attribute.
create.2<-function(elems)
{
return(as.data.table(elems))
}
append.2<-function(dt, elems)
{
n<-attr(dt, 'rowcount')
if (is.null(n))
n<-nrow(dt)
if (n==nrow(dt))
{
tmp<-elems[1]
tmp[[1]]<-rep(NA,n)
dt<-rbindlist(list(dt, tmp), fill=TRUE, use.names=TRUE)
setattr(dt,'rowcount', n)
}
pos<-as.integer(match(names(elems), colnames(dt)))
for (j in seq_along(pos))
{
set(dt, i=as.integer(n+1), pos[[j]], elems[[j]])
}
setattr(dt,'rowcount',n+1)
return(dt)
}
access.2<-function(elems)
{
n<-attr(elems, 'rowcount')
return(as.data.table(elems[1:n,]))
}
SQL should be optimized for fast record insertion, so I initially had high hopes for RSQLite solution
This is basically copy&paste of Karsten W. answer on similar thread.
create.3<-function(elems)
{
con <- RSQLite::dbConnect(RSQLite::SQLite(), ":memory:")
RSQLite::dbWriteTable(con, 't', as.data.frame(elems))
return(con)
}
append.3<-function(con, elems)
{
RSQLite::dbWriteTable(con, 't', as.data.frame(elems), append=TRUE)
return(con)
}
access.3<-function(con)
{
return(RSQLite::dbReadTable(con, "t", row.names=NULL))
}
data.frame's own row-appending + custom environment.
create.4<-function(elems)
{
env<-new.env()
env$dt<-as.data.frame(elems)
return(env)
}
append.4<-function(env, elems)
{
env$dt[nrow(env$dt)+1,]<-elems
return(env)
}
access.4<-function(env)
{
return(env$dt)
}
The test suite:
For convenience I will use one test function to cover them all with indirect calling. (I checked: using do.call instead of calling the functions directly doesn't makes the code run measurable longer).
test<-function(id, n=1000)
{
n<-n-1
el<-list(a=1,b=2,c=3,d=4)
o<-do.call(paste0('create.',id),list(el))
s<-paste0('append.',id)
for (i in 1:n)
{
o<-do.call(s,list(o,el))
}
return(do.call(paste0('access.', id), list(o)))
}
Let's see the performance for n=10 insertions.
I also added a 'placebo' functions (with suffix 0) that don't perform anything - just to measure the overhead of the test setup.
r<-microbenchmark(test(0,n=10), test(1,n=10),test(2,n=10),test(3,n=10), test(4,n=10))
autoplot(r)
For 1E5 rows (measurements done on Intel(R) Core(TM) i7-4710HQ CPU # 2.50GHz):
nr function time
4 data.frame 228.251
3 sqlite 133.716
2 data.table 3.059
1 rbindlist 169.998
0 placebo 0.202
It looks like the SQLite-based sulution, although regains some speed on large data, is nowhere near data.table + manual exponential growth. The difference is almost two orders of magnitude!
Summary
If you know that you will append rather small number of rows (n<=100), go ahead and use the simplest possible solution: just assign the rows to the data.frame using bracket notation and ignore the fact that the data.frame is not pre-populated.
For everything else use data.table::set and grow the data.table exponentially (e.g. using my code).
Update with purrr, tidyr & dplyr
As the question is already dated (6 years), the answers are missing a solution with newer packages tidyr and purrr. So for people working with these packages, I want to add a solution to the previous answers - all quite interesting, especially .
The biggest advantage of purrr and tidyr are better readability IMHO.
purrr replaces lapply with the more flexible map() family,
tidyr offers the super-intuitive method add_row - just does what it says :)
map_df(1:1000, function(x) { df %>% add_row(x = x, y = toString(x)) })
This solution is short and intuitive to read, and it's relatively fast:
system.time(
map_df(1:1000, function(x) { df %>% add_row(x = x, y = toString(x)) })
)
user system elapsed
0.756 0.006 0.766
It scales almost linearly, so for 1e5 rows, the performance is:
system.time(
map_df(1:100000, function(x) { df %>% add_row(x = x, y = toString(x)) })
)
user system elapsed
76.035 0.259 76.489
which would make it rank second right after data.table (if your ignore the placebo) in the benchmark by #Adam Ryczkowski:
nr function time
4 data.frame 228.251
3 sqlite 133.716
2 data.table 3.059
1 rbindlist 169.998
0 placebo 0.202
A more generic solution for might be the following.
extendDf <- function (df, n) {
withFactors <- sum(sapply (df, function(X) (is.factor(X)) )) > 0
nr <- nrow (df)
colNames <- names(df)
for (c in 1:length(colNames)) {
if (is.factor(df[,c])) {
col <- vector (mode='character', length = nr+n)
col[1:nr] <- as.character(df[,c])
col[(nr+1):(n+nr)]<- rep(col[1], n) # to avoid extra levels
col <- as.factor(col)
} else {
col <- vector (mode=mode(df[1,c]), length = nr+n)
class(col) <- class (df[1,c])
col[1:nr] <- df[,c]
}
if (c==1) {
newDf <- data.frame (col ,stringsAsFactors=withFactors)
} else {
newDf[,c] <- col
}
}
names(newDf) <- colNames
newDf
}
The function extendDf() extends a data frame with n rows.
As an example:
aDf <- data.frame (l=TRUE, i=1L, n=1, c='a', t=Sys.time(), stringsAsFactors = TRUE)
extendDf (aDf, 2)
# l i n c t
# 1 TRUE 1 1 a 2016-07-06 17:12:30
# 2 FALSE 0 0 a 1970-01-01 01:00:00
# 3 FALSE 0 0 a 1970-01-01 01:00:00
system.time (eDf <- extendDf (aDf, 100000))
# user system elapsed
# 0.009 0.002 0.010
system.time (eDf <- extendDf (eDf, 100000))
# user system elapsed
# 0.068 0.002 0.070
Lets take a vector 'point' which has numbers from 1 to 5
point = c(1,2,3,4,5)
if we want to append a number 6 anywhere inside the vector then below command may come handy
i) Vectors
new_var = append(point, 6 ,after = length(point))
ii) columns of a table
new_var = append(point, 6 ,after = length(mtcars$mpg))
The command append takes three arguments:
the vector/column to be modified.
value to be included in the modified vector.
a subscript, after which the values are to be appended.
simple...!!
Apologies in case of any...!
My solution is almost the same as the original answer but it doesn't worked for me.
So, I gave names for the columns and it works:
painel <- rbind(painel, data.frame("col1" = xtweets$created_at,
"col2" = xtweets$text))