How to speed up missing search process in a R data.table - r

I am writing a general function for missing value treatment. Data can have Char,numeric,factor and integer type columns. An example of data is as follows
dt<-data.table(
num1=c(1,2,3,4,NA,5,NA,6),
num3=c(1,2,3,4,5,6,7,8),
int1=as.integer(c(NA,NA,102,105,NA,300,400,700)),
int3=as.integer(c(1,10,102,105,200,300,400,700)),
cha1=c('a','b','c',NA,NA,'c','d','e'),
cha3=c('xcda','b','c','miss','no','c','dfg','e'),
fact1=c('a','b','c',NA,NA,'c','d','e'),
fact3=c('ad','bd','cc','zz','yy','cc','dd','ed'),
allm=as.integer(c(NA,NA,NA,NA,NA,NA,NA,NA)),
miss=as.character(c("","",'c','miss','no','c','dfg','e')),
miss2=as.integer(c('','',3,4,5,6,7,8)),
miss3=as.factor(c(".",".",".","c","d","e","f","g")),
miss4=as.factor(c(NA,NA,'.','.','','','t1','t2')),
miss5=as.character(c(NA,NA,'.','.','','','t1','t2'))
)
I was using this code to flag out missing values:
dt[,flag:=ifelse(is.na(miss5)|!nzchar(miss5),1,0)]
But it turns out to be very slow, additionally I have to add logic which could also consider "." as missing.
So I am planning to write this for missing value identification
dt[miss5 %in% c(NA,'','.'),flag:=1]
but on a 6 million record set it takes close to 1 second to run this whereas
dt[!nzchar(miss5),flag:=1] takes close 0.14 secod to run.
My question is, can we have a code where the time taken is as least as possible while we can look for values NA,blank and Dot(NA,".","") as missing?
Any help is highly appreciated.

== and %in% are optimised to use binary search automatically (NEW FEATURE: Auto indexing). To use it, we have to ensure that:
a) we use dt[...] instead of set() as it's not yet implemented in set(), #1196.
b) When RHS to %in% is of higher SEXPTYPE than LHS, auto indexing re-routes to base R to ensure correct results (as binary search always coerces RHS). So for integer columns we need to make sure we pass in just NA and not the "." or "".
Using #akrun's data, here's the code and run time:
in_col = grep("^miss", names(dt), value=TRUE)
out_col = gsub("^miss", "flag", in_col)
system.time({
dt[, (out_col) := 0L]
for (j in seq_along(in_col)) {
if (class(.subset2(dt, in_col[j])) %in% c("character", "factor")) {
lookup = c("", ".", NA)
} else lookup = NA
expr = call("%in%", as.name(in_col[j]), lookup)
tt = dt[eval(expr), (out_col[j]) := 1L]
}
})
# user system elapsed
# 1.174 0.295 1.476
How it works:
a) we first initiate all output columns to 0.
b) Then, for each column, we check it's type and create lookup accordingly.
c) We then create the corresponding expression for i - miss(.) %in% lookup
d) Then we evaluate the expression in i, which'll use auto indexing to create an index very quickly and use that index to quickly find matching indices using binary search.
Note: If necessary, you can add a set2key(dt, NULL) at the end of for-loop so that the created indices are removed immediately after use (to save space).
Compared to this run, #akrun's fastest answer takes 6.33 seconds, which is ~4.2x speedup.
Update: On 4 million rows and 100 columns, it takes ~ 9.2 seconds. That's ~0.092 seconds per column.
Calling [.data.table a 100 times could be expensive. When auto indexing is implemented in set(), it'd be nice to compare the performance.

You can loop through the 'miss' columns and create corresponding 'flag' columns with set.
library(data.table)#v1.9.5+
ind <- grep('^miss', names(dt))
nm1 <- sub('miss', 'flag',names(dt)[ind])
dt[,(nm1) := 0]
for(j in seq_along(ind)){
set(dt, i=which(dt[[ind[j]]] %in% c('.', '', NA)),j= nm1[j], value=1L)
}
Benchmarks
set.seed(24)
df1 <- as.data.frame(matrix(sample(c(NA,0:9), 6e6*5, replace=TRUE), ncol=5))
set.seed(23)
df2 <- as.data.frame(matrix(sample(c('.','', letters[1:5]), 6e6*5,
replace=TRUE), ncol=5))
set.seed(234)
i1 <- sample(10)
dfN <- setNames(cbind(df1, df2)[i1], paste0('miss',1:10))
dt <- as.data.table(dfN)
system.time({
ind <- grep('^miss', names(dt))
nm1 <- sub('miss', 'flag',names(dt)[ind])
dt[,(nm1) := 0L]
for(j in seq_along(ind)){
set(dt, i=which(dt[[ind[j]]] %in% c('.', '', NA)), j= nm1[j], value=1L)
}
}
)
#user system elapsed
# 8.352 0.150 8.496
system.time({
m1 <- matrix(0, nrow=6e6, ncol=10)
m2 <- sapply(seq_along(dt), function(i) {
ind <- which(dt[[i]] %in% c('.', '', NA))
replace(m1[,i], ind, 1L)})
cbind(dt, m2)})
#user system elapsed
# 14.227 0.362 14.582

Related

Update data.table with mapply speed issue

I have a self-defined function, the results of which I want in a data.table. I need to apply this function to some variables from each row of another data.table. I have a method that works how I want it to, but it is quite slow and I am looking to see if there is an approach which will speed it up.
In my sample below, the important results are Column, which is generated in a while loop and varies in length given the input data, and Column2.
My approach has been to have the function append the results to an existing data.table using the update by reference, :=. To achieve this properly, I set the length of Column and Column2 to a known maximum, replace NAs with 0, and simply add to an existing data.table addTable like so: addTable[, First:=First + Column]
This method works with how I have applied the function over each row of the source data.table, using mapply. This way, I needn't worry about the actual product of the mapply call (some kind of matrix); it just updates addTable for each row it applies sample_fun to.
Here is a reproducible sample:
dt<-data.table(X= c(1:100), Y=c(.5, .7, .3, .4), Z=c(1:50000))
addTable <- data.table(First=0, Second=0, Term=c(1:50))
sample_fun <- function(x, y, z) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
addTable[, First := First + Column]
addTable[, Second := Second + Column2]
}
If I run this with dt at 50k rows, it takes ~ 30 seconds:
system.time(mapply(sample_fun2, dt$X, dt$Y, dt$Z))
Seems like a long time (longer with my real data/function). I originally thought this was due to the while loop, because of the ever-present warnings against explicit loops in R around these parts. However, upon testing sample_fun without the last two lines (where the data.table is updated), it clocked in under 1 second over 50k rows.
Long story short, why is this the slowest part if updating by reference is fast? And is there a better way to do this? Making sample_fun output a full data.table each time is considerably slower than what I have now.
Few notes here:
As it stands now, using data.table for your need could be an overkill (though not necessarily) and you could probably avoid it.
You are growing objects in a loop (Column <- c(Column, x))- don't do that. In your case there is no need. Just create an empty vector of zeroes and you can get rid of most of your function.
There is absolutely no need in creating Column2- it is just z- as R automatically will recycle it in order to fit it to the correct size
No need to recalculate nrow(addTable) by row neither, that could be just an additional parameter.
Your bigggest overhead is calling data.table:::`[.data.table` per row- it is a very expensive function. The := function has a very little overhead here. If you''ll replace addTable[, First := First + Column] ; addTable[, Second := Second + Column2] with just addTable$First + Column ; addTable$Second + Column2 the run time will be reduced from ~35 secs to ~2 secs. Another way to illustrate this is by replacing the two lines with set- e.g. set(addTable, j = "First", value = addTable[["First"]] + Column) ; set(addTable, j = "Second", value = addTable[["Second"]] + Column) which basically shares the source code with :=. This also runs ~ 2 secs
Finally, it is better to reduce the amount of operations per row. You could try accumulating the result using Reduce instead of updating the actual data set per row.
Let's see some examples
Your original function timings
library(data.table)
dt <- data.table(X= c(1:100), Y=c(.5, .7, .3, .4), Z=c(1:50000))
addTable <- data.table(First=0, Second=0, Term=c(1:50))
sample_fun <- function(x, y, z) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
addTable[, First := First + Column]
addTable[, Second := Second + Column2]
}
system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user system elapsed
# 30.71 0.00 30.78
30 secs is pretty slow...
1- Let's try removing the data.table:::`[.data.table` overhead
sample_fun <- function(x, y, z) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
addTable$First + Column
addTable$Second + Column2
}
system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user system elapsed
# 2.25 0.00 2.26
^ That was much faster but didn't update the actual data set.
2- Now let's try replacing it with set which will have the same affect as := but without the data.table:::`[.data.table` overhead
sample_fun <- function(x, y, z, n) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
set(addTable, j = "First", value = addTable[["First"]] + Column)
set(addTable, j = "Second", value = addTable[["Second"]] + Column2)
}
system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user system elapsed
# 2.96 0.00 2.96
^ Well, that was also much faster than 30 secs and had the exact same effect as :=
3- Let's try it without using data.table at all
dt <- data.frame(X= c(1:100), Y=c(.5, .7, .3, .4), Z=c(1:50000))
addTable <- data.frame(First=0, Second=0, Term=c(1:50))
sample_fun <- function(x, y, z) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
return(list(Column, Column2))
}
system.time(res <- mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user system elapsed
# 1.34 0.02 1.36
^ That's even faster
Now we can use Reduce combined with accumulate = TRUE in order to create those vectors
system.time(addTable$First <- Reduce(`+`, res[1, ], accumulate = TRUE)[[nrow(dt)]])
# user system elapsed
# 0.07 0.00 0.06
system.time(addTable$Second <- Reduce(`+`, res[2, ], accumulate = TRUE)[[nrow(dt)]])
# user system elapsed
# 0.07 0.00 0.06
Well, everything combined is now under 2 seconds (instead of 30 with your original function).
4- Further improvements could be to fix the other elements in your function (as pointed above), in other words, your function could be just
sample_fun <- function(x, y, n) {
Column <- numeric(n)
i <- 1L
while(x >= 1) {
x <- x * y
Column[i] <- x
i <- i + 1L
}
return(Column)
}
system.time(res <- Map(sample_fun, dt$X, dt$Y, nrow(addTable)))
# user system elapsed
# 0.72 0.00 0.72
^ Twice improvement in speed
Now, we didn't even bother creating Column2 as we already have it in dt$Z. We also used Map instead of mapply as it will be easier for Reduce to work with a list than a matrix.
The next step is similar to as before
system.time(addTable$First <- Reduce(`+`, res, accumulate = TRUE)[[nrow(dt)]])
# user system elapsed
# 0.07 0.00 0.07
But we could improve this even further. Instead of using Map/Reduce we could create a matrix using mapply and then run matrixStats::rowCumsums over it (which is written in C++ internally) in order to calculate addTable$First)
system.time(res <- mapply(sample_fun, dt$X, dt$Y, nrow(addTable)))
# user system elapsed
# 0.76 0.00 0.76
system.time(addTable$First2 <- matrixStats::rowCumsums(res)[, nrow(dt)])
# user system elapsed
# 0 0 0
While the final step is simply summing dt$Z
system.time(addTable$Second <- sum(dt$Z))
# user system elapsed
# 0 0 0
So eventually we went from ~30 secs to less than a second.
Some final notes
As it seems like the main overhead remained in the function itself, you could also maybe try rewriting it using Rcpp as it seems like loops are inevitable in this case (though the overhead is not so big it seems).

Identify and Convert to Numeric/Integer

I have a situation where I need to look at character data, and convert to numeric or integer. I need to perform this operation on a data.table and it needs to be sofastthatIdontnoticeit when working with a data.table that has ~1000 columns and 1e6 rows. There's a lot of missing, or sparse data so that is a confounding element.
fread from the data.table package does this incredibly quickly and is well tested from a csv file (amoung other options).
Is there a way to apply the column identification used in fread to an existing data.table?
Otherwise, here's the approach I was considering (which is still too slow):
Dummy Data:
library(data.table)
size = 1e6
resample <- function(x,size = 1e6) sample(x,size,replace = TRUE)
text <- c("Canada","Peru","Australia",
"Angola","France","", NA_character_)
text2 <- c("Oh Canada.","Arriba Peru.",
"Australia?","Vive la France.")
numerics <- rnorm(1e6)
dt <- data.table(
id = as.character(1:1e6),
i1 = resample(c(as.character(c(0:5,NA)),"")), # sometimes just blank
i2 = resample(c(as.character(c(100:500,NA)))),
n1 = as.character(round(rnorm(1e6),3)),
t1 = resample(text),
t2 = resample(text2)
)
str(dt)
My approach so far, is to use grep to test the columns for alpha, and a literal . and then write a short function to apply as.* as identified.
decide <- data.frame(
vars = names(dt),
character = unlist(lapply(dt, function(x) length(grep("[a-z]",x)))),
numeric = unlist(lapply(dt, function(x) length(grep("[.]",x))))
)
what_is_it <- function(character, numeric) {
if(character == 0 & numeric == 0) {
return("as.integer")
}
if(character > 0) {
return("as.character")
}
if(numeric > 0 & character == 0) {
return("as.numeric")
}
}
decide$fun <- apply(decide[-1], 1, function(x) what_is_it(x[1],x[2]))
for(var in decide$vars) {
fun <- get(decide$fun[decide$vars == var])
dt[, (var) := fun(get(var))]
dt[]
}
system.time(source("https://gist.githubusercontent.com/1beb/183511b51d615751860204344a02c799/raw/91fcee73f24596ac6bdec00edaad944b5b1b7713/quick_convert.R"))
Running at about 3.5 seconds on my machine, but for only 7 columns.
As provided by user20650. The answer is type.convert

Return factor associated with a numeric range defined in two columns

Using a database with a numeric range defined by two columns start and end, I am trying to look up the factor, code, associated with a numeric value in a separate vector identityCodes.
database <- data.frame(start = seq(1, 150000000, 1000),
end = seq(1000, 150000000, 1000),
code = paste0(sample(LETTERS, 15000, replace = TRUE),
sample(LETTERS, 15000, replace = TRUE)))
identityCodes <- sample(1:15000000, 1000)
I've come up with a method for finding the corresponding codes using a for loop and subsetting:
fun <- function (x, y) {
z <- rep(NA, length(x))
for (i in 1:length(x)){
z[i] <- as.character(y[y["start"] <= x[i] & y["end"] >= x[i], "code"])
}
return(z)
}
a <- fun(identityCodes, database)
But the method is slow, especially if I am to scale it:
system.time(fun(identityCodes, database))
user system elapsed
15.36 0.00 15.50
How can I identify the factors associated with each identityCodes faster? Is there a better way to go about this than using a for loop and subsetting?
Here's my attempt using data.table. Very fast - even though I am sure I am not leveraging it efficiently.
Given function:
# method 1
system.time(result1 <- fun(identityCodes, database))
user system elapsed
8.99 0.00 8.98
Using data.table
# method 2
require(data.table)
# x: a data.frame with columns start, end, code
# y: a vector with lookup codes
dt_comb <- function(x, y) {
# convert x to a data.table and set 'start' and 'end' as keys
DT <- setDT(x)
setkey(DT, start, end)
# create a lookup data.table where start and end are the identityCodes
DT2 <- data.table(start=y, end=y)
# overlap join where DT2 start & end are within DT start and end
res <- foverlaps(DT2, DT[, .(start, end)], type="within")
# store i as row number and key (for sorting later)
res[, i:=seq_len(nrow(res))]
setkey(res, i)
# merge the joined table to the original to get codes
final <- merge(res, DT, by=c("start", "end"))[order(i), .(code)]
# export as character the codes
as.character(final[[1]])
}
system.time(result2 <- dt_comb(x=database, y=identityCodes))
user system elapsed
0.08 0.00 0.08
identical(result1, result2)
[1] TRUE
edit: trimmed a couple lines from the function
This is about 45% faster on my machine:
result = lapply(identityCodes, function(x) {
data.frame(identityCode=x,
code=database[database$start <= x & database$end >= x, "code"])
})
result = do.call(rbind, result)
Here's a sample of the output:
identityCode code
1 6836845 OK
2 14100352 RB
3 2313115 NK
4 8440671 XN
5 11349271 TI
6 14467193 VL

Constructing an R data.table by selecting each row from an array of tables

Assume I have a list of length D containing data.table objects. Each data.table has the same columns (X, Y) and same number of rows N. I'd like to construct another table with N rows, with the individual rows taken from the tables specified by an index vector also of length N. Restated, each row in the final table is taken from one and only one of the tables in the array, with the index of the source table specified by an existing vector.
N = 100 # rows in each table (actual ~1000000 rows)
D = 4 # number of tables in array (actual ~100 tables)
tableArray = vector("list", D)
for (d in 1:D) {
tableArray[[d]] = data.table(X=rnorm(N), Y=d) # actual ~100 columns
}
tableIndexVector = sample.int(D, N, replace=TRUE) # length N of random 1:D
finalTable = copy(tableArray[[1]]) # just for length and column names
for (n in 1:N) {
finalTable[n] = tableArray[[tableIndexVector[n]]][n]
}
This seems to work the way I want, but the array within array notation is hard to understand, and I presume the performance of the for loop isn't going to be very good. It seems like there should be some elegant way of doing this, but I haven't stumbled across it yet. Is there another way of doing this that is efficient and less arcane?
(In case you are wondering, each table in the array represents simulated counterfactual observations for a subject under a particular regime of treatment, and I want to sample from these with different probabilities to test the behavior of different regression approaches with different ratios of regimes observed.)
for loops work just fine with data.table but we can improve the performance of your specific loop significantly (I believe) using the following approaches.
Approach # 1
Use set instead, as it avoids the [.data.table overhead
Don't loop over 1:N because you can simplify your loop to run only on unique values of tableIndexVector and assign all the corresponding values at once. This should decrease the run time by at least x10K (as N is of size 1MM and D is only of size 100, while unique(tableIndexVector) <= D)
So you basically could convert your loop to the following
for (i in unique(tableIndexVector)) {
indx <- which(tableIndexVector == i)
set(finalTable, i = indx, j = 1:2, value = tableArray[[i]][indx])
}
Approach # 2
Another approach is to use rbindlist and combine all the tables into one big data.table while adding the new idcol parameter in order to identify the different tables within the big table. You will need the devel version for that. This will avoid the loop as requested, but the result will be ordered by the tables appearance
temp <- rbindlist(tableArray, idcol = "indx")
indx <- temp[, .I[which(tableIndexVector == indx)], by = indx]$V1
finalTable <- temp[indx]
Here's a benchmark on bigger data set
N = 100000
D = 10
tableArray = vector("list", D)
set.seed(123)
for (d in 1:D) {
tableArray[[d]] = data.table(X=rnorm(N), Y=d)
}
set.seed(123)
tableIndexVector = sample.int(D, N, replace=TRUE)
finalTable = copy(tableArray[[1]])
finalTable2 = copy(tableArray[[1]])
## Your approach
system.time(for (n in 1:N) {
finalTable[n] = tableArray[[tableIndexVector[n]]][n]
})
# user system elapsed
# 154.79 33.14 191.57
## My approach # 1
system.time(for (i in unique(tableIndexVector)) {
indx <- which(tableIndexVector == i)
set(finalTable2, i = indx, j = 1:2, value = tableArray[[i]][indx])
})
# user system elapsed
# 0.01 0.00 0.02
## My approach # 2
system.time({
temp <- rbindlist(tableArray, idcol = "indx")
indx <- temp[, .I[which(tableIndexVector == indx)], by = indx]$V1
finalTable3 <- temp[indx]
})
# user system elapsed
# 0.11 0.00 0.11
identical(finalTable, finalTable2)
## [1] TRUE
identical(setorder(finalTable, X), setorder(finalTable3[, indx := NULL], X))
## [1] TRUE
So to conclusion
My first approach is by far the fastest and elapses x15K times faster
than your original one. It is also returns identical result
My second approach is still x1.5K times faster than your original approach but avoids the loop (which you don't like for some reason). Though the result is order by the tables appearance, so the order isn't identical to your result.

How to append rows to an R data frame

I have looked around StackOverflow, but I cannot find a solution specific to my problem, which involves appending rows to an R data frame.
I am initializing an empty 2-column data frame, as follows.
df = data.frame(x = numeric(), y = character())
Then, my goal is to iterate through a list of values and, in each iteration, append a value to the end of the list. I started with the following code.
for (i in 1:10) {
df$x = rbind(df$x, i)
df$y = rbind(df$y, toString(i))
}
I also attempted the functions c, append, and merge without success. Please let me know if you have any suggestions.
Update from comment:
I don't presume to know how R was meant to be used, but I wanted to ignore the additional line of code that would be required to update the indices on every iteration and I cannot easily preallocate the size of the data frame because I don't know how many rows it will ultimately take. Remember that the above is merely a toy example meant to be reproducible. Either way, thanks for your suggestion!
Update
Not knowing what you are trying to do, I'll share one more suggestion: Preallocate vectors of the type you want for each column, insert values into those vectors, and then, at the end, create your data.frame.
Continuing with Julian's f3 (a preallocated data.frame) as the fastest option so far, defined as:
# pre-allocate space
f3 <- function(n){
df <- data.frame(x = numeric(n), y = character(n), stringsAsFactors = FALSE)
for(i in 1:n){
df$x[i] <- i
df$y[i] <- toString(i)
}
df
}
Here's a similar approach, but one where the data.frame is created as the last step.
# Use preallocated vectors
f4 <- function(n) {
x <- numeric(n)
y <- character(n)
for (i in 1:n) {
x[i] <- i
y[i] <- i
}
data.frame(x, y, stringsAsFactors=FALSE)
}
microbenchmark from the "microbenchmark" package will give us more comprehensive insight than system.time:
library(microbenchmark)
microbenchmark(f1(1000), f3(1000), f4(1000), times = 5)
# Unit: milliseconds
# expr min lq median uq max neval
# f1(1000) 1024.539618 1029.693877 1045.972666 1055.25931 1112.769176 5
# f3(1000) 149.417636 150.529011 150.827393 151.02230 160.637845 5
# f4(1000) 7.872647 7.892395 7.901151 7.95077 8.049581 5
f1() (the approach below) is incredibly inefficient because of how often it calls data.frame and because growing objects that way is generally slow in R. f3() is much improved due to preallocation, but the data.frame structure itself might be part of the bottleneck here. f4() tries to bypass that bottleneck without compromising the approach you want to take.
Original answer
This is really not a good idea, but if you wanted to do it this way, I guess you can try:
for (i in 1:10) {
df <- rbind(df, data.frame(x = i, y = toString(i)))
}
Note that in your code, there is one other problem:
You should use stringsAsFactors if you want the characters to not get converted to factors. Use: df = data.frame(x = numeric(), y = character(), stringsAsFactors = FALSE)
Let's benchmark the three solutions proposed:
# use rbind
f1 <- function(n){
df <- data.frame(x = numeric(), y = character())
for(i in 1:n){
df <- rbind(df, data.frame(x = i, y = toString(i)))
}
df
}
# use list
f2 <- function(n){
df <- data.frame(x = numeric(), y = character(), stringsAsFactors = FALSE)
for(i in 1:n){
df[i,] <- list(i, toString(i))
}
df
}
# pre-allocate space
f3 <- function(n){
df <- data.frame(x = numeric(1000), y = character(1000), stringsAsFactors = FALSE)
for(i in 1:n){
df$x[i] <- i
df$y[i] <- toString(i)
}
df
}
system.time(f1(1000))
# user system elapsed
# 1.33 0.00 1.32
system.time(f2(1000))
# user system elapsed
# 0.19 0.00 0.19
system.time(f3(1000))
# user system elapsed
# 0.14 0.00 0.14
The best solution is to pre-allocate space (as intended in R). The next-best solution is to use list, and the worst solution (at least based on these timing results) appears to be rbind.
Suppose you simply don't know the size of the data.frame in advance. It can well be a few rows, or a few millions. You need to have some sort of container, that grows dynamically. Taking in consideration my experience and all related answers in SO I come with 4 distinct solutions:
rbindlist to the data.frame
Use data.table's fast set operation and couple it with manually doubling the table when needed.
Use RSQLite and append to the table held in memory.
data.frame's own ability to grow and use custom environment (which has reference semantics) to store the data.frame so it will not be copied on return.
Here is a test of all the methods for both small and large number of appended rows. Each method has 3 functions associated with it:
create(first_element) that returns the appropriate backing object with first_element put in.
append(object, element) that appends the element to the end of the table (represented by object).
access(object) gets the data.frame with all the inserted elements.
rbindlist to the data.frame
That is quite easy and straight-forward:
create.1<-function(elems)
{
return(as.data.table(elems))
}
append.1<-function(dt, elems)
{
return(rbindlist(list(dt, elems),use.names = TRUE))
}
access.1<-function(dt)
{
return(dt)
}
data.table::set + manually doubling the table when needed.
I will store the true length of the table in a rowcount attribute.
create.2<-function(elems)
{
return(as.data.table(elems))
}
append.2<-function(dt, elems)
{
n<-attr(dt, 'rowcount')
if (is.null(n))
n<-nrow(dt)
if (n==nrow(dt))
{
tmp<-elems[1]
tmp[[1]]<-rep(NA,n)
dt<-rbindlist(list(dt, tmp), fill=TRUE, use.names=TRUE)
setattr(dt,'rowcount', n)
}
pos<-as.integer(match(names(elems), colnames(dt)))
for (j in seq_along(pos))
{
set(dt, i=as.integer(n+1), pos[[j]], elems[[j]])
}
setattr(dt,'rowcount',n+1)
return(dt)
}
access.2<-function(elems)
{
n<-attr(elems, 'rowcount')
return(as.data.table(elems[1:n,]))
}
SQL should be optimized for fast record insertion, so I initially had high hopes for RSQLite solution
This is basically copy&paste of Karsten W. answer on similar thread.
create.3<-function(elems)
{
con <- RSQLite::dbConnect(RSQLite::SQLite(), ":memory:")
RSQLite::dbWriteTable(con, 't', as.data.frame(elems))
return(con)
}
append.3<-function(con, elems)
{
RSQLite::dbWriteTable(con, 't', as.data.frame(elems), append=TRUE)
return(con)
}
access.3<-function(con)
{
return(RSQLite::dbReadTable(con, "t", row.names=NULL))
}
data.frame's own row-appending + custom environment.
create.4<-function(elems)
{
env<-new.env()
env$dt<-as.data.frame(elems)
return(env)
}
append.4<-function(env, elems)
{
env$dt[nrow(env$dt)+1,]<-elems
return(env)
}
access.4<-function(env)
{
return(env$dt)
}
The test suite:
For convenience I will use one test function to cover them all with indirect calling. (I checked: using do.call instead of calling the functions directly doesn't makes the code run measurable longer).
test<-function(id, n=1000)
{
n<-n-1
el<-list(a=1,b=2,c=3,d=4)
o<-do.call(paste0('create.',id),list(el))
s<-paste0('append.',id)
for (i in 1:n)
{
o<-do.call(s,list(o,el))
}
return(do.call(paste0('access.', id), list(o)))
}
Let's see the performance for n=10 insertions.
I also added a 'placebo' functions (with suffix 0) that don't perform anything - just to measure the overhead of the test setup.
r<-microbenchmark(test(0,n=10), test(1,n=10),test(2,n=10),test(3,n=10), test(4,n=10))
autoplot(r)
For 1E5 rows (measurements done on Intel(R) Core(TM) i7-4710HQ CPU # 2.50GHz):
nr function time
4 data.frame 228.251
3 sqlite 133.716
2 data.table 3.059
1 rbindlist 169.998
0 placebo 0.202
It looks like the SQLite-based sulution, although regains some speed on large data, is nowhere near data.table + manual exponential growth. The difference is almost two orders of magnitude!
Summary
If you know that you will append rather small number of rows (n<=100), go ahead and use the simplest possible solution: just assign the rows to the data.frame using bracket notation and ignore the fact that the data.frame is not pre-populated.
For everything else use data.table::set and grow the data.table exponentially (e.g. using my code).
Update with purrr, tidyr & dplyr
As the question is already dated (6 years), the answers are missing a solution with newer packages tidyr and purrr. So for people working with these packages, I want to add a solution to the previous answers - all quite interesting, especially .
The biggest advantage of purrr and tidyr are better readability IMHO.
purrr replaces lapply with the more flexible map() family,
tidyr offers the super-intuitive method add_row - just does what it says :)
map_df(1:1000, function(x) { df %>% add_row(x = x, y = toString(x)) })
This solution is short and intuitive to read, and it's relatively fast:
system.time(
map_df(1:1000, function(x) { df %>% add_row(x = x, y = toString(x)) })
)
user system elapsed
0.756 0.006 0.766
It scales almost linearly, so for 1e5 rows, the performance is:
system.time(
map_df(1:100000, function(x) { df %>% add_row(x = x, y = toString(x)) })
)
user system elapsed
76.035 0.259 76.489
which would make it rank second right after data.table (if your ignore the placebo) in the benchmark by #Adam Ryczkowski:
nr function time
4 data.frame 228.251
3 sqlite 133.716
2 data.table 3.059
1 rbindlist 169.998
0 placebo 0.202
A more generic solution for might be the following.
extendDf <- function (df, n) {
withFactors <- sum(sapply (df, function(X) (is.factor(X)) )) > 0
nr <- nrow (df)
colNames <- names(df)
for (c in 1:length(colNames)) {
if (is.factor(df[,c])) {
col <- vector (mode='character', length = nr+n)
col[1:nr] <- as.character(df[,c])
col[(nr+1):(n+nr)]<- rep(col[1], n) # to avoid extra levels
col <- as.factor(col)
} else {
col <- vector (mode=mode(df[1,c]), length = nr+n)
class(col) <- class (df[1,c])
col[1:nr] <- df[,c]
}
if (c==1) {
newDf <- data.frame (col ,stringsAsFactors=withFactors)
} else {
newDf[,c] <- col
}
}
names(newDf) <- colNames
newDf
}
The function extendDf() extends a data frame with n rows.
As an example:
aDf <- data.frame (l=TRUE, i=1L, n=1, c='a', t=Sys.time(), stringsAsFactors = TRUE)
extendDf (aDf, 2)
# l i n c t
# 1 TRUE 1 1 a 2016-07-06 17:12:30
# 2 FALSE 0 0 a 1970-01-01 01:00:00
# 3 FALSE 0 0 a 1970-01-01 01:00:00
system.time (eDf <- extendDf (aDf, 100000))
# user system elapsed
# 0.009 0.002 0.010
system.time (eDf <- extendDf (eDf, 100000))
# user system elapsed
# 0.068 0.002 0.070
Lets take a vector 'point' which has numbers from 1 to 5
point = c(1,2,3,4,5)
if we want to append a number 6 anywhere inside the vector then below command may come handy
i) Vectors
new_var = append(point, 6 ,after = length(point))
ii) columns of a table
new_var = append(point, 6 ,after = length(mtcars$mpg))
The command append takes three arguments:
the vector/column to be modified.
value to be included in the modified vector.
a subscript, after which the values are to be appended.
simple...!!
Apologies in case of any...!
My solution is almost the same as the original answer but it doesn't worked for me.
So, I gave names for the columns and it works:
painel <- rbind(painel, data.frame("col1" = xtweets$created_at,
"col2" = xtweets$text))

Resources