logical indexing in data.table in R - r

I am a beginner in data.table and I am trying to do a really simple operation which in the base dataframes would look like this:
percentages[percentages<0] = abs(percentages[percentages<0])
The data looks like this:
percentages
p1 p2 p3
1: 0.689 0.206 0.106
The solution for data.table that I have found so far to just get the data is:
percentages[,which(percentages<0),with=FALSE]
but it's more complicated than dataframe...there should be something better but I can't get anything.. any suggestion?

A general option may be using set. It includes a for loop but it would be more efficient as we are looping through the columns and not creating a matrix by doing (df1 < 0 - for huge datasets, this would consume some memory). Using set will be efficient as the documentation says overhead of [.data.table is avoided
for(j in seq_along(df1)){
set(df1, i = which(df1[[j]]<0), j=j, value = abs(df1[[j]]))
}
As the OP wants a single line code, for the single row example showed,
df1[, lapply(.SD, function(x) replace(x, x < 0, abs(x)))]
Benchmarks
Based on the system.time on a slightly bigger dataset
set.seed(42)
dfN <- data.frame(p1 = rnorm(1e7), p2 = rnorm(1e7), p3 = rnorm(1e7), p4 = rnorm(1e7))
dfN1 <- copy(dfN)
setDT(dfN1)
system.time({
i1 <- dfN < 0
dfN[i1] <- abs(dfN[i1])
})
# user system elapsed
# 1.63 0.50 2.12
system.time({
for(j in seq_along(dfN1)){
set(dfN1, i = which(dfN1[[j]]<0), j=j, value = abs(dfN1[[j]][dfN1[[j]]<0]))
}
})
# user system elapsed
# 0.91 0.08 0.98

as akrun posted above, the one-liner reply is
df1[, lapply(.SD, function(x) replace(x, x < 0, abs(x)))]
however, this is not exactly what I was looking for since it seems that data.table is much more syntactically complicated compared to data.frame (at least in this example)
we are basically doing the vectorization ourselves in data.table (using the lapply) while in data.frame it happens automatically

Related

Running a quicker calculation

I have 2 data frames in R, one of which is a subset of the other. I had to do some manipulations in it, and calculate the % of the subsetted data from the main data frame for 6 x-values (DayTreat in the code). So I created a function to do the calculation and create a new column. My issue is that it's painfully slow. Any suggestions?
percDay <- function(fullDat, subDat)
{
subDat$DaySum <- NULL
for (i in fullDat$DayTreat) # for each DayTreat value in fullDat. Must be `psmelt()` made phyloseq object
{
r <- sum(fullDat$Abundance[fullDat$DayTreat == i]) # Take the sum of all the taxa for that day
subDat$DaySum[subDat$DayTreat == i] <- r # Add the value to the subset of data
}
subDat$DayPerc <- (subDat$Abundance/subDat$DaySum) # Make the percentage of the subset
subDat
}
Examing your code, it looks like that you are doing redundant calculasions
the line:
for (i in fullDat$DayTreat)
should be:
for (i in unique(fullDat$DayTreat))
After that you could use data.table and do not use separate data frames,
if you say that one is subset of onother
require(data.table)
setDT(fullDat)
fullDat[, subsetI := Abundance > 30] # for example, should be your Condition
fullDat[, DaySum:= sum(Abundance), by = DayTreat]
fullDat[, DayPerc := Abundance/DaySum]
# get subset:
fullDat[subsetI == T]
If you would provide example data and desired output, it could be possible to supply more concrete code.
So, at a high level, I think the solutions are:
Use faster data classes if you aren't already
Avoid for loops
vectorize manually or
real on faster functions/libraries that use more C code and/or have more vectorization "under the hood"
Try data.table and/or tidyverse for greater speed and cleaner code
Benchmark and profile your code
Example:
require(tidyverse)
require(data.table)
percDay <- function(fullDat, subDat)
{
subDat$DaySum <- NULL
for (i in fullDat$DayTreat) # for each DayTreat value in fullDat. Must be `psmelt()` made phyloseq object
{
r <- sum(fullDat$Abundance[fullDat$DayTreat == i]) # Take the sum of all the taxa for that day
subDat$DaySum[subDat$DayTreat == i] <- r # Add the value to the subset of data
}
subDat$DayPerc <- (subDat$Abundance/subDat$DaySum) # Make the percentage of the subset
subDat
}
# My simulation of your data.frame:
fullDat <- data.frame(Abundance=rnorm(200),
DayTreat=c(1:100,1:100))
subDat <- dplyr::sample_frac(fullDat, .25)
# Your function modifies the data, so I'll make a copy. For a potential
# speed improvement I'll try data.table class
fullDat0 <- as.data.table(fullDat)
subDat0 <- as.data.table(subDat)
require(rbenchmark)
benchmark("original" = {
percDay(fullDat, subDat)
},
"example_improvement" = {
# Tidy approach
tmp <- fullDat0 %>%
group_by(DayTreat) %>%
summarize(DaySum = sum(Abundance))
subDat0 <- merge(subDat, tmp, by="DayTreat") # could use semi_join
subDat0$DayPerc <- (subDat0$Abundance/subDat0$DaySum) # could use mutate
},
replications = 100,
columns = c("test", "replications", "elapsed",
"relative", "user.self", "sys.self"))
test replications elapsed relative user.self sys.self
example_improvement 100 0.22 1.000 0.22 0.00
original 100 1.42 6.455 1.23 0.01
Typically a data.table approach is going to have the greatest speed. The tibble-based "tidy" approach has clearer syntax whilst typically being faster than data.frame but slower than data.table. An experienced data.table expert like #akrun could offer a maximal performance solution using probably just 1 single data.table statement.

How to speed up missing search process in a R data.table

I am writing a general function for missing value treatment. Data can have Char,numeric,factor and integer type columns. An example of data is as follows
dt<-data.table(
num1=c(1,2,3,4,NA,5,NA,6),
num3=c(1,2,3,4,5,6,7,8),
int1=as.integer(c(NA,NA,102,105,NA,300,400,700)),
int3=as.integer(c(1,10,102,105,200,300,400,700)),
cha1=c('a','b','c',NA,NA,'c','d','e'),
cha3=c('xcda','b','c','miss','no','c','dfg','e'),
fact1=c('a','b','c',NA,NA,'c','d','e'),
fact3=c('ad','bd','cc','zz','yy','cc','dd','ed'),
allm=as.integer(c(NA,NA,NA,NA,NA,NA,NA,NA)),
miss=as.character(c("","",'c','miss','no','c','dfg','e')),
miss2=as.integer(c('','',3,4,5,6,7,8)),
miss3=as.factor(c(".",".",".","c","d","e","f","g")),
miss4=as.factor(c(NA,NA,'.','.','','','t1','t2')),
miss5=as.character(c(NA,NA,'.','.','','','t1','t2'))
)
I was using this code to flag out missing values:
dt[,flag:=ifelse(is.na(miss5)|!nzchar(miss5),1,0)]
But it turns out to be very slow, additionally I have to add logic which could also consider "." as missing.
So I am planning to write this for missing value identification
dt[miss5 %in% c(NA,'','.'),flag:=1]
but on a 6 million record set it takes close to 1 second to run this whereas
dt[!nzchar(miss5),flag:=1] takes close 0.14 secod to run.
My question is, can we have a code where the time taken is as least as possible while we can look for values NA,blank and Dot(NA,".","") as missing?
Any help is highly appreciated.
== and %in% are optimised to use binary search automatically (NEW FEATURE: Auto indexing). To use it, we have to ensure that:
a) we use dt[...] instead of set() as it's not yet implemented in set(), #1196.
b) When RHS to %in% is of higher SEXPTYPE than LHS, auto indexing re-routes to base R to ensure correct results (as binary search always coerces RHS). So for integer columns we need to make sure we pass in just NA and not the "." or "".
Using #akrun's data, here's the code and run time:
in_col = grep("^miss", names(dt), value=TRUE)
out_col = gsub("^miss", "flag", in_col)
system.time({
dt[, (out_col) := 0L]
for (j in seq_along(in_col)) {
if (class(.subset2(dt, in_col[j])) %in% c("character", "factor")) {
lookup = c("", ".", NA)
} else lookup = NA
expr = call("%in%", as.name(in_col[j]), lookup)
tt = dt[eval(expr), (out_col[j]) := 1L]
}
})
# user system elapsed
# 1.174 0.295 1.476
How it works:
a) we first initiate all output columns to 0.
b) Then, for each column, we check it's type and create lookup accordingly.
c) We then create the corresponding expression for i - miss(.) %in% lookup
d) Then we evaluate the expression in i, which'll use auto indexing to create an index very quickly and use that index to quickly find matching indices using binary search.
Note: If necessary, you can add a set2key(dt, NULL) at the end of for-loop so that the created indices are removed immediately after use (to save space).
Compared to this run, #akrun's fastest answer takes 6.33 seconds, which is ~4.2x speedup.
Update: On 4 million rows and 100 columns, it takes ~ 9.2 seconds. That's ~0.092 seconds per column.
Calling [.data.table a 100 times could be expensive. When auto indexing is implemented in set(), it'd be nice to compare the performance.
You can loop through the 'miss' columns and create corresponding 'flag' columns with set.
library(data.table)#v1.9.5+
ind <- grep('^miss', names(dt))
nm1 <- sub('miss', 'flag',names(dt)[ind])
dt[,(nm1) := 0]
for(j in seq_along(ind)){
set(dt, i=which(dt[[ind[j]]] %in% c('.', '', NA)),j= nm1[j], value=1L)
}
Benchmarks
set.seed(24)
df1 <- as.data.frame(matrix(sample(c(NA,0:9), 6e6*5, replace=TRUE), ncol=5))
set.seed(23)
df2 <- as.data.frame(matrix(sample(c('.','', letters[1:5]), 6e6*5,
replace=TRUE), ncol=5))
set.seed(234)
i1 <- sample(10)
dfN <- setNames(cbind(df1, df2)[i1], paste0('miss',1:10))
dt <- as.data.table(dfN)
system.time({
ind <- grep('^miss', names(dt))
nm1 <- sub('miss', 'flag',names(dt)[ind])
dt[,(nm1) := 0L]
for(j in seq_along(ind)){
set(dt, i=which(dt[[ind[j]]] %in% c('.', '', NA)), j= nm1[j], value=1L)
}
}
)
#user system elapsed
# 8.352 0.150 8.496
system.time({
m1 <- matrix(0, nrow=6e6, ncol=10)
m2 <- sapply(seq_along(dt), function(i) {
ind <- which(dt[[i]] %in% c('.', '', NA))
replace(m1[,i], ind, 1L)})
cbind(dt, m2)})
#user system elapsed
# 14.227 0.362 14.582

Can a SQL non-equi join task (example below) be done (faster and /or neater) using data.table?

I have a data table with 3 columns: pid ( a job) , starttime (time when it starts ) and fintime (time when it ends) as made up below:
require(data.table)
dt <- data.table(pid=sample(1:100,100), starttime = sample(1:100,100)/100)[,fintime:=starttime + round(runif(100)/4,2)]
I need to identify all possible TWO jobs that can be done sequentially but confirming to an acceptable “gap” between the Jobs. I can do this using SQL for a gap between 0.05 and 0.4 units(of time) as below:
require(sqldf)
res <- sqldf("select a.pid as first, b.pid as second , a.starttime as startime, b.fintime as fintime
from dt a, dt b
where a.fintime < b.starttime - 0.05
and a.fintime > b.starttime - 0.4
")
How can I do this using data.table ?
(I am hoping for a performance improvement over sqldf when the data is large and with more constraints)
So here is a data.table approach that is about 20X faster, but there are some caveats (described at the end).
require(data.table)
set.seed(1) # for reproducible example
n <- 100 # simple example
dt <- data.table(pid=sample(1:n,n),
starttime = sample(1:n,n)/n,2)[,fintime:=starttime + round(runif(n)/4,2)]
# sqldf approach
require(sqldf)
f.sql <- function(dt) {
sqldf("create index idx on dt(starttime,fintime)")
res <- sqldf("select a.pid as first, b.pid as second , a.starttime as starttime, b.fintime as fintime
from dt a, dt b
where b.starttime >= a.fintime + 0.05
and b.starttime <= a.fintime + 0.4
")
}
res.sql <- f.sql(dt)
# data.table approach with foverlaps(...): need >= 1.9.4 for this!!
packageVersion("data.table")
# [1] ‘1.9.4’
f.DT <- function(dt) {
lookup <- dt[,list(second=pid, fintime, a=starttime,b=starttime)]
setkey(lookup,a,b)
DT <- dt[,list(first=pid, starttime, a=fintime+0.05,b=fintime+0.4)]
J.olaps <- foverlaps(DT,lookup,type="any",nomatch=0)
J.olaps[,list(first,second,starttime,fintime)]
}
res.DT <- f.DT(dt)
So this uses the foverlaps(...) function in the newest version of data.table (1.9.4). Suppose you have two data.tables, x and y. Each has a pair of columns that form a range. foverlaps(...) finds all combinations of records in x and y where there is overlap between the range in x and the range in y. Here we set this up so that x has range defined by fintime+0.04 and fintime+0.5, and y has range defined by starttime at both ends. So now foverlaps(...) looks for any combination of records where the starttime is between 0.04 and 0.5 more than the fintime.
Now for the caveats:
First, this only works (AFAIK) if you are willing to relax your constraints to a closed interval (e.g., b.starttime >= a.fintime + 0.05, vs. strictly >).
Second, the data.table approach finds all the records found in the sql approach plus some additional records. You can see this with the following code:
indx <- data.table(first=res.sql$first,second=res.sql$second,key=c("first","second"))
setkey(res.DT,first,second)
extra <- res.DT[!indx,]
The extra records seem like they are legitimate, so the question is: why are they not found by sqldf(...)? I can't answer that.
Third, this works for your example, but might not be easy to extend with "more constraints".
Finally, here is a "benchmark" with a dataset more similar to your actual data:
set.seed(1)
n <- 1e4 # more realistic example
dt <- data.table(pid=sample(1:n,n),
starttime = sample(1:n,n)/n)[,fintime:=starttime + round(runif(n)/4,2)]
system.time(res.sql <- f.sql(dt))
# user system elapsed
# 45.25 0.53 45.80
system.time(res.DT <- f.DT(dt))
# user system elapsed
# 2.09 0.86 2.94

How to append rows to an R data frame

I have looked around StackOverflow, but I cannot find a solution specific to my problem, which involves appending rows to an R data frame.
I am initializing an empty 2-column data frame, as follows.
df = data.frame(x = numeric(), y = character())
Then, my goal is to iterate through a list of values and, in each iteration, append a value to the end of the list. I started with the following code.
for (i in 1:10) {
df$x = rbind(df$x, i)
df$y = rbind(df$y, toString(i))
}
I also attempted the functions c, append, and merge without success. Please let me know if you have any suggestions.
Update from comment:
I don't presume to know how R was meant to be used, but I wanted to ignore the additional line of code that would be required to update the indices on every iteration and I cannot easily preallocate the size of the data frame because I don't know how many rows it will ultimately take. Remember that the above is merely a toy example meant to be reproducible. Either way, thanks for your suggestion!
Update
Not knowing what you are trying to do, I'll share one more suggestion: Preallocate vectors of the type you want for each column, insert values into those vectors, and then, at the end, create your data.frame.
Continuing with Julian's f3 (a preallocated data.frame) as the fastest option so far, defined as:
# pre-allocate space
f3 <- function(n){
df <- data.frame(x = numeric(n), y = character(n), stringsAsFactors = FALSE)
for(i in 1:n){
df$x[i] <- i
df$y[i] <- toString(i)
}
df
}
Here's a similar approach, but one where the data.frame is created as the last step.
# Use preallocated vectors
f4 <- function(n) {
x <- numeric(n)
y <- character(n)
for (i in 1:n) {
x[i] <- i
y[i] <- i
}
data.frame(x, y, stringsAsFactors=FALSE)
}
microbenchmark from the "microbenchmark" package will give us more comprehensive insight than system.time:
library(microbenchmark)
microbenchmark(f1(1000), f3(1000), f4(1000), times = 5)
# Unit: milliseconds
# expr min lq median uq max neval
# f1(1000) 1024.539618 1029.693877 1045.972666 1055.25931 1112.769176 5
# f3(1000) 149.417636 150.529011 150.827393 151.02230 160.637845 5
# f4(1000) 7.872647 7.892395 7.901151 7.95077 8.049581 5
f1() (the approach below) is incredibly inefficient because of how often it calls data.frame and because growing objects that way is generally slow in R. f3() is much improved due to preallocation, but the data.frame structure itself might be part of the bottleneck here. f4() tries to bypass that bottleneck without compromising the approach you want to take.
Original answer
This is really not a good idea, but if you wanted to do it this way, I guess you can try:
for (i in 1:10) {
df <- rbind(df, data.frame(x = i, y = toString(i)))
}
Note that in your code, there is one other problem:
You should use stringsAsFactors if you want the characters to not get converted to factors. Use: df = data.frame(x = numeric(), y = character(), stringsAsFactors = FALSE)
Let's benchmark the three solutions proposed:
# use rbind
f1 <- function(n){
df <- data.frame(x = numeric(), y = character())
for(i in 1:n){
df <- rbind(df, data.frame(x = i, y = toString(i)))
}
df
}
# use list
f2 <- function(n){
df <- data.frame(x = numeric(), y = character(), stringsAsFactors = FALSE)
for(i in 1:n){
df[i,] <- list(i, toString(i))
}
df
}
# pre-allocate space
f3 <- function(n){
df <- data.frame(x = numeric(1000), y = character(1000), stringsAsFactors = FALSE)
for(i in 1:n){
df$x[i] <- i
df$y[i] <- toString(i)
}
df
}
system.time(f1(1000))
# user system elapsed
# 1.33 0.00 1.32
system.time(f2(1000))
# user system elapsed
# 0.19 0.00 0.19
system.time(f3(1000))
# user system elapsed
# 0.14 0.00 0.14
The best solution is to pre-allocate space (as intended in R). The next-best solution is to use list, and the worst solution (at least based on these timing results) appears to be rbind.
Suppose you simply don't know the size of the data.frame in advance. It can well be a few rows, or a few millions. You need to have some sort of container, that grows dynamically. Taking in consideration my experience and all related answers in SO I come with 4 distinct solutions:
rbindlist to the data.frame
Use data.table's fast set operation and couple it with manually doubling the table when needed.
Use RSQLite and append to the table held in memory.
data.frame's own ability to grow and use custom environment (which has reference semantics) to store the data.frame so it will not be copied on return.
Here is a test of all the methods for both small and large number of appended rows. Each method has 3 functions associated with it:
create(first_element) that returns the appropriate backing object with first_element put in.
append(object, element) that appends the element to the end of the table (represented by object).
access(object) gets the data.frame with all the inserted elements.
rbindlist to the data.frame
That is quite easy and straight-forward:
create.1<-function(elems)
{
return(as.data.table(elems))
}
append.1<-function(dt, elems)
{
return(rbindlist(list(dt, elems),use.names = TRUE))
}
access.1<-function(dt)
{
return(dt)
}
data.table::set + manually doubling the table when needed.
I will store the true length of the table in a rowcount attribute.
create.2<-function(elems)
{
return(as.data.table(elems))
}
append.2<-function(dt, elems)
{
n<-attr(dt, 'rowcount')
if (is.null(n))
n<-nrow(dt)
if (n==nrow(dt))
{
tmp<-elems[1]
tmp[[1]]<-rep(NA,n)
dt<-rbindlist(list(dt, tmp), fill=TRUE, use.names=TRUE)
setattr(dt,'rowcount', n)
}
pos<-as.integer(match(names(elems), colnames(dt)))
for (j in seq_along(pos))
{
set(dt, i=as.integer(n+1), pos[[j]], elems[[j]])
}
setattr(dt,'rowcount',n+1)
return(dt)
}
access.2<-function(elems)
{
n<-attr(elems, 'rowcount')
return(as.data.table(elems[1:n,]))
}
SQL should be optimized for fast record insertion, so I initially had high hopes for RSQLite solution
This is basically copy&paste of Karsten W. answer on similar thread.
create.3<-function(elems)
{
con <- RSQLite::dbConnect(RSQLite::SQLite(), ":memory:")
RSQLite::dbWriteTable(con, 't', as.data.frame(elems))
return(con)
}
append.3<-function(con, elems)
{
RSQLite::dbWriteTable(con, 't', as.data.frame(elems), append=TRUE)
return(con)
}
access.3<-function(con)
{
return(RSQLite::dbReadTable(con, "t", row.names=NULL))
}
data.frame's own row-appending + custom environment.
create.4<-function(elems)
{
env<-new.env()
env$dt<-as.data.frame(elems)
return(env)
}
append.4<-function(env, elems)
{
env$dt[nrow(env$dt)+1,]<-elems
return(env)
}
access.4<-function(env)
{
return(env$dt)
}
The test suite:
For convenience I will use one test function to cover them all with indirect calling. (I checked: using do.call instead of calling the functions directly doesn't makes the code run measurable longer).
test<-function(id, n=1000)
{
n<-n-1
el<-list(a=1,b=2,c=3,d=4)
o<-do.call(paste0('create.',id),list(el))
s<-paste0('append.',id)
for (i in 1:n)
{
o<-do.call(s,list(o,el))
}
return(do.call(paste0('access.', id), list(o)))
}
Let's see the performance for n=10 insertions.
I also added a 'placebo' functions (with suffix 0) that don't perform anything - just to measure the overhead of the test setup.
r<-microbenchmark(test(0,n=10), test(1,n=10),test(2,n=10),test(3,n=10), test(4,n=10))
autoplot(r)
For 1E5 rows (measurements done on Intel(R) Core(TM) i7-4710HQ CPU # 2.50GHz):
nr function time
4 data.frame 228.251
3 sqlite 133.716
2 data.table 3.059
1 rbindlist 169.998
0 placebo 0.202
It looks like the SQLite-based sulution, although regains some speed on large data, is nowhere near data.table + manual exponential growth. The difference is almost two orders of magnitude!
Summary
If you know that you will append rather small number of rows (n<=100), go ahead and use the simplest possible solution: just assign the rows to the data.frame using bracket notation and ignore the fact that the data.frame is not pre-populated.
For everything else use data.table::set and grow the data.table exponentially (e.g. using my code).
Update with purrr, tidyr & dplyr
As the question is already dated (6 years), the answers are missing a solution with newer packages tidyr and purrr. So for people working with these packages, I want to add a solution to the previous answers - all quite interesting, especially .
The biggest advantage of purrr and tidyr are better readability IMHO.
purrr replaces lapply with the more flexible map() family,
tidyr offers the super-intuitive method add_row - just does what it says :)
map_df(1:1000, function(x) { df %>% add_row(x = x, y = toString(x)) })
This solution is short and intuitive to read, and it's relatively fast:
system.time(
map_df(1:1000, function(x) { df %>% add_row(x = x, y = toString(x)) })
)
user system elapsed
0.756 0.006 0.766
It scales almost linearly, so for 1e5 rows, the performance is:
system.time(
map_df(1:100000, function(x) { df %>% add_row(x = x, y = toString(x)) })
)
user system elapsed
76.035 0.259 76.489
which would make it rank second right after data.table (if your ignore the placebo) in the benchmark by #Adam Ryczkowski:
nr function time
4 data.frame 228.251
3 sqlite 133.716
2 data.table 3.059
1 rbindlist 169.998
0 placebo 0.202
A more generic solution for might be the following.
extendDf <- function (df, n) {
withFactors <- sum(sapply (df, function(X) (is.factor(X)) )) > 0
nr <- nrow (df)
colNames <- names(df)
for (c in 1:length(colNames)) {
if (is.factor(df[,c])) {
col <- vector (mode='character', length = nr+n)
col[1:nr] <- as.character(df[,c])
col[(nr+1):(n+nr)]<- rep(col[1], n) # to avoid extra levels
col <- as.factor(col)
} else {
col <- vector (mode=mode(df[1,c]), length = nr+n)
class(col) <- class (df[1,c])
col[1:nr] <- df[,c]
}
if (c==1) {
newDf <- data.frame (col ,stringsAsFactors=withFactors)
} else {
newDf[,c] <- col
}
}
names(newDf) <- colNames
newDf
}
The function extendDf() extends a data frame with n rows.
As an example:
aDf <- data.frame (l=TRUE, i=1L, n=1, c='a', t=Sys.time(), stringsAsFactors = TRUE)
extendDf (aDf, 2)
# l i n c t
# 1 TRUE 1 1 a 2016-07-06 17:12:30
# 2 FALSE 0 0 a 1970-01-01 01:00:00
# 3 FALSE 0 0 a 1970-01-01 01:00:00
system.time (eDf <- extendDf (aDf, 100000))
# user system elapsed
# 0.009 0.002 0.010
system.time (eDf <- extendDf (eDf, 100000))
# user system elapsed
# 0.068 0.002 0.070
Lets take a vector 'point' which has numbers from 1 to 5
point = c(1,2,3,4,5)
if we want to append a number 6 anywhere inside the vector then below command may come handy
i) Vectors
new_var = append(point, 6 ,after = length(point))
ii) columns of a table
new_var = append(point, 6 ,after = length(mtcars$mpg))
The command append takes three arguments:
the vector/column to be modified.
value to be included in the modified vector.
a subscript, after which the values are to be appended.
simple...!!
Apologies in case of any...!
My solution is almost the same as the original answer but it doesn't worked for me.
So, I gave names for the columns and it works:
painel <- rbind(painel, data.frame("col1" = xtweets$created_at,
"col2" = xtweets$text))

R: Tabulations and insertions with data.table

I am trying to take a very large set of records with multiple indices, calculate an aggregate statistic on groups determined by a subset of the indices, and then insert that into every row in the table. The issue here is that these are very large tables - over 10M rows each.
Code for reproducing the data is below.
The basic idea is that there are a set of indices, say ix1, ix2, ix3, ..., ixK. Generally, I am choosing only a couple of them, say ix1 and ix2. Then, I calculate an aggregation of all the rows with matching ix1 and ix2 values (over all combinations that appear), for a column called val. To keep it simple, I'll focus on a sum.
I have tried the following methods
Via sparse matrices: convert the values to a coordinate list, i.e. (ix1, ix2, val), then create a sparseMatrix - this nicely sums up everything, and then I need only convert back from the sparse matrix representation to the coordinate list. Speed: good, but it is doing more than is necessary and it doesn't generalize to higher dimensions (e.g. ix1, ix2, ix3) or more general functions than a sum.
Use of lapply and split: by creating a new index that is unique for all (ix1, ix2, ...) n-tuples, I can then use split and apply. The bad thing here is that the unique index is converted by split into a factor, and this conversion is terribly time consuming. Try system({zz <- as.factor(1:10^7)}).
I'm now trying data.table, via a command like sumDT <- DT[,sum(val),by = c("ix1","ix2")]. However, I don't yet see how I can merge sumDT with DT, other than via something like DT2 <- merge(DT, sumDT, by = c("ix1","ix2"))
Is there a faster method for this data.table join than via the merge operation I've described?
[I've also tried bigsplit from the bigtabulate package, and some other methods. Anything that converts to a factor is pretty much out - as far as I can tell, that conversion process is very slow.]
Code to generate data. Naturally, it's better to try a smaller N to see that something works, but not all methods scale very well for N >> 1000.
N <- 10^7
set.seed(2011)
ix1 <- 1 + floor(rexp(N, 0.01))
ix2 <- 1 + floor(rexp(N, 0.01))
ix3 <- 1 + floor(rexp(N, 0.01))
val <- runif(N)
DF <- data.frame(ix1 = ix1, ix2 = ix2, ix3 = ix3, val = val)
DF <- DF[order(DF[,1],DF[,2],DF[,3]),]
DT <- as.data.table(DF)
Well, it's possible you'll find that doing the merge isn't so bad as long as your keys are properly set.
Let's setup the problem again:
N <- 10^6 ## not 10^7 because RAM is tight right now
set.seed(2011)
ix1 <- 1 + floor(rexp(N, 0.01))
ix2 <- 1 + floor(rexp(N, 0.01))
ix3 <- 1 + floor(rexp(N, 0.01))
val <- runif(N)
DT <- data.table(ix1=ix1, ix2=ix2, ix3=ix3, val=val, key=c("ix1", "ix2"))
Now you can calculate your summary stats
info <- DT[, list(summary=sum(val)), by=key(DT)]
And merge the columns "the data.table way", or just with merge
m1 <- DT[info] ## the data.table way
m2 <- merge(DT, info) ## if you're just used to merge
identical(m1, m2)
[1] TRUE
If either of those ways of merging is too slow, you can try a tricky way to build info at the cost of memory:
info2 <- DT[, list(summary=rep(sum(val), length(val))), by=key(DT)]
m3 <- transform(DT, summary=info2$summary)
identical(m1, m3)
[1] TRUE
Now let's see the timing:
#######################################################################
## Using data.table[ ... ] or merge
system.time(info <- DT[, list(summary=sum(val)), by=key(DT)])
user system elapsed
0.203 0.024 0.232
system.time(DT[info])
user system elapsed
0.217 0.078 0.296
system.time(merge(DT, info))
user system elapsed
0.981 0.202 1.185
########################################################################
## Now the two parts of the last version done separately:
system.time(info2 <- DT[, list(summary=rep(sum(val), length(val))), by=key(DT)])
user system elapsed
0.574 0.040 0.616
system.time(transform(DT, summary=info2$summary))
user system elapsed
0.173 0.093 0.267
Or you can skip the intermediate info table building if the following doesn't seem too inscrutable for your tastes:
system.time(m5 <- DT[ DT[, list(summary=sum(val)), by=key(DT)] ])
user system elapsed
0.424 0.101 0.525
identical(m5, m1)
# [1] TRUE

Resources