optimizing solution to find common third on large data set

optimizing solution to find common third on large data set - r

This is a follow up question to my previous question. I run into a problem to find a memory efficient solution to find a common third for my large data set (3.5 million groups and 6.2 million persons)
The proposed solution using the igraph package works fast for a normal sized data sets unfortunately runs into memory issues by creating a large matrix for bigger data sets. Similar issue comes up with my own solution using concatenated inner joins where the third inner join inflates the dataframe so my pc runs out of memory (16gb).
df.output <- inner_join(df,df, by='group' ) %>%
inner_join(.,df, by=c('person.y'='person')) %>%
inner_join(.,df, by=c('group.y'='group')) %>%
rename(person_in_common='person.y', pers1='person.x',pers2='person') %>%
select(pers1,pers2,person_in_common) %>%
filter(pers1!=pers2) %>%
distinct() %>%
filter(person_in_common!=pers1 & person_in_common!=pers2)
df.output[-3] <- t(apply(df.output[-3], 1,
FUN=function(x) sort(x, decreasing=FALSE)))
df.output <- unique(df.output)
Small data set example and expected output
df <- data.frame(group= c("a","a","b","b","b","c"),
person = c("Tom","Jerry","Tom","Anna","Sam","Nic"), stringsAsFactors = FALSE)
df
group person
1 a Tom
2 a Jerry
3 b Tom
4 b Anna
5 b Sam
6 c Nic
and expected result
df.output
pers1 pers2 person_in_common
1 Anna Jerry Tom
2 Jerry Sam Tom
3 Sam Tom Anna
4 Anna Tom Sam
6 Anna Sam Tom
I unfortunately don't have access to a machine with more ram and are also not really experienced with cloud computing, so I hope to make it work on my local pc. I would appreciate input how to optimize any of the solutions or an advise how to tackle the problem otherwise.
Edit 1
A dataframe which reflects my actual data size.
set.seed(33)
Data <- data.frame(group = sample(1:3700000, 14000000, replace=TRUE),
person = sample(1:6800000, 14000000,replace = TRUE))
Edit 2
My real data is a bit more complex in terms of larger groups and more persons per group as the example data. Consequently it gets more memory intense. I could not figure out how to simulate this kind of structure so following the real data for download:
Full person-group data

So, I managed to run this on your test data (I have 16 GB of RAM), but if you run this on your small example then you would see that it does not give the same results. I did not get why, but maybe you could hep me with that. So I will try to explain every step:
myFun <- function(dt) {
require(data.table)
# change the data do data.table:
setDT(dt)
# set key/order the data by group and person:
setkey(dt, group, person)
# I copy the initial data and change the name of soon to be merged column name to "p2"
# which represents person2
dta <- copy(dt)
setnames(dta, "person", "p2")
# the first merge using data.table:
dt1 <- dt[dta, on = "group", allow.cartesian = TRUE, nomatch = 0]
# now we remove rows where persons are the same:
dt1 <- dt1[person != p2] # remove equal persons
# and also we need to remove rows where person1 and person2 are the same,
# just in different order , example:
# 2: a Tom Jerry
# 3: a Jerry Tom
# is the same, if I get it right then you did this using apply in the end of code,
# but it would be much better if we could reduce data now
# also my approach will be much faster (we take pairwise min word to 2 column
# and max to the last):
l1 <- pmin(dt1[[2]], dt1[[3]])
l2 <- pmax(dt1[[2]], dt1[[3]])
set(dt1, j = 2L, value = l1)
set(dt1, j = 3L, value = l2)
# now lets clear memory and take unique rows of dt1:
rm(l1, l2, dt)
dt1 <- unique(dt1)
gc()
# change name for group column:
setnames(dta, "group", "g2")
# second merge:
dt2 <- dt1[dta, on = "p2", allow.cartesian = TRUE, nomatch = 0]
rm(dt1)
gc()
setnames(dta, "p2", "p3")
dt3 <- dt2[dta, on = "g2", allow.cartesian = TRUE, nomatch = 0] # third merge
rm(dt2)
gc()
dt3 <- dt3[p3 != p2 & p3 != person] # removing equal persons
gc()
dt3 <- dt3[, .(person, p2, p3)]
gc()
return(dt3[])
}
On Small data set example:
df <- data.frame(group = c("a","a","b","b","b","c"),
person = c("Tom","Jerry","Tom","Anna","Sam","Nic"),
stringsAsFactors = FALSE)
df
myFun(df)
# person p2 p3
# 1: Anna Tom Jerry
# 2: Sam Tom Jerry
# 3: Jerry Tom Anna
# 4: Sam Tom Anna
# 5: Jerry Tom Sam
# 6: Anna Tom Sam
# 7: Anna Sam Tom
Something similar to your result but not quite the same
Now with larger data:
set.seed(33)
N <- 10e6
dt <- data.frame(group = sample(3.7e6, N, replace = TRUE),
person = sample(6.8e6, N, replace = TRUE))
system.time(results <- myFun(dt)) # 13.22 sek
rm(results)
gc()
And:
set.seed(33)
N <- 14e6
dt <- data.frame(group = sample(3.7e6, N, replace = TRUE),
person = sample(6.8e6, N, replace = TRUE))
system.time(results <- myFun(dt)) # around 40 sek, but RAM does get used to max
Update:
Maybe you can try this splitting aproch, lets say with nparts 6-10?:
myFunNew3 <- function(dt, nparts = 2) {
require(data.table)
setDT(dt)
setkey(dt, group, person)
dta <- copy(dt)
# split into N parts
splits <- rep(1:nparts, each = ceiling(dt[, .N]/nparts))
set(dt, j = "splits", value = splits)
dtl <- split(dt, by = "splits", keep.by = F)
set(dt, j = "splits", value = NULL)
rm(splits)
gc()
i = 1
for (i in seq_along(dtl)) {
X <- copy(dtl[[i]])
setnames(dta, c("group", "person"))
X <- X[dta, on = "group", allow.cartesian = TRUE, nomatch = 0]
X <- X[person != i.person]
gc()
X <- X[dta, on = "person", allow.cartesian = TRUE, nomatch = 0]
gc()
setnames(dta, "group", "i.group")
X <- X[dta, on = "i.group", allow.cartesian = TRUE, nomatch = 0]
gc()
setnames(X, "i.person.1", "pers2")
setnames(X, "i.person", "pers1" )
setnames(X, "person", "person_in_common" )
X <- X[, .(pers1, pers2, person_in_common)]
gc()
X <- X[pers1 != pers2 & person_in_common != pers1 & person_in_common != pers2]
gc()
name1 <- "pers1"
name2 <- "pers2"
l1 <- pmin(X[[name1]], X[[name2]])
l2 <- pmax(X[[name1]], X[[name2]])
set(X, j = name1, value = l1)
set(X, j = name2, value = l2)
rm(l1, l2)
gc()
X <- unique(X)
gc()
if (i > 1) {
X1 <- rbindlist(list(X1, X), use.names = T, fill = T)
X1 <- unique(X1)
rm(X)
gc()
} else {
X1 <- copy(X)
}
dtl[[i]] <- 0L
gc()
}
rm(dta, dtl)
gc()
setkey(X1, pers1, pers2, person_in_common)
X1[]
}

Related

Data management: flatten data with R

I have the following dataframe gathering the evolution of policies:
Df <- data.frame(Id_policy = c("A_001", "A_002", "A_003","B_001","B_002"),
date_new = c("20200101","20200115","20200304","20200110","20200215"),
date_end = c("20200503","20200608","20210101","20200403","20200503"),
expend = c("","A_001","A_002","",""))
which looks like that:
Id_policy date_new date_end expend
A_001 20200101 20200503
A_002 20200115 20200608 A_001
A_003 20200304 20210101 A_002
B_001 20200110 20200403
B_002 20200215 20200503
"Id_policy" refers to a specific policy, "date_new" the date of policy issuance, "date_end" the date of policy end. However, sometimes a policy is extended. When it is the case, a new policy is set and the variable "expend" provides the name of the previous policy it changes.
The idea here is to flatten the dataset so we only keep rows corresponding to different policies. So, the output would be something like this:
Id_policy date_new date_end expend
A_001 20200101 20210101
B_001 20200110 20200403
B_002 20200215 20200503
Has-someone faced a similar problem ?

One way is to treat this as a network problem and use igraph functions (related posts e.g. Make a group_indices based on several columns
; Fast way to group variables based on direct and indirect similarities in multiple columns).
Set the missing 'expend' to 'Id_policy'
Use graph_from_data_frame to create a graph, where 'expend' and 'Id_policy' columns are treated as an edge list.
Use components to get connected components of the graph, i.e. which 'Id_policy' are connected, directly or indirectly.
Select the membership element to get "the cluster id to which each vertex belongs".
Join membership to original data.
Grab relevant data grouped by membership.
I use data.table for the data wrangling steps, but this can of course also be done in base or dplyr.
library(data.table)
library(igraph)
setDT(Df)
Df[expend == "", expend := Id_policy]
g = graph_from_data_frame(Df[ , .(expend, Id_policy)])
mem = components(g)$membership
Df[.(names(mem)), on = .(Id_policy), mem := mem]
Df[ , .(Id_policy = Id_policy[1],
date_new = first(date_new),
date_end = last(date_end), by = mem]
# mem Id_policy date_new date_end
# 1: 1 A_001 20200101 20210101
# 2: 2 B_001 20200110 20200403
# 3: 3 B_002 20200215 20200503

Here is a solution using igraph for creating a directed network of id's, and data.table to do some binding and joining.
I kept in between results to show what each step does.
library( data.table )
library( igraph )
setDT(Df)
#create nodes and links
nodes <- Df[,1:3]
links <- Df[ !expend == "", .(from = expend, to = Id_policy) ]
g = graph_from_data_frame( links, vertices = nodes, directed = TRUE )
plot(g)
#find nodes without incoming (these are startpoints of paths)
in.nodes <- V(g)[degree(g, mode = 'in') == 0]
#define sumcomponents of the graph by looping the in.nodes
L <- lapply( in.nodes, function(x) names( subcomponent(g, x) ) )
# $A_001
# [1] "A_001" "A_002" "A_003"
# $B_001
# [1] "B_001"
# $B_002
# [1] "B_002"
L2 <- lapply( L, function(x) {
#get first and last element
dt <- data.table( start = x[1], end = x[ length(x) ] )
})
#bind list together to a single data.table
ans <- rbindlist( L2, use.names = TRUE, fill = TRUE, idcol = "Id_policy" )
# Id_policy start end
# 1: A_001 A_001 A_003
# 2: B_001 B_001 B_001
# 3: B_002 B_002 B_002
#update ans with values from original Df for start and end
ans[ Df, `:=`( start = i.date_new ), on = .(start = Id_policy) ][]
ans[ Df, `:=`( end = i.date_end ), on = .(end = Id_policy) ][]
# Id_policy start end
# 1: A_001 20200101 20210101
# 2: B_001 20200110 20200403
# 3: B_002 20200215 20200503

An outer for loop to go through each policy id in Df with an inner while loop to find the last extension for an original policy should work
Df <- data.frame(Id_policy = c("A_001", "A_002", "A_003","B_001","B_002"),
date_new = c("20200101","20200115","20200304","20200110","20200215"),
date_end = c("20200503","20200608","20210101","20200403","20200503"),
expend = c("","A_001","A_002","",""),
stringsAsFactors = F)
final_df <- data.frame(matrix(nrow = 0, ncol = 0), stringsAsFactors = F)
for (i in seq_len(nrow(Df))) {
# Check to see if the current policy ID is in the column expend
if (Df$Id_policy[i] %in% Df$expend || !Df$expend[i] == "") {
# Loop through expend policy until last one is found
found_last <- F
j <- i
end_date <- ""
c_policy_id <- Df$Id_policy[j]
expended_id <- Df$Id_policy[which(Df$expend == c_policy_id)]
if (length(expended_id) > 0) {
if (expended_id %in% Df$expend) {
while(!found_last) {
c_policy_id <- Df$Id_policy[j]
expended_id <- Df$Id_policy[which(Df$expend == c_policy_id)]
if (length(expended_id) > 0) {
if (expended_id %in% Df$expend) {
j <- which(Df$expend == expended_id)
}
}else{
end_date <- Df$date_end[j]
found_last <- T
}
}
if (!end_date == "") {
# Add to final df when found the last one
final_df <- bind_rows(final_df, data.frame(Id_policy = Df$Id_policy[i],
date_new = Df$date_new[i],
date_end = end_date,
expend = ""))
}
}
}
}else{
final_df <- bind_rows(final_df, Df[i, ])
}
}
final_df
Id_policy date_new date_end expend
1 A_001 20200101 20210101
2 B_001 20200110 20200403
3 B_002 20200215 20200503

Multiple functions on multiple columns by group, and create informative column names

How to adjust a data table manipulation so that, besides sum per category of several colums,
it would also calculate other functions at the same time such as mean and counts (.N) and automatically create column names: "sum c1" , "sum c2", "sum c4" , "mean c1", " mean c2", "mean c4" and preferably also 1 column "counts"?
My old solution was to write out
mean col1 = ....
mean col2 = ....
Etc, Inside the data.table command
Which worked, but horribly inefficient I think, and it won't work anymore to precode it if in the new app version, the calculations depend on user choices in an R Shiny app what to calculate for which columns.
I've read my way through a bunch of posts and blog articles but haven't quite figured out how to best do this. I read that in some cases the manipulation can become quite slow on large data tables depending on what approach you use (.sdcols, get, lapply, and or by =). Therefore I added a 'sizeable' dummy data set
My real data is around 100k rows by 100 columns and 1-100 groups roughly.
library(data.table)
n = 100000
dt = data.table(index=1:100000,
category = sample(letters[1:25], n, replace = T),
c1=rnorm(n,10000),
c2=rnorm(n,1000),
c3=rnorm(n,100),
c4 = rnorm(n,10)
)
# add more columns to test for big data tables
lapply(c(paste('c', 5:100, sep ='')),
function(addcol) dt[[addcol]] <<- rnorm(n,1000) )
# Simulate columns selected by shiny app user
Colchoice <- c("c1", "c4")
FunChoice <- c(".N", "mean", "sum")
# attempt which now does just one function and doesn't add names
dt[, lapply(.SD, sum, na.rm=TRUE), by=category, .SDcols=Colchoice ]
Expected output is a row per group and a column for each function per each selected column.
Category Mean c1 Sum c1 Mean c4 ...
A
B
C
D
E
......
Possibly a duplicate but I haven't found the exact answer that I need

If I understand correctly, this question consists of two parts:
How to group and aggregate with multiple functions over a list of columns and generate new column names automatically.
How to pass the names of the functions as a character vector.
For part 1, this is nearly a duplicate of Apply multiple functions to multiple columns in data.table but with the additional requirement that the results should be grouped using by =.
Therefore, eddi's answer has to be modified by adding the parameter recursive = FALSE in the call to unlist():
my.summary = function(x) list(N = length(x), mean = mean(x), median = median(x))
dt[, unlist(lapply(.SD, my.summary), recursive = FALSE),
.SDcols = ColChoice, by = category]
category c1.N c1.mean c1.median c4.N c4.mean c4.median
1: f 3974 9999.987 9999.989 3974 9.994220 9.974125
2: w 4033 10000.008 9999.991 4033 10.004261 9.986771
3: n 4025 9999.981 10000.000 4025 10.003686 9.998259
4: x 3975 10000.035 10000.019 3975 10.010448 9.995268
5: k 3957 10000.019 10000.017 3957 9.991886 10.007873
6: j 4027 10000.026 10000.023 4027 10.015663 9.998103
...
For part 2, we need to create my.summary() from a character vector of function names. This can be achieved by "programming on the language", i.e, by assembling an expression as character string and finally parsing and evaluating it:
my.summary <-
sapply(FunChoice, function(f) paste0(f, "(x)")) %>%
paste(collapse = ", ") %>%
sprintf("function(x) setNames(list(%s), FunChoice)", .) %>%
parse(text = .) %>%
eval()
my.summary
function(x) setNames(list(length(x), mean(x), sum(x)), FunChoice)
<environment: 0xe376640>
Alternatively, we can loop over the categories and rbind() the results afterwards:
library(magrittr) # used only to improve readability
lapply(dt[, unique(category)],
function(x) dt[category == x,
c(.(category = x), unlist(lapply(.SD, my.summary))),
.SDcols = ColChoice]) %>%
rbindlist()
Benchmark
So far, 4 data.table and one dplyr solutions have been posted. At least one of the answers claims to be "superfast". So, I wanted to verify by a benchmark with varying number of rows:
library(data.table)
library(magrittr)
bm <- bench::press(
n = 10L^(2:6),
{
set.seed(12212018)
dt <- data.table(
index = 1:n,
category = sample(letters[1:25], n, replace = T),
c1 = rnorm(n, 10000),
c2 = rnorm(n, 1000),
c3 = rnorm(n, 100),
c4 = rnorm(n, 10)
)
# use set() instead of <<- for appending additional columns
for (i in 5:100) set(dt, , paste0("c", i), rnorm(n, 1000))
tables()
ColChoice <- c("c1", "c4")
FunChoice <- c("length", "mean", "sum")
my.summary <- function(x) list(length = length(x), mean = mean(x), sum = sum(x))
bench::mark(
unlist = {
dt[, unlist(lapply(.SD, my.summary), recursive = FALSE),
.SDcols = ColChoice, by = category]
},
loop_category = {
lapply(dt[, unique(category)],
function(x) dt[category == x,
c(.(category = x), unlist(lapply(.SD, my.summary))),
.SDcols = ColChoice]) %>%
rbindlist()
},
dcast = {
dcast(dt, category ~ 1, fun = list(length, mean, sum), value.var = ColChoice)
},
loop_col = {
lapply(ColChoice, function(col)
dt[, setNames(lapply(FunChoice, function(f) get(f)(get(col))),
paste0(col, "_", FunChoice)),
by=category]
) %>%
Reduce(function(x, y) merge(x, y, by="category"), .)
},
dplyr = {
dt %>%
dplyr::group_by(category) %>%
dplyr::summarise_at(dplyr::vars(ColChoice), .funs = setNames(FunChoice, FunChoice))
},
check = function(x, y)
all.equal(setDT(x)[order(category)],
setDT(y)[order(category)] %>%
setnames(stringr::str_replace(names(.), "_", ".")),
ignore.col.order = TRUE,
check.attributes = FALSE
)
)
}
)
The results are easier to compare when plotted:
library(ggplot2)
autoplot(bm)
Please, note the logarithmic time scale.
For this test case, the unlist approach is always the fastest method, followed by dcast. dplyr is catching up for larger problem sizes n. Both lapply/loop approaches are less performant. In particular, Parfait's approach to loop over the columns and merge subresults afterwards seems to be rather sensitive to problem sizes n.
Edit: 2nd benchmark
As suggested by jangorecki, I have repeated the benchmark with much more rows and also with a varying number of groups.
Due to memory limitations, the largest problem size is 10 M rows times 102 columns which takes 7.7 GBytes of memory.
So, the first part of the benchmark code is modified to
bm <- bench::press(
n_grp = 10^(1:3),
n_row = 10L^seq(3, 7, by = 2),
{
set.seed(12212018)
dt <- data.table(
index = 1:n_row,
category = sample(n_grp, n_row, replace = TRUE),
c1 = rnorm(n_row),
c2 = rnorm(n_row),
c3 = rnorm(n_row),
c4 = rnorm(n_row, 10)
)
for (i in 5:100) set(dt, , paste0("c", i), rnorm(n_row, 1000))
tables()
...
As expected by jangorecki, some solutions are more sensitive to the number of groups than others. In particular, performance of loop_category is degrading much stronger with the number of groups while dcast seems to be less affected. For fewer groups, the unlist approach is always faster than dcast while for many groups dcast is faster. However, for larger problem sizes unlist seems to be ahead of dcast.
Edit 2019-03-12: Computing on the language, 3rd benchmark
Inspired by this follow-up question, I have have added a computing on the language approach where the whole expression is created as character string, parsed and evaluated.
The expression is created by
library(magrittr)
ColChoice <- c("c1", "c4")
FunChoice <- c("length", "mean", "sum")
my.expression <- CJ(ColChoice, FunChoice, sorted = FALSE)[
, sprintf("%s.%s = %s(%s)", V1, V2, V2, V1)] %>%
paste(collapse = ", ") %>%
sprintf("dt[, .(%s), by = category]", .) %>%
parse(text = .)
my.expression
expression(dt[, .(c1.length = length(c1), c1.mean = mean(c1), c1.sum = sum(c1),
c4.length = length(c4), c4.mean = mean(c4), c4.sum = sum(c4)), by = category])
This is then evaluated by
eval(my.expression)
which yields
category c1.length c1.mean c1.sum c4.length c4.mean c4.sum
1: f 3974 9999.987 39739947 3974 9.994220 39717.03
2: w 4033 10000.008 40330032 4033 10.004261 40347.19
3: n 4025 9999.981 40249924 4025 10.003686 40264.84
4: x 3975 10000.035 39750141 3975 10.010448 39791.53
5: k 3957 10000.019 39570074 3957 9.991886 39537.89
6: j 4027 10000.026 40270106 4027 10.015663 40333.07
...
I have modified the code of the 2nd benchmark to include this approach but had to reduce the additional columns from 100 to 25 in order to cope with the memory limitations of a much smaller PC. The chart shows that the "eval" approach is almost always the fastest or second:

Here's a data.table answer:
funs_list <- lapply(FunChoice, as.symbol)
dcast(dt, category~1, fun=eval(funs_list), value.var = Colchoice)
It's super fast and does what you want.

Consider building a list of data tables where you iterate through every ColChoice and apply each function of FuncChoice (setting names accordingly). Then, to merge all data tables together, run merge in a Reduce call. Also, use get to retrieve environment objects (functions/columns).
Note: ColChoice was renamed for camel case and length function replaces .N for functional form for count:
set.seed(12212018) # RUN BEFORE data.table() BUILD TO REPRODUCE OUTPUT
...
ColChoice <- c("c1", "c4")
FunChoice <- c("length", "mean", "sum")
output <- lapply(ColChoice, function(col)
dt[, setNames(lapply(FunChoice, function(f) get(f)(get(col))),
paste0(col, "_", FunChoice)),
by=category]
)
final_dt <- Reduce(function(x, y) merge(x, y, by="category"), output)
head(final_dt)
# category c1_length c1_mean c1_sum c4_length c4_mean c4_sum
# 1: a 3893 10000.001 38930003 3893 9.990517 38893.08
# 2: b 4021 10000.028 40210113 4021 9.977178 40118.23
# 3: c 3931 10000.008 39310030 3931 9.996538 39296.39
# 4: d 3954 10000.010 39540038 3954 10.004578 39558.10
# 5: e 4016 9999.998 40159992 4016 10.002131 40168.56
# 6: f 3974 9999.987 39739947 3974 9.994220 39717.03

It seems that there's not a straightforward answer using data.table since noone has answered this yet. So I'll propose a dplyr-based answer that should do what you want. I use the built-in iris data set for the example:
library(dplyr)
iris %>%
group_by(Species) %>%
summarise_at(vars(Sepal.Length, Sepal.Width), .funs = c(sum=sum,mean= mean), na.rm=TRUE)
## A tibble: 3 x 5
# Species Sepal.Length_sum Sepal.Width_sum Sepal.Length_mean Sepal.Width_mean
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 setosa 245. 171. 5.00 3.43
#2 versicolor 297. 138. 5.94 2.77
#3 virginica 323. 149. 6.60 2.97
or using character vector input for the columns and functions:
Colchoice <- c("Sepal.Length", "Sepal.Width")
FunChoice <- c("mean", "sum")
iris %>%
group_by(Species) %>%
summarise_at(vars(Colchoice), .funs = setNames(FunChoice, FunChoice), na.rm=TRUE)
## A tibble: 3 x 5
# Species Sepal.Length_mean Sepal.Width_mean Sepal.Length_sum Sepal.Width_sum
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 setosa 5.00 3.43 245. 171.
#2 versicolor 5.94 2.77 297. 138.
#3 virginica 6.60 2.97 323. 149.

If the summary statistics you need to compute are things like mean, .N, and (perhaps) median, which data.table optimizes into c code across the by, you may have faster performance if you convert the table into long form so that you can do the computations in a way that data table can optimize them:
> library(data.table)
> n = 100000
> dt = data.table(index=1:100000,
category = sample(letters[1:25], n, replace = T),
c1=rnorm(n,10000),
c2=rnorm(n,1000),
c3=rnorm(n,100),
c4 = rnorm(n,10)
)
> {lapply(c(paste('c', 5:100, sep ='')), function(addcol) dt[[addcol]] <<- rnorm(n,1000) ); dt}
> Colchoice <- c("c1", "c4")
> dt[, .SD
][, c('index', 'category', Colchoice), with=F
][, melt(.SD, id.vars=c('index', 'category'))
][, mean := mean(value), .(category, variable)
][, median := median(value), .(category, variable)
][, N := .N, .(category, variable)
][, value := NULL
][, index := NULL
][, unique(.SD)
][, dcast(.SD, category ~ variable, value.var=c('mean', 'median', 'N')
]
category mean_c1 mean_c4 median_c1 median_c4 N_c1 N_c4
1: a 10000 10.021 10000 10.041 4128 4128
2: b 10000 10.012 10000 10.003 3942 3942
3: c 10000 10.005 10000 9.999 3926 3926
4: d 10000 10.002 10000 10.007 4046 4046
5: e 10000 9.974 10000 9.993 4037 4037
6: f 10000 10.025 10000 10.015 4009 4009
7: g 10000 9.994 10000 9.998 4012 4012
8: h 10000 10.007 10000 9.986 3950 3950
...

Fastest way to filter a data.frame list column contents in R / Rcpp

I have a data.frame:
df <- structure(list(id = 1:3, vars = list("a", c("a", "b", "c"), c("b",
"c"))), .Names = c("id", "vars"), row.names = c(NA, -3L), class = "data.frame")
with a list column (each with a character vector):
> str(df)
'data.frame': 3 obs. of 2 variables:
$ id : int 1 2 3
$ vars:List of 3
..$ : chr "a"
..$ : chr "a" "b" "c"
..$ : chr "b" "c"
I want to filter the data.frame according to setdiff(vars,remove_this)
library(dplyr)
library(tidyr)
res <- df %>% mutate(vars = lapply(df$vars, setdiff, "a"))
which gets me this:
> res
id vars
1 1
2 2 b, c
3 3 b, c
But to get drop the character(0) vars I have to do something like:
res %>% unnest(vars) # and then do the equivalent of nest(vars) again after...
Actual datasets:
560K rows and 3800K rows that also have 10 more columns (to carry along).
(this is quite slow, which leads to question...)
What is the Fastest way to do this in R?
Is there a dplyr/ data.table/ other faster method?
How to do this with Rcpp?
UPDATE/EXTENSION:
can the column modification be done in place rather then by copying the lapply(vars,setdiff(... result?
what's the most efficient way to filter out for vars == character(0) if it must be a seperate step.

Setting aside any algorithmic improvements, the analogous data.table solution is automatically going to be faster because you won't have to copy the entire thing just to add a column:
library(data.table)
dt = as.data.table(df) # or use setDT to convert in place
dt[, newcol := lapply(vars, setdiff, 'a')][sapply(newcol, length) != 0]
# id vars newcol
#1: 2 a,b,c b,c
#2: 3 b,c b,c
You can also delete the original column (with basically 0 cost), by adding [, vars := NULL] at the end). Or you can simply overwrite the initial column if you don't need that info, i.e. dt[, vars := lapply(vars, setdiff, 'a')].
Now as far as algorithmic improvements go, assuming your id values are unique for each vars (and if not, add a new unique identifier), I think this is much faster and automatically takes care of the filtering:
dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), by = id]
# id vars
#1: 2 b,c
#2: 3 b,c
To carry along the other columns, I think it's easiest to simply merge back:
dt[, othercol := 5:7]
# notice the keyby
dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), keyby = id][dt, nomatch = 0]
# id vars i.vars othercol
#1: 2 b,c a,b,c 6
#2: 3 b,c b,c 7

Here's another way:
# prep
DT <- data.table(df)
DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)]
setkey(DT,vstr)
get_badkeys <- function(x)
unlist(sapply(1:length(x),function(n) combn(sort(x),n,paste0,collapse="_")))
# choose values to exclude
baduns <- c("a","b")
# subset
DT[!J(get_badkeys(baduns))]
This is fairly fast, but it takes up your key.
Benchmarks. Here's a made-up example:
Candidates:
hannahh <- function(df,baduns){
df %>%
mutate(vars = lapply(.$vars, setdiff, baduns)) %>%
filter(!!sapply(vars,length))
}
eddi <- function(df,baduns){
dt = as.data.table(df)
dt[,
unlist(vars)
, by = id][!V1 %in% baduns,
.(vars = list(V1))
, keyby = id][dt, nomatch = 0]
}
stevenb <- function(df,baduns){
df %>%
rowwise() %>%
do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, baduns)) %>%
mutate(length = length(newcol)) %>%
ungroup() %>%
filter(length > 0)
}
frank <- function(df,baduns){
DT <- data.table(df)
DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)]
setkey(DT,vstr)
DT[!J(get_badkeys(baduns))]
}
Simulation:
nvals <- 4
nbads <- 2
maxlen <- 4
nobs <- 1e4
exdf <- data.table(
id=1:nobs,
vars=replicate(nobs,list(sample(valset,sample(maxlen,1))))
)
setDF(exdf)
baduns <- valset[1:nbads]
Results:
system.time(frank_res <- frank(exdf,baduns))
# user system elapsed
# 0.24 0.00 0.28
system.time(hannahh_res <- hannahh(exdf,baduns))
# 0.42 0.00 0.42
system.time(eddi_res <- eddi(exdf,baduns))
# 0.05 0.00 0.04
system.time(stevenb_res <- stevenb(exdf,baduns))
# 36.27 55.36 93.98
Checks:
identical(sort(frank_res$id),eddi_res$id) # TRUE
identical(unlist(stevenb_res$id),eddi_res$id) # TRUE
identical(unlist(hannahh_res$id),eddi_res$id) # TRUE
Discussion:
For eddi() and hannahh(), the results scarcely change with nvals, nbads and maxlen. In contrast, when baduns goes over 20, frank() becomes incredibly slow (like 20+ sec); it also scales up with nbads and maxlen a little worse than the other two.
Scaling up nobs, eddi()'s lead over hannahh() stays the same, at about 10x. Against frank(), it sometimes shrinks and sometimes stays the same. In the best nobs = 1e5 case for frank(), eddi() is still 3x faster.
If we switch from a valset of characters to something that frank() must coerce to a character for its by-row paste0 operation, both eddi() and hannahh() beat it as nobs grows.
Benchmarks for doing this repeatedly. This is probably obvious, but if you have to do this "many" times (...how many is hard to say), it's better to create the key column than to go through the subsetting for each set of baduns. In the simulation above, eddi() is about 5x as fast as frank(), so I'd go for the latter if I was doing this subsetting 10+ times.
maxbadlen <- 2
set_o_baduns <- replicate(10,sample(valset,size=sample(maxbadlen,1)))
system.time({
DT <- data.table(exdf)
DT[,vstr:=paste0(sort(unlist(vars)),collapse="_"),by=1:nrow(DT)]
setkey(DT,vstr)
for (i in 1:10) DT[!J(get_badkeys(set_o_baduns[[i]]))]
})
# user system elapsed
# 0.29 0.00 0.29
system.time({
dt = as.data.table(exdf)
for (i in 1:10) dt[,
unlist(vars), by = id][!V1 %in% set_o_baduns[[i]],
.(vars = list(V1)), keyby = id][dt, nomatch = 0]
})
# user system elapsed
# 0.39 0.00 0.39
system.time({
for (i in 1:10) hannahh(exdf,set_o_baduns[[i]])
})
# user system elapsed
# 4.10 0.00 4.13
So, as expected, frank() takes very little time for additional evaluations, while eddi() and hannahh() grow linearly.

Here's another idea:
df %>%
rowwise() %>%
do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, "a")) %>%
mutate(length = length(newcol)) %>%
ungroup()
Which gives:
# id vars newcol length
#1 1 a 0
#2 2 a, b, c b, c 2
#3 3 b, c b, c 2
You could then filter on length > 0 to keep only non-empty newcol
df %>%
rowwise() %>%
do(id = .$id, vars = .$vars, newcol = setdiff(.$vars, "a")) %>%
mutate(length = length(newcol)) %>%
ungroup() %>%
filter(length > 0)
Which gives:
# id vars newcol length
#1 2 a, b, c b, c 2
#2 3 b, c b, c 2
Note: As mentioned by #Arun in the comments, this approach is quite slow. You are better off with the data.table solutions.

Perform a semi-join with data.table

How do I perform a semi-join with data.table? A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the rows of X to match the rows of Y. For example, the following code performs an inner join:
x <- data.table(x = 1:2, y = c("a", "b"))
setkey(x, x)
y <- data.table(x = c(1, 1), z = 10:11)
x[y]
# x y z
# 1: 1 a 10
# 2: 1 a 11
A semi-join would return just x[1]

More possibilities :
w = unique(x[y,which=TRUE]) # the row numbers in x which have a match from y
x[w]
If there are duplicate key values in x, then that needs :
w = unique(x[y,which=TRUE,allow.cartesian=TRUE])
x[w]
Or, the other way around :
setkey(y,x)
w = !is.na(y[x,which=TRUE,mult="first"])
x[w]
If nrow(x) << nrow(y) then the y[x] approach should be faster.
If nrow(x) >> nrow(y) then the x[y] approach should be faster.
But the anti anti join appeals too :-)

One solution I can think of is:
tmp <- x[!y]
x[!tmp]
In data.table, you can have another data table as an i expression (i.e., the first expression in the data.table.[ call), and that will perform a join, e.g.:
x <- data.table(x = 1:10, y = letters[1:10])
setkey(x, x)
y <- data.table(x = c(1,3,5,1), z = 1:4)
> x[y]
x y z
1: 1 a 1
2: 3 c 2
3: 5 e 3
4: 1 a 4
The ! before the i expression is an extension of the syntax above that performs a 'not-join', as described on p. 11 of data.table documentation. So the first assignments evaluates to a subset of x that doesn't have any rows where the key (column x) is present in y:
> x[!y]
x y
1: 2 b
2: 4 d
3: 6 f
4: 7 g
5: 8 h
6: 9 i
7: 10 j
It is similar to setdiff in this regard. And therefore the second statement returns all the rows in x where the key is present in y.
The ! feature was added in data.table 1.8.4 with the following note in NEWS:
o A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384i.
DT[-DT["a", which=TRUE, nomatch=0]] # old not-join idiom, still works
DT[!"a"] # same result, now preferred.
DT[!J(6),...] # !J == not-join
DT[!2:3,...] # ! on all types of i
DT[colA!=6L | colB!=23L,...] # multiple vector scanning approach (slow)
DT[!J(6L,23L)] # same result, faster binary search
'!' has been used rather than '-' :
* to match the 'not-join'/'not-where' nomenclature
* with '-', DT[-0] would return DT rather than DT[0] and not be backwards
compatible. With '!', DT[!0] returns DT both before (since !0 is TRUE in
base R) and after this new feature.
* to leave DT[+J...] and DT[-J...] available for future use
For some reason, the following doesn't work x[!(x[!y])] - probably data.table is too smart about parsing the argument.
P.S. As Josh O'Brien pointed in another answer, a one-line would be x[!eval(x[!y])].

I'm confused with all the not-joins above, isn't what you want simply:
unique(x[y, .SD])
# x y
#1: 1 a
If x can have duplicate keys, then you can unique y instead:
## Creating an example data.table 'a' three-times-repeated first row
x <- data.table(x = c(1,1,1,2), y = c("a", "a", "a", "b"))
setkey(x, x)
y <- data.table(x = c(1, 1), z = 10:11)
setkey(y, x)
x[eval(unique(y, by = key(y))), .SD] # data.table >= 1.9.8 requires by=key(y)
# x y
# 1: 1 a
# 2: 1 a
# 3: 1 a

Update. Based on all the discussion here, I would do something like this, which should be fast and work in the most general case:
x[eval(unique(y[, key(x), with = FALSE]))]
Here is another, more direct solution:
unique(x[eval(y$x)])
It's more direct and runs faster - here is the comparison in run times with my previous solution:
# Generate some large data
N <- 1000000 * 26
x <- data.table(x = 1:N, y = letters, z = rnorm(N))
setkey(x, x)
y <- data.table(x = sample(N, N/10, replace = TRUE), z = sample(letters, N/10, replace = TRUE))
setkey(y, x)
system.time(r1 <- x[!eval(x[!y])])
user system elapsed
7.772 1.217 11.998
system.time(r2 <- unique(x[eval(y$x)]))
user system elapsed
0.540 0.142 0.723
In a more general case, you can do something like
x[eval(y[, key(x), with = FALSE])]

I tried to write a method that doesn't use any names, which are downright confusing in the OP's example.
sJ <- function(x,y){
ycols <- 1:min(ncol(y),length(key(x)))
yjoin <- unique(y[, ..ycols])
yjoin
}
x[eval(sJ(x,y))]
For Victor's simpler example, this gives the desired output:
x y
1: 1 a
2: 3 c
3: 5 e
This is a ~30% slower than Victor's way.
EDIT: And Victor's approach, taking unique before joining, is quite a bit faster:
N <- 1e5*26
x <- data.table(x = 1:N, y = letters, z = rnorm(N))
setkey(x, x)
y <- data.table(x = sample(N, N/10, replace = TRUE), z = sample(letters, N/10, replace = TRUE))
setkey(y, x)
require(microbenchmark)
microbenchmark(
sJ=x[eval(sJ(x,y))],
dolla=unique(x[eval(y$x)]),
brack=x[eval(unique(y[['x']]))]
)
Unit: milliseconds
expr min lq median uq max neval
# sJ 120.22700 125.04900 126.50704 132.35326 217.6566 100
# dolla 105.05373 108.33804 109.16249 118.17613 285.9814 100
# brack 53.95656 61.32669 61.88227 65.21571 235.8048 100
I'm guessing the [[ vs $ doesn't help the speed, but didn't check.

This thread is so old. But I noticed that the solution can be easily derived from the definition of semi-join given in the original post:
"A semi-join is like an inner join except that it only returns the
columns of X (not also those of Y), and does not repeat the rows of X
to match the rows of Y"
library(data.table)
dt1 <- data.table(ProdId = 1:4,
Product = c("Bread", "Cheese", "Pizza", "Butter"))
dt2 <- data.table(ProdId = c(1, 1, 3, 4, 5),
Company = c("A", "B", "C", "D", "E"))
# semi-join
unique(merge(dt1, dt2, on="ProdId")[, names(dt1), with=F])
ProdId Product
1: 1 Bread
2: 3 Pizza
3: 4 Butter
I've simply applied the syntax of inner-join, followed by filtering columns from first table only, with unique() to remove rows of first table which were repeated to match rows of second table.
Edit: The above approach will match dplyr::semi_join() output only if we have unique rows in the first table. If we need to output all the rows including duplicates from first table, then we may use fsetdiff() method shown below.
Another one line data.table solution:
fsetdiff(dt1, dt1[!dt2, on="ProdId"])
ProdId Product
1: 1 Bread
2: 3 Pizza
3: 4 Butter
I've just removed from first table the anti-join of first and second. Seems simpler to me. If the first table has duplicate rows, we will need:
fsetdiff(dt1, dt1[!dt2, on="ProdId"], all=T)
The fsetdiff() result with ,all=T matches the output from dplyr:
dplyr::semi_join(dt1, dt2, by="ProdId")
ProdId Product
1 1 Bread
2 3 Pizza
3 4 Butter
Using another set of data taken from one of previous posts:
x <- data.table(x = c(1,1,1,2), y = c("a", "a", "a", "b"))
y <- data.table(x = c(1, 1), z = 10:11)
With dplyr:
dplyr::semi_join(x, y, by="x")
x y
1 1 a
2 1 a
3 1 a
With data.table:
fsetdiff(x, x[!y, on="x"], all=T)
x y
1: 1 a
2: 1 a
3: 1 a
Without ,all=T, the duplicate rows are removed:
fsetdiff(x, x[!y, on="x"])
x y
1: 1 a

The package dplyr supports the following four join types:
inner_join, left_join, semi_join, anti_join
So for the semi-join try the following code
library("dplyr")
table1 <- data.table(x = 1:2, y = c("a", "b"))
table2 <- data.table(x = c(1, 1), z = 10:11)
semi_join(table1, table2)
The output is as expected:
# Joining by: "x"
# Source: local data table [1 x 2]
#
# x y
# (int) (chr)
# 1 1 a

Try the following:
w <- y[,unique(x)]
x[x %in% w]
Output will be:
x y
1: 1 a

reshape wide to long with character suffixes instead of numeric suffixes

Inspired by a comment from #gsk3 on a question about reshaping data, I started doing a little bit of experimentation with reshaping data where the variable names have character suffixes instead of numeric suffixes.
As an example, I'll load the dadmomw dataset from one of the UCLA ATS Stata learning webpages (see "Example 4" on the webpage).
Here's what the dataset looks like:
library(foreign)
dadmom <- read.dta("https://stats.idre.ucla.edu/stat/stata/modules/dadmomw.dat")
dadmom
# famid named incd namem incm
# 1 1 Bill 30000 Bess 15000
# 2 2 Art 22000 Amy 18000
# 3 3 Paul 25000 Pat 50000
When trying to reshape from this wide format to long, I run into a problem. Here's what I do to reshape the data.
reshape(dadmom, direction="long", idvar=1, varying=2:5,
sep="", v.names=c("name", "inc"), timevar="dadmom",
times=c("d", "m"))
# famid dadmom name inc
# 1.d 1 d 30000 Bill
# 2.d 2 d 22000 Art
# 3.d 3 d 25000 Paul
# 1.m 1 m 15000 Bess
# 2.m 2 m 18000 Amy
# 3.m 3 m 50000 Pat
Note the swapped column names for "name" and "inc"; changing v.names to c("inc", "name") doesn't solve the problem.
reshape seems very picky about wanting the columns to be named in a fairly standard way. For example, I can reshape the data correctly (and easily) if I first rename the columns:
dadmom2 <- dadmom # Just so we can continue experimenting with the original data
# Change the names of the last four variables to include a "."
names(dadmom2)[2:5] <- gsub("(d$|m$)", "\\.\\1", names(dadmom2)[2:5])
reshape(dadmom2, direction="long", idvar=1, varying=2:5,
timevar="dadmom")
# famid dadmom name inc
# 1.d 1 d Bill 30000
# 2.d 2 d Art 22000
# 3.d 3 d Paul 25000
# 1.m 1 m Bess 15000
# 2.m 2 m Amy 18000
# 3.m 3 m Pat 50000
My questions are:
Why is R swapping the columns in the example I've provided?
Can I get to this result with base R reshape without changing the variable names before reshaping?
Are there other approaches that could be considered instead of reshape?

This works (to specify to varying what columns go with who):
reshape(dadmom, direction="long", varying=list(c(2, 4), c(3, 5)),
sep="", v.names=c("name", "inc"), timevar="dadmom",
times=c("d", "m"))
So you actually have nested repeated measures here; both name and inc for mom and dad. Because you have more than one series of repeated measures you have to supply a list to varying that tells reshape which group gets stacked on the other group.
So the two approaches to this problem are to provide a list as I did or to rename the columns the way the R beast likes them as you did.
See my recent blogs on base reshape for more on this (particularly the second link deals with this):
reshape (part I)
reshape (part II)

Though this question was specifically about base R, it is useful to know other approaches that help you to achieve the same type of outcome.
One alternative to reshape or merged.stack would be to use a combination of "dplyr" and "tidry", like this:
dadmom %>%
gather(variable, value, -famid) %>% ## Make the entire dataset long
separate(variable, into = c("var", "time"), ## Split "variable" column into two...
sep = "(?<=name|inc)", perl = TRUE) %>% ## ... using regex to split the values
spread(var, value, convert = TRUE) ## Make result wide, converting type
# famid time inc name
# 1 1 d 30000 Bill
# 2 1 m 15000 Bess
# 3 2 d 22000 Art
# 4 2 m 18000 Amy
# 5 3 d 25000 Paul
# 6 3 m 50000 Pat
Another alternative would be to use melt from "data.table", like this:
library(data.table)
melt(as.data.table(dadmom), ## melt here requres a data.table
measure = patterns("name", "inc"), ## identify columns by patterns
value.name = c("name", "inc"))[ ## specify the resulting variable names
## melt creates a numeric "variable" value. Replace with factored labels
, variable := factor(variable, labels = c("d", "m"))][]
# famid variable name inc
# 1: 1 d Bill 30000
# 2: 2 d Art 22000
# 3: 3 d Paul 25000
# 4: 1 m Bess 15000
# 5: 2 m Amy 18000
# 6: 3 m Pat 50000
How do these approaches compare with merged.stack?
Both packages are much better supported. They update and test their code more extensively than I do.
melt is blazing fast.
The Hadleyverse approach is actually slower (in many of my tests, even slower than base R's reshape) probably because of having to make the data long, then wide, then performing type conversion. However, some users like its step-by-step approach.
The Hadleyverse approach might have some unintended consequences because of the requirement of making the data long before making it wide. That forces all of the measure columns to be coerced to the same type (usually "character") if they are of different types to begin with.
Neither have the same convenience of merged.stack. Just look at the code required to get the result ;-)
merged.stack, however, can probably benefit from a simplified update, something along the lines of this function
ReshapeLong_ <- function(indt, stubs, sep = NULL) {
if (!is.data.table(indt)) indt <- as.data.table(indt)
mv <- lapply(stubs, function(y) grep(sprintf("^%s", y), names(indt)))
levs <- unique(gsub(paste(stubs, collapse="|"), "", names(indt)[unlist(mv)]))
if (!is.null(sep)) levs <- gsub(sprintf("^%s", sep), "", levs, fixed = TRUE)
melt(indt, measure = mv, value.name = stubs)[
, variable := factor(variable, labels = levs)][]
}
Which can then be used as:
ReshapeLong_(dadmom, stubs = c("name", "inc"))
How do these approaches compare with base R's reshape?
The main difference is that reshape is not able to handle unbalanced panel datasets. See, for example, "mydf2" as opposed to "mydf" in the tests below.
Test cases
Here's some sample data. "mydf" is balanced. "mydf2" is not balanced.
set.seed(1)
x <- 10000
mydf <- mydf2 <- data.frame(
id_1 = 1:x, id_2 = c("A", "B"), varAa = sample(letters, x, TRUE),
varAb = sample(letters, x, TRUE), varAc = sample(letters, x, TRUE),
varBa = sample(10, x, TRUE), varBb = sample(10, x, TRUE),
varBc = sample(10, x, TRUE), varCa = rnorm(x), varCb = rnorm(x),
varCc = rnorm(x), varDa = rnorm(x), varDb = rnorm(x), varDc = rnorm(x))
mydf2 <- mydf2[-c(9, 14)] ## Make data unbalanced
Here are some functions to test:
f1 <- function(mydf) {
mydf %>%
gather(variable, value, starts_with("var")) %>%
separate(variable, into = c("var", "time"),
sep = "(?<=varA|varB|varC|varD)", perl = TRUE) %>%
spread(var, value, convert = TRUE)
}
f2 <- function(mydf) {
melt(as.data.table(mydf),
measure = patterns(paste0("var", c("A", "B", "C", "D"))),
value.name = paste0("var", c("A", "B", "C", "D")))[
, variable := factor(variable, labels = c("a", "b", "c"))][]
}
f3 <- function(mydf) {
merged.stack(mydf, var.stubs = paste0("var", c("A", "B", "C", "D")), sep = "var.stubs")
}
## Won't run with "mydf2". Should run with "mydf"
f4 <- function(mydf) {
reshape(mydf, direction = "long",
varying = lapply(c("varA", "varB", "varC", "varD"),
function(x) grep(x, names(mydf))),
sep = "", v.names = paste0("var", c("A", "B", "C", "D")),
timevar="time", times = c("a", "b", "c"))
}
Test performance:
library(microbenchmark)
microbenchmark(f1(mydf), f2(mydf), f3(mydf), f4(mydf))
# Unit: milliseconds
# expr min lq mean median uq max neval
# f1(mydf) 463.006547 492.073086 528.533319 514.189548 538.910756 867.93356 100
# f2(mydf) 3.737321 4.108376 6.674066 4.332391 4.761681 47.71142 100
# f3(mydf) 60.211254 64.766770 86.812077 87.040087 92.841747 262.89409 100
# f4(mydf) 40.596455 43.753431 61.006337 48.963145 69.983623 230.48449 100
Observations:
Base R's reshape would not be able to handle reshaping "mydf2".
The "dplyr" + "tidyr" approach would mangle the results in the resulting "varB", "varC", and "varD" because values would be coerced to character.
As the benchmarks show, reshape gives reasonable performance.
Note: Because of the difference in time between posting my last answer and the differences in approach, I thought I would share this as a new answer.

merged.stack from my "splitstackshape" handles this by utilizing the sep = "var.stubs" construct:
library(splitstackshape)
merged.stack(dadmom, var.stubs = c("inc", "name"), sep = "var.stubs")
# famid .time_1 inc name
# 1: 1 d 30000 Bill
# 2: 1 m 15000 Bess
# 3: 2 d 22000 Art
# 4: 2 m 18000 Amy
# 5: 3 d 25000 Paul
# 6: 3 m 50000 Pat
Notice that since there is no real separator in the variables that are being stacked, we can just strip out the var.stubs from the names to create the "time" variables. Using sep = "var.stubs" is equivalent to doing sep = "inc|name".
This works because ".time_1" is created by stripping out what is left after removing the "var.stubs" from the column names.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

optimizing solution to find common third on large data set - r

Related

Data management: flatten data with R

Multiple functions on multiple columns by group, and create informative column names

Fastest way to filter a data.frame list column contents in R / Rcpp

Perform a semi-join with data.table

reshape wide to long with character suffixes instead of numeric suffixes

Categories

Resources