I have a data table with 10 columns.
town
tc
one
two
three
four
five
six
seven
total
Need to generate mean for columns "one" to "total" for which I am using,
DTmean <- DT[,(lapply(.SD,mean)),by = .(town,tc),.SDcols=3:10]
This generates the mean, but then I want the column names to be suffixed with "_mean". How can we do this? Want the first two columns to remain the same as "town" and "tc". I tried the below but then it renames all "one" to "total" to just "_mean"
for (i in 3:10) {
setnames(DTmean,i,paste0(names(i),"_mean"))
}
If you want to do it the data.table way, you should use setnames as follows:
setnames(DTmean, 3:10, paste0(names(DT)[3:10], '_mean'))
or:
cols <- names(DT)[3:10]
setnames(DTmean, cols, paste0(cols, '_mean'))
Furthermore, you don't need the .SDcols statement as you are aggregating all the other columns. Using DT[, lapply(.SD,mean), by = .(town,tc)] should thus give you the same result as using DT[, (lapply(.SD,mean)), by = .(town,tc), .SDcols=3:10].
On the following example dataset:
set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=10),
tc = rep(c('C','D'), 10),
one = rnorm(20,1,1),
two = rnorm(20,2,1),
three = rnorm(20,3,1),
four = rnorm(20,4,1),
five = rnorm(20,5,2),
six = rnorm(20,6,2),
seven = rnorm(20,7,2),
total = rnorm(20,28,3))
using:
DTmean <- DT[, lapply(.SD,mean), by = .(town,tc)]
setnames(DTmean, 3:10, paste0(names(DT)[3:10], '_mean'))
gives:
> DTmean
town tc one_mean two_mean three_mean four_mean five_mean six_mean seven_mean total_mean
1: A C 1.7368898 1.883586 3.358440 4.849896 4.742609 5.089877 6.792513 29.20286
2: A D 0.8906842 1.826135 3.267684 3.760931 6.210145 7.320693 5.571687 26.56142
3: B C 1.4037955 2.474836 2.587920 3.719658 3.446612 6.510183 8.309784 27.80012
4: B D 0.8103511 1.153000 3.360940 3.945082 5.555999 6.198380 8.652779 28.95180
In reply to your comment: If you want to calculate both the mean and the sd simultanuously, you could do (adapted from my answer here):
DT[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x), sd = sd(x))))), by = .(town,tc)]
which gives:
town tc one.mean one.sd two.mean two.sd three.mean three.sd four.mean four.sd five.mean five.sd six.mean six.sd seven.mean seven.sd total.mean total.sd
1: A C 0.2981842 0.3556520 1.578174 0.7788545 2.232366 0.9047046 4.896201 1.238877 4.625866 0.7436584 7.607439 1.7262628 7.949366 1.772771 28.94287 3.902602
2: A D 1.2099018 1.0205252 1.686068 1.5497989 2.671027 0.8323733 4.811279 1.404794 7.235969 0.7883873 6.765797 2.7719942 6.657298 1.107843 27.42563 3.380785
3: B C 0.9238309 0.6679821 2.525485 0.8054734 3.138298 1.0111270 3.876207 0.573342 3.843140 2.1991052 4.942155 0.7784024 6.783383 2.595116 28.95243 1.078307
4: B D 0.8843948 0.9384975 1.988908 1.0543981 3.673393 1.3505701 3.957534 1.097837 2.788119 1.9089660 6.463784 0.7642144 6.416487 2.041441 27.88205 3.807119
However, it is highly probable better to store this in long format. To get this you could use data.table's melt function as follows:
cols <- names(DT)[3:10]
DT2 <- melt(DT[, as.list(unlist(lapply(.SD, function(x) list(mn = mean(x), sdev = sd(x))))), by = .(town,tc)],
id.vars = c('town','tc'),
measure.vars = patterns('.mn','.sdev'),
value.name = c('mn','sdev'))[, variable := cols[variable]]
or in a much simpler operation:
DT2 <- melt(DT, id.vars = c('town','tc'))[, .(mn = mean(value), sdev = sd(value)), by = .(town,tc,variable)]
which results in:
> DT2
town tc variable mn sdev
1: A C one 0.2981842 0.3556520
2: A D one 1.2099018 1.0205252
3: B C one 0.9238309 0.6679821
4: B D one 0.8843948 0.9384975
5: A C two 1.5781743 0.7788545
6: A D two 1.6860675 1.5497989
7: B C two 2.5254855 0.8054734
8: B D two 1.9889082 1.0543981
9: A C three 2.2323655 0.9047046
10: A D three 2.6710267 0.8323733
11: B C three 3.1382982 1.0111270
12: B D three 3.6733929 1.3505701
.....
In response to your last comments, you can detect outliers as follows:
DT3 <- melt(DT, id.vars = c('town','tc'))
DT3[, `:=` (mn = mean(value), sdev = sd(value)), by = .(town,tc,variable)
][, outlier := +(value < mn - sdev | value > mn + sdev)]
which gives:
town tc variable value mn sdev outlier
1: A C one 0.5681578 0.2981842 0.355652 0
2: A D one 0.5528128 1.2099018 1.020525 0
3: A C one 0.5214274 0.2981842 0.355652 0
4: A D one 1.4171454 1.2099018 1.020525 0
5: A C one 0.5820994 0.2981842 0.355652 0
---
156: B D total 23.4462542 27.8820524 3.807119 1
157: B C total 30.5934956 28.9524305 1.078307 1
158: B D total 30.5618759 27.8820524 3.807119 0
159: B C total 27.5940307 28.9524305 1.078307 1
160: B D total 24.8378437 27.8820524 3.807119 0
Related
My original data.table consists of three columns.
site, observation_number and id.
E.g. the following which is all the observations for id = z
|site|observation_number|id
|a | 1| z
|b | 2| z
|c | 3| z
Which means that ID z has traveled from a to b to c.
There is no fixed number of sites per id.
I wish to transform the data to an edge list like this
|from |to||id|
|a | b| z |
|b | c| z |
mock data
sox <- data.table(site = c('a','b','c','a','c','c','a','d','e'),
obsnum =c(1,2,3,1,2,1,2,3,4),
id =c('z','z','z','y','y','k','k','k','k'))
The way I am currently doing this, feels convoluted and is very slow (sox has 1.5 mio rows and dt_out has ca. 7.5 mio. rows).
I basically use a for loop over observation_number to split the data in to chunks where each ID is only present once (that is - only one journey, to - from).
Then I cast data, and rind all the chunks to a new data.table.
dt_out <- data.table()
maksimum = sox[,max(observation_number)]
for (i in 1:maksimum-1) {
i=1
mini = i
maxi = i+1
sox_t <- sox[observation_number ==maxi | observation_number ==mini, ]
temp_dt <- dcast(sox_t[id %in% sox_t[, .N, by = id][N>=2]$id,
.SD[, list(site, observation_number, a=rep(c('from', 'to')))] ,by=id],
id='id', value.var='site', formula=id~a)
dt_out <- rbind(dt_out, temp_dt)
i=max
}
I hope someone can help me optimize this, and preferable create a function where I can input the data.table, the site id, observationnumber id, and the id. For some reason I can't create a function regardless that works.
UPDATE
Using sytem time (and running system time a few times):
User - System - Elapsed
make_edgelist (data.table): 5.38 0.00 5.38
Data.table. with shift: 13.96 0.06 14.08
dplyr, with arrange: 6.06 0.36 6.44
p.s. make_edgelist was updated to order the data.table
make_edgelist <- function(DT, site_var = "site", id_var = "id", obsnum_var = "rn1") {
DT[order(get(obsnum_var)),
list(from = get(site_var)[-.N], to = get(site_var)[-1]), by = id_var]
}
I was surprised that dplyr (with lead) was almost as fast as make_edgelist and much faster than data.table with shift. I guess this means that dplyr will actually be faster with more complex lead/lags/shift.
Also I find it puzzling - but don't know enough to know if it has any significance, that dplyr used more 'system' time than any of the two data.table solutions.
Input data: 1.5 million rows.
Result: 0.6 million rows.
With dplyr, you can try:
sox %>%
group_by(id) %>%
transmute(from = site,
to = lead(from)) %>%
na.omit()
id from to
<chr> <chr> <chr>
1 z a b
2 z b c
3 y a c
4 k c a
5 k a d
6 k d e
As #Sotos noted, it could be useful to arrange the data first:
sox %>%
arrange(id, obsnum) %>%
group_by(id) %>%
transmute(from = site,
to = lead(from)) %>%
na.omit()
Is this what you are looking for?
sox[, .(from = site[-.N], to = site[-1]), by = id]
# id from to
# 1: z a b
# 2: z b c
# 3: y a c
# 4: k c a
# 5: k a d
# 6: k d e
Wrapped in a function:
make_edgelist <- function(DT, site_var = "site", id_var = "id") {
DT[, .(from = get(site_var)[-.N], to = get(site_var)[-1]), by = id_var]
}
Note: This solution assumes the data is already ordered by observation number. To avoid this assumptions add order(obsnum) before the first comma.
Using data.table, in case it's faster than the dplyr solution above, you have:
sox <- sox[order(id, obsnum)]
sox[, from := shift(site), by = "id"]
sox <- sox[!is.na(from)]
setnames(sox, "site", "to")
sox[, obsnum := NULL]
setcolorder(sox, c("id", "from", "to"))
sox
#> id from to
#> 1: k c a
#> 2: k a d
#> 3: k d e
#> 4: y a c
#> 5: z a b
#> 6: z b c
I have a data.table in R to be grouped by values of one column and find the outliers in each group. BUT I need to keep the id column (not included in the aggregation). For example, for the data table a as below, I want to find outliers of Hours by class and output it with corresponding id.
Hours id class
1: 100.000 30298340 M
2: 4.776 30310183 M
3: 1.560 30312576 M
4: 11.520 30336159 M
5: 3.288 30331383 M
6: 6.552 30364533 M
7: 5.064 30365224 M
8: 27.768 30365394 C
9: 4.992 30365211 M
10: 25.536 30365603 M
11: 8.568 30337051 M
12: 5.112 30337052 C
13: 2.352 30284703 M
14: 23.784 30325405 M
15: 16.464 30327152 M
16: 24.336 30351237 M
17: 3.192 30352117 M
18: 24.312 30324926 M
19: 23.160 30325670 M
20: 4.176 30324906 M
Then I used the following code to find outliers.
temp<-a[,.(Hours=boxplot.stats(Hours,coef=3,do.conf=F)$out,M=boxplot.stats(Hours,do.conf=F
)$stats[3]),by=class]
temp[Hours>M] will give me the outlier above median and median as a reference.
class Hours M
1: M 100 7.56
But how can I change the aggregation line to include id for the outlier row?
Since my original table is very large, I don't want to merge the temp with original data. Also, because this is only a small piece of large existing program, I am trying to keep the main structure of temp result with id column added so temp can be passed into the next block of computation. Ideally, is there an simple way to tweak the data.table aggregation line to fulfill my request? Thank you!!!
Here is an additional question in the same scenarios. Now if I have 3 more columns in the original data, i.e. age, gender, etc. How can I keep them all in the outlier output? I can simply repeat Eric's code with id replaced with other variable and add in the data.table step:
age=age[which(Hours %in% boxplot.stats(Hours, coef = 3, do.conf = FALSE)$out)],
gender=gender[which(Hours %in% boxplot.stats(Hours, coef = 3, do.conf = FALSE)$out)],
But it would be a tedious work if there are more columns to add. I am thinking of doing the following:
keyname<-c("age", "gender","id")
temp <- a[, .(Hours = boxplot.stats(Hours, coef = 3, do.conf = FALSE)$out,
M = boxplot.stats(Hours, do.conf = FALSE)$stats[3],
lapply(c(1:length(keyname)),function(x) keyname[x]=get(keyname[x])[which(Hours) %in% boxplot.stats(Hours, coef = 3, do.conf = FALSE)$out)]),
by = class]
However, it doesn't work. Any further suggestion? Thank you!
Use which and subsetting to make the column.
temp <- a[, .(Hours = boxplot.stats(Hours, coef = 3, do.conf = FALSE)$out,
M = boxplot.stats(Hours, do.conf = FALSE)$stats[3],
id = id[which(Hours %in% boxplot.stats(Hours, coef = 3, do.conf = FALSE)$out)]),
by = class]
> temp
class Hours M id
1: M 100 7.56 30298340
2: C NA 16.44 NA
> temp[Hours > M]
class Hours M id
1: M 100 7.56 30298340
I know that there are many answers provided in this forum on how to get summary statistics (e.g. mean, se, N) for multiple groups using options like aggregate , ddply or data.table. I'm not sure, however, how to apply these functions over multiple columns at once.
More specifically, I would like to know how to extend the following ddply command over multiple columns (dv1, dv2, dv3) without re-typing the code with different variable name each time.
library(reshape2)
library(plyr)
group1 <- c(rep(LETTERS[1:4], c(4,6,6,8)))
group2 <- c(rep(LETTERS[5:8], c(6,4,8,6)))
group3 <- c(rep(LETTERS[9:10], c(12,12)))
my.dat <- data.frame(group1, group2, group3, dv1=rnorm(24),dv2=rnorm(24),dv3=rnorm(24))
my.dat
data1 <- ddply(my.dat, c("group1", "group2","group3"), summarise,
N = length(dv1),
mean = mean(dv1,na.rm=T),
sd = sd(dv1,na.rm=T),
se = sd / sqrt(N)
)
data1
How can I apply this ddply function over multiple columns such that the outcome will be data1, data2, data3... for each outcome variable? I thought this could be the solution:
dfm <- melt(my.dat, id.vars = c("group1", "group2","group3"))
lapply(list(.(group1, variable), .(group2, variable),.(group3, variable)),
ddply, .data = dfm, .fun = summarize,
mean = mean(value),
sd = sd(value),
N=length(value),
se=sd/sqrt(N))
Looks like it's in the right direction but not exactly what I need. This solution provides the statistics by each group separately. What I need an outcome as in data1 (e.g. first aggregated group is people who are at A, E and I; the second is those who are at group B, E and I etc...)
Here's an illustration of reshaping your data first. I've written a custom function to improve readability:
mysummary <- function(x,na.rm=F){
res <- list(mean=mean(x, na.rm=na.rm),
sd=sd(x,na.rm=na.rm),
N=length(x))
res$se <- res$sd/sqrt(res$N)
res
}
library(data.table)
res <- melt(setDT(my.dat),id.vars=c("group1","group2","group3"))[,mysummary(value),
by=.(group1,group2,group3,variable)]
> head(res)
group1 group2 group3 variable mean sd N se
1: A E I dv1 9.75 6.994045 4 3.497023
2: B E I dv1 9.50 7.778175 2 5.500000
3: B F I dv1 16.00 4.082483 4 2.041241
4: C G I dv1 14.50 10.606602 2 7.500000
5: C G J dv1 10.75 10.372239 4 5.186119
6: D G J dv1 13.00 4.242641 2 3.000000
Or without the custom function, thanks to #Jaap
melt(setDT(my.dat),
id=c("group1","group2","group3"))[, .(mean = mean(value),
sd = sd(value),
n = .N,
se = sd(value)/sqrt(.N)),
.(group1, group2, group3, variable)]
If you don't want to melt into long format, you can also do:
library(data.table)
setDT(my.dat)[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x),
sd = sd(x),
n = .N,
se = sd(x)/sqrt(.N))))),
by = .(group1, group2, group3), .SDcols=c("dv1","dv2","dv3")]
which gives:
group1 group2 group3 dv1.mean dv1.sd dv1.n dv1.se dv2.mean dv2.sd dv2.n dv2.se dv3.mean dv3.sd dv3.n dv3.se
1: A E I 0.09959774 0.4704498 4 0.23522491 0.05020096 0.8098882 4 0.40494412 -0.134137210 0.7832841 4 0.3916420
2: B E I 0.72726477 0.3651544 2 0.25820315 0.73743314 1.4260172 2 1.00834641 -0.120188202 0.5532434 2 0.3912022
3: B F I -0.68661572 0.7212631 4 0.36063157 0.06670216 0.7678781 4 0.38393905 0.096275469 0.8993015 4 0.4496508
4: C G I -0.54577363 0.0798962 2 0.05649515 0.18293371 0.1022325 2 0.07228926 -0.947603264 2.3118016 2 1.6346906
5: C G J 0.17434075 0.8503874 4 0.42519369 -0.11485558 1.4184031 4 0.70920154 -0.005140781 0.6871591 4 0.3435796
6: D G J 0.17943465 0.4943486 2 0.34955725 -0.22223273 0.3679613 2 0.26018796 -0.373289114 1.0737512 2 0.7592568
7: D H J 0.38090937 0.7904832 6 0.32271340 0.02107597 1.0094695 6 0.41211422 0.118277330 0.9024006 6 0.3684035
Here is a solution using dplyr. This gives the result in a "wide" format (i.e. the stats for dv1, dv2, dv3 are on the same line).
se <- function(x) { sd(x)/sqrt(length(x)) }
my.dat %>%
group_by(group1, group2, group3) %>%
summarise_each(funs(mean, sd, length, se), dv1, dv2, dv3) %>%
ungroup
If having the stats for dv1, dv2, and dv3 on separate lines is desired, this can be modified using melt or gather (from tidyr).
I have a large data.table in R with several columns with dollar values. In a different column I have an inflation adjustment number. I am trying to figure out how to update each of my monetary columns with it multiplied by the inflation adjustment column. Suppose I have the data:
DT <- data.table(id=1:1000,year=round(runif(1000)*10),
inc1 = runif(1000), inc2 = runif(1000), inc3 = runif(1000),
deflator = rnorm(1000))
which gives output:
id year inc1 inc2 inc3 deflator
1: 1 8 0.4754808 0.6678110 0.41533976 -0.64126988
2: 2 2 0.6568746 0.7765634 0.70616373 0.39687915
3: 3 6 0.8192947 0.9236281 0.90002534 -0.69545700
4: 4 4 0.7781929 0.1624902 0.17565790 0.05263055
5: 5 7 0.6232520 0.8024975 0.86449836 0.70781887
---
996: 996 2 0.9676383 0.2238746 0.19822000 0.78564836
997: 997 9 0.9877410 0.5783748 0.57497438 -1.63365223
998: 998 8 0.2220570 0.6500632 0.19814932 1.00260174
999: 999 3 0.4793767 0.2830457 0.54835581 1.04168818
1000: 1000 8 0.2003476 0.6121637 0.02921505 0.34933690
in reality I have inc1 - inc100, rather than just three variables and I want to figure out a way to perform this action:
DT[, inc1 := inc1 * deflator]
for each of my 100 income columns (inc1, inc2, inc3 in the fake data above). I will have more than 100 columns in the future, so I would like to figure out a way to loop the action over the columns. Is there a way to do this for all the income columns at once?
I would like to do something like:
inc_cols = c(inc1, inc2, inc3)
DT[, inc_cols := lapply(inc_cols,function(x)= x * deflator),]
or
DT[, inc_cols := lapply(.SD,function(x)= x * deflator),.SDcols = inc_cols]
but neither of these seem to work. I also tried using the get() function to make it clear deflator is a referencing a column, like:
DT[, inc_cols := lapply(.SD,function(x)= x * get(deflator)),.SDcols = inc_cols]
but had no luck. I also tried to loop through the variables with something like:
for (var in inc_cols) {
print(var)
DT[, get(var) := get(var) *infAdj2010_mult]
}
which returns
[1] "inc1"
Error in get(var) : object 'inc1' not found
I realize this is probably a straight forward question and I have tried to search the other questions here and various online guides and tutorials, but I cannot find an example matching my specific problem. It is similar to this question, but not exactly.
Thanks for your help!
You could try
DT[, (inc_cols) := lapply(.SD, function(x)
x * DT[['deflator']] ), .SDcols = inc_cols]
head(DT1,2)
# id year inc1 inc2 inc3 deflator
#1: 1 3 0.614838304 0.009796974 0.3236051 0.7735552
#2: 2 2 -0.001583579 -0.082289606 -0.1365115 -0.6644330
Or if you need a loop
for(inc in inc_cols){
nm1 <- as.symbol(inc)
DT[,(inc):= eval(nm1)*deflator]
}
head(DT,2)
# id year inc1 inc2 inc3 deflator
#1: 1 3 0.614838304 0.009796974 0.3236051 0.7735552
#2: 2 2 -0.001583579 -0.082289606 -0.1365115 -0.6644330
Or a possible option using set which should be very fast as the overhead of [.data.table is avoided (suggested by #Arun)
indx <- grep('inc', colnames(DT))
for(j in indx){
set(DT, i=NULL, j=j, value=DT[[j]]*DT[['deflator']])
}
head(DT,2)
# id year inc1 inc2 inc3 deflator
#1: 1 3 0.614838304 0.009796974 0.3236051 0.7735552
#2: 2 2 -0.001583579 -0.082289606 -0.1365115 -0.6644330
where
inc_cols <- grep('^inc', colnames(DT), value=TRUE)
data
set.seed(24)
DT <- data.table(id=1:1000,year=round(runif(1000)*10),
inc1 = runif(1000), inc2 = runif(1000), inc3 = runif(1000),
deflator = rnorm(1000))
Since you can use dplyr on data.tables, you could also do:
library(dplyr)
DT %>% mutate_each(funs(.*deflator), starts_with("inc"))
Which will multiply each column of DT that starts with "inc" by the "deflator" column.
This approach is also quite convenient, but likely slower than using set():
library(data.table); library(magrittr)
set.seed(42)
DT <- data.table(id=1:1000,year=round(runif(1000)*10),
inc1 = runif(1000), inc2 = runif(1000), inc3 = runif(1000),
deflator = rnorm(1000))
vars <- names(DT) %>% .[grepl("inc", .)]
DT[, (vars) := .SD * deflator, .SDcols = vars]
DT[]
id year inc1 inc2 inc3 deflator
1: 1 9 0.212563676 0.24806366 0.06860638 0.2505781
2: 2 9 -0.017438715 -0.12186792 -0.26241497 -0.2779240
3: 3 3 -1.414016119 -1.20714809 -0.76920337 -1.7247357
4: 4 8 -1.082336969 -1.78411512 -1.08720698 -2.0067049
5: 5 6 -0.644638321 -1.07757416 -0.20895576 -1.2918083
---
996: 996 1 -0.573551720 -1.93996157 -0.50171303 -2.1569621
997: 997 5 -0.007899417 -0.01561619 -0.05708009 -0.0920275
998: 998 1 -0.090975121 -0.30475714 -0.27291825 -0.3974001
999: 999 5 -0.045984079 -0.01563942 -0.07868934 -0.1383273
1000: 1000 0 -0.785962308 -0.63266975 -0.29247974 -0.8257650
#IceCreamToucan noted that the following should work in newer versions of DT.
DT[, inc_cols := lapply(.SD,function(x)= x * deflator),.SDcols = inc_cols]
For me, I had to do the following to LHS to make it work. Also, see RHS edit.
DT[, c(inc_cols) := lapply(.SD, function(x) x * deflator), .SDcols = inc_cols]
-or-
DT[, (inc_cols) := lapply(.SD, function(x) x * deflator), .SDcols = inc_cols]
I have read some references to similar problems here on SO, but haven't been able to find a solution yet and wondering if there is any way to do the following using just data.table.
I'll use a simplified example, but in practice, my data table has > 1000 columns similar to var1, var2, ... var1000, etc.
dt <- data.table(uid=c("a","b"), var1=c(1,2), var2=c(100,200))
I am looking for a solution that will allow me to get an output similar to reshape's melt function --
> melt(dt, id=c("uid"))
uid variable value
1 a var1 1
2 b var1 2
3 a var2 100
4 b var2 200
That is, all the columns except for uid are listed under a single column with the corresponding values in an adjoining column. I have tried this with a combination of list, etc, but might be missing something that is obvious.
All uids in dt are unique.
Thanks in advance.
For a data.table reshape, try the following:
dt[, list(variable = names(.SD), value = unlist(.SD, use.names = F)), by = uid]
The cost of the syntax is worth it; the function runs very quickly!
stack generally outperforms melt.
A straightforward approach to this problem with stack would be:
dt[, stack(.SD), by = "uid"]
Of course, you can specify your .SDcols if necessary. And then, use setnames() to change the names to whatever you want.
(Self-promotion alert)
I wrote some functions and put them in a package called "splitstackshape". One of the functions is called Stacked(), and in the 1.2.0 version of the "splitstackshape" package, should work very fast.
It's a little bit different from just stacking all the remaining columns in a data.table. It is more analogous to base R's reshape() than melt() from "reshape2". Here's an example of Stacked() in action.
I've created a decently large data.table to do this test. There are 50 numeric columns we want to stack, and 50 factor columns we want to stack. I've also further optimized #Andreas's answer.
The data
set.seed(1)
m1 <- matrix(rnorm(10000*50), ncol = 50)
m2 <- matrix(sample(LETTERS, 10000*50, replace = TRUE), ncol = 50)
colnames(m1) <- paste("varA", sprintf("%02d", 1:50), sep = "_")
colnames(m2) <- paste("varB", sprintf("%02d", 1:50), sep = "_")
dt <- data.table(uid = 1:10000, m1, m2)
The functions for benchmarking
test1 <- function() Stacked(dt, "uid", c("varA", "varB"), "_")
## merged.stack
test2 <- function() merged.stack(dt, "uid", c("varA", "varB"), "_")
## unlist(..., use.names = TRUE) -- OPTIMIZED
test3 <- function() {
list(cbind(dt[, "uid", with = FALSE],
dt[, list(variable = rep(names(.SD), each = nrow(dt)),
value = unlist(.SD)),
.SDcols = 2:51]),
cbind(dt[, "uid", with = FALSE],
dt[, list(variable = rep(names(.SD), each = nrow(dt)),
value = unlist(.SD)),
.SDcols = 52:101]))
}
## unlist(..., use.names = FALSE) -- OPTIMIZED
test4 <- function() {
list(cbind(dt[, "uid", with = FALSE],
dt[, list(variable = rep(names(.SD), each = nrow(dt)),
value = unlist(.SD, use.names = FALSE)),
.SDcols = 2:51]),
cbind(dt[, "uid", with = FALSE],
dt[, list(variable = rep(names(.SD), each = nrow(dt)),
value = unlist(.SD, use.names = FALSE)),
.SDcols = 52:101]))
}
## Andreas's current answer
test5 <- function() {
list(dt[, list(variable = names(.SD),
value = unlist(.SD, use.names = FALSE)),
by = uid, .SDcols = 2:51],
dt[, list(variable = names(.SD),
value = unlist(.SD, use.names = FALSE)),
by = uid, .SDcols = 52:101])
}
The results
library(microbenchmark)
microbenchmark(Stacked = test1(), merged.stack = test2(),
unlist.namesT = test3(), unlist.namesF = test4(),
AndreasAns = test5(), times = 3)
# Unit: milliseconds
# expr min lq median uq max neval
# Stacked 391.3251 393.0976 394.8702 421.4185 447.9668 3
# merged.stack 764.3071 769.6935 775.0799 867.2638 959.4477 3
# unlist.namesT 1680.0610 1761.9701 1843.8791 1881.9722 1920.0653 3
# unlist.namesF 215.0827 242.7748 270.4669 270.6944 270.9218 3
# AndreasAns 16193.5084 16249.5797 16305.6510 16793.3832 17281.1154 3
^^ I'm not sure why Andreas's current answer is so slow here. The "optimization" I did was basically to unlist without using by, which made a huge difference on the "varB" (factor) columns.
The manual approach is still faster than the functions from "splitstackshape", but these are milliseconds we're talking about, and some pretty compact one-liner code!
Sample output
For reference, here is what the output of Stacked() looks like. It's a list of "stacked" data.tables, one list item for each stacked variable.
test1()
# $varA
# uid .time_1 varA
# 1: 1 01 -0.6264538
# 2: 1 02 -0.8043316
# 3: 1 03 0.2353485
# 4: 1 04 0.6179223
# 5: 1 05 -0.2212571
# ---
# 499996: 10000 46 -0.6859073
# 499997: 10000 47 -0.9763478
# 499998: 10000 48 0.6579464
# 499999: 10000 49 0.7741840
# 500000: 10000 50 0.5195232
#
# $varB
# uid .time_1 varB
# 1: 1 01 D
# 2: 1 02 A
# 3: 1 03 S
# 4: 1 04 L
# 5: 1 05 T
# ---
# 499996: 10000 46 A
# 499997: 10000 47 W
# 499998: 10000 48 H
# 499999: 10000 49 U
# 500000: 10000 50 W
And, here is what the merged.stack output looks like. It's similar to what you would get when you use reshape(..., direction = "long") from base R.
test2()
# uid .time_1 varA varB
# 1: 1 01 -0.6264538 D
# 2: 1 02 -0.8043316 A
# 3: 1 03 0.2353485 S
# 4: 1 04 0.6179223 L
# 5: 1 05 -0.2212571 T
# ---
# 499996: 10000 46 -0.6859073 A
# 499997: 10000 47 -0.9763478 W
# 499998: 10000 48 0.6579464 H
# 499999: 10000 49 0.7741840 U
# 500000: 10000 50 0.5195232 W
Shameless Self-promotion
You might want to try melt_ from my package Kmisc. melt_ is essentially a rewrite of reshape2:::melt.data.frame with most of the grunt work done in C, and avoids as much copying and type coercion as possible for a speedy implementation.
An example:
## devtools::install_github("Kmisc", "kevinushey")
library(Kmisc)
library(reshape2)
library(microbenchmark)
n <- 1E6
big_df <- data.frame( stringsAsFactors=FALSE,
x=sample(letters, n, TRUE),
y=sample(LETTERS, n, TRUE),
za=rnorm(n),
zb=rnorm(n),
zc=rnorm(n)
)
all.equal(
melt <- melt(big_df, id.vars=c('x', 'y')),
melt_ <- melt_(big_df, id.vars=c('x', 'y'))
)
## we don't convert the 'variable' column to factor by default
## if we do, we see they're identical
melt_$variable <- factor(melt_$variable)
stopifnot( identical(melt, melt_) )
microbenchmark( times=5,
melt=melt(big_df, id.vars=c('x', 'y')),
melt_=melt_(big_df, id.vars=c('x', 'y'))
)
gives me
Unit: milliseconds
expr min lq median uq max neval
melt 916.40436 931.60031 999.03877 1102.31090 1160.3598 5
melt_ 61.59921 78.08768 90.90615 94.52041 182.0879 5
With any luck, this will be fast enough for your data.