I have read some references to similar problems here on SO, but haven't been able to find a solution yet and wondering if there is any way to do the following using just data.table.
I'll use a simplified example, but in practice, my data table has > 1000 columns similar to var1, var2, ... var1000, etc.
dt <- data.table(uid=c("a","b"), var1=c(1,2), var2=c(100,200))
I am looking for a solution that will allow me to get an output similar to reshape's melt function --
> melt(dt, id=c("uid"))
uid variable value
1 a var1 1
2 b var1 2
3 a var2 100
4 b var2 200
That is, all the columns except for uid are listed under a single column with the corresponding values in an adjoining column. I have tried this with a combination of list, etc, but might be missing something that is obvious.
All uids in dt are unique.
Thanks in advance.
For a data.table reshape, try the following:
dt[, list(variable = names(.SD), value = unlist(.SD, use.names = F)), by = uid]
The cost of the syntax is worth it; the function runs very quickly!
stack generally outperforms melt.
A straightforward approach to this problem with stack would be:
dt[, stack(.SD), by = "uid"]
Of course, you can specify your .SDcols if necessary. And then, use setnames() to change the names to whatever you want.
(Self-promotion alert)
I wrote some functions and put them in a package called "splitstackshape". One of the functions is called Stacked(), and in the 1.2.0 version of the "splitstackshape" package, should work very fast.
It's a little bit different from just stacking all the remaining columns in a data.table. It is more analogous to base R's reshape() than melt() from "reshape2". Here's an example of Stacked() in action.
I've created a decently large data.table to do this test. There are 50 numeric columns we want to stack, and 50 factor columns we want to stack. I've also further optimized #Andreas's answer.
The data
set.seed(1)
m1 <- matrix(rnorm(10000*50), ncol = 50)
m2 <- matrix(sample(LETTERS, 10000*50, replace = TRUE), ncol = 50)
colnames(m1) <- paste("varA", sprintf("%02d", 1:50), sep = "_")
colnames(m2) <- paste("varB", sprintf("%02d", 1:50), sep = "_")
dt <- data.table(uid = 1:10000, m1, m2)
The functions for benchmarking
test1 <- function() Stacked(dt, "uid", c("varA", "varB"), "_")
## merged.stack
test2 <- function() merged.stack(dt, "uid", c("varA", "varB"), "_")
## unlist(..., use.names = TRUE) -- OPTIMIZED
test3 <- function() {
list(cbind(dt[, "uid", with = FALSE],
dt[, list(variable = rep(names(.SD), each = nrow(dt)),
value = unlist(.SD)),
.SDcols = 2:51]),
cbind(dt[, "uid", with = FALSE],
dt[, list(variable = rep(names(.SD), each = nrow(dt)),
value = unlist(.SD)),
.SDcols = 52:101]))
}
## unlist(..., use.names = FALSE) -- OPTIMIZED
test4 <- function() {
list(cbind(dt[, "uid", with = FALSE],
dt[, list(variable = rep(names(.SD), each = nrow(dt)),
value = unlist(.SD, use.names = FALSE)),
.SDcols = 2:51]),
cbind(dt[, "uid", with = FALSE],
dt[, list(variable = rep(names(.SD), each = nrow(dt)),
value = unlist(.SD, use.names = FALSE)),
.SDcols = 52:101]))
}
## Andreas's current answer
test5 <- function() {
list(dt[, list(variable = names(.SD),
value = unlist(.SD, use.names = FALSE)),
by = uid, .SDcols = 2:51],
dt[, list(variable = names(.SD),
value = unlist(.SD, use.names = FALSE)),
by = uid, .SDcols = 52:101])
}
The results
library(microbenchmark)
microbenchmark(Stacked = test1(), merged.stack = test2(),
unlist.namesT = test3(), unlist.namesF = test4(),
AndreasAns = test5(), times = 3)
# Unit: milliseconds
# expr min lq median uq max neval
# Stacked 391.3251 393.0976 394.8702 421.4185 447.9668 3
# merged.stack 764.3071 769.6935 775.0799 867.2638 959.4477 3
# unlist.namesT 1680.0610 1761.9701 1843.8791 1881.9722 1920.0653 3
# unlist.namesF 215.0827 242.7748 270.4669 270.6944 270.9218 3
# AndreasAns 16193.5084 16249.5797 16305.6510 16793.3832 17281.1154 3
^^ I'm not sure why Andreas's current answer is so slow here. The "optimization" I did was basically to unlist without using by, which made a huge difference on the "varB" (factor) columns.
The manual approach is still faster than the functions from "splitstackshape", but these are milliseconds we're talking about, and some pretty compact one-liner code!
Sample output
For reference, here is what the output of Stacked() looks like. It's a list of "stacked" data.tables, one list item for each stacked variable.
test1()
# $varA
# uid .time_1 varA
# 1: 1 01 -0.6264538
# 2: 1 02 -0.8043316
# 3: 1 03 0.2353485
# 4: 1 04 0.6179223
# 5: 1 05 -0.2212571
# ---
# 499996: 10000 46 -0.6859073
# 499997: 10000 47 -0.9763478
# 499998: 10000 48 0.6579464
# 499999: 10000 49 0.7741840
# 500000: 10000 50 0.5195232
#
# $varB
# uid .time_1 varB
# 1: 1 01 D
# 2: 1 02 A
# 3: 1 03 S
# 4: 1 04 L
# 5: 1 05 T
# ---
# 499996: 10000 46 A
# 499997: 10000 47 W
# 499998: 10000 48 H
# 499999: 10000 49 U
# 500000: 10000 50 W
And, here is what the merged.stack output looks like. It's similar to what you would get when you use reshape(..., direction = "long") from base R.
test2()
# uid .time_1 varA varB
# 1: 1 01 -0.6264538 D
# 2: 1 02 -0.8043316 A
# 3: 1 03 0.2353485 S
# 4: 1 04 0.6179223 L
# 5: 1 05 -0.2212571 T
# ---
# 499996: 10000 46 -0.6859073 A
# 499997: 10000 47 -0.9763478 W
# 499998: 10000 48 0.6579464 H
# 499999: 10000 49 0.7741840 U
# 500000: 10000 50 0.5195232 W
Shameless Self-promotion
You might want to try melt_ from my package Kmisc. melt_ is essentially a rewrite of reshape2:::melt.data.frame with most of the grunt work done in C, and avoids as much copying and type coercion as possible for a speedy implementation.
An example:
## devtools::install_github("Kmisc", "kevinushey")
library(Kmisc)
library(reshape2)
library(microbenchmark)
n <- 1E6
big_df <- data.frame( stringsAsFactors=FALSE,
x=sample(letters, n, TRUE),
y=sample(LETTERS, n, TRUE),
za=rnorm(n),
zb=rnorm(n),
zc=rnorm(n)
)
all.equal(
melt <- melt(big_df, id.vars=c('x', 'y')),
melt_ <- melt_(big_df, id.vars=c('x', 'y'))
)
## we don't convert the 'variable' column to factor by default
## if we do, we see they're identical
melt_$variable <- factor(melt_$variable)
stopifnot( identical(melt, melt_) )
microbenchmark( times=5,
melt=melt(big_df, id.vars=c('x', 'y')),
melt_=melt_(big_df, id.vars=c('x', 'y'))
)
gives me
Unit: milliseconds
expr min lq median uq max neval
melt 916.40436 931.60031 999.03877 1102.31090 1160.3598 5
melt_ 61.59921 78.08768 90.90615 94.52041 182.0879 5
With any luck, this will be fast enough for your data.
Related
I am performing a per policy life insurance valuation in R. Monthly cash flow projections are performed per policy and returns a data frame in the following format (for example):
Policy1 = data.frame(ProjM = 1:200,
Cashflow1 = rep(5,200),
Cashflow2 = rep(10,200))
My model returns a list (using lapply and a function which performs the per policy cashflow projection - based on various per policy details, escalation assumptions and life contingencies). I want to aggregate the cash flows across all policies by ProjM. The following code does what I want, but looking for a more memory efficient way (ie not using the rbindlist function). Example data:
Policy1 = data.frame(ProjM = 1:5,
Cashflow1 = rep(5,5),
Cashflow2 = rep(10,5))
Policy2 = data.frame(ProjM = 1:3,
Cashflow1 = rep(50,3),
Cashflow2 = rep(-45,3))
# this is the output containing 35000 data frames:
ListOfDataFrames = list(Policy1 = Policy1, Policy2 = Policy2)
My code:
library(data.table)
OneBigDataFrame <- rbindlist(ListOfDataFrames)
MyOutput <- aggregate(. ~ ProjM, data = OneBigDataFrame, FUN = sum)
Output required:
ProjM Cashflow1 Cashflow2
1 55 -35
2 55 -35
3 55 -35
4 5 10
5 5 10
I have looked for examples, and R aggregate list of dataframe performs aggregation for all data frames, but do not combine them into 1 data frame.
With data.table syntax the one step approach would be to create the big data.table first and then do the aggregation:
library(data.table)
OneBigDataFrame <- rbindlist(ListOfDataFrames)
OneBigDataFrame[, lapply(.SD, sum), by = ProjM]
or, more concise
rbindlist(ListOfDataFrames)[, lapply(.SD, sum), by = ProjM]
ProjM Cashflow1 Cashflow2
1: 1 55 -35
2: 2 55 -35
3: 3 55 -35
4: 4 5 10
5: 5 5 10
Now, the OP has requested to avoid creating the big data.table first in order to save memory. This requires a two step approach where the aggregates are computed for each data.table which are then aggregated to a grand total in the final step:
rbindlist(
lapply(ListOfDataFrames,
function(x) setDT(x)[, lapply(.SD, sum), by = ProjM])
)[, lapply(.SD, sum), by = ProjM]
ProjM Cashflow1 Cashflow2
1: 1 55 -35
2: 2 55 -35
3: 3 55 -35
4: 4 5 10
5: 5 5 10
Note that setDT() is used here to coerce the data.frames to data.table by reference, i.e., without creating an additional copy which saves time and memory.
Benchmark
Using the benchmark data of d.b (list of 10000 data.frames with 100 rows each, 28.5 Mb in total) with all answers provided so far:
mb <- microbenchmark::microbenchmark(
malan = {
OneBigDataFrame <- rbindlist(test)
malan <- aggregate(. ~ ProjM, data = OneBigDataFrame, FUN = sum)
},
d.b = d.b <- with(data = data.frame(do.call(dplyr::bind_rows, test)),
expr = aggregate(x = list(Cashflow1 = Cashflow1, Cashflow2 = Cashflow2),
by = list(ProjM = ProjM),
FUN = sum)),
a.gore = {
newagg <- function(dataset) {
dataset <- data.table(dataset)
dataset <- dataset[,lapply(.SD,sum),by=ProjM,.SDcols=c("Cashflow1","Cashflow2")]
return(dataset)
}
a.gore <- newagg(rbindlist(lapply(test,newagg)))
},
dt1 = dt1 <- rbindlist(test)[, lapply(.SD, sum), by = ProjM],
dt2 = dt2 <- rbindlist(
lapply(test,
function(x) setDT(x)[, lapply(.SD, sum), by = ProjM])
)[, lapply(.SD, sum), by = ProjM],
times = 5L
)
mb
Unit: milliseconds
expr min lq mean median uq max neval cld
malan 565.43967 583.08300 631.15898 600.45790 605.60237 801.2120 5 b
d.b 707.50261 710.31127 719.25591 713.54526 721.26691 743.6535 5 b
a.gore 14706.40442 14747.76305 14861.61641 14778.88547 14805.29412 15269.7350 5 d
dt1 40.10061 40.92474 42.27034 41.55434 42.07951 46.6925 5 a
dt2 8806.85039 8846.47519 9144.00399 9295.29432 9319.17251 9452.2275 5 c
The fastest solution is the one step approach using data.table which is 15 times faster than the second fastest. Surprisingly, the two step data.table approaches are magnitudes slower than the one step approach.
To make sure that all solutions return the same result this can be checked using
all.equal(malan, d.b)
all.equal(malan, as.data.frame(a.gore))
all.equal(malan, as.data.frame(dt1))
all.equal(malan, as.data.frame(dt2))
which return TRUE in all cases.
I think this solution might be efficient. Give it a try and let me know
require(data.table)
newagg <- function(dataset) { dataset <- data.table(dataset);dataset <- dataset[,lapply(.SD,sum),by=ProjM,.SDcols=c("Cashflow1","Cashflow2")]; return(dataset)}
newagg(rbindlist(lapply(ListOfDataFrames,newagg)))
# ProjM Cashflow1 Cashflow2
# 1: 1 55 -35
# 2: 2 55 -35
# 3: 3 55 -35
# 4: 4 5 10
# 5: 5 5 10
I have a data table with 10 columns.
town
tc
one
two
three
four
five
six
seven
total
Need to generate mean for columns "one" to "total" for which I am using,
DTmean <- DT[,(lapply(.SD,mean)),by = .(town,tc),.SDcols=3:10]
This generates the mean, but then I want the column names to be suffixed with "_mean". How can we do this? Want the first two columns to remain the same as "town" and "tc". I tried the below but then it renames all "one" to "total" to just "_mean"
for (i in 3:10) {
setnames(DTmean,i,paste0(names(i),"_mean"))
}
If you want to do it the data.table way, you should use setnames as follows:
setnames(DTmean, 3:10, paste0(names(DT)[3:10], '_mean'))
or:
cols <- names(DT)[3:10]
setnames(DTmean, cols, paste0(cols, '_mean'))
Furthermore, you don't need the .SDcols statement as you are aggregating all the other columns. Using DT[, lapply(.SD,mean), by = .(town,tc)] should thus give you the same result as using DT[, (lapply(.SD,mean)), by = .(town,tc), .SDcols=3:10].
On the following example dataset:
set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=10),
tc = rep(c('C','D'), 10),
one = rnorm(20,1,1),
two = rnorm(20,2,1),
three = rnorm(20,3,1),
four = rnorm(20,4,1),
five = rnorm(20,5,2),
six = rnorm(20,6,2),
seven = rnorm(20,7,2),
total = rnorm(20,28,3))
using:
DTmean <- DT[, lapply(.SD,mean), by = .(town,tc)]
setnames(DTmean, 3:10, paste0(names(DT)[3:10], '_mean'))
gives:
> DTmean
town tc one_mean two_mean three_mean four_mean five_mean six_mean seven_mean total_mean
1: A C 1.7368898 1.883586 3.358440 4.849896 4.742609 5.089877 6.792513 29.20286
2: A D 0.8906842 1.826135 3.267684 3.760931 6.210145 7.320693 5.571687 26.56142
3: B C 1.4037955 2.474836 2.587920 3.719658 3.446612 6.510183 8.309784 27.80012
4: B D 0.8103511 1.153000 3.360940 3.945082 5.555999 6.198380 8.652779 28.95180
In reply to your comment: If you want to calculate both the mean and the sd simultanuously, you could do (adapted from my answer here):
DT[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x), sd = sd(x))))), by = .(town,tc)]
which gives:
town tc one.mean one.sd two.mean two.sd three.mean three.sd four.mean four.sd five.mean five.sd six.mean six.sd seven.mean seven.sd total.mean total.sd
1: A C 0.2981842 0.3556520 1.578174 0.7788545 2.232366 0.9047046 4.896201 1.238877 4.625866 0.7436584 7.607439 1.7262628 7.949366 1.772771 28.94287 3.902602
2: A D 1.2099018 1.0205252 1.686068 1.5497989 2.671027 0.8323733 4.811279 1.404794 7.235969 0.7883873 6.765797 2.7719942 6.657298 1.107843 27.42563 3.380785
3: B C 0.9238309 0.6679821 2.525485 0.8054734 3.138298 1.0111270 3.876207 0.573342 3.843140 2.1991052 4.942155 0.7784024 6.783383 2.595116 28.95243 1.078307
4: B D 0.8843948 0.9384975 1.988908 1.0543981 3.673393 1.3505701 3.957534 1.097837 2.788119 1.9089660 6.463784 0.7642144 6.416487 2.041441 27.88205 3.807119
However, it is highly probable better to store this in long format. To get this you could use data.table's melt function as follows:
cols <- names(DT)[3:10]
DT2 <- melt(DT[, as.list(unlist(lapply(.SD, function(x) list(mn = mean(x), sdev = sd(x))))), by = .(town,tc)],
id.vars = c('town','tc'),
measure.vars = patterns('.mn','.sdev'),
value.name = c('mn','sdev'))[, variable := cols[variable]]
or in a much simpler operation:
DT2 <- melt(DT, id.vars = c('town','tc'))[, .(mn = mean(value), sdev = sd(value)), by = .(town,tc,variable)]
which results in:
> DT2
town tc variable mn sdev
1: A C one 0.2981842 0.3556520
2: A D one 1.2099018 1.0205252
3: B C one 0.9238309 0.6679821
4: B D one 0.8843948 0.9384975
5: A C two 1.5781743 0.7788545
6: A D two 1.6860675 1.5497989
7: B C two 2.5254855 0.8054734
8: B D two 1.9889082 1.0543981
9: A C three 2.2323655 0.9047046
10: A D three 2.6710267 0.8323733
11: B C three 3.1382982 1.0111270
12: B D three 3.6733929 1.3505701
.....
In response to your last comments, you can detect outliers as follows:
DT3 <- melt(DT, id.vars = c('town','tc'))
DT3[, `:=` (mn = mean(value), sdev = sd(value)), by = .(town,tc,variable)
][, outlier := +(value < mn - sdev | value > mn + sdev)]
which gives:
town tc variable value mn sdev outlier
1: A C one 0.5681578 0.2981842 0.355652 0
2: A D one 0.5528128 1.2099018 1.020525 0
3: A C one 0.5214274 0.2981842 0.355652 0
4: A D one 1.4171454 1.2099018 1.020525 0
5: A C one 0.5820994 0.2981842 0.355652 0
---
156: B D total 23.4462542 27.8820524 3.807119 1
157: B C total 30.5934956 28.9524305 1.078307 1
158: B D total 30.5618759 27.8820524 3.807119 0
159: B C total 27.5940307 28.9524305 1.078307 1
160: B D total 24.8378437 27.8820524 3.807119 0
I have a fitting process that runs 100 times. Each time the output is a dataframe -- which I capture by using lapply to create a list of 100 dataframes.
The first two dataframes might look something like this (I have more than 1 column):
n1 = c(4, 5, 6)
df1 = data.frame(n1, row.names = c("height", "weight", "favcolor"))
n2 = c(2, 3, 5, 7)
df2 = data.frame(n2, row.names = c("height", "weight", "inseam", "favcolor"))
I would like to combine these dataframes (take the average height value, or the standard deviation of the weights, for example).
My first thought was to turn this list into a dataframe -- not going to work because arguments imply differing number of rows.
My second thought was to insert NAs for each rowname that didn't appear (so, I would be adding:
new_row <- c(NA, NA)
row.names(new_row) <- "inseam"
But I can't get that to work for a myrid of reasons, and I'm unable to even test if I add an "inseam" row that contains NAs to the end of df1, that when I take the average of the "inseam" I get the correct answer.
The correct average in this case being:
height 3
weight 4
inseam 5
favcolor 6.5
So, my question:
If you had a list of dataframes, where the row names are meaningful and need to be aggregated across like the above example, what is the best way to do that?
################
EDIT
Here is the full data from 3 of my datasets:
> mega_df[1]
[[1]]
coeff error pval
rf2 -1.15099200 0.5752430 4.540538e-02
rf3 -0.53430218 0.4928974 2.783635e-01
rf4 0.08784138 0.4933079 8.586711e-01
rf5 0.96002601 0.5070957 5.833327e-02
rm2 -0.36188368 0.4626464 4.340949e-01
rm3 0.01805873 0.4355164 9.669251e-01
rm4 0.45008373 0.4319557 2.974268e-01
rm5 1.04056503 0.4441024 1.912556e-02
rc_cat1 0.86231928 0.2827566 2.290799e-03
rc_cat2 1.21335473 0.2448206 7.192321e-07
rc_cat3 0.96196637 0.2044198 2.528247e-06
rc_cat4 1.04477290 0.3302644 1.559142e-03
rc_cat5 -0.58902623 1.5893867 7.109357e-01
rc_cat6 0.62569607 0.1720676 2.765407e-04
rc_cat7 0.29381724 0.4115594 4.752815e-01
rs2 0.12333678 0.7186019 8.637250e-01
rs3 1.22018613 0.6423970 5.750837e-02
rs4 1.96075220 0.6454184 2.381892e-03
rs5 2.58404946 0.6543862 7.853927e-05
1|3 0.01561497 0.4851330 9.743229e-01
3|4 1.82853786 0.4937675 2.128663e-04
4|5 3.73480100 0.5023435 1.047616e-13
> mega_df[2]
[[1]]
coeff error pval
rf2 -0.23364248 5.849338e-01 6.895734e-01
rf3 0.24054894 5.219730e-01 6.449094e-01
rf4 0.84072979 5.208259e-01 1.064788e-01
rf5 1.47867154 5.346970e-01 5.684640e-03
rm2 -0.29555400 4.465509e-01 5.080612e-01
rm3 0.31147504 4.131024e-01 4.508553e-01
rm4 0.73696523 4.141224e-01 7.514424e-02
rm5 1.14273148 4.271863e-01 7.472508e-03
rc_cat1 1.27479299 3.094432e-01 3.794740e-05
rc_cat2 1.10917318 2.619011e-01 2.284654e-05
rc_cat3 0.65782540 2.161602e-01 2.340525e-03
rc_cat4 0.40512225 3.301662e-01 2.198131e-01
rc_cat5 12.78797722 5.612311e-08 0.000000e+00
rc_cat6 0.41622889 1.677804e-01 1.310894e-02
rc_cat7 0.16833629 3.806498e-01 6.583198e-01
rs2 -0.02279305 7.225878e-01 9.748360e-01
rs3 0.68299485 6.759050e-01 3.122608e-01
rs4 1.36149302 6.780722e-01 4.465519e-02
rs5 2.18484594 6.863696e-01 1.456612e-03
1|3 0.35419237 5.844931e-01 5.445266e-01
3|4 2.12603072 5.928308e-01 3.354863e-04
4|5 3.97564508 5.999369e-01 3.431064e-11
> mega_df[3]
[[1]]
coeff error pval
rf2 -0.2733408 0.5884741 6.422961e-01
rf3 0.1764257 0.5257697 7.372050e-01
rf4 0.6504428 0.5248386 2.152271e-01
rf5 1.3967757 0.5356706 9.119879e-03
rm2 -0.2361284 0.4870015 6.277745e-01
rm3 0.2078729 0.4609270 6.519977e-01
rm4 0.6390950 0.4622065 1.667555e-01
rm5 1.1738653 0.4730686 1.308730e-02
rc_cat1 0.9337627 0.2958630 1.599133e-03
rc_cat2 1.0292916 0.2493133 3.651281e-05
rc_cat3 0.7088285 0.2012026 4.267587e-04
rc_cat4 0.6296966 0.3664883 8.576193e-02
rc_cat6 0.5475018 0.1720841 1.464662e-03
rc_cat7 0.4521113 0.3588440 2.077017e-01
rs2 -0.4663666 0.7031265 5.071541e-01
rs3 0.7810059 0.6489673 2.287985e-01
rs4 1.5178641 0.6522175 1.995271e-02
rs5 2.1916080 0.6578769 8.643075e-04
1|3 0.2569225 0.4659460 5.813597e-01
3|4 2.0648302 0.4769118 1.493906e-05
4|5 3.9312070 0.4855339 5.648509e-16
And I'm hoping to do some basic aggregations that end up returning:
avcoef averror avpval std(coef)
rf2 W X Y Z
rf3 ...
.
.
.
I guess you could just create a new column in each data set which will contain the row names and then merge accordingly, something like:
l <- lapply(list(df1, df2), function(x) {x$New <- row.names(x) ; x})
Res <- Reduce(function(...) merge(..., by = "New", all = TRUE), l)
cbind(Res[1], Means = rowMeans(Res[-1], na.rm = TRUE))
# Row.names Means
# 1 favcolor 6.5
# 2 height 3.0
# 3 inseam 5.0
# 4 weight 4.0
This is probably highly related to this
Edit: For the new data set
l <- lapply(list(mega_df1, mega_df2, mega_df3), function(x) {x$RowName <- row.names(x) ; x})
Res <- Reduce(function(...) merge(..., by = "RowName", all = TRUE), l)
library(data.table) ## v1.9.6+
dcast(melt(setDT(Res), "RowName"),
RowName ~ sub("\\..*", "", variable),
mean,
na.rm = TRUE,
value.var = "value")
# RowName coeff error pval
# 1: cat1 1.0236250 0.2960209 1.309293e-03
# 2: cat2 1.1172732 0.2520117 2.002619e-05
# 3: cat3 0.7762068 0.2072609 9.232706e-04
# 4: cat4 0.6931972 0.3423063 1.023781e-01
# 5: cat5 6.0994755 0.7946934 3.554678e-01
# 6: cat6 0.5298089 0.1706440 4.950048e-03
# 7: cat7 0.3047549 0.3836844 4.471010e-01
I have a large data.table in R with several columns with dollar values. In a different column I have an inflation adjustment number. I am trying to figure out how to update each of my monetary columns with it multiplied by the inflation adjustment column. Suppose I have the data:
DT <- data.table(id=1:1000,year=round(runif(1000)*10),
inc1 = runif(1000), inc2 = runif(1000), inc3 = runif(1000),
deflator = rnorm(1000))
which gives output:
id year inc1 inc2 inc3 deflator
1: 1 8 0.4754808 0.6678110 0.41533976 -0.64126988
2: 2 2 0.6568746 0.7765634 0.70616373 0.39687915
3: 3 6 0.8192947 0.9236281 0.90002534 -0.69545700
4: 4 4 0.7781929 0.1624902 0.17565790 0.05263055
5: 5 7 0.6232520 0.8024975 0.86449836 0.70781887
---
996: 996 2 0.9676383 0.2238746 0.19822000 0.78564836
997: 997 9 0.9877410 0.5783748 0.57497438 -1.63365223
998: 998 8 0.2220570 0.6500632 0.19814932 1.00260174
999: 999 3 0.4793767 0.2830457 0.54835581 1.04168818
1000: 1000 8 0.2003476 0.6121637 0.02921505 0.34933690
in reality I have inc1 - inc100, rather than just three variables and I want to figure out a way to perform this action:
DT[, inc1 := inc1 * deflator]
for each of my 100 income columns (inc1, inc2, inc3 in the fake data above). I will have more than 100 columns in the future, so I would like to figure out a way to loop the action over the columns. Is there a way to do this for all the income columns at once?
I would like to do something like:
inc_cols = c(inc1, inc2, inc3)
DT[, inc_cols := lapply(inc_cols,function(x)= x * deflator),]
or
DT[, inc_cols := lapply(.SD,function(x)= x * deflator),.SDcols = inc_cols]
but neither of these seem to work. I also tried using the get() function to make it clear deflator is a referencing a column, like:
DT[, inc_cols := lapply(.SD,function(x)= x * get(deflator)),.SDcols = inc_cols]
but had no luck. I also tried to loop through the variables with something like:
for (var in inc_cols) {
print(var)
DT[, get(var) := get(var) *infAdj2010_mult]
}
which returns
[1] "inc1"
Error in get(var) : object 'inc1' not found
I realize this is probably a straight forward question and I have tried to search the other questions here and various online guides and tutorials, but I cannot find an example matching my specific problem. It is similar to this question, but not exactly.
Thanks for your help!
You could try
DT[, (inc_cols) := lapply(.SD, function(x)
x * DT[['deflator']] ), .SDcols = inc_cols]
head(DT1,2)
# id year inc1 inc2 inc3 deflator
#1: 1 3 0.614838304 0.009796974 0.3236051 0.7735552
#2: 2 2 -0.001583579 -0.082289606 -0.1365115 -0.6644330
Or if you need a loop
for(inc in inc_cols){
nm1 <- as.symbol(inc)
DT[,(inc):= eval(nm1)*deflator]
}
head(DT,2)
# id year inc1 inc2 inc3 deflator
#1: 1 3 0.614838304 0.009796974 0.3236051 0.7735552
#2: 2 2 -0.001583579 -0.082289606 -0.1365115 -0.6644330
Or a possible option using set which should be very fast as the overhead of [.data.table is avoided (suggested by #Arun)
indx <- grep('inc', colnames(DT))
for(j in indx){
set(DT, i=NULL, j=j, value=DT[[j]]*DT[['deflator']])
}
head(DT,2)
# id year inc1 inc2 inc3 deflator
#1: 1 3 0.614838304 0.009796974 0.3236051 0.7735552
#2: 2 2 -0.001583579 -0.082289606 -0.1365115 -0.6644330
where
inc_cols <- grep('^inc', colnames(DT), value=TRUE)
data
set.seed(24)
DT <- data.table(id=1:1000,year=round(runif(1000)*10),
inc1 = runif(1000), inc2 = runif(1000), inc3 = runif(1000),
deflator = rnorm(1000))
Since you can use dplyr on data.tables, you could also do:
library(dplyr)
DT %>% mutate_each(funs(.*deflator), starts_with("inc"))
Which will multiply each column of DT that starts with "inc" by the "deflator" column.
This approach is also quite convenient, but likely slower than using set():
library(data.table); library(magrittr)
set.seed(42)
DT <- data.table(id=1:1000,year=round(runif(1000)*10),
inc1 = runif(1000), inc2 = runif(1000), inc3 = runif(1000),
deflator = rnorm(1000))
vars <- names(DT) %>% .[grepl("inc", .)]
DT[, (vars) := .SD * deflator, .SDcols = vars]
DT[]
id year inc1 inc2 inc3 deflator
1: 1 9 0.212563676 0.24806366 0.06860638 0.2505781
2: 2 9 -0.017438715 -0.12186792 -0.26241497 -0.2779240
3: 3 3 -1.414016119 -1.20714809 -0.76920337 -1.7247357
4: 4 8 -1.082336969 -1.78411512 -1.08720698 -2.0067049
5: 5 6 -0.644638321 -1.07757416 -0.20895576 -1.2918083
---
996: 996 1 -0.573551720 -1.93996157 -0.50171303 -2.1569621
997: 997 5 -0.007899417 -0.01561619 -0.05708009 -0.0920275
998: 998 1 -0.090975121 -0.30475714 -0.27291825 -0.3974001
999: 999 5 -0.045984079 -0.01563942 -0.07868934 -0.1383273
1000: 1000 0 -0.785962308 -0.63266975 -0.29247974 -0.8257650
#IceCreamToucan noted that the following should work in newer versions of DT.
DT[, inc_cols := lapply(.SD,function(x)= x * deflator),.SDcols = inc_cols]
For me, I had to do the following to LHS to make it work. Also, see RHS edit.
DT[, c(inc_cols) := lapply(.SD, function(x) x * deflator), .SDcols = inc_cols]
-or-
DT[, (inc_cols) := lapply(.SD, function(x) x * deflator), .SDcols = inc_cols]
if I understand correctly, duplicated() function for data.table returns a logical vector which doesn't contain first occurrence of duplicated record. What is the best way to mark this first occurrence as well? In case of base::duplicated(), I solved this simply by disjunction with reversed order function: myDups <- (duplicated(x) | duplicated(x, fromLast=TRUE)) - but in data.table::duplicated(), fromLast=TRUE is not included (I don't know why)...
P.S. ok, here's a primitive example
myDT <- fread(
"id,fB,fC
1, b1,c1
2, b2,c2
3, b1,c1
4, b3,c3
5, b1,c1
")
setkeyv(myDT, c('fB', 'fC'))
myDT[, fD:=duplicated(myDT)]
rows 1, 3 and 5 are all duplicates but only 3 and 5 will be included in duplicated while I need to mark all of them.
UPD. important notice: the answer I've accepted below works only for keyed table. If you want to find duplicate records considering all columns, you have to setkey all these columns explicitly. So far I use the following workaround specifically for this case:
dups1 <- duplicated(myDT);
dups2 <- duplicated(myDT, fromLast=T);
dups <- dups1 | dups2;
This appears to work:
> myDT[unique(myDT),fD:=.N>1]
> myDT
id fB fC fD
1: 1 b1 c1 TRUE
2: 3 b1 c1 TRUE
3: 5 b1 c1 TRUE
4: 2 b2 c2 FALSE
5: 4 b3 c3 FALSE
Thanks to #flodel, the better way to do it is this:
> myDT[, fD := .N > 1, by = key(myDT)]
> myDT
id fB fC fD
1: 1 b1 c1 TRUE
2: 3 b1 c1 TRUE
3: 5 b1 c1 TRUE
4: 2 b2 c2 FALSE
5: 4 b3 c3 FALSE
The difference in efficiency is substantial:
> microbenchmark(
key=myDT[, fD := .N > 1, by = key(myDT)],
unique=myDT[unique(myDT),fD:=.N>1])
Unit: microseconds
expr min lq median uq max neval
key 679.874 715.700 735.0575 773.7595 1825.437 100
unique 1417.845 1485.913 1522.7475 1567.9065 24053.645 100
Especially for the max. What's going on there?
Many years ago this was the fastest answer by a large margin (see revision history if interested):
dups = duplicated(myDT, by = key(myDT));
myDT[, fD := dups | c(tail(dups, -1), FALSE)]
There have been a lot of internal changes since then however, that have made many options about the same order:
myDT <- data.table(id = sample(1e6),
fB = sample(seq_len(1e3), size= 1e6, replace=TRUE),
fC = sample(seq_len(1e3), size= 1e6,replace=TRUE ))
setkey(myDT, fB, fC)
microbenchmark(
key=myDT[, fD := .N > 1, by = key(myDT)],
unique=myDT[unique(myDT, by = key(myDT)),fD:=.N>1],
dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
.SDcols = key(myDT)],
dup2 = {dups = duplicated(myDT, by = key(myDT)); myDT[, fD := dups | c(tail(dups, -1L), FALSE)]},
dup3 = {dups = duplicated(myDT, by = key(myDT)); myDT[, fD := dups | c(dups[-1L], FALSE)]},
times=10)
# expr min lq mean median uq max neval
# key 523.3568 567.5372 632.2379 578.1474 678.4399 886.8199 10
# unique 189.7692 196.0417 215.4985 210.5258 224.4306 290.2597 10
# dup 4440.8395 4685.1862 4786.6176 4752.8271 4900.4952 5148.3648 10
# dup2 143.2756 153.3738 236.4034 161.2133 318.1504 419.4082 10
# dup3 144.1497 150.9244 193.3058 166.9541 178.0061 460.5448 10
As of data.table version 1.9.8, the solution by eddi needs to be modified to be:
dups = duplicated(myDT, by = key(myDT));
myDT[, fD := dups | c(tail(dups, -1), FALSE)]
since:
Changes in v1.9.8 (on CRAN 25 Nov 2016)
POTENTIALLY BREAKING CHANGES
By default all columns are now used by unique(), duplicated() and
uniqueN() data.table methods, #1284 and #1841. To restore old
behaviour: options(datatable.old.unique.by.key=TRUE). In 1 year this
option to restore the old default will be deprecated with warning. In
2 years the option will be removed. Please explicitly pass by=key(DT)
for clarity. Only code that relies on the default is affected. 266
CRAN and Bioconductor packages using data.table were checked before
release. 9 needed to change and were notified. Any lines of code
without test coverage will have been missed by these checks. Any
packages not on CRAN or Bioconductor were not checked.
A third approach, (that appears more efficient for this small example)
You can explicitly call duplicated.data.frame....
myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
.SDcols = key(myDT)]
microbenchmark(
key=myDT[, fD := .N > 1, by = key(myDT)],
unique=myDT[unique(myDT),fD:=.N>1],
dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
.SDcols = key(myDT)])
## Unit: microseconds
## expr min lq median uq max neval
## key 556.608 575.9265 588.906 600.9795 27713.242 100
## unique 1112.913 1164.8310 1183.244 1216.9000 2263.557 100
## dup 420.173 436.3220 448.396 461.3750 699.986 100
If we expand the size of the sample data.table, then the key approach is the clear winner
myDT <- data.table(id = sample(1e6),
fB = sample(seq_len(1e3), size= 1e6, replace=TRUE),
fC = sample(seq_len(1e3), size= 1e6,replace=TRUE ))
setkeyv(myDT, c('fB', 'fC'))
microbenchmark(
key=myDT[, fD := .N > 1, by = key(myDT)],
unique=myDT[unique(myDT),fD:=.N>1],
dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
.SDcols = key(myDT)],times=10)
## Unit: milliseconds
## expr min lq median uq max neval
## key 355.9258 358.1764 360.7628 450.9218 500.8360 10
## unique 451.3794 458.0258 483.3655 519.3341 553.2515 10
## dup 1690.1579 1721.5784 1775.5948 1826.0298 1845.4012 10