How can I use data.table syntax to produce a data.table where each column contains the differences between the column of the original data.table and the next column?
Example: I have a data.table where each row is a group, and each column is surviving population after year 0, after year 1, 2, etc. Such as:
pop <- data.table(group_id = c(1, 2, 3),
N = c(4588L, 4589L, 4589L),
N_surv_1 = c(4213, 4243, 4264),
N_surv_2 = c(3703, 3766, 3820),
N_surv_3 = c(2953, 3054, 3159) )
# group_id N N_surv_1 N_surv_2 N_surv_3
# 1 4588 4213 3703 2953
# 2 4589 4243 3766 3054
# 3 4589 4264 3820 3159
(Data types differ because N is a true integer count and N_surv_1, etc. are projections that could be fractional.)
What I have done: using the base diff and matrix transposition, we can:
diff <- data.table(t(diff(t(as.matrix(pop[,-1,with=FALSE])))))
setnames(diff, paste0("deaths_",1:ncol(diff)))
cbind(group_id = pop[,group_id],diff)
# produces desired output:
# group_id deaths_1 deaths_2 deaths_3
# 1 -375 -510 -750
# 2 -346 -477 -712
# 3 -325 -444 -661
I know that I can use base diff by group on a single column produced by melt.data.table, so this works but ain't pretty:
melt(pop,
id.vars = "group_id"
)[order(group_id)][, setNames(as.list(diff(value)),
paste0("deaths_",1:(ncol(pop)-2)) ),
keyby = group_id]
Is that the most data.table-riffic way to do this, or is there a way to do it as a multi-column operation in data.table?
Well, you could subtract the subsets:
ncols = grep("^N(_surv_[0-9]+)?", names(pop), value=TRUE)
pop[, Map(
`-`,
utils:::tail.default(.SD, -1),
utils:::head.default(.SD, -1)
), .SDcols=ncols]
# N_surv_1 N_surv_2 N_surv_3
# 1: -375 -510 -750
# 2: -346 -477 -712
# 3: -325 -444 -661
You could assign these values to new columns with :=. I have no idea why tail and head are not made more easily available... As pointed out by #akrun, you could use with=FALSE instead, like pop[, .SD[, -1, with=FALSE] - .SD[, -ncol(.SD), with=FALSE], .SDcols=ncols].
Anyway, this is pretty convoluted compared to simply reshaping:
melt(pop, id="group_id")[, tail(value, -1) - head(value, -1), by=group_id]
# group_id V1
# 1: 1 -375
# 2: 1 -510
# 3: 1 -750
# 4: 2 -346
# 5: 2 -477
# 6: 2 -712
# 7: 3 -325
# 8: 3 -444
# 9: 3 -661
Without reshaping data and each row with a unique id, you can group by the id column and then calculate the difference with diff on each row, i.e. unlist(.SD):
pop[, setNames(as.list(diff(unlist(.SD))), paste0("deaths_", 1:(ncol(pop)-2))), group_id]
# group_id deaths_1 deaths_2 deaths_3
# 1: 1 -375 -510 -750
# 2: 2 -346 -477 -712
# 3: 3 -325 -444 -661
Essentially, something like this if you ignore setting up the column names:
pop[, as.list(diff(unlist(.SD))), group_id]
Here's another way to do it without reshaping or grouping which might make it faster. If it's small number of rows then it probably won't be a noticeable difference.
cols<-names(pop)[-1]
combs<-list()
for(i in 2:length(cols)) {
combs[[length(combs)+1]]<-c(cols[i-1], cols[i])
}
newnames<-sapply(combs,function(x) gsub('N_surv','death',x[2]))
deathpop<-copy(pop)
deathpop[,(newnames):=lapply(combs,function(x) get(x[2])-get(x[1]))]
deathpop[,(cols):=NULL]
I did some benchmarking
rows<-10000000
pop <- data.table(group_id = 1:rows,
N = runif(rows,3000,4000),
N_surv_1 = runif(rows,3000,4000),
N_surv_2 = runif(rows,3000,4000),
N_surv_3 = runif(rows,3000,4000))
system.time({
cols<-names(pop)[-1]
combs<-list()
for(i in 2:length(cols)) {
combs[[length(combs)+1]]<-c(cols[i-1], cols[i])
}
newnames<-sapply(combs,function(x) gsub('N_surv','death',x[2]))
deathpop<-copy(pop)
deathpop[,(newnames):=lapply(combs,function(x) get(x[2])-get(x[1]))]
deathpop[,(cols):=NULL]})
and it returned
user system elapsed
0.192 0.808 1.003
In contrast I did
system.time(pop[, as.list(diff(unlist(.SD))), group_id])
and it returned
user system elapsed
169.836 0.428 170.469
I also did
system.time({
ncols = grep("^N(_surv_[0-9]+)?", names(pop), value=TRUE)
pop[, Map(
`-`,
utils:::tail.default(.SD, -1),
utils:::head.default(.SD, -1)
), .SDcols=ncols]
})
which returned
user system elapsed
0.044 0.044 0.089
Finally, doing
system.time(melt(pop, id="group_id")[, tail(value, -1) - head(value, -1), by=group_id])
returns
user system elapsed
223.360 1.736 225.315
Frank's Map solution is fastest. If you take the copying out of mine then it gets a lot closer to Frank's time but his still wins for this test case.
Related
The problem is well-known: unlike data.frame's, where one can point to column names by character variables, the default behaviour of data.table is to want actual column names (e.g. you cannot do DT[, "X"], but you must do DT[, X], if your table has a column named "X").
Which in some cases is a problem, because one wants to handle a generic dataset with arbitrary, user-defined column names.
I saw a couple of posts about this:
Pass column name in data.table using variable
Select / assign to data.table when variable names are stored in a character vector
And the official FAQ says I should use with = FALSE:
https://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.html#i-assigned-a-variable-mycol-x-but-then-dt-mycol-returns-x.-how-do-i-get-it-to-look-up-the-column-name-contained-in-the-mycol-variable
The quote + eval method, I do not really understand; and the one with .. gave an error even before starting doing anything.
So I only compared the method using the actual column names (which I could not use in real practice), the one using get and the one using with = FALSE.
Interestingly, the latter, i.e. the official, recommended one, is the only one that does not work at all.
And get, while it works, for some reason is far slower than using the actual column names, which I really don't get (no pun intended).
So I guess I am doing something wrong...
Incidentally, but importantly, I turned to data.table because I needed to make a grouped mean of a fairly large dataset, and my previous attempts using aggregate, by or tapply were either too slow, or too memory-hungry, and they crashed R.
I cannot disclose the actual data I am working with, so I made a simulated dataset of the same size here:
require(data.table)
row.var = "R"
col.var = "C"
value.var = "V"
set.seed(934293)
d <- setNames(data.frame(sample(1:758145, 7582953, replace = T), sample(1:450, 7582953, replace = T), runif(7582953, 5, 9)),
c(row.var, col.var, value.var))
DT <- as.data.table(d)
rm(m)
print(system.time({
m <- DT[, mean(V), by = .(R, C)]
}))
# user system elapsed
# 1.64 0.27 0.51
rm(m)
print(system.time({
m <- DT[, mean(get(value.var)), by = .(get(row.var), get(col.var))]
}))
# user system elapsed
# 16.05 0.02 14.97
rm(m)
print(system.time({
m <- DT[, mean(value.var), by = .(row.var, col.var), with = FALSE]
}))
#Error in h(simpleError(msg, call)) :
# error in evaluating the argument 'x' in selecting a method for function 'print': missing value #where TRUE/FALSE needed
#In addition: Warning message:
#In mean.default(value.var) :
#
# Error in h(simpleError(msg, call)) :
#error in evaluating the argument 'x' in selecting a method for function 'print': missing value #where TRUE/FALSE needed Timing stopped at: 0 0 0
Any ideas?
collap from collapse gives a better timing
library(collapse)
> system.time(collap(DT, reformulate(c(row.var, col.var),
response = value.var), fmean))
user system elapsed
0.881 0.020 0.901
> system.time(fmean(fgroup_by(DT, c(row.var, col.var))))
user system elapsed
0.931 0.021 0.952
> system.time(DT[, mean(V), by = .(R, C)])
user system elapsed
5.052 0.099 0.646
As the get approach or the one with .SDcols is taking time, another approach is to interpolate the values in a string and evaluate
system.time(eval(parse(text = glue::glue("DT[, mean({value.var}), by = .({row.var}, {col.var})]"))))
user system elapsed
5.065 0.105 0.660
-checking the output
> out_c <- collap(DT, reformulate(c(row.var, col.var),
response = value.var), fmean)
> out_d <- DT[, mean(V), by = .(R, C)]
> out_dte <- eval(parse(text = glue::glue("DT[, mean({value.var}), by = .({row.var}, {col.var})]")))
> out_c
R C V
<int> <int> <num>
1: 1 16 5.237421
2: 1 77 5.080965
3: 1 95 5.822834
4: 1 107 7.276902
5: 1 224 5.565160
---
7499419: 758145 162 8.077307
7499420: 758145 181 6.094559
7499421: 758145 251 7.816277
7499422: 758145 310 8.373066
7499423: 758145 435 8.222885
> out_d[order(R, C)]
R C V1
<int> <int> <num>
1: 1 16 5.237421
2: 1 77 5.080965
3: 1 95 5.822834
4: 1 107 7.276902
5: 1 224 5.565160
---
7499419: 758145 162 8.077307
7499420: 758145 181 6.094559
7499421: 758145 251 7.816277
7499422: 758145 310 8.373066
7499423: 758145 435 8.222885
> out_dte[order(R, C)]
R C V1
<int> <int> <num>
1: 1 16 5.237421
2: 1 77 5.080965
3: 1 95 5.822834
4: 1 107 7.276902
5: 1 224 5.565160
---
7499419: 758145 162 8.077307
7499420: 758145 181 6.094559
7499421: 758145 251 7.816277
7499422: 758145 310 8.373066
7499423: 758145 435 8.222885
Once we get the output, the column names can be updated with setnames
> setnames(out_dte[order(R, C)], 'V1', value.var)[]
R C V
<int> <int> <num>
1: 1 16 5.237421
2: 1 77 5.080965
3: 1 95 5.822834
4: 1 107 7.276902
5: 1 224 5.565160
---
7499419: 758145 162 8.077307
7499420: 758145 181 6.094559
7499421: 758145 251 7.816277
7499422: 758145 310 8.373066
7499423: 758145 435 8.222885
This particular problem of programming data.tables can be solved without get() at least in two different ways:
1. Using .SDcols and passing character values to by
Here, .SDcols takes a vector of character column names to operate on. by = accepts also a character vector of column names.
DT[, lapply(.SD, mean), .SDcols = value.var, by = c(row.var, col.var)]
2. Using the new env parameter
With development version 1.14.3. data.table has gained a new interface for programming on data.table (see item 10 in the Changelog).
data.table::update.dev.pkg() # Install latest dev version (1.14.3)
library(data.table)
DT[, mean(v1), by = .(v2, v3), env = list(v1 = value.var, v2 = row.var, v3 = col.var)]
Benchmarks
microbenchmark::microbenchmark(
nat = DT[, mean(V), by = .(R, C)],
# get = DT[, mean(get(value.var)), by = .(get(row.var), get(col.var))],
chr = DT[, lapply(.SD, mean), .SDcols = value.var, by = c(row.var, col.var)],
env = DT[, mean(v1), by = .(v2, v3), env = list( v1 = value.var, v2 = row.var, v3 = col.var)],
times = 3L,
check = "equivalent"
)
Unit: seconds
expr min lq mean median uq max neval
nat 1.275479 1.313737 1.356826 1.351995 1.397500 1.443005 3
chr 1.279219 1.297576 1.328261 1.315933 1.352782 1.389630 3
env 1.324985 1.327743 1.331603 1.330502 1.334912 1.339323 3
All three variants (except get) took approximately the same time. get was excluded as the run time was too long.
Edit: Benchmarks results including get:
Unit: seconds
expr min lq mean median uq max neval
nat 1.238719 1.288629 1.315594 1.338539 1.354032 1.369525 3
get 569.560843 572.417951 576.482720 575.275059 579.943658 584.612257 3
chr 1.275734 1.279536 1.309346 1.283338 1.326153 1.368967 3
env 1.298941 1.316105 1.326649 1.333268 1.340503 1.347738 3
So, get takes about 500 times longer than the other three variants.
An explanation for this behaviour is given by data.table when the verbose option is switched on:
options(datatable.verbose = TRUE)
Now,
DT[, mean(get(value.var)), by = .(get(row.var), get(col.var))]
gives the following hints:
Argument 'by' after substitute: .(get(row.var), get(col.var))
'(m)get'
found in j. ansvars being set to all columns. Use .SDcols or a single
j=eval(macro) instead. Both will detect the columns used which is
important for efficiency.
Old ansvars: [R, C, V]
New ansvars: [R, C, V]
Finding groups using forderv ... forder.c received 7582953 rows
and 2 columns
0.600s elapsed (0.720s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.070s elapsed (0.060s cpu)
Getting
back original order ... forder.c received a vector type 'integer'
length 7499423
0.470s elapsed (0.640s cpu)
lapply optimization is on, j unchanged as 'mean(get(value.var))'
GForce is on, left j unchanged
Old mean
optimization changed j from 'mean(get(value.var))' to
'.External(Cfastmean, get(value.var), FALSE)'
Making each group and
running j (GForce FALSE) ...
[...]
The 3 other variants are all using data.table's gforce optimization.
I have a data frame with numeric columns and a character column with labels. See example:
library(tidyverse)
a <- c(0.036210845, 0.005546561, 0.004394322 ,0.006635205, 2.269306824 ,0.013542101, 0.006580308 ,0.006854309,0.009076331 ,0.006577178 ,0.099406840 ,0.010962796, 0.011491922,0.007454443 ,0.004463684,0.005836916,0.011119906 ,0.009543205, 0.003990476, 0.007793532 ,0.020776231, 0.011713687, 0.010045341, 0.008411304, 0.032514994)
b <- c(0.030677829, 0.005210211, 0.004164294, 0.006279456 ,1.095908581 ,0.012029876, 0.006193405 ,0.006486812, 0.008589699, 0.006167356, 0.068956516 ,0.010140064 ,0.010602171 ,0.006898081 ,0.004193735, 0.005447855 ,0.009936211, 0.008743681, 0.003774822, 0.007375678, 0.019695336, 0.010827791, 0.009258572, 0.007960328,0.026956408)
c <- c(0.025855453, 0.004882746 ,0.003946182, 0.005929399 ,0.466284591 ,0.010704604 ,0.005815709, 0.006125196, 0.008110854, 0.005769223, 0.046847336, 0.009356712, 0.009803620 ,0.006366758, 0.003936953 ,0.005072295, 0.008885989 ,0.007989028, 0.003565631, 0.006964512, 0.018636187, 0.010009413, 0.008540876, 0.007516569,0.022227924)
label <- c("fa05","fa05" ,"fa05", "fa10", "fa10", "fa10", "fa20","fa20", "faflat", "faflat", "sa05", "sa05", "sa10" , "sa10" , "sa10" , "sa10", "sa10", "sa10", "sa20", "sa20", "sa20" ,"sa20", "saflat", "saflat", "saflat")
dataframe <- as.data.frame(cbind(a,b,c,label))
dataframe <- dataframe %>%
transform(a = as.numeric(a)) %>%
transform(b = as.numeric(b)) %>%
transform(c = as.numeric(c))
I have written a function that takes a sample of rows for each label (number of rows in sample = number of rows for the specific label) and as output gives the average of the samples. Example: in the source data (dataframe) there are 3 rows of the label "fa05". Lets call them fa05_1, fa05_2, fa05_3 (just for explaining it). The function takes a sample of these three rows that each consist of 3 columns (a,b and c). The number of fa05 in the sample equals the number fa05 in the source data, so 3 in this case. The function takes a sample with replacement so it could fx be fa05_3, fa05_1, fa05_1. Then it takes the average of those three samples for each of the three columns a,b and c and gives the output. It looks like this:
samp <- function(df, col1, var){
df %>%
group_by(!!col1) %>%
nest() %>%
ungroup() %>%
mutate(n = !!var) %>%
mutate(samp = map2(data, n, sample_n, replace=T)) %>%
select(-data) %>%
unnest(samp) %>%
group_by(!!col1) %>%
dplyr::summarise(across("a":"c", mean))
}
list <- c(3,3,2,2,2,6,4,3) # The number of times each label occur in the data
samp(dataframe, quo(label), quo(list))
label a b c
<chr> <dbl> <dbl> <dbl>
1 fa05 0.00439 0.00416 0.00395
2 fa10 0.00894 0.00820 0.00752
3 fa20 0.00672 0.00634 0.00597
4 faflat 0.00908 0.00859 0.00811
5 sa05 0.0552 0.0395 0.0281
6 sa10 0.00715 0.00657 0.00603
7 sa20 0.0101 0.00956 0.00903
8 saflat 0.0250 0.0211 0.0177
I would like to use this function on some data and repeat it 1000 times efficiently. At first it was not a function and I used rerun() but that was very inefficient. I read that I could write it as a function and the use lapply which should be more efficient, but it does not work when I do like this:
lapply(dataframe, samp, col1=quo(Pattern), var=quo(list))
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "c('double', 'numeric')"
How do I make this work with lapply? And how to I tell lapply to rerun the function 1000 times? I hope you can help.
You can just do this
replicate(1000, samp(dataframe, quo(label), quo(list)), simplify = FALSE)
However, this is really slow.
> system.time(replicate(1000, samp(dataframe, quo(label), quo(list)), simplify = FALSE))
user system elapsed
33.83 0.03 33.87
To make it faster, we need to rewrite your samp function. Here is a tidyverse approach
group_sample_size <- c("fa05" = 3, "fa10" = 3, "fa20" = 2, "faflat" = 2, "sa05" = 2, "sa10" = 6, "sa20" = 4, "saflat" = 3)
prep <- function(df, grp_var, sample_size) {
df %>%
mutate(size = sample_size[.data[[grp_var]]]) %>%
group_by(across(!!grp_var))
}
rep_sample <- function(df, n) {
replicate(
n,
df %>%
slice(sample.int(n(), size[[1L]], replace = TRUE)) %>%
summarise(across(a:c, mean), .groups = "drop"),
simplify = FALSE
)
}
dataframe %>%
prep("label", group_sample_size) %>%
rep_sample(1000)
Performance has improved significantly but is still suboptimal IMO. It takes about 5-6 seconds to finish the simulation.
> system.time(dataframe %>% prep("label", group_sample_size) %>% rep_sample(1000))
user system elapsed
5.80 0.01 5.81
For efficiency, I think the following data.table approach would be better.
library(data.table)
fsamp <- function(df, grp_var, size, nsim) {
df <- as.data.table(df)
group_info <- table(df[[grp_var]], dnn = list(grp_var))
simu_pool <- df[, -grp_var, with = FALSE]
simu_vars <- names(simu_pool)
simu_pool <- split(simu_pool, df[[grp_var]])
out <- data.table(
simu = rep(seq_len(nsim), each = length(group_info)),
group_info
)
out[
, size := size[out[[grp_var]]]
][
, (simu_vars) := lapply(simu_pool[[.BY[[grp_var]]]][sample.int(N, size, replace = TRUE)], mean),
by = c("simu", grp_var)
][]
}
This one is about four times faster than the optimised tidyverse approach.
> system.time(fsamp(dataframe, "label", group_sample_size, 1000))
user system elapsed
1.47 0.04 1.50
All three approaches produce the same set of results
> set.seed(124)
> # rbindlist converts a list of tibbles into a single data.table
> dataframe %>% prep("label", group_sample_size) %>% rep_sample(1000) %>% rbindlist()
label a b c
1: fa05 0.015383909 0.013350778 0.011561460
2: fa10 0.763161377 0.371405971 0.160972865
3: fa20 0.006717308 0.006340109 0.005970452
4: faflat 0.009076331 0.008589699 0.008110854
5: sa05 0.055184818 0.039548290 0.028102024
---
7996: faflat 0.007826754 0.007378527 0.006940039
7997: sa05 0.099406840 0.068956516 0.046847336
7998: sa10 0.006648513 0.006118159 0.005626362
7999: sa20 0.020776231 0.019695336 0.018636187
8000: saflat 0.008411304 0.007960328 0.007516569
> set.seed(124)
> fsamp(df, "label", group_sample_size, 1000)
simu label N size a b c
1: 1 fa05 3 3 0.015383909 0.013350778 0.011561460
2: 1 fa10 3 3 0.763161377 0.371405971 0.160972865
3: 1 fa20 2 2 0.006717308 0.006340109 0.005970452
4: 1 faflat 2 2 0.009076331 0.008589699 0.008110854
5: 1 sa05 2 2 0.055184818 0.039548290 0.028102024
---
7996: 1000 faflat 2 2 0.007826754 0.007378527 0.006940039
7997: 1000 sa05 2 2 0.099406840 0.068956516 0.046847336
7998: 1000 sa10 6 6 0.006648513 0.006118159 0.005626362
7999: 1000 sa20 4 4 0.020776231 0.019695336 0.018636187
8000: 1000 saflat 3 3 0.008411304 0.007960328 0.007516569
> set.seed(124)
> replicate(1000, samp(dataframe, quo(label), quo(list)), simplify = FALSE) %>% rbindlist()
label a b c
1: fa05 0.015383909 0.013350778 0.011561460
2: fa10 0.763161377 0.371405971 0.160972865
3: fa20 0.006717308 0.006340109 0.005970452
4: faflat 0.009076331 0.008589699 0.008110854
5: sa05 0.055184818 0.039548290 0.028102024
---
7996: faflat 0.007826754 0.007378527 0.006940039
7997: sa05 0.099406840 0.068956516 0.046847336
7998: sa10 0.006648513 0.006118159 0.005626362
7999: sa20 0.020776231 0.019695336 0.018636187
8000: saflat 0.008411304 0.007960328 0.007516569
I have a dataframe where one column is the amount spent. In the amount spent column there are the values for amount spent and also negative values for any returns. For example.
ID Store Spent
123 A 18.50
123 A -18.50
123 A 18.50
I want to remove the negative value then one of its positive counter parts - the idea is to only keep fully completed spend amounts so I can look at total spend.
Right now I am thinking something like this - where I have the data frame sorted by spend
if spend < 0 {
take absolute value of spend
if diff between abs(spend) and spend+1 = 0 then both are NA}
I would like to have something like
df[df$spend < 0] <- NA
where I can also set one positive counterpart to NA as well. Any suggestions?
There should be a simpler solution to this but here is one way. Also created my own example since the one shared did not have sufficient data points to test
#Original vector
x <- c(1, 2, -2, 1, -1, -1, 2, 3, -4, 1, 4)
#Count the frequency of negative numbers, keeping all the unique numbers
vals <- table(factor(abs(x[x < 0]), levels = unique(abs(x))))
#Count the frequency of absolute value of original vector
vals1 <- table(abs(x))
#Subtract the frequencies between two vectors
new_val <- vals1 - (vals * 2 )
#Recreate the new vector
as.integer(rep(names(new_val), new_val))
#[1] 1 2 3
If you add a rowid column you can do this with data.table ant-joins.
Here's an example which takes ID into account, not deleting "positive counterparts" unless they're the same ID
First create more interesting sample data
df <- fread('
ID Store Spent
123 A 18.50
123 A -18.50
123 A 18.50
123 A -19.50
123 A 19.50
123 A -99.50
124 A -94.50
124 A 99.50
124 A 94.50
124 A 94.50
')
Now remove all the negative values with positive counterparts, and remove those counterparts
negs <- df[Spent < 0][, Spent := -Spent][, rid := rowid(ID, Spent)]
pos <- df[Spent > 0][, rid := rowid(ID, Spent)]
pos[!negs, on = .(ID, Spent, rid), -'rid']
# ID Store Spent rid
# 1: 123 A 18.5 2
# 2: 124 A 99.5 1
# 3: 124 A 94.5 2
And as applied to Ronak's x vector example
x <- c(1, 2, -2, 1, -1, -1, 2, 3, -4, 1, 4)
negs <- data.table(x = -x[x<0])[, rid := rowid(x)]
pos <- data.table(x = x[x>0])[, rid := rowid(x)]
pos[!negs, on = names(pos), -'rid']
# x
# 1: 2
# 2: 3
# 3: 1
I used the following code.
library(dplyr)
store <- rep(LETTERS[1:3], 3)
id <- c(1:4, 1:3, 1:2)
expense <- runif(9, -10, 10)
tibble(store, id, expense) %>%
group_by(store) %>%
summarise(net_expenditure = sum(expense))
to get this output:
# A tibble: 3 x 2
store net_expenditure
<chr> <dbl>
1 A 13.3
2 B 8.17
3 C 16.6
Alternatively, if you wanted the net expenditure per store-id pairing, then you could use this code:
tibble(store, id, expense) %>%
group_by(store, id) %>%
summarise(net_expenditure = sum(expense))
I've approached your question from a slightly different perspective. I'm not sure that my code answers your question, but it might help.
Pivoting and subtotals are common auxiliary steps in spreadsheets and SQL.
Assume a data.table with the fields date, myCategory, revenue. Assume that you want to know the proportion of day revenue of all revenue and the proportion of day revenue within different subgroup such that
b[,{
#First auxiliary variable of all revenue
totalRev = sum(revenue) #SUBGROUP OF ALL REV
#Second auxiliary variable of revenue by date, syntax wrong! How to do this?
{totalRev_date=sum(revenue), by=list(date)} #DIFFERENT SUBGROUP, by DATE's rev
#Within the subgroup by date and myCategory, we will use 1st&2nd auxiliary vars
.SD[,.(Revenue_prop_of_TOT=revenue/totalRev,
,Revenue_prop_of_DAY=revenue/totalRev_date) ,by=list(myCategory,date)]
},]
where we need to compute the auxiliary sums, all revenue of specific day and all revenue of whole history.
The end result should look like this:
date myCategory Revenue_prop_of_TOT Revenue_prop_of_DAY
2019-01-01 Cat1 0.002 0.2
...
where you see that the auxiliary variables are only help functions.
How can you pivot and compute subtotals within R data.table?
Another option using data.table::cube:
cb <- cube(DT, sum(value), by=c("date","category"), id=TRUE)
cb[grouping==0L, .(date, category,
PropByDate = V1 / cb[grouping==1L][.SD, on="date", x.V1],
PropByCategory = V1 / cb[grouping==2L][.SD, on="category", x.V1],
PropByTotal = V1 / cb[grouping==3L, V1]
)]
output:
date category PropByDate PropByCategory PropByTotal
1: 1 1 0.3333333 0.2500000 0.1
2: 1 2 0.6666667 0.3333333 0.2
3: 2 1 0.4285714 0.7500000 0.3
4: 2 2 0.5714286 0.6666667 0.4
data:
DT <- data.table(date=c(1, 1, 2, 2), category=c(1, 2, 1, 2), value=1:4)
# date category value
#1: 1 1 1
#2: 1 2 2
#3: 2 1 3
#4: 2 2 4
Hopefully I'm understanding correctly what you intend but please let me know in the comments if you need a different output.
b = data.table(date = rep(seq.Date(Sys.Date()-99, Sys.Date(), "days"), each=2),
myCategory = c("a", "b"),
revenue = rnorm(100, 200))
# global total, just create a constant
totalRev = b[, sum(revenue)]
# Total revenue at myCategory and date level / total Revenue
b[, Revenue_prop_of_TOT:=sum(revenue)/totalRev, by=.(myCategory, date)]
# you can calculate totalRev_date independently
b[, totalRev_date:=sum(revenue), by=date]
# If these are all the columns you have you don't need the sum(revenue) and by calls
b[, Revenue_prop_of_DAY:=sum(revenue)/totalRev_date, by=.(myCategory, date)]
Finally I would wrap it in a function.
revenue_total <- function(b){
totalRev = b[, sum(revenue)]
b[, Revenue_prop_of_TOT:=sum(revenue)/totalRev, by=.(myCategory, date)]
b[, totalRev_date:=sum(revenue), by=date]
b[, Revenue_prop_of_DAY:=sum(revenue)/totalRev_date, by=.(myCategory, date)]
b
}
b = revenue_total(b)
Options for pivoting and subtotals in R
cube answered here
groupingsets commented by marbel here
I have read some references to similar problems here on SO, but haven't been able to find a solution yet and wondering if there is any way to do the following using just data.table.
I'll use a simplified example, but in practice, my data table has > 1000 columns similar to var1, var2, ... var1000, etc.
dt <- data.table(uid=c("a","b"), var1=c(1,2), var2=c(100,200))
I am looking for a solution that will allow me to get an output similar to reshape's melt function --
> melt(dt, id=c("uid"))
uid variable value
1 a var1 1
2 b var1 2
3 a var2 100
4 b var2 200
That is, all the columns except for uid are listed under a single column with the corresponding values in an adjoining column. I have tried this with a combination of list, etc, but might be missing something that is obvious.
All uids in dt are unique.
Thanks in advance.
For a data.table reshape, try the following:
dt[, list(variable = names(.SD), value = unlist(.SD, use.names = F)), by = uid]
The cost of the syntax is worth it; the function runs very quickly!
stack generally outperforms melt.
A straightforward approach to this problem with stack would be:
dt[, stack(.SD), by = "uid"]
Of course, you can specify your .SDcols if necessary. And then, use setnames() to change the names to whatever you want.
(Self-promotion alert)
I wrote some functions and put them in a package called "splitstackshape". One of the functions is called Stacked(), and in the 1.2.0 version of the "splitstackshape" package, should work very fast.
It's a little bit different from just stacking all the remaining columns in a data.table. It is more analogous to base R's reshape() than melt() from "reshape2". Here's an example of Stacked() in action.
I've created a decently large data.table to do this test. There are 50 numeric columns we want to stack, and 50 factor columns we want to stack. I've also further optimized #Andreas's answer.
The data
set.seed(1)
m1 <- matrix(rnorm(10000*50), ncol = 50)
m2 <- matrix(sample(LETTERS, 10000*50, replace = TRUE), ncol = 50)
colnames(m1) <- paste("varA", sprintf("%02d", 1:50), sep = "_")
colnames(m2) <- paste("varB", sprintf("%02d", 1:50), sep = "_")
dt <- data.table(uid = 1:10000, m1, m2)
The functions for benchmarking
test1 <- function() Stacked(dt, "uid", c("varA", "varB"), "_")
## merged.stack
test2 <- function() merged.stack(dt, "uid", c("varA", "varB"), "_")
## unlist(..., use.names = TRUE) -- OPTIMIZED
test3 <- function() {
list(cbind(dt[, "uid", with = FALSE],
dt[, list(variable = rep(names(.SD), each = nrow(dt)),
value = unlist(.SD)),
.SDcols = 2:51]),
cbind(dt[, "uid", with = FALSE],
dt[, list(variable = rep(names(.SD), each = nrow(dt)),
value = unlist(.SD)),
.SDcols = 52:101]))
}
## unlist(..., use.names = FALSE) -- OPTIMIZED
test4 <- function() {
list(cbind(dt[, "uid", with = FALSE],
dt[, list(variable = rep(names(.SD), each = nrow(dt)),
value = unlist(.SD, use.names = FALSE)),
.SDcols = 2:51]),
cbind(dt[, "uid", with = FALSE],
dt[, list(variable = rep(names(.SD), each = nrow(dt)),
value = unlist(.SD, use.names = FALSE)),
.SDcols = 52:101]))
}
## Andreas's current answer
test5 <- function() {
list(dt[, list(variable = names(.SD),
value = unlist(.SD, use.names = FALSE)),
by = uid, .SDcols = 2:51],
dt[, list(variable = names(.SD),
value = unlist(.SD, use.names = FALSE)),
by = uid, .SDcols = 52:101])
}
The results
library(microbenchmark)
microbenchmark(Stacked = test1(), merged.stack = test2(),
unlist.namesT = test3(), unlist.namesF = test4(),
AndreasAns = test5(), times = 3)
# Unit: milliseconds
# expr min lq median uq max neval
# Stacked 391.3251 393.0976 394.8702 421.4185 447.9668 3
# merged.stack 764.3071 769.6935 775.0799 867.2638 959.4477 3
# unlist.namesT 1680.0610 1761.9701 1843.8791 1881.9722 1920.0653 3
# unlist.namesF 215.0827 242.7748 270.4669 270.6944 270.9218 3
# AndreasAns 16193.5084 16249.5797 16305.6510 16793.3832 17281.1154 3
^^ I'm not sure why Andreas's current answer is so slow here. The "optimization" I did was basically to unlist without using by, which made a huge difference on the "varB" (factor) columns.
The manual approach is still faster than the functions from "splitstackshape", but these are milliseconds we're talking about, and some pretty compact one-liner code!
Sample output
For reference, here is what the output of Stacked() looks like. It's a list of "stacked" data.tables, one list item for each stacked variable.
test1()
# $varA
# uid .time_1 varA
# 1: 1 01 -0.6264538
# 2: 1 02 -0.8043316
# 3: 1 03 0.2353485
# 4: 1 04 0.6179223
# 5: 1 05 -0.2212571
# ---
# 499996: 10000 46 -0.6859073
# 499997: 10000 47 -0.9763478
# 499998: 10000 48 0.6579464
# 499999: 10000 49 0.7741840
# 500000: 10000 50 0.5195232
#
# $varB
# uid .time_1 varB
# 1: 1 01 D
# 2: 1 02 A
# 3: 1 03 S
# 4: 1 04 L
# 5: 1 05 T
# ---
# 499996: 10000 46 A
# 499997: 10000 47 W
# 499998: 10000 48 H
# 499999: 10000 49 U
# 500000: 10000 50 W
And, here is what the merged.stack output looks like. It's similar to what you would get when you use reshape(..., direction = "long") from base R.
test2()
# uid .time_1 varA varB
# 1: 1 01 -0.6264538 D
# 2: 1 02 -0.8043316 A
# 3: 1 03 0.2353485 S
# 4: 1 04 0.6179223 L
# 5: 1 05 -0.2212571 T
# ---
# 499996: 10000 46 -0.6859073 A
# 499997: 10000 47 -0.9763478 W
# 499998: 10000 48 0.6579464 H
# 499999: 10000 49 0.7741840 U
# 500000: 10000 50 0.5195232 W
Shameless Self-promotion
You might want to try melt_ from my package Kmisc. melt_ is essentially a rewrite of reshape2:::melt.data.frame with most of the grunt work done in C, and avoids as much copying and type coercion as possible for a speedy implementation.
An example:
## devtools::install_github("Kmisc", "kevinushey")
library(Kmisc)
library(reshape2)
library(microbenchmark)
n <- 1E6
big_df <- data.frame( stringsAsFactors=FALSE,
x=sample(letters, n, TRUE),
y=sample(LETTERS, n, TRUE),
za=rnorm(n),
zb=rnorm(n),
zc=rnorm(n)
)
all.equal(
melt <- melt(big_df, id.vars=c('x', 'y')),
melt_ <- melt_(big_df, id.vars=c('x', 'y'))
)
## we don't convert the 'variable' column to factor by default
## if we do, we see they're identical
melt_$variable <- factor(melt_$variable)
stopifnot( identical(melt, melt_) )
microbenchmark( times=5,
melt=melt(big_df, id.vars=c('x', 'y')),
melt_=melt_(big_df, id.vars=c('x', 'y'))
)
gives me
Unit: milliseconds
expr min lq median uq max neval
melt 916.40436 931.60031 999.03877 1102.31090 1160.3598 5
melt_ 61.59921 78.08768 90.90615 94.52041 182.0879 5
With any luck, this will be fast enough for your data.