Data table operations with multiple group by variable sets - r

I have a data.table that I would like to perform group-by operations on, but would like to retain the null variables and use different group-by variable sets.
A toy example:
library(data.table)
set.seed(1)
DT <- data.table(
id = sample(c("US", "Other"), 25, replace = TRUE),
loc = sample(LETTERS[1:5], 25, replace = TRUE),
index = runif(25)
)
I would like to find the sum of index by all combinations of the key variables (including the null set). The concept is analogous to "grouping sets" in Oracle SQL, here is an example of my current workaround:
rbind(
DT[, list(id = "", loc = "", sindex = sum(index)), by = NULL],
DT[, list(loc = "", sindex = sum(index)), by = "id"],
DT[, list(id = "", sindex = sum(index)), by = "loc"],
DT[, list(sindex = sum(index)), by = c("id", "loc")]
)[order(id, loc)]
id loc sindex
1: 11.54218399
2: A 2.82172063
3: B 0.98639578
4: C 2.89149433
5: D 3.93292900
6: E 0.90964424
7: Other 6.19514146
8: Other A 1.12107080
9: Other B 0.43809711
10: Other C 2.80724742
11: Other D 1.58392886
12: Other E 0.24479728
13: US 5.34704253
14: US A 1.70064983
15: US B 0.54829867
16: US C 0.08424691
17: US D 2.34900015
18: US E 0.66484697
Is there a preferred "data table" way to accomplish this?

As of this commit, this is now possible with the dev version of data.table with cube or groupingsets:
library("data.table")
# data.table 1.10.5 IN DEVELOPMENT built 2017-08-08 18:31:51 UTC
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
cube(DT, list(sindex = sum(index)), by = c("id", "loc"))
# id loc sindex
# 1: US B 0.54829867
# 2: US A 1.70064983
# 3: Other B 0.43809711
# 4: Other E 0.24479728
# 5: Other C 2.80724742
# 6: Other A 1.12107080
# 7: US E 0.66484697
# 8: US D 2.34900015
# 9: Other D 1.58392886
# 10: US C 0.08424691
# 11: NA B 0.98639578
# 12: NA A 2.82172063
# 13: NA E 0.90964424
# 14: NA C 2.89149433
# 15: NA D 3.93292900
# 16: US NA 5.34704253
# 17: Other NA 6.19514146
# 18: NA NA 11.54218399
groupingsets(DT, j = list(sindex = sum(index)), by = c("id", "loc"), sets = list(character(), "id", "loc", c("id", "loc")))
# id loc sindex
# 1: NA NA 11.54218399
# 2: US NA 5.34704253
# 3: Other NA 6.19514146
# 4: NA B 0.98639578
# 5: NA A 2.82172063
# 6: NA E 0.90964424
# 7: NA C 2.89149433
# 8: NA D 3.93292900
# 9: US B 0.54829867
# 10: US A 1.70064983
# 11: Other B 0.43809711
# 12: Other E 0.24479728
# 13: Other C 2.80724742
# 14: Other A 1.12107080
# 15: US E 0.66484697
# 16: US D 2.34900015
# 17: Other D 1.58392886
# 18: US C 0.08424691

I have a generic function that you can feed in a dataframe and a vector of dimensions you wish to group by, and it will return the sum of all numeric fields grouped by those dimensions.
rollSum = function(input, dimensions){
#cast dimension inputs to character in case a dimension input is numeric
for (x in 1:length(dimensions)){
input[[eval(dimensions[x])]] = as.character(input[[eval(dimensions[x])]])
}
numericColumns = which(lapply(input,class) %in% c("integer", "numeric"))
output = input[,lapply(.SD, sum, na.rm = TRUE), by = eval(dimensions),
.SDcols = numericColumns]
return(output)
}
So then you can create a list of your different group by vectors:
groupings = list(c("id"),c("loc"),c("id","loc"))
And then use it with lapply and rbindlist in the way of:
groupedSets = rbindlist(lapply(groupings, function(x){
return(rollSum(DT,x))}), fill = TRUE)

using dplyr, an adaption of this should work, if I understand your question correctly.
sum <- mtcars %>%
group_by(vs, am) %>%
summarise(Sum=sum(mpg))
I didnt check how it treats the missung values though,but it should just make another group of them (last group).

Related

Why does melt produce a warning when one passes a list to measure.vars?

I am just wondering why does melt throw this warning when I pass a defined list to measure.vars arg?
ny = 3
times = 1990L + 10 * (seq_len(ny)-1)
measure_vars_long = c('gdppc', 'lifexp')
measure_vars_wide = list(measure_var1 = paste0('gdppc_', times[-3]),
measure_var2 = paste0('lifexp_', times))
DT = data.table::data.table(
country = c("c1", "c2", "c3"),
gdppc_1990 = c(121782.304323278, 126794.359297492, 150476.795178838),
gdppc_2000 = c(118858.692678623, 143942.932505161, 166981.9295872),
lifexp_1990 = c(54.4529938697815, 93.5958937462419, 92.9653832130134),
lifexp_2000 = c(88.843795270659, 77.9958764929324, 96.4652526797727),
lifexp_2010 = c(81.6346854623407, 90.6221343390644, 63.0786676099524)
)
melt(DT,
idvars = 'country',
measure = measure_vars_wide, #with(measure_vars_wide, list(measure_var1, measure_var2))
variable.name = 'year',
value.name = c('gdppc', 'lifexp'))
country year measure_var1 measure_var2
1: c1 1 121782.3 54.45299
2: c2 1 126794.4 93.59589
3: c3 1 150476.8 92.96538
4: c1 2 118858.7 88.84380
5: c2 2 143942.9 77.99588
6: c3 2 166981.9 96.46525
7: c1 3 NA 81.63469
8: c2 3 NA 90.62213
9: c3 3 NA 63.07867
Warning message:
'value.name' provided in both 'measure.vars' and 'value.name argument'; value provided in 'measure.vars' is given precedence.
However, this works:
melt(DT,
idvars = 'country',
measure = with(measure_vars_wide, list(measure_var1, measure_var2)),
variable.name = 'year',
value.name = c('gdppc', 'lifexp'))
country year gdppc lifexp
1: c1 1 121782.3 54.45299
2: c2 1 126794.4 93.59589
3: c3 1 150476.8 92.96538
4: c1 2 118858.7 88.84380
5: c2 2 143942.9 77.99588
6: c3 2 166981.9 96.46525
7: c1 3 NA 81.63469
8: c2 3 NA 90.62213
9: c3 3 NA 63.07867
Reading ?data.table::melt, we see that
measure.vars: Measure variables for 'melt'ing. Can be missing, vector,
list, or pattern-based.
...
For convenience/clarity in the case of multiple 'melt'ed
columns, resulting column names can be supplied as names to
the elements 'measure.vars' (in the 'list' and 'patterns'
usages). See also 'Examples'.
value.name: name for the molten data values column(s). ...
... though note well that
the names provided in 'measure.vars' take precedence.
This means that names in measure.vars can be used in place of value.name.
To confirm that, we can use the named-list variant and remove the value.name and see that the new column names are indeed from names(measure_vars_wide):
melt(DT,
idvars = 'country',
measure = measure_vars_wide,
variable.name = 'year')
# country year measure_var1 measure_var2
# <char> <fctr> <num> <num>
# 1: c1 1 121782.3 54.45299
# 2: c2 1 126794.4 93.59589
# 3: c3 1 150476.8 92.96538
# 4: c1 2 118858.7 88.84380
# 5: c2 2 143942.9 77.99588
# 6: c3 2 166981.9 96.46525
# 7: c1 3 NA 81.63469
# 8: c2 3 NA 90.62213
# 9: c3 3 NA 63.07867
So you can use the named list exclusive, using 'gdppc' and 'lifexp' as the list's names instead, and remove the value.name= argument, and get the desired effect without any warnings.
Why does it behave differently in your second code? Because your with(., list(.)) is returning an unnamed list:
with(measure_vars_wide, list(measure_var1, measure_var2))
# [[1]]
# [1] "gdppc_1990" "gdppc_2000"
# [[2]]
# [1] "lifexp_1990" "lifexp_2000" "lifexp_2010"
In the end, I think you can stick with the with(...) code, or change your code to be:
measure_vars_wide = list(gdppc = paste0('gdppc_', times[-3]),
lifexp = paste0('lifexp_', times))
melt(DT,
idvars = 'country',
measure = measure_vars_wide,
variable.name = 'year')
# country year gdppc lifexp
# <char> <fctr> <num> <num>
# 1: c1 1 121782.3 54.45299
# 2: c2 1 126794.4 93.59589
# 3: c3 1 150476.8 92.96538
# 4: c1 2 118858.7 88.84380
# 5: c2 2 143942.9 77.99588
# 6: c3 2 166981.9 96.46525
# 7: c1 3 NA 81.63469
# 8: c2 3 NA 90.62213
# 9: c3 3 NA 63.07867

Merging a sum by reference with data.table

Let's say I have two data.table, dt_a and dt_b defined as below.
library(data.table)
set.seed(20201111L)
dt_a <- data.table(
foo = c("a", "b", "c")
)
dt_b <- data.table(
bar = sample(c("a", "b", "c"), 10L, replace=TRUE),
value = runif(10L)
)
dt_b[]
## bar value
## 1: c 0.4904536
## 2: c 0.9067509
## 3: b 0.1831664
## 4: c 0.0203943
## 5: c 0.8707686
## 6: a 0.4224133
## 7: a 0.6025349
## 8: b 0.4916672
## 9: a 0.4566726
## 10: b 0.8841110
I want to left join dt_b on dt_a by reference, summing over the multiple match. A way to do so would be to first create a summary of dt_b (thus solving the multiple match issue) and merge if afterwards.
dt_b_summary <- dt_b[, .(value=sum(value)), bar]
dt_a[dt_b_summary, value_good:=value, on=c(foo="bar")]
dt_a[]
## foo value_good
## 1: a 1.481621
## 2: b 1.558945
## 3: c 2.288367
However, this will allow memory to the object dt_b_summary, which is inefficient.
I would like to have the same result by directly joining on dt_b and summing multiple match. I'm looking for something like below, but that won't work.
dt_a[dt_b, value_bad:=sum(value), on=c(foo="bar")]
dt_a[]
## foo value_good value_bad
## 1: a 1.481621 5.328933
## 2: b 1.558945 5.328933
## 3: c 2.288367 5.328933
Anyone knows if there is something possible?
We can use .EACHI with by
library(data.table)
dt_b[dt_a, .(value = sum(value)), on = .(bar = foo), by = .EACHI]
# bar value
#1: a 1.481621
#2: b 1.558945
#3: c 2.288367
If we want to update the original object 'dt_a'
dt_a[, value := dt_b[.SD, sum(value), on = .(bar = foo), by = .EACHI]$V1]
dt_a
# foo value
#1: a 1.481621
#2: b 1.558945
#3: c 2.288367
For multiple columns
dt_b$value1 <- dt_b$value
nm1 <- c('value', 'value1')
dt_a[, (nm1) := dt_b[.SD, lapply(.SD, sum),
on = .(bar = foo), by = .EACHI][, .SD, .SDcols = nm1]]

na.locf in data.table when completing by group

I have a data.table in which I'd like to complete a column to fill in some missing values, however I'm having some trouble filling in the other columns.
dt = data.table(a = c(1, 3, 5), b = c('a', 'b', 'c'))
dt[, .(a = seq(min(a), max(a), 1), b = na.locf(b))]
# a b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 a
# 5: 5 b
However looking for something more like this:
dt %>%
complete(a = seq(min(a), max(a), 1)) %>%
mutate(b = na.locf(b))
# # A tibble: 5 x 2
# a b
# <dbl> <chr>
# 1 1 a
# 2 2 a
# 3 3 b
# 4 4 b
# 5 5 c
where the last value is carried forward
Another possible solution with only the (rolling) join capabilities of data.table:
dt[.(min(a):max(a)), on = .(a), roll = Inf]
which gives:
a b
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
On large datasets this will probably outperform every other solution.
Courtesy to #Mako212 who gave the hint by using seq in his answer.
First posted solution which works, but gives a warning:
dt[dt[, .(a = Reduce(":", a))], on = .(a), roll = Inf]
data.table recycles observations by default when you try dt[, .(a = seq(min(a), max(a), 1))] so it never generates any NA values for na.locf to fill. Pretty sure you need to use a join here to "complete" the cases, and then you can use na.locf to fill.
dt[dt[, .(a = min(a):max(a))], on = 'a'][, .(a, b = na.locf(b))]
Not sure if there's a way to skip the separate t1 line, but this gives you the desired result.
a b
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
And I'll borrow #Jaap's min/max line to avoid creating the second table. So basically you can either use his rolling join solution, or if you want to use na.locf this gets the same result.

new vector in r by values of other vector

I have data frame of this form:
df <- data.frame(country = rep(x = LETTERS[1:4], each = 5), year = rep(2001:2005), C=runif(20,30,100), Z=rnorm(20, mean = 0, sd = 1))
I would like for each country, to identify value of Z when year==2003, and to divide all values of C by that value, so each country values of C will be divided with some different number but the number will be the same within one country - and to save all these in some new vector "New". So for example, all values in C for country A will be divided with -0.80212515, for country B divided with -0.62305076 etc. How can i do it? Thanks!
Your data does not match with example you shared in your post. You need to use set.seed() to make it reproducible. Anyways, here's a solution using dplyr -
set.seed(42)
df <- data.frame(country = rep(x = LETTERS[1:4], each = 5),
year = rep(2001:2005),
C=runif(20,30,100),
Z=rnorm(20, mean = 0, sd = 1))
df %>%
group_by(country) %>%
mutate(
New = C / Z[year == 2003]
) %>%
pull(New)
# [1] -67.70760 -68.83000 -36.02216 -63.45585 -53.94507 -24.97189 -30.70301
# [8] -14.84183 -28.60558 -29.87234 -360.88226 -467.30510 -555.07518 -278.50602
# [15] -362.73532 -54.33474 -55.85181 -21.67929 -35.87291 -39.26086
Another solution, using base R
Extract Z for 2005 for each country
v1 <- df[df$year==2005,4]
create vector with correct length for division
z_2005 <- rep(x = v1[1:4],each = 5)
new vector <- C divided by Z for appropriate year
new <- df$C / Z_2005
If you want to merge new columns with old dataframe
df2 <- cbind(df,Z_2005,new)
A data.table alternative to #Shree's dplyr:
set.seed(42)
dt <- data.table(country = rep(x = LETTERS[1:4], each = 5), year = rep(2001:2005), C=runif(20,30,100), Z=rnorm(20, mean = 0, sd = 1))
dt[,New := C/Z[year==2003],by="country"]
dt
# country year C Z New
# 1: A 2001 94.03642 1.3048697 -67.70760
# 2: A 2002 95.59528 2.2866454 -68.83000
# 3: A 2003 50.02977 -1.3888607 -36.02216
# 4: A 2004 88.13133 -0.2787888 -63.45585
# 5: A 2005 74.92219 -0.1333213 -53.94507
# 6: B 2001 66.33672 0.6359504 -24.97189
# 7: B 2002 81.56118 -0.2842529 -30.70301
# 8: B 2003 39.42666 -2.6564554 -14.84183
# 9: B 2004 75.98946 -2.4404669 -28.60558
# 10: B 2005 79.35453 1.3201133 -29.87234
# 11: C 2001 62.04192 -0.3066386 -360.88226
# 12: C 2002 80.33786 -1.7813084 -467.30510
# 13: C 2003 95.42706 -0.1719174 -555.07518
# 14: C 2004 47.88002 1.2146747 -278.50602
# 15: C 2005 62.36050 1.8951935 -362.73532
# 16: D 2001 95.80102 -0.4304691 -54.33474
# 17: D 2002 98.47585 -0.2572694 -55.85181
# 18: D 2003 38.22412 -1.7631631 -21.67929
# 19: D 2004 63.24980 0.4600974 -35.87291
# 20: D 2005 69.22329 -0.6399949 -39.26086
And an option that relies on neither data.table nor dplyr:
do.call(rbind,
by(df, df$country, FUN = function(a) transform(a, New = C/Z[year==2003])))
Use split and process each dataset separately, then combine them
r=sapply(split(df, df$country), function(x)New=x$Z/x$Z[x$year==2003])
d=tidyr::gather(as.data.frame(r),Country, New)
Edits
set.seed(0)
df <- data.frame(country = rep(x = LETTERS[1:4], each = 5), year = rep(2001:2005), C=runif(20,30,100), Z=rnorm(20, mean = 0, sd = 1))
r=sapply(split(df, df$country), function(x)New=x$Z/x$Z[x$year==2003])
d=tidyr::gather(as.data.frame(r),country, New)
cbind(df, d)

Cross-correlation with multiple groups in one data.table

I'd like to calculate the cross-correlations between groups of time series within on data.table. I have a time series data in this format:
data = data.table( group = c(rep("a", 5),rep("b",5),rep("c",5)) , Y = rnorm(15) )
group Y
1: a 0.90855520
2: a -0.12463737
3: a -0.45754652
4: a 0.65789709
5: a 1.27632196
6: b 0.98483700
7: b -0.44282527
8: b -0.93169070
9: b -0.21878359
10: b -0.46713392
11: c -0.02199363
12: c -0.67125826
13: c 0.29263953
14: c -0.65064603
15: c -1.41143837
Each group has the same number of observations. What I am looking for is a way to obtain cross correlation between the groups:
group.1 group.2 correlation
a b 0.xxx
a c 0.xxx
b c 0.xxx
I am working on a script to subset each group and append the cross-correlations, but the data size is fairly large. Is there any efficient / zen way to do this?
Does this help?
data[,id:=rep(1:5,3)]
dtw = dcast.data.table(data, id ~ group, value.var="Y" )[, id := NULL]
cor(dtw)
See Correlation between groups in R data.table
Another way would be:
# data
set.seed(45L)
data = data.table( group = c(rep("a", 5),rep("b",5),rep("c",5)) , Y = rnorm(15) )
# method 2
setkey(data, "group")
data2 = data[J(c("b", "c", "a"))][, list(group2=group, Y2=Y)]
data[, c(names(data2)) := data2]
data[, cor(Y, Y2), by=list(group, group2)]
# group group2 V1
# 1: a b -0.2997090
# 2: b c 0.6427463
# 3: c a -0.6922734
And to generalize this "other" way to more than three groups...
data = data.table( group = c(rep("a", 5),rep("b",5),rep("c",5),rep("d",5)) ,
Y = rnorm(20) )
setkey(data, "group")
groups = unique(data$group)
ngroups = length(groups)
library(gtools)
pairs = combinations(ngroups,2,groups)
d1 = data[pairs[,1],,allow.cartesian=TRUE]
d2 = data[pairs[,2],,allow.cartesian=TRUE]
d1[,c("group2","Y2"):=d2]
d1[,cor(Y,Y2), by=list(group,group2)]
# group group2 V1
# 1: a b 0.10742799
# 2: a c 0.52823511
# 3: a d 0.04424170
# 4: b c 0.65407400
# 5: b d 0.32777779
# 6: c d -0.02425053

Resources