Row operations on selected columns based on substring in data.table - r

I would like to apply a function to selected columns that match two different substrings. I've found this post related to my question but I couldn't get an answer from there.
Here is a reproducible example with my failed attempt. For the sake of this example, I want to do a row-wise operation where I sum the values from all columns starting with string v and subtract from the average of the values in all columns starting with f.
update: the proposed solution must (a) use the := operator to make the most of data.table fast performance, and (2) be flexible to other operation rather than mean and sum, which I used here just for the sake of simplicity
library(data.table)
# generate data
dt <- data.table(id= letters[1:5],
v1= 1:5,
v2= 1:5,
f1= 11:15,
f2= 11:15)
dt
#> id v1 v2 f1 f2
#> 1: a 1 1 11 11
#> 2: b 2 2 12 12
#> 3: c 3 3 13 13
#> 4: d 4 4 14 14
#> 5: e 5 5 15 15
# what I've tried
dt[, Y := sum( .SDcols=names(dt) %like% "v" ) - mean( .SDcols=names(dt) %like% "f" ) by = id]

We melt the dataset into 'long' format, by making use of the measure argument, get the difference between the sum of 'v' and mean of 'f', grouped by 'id', join on the 'id' column with the original dataset and assign (:=) the 'V1' as the 'Y' variable
dt[melt(dt, measure = patterns("^v", "^f"), value.name = c("v", "f"))[
, sum(v) - mean(f), id], Y :=V1, on = .(id)]
dt
# id v1 v2 f1 f2 Y
#1: a 1 1 11 11 -9
#2: b 2 2 12 12 -8
#3: c 3 3 13 13 -7
#4: d 4 4 14 14 -6
#5: e 5 5 15 15 -5
Or another option is with Reduce after creating index or 'v' and 'f' columns
nmv <- which(startsWith(names(dt), "v"))
nmf <- which(startsWith(names(dt), "f"))
l1 <- length(nmv)
dt[, Y := Reduce(`+`, .SD[, nmv, with = FALSE])- (Reduce(`+`, .SD[, nmf, with = FALSE])/l1)]

rowSums and rowMeans combined with grep can accomplish this.
dt$Y <- rowMeans(dt[,grep("f", names(dt)),with=FALSE]) - rowSums(dt[,grep("v", names(dt)),with=FALSE])

Related

R data.table: keep column when grouping by expression

When grouping by an expression involving a column (e.g. DT[...,.SD[c(1,.N)],by=expression(col)]), I want to keep the value of col in .SD.
For example, in the following I am grouping by the remainder of a divided by 3, and keeping the first and last observation in each group. However, a is no longer present in .SD
f <- function(x) x %% 3
Q <- data.table(a = 1:20, x = rnorm(20), y = rnorm(20))
Q[, .SD[c(1., .N)], by = f(a)]
f x y
1: 1 0.2597929 1.0256259
2: 1 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 2 0.6600591 -0.5827745
5: 0 1.3758503 1.3122561
6: 0 2.6501140 1.9394756
The desired output is as if I had done the following
Q[, f := f(a)]
tmp <- Q[, .SD[c(1, .N)], by=f]
Q[, f := NULL]
tmp[, f := NULL]
tmp
a x y
1: 1 0.2597929 1.0256259
2: 19 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 20 0.6600591 -0.5827745
5: 3 1.3758503 1.3122561
6: 18 2.6501140 1.9394756
Is there a way to do this directly, without creating a new variable and creating a new intermediate data.table?
Instead of .SD, use .I to get the row index, extract that column ($V1) and subset the original dataset
library(data.table)
Q[Q[, .I[c(1., .N)], by = f(a)]$V1]
# a x y
#1: 1 0.7265238 0.5631753
#2: 19 1.7110611 -0.3141118
#3: 2 0.1643566 -0.4704501
#4: 20 0.5182394 -0.1309016
#5: 3 -0.6039137 0.1349981
#6: 18 0.3094155 -1.1892190
NOTE: The values in columns 'x', 'y' would be different as there was no set.seed

Last observation of the previous group

I would like to know, if I have data that I can group by a variable, how can I get the last observation of the previous group?
I have the following data:
dt <- data.table(a=c(1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,5,5,5,5,5), b=sample.int(21))
I would like to create a new data.table that has the group ID and the difference between the last observation of the group from the last observation of the previous group. So that from the above I'd get:
a c
1: 1 NA
2: 2 9
3: 3 1
4: 4 -8
5: 5 5
Thanks!
We group by 'a', get the last element of 'b', then take the lag of 'c' by shifting
dt[, .(c = last(b)), a][, c:= shift(c)][]
Here is a way:
dt[, c := b * (1:.N == .N), by = a] ## get last row within the group
dt <- dt[b == c] ## filter data.table to get rows of interest
dt[, c := shift(c, type = "lag") - c][] ## getting difference using shift with lag argument
# a b c
#1: 1 11 NA
#2: 2 10 NA
#3: 3 18 9
#4: 4 19 -7
#5: 5 12 -8
data
set.seed(1)
dt <- data.table(a=c(1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,5,5,5,5,5), b=sample.int(21))

R data.table - group by column includes list

I try to use the group by function of the data.table package in R.
start <- as.Date('2014-1-1')
end <- as.Date('2014-1-6')
time.span <- seq(start, end, "days")
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=c('a','a','b','b','a','b'))
date value group
1 2014-01-01 1 a
2 2014-01-02 2 a
3 2014-01-03 3 b
4 2014-01-04 4 b
5 2014-01-05 5 a
6 2014-01-06 6 b
a[,mean(value),by=group]
> group V1
1: a 2.6667
2: b 4.3333
This works fine.
Since i am working with Dates it can happen that a special date not only has one but two groups.
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b'))
date value group
1 2014-01-01 1 a
2 2014-01-02 2 c("a", "b")
3 2014-01-03 3 b
4 2014-01-04 4 b
5 2014-01-05 5 a
6 2014-01-06 6 b
a[,mean(value),by=group]
> Error in `[.data.table`(a, , mean(value), by = group) :
The items in the 'by' or 'keyby' list are length (1,2,1,1,1,1). Each must be same length as rows in x or number of rows returned by i (6).
I would like that the group date with both groups will be used for calculating the mean of group a as well as of group b.
Expected results:
mean a: 2.6667
mean b: 3.75
Is that possible with the data.table package?
Update
Thx to akrun my initial issue is solved. After "splitting" the data.table and in my case calculate different factors (based on the groups) i need the data.table back in its "original" form with unique rows based on the date. My solution so far:
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b'))
b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)]
date value group
1 2014-01-01 1 a
2 2014-01-02 2 a
3 2014-01-02 2 b
4 2014-01-03 3 b
5 2014-01-04 4 b
6 2014-01-05 5 a
7 2014-01-06 6 b
# creates new column with mean based on group
b[,factor := mean(value), by=group]
#creates new data.table c without duplicate rows (based on date) + if a row has group a & b it creates the product of their factors
c <- b[,.(value = unique(value), group = list(group), factor = prod(factor)),by=date]
date value group factor
01/01/14 1 a 2.666666667
02/01/14 2 c("a", "b") 10
03/01/14 3 b 3.75
04/01/14 4 b 3.75
05/01/14 5 a 2.666666667
06/01/14 6 b 3.75
I guess it is not the perfect way to do it, but it works. Any suggestions how i could do it better?
Alternative solution (really slow!!!):
d <- a[rep(1:nrow(a), lengths(group))][,group:=unlist(a$group)][, mean(value), by = group]
for(i in 1:NROW(a)){
y1 <- 1
for(j in a[i,group][[1]]){
y1 <- y1 * d[group==j, V1]
}
a[i, factor := y1]
}
My fastest solution so far:
# split rows that more than one group
b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)]
# calculate mean of different groups
b <- b[,factor := mean(value), by=group]
# only keep date + factor columns
b <- b[,.(date, factor)]
# summarise rows by date
b <- b[,lapply(.SD,prod), by=date]
# add summarised factor column to initial data.table
c <- merge(a,b,by='date')
Any chance to make it faster?
One option would be to group by the row sequence, we unlist the list column ('group'), paste the list elements together (toString(..)), use cSplit from splitstackshape with direction='long' to reshape it into 'long' format, and then get the mean of the 'value' column using 'grp' as the grouping variable.
library(data.table)
library(splitstackshape)
a[, grp:= toString(unlist(group)), 1:nrow(a)]
cSplit(a, 'grp', ', ', 'long')[, mean(value), grp]
# grp V1
#1: a 2.666667
#2: b 3.750000
Just realized that another option using splitstackshape would be listCol_l which unlists a list column into long form. As the output is a data.table, we can use the data.table methods to calculate the mean. It is much more compact to get the mean.
listCol_l(a, 'group')[, mean(value), group_ul]
# group_ul V1
#1: a 2.666667
#2: b 3.750000
Or another option without using splitstackshape would be to replicate the rows of the dataset by the length of the list element. The lengths is a convenient wrapper for sapply(group, length) and is much faster. Then, we change the 'group' column by unlisting the original 'group' from 'a' dataset and get the mean of 'value', grouped by 'group'.
a[rep(1:nrow(a), lengths(group))][,
group:=unlist(a$group)][, mean(value), by = group]
# group V1
#1: a 2.666667
#2: b 3.750000
As shorter solution posted by #mike-h in this question also uses unlist() but groups by the remaining columns:
require(data.table)
a = data.table(date = time.span,
value = c(1,2,3,4,5,6),
group = list('a',c('a','b'),'b','b','a','b'))
a[ , .(group = unlist(group)), .(date, value)][ , mean(value), group ]

Use of lapply .SD in data.table R

I am not very clear about use of .SD and by.
For instance, does the below snippet mean: 'change all the columns in DT to factor except A and B?' It also says in data.table manual: ".SD refers to the Subset of the data.table for each group (excluding the grouping columns)" - so columns A and B are excluded?
DT = DT[ ,lapply(.SD, as.factor), by=.(A,B)]
However, I also read that by means like 'group by' in SQL when you do aggregation. For instance, if I would like to sum (like colsum in SQL) over all the columns except A and B do I still use something similar? Or in this case, does the below code mean to take the sum and group by values in columns A and B? (take sum and group by A,B as in SQL)
DT[,lapply(.SD,sum),by=.(A,B)]
Then how do I do a simple colsum over all the columns except A and B?
Just to illustrate the comments above with an example, let's take
set.seed(10238)
# A and B are the "id" variables within which the
# "data" variables C and D vary meaningfully
DT = data.table(
A = rep(1:3, each = 5L),
B = rep(1:5, 3L),
C = sample(15L),
D = sample(15L)
)
DT
# A B C D
# 1: 1 1 14 11
# 2: 1 2 3 8
# 3: 1 3 15 1
# 4: 1 4 1 14
# 5: 1 5 5 9
# 6: 2 1 7 13
# 7: 2 2 2 12
# 8: 2 3 8 6
# 9: 2 4 9 15
# 10: 2 5 4 3
# 11: 3 1 6 5
# 12: 3 2 12 10
# 13: 3 3 10 4
# 14: 3 4 13 7
# 15: 3 5 11 2
Compare the following:
#Sum all columns
DT[ , lapply(.SD, sum)]
# A B C D
# 1: 30 45 120 120
#Sum all columns EXCEPT A, grouping BY A
DT[ , lapply(.SD, sum), by = A]
# A B C D
# 1: 1 15 38 43
# 2: 2 15 30 49
# 3: 3 15 52 28
#Sum all columns EXCEPT A
DT[ , lapply(.SD, sum), .SDcols = !"A"]
# B C D
# 1: 45 120 120
#Sum all columns EXCEPT A, grouping BY B
DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"]
# B C D
# 1: 1 27 29
# 2: 2 17 30
# 3: 3 33 11
# 4: 4 23 36
# 5: 5 20 14
A few notes:
You said "does the below snippet... change all the columns in DT..."
The answer is no, and this is very important for data.table. The object returned is a new data.table, and all of the columns in DT are exactly as they were before running the code.
You mentioned wanting to change the column types
Referring to the point above again, note that your code (DT[ , lapply(.SD, as.factor)]) returns a new data.table and does not change DT at all. One (incorrect) way to do this, which is done with data.frames in base, is to overwrite the old data.table with the new data.table you've returned, i.e., DT = DT[ , lapply(.SD, as.factor)].
This is wasteful because it involves creating copies of DT which can be an efficiency killer when DT is large. The correct data.table approach to this problem is to update the columns by reference using`:=`, e.g., DT[ , names(DT) := lapply(.SD, as.factor)], which creates no copies of your data. See data.table's reference semantics vignette for more on this.
You mentioned comparing efficiency of lapply(.SD, sum) to that of colSums. sum is internally optimized in data.table (you can note this is true from the output of adding the verbose = TRUE argument within []); to see this in action, let's beef up your DT a bit and run a benchmark:
Results:
library(data.table)
set.seed(12039)
nn = 1e7; kk = seq(100L)
DT = setDT(replicate(26L, sample(kk, nn, TRUE), simplify=FALSE))
DT[ , LETTERS[1:2] := .(sample(100L, nn, TRUE), sample(100L, nn, TRUE))]
library(microbenchmark)
microbenchmark(
times = 100L,
colsums = colSums(DT[ , !c("A", "B")]),
lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# colsums 1624.2622 2020.9064 2028.9546 2034.3191 2049.9902 2140.8962 100
# lapplys 246.5824 250.3753 252.9603 252.1586 254.8297 266.1771 100

Selection of levels of factors within a factor

This is my example:
df<-data.frame(ID=as.factor(c(rep("A",20),rep("B",15))),var=as.factor(c(rep("w",5),rep("x",10),rep("y",12),rep("z",8))), obs=runif(35,0,10))
What I want to do is, for each 'ID', to be able to select a single 'var', at random and possibly by selecting the 'var' with the most 'obs'. So for example, at random it could give this:
ID var obs
6 A x 3.44405412
7 A x 1.50957637
8 A x 8.22009420
9 A x 7.47094473
10 A x 8.26098410
11 A x 9.62919537
12 A x 0.10393890
13 A x 0.11298502
14 A x 4.33822574
15 A x 4.20109035
28 B z 1.07697286
29 B z 8.40864310
30 B z 7.62563257
31 B z 0.06885177
32 B z 4.33959316
33 B z 7.98303782
34 B z 8.38335593
35 B z 4.52110318
Thank you in advance for your help.
Here's another data.table approach. To begin...
library(data.table)
setDT(df)
Then, select the var for each ID:
# var with highest #obs
idvar_selected = df[,.(var = .SD[,.N,by=var][which.max(N)]$var), by=ID]
# or... at random, weighted by #obs
idvar_selected = df[,.(var = sample(var,1)), by=ID]
And "join" using the selection:
df[idvar_selected, on=c("ID","var")]
One option using data.table.
We convert the 'data.frame' to 'data.table' (setDT(df)). Grouped by 'ID' and 'var', we create a variable 'N' that gives the number of rows (.N) for each group. Then, we group by 'ID' and subset the rows that have the max value of 'N' (.SD[N==max(N)]). The 'N' column can be assigned to 'NULL' as it is not needed in the expected output.
library(data.table)
setDT(df)[,N := .N , by = .(ID, var)][, .SD[N==max(N)] ,
by = .(ID)][, N:= NULL][]
# ID var obs
# 1: A x 9.2044378
# 2: A x 2.7973557
# 3: A x 7.6382046
# 4: A x 8.0163062
# 5: A x 2.5472509
# 6: A x 6.0488886
# 7: A x 3.7073495
# 8: A x 6.7169025
# 9: A x 6.7298231
#10: A x 3.2043056
#11: B z 5.9973018
#12: B z 6.3014766
#13: B z 0.4663503
#14: B z 3.1951313
#15: B z 2.3874890
#16: B z 3.6881753
#17: B z 1.4802475
#18: B z 9.3776173
By assigning a new column, we are changing the original dataset 'df'. We could remove that column later from the original dataset by
df[, N:=NULL]
Or a modification of the above code without assigning (:=) so that original dataset remains the same. We concatenate .SD i.e. Subset of Datatable with .N to create the new column 'N', and then subset the rows as before.
setDT(df)[, c(list(N=.N), .SD) ,by =.(ID, var)][,
.SD[N==max(N)], by =ID][, N:= NULL][]
Or as suggested by #Frank, we can copy(.SD) to avoid the original dataset getting changed, then assign the 'N', and do as before.
setDT(df)[,copy(.SD)][,N := .N , by = .(ID, var)][,
.SD[N==max(N)] , by = .(ID)][]
If we want to select random 'var' within each 'ID', we can use sample to select a single 'var' grouped by 'ID', get a logical vector (var==sample(var, 1)]) and subset the rows
setDT(df)[, .SD[var==sample(var, 1)] , by =ID]
data
set.seed(24)
df <- data.frame(ID=as.factor(c(rep("A",20),rep("B",15))),
var=as.factor(c(rep("w",5),rep("x",10),rep("y",12),rep("z",8))),
obs=runif(35,0,10))

Resources