Selection of levels of factors within a factor - r

This is my example:
df<-data.frame(ID=as.factor(c(rep("A",20),rep("B",15))),var=as.factor(c(rep("w",5),rep("x",10),rep("y",12),rep("z",8))), obs=runif(35,0,10))
What I want to do is, for each 'ID', to be able to select a single 'var', at random and possibly by selecting the 'var' with the most 'obs'. So for example, at random it could give this:
ID var obs
6 A x 3.44405412
7 A x 1.50957637
8 A x 8.22009420
9 A x 7.47094473
10 A x 8.26098410
11 A x 9.62919537
12 A x 0.10393890
13 A x 0.11298502
14 A x 4.33822574
15 A x 4.20109035
28 B z 1.07697286
29 B z 8.40864310
30 B z 7.62563257
31 B z 0.06885177
32 B z 4.33959316
33 B z 7.98303782
34 B z 8.38335593
35 B z 4.52110318
Thank you in advance for your help.

Here's another data.table approach. To begin...
library(data.table)
setDT(df)
Then, select the var for each ID:
# var with highest #obs
idvar_selected = df[,.(var = .SD[,.N,by=var][which.max(N)]$var), by=ID]
# or... at random, weighted by #obs
idvar_selected = df[,.(var = sample(var,1)), by=ID]
And "join" using the selection:
df[idvar_selected, on=c("ID","var")]

One option using data.table.
We convert the 'data.frame' to 'data.table' (setDT(df)). Grouped by 'ID' and 'var', we create a variable 'N' that gives the number of rows (.N) for each group. Then, we group by 'ID' and subset the rows that have the max value of 'N' (.SD[N==max(N)]). The 'N' column can be assigned to 'NULL' as it is not needed in the expected output.
library(data.table)
setDT(df)[,N := .N , by = .(ID, var)][, .SD[N==max(N)] ,
by = .(ID)][, N:= NULL][]
# ID var obs
# 1: A x 9.2044378
# 2: A x 2.7973557
# 3: A x 7.6382046
# 4: A x 8.0163062
# 5: A x 2.5472509
# 6: A x 6.0488886
# 7: A x 3.7073495
# 8: A x 6.7169025
# 9: A x 6.7298231
#10: A x 3.2043056
#11: B z 5.9973018
#12: B z 6.3014766
#13: B z 0.4663503
#14: B z 3.1951313
#15: B z 2.3874890
#16: B z 3.6881753
#17: B z 1.4802475
#18: B z 9.3776173
By assigning a new column, we are changing the original dataset 'df'. We could remove that column later from the original dataset by
df[, N:=NULL]
Or a modification of the above code without assigning (:=) so that original dataset remains the same. We concatenate .SD i.e. Subset of Datatable with .N to create the new column 'N', and then subset the rows as before.
setDT(df)[, c(list(N=.N), .SD) ,by =.(ID, var)][,
.SD[N==max(N)], by =ID][, N:= NULL][]
Or as suggested by #Frank, we can copy(.SD) to avoid the original dataset getting changed, then assign the 'N', and do as before.
setDT(df)[,copy(.SD)][,N := .N , by = .(ID, var)][,
.SD[N==max(N)] , by = .(ID)][]
If we want to select random 'var' within each 'ID', we can use sample to select a single 'var' grouped by 'ID', get a logical vector (var==sample(var, 1)]) and subset the rows
setDT(df)[, .SD[var==sample(var, 1)] , by =ID]
data
set.seed(24)
df <- data.frame(ID=as.factor(c(rep("A",20),rep("B",15))),
var=as.factor(c(rep("w",5),rep("x",10),rep("y",12),rep("z",8))),
obs=runif(35,0,10))

Related

R data.table: keep column when grouping by expression

When grouping by an expression involving a column (e.g. DT[...,.SD[c(1,.N)],by=expression(col)]), I want to keep the value of col in .SD.
For example, in the following I am grouping by the remainder of a divided by 3, and keeping the first and last observation in each group. However, a is no longer present in .SD
f <- function(x) x %% 3
Q <- data.table(a = 1:20, x = rnorm(20), y = rnorm(20))
Q[, .SD[c(1., .N)], by = f(a)]
f x y
1: 1 0.2597929 1.0256259
2: 1 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 2 0.6600591 -0.5827745
5: 0 1.3758503 1.3122561
6: 0 2.6501140 1.9394756
The desired output is as if I had done the following
Q[, f := f(a)]
tmp <- Q[, .SD[c(1, .N)], by=f]
Q[, f := NULL]
tmp[, f := NULL]
tmp
a x y
1: 1 0.2597929 1.0256259
2: 19 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 20 0.6600591 -0.5827745
5: 3 1.3758503 1.3122561
6: 18 2.6501140 1.9394756
Is there a way to do this directly, without creating a new variable and creating a new intermediate data.table?
Instead of .SD, use .I to get the row index, extract that column ($V1) and subset the original dataset
library(data.table)
Q[Q[, .I[c(1., .N)], by = f(a)]$V1]
# a x y
#1: 1 0.7265238 0.5631753
#2: 19 1.7110611 -0.3141118
#3: 2 0.1643566 -0.4704501
#4: 20 0.5182394 -0.1309016
#5: 3 -0.6039137 0.1349981
#6: 18 0.3094155 -1.1892190
NOTE: The values in columns 'x', 'y' would be different as there was no set.seed

Last observation of the previous group

I would like to know, if I have data that I can group by a variable, how can I get the last observation of the previous group?
I have the following data:
dt <- data.table(a=c(1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,5,5,5,5,5), b=sample.int(21))
I would like to create a new data.table that has the group ID and the difference between the last observation of the group from the last observation of the previous group. So that from the above I'd get:
a c
1: 1 NA
2: 2 9
3: 3 1
4: 4 -8
5: 5 5
Thanks!
We group by 'a', get the last element of 'b', then take the lag of 'c' by shifting
dt[, .(c = last(b)), a][, c:= shift(c)][]
Here is a way:
dt[, c := b * (1:.N == .N), by = a] ## get last row within the group
dt <- dt[b == c] ## filter data.table to get rows of interest
dt[, c := shift(c, type = "lag") - c][] ## getting difference using shift with lag argument
# a b c
#1: 1 11 NA
#2: 2 10 NA
#3: 3 18 9
#4: 4 19 -7
#5: 5 12 -8
data
set.seed(1)
dt <- data.table(a=c(1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,5,5,5,5,5), b=sample.int(21))

Row operations on selected columns based on substring in data.table

I would like to apply a function to selected columns that match two different substrings. I've found this post related to my question but I couldn't get an answer from there.
Here is a reproducible example with my failed attempt. For the sake of this example, I want to do a row-wise operation where I sum the values from all columns starting with string v and subtract from the average of the values in all columns starting with f.
update: the proposed solution must (a) use the := operator to make the most of data.table fast performance, and (2) be flexible to other operation rather than mean and sum, which I used here just for the sake of simplicity
library(data.table)
# generate data
dt <- data.table(id= letters[1:5],
v1= 1:5,
v2= 1:5,
f1= 11:15,
f2= 11:15)
dt
#> id v1 v2 f1 f2
#> 1: a 1 1 11 11
#> 2: b 2 2 12 12
#> 3: c 3 3 13 13
#> 4: d 4 4 14 14
#> 5: e 5 5 15 15
# what I've tried
dt[, Y := sum( .SDcols=names(dt) %like% "v" ) - mean( .SDcols=names(dt) %like% "f" ) by = id]
We melt the dataset into 'long' format, by making use of the measure argument, get the difference between the sum of 'v' and mean of 'f', grouped by 'id', join on the 'id' column with the original dataset and assign (:=) the 'V1' as the 'Y' variable
dt[melt(dt, measure = patterns("^v", "^f"), value.name = c("v", "f"))[
, sum(v) - mean(f), id], Y :=V1, on = .(id)]
dt
# id v1 v2 f1 f2 Y
#1: a 1 1 11 11 -9
#2: b 2 2 12 12 -8
#3: c 3 3 13 13 -7
#4: d 4 4 14 14 -6
#5: e 5 5 15 15 -5
Or another option is with Reduce after creating index or 'v' and 'f' columns
nmv <- which(startsWith(names(dt), "v"))
nmf <- which(startsWith(names(dt), "f"))
l1 <- length(nmv)
dt[, Y := Reduce(`+`, .SD[, nmv, with = FALSE])- (Reduce(`+`, .SD[, nmf, with = FALSE])/l1)]
rowSums and rowMeans combined with grep can accomplish this.
dt$Y <- rowMeans(dt[,grep("f", names(dt)),with=FALSE]) - rowSums(dt[,grep("v", names(dt)),with=FALSE])

R data.table - group by column includes list

I try to use the group by function of the data.table package in R.
start <- as.Date('2014-1-1')
end <- as.Date('2014-1-6')
time.span <- seq(start, end, "days")
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=c('a','a','b','b','a','b'))
date value group
1 2014-01-01 1 a
2 2014-01-02 2 a
3 2014-01-03 3 b
4 2014-01-04 4 b
5 2014-01-05 5 a
6 2014-01-06 6 b
a[,mean(value),by=group]
> group V1
1: a 2.6667
2: b 4.3333
This works fine.
Since i am working with Dates it can happen that a special date not only has one but two groups.
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b'))
date value group
1 2014-01-01 1 a
2 2014-01-02 2 c("a", "b")
3 2014-01-03 3 b
4 2014-01-04 4 b
5 2014-01-05 5 a
6 2014-01-06 6 b
a[,mean(value),by=group]
> Error in `[.data.table`(a, , mean(value), by = group) :
The items in the 'by' or 'keyby' list are length (1,2,1,1,1,1). Each must be same length as rows in x or number of rows returned by i (6).
I would like that the group date with both groups will be used for calculating the mean of group a as well as of group b.
Expected results:
mean a: 2.6667
mean b: 3.75
Is that possible with the data.table package?
Update
Thx to akrun my initial issue is solved. After "splitting" the data.table and in my case calculate different factors (based on the groups) i need the data.table back in its "original" form with unique rows based on the date. My solution so far:
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b'))
b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)]
date value group
1 2014-01-01 1 a
2 2014-01-02 2 a
3 2014-01-02 2 b
4 2014-01-03 3 b
5 2014-01-04 4 b
6 2014-01-05 5 a
7 2014-01-06 6 b
# creates new column with mean based on group
b[,factor := mean(value), by=group]
#creates new data.table c without duplicate rows (based on date) + if a row has group a & b it creates the product of their factors
c <- b[,.(value = unique(value), group = list(group), factor = prod(factor)),by=date]
date value group factor
01/01/14 1 a 2.666666667
02/01/14 2 c("a", "b") 10
03/01/14 3 b 3.75
04/01/14 4 b 3.75
05/01/14 5 a 2.666666667
06/01/14 6 b 3.75
I guess it is not the perfect way to do it, but it works. Any suggestions how i could do it better?
Alternative solution (really slow!!!):
d <- a[rep(1:nrow(a), lengths(group))][,group:=unlist(a$group)][, mean(value), by = group]
for(i in 1:NROW(a)){
y1 <- 1
for(j in a[i,group][[1]]){
y1 <- y1 * d[group==j, V1]
}
a[i, factor := y1]
}
My fastest solution so far:
# split rows that more than one group
b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)]
# calculate mean of different groups
b <- b[,factor := mean(value), by=group]
# only keep date + factor columns
b <- b[,.(date, factor)]
# summarise rows by date
b <- b[,lapply(.SD,prod), by=date]
# add summarised factor column to initial data.table
c <- merge(a,b,by='date')
Any chance to make it faster?
One option would be to group by the row sequence, we unlist the list column ('group'), paste the list elements together (toString(..)), use cSplit from splitstackshape with direction='long' to reshape it into 'long' format, and then get the mean of the 'value' column using 'grp' as the grouping variable.
library(data.table)
library(splitstackshape)
a[, grp:= toString(unlist(group)), 1:nrow(a)]
cSplit(a, 'grp', ', ', 'long')[, mean(value), grp]
# grp V1
#1: a 2.666667
#2: b 3.750000
Just realized that another option using splitstackshape would be listCol_l which unlists a list column into long form. As the output is a data.table, we can use the data.table methods to calculate the mean. It is much more compact to get the mean.
listCol_l(a, 'group')[, mean(value), group_ul]
# group_ul V1
#1: a 2.666667
#2: b 3.750000
Or another option without using splitstackshape would be to replicate the rows of the dataset by the length of the list element. The lengths is a convenient wrapper for sapply(group, length) and is much faster. Then, we change the 'group' column by unlisting the original 'group' from 'a' dataset and get the mean of 'value', grouped by 'group'.
a[rep(1:nrow(a), lengths(group))][,
group:=unlist(a$group)][, mean(value), by = group]
# group V1
#1: a 2.666667
#2: b 3.750000
As shorter solution posted by #mike-h in this question also uses unlist() but groups by the remaining columns:
require(data.table)
a = data.table(date = time.span,
value = c(1,2,3,4,5,6),
group = list('a',c('a','b'),'b','b','a','b'))
a[ , .(group = unlist(group)), .(date, value)][ , mean(value), group ]

When grouping by all columns in a data.table, .SD is empty

I'm having trouble getting consistent output in data.table using consistent syntax. See example below
library(data.table)
d <- data.table(x = c(1,1,2,2), y = c(1,1,2,2))
# data.table shown below
# x y
1: 1 1
2: 1 1
3: 2 2
4: 2 2
d[, if(.N>1) .SD else NULL, by = .(x, y)]
# returns Empty data.table (0 rows) of 2 cols: x,y
When all columns are used for grouping in by, .SD is empty, causing an empty data.table to be returned.
When one adds another column, .SD contains columns not being grouped by, the correct output is returned.
d[, if(.N>1) .SD else NULL, by = x]
# returns
x y
1: 1 1
2: 1 1
3: 2 2
4: 2 2
d <- data.table(x = c(1,1,2,2), y = c(1,1,2,2), t = 1:4)
d[, if(.N>1) .SD else NULL, by = .(x, y)]
# returns
x y t
1: 1 1 1
2: 1 1 2
3: 2 2 3
4: 2 2 4
I'm trying to find a way to write code to return rows that appear duplicate times that works for both the case where the by columns do and don't consist of all columns in the data.table. Toward this end, I tried setting .SDcols = c("x", "y"). However, the columns get repeated in the output
d[, if(.N>1) .SD else NULL, by = .(x, y), .SDcols = c("x", "y")]
x y x y
1: 1 1 1 1
2: 1 1 1 1
3: 2 2 2 2
4: 2 2 2 2
Is there a way to make it so d[, if(.N > 1) .SD else NULL, by = colnames] returns the desired output independent of whether the column names grouped by consist of all columns in 'd'? Or do I need to use an if statement and break up the 2 cases?
Here's one approach
setkey(d,x,y)
dnew <- d[d[,.N>1,by=key(d)][(V1),key(d),with=FALSE]]
This
sets (x,y) to a key;
identifies which (x,y) groups satisfy the criterion; and then
selects those groups from d.

Resources