Last observation of the previous group - r

I would like to know, if I have data that I can group by a variable, how can I get the last observation of the previous group?
I have the following data:
dt <- data.table(a=c(1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,5,5,5,5,5), b=sample.int(21))
I would like to create a new data.table that has the group ID and the difference between the last observation of the group from the last observation of the previous group. So that from the above I'd get:
a c
1: 1 NA
2: 2 9
3: 3 1
4: 4 -8
5: 5 5
Thanks!

We group by 'a', get the last element of 'b', then take the lag of 'c' by shifting
dt[, .(c = last(b)), a][, c:= shift(c)][]

Here is a way:
dt[, c := b * (1:.N == .N), by = a] ## get last row within the group
dt <- dt[b == c] ## filter data.table to get rows of interest
dt[, c := shift(c, type = "lag") - c][] ## getting difference using shift with lag argument
# a b c
#1: 1 11 NA
#2: 2 10 NA
#3: 3 18 9
#4: 4 19 -7
#5: 5 12 -8
data
set.seed(1)
dt <- data.table(a=c(1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,5,5,5,5,5), b=sample.int(21))

Related

R data.table: keep column when grouping by expression

When grouping by an expression involving a column (e.g. DT[...,.SD[c(1,.N)],by=expression(col)]), I want to keep the value of col in .SD.
For example, in the following I am grouping by the remainder of a divided by 3, and keeping the first and last observation in each group. However, a is no longer present in .SD
f <- function(x) x %% 3
Q <- data.table(a = 1:20, x = rnorm(20), y = rnorm(20))
Q[, .SD[c(1., .N)], by = f(a)]
f x y
1: 1 0.2597929 1.0256259
2: 1 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 2 0.6600591 -0.5827745
5: 0 1.3758503 1.3122561
6: 0 2.6501140 1.9394756
The desired output is as if I had done the following
Q[, f := f(a)]
tmp <- Q[, .SD[c(1, .N)], by=f]
Q[, f := NULL]
tmp[, f := NULL]
tmp
a x y
1: 1 0.2597929 1.0256259
2: 19 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 20 0.6600591 -0.5827745
5: 3 1.3758503 1.3122561
6: 18 2.6501140 1.9394756
Is there a way to do this directly, without creating a new variable and creating a new intermediate data.table?
Instead of .SD, use .I to get the row index, extract that column ($V1) and subset the original dataset
library(data.table)
Q[Q[, .I[c(1., .N)], by = f(a)]$V1]
# a x y
#1: 1 0.7265238 0.5631753
#2: 19 1.7110611 -0.3141118
#3: 2 0.1643566 -0.4704501
#4: 20 0.5182394 -0.1309016
#5: 3 -0.6039137 0.1349981
#6: 18 0.3094155 -1.1892190
NOTE: The values in columns 'x', 'y' would be different as there was no set.seed

Row operations on selected columns based on substring in data.table

I would like to apply a function to selected columns that match two different substrings. I've found this post related to my question but I couldn't get an answer from there.
Here is a reproducible example with my failed attempt. For the sake of this example, I want to do a row-wise operation where I sum the values from all columns starting with string v and subtract from the average of the values in all columns starting with f.
update: the proposed solution must (a) use the := operator to make the most of data.table fast performance, and (2) be flexible to other operation rather than mean and sum, which I used here just for the sake of simplicity
library(data.table)
# generate data
dt <- data.table(id= letters[1:5],
v1= 1:5,
v2= 1:5,
f1= 11:15,
f2= 11:15)
dt
#> id v1 v2 f1 f2
#> 1: a 1 1 11 11
#> 2: b 2 2 12 12
#> 3: c 3 3 13 13
#> 4: d 4 4 14 14
#> 5: e 5 5 15 15
# what I've tried
dt[, Y := sum( .SDcols=names(dt) %like% "v" ) - mean( .SDcols=names(dt) %like% "f" ) by = id]
We melt the dataset into 'long' format, by making use of the measure argument, get the difference between the sum of 'v' and mean of 'f', grouped by 'id', join on the 'id' column with the original dataset and assign (:=) the 'V1' as the 'Y' variable
dt[melt(dt, measure = patterns("^v", "^f"), value.name = c("v", "f"))[
, sum(v) - mean(f), id], Y :=V1, on = .(id)]
dt
# id v1 v2 f1 f2 Y
#1: a 1 1 11 11 -9
#2: b 2 2 12 12 -8
#3: c 3 3 13 13 -7
#4: d 4 4 14 14 -6
#5: e 5 5 15 15 -5
Or another option is with Reduce after creating index or 'v' and 'f' columns
nmv <- which(startsWith(names(dt), "v"))
nmf <- which(startsWith(names(dt), "f"))
l1 <- length(nmv)
dt[, Y := Reduce(`+`, .SD[, nmv, with = FALSE])- (Reduce(`+`, .SD[, nmf, with = FALSE])/l1)]
rowSums and rowMeans combined with grep can accomplish this.
dt$Y <- rowMeans(dt[,grep("f", names(dt)),with=FALSE]) - rowSums(dt[,grep("v", names(dt)),with=FALSE])

What does ".N" mean in data.table?

I have a data.table dt:
library(data.table)
dt = data.table(a=LETTERS[c(1,1:3)],b=4:7)
a b
1: A 4
2: A 5
3: B 6
4: C 7
The result of dt[, .N, by=a] is
a N
1: A 2
2: B 1
3: C 1
I know the by=a or by="a" means grouped by a column and the N column is the sum of duplicated times of a. However, I don't use nrow() but I get the result. The .N is not just the column name? I can't find the document by ??".N" in R. I tried to use .K, but it doesn't work. What does .N means?
Think of .N as a variable for the number of instances. For example:
dt <- data.table(a = LETTERS[c(1,1:3)], b = 4:7)
dt[.N] # returns the last row
# a b
# 1: C 7
Your example returns a new variable with the number of rows per case:
dt[, new_var := .N, by = a]
dt
# a b new_var
# 1: A 4 2 # 2 'A's
# 2: A 5 2
# 3: B 6 1 # 1 'B'
# 4: C 7 1 # 1 'C'
For a list of all special symbols of data.table, see also https://www.rdocumentation.org/packages/data.table/versions/1.10.0/topics/special-symbols

How to Enter data for only conditioned rows on data table

I need to put number on first or random item in the group.
I do following:
item<-sample(c("a","b", "c"), 30,replace=T)
week<-rep(c("1","2","3"),10)
volume<-c(1:30)
DT<-data.table(item, week,volume)
setkeyv(DT, c("item", "week"))
sampleDT <- DT[,.SD[1], by= list(item,week)]
item week volume newCol
1: a 1 1 5
2: a 2 14 5
3: a 3 6 5
4: b 1 13 5
5: b 2 2 5
6: b 3 9 5
7: c 1 7 5
8: c 2 5 5
9: c 3 3 5
DT[DT[,.SD[1], by= list(item,week)], newCol:=5]
The sampleDT comes out correct ,but last line puts 5 on all columns instead of conditioned ones.
What am I doing wrong?
I think you want to do this instead:
DT[DT[, .I[1], by = list(item, week)]$V1, newCol := 5]
Your version doesn't work because the join that you have results in the full data.table.
Also there is a pending FR to make the syntax simpler:
# won't work now, but maybe in the future
DT[, newCol[1] := 5, by = list(item, week)]
The problem with your command is that it is finding rows in the original data.table that have combinations of the keys [item, week] that you found in sampleDT. Since sampleDT includes all combinations of [item, week], you get the whole data.table back.
A simpler solution (I think) would be using !duplicated() to retrieve the first instance of each [item, week] combination:
DT[!duplicated(DT, c("item", "week") ), newCol := 5]

How to change the last value in each group by reference, in data.table

For a data.table DT grouped by site, sorted by time t, I need to change the last value of a variable in each group. I assume it should be possible to do this by reference using :=, but I haven't found a way that works yet.
Sample data:
require(data.table) # using 1.8.11
DT <- data.table(site=c(rep("A",5), rep("B",4)),t=c(1:5,1:4),a=as.double(c(11:15,21:24)))
setkey(DT, site, t)
DT
# site t a
# 1: A 1 11
# 2: A 2 12
# 3: A 3 13
# 4: A 4 14
# 5: A 5 15
# 6: B 1 21
# 7: B 2 22
# 8: B 3 23
# 9: B 4 24
The desired result is to change the last value of a in each group, for example to 999, so the result looks like:
# site t a
# 1: A 1 11
# 2: A 2 12
# 3: A 3 13
# 4: A 4 14
# 5: A 5 999
# 6: B 1 21
# 7: B 2 22
# 8: B 3 23
# 9: B 4 999
It seems like .I and/or .N should be used, but I haven't found a form that works. The use of := in the same statement as .I[.N] gives an error. The following gives me the row numbers where the assignment is to be made:
DT[, .I[.N], by=site]
# site V1
# 1: A 5
# 2: B 9
but I don't seem to be able to use this with a := assignment. The following give errors:
DT[.N, a:=999, by=site]
# Null data.table (0 rows and 0 cols)
DT[, .I[.N, a:=999], by=site]
# Error in `:=`(a, 999) :
# := and `:=`(...) are defined for use in j, once only and in particular ways.
# See help(":="). Check is.data.table(DT) is TRUE.
DT[.I[.N], a:=999, by=site]
# Null data.table (0 rows and 0 cols)
Is there a way to do this by reference in data.table? Or is this better done another way in R?
Currently you can use:
DT[DT[, .I[.N], by = site][['V1']], a := 999]
# or, avoiding the overhead of a second call to `[.data.table`
set(DT, i = DT[,.I[.N],by='site'][['V1']], j = 'a', value = 999L)
alternative approaches:
use replace...
DT[, a := replace(a, .N, 999), by = site]
or shift the replacement to the RHS, wrapped by {} and return the full vector
DT[, a := {a[.N] <- 999L; a}, by = site]
or use mult='last' and take advantage of by-without-by. This requires the data.table to be keyed by the groups of interest.
DT[unique(site), a := 999, mult = 'last']
There is a feature request #2793 that would allow
DT[, a[.N] := 999]
but this is yet to be implemented

Resources