Differences to earlier instance within group or external variable - r

I have data
dat1 <- data.table(id=1:9,
group=c(1,1,2,2,2,3,3,3,3),
t=c(14,17,20,21,26,89,90,95,99),
index=c(1,2,1,2,3,1,2,3,4)
)
and I would like to compute the difference on t to the previous value, according to index. For the first instance of each group, I would like to compute the difference to some external variable
dat2 <- data.table(group=c(1,2,3),
start=c(10,15,80)
)
such that the following result should be obtained:
> res
id group t index dif
1: 1 1 14 1 4
2: 2 1 17 2 3
3: 3 2 20 1 5
4: 4 2 21 2 1
5: 5 2 26 3 5
6: 6 3 89 1 9
7: 7 3 90 2 1
8: 8 3 95 3 5
9: 9 3 99 4 4
I have tried using
dat1[ , ifelse(index == min(index), dif := t - dat2$start, dif := t - t[-1]), by = group]
but I was unsure about referencing other elements of the same group and external elements in one step. Is this at all possible using data.table?

A possible solution:
dat1[, dif := ifelse(index == min(index),
t - dat2$start[match(.BY, dat2$group)],
t - shift(t))
, by = group][]
which gives:
id group t index dif
1: 1 1 14 1 4
2: 2 1 17 2 3
3: 3 2 20 1 5
4: 4 2 21 2 1
5: 5 2 26 3 5
6: 6 3 89 1 9
7: 7 3 90 2 1
8: 8 3 95 3 5
9: 9 3 99 4 4
Or a variant as proposed by #jogo in the comments which avoids the ifelse:
dat1[, dif := t - shift(t), by = group
][index == 1, dif := t - dat2[group==.BY, start], by = group][]

I would try to avoid ifelse and use data.tables efficient join-capabilities:
dat1[dat2, on = "group", # join on group
start := i.start][, # add start value
diff := diff(c(start[1L], t)), by = group][, # compute difference
start := NULL] # remove start value
The resulting table is:
# id group t index diff
#1: 1 1 14 1 4
#2: 2 1 17 2 3
#3: 3 2 20 1 5
#4: 4 2 21 2 1
#5: 5 2 26 3 5
#6: 6 3 89 1 9
#7: 7 3 90 2 1
#8: 8 3 95 3 5
#9: 9 3 99 4 4

You may use shift with a dynamic fill argument: Index 'dat2' with .BY to get 'start' values for each 'group':
dat1[ , dif := t - shift(t, fill = dat2[group == .BY, start]), by = group]
# id group t index dif
# 1: 1 1 14 1 4
# 2: 2 1 17 2 3
# 3: 3 2 20 1 5
# 4: 4 2 21 2 1
# 5: 5 2 26 3 5
# 6: 6 3 89 1 9
# 7: 7 3 90 2 1
# 8: 8 3 95 3 5
# 9: 9 3 99 4 4
Alternatively, you can do this in steps. Probably a matter of taste, but I find it more transparent than the ifelse way.
First the 'normal' shift. Then add an 'index' variable to 'dat2' and do an update join.
dat1[ , dif := t - shift(t), by = group]
dat2[ , index := 1]
dat1[dat2, on = .(group, index), dif := t - start]

Related

R data.table group by continuous values

I need some help with grouping data by continuous values.
If I have this data.table
dt <- data.table::data.table( a = c(1,1,1,2,2,2,2,1,1,2), b = seq(1:10), c = seq(1:10)+1 )
a b c
1: 1 1 2
2: 1 2 3
3: 1 3 4
4: 2 4 5
5: 2 5 6
6: 2 6 7
7: 2 7 8
8: 1 8 9
9: 1 9 10
10: 2 10 11
I need a group for every following equal values in column a. Of this group i need the first (also min possible) value of column b and the last (also max possible) value of column c.
Like this:
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
Thank you very much for your help. I do not get it solved alone.
Probably we can try
> dt[, .(a = a[1], b = b[1], c = c[.N]), rleid(a)][, -1]
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
An option with dplyr
library(dplyr)
dt %>%
group_by(grp = cumsum(c(TRUE, diff(a) != 0))) %>%
summarise(across(a:b, first), c = last(c)) %>%
select(-grp)
-output
# A tibble: 4 × 3
a b c
<dbl> <int> <dbl>
1 1 1 4
2 2 4 8
3 1 8 10
4 2 10 11

Fill Missing Values

data=data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4),
"timeHAVE"=c(1,4,7,10,2,5,NA,11,6,NA,NA,NA,3),
"timeWANT"=c(1,4,7,10,2,5,8,11,6,9,12,15,3))
library(dplyr);library(tidyverse)
data$timeWANTattempt=data$timeHAVE
data <- data %>%
group_by(student) %>%
fill(timeWANTattempt)+3
I have 'timeHAVE' and I want to replace missing times with the previous time +3. I show my dplyr attempt but it does not work. I seek a data.table solution. Thank you.
you can try.
data %>%
group_by(student) %>%
mutate(n_na = cumsum(is.na(timeHAVE))) %>%
mutate(timeHAVE = ifelse(is.na(timeHAVE), timeHAVE[n_na == 0 & lead(n_na) == 1] + 3*n_na, timeHAVE))
student timeHAVE timeWANT n_na
<dbl> <dbl> <dbl> <int>
1 1 1 1 0
2 1 4 4 0
3 1 7 7 0
4 1 10 10 0
5 2 2 2 0
6 2 5 5 0
7 2 8 8 1
8 2 11 11 1
9 3 6 6 0
10 3 9 9 1
11 3 12 12 2
12 3 15 15 3
13 4 3 3 0
I included the little helper n_na which counts NA's in a row. Then the second mutate muliplies the number of NAs with three and adds this to the first non-NA element before NA's
Here's an approach using 'locf' filling
setDT(data)
data[ , by = student, timeWANT := {
# carry previous observations forward whenever missing
locf_fill = nafill(timeHAVE, 'locf')
# every next NA, the amount shifted goes up by another 3
na_shift = cumsum(idx <- is.na(timeHAVE))
# add the shift, but only where the original data was missing
locf_fill[idx] = locf_fill[idx] + 3*na_shift[idx]
# return the full vector
locf_fill
}]
Warning that this won't work if a given student can have more than one non-consecutive set of NA values in timeHAVE
Another data.table option without grouping:
setDT(data)[, w := fifelse(is.na(timeHAVE) & student==shift(student),
nafill(timeHAVE, "locf") + 3L * rowid(rleid(timeHAVE)),
timeHAVE)]
output:
student timeHAVE timeWANT w
1: 1 1 1 1
2: 1 4 4 4
3: 1 7 7 7
4: 1 10 10 10
5: 2 2 2 2
6: 2 5 5 5
7: 2 NA 8 8
8: 2 11 11 11
9: 3 6 6 6
10: 3 NA 9 9
11: 3 NA 12 12
12: 3 NA 15 15
13: 4 NA NA NA
14: 4 3 3 3
data with student=4 having NA for the first timeHAVE:
data = data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4),
"timeHAVE"=c(1,4,7,10,2,5,NA,11,6,NA,NA,NA,NA,3),
"timeWANT"=c(1,4,7,10,2,5,8,11,6,9,12,15,NA,3))

Create columns with different rules with data.table in r

I'm trying to better understant the data.table package in r. I want to do different types of calculation with some columns and assign the result to new columns with specific names. Here is an example:
set.seed(122)
df <- data.frame(rain = rep(5,10),temp=1:10, skip = sample(0:2,10,T),
windw_sz = sample(1:2,10,T),city =c(rep("a",5),rep("b",5)),ord=rep(sample(1:5,5),2))
df <- as.data.table(df)
vars <- c("rain","temp")
df[, paste0("mean.",vars) := lapply(mget(vars),mean), by="city" ]
This works just fine. But now I also want to calculate the sum of these variables, so I try:
df[, c(paste0("mean.",vars), paste("sum.",vars)) := list( lapply(mget(vars),mean),
lapply(mget(vars),sum)), by="city" ]
and I get an error.
How could I implement this last part?
Thanks a lot!
Instead of list wrap, we can do a c as the lapply output is a list, and when do list as wrapper, it returns a list of list. However, with c, it concats two list end to end (i.e. c(as.list(1:5), as.list(6:10)) as opposed to list(as.list(1:5), as.list(6:10))) and instead of mget, make use of .SDcols
library(data.table)
df[, paste0(rep(c("mean.", "sum."), each = 2), vars) :=
c(lapply(.SD, mean), lapply(.SD, sum)), by = .(city), .SDcols = vars]
df
# rain temp skip windw_sz city ord mean.rain mean.temp sum.rain sum.temp
# 1: 5 1 0 2 a 2 5 3 25 15
# 2: 5 2 1 1 a 5 5 3 25 15
# 3: 5 3 2 2 a 3 5 3 25 15
# 4: 5 4 2 1 a 4 5 3 25 15
# 5: 5 5 2 2 a 1 5 3 25 15
# 6: 5 6 0 1 b 2 5 8 25 40
# 7: 5 7 2 2 b 5 5 8 25 40
# 8: 5 8 1 2 b 3 5 8 25 40
# 9: 5 9 2 1 b 4 5 8 25 40
#10: 5 10 2 2 b 1 5 8 25 40

data.table manipulation and merging

I have data
dat1 <- data.table(id=1:8,
group=c(1,1,2,2,2,3,3,3),
value=c(5,6,10,11,12,20,21,22))
dat2 <- data.table(group=c(1,2,3),
value=c(3,6,13))
and I would like to subtract dat2$value from each of the dat1$value, based on group.
Is this possible using data.table or does it require additional packages?
With data.table, you could do:
library(data.table)
dat1[dat2, on = "group"][, new.value := value - i.value, by = "group"][]
Which returns:
id group value i.value new.value
1: 1 1 5 3 2
2: 2 1 6 3 3
3: 3 2 10 6 4
4: 4 2 11 6 5
5: 5 2 12 6 6
6: 6 3 20 13 7
7: 7 3 21 13 8
8: 8 3 22 13 9
Alternatively, you can do this in one step as akrun mentions:
dat1[dat2, newvalue := value - i.value, on = .(group)]
id group value newvalue
1: 1 1 5 2
2: 2 1 6 3
3: 3 2 10 4
4: 4 2 11 5
5: 5 2 12 6
6: 6 3 20 7
7: 7 3 21 8
8: 8 3 22 9

Row conditional column operations in data.table

I have a large data.table where I for each row need to make computations based on part of the full data.table. As an example consider the following data.table, and assume I for each row want to compute the sum of the num variable for every rows where id2 matches id1 for the current row as well as the time variable is within distance 1 from the time of the current row.
set.seed(123)
dat <- data.table(cbind(id1=sample(1:5,10,replace=T),
id2=sample(1:5,10,replace=T),
num=sample(1:10,10,replace=T),
time=sample(1:10,10,replace=T)))
This could easily be done by looping over each row like this
dat[,val:= 0]
for (i in 1:nrow(dat)){
this.val <- dat[ (id2==id1[i]) & (time>=time[i]-2) & (time<=time[i]+2),sum(num)]
dat[i,val:=this.val]
}
dat
The resulting data.table looks like this:
> dat
id1 id2 num time val
1: 2 5 9 10 6
2: 4 3 7 10 0
3: 3 4 7 7 10
4: 5 3 10 8 9
5: 5 1 7 1 2
6: 1 5 8 5 6
7: 3 2 6 8 17
8: 5 1 6 3 10
9: 3 2 3 4 0
10: 3 5 2 3 0
What is the proper/fast way to do things like this using data.table?
We can use a self-join here by creating the 'timeminus2' and 'timeplus2' column, join on by 'id2' with 'id1' and the non-equi logical condition to get the sum of 'num' and assign (:=) the 'val' column to the original dataset
tmp <- dat[.(id1 = id1, timeminus2 = time - 2, timeplus2 = time + 2),
.(val = sum(num)),
on = .(id2 = id1, time >= timeminus2, time <= timeplus2),
by = .EACHI
][is.na(val), val := 0][]
dat[, val := tmp$val][]
# id1 id2 num time val
# 1: 2 5 9 10 6
# 2: 4 3 7 10 0
# 3: 3 4 7 7 10
# 4: 5 3 10 8 9
# 5: 5 1 7 1 2
# 6: 1 5 8 5 6
# 7: 3 2 6 8 17
# 8: 5 1 6 3 10
# 9: 3 2 3 4 0
#10: 3 5 2 3 0

Resources