R data.table conditional selection - r

I have a data.table
set.seed(1)
DT <- data.table(tag = rep(LETTERS[1:4],each = 2, times = 3),
year = rep(1:3, each = 8), month = rep(c(1:2), 12),
value = runif(24, 1, 10))
DT
tag year month value
1: A 1 1 3.389578
2: A 1 2 4.349115
3: B 1 1 6.155680
4: B 1 2 9.173870
5: C 1 1 2.815137
6: C 1 2 9.085507
7: D 1 1 9.502077
8: D 1 2 6.947180
9: A 2 1 6.662026
10: A 2 2 1.556076
11: B 2 1 2.853771
12: B 2 2 2.589011
13: C 2 1 7.183206
14: C 2 2 4.456933
15: D 2 1 7.928573
16: D 2 2 5.479293
17: A 3 1 7.458567
18: A 3 2 9.927155
19: B 3 1 4.420317
20: B 3 2 7.997007
21: C 3 1 9.412347
22: C 3 2 2.909283
23: D 3 1 6.865064
24: D 3 2 2.129996
Sort this DT by year, month, and -value:
setorder(DT, year, month, -value)
will produce:
tag year month value
1: D 1 1 9.502077
2: B 1 1 6.155680
3: A 1 1 3.389578
4: C 1 1 2.815137
5: B 1 2 9.173870
6: C 1 2 9.085507
7: D 1 2 6.947180
8: A 1 2 4.349115
9: D 2 1 7.928573
10: C 2 1 7.183206
11: A 2 1 6.662026
12: B 2 1 2.853771
13: D 2 2 5.479293
14: C 2 2 4.456933
15: B 2 2 2.589011
16: A 2 2 1.556076
17: C 3 1 9.412347
18: A 3 1 7.458567
19: D 3 1 6.865064
20: B 3 1 4.420317
21: A 3 2 9.927155
22: B 3 2 7.997007
23: C 3 2 2.909283
24: D 3 2 2.129996
I would like the result to be like following:
tag year month value
1: D 1 1 9.502077
2: B 1 1 6.155680
3: B 1 2 9.173870
4: D 1 2 6.947180
5: D 2 1 7.928573
6: C 2 1 7.183206
7: D 2 2 5.479293
8: C 2 2 4.456933
9: C 3 1 9.412347
10: A 3 1 7.458567
11: A 3 2 9.927155
12: C 3 2 2.909283
The result DT would have the following property: within each year keep only the two tags that have larger value in month 1. For example, year 1, the two tags that have larger value are D and B, so keep D and B for the whole year 1. year 2 keeps D and C. At each month 1, I need to reselect the rows with the year.

We group by 'year', get the first two elements in 'tag' for 'month' 1 (as it is already ordered), create a logical index using %in% and subset the rows.
DT[, .SD[tag %in% head(tag[month ==1],2)], .(year)]
# year tag month value
# 1: 1 D 1 9.502077
# 2: 1 B 1 6.155680
# 3: 1 B 2 9.173870
# 4: 1 D 2 6.947180
# 5: 2 D 1 7.928573
# 6: 2 C 1 7.183206
# 7: 2 D 2 5.479293
# 8: 2 C 2 4.456933
# 9: 3 C 1 9.412347
#10: 3 A 1 7.458567
#11: 3 A 2 9.927155
#12: 3 C 2 2.909283

Related

Is there some way to keep variable names from.SD+.SDcols together with non .SD variable names in data.table?

Given a data.table
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)
DT
x v y a b
1: b 1 1 1 9
2: b 1 3 2 8
3: b 1 6 3 7
4: a 2 1 4 6
5: a 2 3 5 5
6: a 1 6 6 4
7: c 1 1 7 3
8: c 2 3 8 2
9: c 2 6 9 1
if one does
DT[, .(a, .SD), .SDcols=x:y]
a .SD.x .SD.v .SD.y
1: 1 b 1 1
2: 2 b 1 3
3: 3 b 1 6
4: 4 a 2 1
5: 5 a 2 3
6: 6 a 1 6
7: 7 c 1 1
8: 8 c 2 3
9: 9 c 2 6
the variables from .SDcols become prefixed by .SD. On the other hand, if one tries, as in https://stackoverflow.com/a/62282856/997979,
DT[, c(.(a), .SD), .SDcols=x:y]
V1 x v y
1: 1 b 1 1
2: 2 b 1 3
3: 3 b 1 6
4: 4 a 2 1
5: 5 a 2 3
6: 6 a 1 6
7: 7 c 1 1
8: 8 c 2 3
9: 9 c 2 6
the other variable name (a) become lost. (It is due to this reason that I re-ask the question which I initially marked as a duplicate to that linked above).
Is there some way to keep the names from both .SD variables and non .SD variables?
The goal is simultaneously being able to use .() to select variables without quotes and being able to select variables through .SDcols = patterns("...")
Thanks in advance!
not really sure why.. but it works ;-)
DT[, .(a, (.SD)), .SDcols=x:y]
# a x v y
# 1: 1 b 1 1
# 2: 2 b 1 3
# 3: 3 b 1 6
# 4: 4 a 2 1
# 5: 5 a 2 3
# 6: 6 a 1 6
# 7: 7 c 1 1
# 8: 8 c 2 3
# 9: 9 c 2 6

column mean for subset of rows in R data.table

I have a data.table like this:
example <- data.table(
id=rep(1:2,each=9),
day=rep(rep(1:3,each=3),2),
time=rep(1:3,6),
mean_day=rnorm(18)
)
I need to compute the mean across multiple days, but different ones between id's.
To get the mean across days 1 to 2 for the first individual I've tried the following (inspired by this post):
example[id==1, mean_over_days := mean(mean_day,na.rm=T), by=(day %in% 1:2)]
> example
id day time mean_day mean_over_days
1: 1 1 1 -1.53685184 -0.8908466
2: 1 1 2 0.77445521 -0.8908466
3: 1 1 3 -0.56048917 -0.8908466
4: 1 2 1 -1.78388960 -0.8908466
5: 1 2 2 0.01787129 -0.8908466
6: 1 2 3 -2.25617538 -0.8908466
7: 1 3 1 -0.44886190 -0.0955414
8: 1 3 2 -1.31086985 -0.0955414
9: 1 3 3 1.47310754 -0.0955414
10: 2 1 1 0.53560356 NA
11: 2 1 2 1.16654996 NA
12: 2 1 3 -0.06704728 NA
13: 2 2 1 -0.83897719 NA
14: 2 2 2 -0.85209939 NA
15: 2 2 3 -0.41392341 NA
16: 2 3 1 -0.03014190 NA
17: 2 3 2 0.43835822 NA
18: 2 3 3 -1.62432188 NA
I want all the lines for id==1 of column mean_over_days to have the same value (-0.8908466), but it happens that for day 3 this column has the mean over that day only.
How can I change the code to correct this?
Don't subset in by it would create different groups for TRUE and FALSE values.
library(data.table)
example[id==1, mean_over_days := mean(mean_day[day %in% 1:2],na.rm=TRUE)][]
# id day time mean_day mean_over_days
# 1: 1 1 1 -0.56047565 0.4471527
# 2: 1 1 2 -0.23017749 0.4471527
# 3: 1 1 3 1.55870831 0.4471527
# 4: 1 2 1 0.07050839 0.4471527
# 5: 1 2 2 0.12928774 0.4471527
# 6: 1 2 3 1.71506499 0.4471527
# 7: 1 3 1 0.46091621 0.4471527
# 8: 1 3 2 -1.26506123 0.4471527
# 9: 1 3 3 -0.68685285 0.4471527
#10: 2 1 1 -0.44566197 NA
#11: 2 1 2 1.22408180 NA
#12: 2 1 3 0.35981383 NA
#13: 2 2 1 0.40077145 NA
#14: 2 2 2 0.11068272 NA
#15: 2 2 3 -0.55584113 NA
#16: 2 3 1 1.78691314 NA
#17: 2 3 2 0.49785048 NA
#18: 2 3 3 -1.96661716 NA
data
set.seed(123)
example <- data.table(
id=rep(1:2,each=9),
day=rep(rep(1:3,each=3),2),
time=rep(1:3,6),
mean_day=rnorm(18)
)

R mutate a column by group in ifelse

I'd like to mutate a column in R data.table.
Here's the example of my data.
df <- data.table(id=c(1,1,1,2,2,2,3,3,3),
stopId=c("a","b","c","a","b","c","a","b","c"),
category=c(1,1,1,NA,NA,NA,2,2,2),
result = c('a','a','a','b','b','b','c','c','c'))
My goal is to create a column using if-else command.
The column would be the first values of groupId group by id.
The point is when mutating, the values should be the same by group.
If the category is NA, then the result should be the last value of groupId.
This is the result I'm looking forward to.
id groupId category result
1: 1 a 1 a
2: 1 b 1 a
3: 1 c 1 a
4: 2 a NA b
5: 2 c NA b
6: 2 b NA b
7: 3 c 2 c
8: 3 b 2 c
9: 3 a 2 c
with data.table:
df[,result:=fifelse(is.na(category),last(stopId),first(stopId)),by=id][]
id stopId category result
1: 1 a 1 a
2: 1 b 1 a
3: 1 c 1 a
4: 2 a NA c
5: 2 b NA c
6: 2 c NA c
7: 3 a 2 a
8: 3 b 2 a
9: 3 c 2 a
As it's name, by using first and last,
df %>%
group_by(id) %>%
mutate(resultt = ifelse(is.na(category), last(stopId), first(stopId)))
id stopId category result resultt
<dbl> <chr> <dbl> <chr> <chr>
1 1 a 1 a a
2 1 b 1 a a
3 1 c 1 a a
4 2 a NA b b
5 2 c NA b b
6 2 b NA b b
7 3 c 2 c c
8 3 b 2 c c
9 3 a 2 c c
Data you provided is different above...
We can use .N or 1 to index stopId per group
> df[, result := stopId[ifelse(is.na(category), .N, 1)], id][]
id stopId category result
1: 1 a 1 a
2: 1 b 1 a
3: 1 c 1 a
4: 2 a NA c
5: 2 b NA c
6: 2 c NA c
7: 3 a 2 a
8: 3 b 2 a
9: 3 c 2 a
or shorter
> df[, result := stopId[c(1, .N)[is.na(category) + 1]], id][]
id stopId category result
1: 1 a 1 a
2: 1 b 1 a
3: 1 c 1 a
4: 2 a NA c
5: 2 b NA c
6: 2 c NA c
7: 3 a 2 a
8: 3 b 2 a
9: 3 c 2 a

R find intervals in data.table

i want to add a new column with intervals or breakpoints by group. As an an example:
This is my data.table:
x <- data.table(a = c(1:8,1:8), b = c(rep("A",8),rep("B",8)))
I have already the breakpoint or rowindices:
pos <- data.table(b = c("A","A","B","B"), bp = c(3,5,2,4))
Here i can find the interval for group "A" with:
findInterval(1:nrow(x[b=="A"]), pos[b=="A"]$bp)
How can i do this for each group. In this case "A" and "B"?
An option is to split the datasets by 'b' column, use Map to loop over the corresponding lists, and apply findInterval
Map(function(u, v) findInterval(seq_len(nrow(u)), v$bp),
split(x, x$b), split(pos, pos$b))
#$A
#[1] 0 0 1 1 2 2 2 2
#$B
#[1] 0 1 1 2 2 2 2 2
or another option is to group by 'b' from 'x', then use findInterval by subsetting the 'bp' from 'pos' by filtering with a logical condition created based on .BY
x[, findInterval(seq_len(.N), pos$bp[pos$b==.BY]), b]
# b V1
# 1: A 0
# 2: A 0
# 3: A 1
# 4: A 1
# 5: A 2
# 6: A 2
# 7: A 2
# 8: A 2
# 9: B 0
#10: B 1
#11: B 1
#12: B 2
#13: B 2
#14: B 2
#15: B 2
#16: B 2
Another option using rolling join in data.table:
pos[, ri := rowid(b)]
x[, intvl := fcoalesce(pos[x, on=.(b, bp=a), roll=Inf, ri], 0L)]
output:
a b intvl
1: 1 A 0
2: 2 A 0
3: 3 A 1
4: 4 A 1
5: 5 A 2
6: 6 A 2
7: 7 A 2
8: 8 A 2
9: 1 B 0
10: 2 B 1
11: 3 B 1
12: 4 B 2
13: 5 B 2
14: 6 B 2
15: 7 B 2
16: 8 B 2
We can nest the pos data into list by b and join with x and use findInterval to get corresponding groups.
library(dplyr)
pos %>%
tidyr::nest(data = bp) %>%
right_join(x, by = 'b') %>%
group_by(b) %>%
mutate(interval = findInterval(a, data[[1]][[1]])) %>%
select(-data)
# b a interval
# <chr> <int> <int>
# 1 A 1 0
# 2 A 2 0
# 3 A 3 1
# 4 A 4 1
# 5 A 5 2
# 6 A 6 2
# 7 A 7 2
# 8 A 8 2
# 9 B 1 0
#10 B 2 1
#11 B 3 1
#12 B 4 2
#13 B 5 2
#14 B 6 2
#15 B 7 2
#16 B 8 2

Extract and collapse non-missing elements by row in the data.table

I would like to extract all unique non missing elements in a row and then collapse them using &&&&. Here comes a small example:
#Load needed libraries:
library(data.table)
#Generate the data:
set.seed(1)
n_rows<-10
#Define function to apply to rows:
function_non_missing<-function(x){
x<-x[!is.na(x)]
x<-x[x!="NA"]
x<-unique(x[order(x)])
paste(x,collapse="&&&&")
}
data<-data.table(
a=sample(c(1,2,NA,NA),n_rows,replace=TRUE),
b=sample(c(1,2,NA,NA),n_rows,replace=TRUE),
c=sample(c(1,2,NA,NA),n_rows,replace=TRUE)
)
> data
a b c
1: 1 NA 1
2: NA NA NA
3: NA 1 1
4: 1 1 1
5: 2 1 1
6: 1 2 1
7: NA 2 2
8: NA 2 1
9: 2 2 1
10: 2 NA 2
#Obtain results
data[,paste(.SD),by=1:nrow(data)][,function_non_missing(V1),by=nrow]
nrow V1
1: 1 1
2: 2
3: 3 1
4: 4 1
5: 5 1&&&&2
6: 6 1&&&&2
7: 7 2
8: 8 1&&&&2
9: 9 1&&&&2
10: 10 2
The above code looks very convoluted and I believe there might be better solutions.
Using melt() / dcast():
data[, row := .I
][, melt(.SD, id.vars = "row")
][order(row, value), paste0(unique(value[!is.na(value)]), collapse = "&&&"), by = row]
row V1
1: 1 1
2: 2
3: 3 1
4: 4 1
5: 5 1&&&2
6: 6 1&&&2
7: 7 2
8: 8 1&&&2
9: 9 1&&&2
10: 10 2
Alterntively using your original function:
data[, function_non_missing(unlist(.SD)), by = 1:nrow(data)]
nrow V1
1: 1 1
2: 2
3: 3 2
4: 4 1&&&&2
5: 5 1&&&&2
6: 6 1&&&&2
7: 7 1
8: 8 2
9: 9 1&&&&2
10: 10 1&&&&2
Probably using apply?
library(data.table)
data[, col := apply(.SD, 1, function(x)
paste(sort(unique(na.omit(x))), collapse = "&&&"))]
data
# a b c col
# 1: 1 NA 1 1
# 2: NA NA NA
# 3: NA 1 1 1
# 4: 1 1 1 1
# 5: 2 1 1 1&&&2
# 6: 1 2 1 1&&&2
# 7: NA 2 2 2
# 8: NA 2 1 1&&&2
# 9: 2 2 1 1&&&2
#10: 2 NA 2 2

Resources