Unexpected result using data.table's shift() by group (bug?)

Unexpected result using data.table's shift() by group (bug?) - r

Consider this dataset
Data
dt <- data.table(ID = c(1,8,9,20,32,33), Char = c("A", "A", "B", "B", "C", "C"))
dt
ID Char
1: 1 A
2: 8 A
3: 9 B
4: 20 B
5: 32 C
6: 33 C
I want to identify "runs" by ID, i.e. consecutive rows where the ID differs by 1, but I only want to consider runs within the same Char group. I can do this as follows
Correct
dt[, InRun := FALSE]
dt[, DistToAbove := abs(ID - shift(ID, type="lag")), by=Char]
dt[, DistToBelow := abs(ID - shift(ID, type="lead")), by=Char]
dt[DistToAbove <= 1 | DistToBelow <= 1, InRun := TRUE, by=Char]
dt
ID Char InRun DistToAbove DistToBelow
1: 1 A FALSE NA 7
2: 8 A FALSE 7 NA
3: 9 B FALSE NA 11
4: 20 B FALSE 11 NA
5: 32 C TRUE NA 1
6: 33 C TRUE 1 NA
I tried simplifying the above code into the lines below, but the answer differs
Incorrect / Unexpected
dt[, InRun := FALSE]
dt[abs(ID - shift(ID, type="lag")) <= 1 | abs(shift(ID, type="lead") - ID) <= 1, InRun := TRUE, by=Char]
dt
ID Char InRun DistToAbove DistToBelow
1: 1 A FALSE NA 7
2: 8 A TRUE 7 NA
3: 9 B TRUE NA 11
4: 20 B FALSE 11 NA
5: 32 C TRUE NA 1
6: 33 C TRUE 1 NA
What gives? (Note I'm using data.table v1.9.7)

I want to identify "runs" by ID, i.e. consecutive rows where the ID differs by 1, but I only want to consider runs within the same Char group.
Here's how I'd approach it:
dt[, run_id := cumsum(
( ID != shift(ID, fill = ID[1L]) + 1L )
|
( Char != shift(Char, fill = Char[1L]) )
)]
dt[, in_run := .N > 1L, by=.(Char, run_id)]
ID Char run_id in_run
1: 1 A 1 FALSE
2: 8 A 2 FALSE
3: 9 B 3 FALSE
4: 20 B 4 FALSE
5: 32 C 5 TRUE
6: 33 C 5 TRUE
This code identifies all runs (including those with length of one) and then tests for length greater than one (the OP's definition).
Regarding the OP's approach:
dt[abs(ID - shift(ID, type="lag")) <= 1 | abs(shift(ID, type="lead") - ID) <= 1, # i
InRun := TRUE # j
, by=Char] # by
In DT[i,j,by] the steps are: filter using i, then group with by, then calculate j. You can't do by-group calculations in i in the way attempted here.

Related

Replace a sequence of values by group depending on preceeding values

I have a data table of this form (2000000+ rows, 1000+groups):
set.seed(1)
dt <- data.table(id = rep(1:3, each = 5), values = sample(c("a", "b","c"), 15, TRUE))
> dt
id values
1: 1 a
2: 1 c
3: 1 a
4: 1 b
5: 1 a
6: 2 c
7: 2 c
8: 2 b
9: 2 b
10: 2 c
11: 3 c
12: 3 a
13: 3 a
14: 3 a
15: 3 b
I want to, within each ID group, replace the entire sequence of character "a", that precedes the character "b", and I want to replace them with "b". So the condition is that if "a" or a sequence of "a"s appear before "b", replace all the "a"s. (actually, in my real table, it's when "b" is preceded by "a","x", or"y", preceding character should be replaced, but I should be able to generalize)
In the example above,the value of "a" in row 3 should be replaced (easy to do with (shift) in data.table), as well as all the "a"s in rows 12-14 (not sure how to do). So, the desired output is this:
> dt
id values
1: 1 a
2: 1 c
3: 1 b
4: 1 b
5: 1 a
6: 2 c
7: 2 c
8: 2 b
9: 2 b
10: 2 c
11: 3 c
12: 3 b
13: 3 b
14: 3 b
15: 3 b
What comes to my mind is looping from the last index, but I am not exactly sure how to do that with if I have multiple groupings (say, ID and DATE), and anyway, this doesn't seem to be the fastest dt solution.

Here's another data.table approach:
dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(id, x=x-1, values="a")],
on = .(id, x, values),
values := "b"
][, x := NULL]
create a new column "x" with the run length ids per value grouped by id
join on itself while modifying the run length ids (x) to be the preceeding value and values to be "a" (the specific value you want to change), then update values with "b"
delete column x afterwards
The result is:
dt
# id values
# 1: 1 a
# 2: 1 c
# 3: 1 b
# 4: 1 b
# 5: 1 a
# 6: 2 c
# 7: 2 c
# 8: 2 b
# 9: 2 b
# 10: 2 c
# 11: 3 c
# 12: 3 b
# 13: 3 b
# 14: 3 b
# 15: 3 b
And here's a generalization to the case where you want to replace values "a", "x", or "y" followed by "b" with "b":
dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(values=c("a", "x", "y")), by = .(id, x=x-1)],
on = .(id, x, values),
values := "b"
][, x := NULL]

Late to the party and several nice run length alternatives were already provided ;) So here I try nafill instead.
(1) Create a variable 'v2' which is NA when 'values' are "a". (2) Fill missing values by next observation carried backward. (3) When the original 'values' are "a" and the corresponding filled values in 'v2' are "b", update 'v' with 'v2'.
# 1
dt[values != "a" , v2 := values]
# 2
d1[, v2 := v2[nafill(replace(seq_len(.N), is.na(v2), NA), type = "nocb")], by = id]
# 3
dt[values == "a" & v2 == "b", values := v2]
# clean-up
dt[ , v2 := NULL]
Currently, nafill only works with numeric variables, hence replace step in chunk # 2 (modified from #chinsoon12 in the issue nafill, setnafill for character, factor and other types).
The NA replacement code may be slightly shortened by using zoo::nalocf:
dt[, v2 := zoo::na.locf(v2, fromLast = TRUE, na.rm = FALSE), by = id]
However, note that na.locf is slower.
When comparing the answers on larger data (data.table(id = rep(1:1e4, each = 1e4, replace = TRUE), values = sample(c("a", "b", "c"), 1e8, replace = TRUE)), it turns out that this alternative actually is faster than the others.

This is not pretty but I think this is what you are after:
dt[, .N, by = .(id, values = paste0(values, rleid(values)))
][, values := sub("[0-9]+", "", values)
][, values := fifelse(values == "a" & shift(values, -1L) == "b" & !is.na(shift(values, -1L)), "b", values), by = id
][, .SD[rep(seq_len(.N), N)]
][, !"N"]
id values
1: 1 a
2: 1 c
3: 1 b
4: 1 b
5: 1 a
6: 2 c
7: 2 c
8: 2 b
9: 2 b
10: 2 c
11: 3 c
12: 3 b
13: 3 b
14: 3 b
15: 3 b

You can use rle().
Note: To avoid ambiguity, I rename the "values" column to "var" because the rle() function also produces a list containing a vector named "values".
dt[, new := with(rle(var), rep(ifelse(values == "a" & c(values[-1], "") == "b", "b", values), lengths)), by = id]
dt
# id var new
# 1: 1 a a
# 2: 1 c c
# 3: 1 a b
# 4: 1 b b
# 5: 1 a a
# 6: 2 c c
# 7: 2 c c
# 8: 2 b b
# 9: 2 b b
# 10: 2 c c
# 11: 3 c c
# 12: 3 a b
# 13: 3 a b
# 14: 3 a b
# 15: 3 b b

Flag randomly selected N rows by group in data.table

At the data.table in column C3 I want to flag N randomly selected rows by each group (C1). There are several similar questions have already been asked on SO here, here and here. But based on the answers still cannot figure out how to find a solution for my task.
set.seed(1)
dt = data.table(C1 = c("A","A","A","B","C","C","C","D","D","D"),
C2 = c(2,1,3,1,2,3,4,5,4,5))
dt
C1 C2
1: A 2
2: A 1
3: A 3
4: B 1
5: C 2
6: C 3
7: C 4
8: D 5
9: D 4
10: D 5
Here are row indexes for two randomly selected rows by each group C1 (doesn't work well for group B):
dt[, sample(.I, min(.N, 2)), by = C1]$V1
[1] 1 3 3 7 5 10 9
NB: for B only one row should be selected because group B consists of one row only.
Here is a solution for one randomly selected row in each group, which often doesn't work for group B:
dt[, C3 := .I == sample(.I, 1), by = C1]
dt
C1 C2 C3
1: A 2 FALSE
2: A 1 TRUE
3: A 3 FALSE
4: B 1 FALSE
5: C 2 TRUE
6: C 3 FALSE
7: C 4 FALSE
8: D 5 TRUE
9: D 4 FALSE
10: D 5 FALSE
Actually I want to expand it on N rows. I've tried (for two rows):
dt[, C3 := .I==sample(.I, min(.N, 2)), by = C1]
which of course doesn't work.
Any help is much appreciated!

dt[, C3 := 1:.N %in% sample(.N, min(.N, 2)), by = C1]
Or use head, but I think that should be slower
dt[, C3 := 1:.N %in% head(sample(.N), 2) , by = C1]
If the number of flagged rows is not constant you can do
flagsz <- c(2, 1, 2, 3)
dt[, C3 := 1:.N %in% sample(.N, min(.N, flagsz[.GRP])), by = C1]

N=2
dt[, C3 := {if (.N < N) rep(TRUE,.N) else 1:.N %in% sample(.N,N) }, by=C1]
dt
# C1 C2 C3
# 1: A 2 TRUE
# 2: A 1 FALSE
# 3: A 3 TRUE
# 4: B 1 TRUE
# 5: C 2 FALSE
# 6: C 3 TRUE
# 7: C 4 TRUE
# 8: D 5 TRUE
# 9: D 4 TRUE
# 10: D 5 FALSE

Match in lagged group in data.table

I'm trying to create a new column that indicates if an ID was present in a previous group. Here's my data:
data <- data.table(ID = c(1:3, c(9,2,3,4),c(5,1)),
groups = c(rep(c("a", "b", "c"), c(3, 4,2))))
ID groups
1: 1 a
2: 2 a
3: 3 a
4: 9 b
5: 2 b
6: 3 b
7: 4 b
8: 5 c
9: 1 c
I'm not sure how to specify lagged groups. I tried to use shift, but it's not working:
data[,.(ID=ID,match_lagged=ID %in% shift(ID)),by=groups]
Here's my desired result.
The first 3 lines are not matched because there is no previous group. FALSE would also work for these three rows. ID=4 (in group b) is not matched in group a. ID=5 (in group c) is not matched in group b.
Note that ID 1 in group c is not matched in group b so it should be false even though it exists in group a. This is why duplicated(data$ID) does not work. Data from a group has to be matched from the previous group.
groups ID match_lagged
1: a 1 NA
2: a 2 NA
3: a 3 NA
4: b 9 FALSE
5: b 2 TRUE
6: b 3 TRUE
7: b 4 FALSE
8: c 5 FALSE
9: c 1 FALSE
A dplyr solution would also work.

Number the groups, and then check if the diff is equal to one for each ID.
data[, grp.id := .GRP, by = groups]
data[, match_lagged := c(FALSE, diff(grp.id) == 1), by = ID][
grp.id == 1, match_lagged := NA][]
# ID groups grp.id match_lagged
#1: 1 a 1 NA
#2: 2 a 1 NA
#3: 3 a 1 NA
#4: 9 b 2 FALSE
#5: 2 b 2 TRUE
#6: 3 b 2 TRUE
#7: 4 b 2 FALSE
#8: 5 c 3 FALSE
#9: 1 c 3 FALSE
This assumes that you find each ID only once in each group. If that's not the case you can unique, do the above, and then merge in.

This works. There is probably an easier solution:
data <- data.frame(ID = c(1:3, 1:4,c(5,1)),
groups = c(rep(c("a", "b", "c"), c(3, 4,2))))
z <- data %>% group_by(groups) %>% summarize(all_vals = list(ID))
z <- z %>% mutate(lagged_id = lag(all_vals,1))
match_lagged <- lapply(1:nrow(z) , function(x) {
(z$all_vals[x] %>% unlist) %in% (z$lagged_id[x] %>% unlist)
})
data$match_lagged = match_lagged %>% unlist

ifelse function group in group in R

I have data set
ID <- c(1,1,2,2,2,2,3,3,3,3,3,4,4,4)
Eval <- c("A","A","B","B","A","A","A","A","B","B","A","A","A","B")
med <- c("c","d","k","k","h","h","c","d","h","h","h","c","h","k")
df <- data.frame(ID,Eval,med)
> df
ID Eval med
1 1 A c
2 1 A d
3 2 B k
4 2 B k
5 2 A h
6 2 A h
7 3 A c
8 3 A d
9 3 B h
10 3 B h
11 3 A h
12 4 A c
13 4 A h
14 4 B k
I try to create variable x and y, group by ID and Eval. For each ID, if Eval = A, and med = "h" or "k", I set x = 1, other wise x = 0, if Eval = B and med = "h" or "k", I set y = 1, other wise y = 0. I use the way I don't like it, I got answer but it seem like not that great
df <- data.table(df)
setDT(df)[, count := uniqueN(med) , by = .(ID,Eval)]
setDT(df)[Eval == "A", x:= ifelse(count == 1 & med %in% c("k","h"),1,0), by=ID]
setDT(df)[Eval == "B", y:= ifelse(count == 1 & med %in% c("k","h"),1,0), by=ID]
ID Eval med count x y
1: 1 A c 2 0 NA
2: 1 A d 2 0 NA
3: 2 B k 1 NA 1
4: 2 B k 1 NA 1
5: 2 A h 1 1 NA
6: 2 A h 1 1 NA
7: 3 A c 3 0 NA
8: 3 A d 3 0 NA
9: 3 B h 1 NA 1
10: 3 B h 1 NA 1
11: 3 A h 3 0 NA
12: 4 A c 2 0 NA
13: 4 A h 2 0 NA
14: 4 B k 1 NA 1
Then I need to collapse the row to get unique ID, I don't know how to collapse rows, any idea?
The output
ID x y
1 0 0
2 1 1
3 0 1
4 0 1

We create the 'x' and 'y' variables grouped by 'ID' without the NA elements directly coercing the logical vector to binary (as.integer)
df[, x := as.integer(Eval == "A" & count ==1 & med %in% c("h", "k")) , by = ID]
and similarly for 'y'
df[, y := as.integer(Eval == "B" & count ==1 & med %in% c("h", "k")) , by = ID]
and summarise it, using any after grouping by "ID"
df[, lapply(.SD, function(x) as.integer(any(x))) , ID, .SDcols = x:y]
# ID x y
#1: 1 0 0
#2: 2 1 1
#3: 3 0 1
#4: 4 0 1
If we need a compact approach, instead of assinging (:=), we summarise the output grouped by "ID", "Eval" based on the conditions and then grouped by 'ID', we check if there is any TRUE values in 'x' and 'y' by looping over the columns described in the .SDcols.
setDT(df)[, if(any(uniqueN(med)==1 & med %in% c("h", "k"))) {
.(x= Eval=="A", y= Eval == "B") } else .(x=FALSE, y=FALSE),
by = .(ID, Eval)][, lapply(.SD, any) , by = ID, .SDcols = x:y]
# ID x y
#1: 1 FALSE FALSE
#2: 2 TRUE TRUE
#3: 3 FALSE TRUE
#4: 4 FALSE TRUE
If needed, we can convert to binary similar to the approach showed in the first solution.

The OP's goal...
"I try to create variable x and y, group by ID and Eval. For each ID, if Eval = A, and med = "h" or "k", I set x = 1, other wise x = 0, if Eval = B and med = "h" or "k", I set y = 1, other wise y = 0. [...] Then I need to collapse the row to get unique ID"
can be simplified to...
For each ID and Eval, flag if all med values are h or all med values are k.
setDT(df) # only do this once
df[, all(med=="k") | all(med=="h"), by=.(ID,Eval)][, dcast(.SD, ID ~ Eval, fun=any)]
ID A B
1: 1 FALSE FALSE
2: 2 TRUE TRUE
3: 3 FALSE TRUE
4: 4 FALSE TRUE
To see what dcast is doing, read ?dcast and try running just the first part on its own, df[, all(med=="k") | all(med=="h"), by=.(ID,Eval)].
The change to use x and y instead of A and B is straightforward but ill-advised (since unnecessary renaming can be confusing and lead to extra work when there are new Eval values); and ditto the change for 1/0 instead of TRUE/FALSE (since the values captured are actually boolean).

Here is my dplyr solution since I find it more readable than data.table.
library(dplyr)
df %>%
group_by(ID, Eval) %>%
mutate(
count = length(unique(med)),
x = ifelse(Eval == "A" &
count == 1 & med %in% c("h", "k"), 1, 0),
y = ifelse(Eval == "B" &
count == 1 & med %in% c("h", "k"), 1, 0)
) %>%
group_by(ID) %>%
summarise(x1 = max(unique(x)),
y1 = max(unique(y)))

A one liner solution for collapsing the rows of your result :
df[,lapply(.SD,function(i) {ifelse(1 %in% i,ifelse(!0 %in% i,1,0),0)}),.SDcols=x:y,by=ID]
ID x y
1: 1 0 0
2: 2 1 1
3: 3 0 1
4: 4 0 1

Order factor levels in order of appearance in data set

I have a survey in which a unique ID must be assigned to questions. Some questions appear multiple times. This means that there is an extra layer of questions. In the sample data below only the first layer is included.
Question: how do I assign a unique index by order of appearance? The solution provided here works alphabetically. I can order the factors, but this defeats the purpose of doing it in R [there are many questions to sort].
library(data.table)
dt = data.table(question = c("C", "C", "A", "B", "B", "D"),
value = c(10,20,30,40,20,30))
dt[, idx := as.numeric(as.factor(question))]
gives:
question value idx
# 1: C 10 3
# 2: C 20 3
# 3: A 30 1
# 4: B 40 2
# 5: B 20 2
# 6: D 30 4
# but required is:
dt[, idx.required := c(1, 1, 2, 3, 3, 4)]

I think the data.table way to do this will be
dt[, idx := .GRP, by = question]
## question value idx
## 1: C 10 1
## 2: C 20 1
## 3: A 30 2
## 4: B 40 3
## 5: B 20 3
## 6: D 30 4

You could respecify the factor levels:
dt[, idx := as.numeric(factor(question, levels=unique(question)))]
# question value idx
# 1: C 10 1
# 2: C 20 1
# 3: A 30 2
# 4: B 40 3
# 5: B 20 3
# 6: D 30 4

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Unexpected result using data.table's shift() by group (bug?) - r

Related

Replace a sequence of values by group depending on preceeding values

Flag randomly selected N rows by group in data.table

Match in lagged group in data.table

ifelse function group in group in R

Order factor levels in order of appearance in data set

Categories

Resources