Flag randomly selected N rows by group in data.table

Flag randomly selected N rows by group in data.table - r

At the data.table in column C3 I want to flag N randomly selected rows by each group (C1). There are several similar questions have already been asked on SO here, here and here. But based on the answers still cannot figure out how to find a solution for my task.
set.seed(1)
dt = data.table(C1 = c("A","A","A","B","C","C","C","D","D","D"),
C2 = c(2,1,3,1,2,3,4,5,4,5))
dt
C1 C2
1: A 2
2: A 1
3: A 3
4: B 1
5: C 2
6: C 3
7: C 4
8: D 5
9: D 4
10: D 5
Here are row indexes for two randomly selected rows by each group C1 (doesn't work well for group B):
dt[, sample(.I, min(.N, 2)), by = C1]$V1
[1] 1 3 3 7 5 10 9
NB: for B only one row should be selected because group B consists of one row only.
Here is a solution for one randomly selected row in each group, which often doesn't work for group B:
dt[, C3 := .I == sample(.I, 1), by = C1]
dt
C1 C2 C3
1: A 2 FALSE
2: A 1 TRUE
3: A 3 FALSE
4: B 1 FALSE
5: C 2 TRUE
6: C 3 FALSE
7: C 4 FALSE
8: D 5 TRUE
9: D 4 FALSE
10: D 5 FALSE
Actually I want to expand it on N rows. I've tried (for two rows):
dt[, C3 := .I==sample(.I, min(.N, 2)), by = C1]
which of course doesn't work.
Any help is much appreciated!

dt[, C3 := 1:.N %in% sample(.N, min(.N, 2)), by = C1]
Or use head, but I think that should be slower
dt[, C3 := 1:.N %in% head(sample(.N), 2) , by = C1]
If the number of flagged rows is not constant you can do
flagsz <- c(2, 1, 2, 3)
dt[, C3 := 1:.N %in% sample(.N, min(.N, flagsz[.GRP])), by = C1]

N=2
dt[, C3 := {if (.N < N) rep(TRUE,.N) else 1:.N %in% sample(.N,N) }, by=C1]
dt
# C1 C2 C3
# 1: A 2 TRUE
# 2: A 1 FALSE
# 3: A 3 TRUE
# 4: B 1 TRUE
# 5: C 2 FALSE
# 6: C 3 TRUE
# 7: C 4 TRUE
# 8: D 5 TRUE
# 9: D 4 TRUE
# 10: D 5 FALSE

Related

Replace a sequence of values by group depending on preceeding values

I have a data table of this form (2000000+ rows, 1000+groups):
set.seed(1)
dt <- data.table(id = rep(1:3, each = 5), values = sample(c("a", "b","c"), 15, TRUE))
> dt
id values
1: 1 a
2: 1 c
3: 1 a
4: 1 b
5: 1 a
6: 2 c
7: 2 c
8: 2 b
9: 2 b
10: 2 c
11: 3 c
12: 3 a
13: 3 a
14: 3 a
15: 3 b
I want to, within each ID group, replace the entire sequence of character "a", that precedes the character "b", and I want to replace them with "b". So the condition is that if "a" or a sequence of "a"s appear before "b", replace all the "a"s. (actually, in my real table, it's when "b" is preceded by "a","x", or"y", preceding character should be replaced, but I should be able to generalize)
In the example above,the value of "a" in row 3 should be replaced (easy to do with (shift) in data.table), as well as all the "a"s in rows 12-14 (not sure how to do). So, the desired output is this:
> dt
id values
1: 1 a
2: 1 c
3: 1 b
4: 1 b
5: 1 a
6: 2 c
7: 2 c
8: 2 b
9: 2 b
10: 2 c
11: 3 c
12: 3 b
13: 3 b
14: 3 b
15: 3 b
What comes to my mind is looping from the last index, but I am not exactly sure how to do that with if I have multiple groupings (say, ID and DATE), and anyway, this doesn't seem to be the fastest dt solution.

Here's another data.table approach:
dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(id, x=x-1, values="a")],
on = .(id, x, values),
values := "b"
][, x := NULL]
create a new column "x" with the run length ids per value grouped by id
join on itself while modifying the run length ids (x) to be the preceeding value and values to be "a" (the specific value you want to change), then update values with "b"
delete column x afterwards
The result is:
dt
# id values
# 1: 1 a
# 2: 1 c
# 3: 1 b
# 4: 1 b
# 5: 1 a
# 6: 2 c
# 7: 2 c
# 8: 2 b
# 9: 2 b
# 10: 2 c
# 11: 3 c
# 12: 3 b
# 13: 3 b
# 14: 3 b
# 15: 3 b
And here's a generalization to the case where you want to replace values "a", "x", or "y" followed by "b" with "b":
dt[, x := rleid(values), by = .(id)]
dt[dt[values == "b", .(values=c("a", "x", "y")), by = .(id, x=x-1)],
on = .(id, x, values),
values := "b"
][, x := NULL]

Late to the party and several nice run length alternatives were already provided ;) So here I try nafill instead.
(1) Create a variable 'v2' which is NA when 'values' are "a". (2) Fill missing values by next observation carried backward. (3) When the original 'values' are "a" and the corresponding filled values in 'v2' are "b", update 'v' with 'v2'.
# 1
dt[values != "a" , v2 := values]
# 2
d1[, v2 := v2[nafill(replace(seq_len(.N), is.na(v2), NA), type = "nocb")], by = id]
# 3
dt[values == "a" & v2 == "b", values := v2]
# clean-up
dt[ , v2 := NULL]
Currently, nafill only works with numeric variables, hence replace step in chunk # 2 (modified from #chinsoon12 in the issue nafill, setnafill for character, factor and other types).
The NA replacement code may be slightly shortened by using zoo::nalocf:
dt[, v2 := zoo::na.locf(v2, fromLast = TRUE, na.rm = FALSE), by = id]
However, note that na.locf is slower.
When comparing the answers on larger data (data.table(id = rep(1:1e4, each = 1e4, replace = TRUE), values = sample(c("a", "b", "c"), 1e8, replace = TRUE)), it turns out that this alternative actually is faster than the others.

This is not pretty but I think this is what you are after:
dt[, .N, by = .(id, values = paste0(values, rleid(values)))
][, values := sub("[0-9]+", "", values)
][, values := fifelse(values == "a" & shift(values, -1L) == "b" & !is.na(shift(values, -1L)), "b", values), by = id
][, .SD[rep(seq_len(.N), N)]
][, !"N"]
id values
1: 1 a
2: 1 c
3: 1 b
4: 1 b
5: 1 a
6: 2 c
7: 2 c
8: 2 b
9: 2 b
10: 2 c
11: 3 c
12: 3 b
13: 3 b
14: 3 b
15: 3 b

You can use rle().
Note: To avoid ambiguity, I rename the "values" column to "var" because the rle() function also produces a list containing a vector named "values".
dt[, new := with(rle(var), rep(ifelse(values == "a" & c(values[-1], "") == "b", "b", values), lengths)), by = id]
dt
# id var new
# 1: 1 a a
# 2: 1 c c
# 3: 1 a b
# 4: 1 b b
# 5: 1 a a
# 6: 2 c c
# 7: 2 c c
# 8: 2 b b
# 9: 2 b b
# 10: 2 c c
# 11: 3 c c
# 12: 3 a b
# 13: 3 a b
# 14: 3 a b
# 15: 3 b b

Match in lagged group in data.table

I'm trying to create a new column that indicates if an ID was present in a previous group. Here's my data:
data <- data.table(ID = c(1:3, c(9,2,3,4),c(5,1)),
groups = c(rep(c("a", "b", "c"), c(3, 4,2))))
ID groups
1: 1 a
2: 2 a
3: 3 a
4: 9 b
5: 2 b
6: 3 b
7: 4 b
8: 5 c
9: 1 c
I'm not sure how to specify lagged groups. I tried to use shift, but it's not working:
data[,.(ID=ID,match_lagged=ID %in% shift(ID)),by=groups]
Here's my desired result.
The first 3 lines are not matched because there is no previous group. FALSE would also work for these three rows. ID=4 (in group b) is not matched in group a. ID=5 (in group c) is not matched in group b.
Note that ID 1 in group c is not matched in group b so it should be false even though it exists in group a. This is why duplicated(data$ID) does not work. Data from a group has to be matched from the previous group.
groups ID match_lagged
1: a 1 NA
2: a 2 NA
3: a 3 NA
4: b 9 FALSE
5: b 2 TRUE
6: b 3 TRUE
7: b 4 FALSE
8: c 5 FALSE
9: c 1 FALSE
A dplyr solution would also work.

Number the groups, and then check if the diff is equal to one for each ID.
data[, grp.id := .GRP, by = groups]
data[, match_lagged := c(FALSE, diff(grp.id) == 1), by = ID][
grp.id == 1, match_lagged := NA][]
# ID groups grp.id match_lagged
#1: 1 a 1 NA
#2: 2 a 1 NA
#3: 3 a 1 NA
#4: 9 b 2 FALSE
#5: 2 b 2 TRUE
#6: 3 b 2 TRUE
#7: 4 b 2 FALSE
#8: 5 c 3 FALSE
#9: 1 c 3 FALSE
This assumes that you find each ID only once in each group. If that's not the case you can unique, do the above, and then merge in.

This works. There is probably an easier solution:
data <- data.frame(ID = c(1:3, 1:4,c(5,1)),
groups = c(rep(c("a", "b", "c"), c(3, 4,2))))
z <- data %>% group_by(groups) %>% summarize(all_vals = list(ID))
z <- z %>% mutate(lagged_id = lag(all_vals,1))
match_lagged <- lapply(1:nrow(z) , function(x) {
(z$all_vals[x] %>% unlist) %in% (z$lagged_id[x] %>% unlist)
})
data$match_lagged = match_lagged %>% unlist

Column order of `.SD` in j argument differs when `get()` is used

I very often transform subsets of data using the .SDcols option in data.table. It makes sense that the .SD columns sent to j are in the same order as the original data.table.
EDITED to properly identify the issue
It's nice that .SD columns have the same order as that specified in the .SDcols argument. This does not happen when get is used in the j argument (inside an lapply call, at least). In this case, the .SD table columns maintain their original order.
Is there any way to override this behaviour?
An example without get works fine
# library(data.table)
dt = data.table(col1 = rep(LETTERS[1:3], 4),
b = rnorm(12),
a = 1:12,
c = LETTERS[1:12])
# columns I want to do something to
d.vars = c('a', 'b') #' names in different order than names(dt)
# Generate columns of first differences by group
dt[, paste('d', d.vars, sep='.') :=
lapply(.SD, function(L) L - shift(L, n = 1, type='lag') ),
keyby = col1, .SDcols = d.vars]
The result is assigns differenced values to the "wrong" column because my named vector (d.vars) is ordered differently than the columns in dt. The result is:
The results are as expected, the .SD table's columns are ordered the same way as the names in d.vars.
> dt
col1 b a c d.a d.b
1: A -0.28901751 1 A NA NA
2: A 0.65746901 4 D 3 0.94648651
3: A -0.10602462 7 G 3 -0.76349362
4: A -0.38406252 10 J 3 -0.27803790
5: B -1.06963450 2 B NA NA
6: B 0.35137273 5 E 3 1.42100723
7: B 0.43394046 8 H 3 0.08256772
8: B 0.82525042 11 K 3 0.39130996
9: C 0.50421710 3 C NA NA
10: C -1.09493665 6 F 3 -1.59915375
11: C -0.04858163 9 I 3 1.04635501
12: C 0.45867279 12 L 3 0.50725443
Which is the expected output because lapply in j processed column a first and b second, in spite of the column order in dt.
Example with get behaves differently
dt2 = data.table(col1 = rep(LETTERS[1:3], 4),
b = rnorm(12),
a = 1:12,
neg = -1,
c = LETTERS[1:12])
# columns I want to do something to
d.vars = c('a', 'b') #' names in different order than names(dt)
# name of variable to be called in j.
negate <- 'neg'
dt2[, paste('d', d.vars, sep='.') :=
lapply(.SD, function(L) {(L - shift(L, n = 1, type='lag') ) * get(negate) }),
keyby = col1, .SDcols = d.vars]
Now the naming of the newly created columns doesn't align with the name order in d.vars:
> dt2
col1 b a neg c d.a d.b
1: A -0.3539066 1 -1 A NA NA
2: A 0.2702374 4 -1 D -0.62414408 -3
3: A -0.7834941 7 -1 G 1.05373150 -3
4: A -1.2765652 10 -1 J 0.49307118 -3
5: B -0.2936422 2 -1 B NA NA
6: B -0.2451996 5 -1 E -0.04844252 -3
7: B -1.6577614 8 -1 H 1.41256181 -3
8: B 1.0668059 11 -1 K -2.72456737 -3
9: C -0.1160938 3 -1 C NA NA
10: C -0.7940771 6 -1 F 0.67798333 -3
11: C 0.2951743 9 -1 I -1.08925140 -3
12: C -0.4508854 12 -1 L 0.74605969 -3
In this second example the b column is processed by lapply first and therefore assigned to d.a.
If I refer to neg directly (i.e., I don't use get) then the results are as expected: lapply processes the .SD columns in the order given in d.vars.
p.s. Thanks data.table team! I love this package!

Based on the description, we can use match to match the 'd.vars' and the column names of 'dt' ('d.vars1') and then use it to get the order right
d.vars1 <- d.vars[match(names(dt), d.vars, nomatch = 0)]
dt[, paste0("d.",d.vars1) := lapply(.SD, function(L)
L - shift(L, n = 1, type='lag') ), keyby = col1, .SDcols = d.vars1]
dt
# col1 b a c d.b d.a
# 1: A -0.28901751 1 A NA NA
# 2: A 0.65746901 4 D 0.94648652 3
# 3: A -0.10602462 7 G -0.76349363 3
# 4: A -0.38406252 10 J -0.27803790 3
# 5: B -1.06963450 2 B NA NA
# 6: B 0.35137273 5 E 1.42100723 3
# 7: B 0.43394046 8 H 0.08256773 3
# 8: B 0.82525042 11 K 0.39130996 3
# 9: C 0.50421710 3 C NA NA
#10: C -1.09493665 6 F -1.59915375 3
#11: C -0.04858163 9 I 1.04635502 3
#12: C 0.45867279 12 L 0.50725442 3
Update
Based on the new dataset
d.vars1 <- d.vars[match(names(dt2), d.vars, nomatch = 0)]
dt2[, paste0('d.', d.vars1) := lapply(.SD, function(L)
L - shift(L, n = 1, type='lag') * get(negate) ),
keyby = col1, .SDcols = d.vars1]
dt2
# col1 b a neg c d.b d.a
# 1: A -0.3539066 1 -1 A NA NA
# 2: A 0.2702374 4 -1 D -0.0836692 5
# 3: A -0.7834941 7 -1 G -0.5132567 11
# 4: A -1.2765652 10 -1 J -2.0600593 17
# 5: B -0.2936422 2 -1 B NA NA
# 6: B -0.2451996 5 -1 E -0.5388418 7
# 7: B -1.6577614 8 -1 H -1.9029610 13
# 8: B 1.0668059 11 -1 K -0.5909555 19
# 9: C -0.1160938 3 -1 C NA NA
#10: C -0.7940771 6 -1 F -0.9101709 9
#11: C 0.2951743 9 -1 I -0.4989028 15
#12: C -0.4508854 12 -1 L -0.1557111 21

Unexpected result using data.table's shift() by group (bug?)

Consider this dataset
Data
dt <- data.table(ID = c(1,8,9,20,32,33), Char = c("A", "A", "B", "B", "C", "C"))
dt
ID Char
1: 1 A
2: 8 A
3: 9 B
4: 20 B
5: 32 C
6: 33 C
I want to identify "runs" by ID, i.e. consecutive rows where the ID differs by 1, but I only want to consider runs within the same Char group. I can do this as follows
Correct
dt[, InRun := FALSE]
dt[, DistToAbove := abs(ID - shift(ID, type="lag")), by=Char]
dt[, DistToBelow := abs(ID - shift(ID, type="lead")), by=Char]
dt[DistToAbove <= 1 | DistToBelow <= 1, InRun := TRUE, by=Char]
dt
ID Char InRun DistToAbove DistToBelow
1: 1 A FALSE NA 7
2: 8 A FALSE 7 NA
3: 9 B FALSE NA 11
4: 20 B FALSE 11 NA
5: 32 C TRUE NA 1
6: 33 C TRUE 1 NA
I tried simplifying the above code into the lines below, but the answer differs
Incorrect / Unexpected
dt[, InRun := FALSE]
dt[abs(ID - shift(ID, type="lag")) <= 1 | abs(shift(ID, type="lead") - ID) <= 1, InRun := TRUE, by=Char]
dt
ID Char InRun DistToAbove DistToBelow
1: 1 A FALSE NA 7
2: 8 A TRUE 7 NA
3: 9 B TRUE NA 11
4: 20 B FALSE 11 NA
5: 32 C TRUE NA 1
6: 33 C TRUE 1 NA
What gives? (Note I'm using data.table v1.9.7)

I want to identify "runs" by ID, i.e. consecutive rows where the ID differs by 1, but I only want to consider runs within the same Char group.
Here's how I'd approach it:
dt[, run_id := cumsum(
( ID != shift(ID, fill = ID[1L]) + 1L )
|
( Char != shift(Char, fill = Char[1L]) )
)]
dt[, in_run := .N > 1L, by=.(Char, run_id)]
ID Char run_id in_run
1: 1 A 1 FALSE
2: 8 A 2 FALSE
3: 9 B 3 FALSE
4: 20 B 4 FALSE
5: 32 C 5 TRUE
6: 33 C 5 TRUE
This code identifies all runs (including those with length of one) and then tests for length greater than one (the OP's definition).
Regarding the OP's approach:
dt[abs(ID - shift(ID, type="lag")) <= 1 | abs(shift(ID, type="lead") - ID) <= 1, # i
InRun := TRUE # j
, by=Char] # by
In DT[i,j,by] the steps are: filter using i, then group with by, then calculate j. You can't do by-group calculations in i in the way attempted here.

Order factor levels in order of appearance in data set

I have a survey in which a unique ID must be assigned to questions. Some questions appear multiple times. This means that there is an extra layer of questions. In the sample data below only the first layer is included.
Question: how do I assign a unique index by order of appearance? The solution provided here works alphabetically. I can order the factors, but this defeats the purpose of doing it in R [there are many questions to sort].
library(data.table)
dt = data.table(question = c("C", "C", "A", "B", "B", "D"),
value = c(10,20,30,40,20,30))
dt[, idx := as.numeric(as.factor(question))]
gives:
question value idx
# 1: C 10 3
# 2: C 20 3
# 3: A 30 1
# 4: B 40 2
# 5: B 20 2
# 6: D 30 4
# but required is:
dt[, idx.required := c(1, 1, 2, 3, 3, 4)]

I think the data.table way to do this will be
dt[, idx := .GRP, by = question]
## question value idx
## 1: C 10 1
## 2: C 20 1
## 3: A 30 2
## 4: B 40 3
## 5: B 20 3
## 6: D 30 4

You could respecify the factor levels:
dt[, idx := as.numeric(factor(question, levels=unique(question)))]
# question value idx
# 1: C 10 1
# 2: C 20 1
# 3: A 30 2
# 4: B 40 3
# 5: B 20 3
# 6: D 30 4

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Flag randomly selected N rows by group in data.table - r

dt[, C3 := 1:.N %in% sample(.N, min(.N, 2)), by = C1] Or use head, but I think that should be slower dt[, C3 := 1:.N %in% head(sample(.N), 2) , by = C1] If the number of flagged rows is not constant you can do flagsz <- c(2, 1, 2, 3) dt[, C3 := 1:.N %in% sample(.N, min(.N, flagsz[.GRP])), by = C1]

N=2 dt[, C3 := {if (.N < N) rep(TRUE,.N) else 1:.N %in% sample(.N,N) }, by=C1] dt # C1 C2 C3 # 1: A 2 TRUE # 2: A 1 FALSE # 3: A 3 TRUE # 4: B 1 TRUE # 5: C 2 FALSE # 6: C 3 TRUE # 7: C 4 TRUE # 8: D 5 TRUE # 9: D 4 TRUE # 10: D 5 FALSE

Related

Replace a sequence of values by group depending on preceeding values

Match in lagged group in data.table

Column order of `.SD` in j argument differs when `get()` is used

Unexpected result using data.table's shift() by group (bug?)

Order factor levels in order of appearance in data set

Categories

Resources