Filter duplicate sequence of rows - r

(Note i was surprised not finding a similar question but i am happy to remove this one if i am mistaken).
I have the following sample dataset.
library(data.table)
dt <- data.table(val = c(1, 2, 3, 0, 2, 4, 1, 2, 3), id = c(1, 1, 1, 2, 2, 2, 3, 3, 3))
Group with id=1 has the same values for val (1,2,3) as group with id=3. I would like to filter out this "duplicate" values in group id=3.
My desired output is:
> dt
val id
1: 1 1
2: 2 1
3: 3 1
4: 0 2
5: 2 2
6: 4 2
I only came up with dirty workarounds like taking the sum: dt[, filter:= sum(val) , by = id] and remove duplicates, but then the values for id = 2 would also disappear.
Note: If values for id=3 would be 1,3,2 (so same values but different order, the rows should not be removed),..so order matters.

This is not a data.table specific approach, but it would work:
x = split(dt$val, dt$id)
dt[!id %in% names(x[duplicated(x)])]
# val id
#1: 1 1
#2: 2 1
#3: 3 1
#4: 0 2
#5: 2 2
#6: 4 2
It might be not optimal in terms of efficiency.

You can convert to string, remove duplicates and merge, i.e.
merge(dt, unique(dt[, .(new = toString(val)), id], by = 'new'))[,new := NULL][]
# id val
#1: 1 1
#2: 1 2
#3: 1 3
#4: 2 0
#5: 2 2
#6: 2 4
We can avoid merge by pulling the ids and using %in%, i.e.
i1 <- unique(dt[, .(new = toString(val)), id], by = 'new')[, id]
dt[id %in% i1,]
# val id
#1: 1 1
#2: 2 1
#3: 3 1
#4: 0 2
#5: 2 2
#6: 4 2

Another option with data.table:
dt <- dt[, pat := paste(val, collapse = "/"), by = id][
, .SD[which.min(rleid(pat))], by = .(pat, val)][, pat := NULL]
Output:
val id
1: 1 1
2: 2 1
3: 3 1
4: 0 2
5: 2 2
6: 4 2

Related

mutate variable by condition using two variables in long format data.table in r

In this data.table:
dt <- data.table(id=c(1,1,1,2,2,2), time=rep(1:3,2), x=c(1,0,0,0,1,0))
dt
id time x
1: 1 1 1
2: 1 2 0
3: 1 3 0
4: 2 1 0
5: 2 2 1
6: 2 3 0
I need the following:
id time x
1: 1 1 1
2: 1 2 1
3: 1 3 1
4: 2 1 0
5: 2 2 1
6: 2 3 1
that is
if x==1 at time==1 then x=1 at times 2 and 3, by id
if x==1 at time==2 then x=1 at time 3, by id
For the first point (I guess the second one will be similar), I have tried approaches mentioned in similar questions I posted before (here and here), but none work:
dt[x==1[time == 1], x := x[time == 1], id] gives an error
setDT(dt)[, x2:= ifelse(x==1 & time==1, x[time==1], x), by=id] changes xonly at time 1 (so, no real change observed)
It would be much easier to work with data.table in wide format, but I keep facing this kind of problem in long format and I don't want to reshape my data all the time
Thank you!
EDIT:
The answer provided by #GregorThomas, dt[, x := cummax(x), by = id], works for the problem that I presented.
Now I ask the same question for a character variable:
dt2 <- data.table(id=c(1,1,1,2,2,2), time=rep(1:3,2), x=c('a','b','b','b','a','b'))
dt2
id time x
1: 1 1 a
2: 1 2 b
3: 1 3 b
4: 2 1 b
5: 2 2 a
6: 2 3 b
In the table above, how could be done the following:
if x=='a' at time==1 then x='a' at times 2 and 3, by id
if x=='a' at time==2 then x='a' at time 3, by id
Using the cumulative maximum function cummax:
dt[, x := cummax(x), by = id]
dt
# id time x
# 1: 1 1 1
# 2: 1 2 1
# 3: 1 3 1
# 4: 2 1 0
# 5: 2 2 1
# 6: 2 3 1

join data.table after group by

I have a fairly simple question that I could not find a suitable answer here. I have the following data.table, which I want to create a indicator variable equal to 1 if the group ID has an observation with a specific value, in this case, 13:
DT = data.table(ID = c(1, 1, 2, 3, 3, 3), A = c(13, 1, 13, 11, 12, 12))
DT
ID A
1: 1 13
2: 1 1
3: 2 13
4: 3 11
5: 3 12
6: 3 12
My desired result, which is a simple split-apply-combine in dplyr lingo, would be:
DT
ID A B
1: 1 13 1
2: 1 1 1
3: 2 13 1
4: 3 11 0
5: 3 12 0
6: 3 12 0
My idea was to do something along the lines of DT[A == 13, B := 1][, B := max(B, na.rm=TRUE), by='ID'], and it kind of works but results in some -Inf values for groups with no observations equal to 13. Is there a better way to do this?
In a split-apply-combine framework, I would start with DT[A == 13, B := 1, by='ID'], then do a LEFT JOIN, but want to do it the data.table way as much as possible. Thanks!
We can use a group by 'ID' to assign (:=) if there are any value in 'A' that is equal to 13
library(data.table)
DT[, B := +(any(A == 13)), ID]
Or with %in%
DT[, B := +(13 %in% A), ID]
DT
# ID A B
#1: 1 13 1
#2: 1 1 1
#3: 2 13 1
#4: 3 11 0
#5: 3 12 0
#6: 3 12 0

R rules for new variable based on current previous and next value

data
data=data.frame("person"=c(1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
"score"=c(1,2,1,3,4,3,1,2,1,3,1,2,3,1,3,2,2,3,1,3,3),
"want"=c(1,1,1,3,4,1,1,1,1,1,1,1,1,1,1,3,1,1,1,3,3))
I will do my best here to explain what I hope to achieve.
Basically I want to create the 'want' column which depends on the previous, current, and next values.
In the data, an individual can have a score of 1,2,3,4. I want a new variable 'want' that follows these rules:
a score of 3 will be assigned at time T if there was a score of 3 at time T-1 and a score of 2 or 3 at time T+1.
a score of 3 will be assigned at time T if there was a score of 3 at time T and a score of 4 at time T+1.
otherwise, all score values should be a 1 EXCEPT if there is a 4.
Is it suppoused to look like your want column? This gives different results, but appears to be following your logic:
library(dplyr)
data %>%
group_by(person) %>%
mutate(want2 = case_when(
(lag(score) == 3 & lead(score) %in% c(2,3)) ~ 3,
score == 3 & lead(score) == 4 ~ 3,
T ~ 1))
Your want columns is not following your own rules. Please notice that you have a 4 in the 5th position, but there's no rule to assign a 4 (other values seem to be miscalculated, as per your rules).
# load packages
library(data.table)
# create data
dt <- data.table(person = c(1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
score = c(1,2,1,3,4,3,1,2,1,3,1,2,3,1,3,2,2,3,1,3,3))
# Make lead and lag vectors
dt[, tMinus := shift(score, 1, type = "lag")]
dt[, tPlus := shift(score, 1, type = "lead")]
# calculate want
dt[, want := 1][tMinus == 3 & tPlus %in% 2:3, want := 3][score == 3 & tPlus == 4, want := 3]
# remove unneeded columns
dt[, c("tMinus", "tPlus") := NULL]
This produces the result:
> dt
person score want
1: 1 1 1
2: 1 2 1
3: 1 1 1
4: 1 3 3
5: 1 4 3
6: 2 3 1
7: 2 1 3
8: 2 2 1
9: 2 1 1
10: 2 3 1
11: 2 1 3
12: 3 2 1
13: 3 3 1
14: 3 1 3
15: 3 3 1
16: 3 2 3
17: 4 2 1
18: 4 3 1
19: 4 1 3
20: 4 3 1
21: 4 3 1
person score want
It wasn't clear if you wanted to calculate the want by person. If so, then consider the next code:
dt[, tPlus := shift(score, 1, type = "lead"), by = person]
dt[, tMinus := shift(score, 1, type = "lag"), by = person]
dt[, want := 1][tMinus == 3 & tPlus %in% 2:3,
want := 3][score == 3 & tPlus == 4,
want := 3][,
c("tMinus", "tPlus") := NULL][]

force .GRP counter to start from 2 instead of 1 in data.table

How to force .GRP in data.table to start the group counter from 2 instead of 1?
I have a data.table with groups which I want to sequentially order by group.
example_data.table <- data.table(Var1 = c(1,2,2,4,5,5,5), Var2 = c(1,2,3,7,1,2,3) )
When I use .GRP counter it starts with very first combination as conter 1.
Group_table <- setDT(example_data.table)[, label := .GRP, by = c("Var1", "Var2" )]
But I want to set group with Var1 value as 4 and Var2 value as 7 to counter value 1 and then the next.
How do I use .GRP in such a way that Var1 as 4 and Var2 as 7 takes a counter as 1 and others in next order?
So, what I was thinking is manually give counter as 1 for the needed combination and for others start the counter from 2 . There are other ways too but I am just a bit confused.
If you only have one entry with Var1 = 4 & Var2 = 7, then you can remove that entry from .GRP, and use replace to replace it with 1, i.e.
library(data.table)
dt1[-(which(dt1$Var1 == 4 & dt1$Var2 == 7)), Counter := .GRP + 1, by = c('Var1', 'Var2')][,
Counter := replace(Counter, is.na(Counter), 1)][]
which gives,
Var1 Var2 Counter
1: 1 1 2
2: 2 2 3
3: 2 3 4
4: 4 7 1
5: 5 1 5
6: 5 2 6
7: 5 3 7
If you want certain groups to "start" the count, you can use order to sort during construction:
ex = copy(example_data.table)
ex[order(Var1 != 4, Var2 != 7), g := .GRP, by=.(Var1, Var2)][]
Var1 Var2 g
1: 1 1 2
2: 2 2 3
3: 2 3 4
4: 4 7 1
5: 5 1 5
6: 5 2 6
7: 5 3 7

Recode NA with values from similar ID with data.table

I'm in a learning process to use data.table and trying to recode NA to the non-missing values by b.
library(data.table)
dt <- data.table(a = rep(1:3, 2),
b = c(rep(1,3), rep(2, 3)),
c = c(NA, 4, NA, 6, NA, NA))
> dt
a b c
1: 1 1 NA
2: 2 1 4
3: 3 1 NA
4: 1 2 6
5: 2 2 NA
6: 3 2 NA
I would like to get this:
> dt
a b c
1: 1 1 4
2: 2 1 4
3: 3 1 4
4: 1 2 6
5: 2 2 6
6: 3 2 6
I tried these, but none gives the desired result.
dt[, c := ifelse(is.na(c), !is.na(c), c), by = b]
dt[is.na(c), c := dt[!is.na(c), .(c)], by = b]
Appreciate to get some helps and a little bit explanation on how should I consider/think when trying to solve the problem with data.table approach.
Assuming a simple case where there is just one c for each level of b:
dt[, c := c[!is.na(c)][1], by = b]
dt

Resources