join data.table after group by - r

I have a fairly simple question that I could not find a suitable answer here. I have the following data.table, which I want to create a indicator variable equal to 1 if the group ID has an observation with a specific value, in this case, 13:
DT = data.table(ID = c(1, 1, 2, 3, 3, 3), A = c(13, 1, 13, 11, 12, 12))
DT
ID A
1: 1 13
2: 1 1
3: 2 13
4: 3 11
5: 3 12
6: 3 12
My desired result, which is a simple split-apply-combine in dplyr lingo, would be:
DT
ID A B
1: 1 13 1
2: 1 1 1
3: 2 13 1
4: 3 11 0
5: 3 12 0
6: 3 12 0
My idea was to do something along the lines of DT[A == 13, B := 1][, B := max(B, na.rm=TRUE), by='ID'], and it kind of works but results in some -Inf values for groups with no observations equal to 13. Is there a better way to do this?
In a split-apply-combine framework, I would start with DT[A == 13, B := 1, by='ID'], then do a LEFT JOIN, but want to do it the data.table way as much as possible. Thanks!

We can use a group by 'ID' to assign (:=) if there are any value in 'A' that is equal to 13
library(data.table)
DT[, B := +(any(A == 13)), ID]
Or with %in%
DT[, B := +(13 %in% A), ID]
DT
# ID A B
#1: 1 13 1
#2: 1 1 1
#3: 2 13 1
#4: 3 11 0
#5: 3 12 0
#6: 3 12 0

Related

Skip NAs when using Reduce() in data.table

I'm trying to get the cumulative sum of data.table rows and was able to find this code in another stackoverflow post:
devDF1[,names(devDF1):=Reduce(`+`,devDF1,accumulate=TRUE)]
It does what I need it to do, however when it comes across a row that starts off with an NA, it will just replace every element in that row with NA (instead of the cumsum of the other elements in the row). I don't want to replace the NAs with 0s, because I'll be needing this output for further processes and don't want the same final cumsum duplicated in the rows. Is there any way I can adjust that piece of code to ignore the NAs? Or is there an alternate code that could be used to get the cumulative sum of the rows in a data.table while ignoring NAs?
Consider this example :
library(data.table)
dt <- data.table(a = 1:5, b = c(3, NA, 1, 2, 4), c = c(NA, 1, NA, 3, 4))
dt
# a b c
#1: 1 3 NA
#2: 2 NA 1
#3: 3 1 NA
#4: 4 2 3
#5: 5 4 4
If you want to carry previous value to NA values you can use :
dt[, names(dt) := lapply(.SD, function(x) cumsum(replace(x, is.na(x), 0))),
.SDcols = names(dt)]
dt
# a b c
#1: 1 3 0
#2: 3 3 1
#3: 6 4 1
#4: 10 6 4
#5: 15 10 8
If you want to keep NA as NA :
dt[, names(dt) := lapply(.SD, function(x) {
x1 <- cumsum(replace(x, is.na(x), 0))
x1[is.na(x)] <- NA
x1
}), .SDcols = names(dt)]
dt
# a b c
#1: 1 3 NA
#2: 3 NA 1
#3: 6 4 NA
#4: 10 6 4
#5: 15 10 8

R rules for new variable based on current previous and next value

data
data=data.frame("person"=c(1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
"score"=c(1,2,1,3,4,3,1,2,1,3,1,2,3,1,3,2,2,3,1,3,3),
"want"=c(1,1,1,3,4,1,1,1,1,1,1,1,1,1,1,3,1,1,1,3,3))
I will do my best here to explain what I hope to achieve.
Basically I want to create the 'want' column which depends on the previous, current, and next values.
In the data, an individual can have a score of 1,2,3,4. I want a new variable 'want' that follows these rules:
a score of 3 will be assigned at time T if there was a score of 3 at time T-1 and a score of 2 or 3 at time T+1.
a score of 3 will be assigned at time T if there was a score of 3 at time T and a score of 4 at time T+1.
otherwise, all score values should be a 1 EXCEPT if there is a 4.
Is it suppoused to look like your want column? This gives different results, but appears to be following your logic:
library(dplyr)
data %>%
group_by(person) %>%
mutate(want2 = case_when(
(lag(score) == 3 & lead(score) %in% c(2,3)) ~ 3,
score == 3 & lead(score) == 4 ~ 3,
T ~ 1))
Your want columns is not following your own rules. Please notice that you have a 4 in the 5th position, but there's no rule to assign a 4 (other values seem to be miscalculated, as per your rules).
# load packages
library(data.table)
# create data
dt <- data.table(person = c(1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
score = c(1,2,1,3,4,3,1,2,1,3,1,2,3,1,3,2,2,3,1,3,3))
# Make lead and lag vectors
dt[, tMinus := shift(score, 1, type = "lag")]
dt[, tPlus := shift(score, 1, type = "lead")]
# calculate want
dt[, want := 1][tMinus == 3 & tPlus %in% 2:3, want := 3][score == 3 & tPlus == 4, want := 3]
# remove unneeded columns
dt[, c("tMinus", "tPlus") := NULL]
This produces the result:
> dt
person score want
1: 1 1 1
2: 1 2 1
3: 1 1 1
4: 1 3 3
5: 1 4 3
6: 2 3 1
7: 2 1 3
8: 2 2 1
9: 2 1 1
10: 2 3 1
11: 2 1 3
12: 3 2 1
13: 3 3 1
14: 3 1 3
15: 3 3 1
16: 3 2 3
17: 4 2 1
18: 4 3 1
19: 4 1 3
20: 4 3 1
21: 4 3 1
person score want
It wasn't clear if you wanted to calculate the want by person. If so, then consider the next code:
dt[, tPlus := shift(score, 1, type = "lead"), by = person]
dt[, tMinus := shift(score, 1, type = "lag"), by = person]
dt[, want := 1][tMinus == 3 & tPlus %in% 2:3,
want := 3][score == 3 & tPlus == 4,
want := 3][,
c("tMinus", "tPlus") := NULL][]

Filter duplicate sequence of rows

(Note i was surprised not finding a similar question but i am happy to remove this one if i am mistaken).
I have the following sample dataset.
library(data.table)
dt <- data.table(val = c(1, 2, 3, 0, 2, 4, 1, 2, 3), id = c(1, 1, 1, 2, 2, 2, 3, 3, 3))
Group with id=1 has the same values for val (1,2,3) as group with id=3. I would like to filter out this "duplicate" values in group id=3.
My desired output is:
> dt
val id
1: 1 1
2: 2 1
3: 3 1
4: 0 2
5: 2 2
6: 4 2
I only came up with dirty workarounds like taking the sum: dt[, filter:= sum(val) , by = id] and remove duplicates, but then the values for id = 2 would also disappear.
Note: If values for id=3 would be 1,3,2 (so same values but different order, the rows should not be removed),..so order matters.
This is not a data.table specific approach, but it would work:
x = split(dt$val, dt$id)
dt[!id %in% names(x[duplicated(x)])]
# val id
#1: 1 1
#2: 2 1
#3: 3 1
#4: 0 2
#5: 2 2
#6: 4 2
It might be not optimal in terms of efficiency.
You can convert to string, remove duplicates and merge, i.e.
merge(dt, unique(dt[, .(new = toString(val)), id], by = 'new'))[,new := NULL][]
# id val
#1: 1 1
#2: 1 2
#3: 1 3
#4: 2 0
#5: 2 2
#6: 2 4
We can avoid merge by pulling the ids and using %in%, i.e.
i1 <- unique(dt[, .(new = toString(val)), id], by = 'new')[, id]
dt[id %in% i1,]
# val id
#1: 1 1
#2: 2 1
#3: 3 1
#4: 0 2
#5: 2 2
#6: 4 2
Another option with data.table:
dt <- dt[, pat := paste(val, collapse = "/"), by = id][
, .SD[which.min(rleid(pat))], by = .(pat, val)][, pat := NULL]
Output:
val id
1: 1 1
2: 2 1
3: 3 1
4: 0 2
5: 2 2
6: 4 2

force .GRP counter to start from 2 instead of 1 in data.table

How to force .GRP in data.table to start the group counter from 2 instead of 1?
I have a data.table with groups which I want to sequentially order by group.
example_data.table <- data.table(Var1 = c(1,2,2,4,5,5,5), Var2 = c(1,2,3,7,1,2,3) )
When I use .GRP counter it starts with very first combination as conter 1.
Group_table <- setDT(example_data.table)[, label := .GRP, by = c("Var1", "Var2" )]
But I want to set group with Var1 value as 4 and Var2 value as 7 to counter value 1 and then the next.
How do I use .GRP in such a way that Var1 as 4 and Var2 as 7 takes a counter as 1 and others in next order?
So, what I was thinking is manually give counter as 1 for the needed combination and for others start the counter from 2 . There are other ways too but I am just a bit confused.
If you only have one entry with Var1 = 4 & Var2 = 7, then you can remove that entry from .GRP, and use replace to replace it with 1, i.e.
library(data.table)
dt1[-(which(dt1$Var1 == 4 & dt1$Var2 == 7)), Counter := .GRP + 1, by = c('Var1', 'Var2')][,
Counter := replace(Counter, is.na(Counter), 1)][]
which gives,
Var1 Var2 Counter
1: 1 1 2
2: 2 2 3
3: 2 3 4
4: 4 7 1
5: 5 1 5
6: 5 2 6
7: 5 3 7
If you want certain groups to "start" the count, you can use order to sort during construction:
ex = copy(example_data.table)
ex[order(Var1 != 4, Var2 != 7), g := .GRP, by=.(Var1, Var2)][]
Var1 Var2 g
1: 1 1 2
2: 2 2 3
3: 2 3 4
4: 4 7 1
5: 5 1 5
6: 5 2 6
7: 5 3 7

Recode NA with values from similar ID with data.table

I'm in a learning process to use data.table and trying to recode NA to the non-missing values by b.
library(data.table)
dt <- data.table(a = rep(1:3, 2),
b = c(rep(1,3), rep(2, 3)),
c = c(NA, 4, NA, 6, NA, NA))
> dt
a b c
1: 1 1 NA
2: 2 1 4
3: 3 1 NA
4: 1 2 6
5: 2 2 NA
6: 3 2 NA
I would like to get this:
> dt
a b c
1: 1 1 4
2: 2 1 4
3: 3 1 4
4: 1 2 6
5: 2 2 6
6: 3 2 6
I tried these, but none gives the desired result.
dt[, c := ifelse(is.na(c), !is.na(c), c), by = b]
dt[is.na(c), c := dt[!is.na(c), .(c)], by = b]
Appreciate to get some helps and a little bit explanation on how should I consider/think when trying to solve the problem with data.table approach.
Assuming a simple case where there is just one c for each level of b:
dt[, c := c[!is.na(c)][1], by = b]
dt

Resources