Multiple first and last non-NA values by group - r

I have the following data.table:
require(data.table)
dt = data.table(
id = c(rep('Grp 1', 31), rep('Grp 2', 31)),
date = rep(as.IDate(as.IDate('2020-01-01') : as.IDate('2020-01-31')), 2),
change = c(rep(NA, 5), rep('yes', 5), rep(NA, 10), rep('yes', 3), rep(NA, 8),
rep(NA, 2), rep('yes', 8), rep(NA, 8), rep('yes', 5), rep(NA, 8))
)
For every group id I'd like to filter the first and last dates of a series, which is defined by a second column change being yes (i.e. non-NA). I can do the following, which would provide me with the first and last non-NA row by group. However, the problem is that the series occurs more than once per group.
dt[ !is.na(change),
.(head(date, 1),
tail(date, 1)),
.(id) ]
These are the row indices I'd like to have filtered:
dt[c(6,10,21,23,34,41,50,54)]

One way is to give a unique group id to each streak identified by an id and change combination. We can use rleid to generate such run-length type ids. Consider something like this
dt[,
gid := rleid(id, change)
][!is.na(change),
as.list(range(date)),
by = .(id, gid)
][,
gid := NULL
]
Note that I also assume that you want the range of dates, not really the first and last elements. Your method will fail if the dates are not in chronological order. Output looks like this
id V1 V2
1: Grp 1 2020-01-06 2020-01-10
2: Grp 1 2020-01-21 2020-01-23
3: Grp 2 2020-01-03 2020-01-10
4: Grp 2 2020-01-19 2020-01-23
rleid works like this
> rleid(c(1, 1, 2, 3, 3), c("a", "b", "b", "d", "d"))
[1] 1 2 3 4 4

Here is an option with dplyr
library(dplyr)
library(data.table)
dt %>%
group_by(grp = rleid(id, change), id) %>%
filter(!is.na(change)) %>%
summarise(V1 = min(date, na.rm = TRUE),
V2 = max(date, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 4 x 4
# grp id V1 V2
# <int> <chr> <date> <date>
#1 2 Grp 1 2020-01-06 2020-01-10
#2 4 Grp 1 2020-01-21 2020-01-23
#3 7 Grp 2 2020-01-03 2020-01-10
#4 9 Grp 2 2020-01-19 2020-01-23

Related

Identify rows with a value greater than threshold, but only direct one above per group

Suppose we have a dataset with a grouping variable, a value, and a threshold that is unique per group. Say I want to identify a value that is greater than a threshold, but only one.
test <- data.frame(
grp = c("A", "A", "A", "B", "B", "B"),
value = c(1, 3, 5, 1, 3, 5),
threshold = c(4,4,4,2,2,2)
)
want <- data.frame(
grp = c("A", "A", "A", "B", "B", "B"),
value = c(1, 3, 5, 1, 3, 5),
threshold = c(4,4,4,2,2,2),
want = c(NA, NA, "yes", NA, "yes", NA)
)
In the table above, Group A has a threshold of 4, and only value of 5 is higher. But in Group B, threshold is 2, and both value of 3 and 5 is higher. However, only row with value of 3 is marked.
I was able to do this by identifying which rows had value greater than threshold, then removing the repeated value:
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = if_else(value > threshold, "yes", NA_character_)) %>%
mutate(across(want, ~replace(.x, duplicated(.x), NA)))
I was wondering if there was a direct way to do this using a single logical statement rather than doing it two-step method, something along the line of:
test %>%
group_by(grp) %>%
mutate(want = if_else(???, "yes", NA_character_))
The answer doesn't have to be on R either. Just a logical step explanation would suffice as well. Perhaps using a rank?
Thank you!
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = (value > threshold), want = want & !lag(cumany(want))) %>%
ungroup()
# # A tibble: 6 × 4
# grp value threshold want
# <chr> <dbl> <dbl> <lgl>
# 1 A 1 4 FALSE
# 2 A 3 4 FALSE
# 3 A 5 4 TRUE
# 4 B 1 2 FALSE
# 5 B 3 2 TRUE
# 6 B 5 2 FALSE
If you really want strings, you can if_else after this.
Here is more direct way:
The essential part:
With min(which((value > threshold) == TRUE) we get the first TRUE in our column,
Next we use an ifelse and check the number we get to the row number and set the conditions:
library(dplyr)
test %>%
group_by(grp) %>%
mutate(want = ifelse(row_number()==min(which((value > threshold) == TRUE)),
"yes", NA_character_))
grp value threshold want
<chr> <dbl> <dbl> <chr>
1 A 1 4 NA
2 A 3 4 NA
3 A 5 4 yes
4 B 1 2 NA
5 B 3 2 yes
6 B 5 2 NA
>
This is a perfect chance for a data.table answer using its non-equi matching and multiple match handling capabilities:
library(data.table)
setDT(test)
test[test, on=.(grp, value>threshold), mult="first", flag := TRUE]
test
# grp value threshold flag
# <char> <num> <num> <lgcl>
#1: A 1 4 NA
#2: A 3 4 NA
#3: A 5 4 TRUE
#4: B 1 2 NA
#5: B 3 2 TRUE
#6: B 5 2 NA
Find the "first" matching value in each group that is greater than > the threshold and set := it to TRUE

Remove rows of a data frame from another dataframe but keep duplicated in R

I'm working in R and I have two dataframes, one is the base dataframe, and another has the rows that i need to remove from the base one. But I can't use setdiff() function, because it removes duplicated rows. Here's an example:
a <- data.frame(var1 = c(1, NA, 2, 2, 3, 4, 5),
var2 = c(1, 7, 2, 2, 3, 4, 5))
b <- data.frame(id = c(2, 4),
numero = c(2, 4))
And the result must be:
id numero
1 1
NA 7
2 2
3 3
5 5
It must be an efficient algorithm, too, because the base dataframe has 3 million rows with 26 columns.
We may need to create a sequence column before joining
library(data.table)
setDT(a)[, rn := rowid(var1, var2)][!setDT(b)[,
rn:= rowid(id, numero)], on = .(var1 = id, var2 = numero, rn)][,
rn := NULL][]
-output
var1 var2
<num> <num>
1: 1 1
2: NA 7
3: 2 2
4: 3 3
5: 5 5

How to do a shift operation in R data.table, while appending an additional row for one last comparison?

Here are some data:
library(data.table)
library(lubridate)
foo <- data.table(
date = seq.Date(from = as_date('2020-01-01'), to = as_date('2020-03-01'), by = '1 month'),
a = rep(1:3, 2),
group = c(rep('a', 3), rep('b', 3))
)
> foo
date a group
1: 2020-01-01 1 a
2: 2020-02-01 2 a
3: 2020-03-01 3 a
4: 2020-01-01 1 b
5: 2020-02-01 2 b
6: 2020-03-01 3 b
The desired output is the following:
date a group diff
1: 2020-01-01 1 a 1
2: 2020-02-01 2 a 1
3: 2020-03-01 3 a 1
4: 2020-04-01 0 a -3
5: 2020-01-01 1 b 1
6: 2020-02-01 2 b 1
7: 2020-03-01 3 b 1
8: 2020-04-01 0 b -3
And here follows my own solution.
bar <- foo[foo[, .I[which.max(date)], by = group]$V1]
bar <- bar[a != 0][, c('date', 'a') := .(date %m+% months(1), 0)]
foo <- rbindlist(list(foo, bar))
foo[, diff := a - shift(a, fill = 0), by = group]
foo[order(group, date)]
I wonder if a more compact solution exists in data.table, such as a fill option able to look at the past with a shift operation performed from the point of view of future non-existent data.
This is more compact but it's largely similar:
foo[
order(date),
.(date = c(date, date[.N] %m+% months(1)), a = c(a, 0)),
by = group
][ , diff := a - shift(a, fill=0)][]
I guess we could also do things in one query:
foo[
order(date),
.(
date = c(date, date[.N] %m+% months(1)),
a = c(a, 0),
diff = c(a - shift(a, fill=0), -a[.N])
),
by = group
]
another option that may be more palatable:
foo[
order(date),
{
out <- rbind(
.SD,
data.table(
date = date[.N] %m+% months(1),
a = 0
)
)
out[ , diff := a - shift(a, fill=0)]
out
},
by = group
]

How to group the data by id and get unique values of all columns in R?

I have a table with ID and other columns. I want to group the data by Ids and get the unique values of all columns.
from above table group by ID and get unique(Alt1, Alt2, Alt3)
Resul should be in vector form
A -> 1,2,3,5
B ->1,3,4,5,7
We can get data in long format and for each ID make a list of unique values.
library(dplyr)
library(tidyr)
df1 <- df %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>%
summarise(value = list(unique(value))) %>%
unnest(value)
df1
# ID value
# <fct> <dbl>
# 1 A 1
# 2 A 3
# 3 A 2
# 4 A 5
# 5 B 1
# 6 B 4
# 7 B 5
# 8 B 3
# 9 B 6
#10 B 7
We can store it as a list if needed using split.
split(df1$value, df1$ID)
#$A
#[1] 1 3 2 5
#$B
#[1] 1 4 5 3 6 7
data.table equivalent of the above would be :
library(Data.table)
setDT(df)
df2 <- melt(df, id.vars = 'ID')[, .(value = list(unique(value))), ID]
unique values are present in df2$value as a vector.
data
df <- data.frame(ID = c('A', 'A', 'B', 'B'),
Alt1 = c(1, 2, 1, 3),
Alt2 = c(3, 5, 4, 6),
Alt3 = c(1, 3, 5, 7))

Keep empty groups when grouping with data.table in R

I want to keep empty groups (with a default value like NA or 0) when grouping by multiple conditions.
dt = data.table(user = c("A", "A", "B"), date = c("t1", "t2", "t1"), duration = c(1, 2, 1))
dt[, .("total" = sum(duration)), by = .(date, user)]
Result:
date user total
1: t1 A 1
2: t2 A 2
3: t1 B 1
Desired result:
date user total
1: t1 A 1
2: t2 A 2
3: t1 B 1
3: t2 B NA
One solution could be to add rows with 0 values before grouping, but it would require to create the Descartes product of many columns and manually checking if a value already exists for that combination, but I would prefer a built-in / simpler one.
You can try:
dt[CJ(user = user, date = date, unique = TRUE), on = .(user, date)]
user date duration
1: A t1 1
2: A t2 2
3: B t1 1
4: B t2 NA
Here is an option with complete from tidyr
library(tidyr)
library(dplyr)
dt1 <- dt[, .("total" = sum(duration)), by = .(date, user)]
dt1 %>%
complete(user, date)
# user date total
# <chr> <chr> <dbl>
# A t1 1
#2 A t2 2
#3 B t1 1
#4 B t2 NA
Or using dcast/melt
melt(dcast(dt, user ~ date, value.var = 'duration', sum),
id.var = 'user', variable.name = 'date', value.name = 'total')

Resources