R data.table conditional (min/max) aggregation - r

I'm relatively new to R and I have a question regarding how to do conditional aggregation using data.tables (or other methods) while still accessing the table columns by reference. There was an answer to a similar question here but it takes a long time on my data and takes a lot of memory. Here is some toy data:
t <- data.table(User=c(1,1,1,1,1,2,2,2,2,3,3,3,3,3,3),
Obs=c(1,2,3,4,5,1,2,3,4,1,2,3,4,5,6),
Flag=c(0,1,0,1,0,0,1,0,0,1,0,0,0,1,0))
Which looks like this:
User Obs Flag
1: 1 1 0
2: 1 2 1
3: 1 3 0
4: 1 4 1
5: 1 5 0
6: 2 1 0
7: 2 2 1
8: 2 3 0
9: 2 4 0
10: 3 1 1
11: 3 2 0
12: 3 3 0
13: 3 4 0
14: 3 5 1
15: 3 6 0
What I would like to do with this is to get the maximum observation less than the current observation where the flag is 1, by user. The output should look like this:
User Obs Flag min.max
1: 1 1 0 NA
2: 1 2 1 2
3: 1 3 0 2
4: 1 4 1 4
5: 1 5 0 4
6: 2 1 0 NA
7: 2 2 1 2
8: 2 3 0 2
9: 2 4 0 2
10: 3 1 1 1
11: 3 2 0 1
12: 3 3 0 1
13: 3 4 0 1
14: 3 5 1 5
15: 3 6 0 5
Any help would be greatly appreciated!

t[, max := Obs[Flag == 1], by = .(User, cumsum(diff(c(0, Flag)) == 1))]
t
# User Obs Flag max
# 1: 1 1 0 NA
# 2: 1 2 1 2
# 3: 1 3 0 2
# 4: 1 4 1 4
# 5: 1 5 0 4
# 6: 2 1 0 NA
# 7: 2 2 1 2
# 8: 2 3 0 2
# 9: 2 4 0 2
#10: 3 1 1 1
#11: 3 2 0 1
#12: 3 3 0 1
#13: 3 4 0 1
#14: 3 5 1 5
#15: 3 6 0 5

Related

Recode when there is a missing category in R

I need a recoding help. Here how my dataset looks like.
df <- data.frame(id = c(1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3, 4,4,4,4,4),
score = c(0,1,0,1,0, 0,2,0,2,2, 0,3,3,0,0, 0,1,3,1,3))
> df
id score
1 1 0
2 1 1
3 1 0
4 1 1
5 1 0
6 2 0
7 2 2
8 2 0
9 2 2
10 2 2
11 3 0
12 3 3
13 3 3
14 3 0
15 3 0
16 4 0
17 4 1
18 4 3
19 4 1
20 4 3
Some ids have missing score categories. So if this is the case per id, I would like to recode score category. So:
a) if the score options are `0,1,2` and `1` score is missing, then `2` need to be recoded as `1`,
b) if the score options are `0,1,2,3` and `1,2` score is missing, then `3` need to be recoded as `1`,
c) if the score options are `0,1,2,3` and `2` score is missing, then `2,3` need to be recoded as `1,2`,
the idea is there should not be any missing score categories in between.
The desired output would be:
> df.1
id score score.recoded
1 1 0 0
2 1 1 1
3 1 0 0
4 1 1 1
5 1 0 0
6 2 0 0
7 2 2 1
8 2 0 0
9 2 2 1
10 2 2 1
11 3 0 0
12 3 3 1
13 3 3 1
14 3 0 0
15 3 0 0
16 4 0 0
17 4 1 1
18 4 3 2
19 4 1 1
20 4 3 2
df %>%
group_by(id)%>%
mutate(score = as.numeric(factor(score)) - 1)
# A tibble: 20 x 2
# Groups: id [4]
id score
<dbl> <dbl>
1 1 0
2 1 1
3 1 0
4 1 1
5 1 0
6 2 0
7 2 1
8 2 0
9 2 1
10 2 1
11 3 0
12 3 1
13 3 1
14 3 0
15 3 0
16 4 0
17 4 1
18 4 2
19 4 1
20 4 2
Using data.table
library(data.table)
setDT(df)[, score.recoded := 0][
score >0, score.recoded := match(score, score), id]
-output
> df
id score score.recoded
<num> <num> <int>
1: 1 0 0
2: 1 1 1
3: 1 0 0
4: 1 1 1
5: 1 0 0
6: 2 0 0
7: 2 2 1
8: 2 0 0
9: 2 2 1
10: 2 2 1
11: 3 0 0
12: 3 3 1
13: 3 3 1
14: 3 0 0
15: 3 0 0
16: 4 0 0
17: 4 1 1
18: 4 3 2
19: 4 1 1
20: 4 3 2

Conditionally replace value in a single row or replace value of following rows with values from previous row group in R

I have a huge datatable with over 20'000 rows with a column for each time point t and for each customer with id and I am looking for a way to replace the values in y for t=5:8 each customer id by the value by copy pasting the value of y when t=3&4.
The data set below is a short version of my data set:
Dt=data.table(
t=rep(1:8, times=3),
y=c(0,1,0,0,0,1,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0),
id=rep(1:3, each=8))
t y id
1: 1 0 1
2: 2 1 1
3: 3 0 1
4: 4 0 1
5: 5 0 1
6: 6 1 1
7: 7 1 1
8: 8 0 1
9: 1 0 2
10: 2 0 2
11: 3 0 2
12: 4 1 2
13: 5 0 2
14: 6 0 2
15: 7 1 2
16: 8 0 2
17: 1 0 3
18: 2 1 3
19: 3 1 3
20: 4 1 3
21: 5 0 3
22: 6 1 3
23: 7 0 3
24: 8 0 3
In the end it should look like this:
t y id
1: 1 0 1
2: 2 1 1
3: 3 0 1
4: 4 0 1
5: 5 0 1
6: 6 0 1
7: 7 0 1
8: 8 0 1
9: 1 0 2
10: 2 0 2
11: 3 0 2
12: 4 1 2
13: 5 0 2
14: 6 1 2
15: 7 0 2
16: 8 1 2
17: 1 0 3
18: 2 1 3
19: 3 1 3
20: 4 1 3
21: 5 1 3
22: 6 1 3
23: 7 1 3
24: 8 1 3
Do you maybe have an idea how I could solve this? I thought of doing 2 for loops with the range of t and customer id, but I imagine that for this dataset it would take too long.
Thank you in advance!
Your data does not exactly match what is displayed in your post (particularly rows where t is 3 and 4 within id 3). You could try replace in the following approach, though not sure how efficient this is generating a copy with :=.
library(data.table)
Dt[ , y := replace(y, t %in% 5:8, y[t %in% 3:4]), by = id]
Dt
Output
t y id
1: 1 0 1
2: 2 1 1
3: 3 0 1
4: 4 0 1
5: 5 0 1
6: 6 0 1
7: 7 0 1
8: 8 0 1
9: 1 0 2
10: 2 0 2
11: 3 0 2
12: 4 1 2
13: 5 0 2
14: 6 1 2
15: 7 0 2
16: 8 1 2
17: 1 0 3
18: 2 1 3
19: 3 0 3
20: 4 0 3
21: 5 0 3
22: 6 0 3
23: 7 0 3
24: 8 0 3
t y id

R Data Table Assign Subset of Rows and Columns with Zero

I'm trying to explode a data table into a time series by populating future time steps with values of zero. The starting data table has the following structure. Values for V1 and V2 can be thought of as values for the first time step.
dt <- data.table(ID = c(1,2,3), V1 = c(1,2,3), V2 = c(4,5,6))
ID V1 V2
1: 1 1 4
2: 2 2 5
3: 3 3 6
What I want to get to is a data table like this
ID year V1 V2
1: 1 1 1 4
2: 1 2 0 0
3: 1 3 0 0
4: 1 4 0 0
5: 1 5 0 0
6: 2 1 2 5
7: 2 2 0 0
8: 2 3 0 0
9: 2 4 0 0
10: 2 5 0 0
11: 3 1 3 6
12: 3 2 0 0
13: 3 3 0 0
14: 3 4 0 0
15: 3 5 0 0
I've exploded the original data table and appended the year column with the following
dt <- dt[, .(year=1:5), by=ID][dt, on=ID, allow.cartesian=T]
ID year V1 V2
1: 1 1 1 4
2: 1 2 1 4
3: 1 3 1 4
4: 1 4 1 4
5: 1 5 1 4
6: 2 1 2 5
7: 2 2 2 5
8: 2 3 2 5
9: 2 4 2 5
10: 2 5 2 5
11: 3 1 3 6
12: 3 2 3 6
13: 3 3 3 6
14: 3 4 3 6
15: 3 5 3 6
Any ideas on how to populate columns V1 and V2 with zeros for year!=1 would be much appreciated. I also need to avoid spelling out the V1 and V2 column names as the actual data table I'm working with has 58 columns.
I got an error with that last step, but if you have a more recent version of data.table that behaves differently hten by all means just :
dt[year != 1, V1 := 0] # logical condition in the 'i' position
dt[year != 1, V2 := 0] # data.table assign in the 'j' position
Ooops. Didn't read to the end. Will see if I can test a range of columns.
Ranges can be constructed on the LHS of data.table.[ assignment operator (:=):
> dt2[year != 1, paste0("V", 1:2) := 0 ]
> dt2
ID V1 V2 year
1: 1 1 4 1
2: 1 0 0 2
3: 1 0 0 3
4: 1 0 0 4
5: 1 0 0 5
6: 2 2 5 1
7: 2 0 0 2
8: 2 0 0 3
9: 2 0 0 4
10: 2 0 0 5
11: 3 3 6 1
12: 3 0 0 2
13: 3 0 0 3
14: 3 0 0 4
15: 3 0 0 5
You can use tidyr::complete -
library(dplyr)
library(tidyr)
dt %>%
mutate(year = 1) %>%
complete(ID, year = 1:5, fill = list(V1 = 0, V2 = 0))
# ID year V1 V2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 4
# 2 1 2 0 0
# 3 1 3 0 0
# 4 1 4 0 0
# 5 1 5 0 0
# 6 2 1 2 5
# 7 2 2 0 0
# 8 2 3 0 0
# 9 2 4 0 0
#10 2 5 0 0
#11 3 1 3 6
#12 3 2 0 0
#13 3 3 0 0
#14 3 4 0 0
#15 3 5 0 0

What is the R function for detecting successive differences in a data frame?

I use the following code in R and it works very well. More precisely, I compare each time cluster_id with the last cluster_ref to see when they differ 2 periods in a row (data is organized by fund_numbers). However, I would like to adapt it to 5 periods. But it is impossible to make it work. Do you have any idea how I can modify this code to solve my problem?
get_output <- function(mon, ref){
exp <- !is.na(Cluster_id) & !map2_lgl(Cluster_id, last(Cluster_ref), identical)
as.integer(exp & lag(exp, default = FALSE))
}
df %>%
arrange(Fund_number, rolling_window) %>%
group_by(Fund_number) %>%
mutate(Deviation = get_output(Cluster_id, Cluster_ref)) %>%
ungroup()
rolling_window Fund_number Cluster_id Cluster_ref Expected_output
1 1 10 10 0
2 1 10 10 0
3 1 8 9 0
4 1 8 8 0
5 1 7 7 0
6 1 8 8 0
7 1 8 NA 1
8 1 7 NA 1
9 1 7 10 1
10 1 10 10 0
1 2 NA NA 0
2 2 NA 3 0
3 2 3 3 0
4 2 2 5 0
5 2 2 NA 0
6 2 2 4 0
7 2 2 4 1
8 2 5 5 0
9 2 4 5 0
10 2 3 5 0
This is what I want.
So as you can see, the data is organized by fund_number. Then I look at the last cluster_ref for each fund (so every 8 rows) and compare it to each cluster_id for each fund. As soon as it is different at least 5 periods in a row I have 1 if not 0. So for each fund, I compare the 8th cluster_ref and the cluster_id of rows 1 to 8.
The code above makes this but with 2 time periods.
Thank you very much,
Vanie
In data.table we can use rleid over Cluster_id values.
library(data.table)
setDT(df)[, temp := rleid(last(Cluster_ref) != Cluster_id), Fund_number]
df[, output := +(seq_along(Cluster_ref) >= 5), .(Fund_number, temp)]
df[, temp := NULL]
df
# rolling_window Fund_number Cluster_id Cluster_ref Expected_output output
# 1: 1 1 10 10 0 0
# 2: 2 1 10 10 0 0
# 3: 3 1 8 9 0 0
# 4: 4 1 8 8 0 0
# 5: 5 1 7 7 0 0
# 6: 6 1 8 8 0 0
# 7: 7 1 8 NA 1 1
# 8: 8 1 7 NA 1 1
# 9: 9 1 7 10 1 1
#10: 10 1 10 10 0 0
#11: 1 2 NA NA 0 0
#12: 2 2 NA 3 0 0
#13: 3 2 3 3 0 0
#14: 4 2 2 5 0 0
#15: 5 2 2 NA 0 0
#16: 6 2 2 4 0 0
#17: 7 2 2 4 1 1
#18: 8 2 5 5 0 0
#19: 9 2 4 5 0 0
#20: 10 2 3 5 0 0

Conditional Series Fill in R

Looking for a way to fill in a vector with new values conditional on values within that vector and another variable in the data frame. Pasted an example of what the data looks like below.
PrsVar= c(rep(1,10),rep(2,7),rep(3,11))
IndVar = c(0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0)
OutVar = c(1,1,1,1,2,2,2,3,3,3,1,1,1,1,2,2,2,1,1,1,1,2,2,2,2,3,3,3)
exampdata <- cbind(PrsVar,IndVar,OutVar)
exampdata <- as.data.frame(exampdata)
> exampdata
PrsVar IndVar OutVar
1 1 0 1
2 1 0 1
3 1 0 1
4 1 1 1
5 1 0 2
6 1 0 2
7 1 1 2
8 1 0 3
9 1 0 3
10 1 0 3
11 2 0 1
12 2 0 1
13 2 0 1
14 2 1 1
15 2 0 2
16 2 0 2
17 2 1 2
18 3 0 1
19 3 0 1
20 3 0 1
21 3 1 1
22 3 0 2
23 3 0 2
24 3 0 2
25 3 1 2
26 3 0 3
27 3 0 3
28 3 0 3
This is time-series data and each row represents a person-day. PrsVar is an ID for an individual in the study and IndVar is an indicator that an episode has ended on that person-day. The person-day after that represents a new episode.
I'd like to create a variable that looks like OutVar using just the values from PrsVar and IndVar. This new variable OutVar labels the episode each person-day is in, incrementing by 1, and starting over at 1 for each new individual.
I could run this through a loop, but I need more efficient code to work with 3,000,000+ rows of data. Was trying to use something in dplyr or maybe mapply, but I'm stumped. Thinking a solution to this would be helpful to others and would certainly be helpful to me again in the near future.
The data.table package offers a fast, efficient, and tidy way to do this. It's all done by reference (not by value, so no copying is done) so millions of rows won't be an issue at all (under a minute, maybe).
library(data.table)
patient.data <- data.table(PrsVar = c(rep(1,10), rep(2,7), rep(3,11)),
IndVar = c(0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0),
OutVar = c(1,1,1,1,2,2,2,3,3,3,1,1,1,1,2,2,2,1,1,1,1,2,2,2,2,3,3,3))
Increment an episode counter EpVar based on the cumulative sum of IndVar (plus 1). This increases the counter at the record where IndVar increases (which is too early) so shift it down a record with shift, replacing the missing value with a reset counter (1). This can be done groupwise with the by keyword.
patient.data[ , EpVar:=shift(1+cumsum(IndVar), fill=1), by=PrsVar]
patient.data
PrsVar IndVar OutVar EpVar
1: 1 0 1 1
2: 1 0 1 1
3: 1 0 1 1
4: 1 1 1 1
5: 1 0 2 2
6: 1 0 2 2
7: 1 1 2 2
8: 1 0 3 3
9: 1 0 3 3
10: 1 0 3 3
11: 2 0 1 1
12: 2 0 1 1
13: 2 0 1 1
14: 2 1 1 1
15: 2 0 2 2
16: 2 0 2 2
17: 2 1 2 2
18: 3 0 1 1
19: 3 0 1 1
20: 3 0 1 1
21: 3 1 1 1
22: 3 0 2 2
23: 3 0 2 2
24: 3 0 2 2
25: 3 1 2 2
26: 3 0 3 3
27: 3 0 3 3
28: 3 0 3 3
A bit ugly, but this logic should be easily adaptable to other methods:
with(exampdata,
ave(IndVar, PrsVar, FUN=function(x) {
out <- rev(cumsum(rev(x)))
max(out) - out + 1
})
)
# [1] 1 1 1 1 2 2 2 3 3 3 1 1 1 1 2 2 2 1 1 1 1 2 2 2 2 3 3 3

Resources