how to cumulative sum variable by unique values and input back in - r

I'm looking to do the following -- cumulative sum the indicator values and remove the indicators after those days
original:
transaction
day
indicator
1
1
0
1
2
0
1
3
0
1
4
1
1
5
1
1
6
1
2
1
0
2
2
0
2
3
0
2
4
0
2
5
1
2
6
1
and make the new table like this --
transaction
day
indicator
1
1
0
1
2
0
1
3
0
1
4
3
2
1
0
2
2
0
2
3
0
2
4
0
2
5
2

Change all day with indicator == 1 to the first day with indicator == 1
df%>%
group_by(transaction)%>%
mutate(day=case_when(indicator==0~day,
T~head(day[indicator==1],1)))%>%
group_by(transaction,day)%>%
summarise(indicator=sum(indicator))%>%
ungroup
transaction day indicator
<int> <int> <int>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 3
5 2 1 0
6 2 2 0
7 2 3 0
8 2 4 0
9 2 5 2

Please try the below code
code
df <- bind_rows(df1, df2) %>% group_by(transaction) %>%
mutate(cumsum=cumsum(indicator), cumsum2=ifelse(cumsum==1, day, NA)) %>%
fill(cumsum2) %>%
mutate(day=ifelse(!is.na(cumsum2), cumsum2, day)) %>%
group_by(transaction, day) %>% slice_tail(n=1) %>% select(-cumsum2)
Created on 2023-01-19 with reprex v2.0.2
output
# A tibble: 8 × 4
# Groups: transaction, day [8]
transaction day indicator cumsum
<dbl> <int> <dbl> <dbl>
1 1 1 0 0
2 1 2 0 0
3 1 3 0 0
4 1 4 1 3
5 2 1 0 0
6 2 2 0 0
7 2 3 0 0
8 2 4 1 2

Another approach to try. After grouping by transaction, change indicator to either 0 (same) or the sum of indicator. Finally, keep or filter previous rows where cumall (cumulative all) values for indicator are 0. Using lag will provide the last row containing the sum.
library(tidyverse)
df %>%
group_by(transaction) %>%
mutate(indicator = ifelse(indicator == 0, 0, sum(indicator))) %>%
filter(cumall(lag(indicator, default = 0) == 0))
Output
transaction day indicator
<int> <int> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 3
5 2 1 0
6 2 2 0
7 2 3 0
8 2 4 0
9 2 5 2

Related

Is there a R function for preparing datasets for survival analysis like stset in Stata?

Datasets look like this
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 0 1 1 1
2 1 3 1 1
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
As you see, when id = 1, it's just the data input to coxph in survival package. However, when id = 2, at the beginning and end, failure occurs, but in the middle, failure disappears.
Is there a general function to extract data from id = 2 and get the result like id = 1?
I think when id = 2, the result should look like below.
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
A bit hacky, but should get the job done.
Data:
# Load data
library(tidyverse)
df <- read_table("
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 0 1 1 1
2 1 3 1 1
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
")
Data wrangling:
# Check for sub-groups within IDs and remove all but the last one
df <- df %>%
# Group by ID
group_by(
id
) %>%
mutate(
# Check if a new sub-group is starting (after a failure)
new_group = case_when(
# First row is always group 0
row_number() == 1 ~ 0,
# If previous row was a failure, then a new sub-group starts here
lag(failure) == 1 ~ 1,
# Otherwise not
TRUE ~ 0
),
# Assign sub-group number by calculating cumulative sums
group = cumsum(new_group)
) %>%
# Keep only last sub-group for each ID
filter(
group == max(group)
) %>%
ungroup() %>%
# Remove working columns
select(
-new_group, -group
)
Result:
> df
# A tibble: 6 × 5
id start end failure x1
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 0
2 1 1 3 0 0
3 1 3 6 1 0
4 2 3 4 0 1
5 2 4 6 0 1
6 2 6 7 1 1

How to get the number of consecutive zeroes from a column in a dataframe

I'm trying to work out how to get the number of consecutive zeroes for a given column for a dataframe.
Here is a dataframe:
data <- data.frame(id = c(1,1,1,1,1,1,2,2,2,2,2,2), value = c(1,0,0,1,0,0,0,0,0,0,4,3))
This would be the desired output:
id value consec
1 1 0
1 0 2
1 0 2
1 1 0
1 0 2
1 0 2
2 0 4
2 0 4
2 0 4
2 0 4
2 4 0
2 3 0
Any ideas on how to achieve this output?
Many thanks
You can do:
data$consec <- with(data, ave(value, value, cumsum(value != 0), id, FUN = length) - (value != 0))
data
id value consec
1 1 1 0
2 1 0 2
3 1 0 2
4 1 1 0
5 1 0 2
6 1 0 2
7 2 0 4
8 2 0 4
9 2 0 4
10 2 0 4
11 2 4 0
12 2 3 0
Here's a base R solution using interaction and rle (run-length encoding):
rlid <- rle(as.numeric(interaction(data$id, data$value)))$lengths
data$consec <- replace(rep(rlid, rlid), data$value != 0, 0)
data
#> id value consec
#> 1 1 1 0
#> 2 1 0 2
#> 3 1 0 2
#> 4 1 1 0
#> 5 1 0 2
#> 6 1 0 2
#> 7 2 0 4
#> 8 2 0 4
#> 9 2 0 4
#> 10 2 0 4
#> 11 2 4 0
#> 12 2 3 0
This dplyr solution will work. Using cumulative sum we keep track of every time a new non-zero entry occurs, and for each of these groups we count the number of zeros:
data %>%
group_by(id) %>%
mutate(flag_0 = cumsum(value == 1)) %>%
group_by(id, flag_0) %>%
mutate(conseq = ifelse(value == 0, sum(value == 0), 0)) %>%
ungroup()
# A tibble: 12 x 4
id value flag_0 conseq
<dbl> <dbl> <int> <dbl>
1 1 1 1 0
2 1 0 1 2
3 1 0 1 2
4 1 1 2 0
5 1 0 2 2
6 1 0 2 2
7 2 0 0 4
8 2 0 0 4
9 2 0 0 4
10 2 0 0 4
11 2 4 0 0
12 2 3 0 0
This tidyverse approach can also do the job
library(tidyverse)
data %>% group_by(id) %>%
mutate(value2 =cumsum(value)) %>% group_by(id, value, value2) %>%
mutate(consec = ifelse(value == 0, n(), 0)) %>%
ungroup() %>% select(-value2)
# A tibble: 12 x 3
id value consec
<dbl> <dbl> <dbl>
1 1 1 0
2 1 0 2
3 1 0 2
4 1 1 0
5 1 0 2
6 1 0 2
7 2 0 4
8 2 0 4
9 2 0 4
10 2 0 4
11 2 4 0
12 2 3 0

Only Use The First Match For Every N Rows

I have a data.frame that looks like this.
Date Number
1 1
2 0
3 1
4 0
5 0
6 1
7 0
8 0
9 1
I would like to create a new column that puts a 1 in the column if it is the first 1 of every 3 rows. Otherwise put a 0. For example, this is how I would like the new data.frame to look
Date Number New
1 1 1
2 0 0
3 1 0
4 0 0
5 0 0
6 1 1
7 0 0
8 0 0
9 1 1
Every three rows we find the first 1 and populate the column otherwise we place a 0. Thank you.
Hmm, at first glance I thought Akrun answer provided me the solution. However, it is not exactly what I am looking for. Here is what #akrun solution provides.
df1 = data.frame(Number = c(1,0,1,0,1,1,1,0,1,0,0,0))
head(df1,9)
Number
1 1
2 0
3 1
4 0
5 1
6 1
7 1
8 0
9 1
Attempt at solution:
df1 %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(New = +(Number == row_number()))
Number grp New
<dbl> <int> <int>
1 1 1 1
2 0 1 0
3 1 1 0
4 0 2 0
5 1 2 0 #should be a 1
6 1 2 0
7 1 3 1
8 0 3 0
9 1 3 0
As you can see the code misses the one on row 5. I am looking for the first 1 in every chunk. Then everything else should be 0.
Sorry if i was unclear akrn
Edit** Akrun new answer is exactly what I am looking for. Thank you very much
Here is an option to create a grouping column with gl and then do a == with the row_number on the index of matched 1. Here, match will return only the index of the first match.
library(dplyr)
df1 %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(New = +(row_number() == match(1, Number, nomatch = 0)))
# A tibble: 12 x 3
# Groups: grp [4]
# Number grp New
# <dbl> <int> <int>
# 1 1 1 1
# 2 0 1 0
# 3 1 1 0
# 4 0 2 0
# 5 1 2 1
# 6 1 2 0
# 7 1 3 1
# 8 0 3 0
# 9 1 3 0
#10 0 4 0
#11 0 4 0
#12 0 4 0
Looking at the logic, perhaps you want to check if Number == 1 and that the prior 2 values were both 0. If that is not correct please let me know.
library(dplyr)
df %>%
mutate(New = ifelse(Number == 1 & lag(Number, n = 1L, default = 0) == 0 & lag(Number, n = 2L, default = 0) == 0, 1, 0))
Output
Date Number New
1 1 1 1
2 2 0 0
3 3 1 0
4 4 0 0
5 5 0 0
6 6 1 1
7 7 0 0
8 8 0 0
9 9 1 1
You can replace Number value to 0 except for the 1st occurrence of 1 in each 3 rows.
library(dplyr)
df %>%
group_by(gr = ceiling(row_number()/3)) %>%
mutate(New = replace(Number, -which.max(Number), 0)) %>%
#Or to be safe and specific use
#mutate(New = replace(Number, -which(Number == 1)[1], 0)) %>%
ungroup() %>% select(-gr)
# A tibble: 9 x 3
# Date Number New
# <int> <int> <int>
#1 1 1 1
#2 2 0 0
#3 3 1 0
#4 4 0 0
#5 5 0 0
#6 6 1 1
#7 7 0 0
#8 8 0 0
#9 9 1 1

Constructing variable lags based on additional condition

I want to create a lagged variable based on the following additional condition and operations:
When the lag (previous row) of the variable (day_active) is 1, it should also take the lag of the variable n_wins
When the lag (previous row) of day_active is 0, it should just repeat the value of n_wins of the previous row as long as day_active remains 0.
Let's assume we observe a game player for ten days. day_active indicates if he was active on that day and n_wins indicates the number of games he won.
Example dataset:
da = data.frame(day = c(1,2,3,4,5,6,7,8,9,10), day_active = c(1,1,0,0,1,1,0,0,1,1), n_wins = c(2,3,0,0,1,0,0,0,0,1))
da
day day_active n_wins
1 1 1 2
2 2 1 3
3 3 0 0
4 4 0 0
5 5 1 1
6 6 1 0
7 7 0 0
8 8 0 0
9 9 1 0
10 10 1 1
This is how it should look after the transformation:
da2 = data.frame(day = c(1,2,3,4,5,6,7,8,9,10), day_active = c(1,1,0,0,1,1,0,0,1,1), n_wins = c(2,3,0,0,1,0,0,0,0,1), lag_n_wins = c(NA,2,3,3,3,1,0,0,0,0))
da2
day day_active n_wins lag_n_wins
1 1 1 2 NA
2 2 1 3 2
3 3 0 0 3
4 4 0 0 3
5 5 1 1 3
6 6 1 0 1
7 7 0 0 0
8 8 0 0 0
9 9 1 0 0
10 10 1 1 0
We can create a grouping column based on the presence of 1 in 'day_active' by taking the cumulative sum of logical vector, then if all the values are not 0, replace with NA and replace the NA with the previous non-NA element with na.locf (from zoo), ungroup and take the lag of the column created
library(dplyr)
da %>%
group_by(grp = cumsum(day_active == 1)) %>%
mutate(lag_n_wins = zoo::na.locf0(if(all(n_wins == 0)) n_wins
else na_if(n_wins, 0)) ) %>%
ungroup %>%
mutate(lag_n_wins = lag(lag_n_wins)) %>%
select(-grp)
# A tibble: 10 x 4
# day day_active n_wins lag_n_wins
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2 NA
# 2 2 1 3 2
# 3 3 0 0 3
# 4 4 0 0 3
# 5 5 1 1 3
# 6 6 1 0 1
# 7 7 0 0 0
# 8 8 0 0 0
# 9 9 1 0 0
#10 10 1 1 0

R: Long-data: how to remove all following obs within same ID once condition is met?

I have long data looking like this for example:
ID time condition
1 1 0
1 2 0
1 3 0
1 4 1
2 1 0
2 2 1
2 3 1
2 4 0
3 1 1
3 2 1
3 3 0
3 4 0
4 1 0
4 2 1
4 3 NA
4 4 NA
I want to only keep those rows before condition is met once so I want:
ID time condition
1 1 0
1 2 0
1 3 0
1 4 1
2 1 0
2 2 1
3 1 1
4 1 0
4 2 1
I tried to loop but a) it said looping is not good coding style in R and b) it won't work.
Sidenote: just if you are wondering, it does make sense that IDs have condition and then lose it again in my example, but I am only interested in when they first had it.
Thank you.
Here's an easy way with dplyr:
library(dplyr)
df %>% group_by(ID) %>%
filter(row_number() <= which.max(condition) | sum(condition) == 0)
# # A tibble: 7 x 3
# # Groups: ID [3]
# ID time condition
# <int> <int> <int>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 0
# 4 1 4 1
# 5 2 1 0
# 6 2 2 1
# 7 3 1 1
It relies on which.max which returns the index of the first maximum value in vector. The | sum(condition) == 0 takes care to keep censored cases (where condition is always 0).
Using this data:
1 1 0
1 2 0
1 3 0
1 4 1
2 1 0
2 2 1
2 3 1
2 4 0
3 1 1
3 2 1
3 3 0
3 4 0')

Resources