I'm stuck trying to keep row based on condition in R. I want to keep row of data based on the same condition across a large number of columns. So in the below example I want to keep rows from duplicated rows where hv value '0' at each column.
here is the data frame:
ID A B C
1 001 1 1 1
2 002 0 1 0
3 002 1 0 0
4 003 0 1 1
5 003 1 0 1
6 003 0 0 1
I want get like this:
ID A B C
1 001 1 1 1
2 002 0 0 0
3 003 0 0 1
Any help would be much appreciated, thanks!
Please check this code
# A tibble: 6 × 4
ID A B C
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 0 1 0
3 2 1 0 0
4 3 0 1 1
5 3 1 0 1
6 3 0 0 1
code
data2 <- data %>% group_by(ID) %>%
mutate(across(c('A','B','C'), ~ ifelse(.x==0, 0, NA), .names = 'x{col}')) %>%
fill(xA, xB, xC) %>%
mutate(across(c('xA','xB','xC'), ~ ifelse(is.na(.x), 1, .x))) %>%
ungroup() %>% group_by(ID) %>% slice_tail(n=1)
output
# A tibble: 3 × 7
# Groups: ID [3]
ID A B C xA xB xC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 2 1 0 0 0 0 0
3 3 0 0 1 0 0 1
I'm looking to do the following -- cumulative sum the indicator values and remove the indicators after those days
original:
transaction
day
indicator
1
1
0
1
2
0
1
3
0
1
4
1
1
5
1
1
6
1
2
1
0
2
2
0
2
3
0
2
4
0
2
5
1
2
6
1
and make the new table like this --
transaction
day
indicator
1
1
0
1
2
0
1
3
0
1
4
3
2
1
0
2
2
0
2
3
0
2
4
0
2
5
2
Change all day with indicator == 1 to the first day with indicator == 1
df%>%
group_by(transaction)%>%
mutate(day=case_when(indicator==0~day,
T~head(day[indicator==1],1)))%>%
group_by(transaction,day)%>%
summarise(indicator=sum(indicator))%>%
ungroup
transaction day indicator
<int> <int> <int>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 3
5 2 1 0
6 2 2 0
7 2 3 0
8 2 4 0
9 2 5 2
Please try the below code
code
df <- bind_rows(df1, df2) %>% group_by(transaction) %>%
mutate(cumsum=cumsum(indicator), cumsum2=ifelse(cumsum==1, day, NA)) %>%
fill(cumsum2) %>%
mutate(day=ifelse(!is.na(cumsum2), cumsum2, day)) %>%
group_by(transaction, day) %>% slice_tail(n=1) %>% select(-cumsum2)
Created on 2023-01-19 with reprex v2.0.2
output
# A tibble: 8 × 4
# Groups: transaction, day [8]
transaction day indicator cumsum
<dbl> <int> <dbl> <dbl>
1 1 1 0 0
2 1 2 0 0
3 1 3 0 0
4 1 4 1 3
5 2 1 0 0
6 2 2 0 0
7 2 3 0 0
8 2 4 1 2
Another approach to try. After grouping by transaction, change indicator to either 0 (same) or the sum of indicator. Finally, keep or filter previous rows where cumall (cumulative all) values for indicator are 0. Using lag will provide the last row containing the sum.
library(tidyverse)
df %>%
group_by(transaction) %>%
mutate(indicator = ifelse(indicator == 0, 0, sum(indicator))) %>%
filter(cumall(lag(indicator, default = 0) == 0))
Output
transaction day indicator
<int> <int> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 3
5 2 1 0
6 2 2 0
7 2 3 0
8 2 4 0
9 2 5 2
Datasets look like this
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 0 1 1 1
2 1 3 1 1
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
As you see, when id = 1, it's just the data input to coxph in survival package. However, when id = 2, at the beginning and end, failure occurs, but in the middle, failure disappears.
Is there a general function to extract data from id = 2 and get the result like id = 1?
I think when id = 2, the result should look like below.
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
A bit hacky, but should get the job done.
Data:
# Load data
library(tidyverse)
df <- read_table("
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 0 1 1 1
2 1 3 1 1
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
")
Data wrangling:
# Check for sub-groups within IDs and remove all but the last one
df <- df %>%
# Group by ID
group_by(
id
) %>%
mutate(
# Check if a new sub-group is starting (after a failure)
new_group = case_when(
# First row is always group 0
row_number() == 1 ~ 0,
# If previous row was a failure, then a new sub-group starts here
lag(failure) == 1 ~ 1,
# Otherwise not
TRUE ~ 0
),
# Assign sub-group number by calculating cumulative sums
group = cumsum(new_group)
) %>%
# Keep only last sub-group for each ID
filter(
group == max(group)
) %>%
ungroup() %>%
# Remove working columns
select(
-new_group, -group
)
Result:
> df
# A tibble: 6 × 5
id start end failure x1
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 0
2 1 1 3 0 0
3 1 3 6 1 0
4 2 3 4 0 1
5 2 4 6 0 1
6 2 6 7 1 1
I'm trying to calculate a running count (i.e., cumulative sum) that is conditional on other variables and that can reset for particular values on another variable. I'm working in R and would prefer a dplyr-based solution, if possible.
I'd like to create a variable for the running count, cumulative, based on the following algorithm:
Calculate the running count (cumulative) within combinations of id and age
Increment running count (cumulative) by 1 for every subsequent trial where accuracy = 0, block = 2, and condition = 1
Reset running count (cumulative) to 0 for each trial where accuracy = 1, block = 2, and condition = 1, and the next increment resumes at 1 (not the previous number)
For each trial where block != 2, or condition != 1, leave the running count (cumulative) as NA
Here's a minimal working example:
mydata <- data.frame(id = c(1,1,1,1,1,1,1,1,1,1,1),
age = c(1,1,1,1,1,1,1,1,1,1,2),
block = c(1,1,2,2,2,2,2,2,2,2,2),
trial = c(1,2,1,2,3,4,5,6,7,8,1),
condition = c(1,1,1,1,1,2,1,1,1,1,1),
accuracy = c(0,0,0,0,0,0,0,1,0,0,0)
)
id age block trial condition accuracy
1 1 1 1 1 0
1 1 1 2 1 0
1 1 2 1 1 0
1 1 2 2 1 0
1 1 2 3 1 0
1 1 2 4 2 0
1 1 2 5 1 0
1 1 2 6 1 1
1 1 2 7 1 0
1 1 2 8 1 0
1 2 2 1 1 0
The expected output is:
id age block trial condition accuracy cumulative
1 1 1 1 1 0 NA
1 1 1 2 1 0 NA
1 1 2 1 1 0 1
1 1 2 2 1 0 2
1 1 2 3 1 0 3
1 1 2 4 2 0 NA
1 1 2 5 1 0 4
1 1 2 6 1 1 0
1 1 2 7 1 0 1
1 1 2 8 1 0 2
1 2 2 1 1 0 1
Here is an option using data.table. Create a binary column based on matching the pasted values of 'accuracy', 'block', 'condition' with that of the custom values, grouped by run-length-id of the binary column ('ind'), 'id' and 'age', get the cumulative sum of 'ind' and assign (:=) it to a new column ('Cumulative')
library(data.table)
setDT(mydata)[, ind := match(do.call(paste0, .SD), c("121", "021")) - 1,
.SDcols = c("accuracy", "block", "condition")
][, Cumulative := cumsum(ind), .(rleid(ind), id, age)
][, ind := NULL][]
# id age block trial condition accuracy Cumulative
# 1: 1 1 1 1 1 0 NA
# 2: 1 1 1 2 1 0 NA
# 3: 1 1 2 1 1 0 1
# 4: 1 1 2 2 1 0 2
# 5: 1 1 2 3 1 0 3
# 6: 1 1 2 4 2 0 NA
# 7: 1 1 2 5 1 1 0
# 8: 1 1 2 6 1 0 1
# 9: 1 1 2 7 1 0 2
#10: 1 2 2 1 1 0 1
We can use case_when to assign the value which we need based on our conditions. We then add an additional group_by condition using cumsum to switch values when the temp column 0. In the final mutate step we temporarily replace NA values in temp to 0, then take cumsum over it and put back the NA values again to it's place to get the final output.
library(dplyr)
mydata %>%
group_by(id, age) %>%
mutate(temp = case_when(accuracy == 0 & block == 2 & condition == 1 ~ 1,
accuracy == 1 & block == 2 & condition == 1 ~ 0,
TRUE ~ NA_real_)) %>%
ungroup() %>%
group_by(id, age, group = cumsum(replace(temp == 0, is.na(temp), 0))) %>%
mutate(cumulative = replace(cumsum(replace(temp, is.na(temp), 0)),
is.na(temp), NA)) %>%
select(-temp, -group)
# group id age block trial condition accuracy cumulative
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0 1 1 1 1 1 0 NA
# 2 0 1 1 1 2 1 0 NA
# 3 0 1 1 2 1 1 0 1
# 4 0 1 1 2 2 1 0 2
# 5 0 1 1 2 3 1 0 3
# 6 0 1 1 2 4 2 0 NA
# 7 0 1 1 2 5 1 0 4
# 8 1 1 1 2 6 1 1 0
# 9 1 1 1 2 7 1 0 1
#10 1 1 1 2 8 1 0 2
#11 1 1 2 2 1 1 0 1
In R, in a repeated measures dataset, how can I create a variable that is the same for each measurement on an individual based upon an incident variable? For instance if I have:
id incident_MI
1 0
1 0
1 1
2 0
2 0
2 0
3 0
3 0
3 0
3 1
And I want to use the incident_MI to create an ever_MI variable like this:
id incident_MI Ever_MI
1 0 1
1 0 1
1 1 1
2 0 0
2 0 0
2 0 0
3 0 1
3 0 1
3 0 1
3 1 1
Any ideas on how I might code that in R?
We can check for any 1's in the 'incident_MI' after grouping by 'id' and convert it to 'numeric' with as.integer to create the 'Ever_MI'
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(Ever_MI = as.integer(any(incident_MI==1)))
# A tibble: 10 x 3
# Groups: id [3]
# id incident_MI Ever_MI
# <int> <int> <int>
# 1 1 0 1
# 2 1 0 1
# 3 1 1 1
# 4 2 0 0
# 5 2 0 0
# 6 2 0 0
# 7 3 0 1
# 8 3 0 1
# 9 3 0 1
#10 3 1 1
Or as #lmo commented, the data.table option would be
library(data.table)
setDT(df1)[, Ever_MI := any(incident_MI), by=.(id)][]
Or using base R
df1$Ever_MI <- with(df1, ave(incident_MI, id, FUN = any))