Drop the rows after the Event/Diseased(1) occurred in R - r

I'm new to R, I have a set of PATENT IDs with Disease status. I want to drop the rows after 1 status occurrence of disease. My data set looks like
ID Date Disease
123 02-03-2012 0
123 03-03-2013 1
123 04-03-2014 0
321 03-03-2015 1
423 06-06-2016 1
423 07-06-2017 1
543 08-05-2018 1
543 09-06-2019 0
645 08-09-2019 0
645 10-10-2018 0
645 11-10 -2012 0
Expected Output
ID Date Disease
123 02-03-2012 0
123 03-03-2013 1
321 03-03-2015 1
423 06-06-2016 1
543 08-05-2018 1
645 08-09-2019 0
645 10-10-2018 0
645 11-10 -2012 0
Kindly suggest a code that returns the expected output.
Thanks in Advance!

Using dplyr one way would be to select all rows if no Disease == 1 occur in an ID or select rows only till first 1.
library(dplyr)
df %>%
group_by(ID) %>%
filter(if(any(Disease == 1)) row_number() <= match(1, Disease) else TRUE)
# ID Date Disease
# <int> <chr> <int>
#1 123 02-03-2012 0
#2 123 03-03-2013 1
#3 321 03-03-2015 1
#4 423 06-06-2016 1
#5 543 08-05-2018 1
#6 645 08-09-2019 0
#7 645 10-10-2018 0
#8 645 11-10-2012 0
data
df <- structure(list(ID = c(123L, 123L, 123L, 321L, 423L, 423L, 543L,
543L, 645L, 645L, 645L), Date = c("02-03-2012", "03-03-2013",
"04-03-2014", "03-03-2015", "06-06-2016", "07-06-2017", "08-05-2018",
"09-06-2019", "08-09-2019", "10-10-2018", "11-10-2012"), Disease = c(0L,
1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-11L))

This would do it.
set.seed(1012)
datas <- data_frame(ids = rep(1:3, each = 3),
times = runif(9, 0, 100),
event = rep(c(0, 1, 0), 3)) %>%
arrange(ids, times)
datas %>%
group_by(ids) %>%
mutate(lag(cumsum(event), default = 0) == 0)

We can use cumsum to create a logical vector for subsetting
library(data.table)
setDT(df)[df[, .I[cumsum(cumsum(Disease)) <= 1], ID]$V1]
# ID Date Disease
#1: 123 02-03-2012 0
#2: 123 03-03-2013 1
#3: 321 03-03-2015 1
#4: 423 06-06-2016 1
#5: 543 08-05-2018 1
#6: 645 08-09-2019 0
#7: 645 10-10-2018 0
#8: 645 11-10-2012 0
Or using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
filter(cumsum(cumsum(Disease)) <=1)
data
df <- structure(list(ID = c(123L, 123L, 123L, 321L, 423L, 423L, 543L,
543L, 645L, 645L, 645L), Date = c("02-03-2012", "03-03-2013",
"04-03-2014", "03-03-2015", "06-06-2016", "07-06-2017", "08-05-2018",
"09-06-2019", "08-09-2019", "10-10-2018", "11-10-2012"), Disease = c(0L,
1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L)), class = "data.frame",
row.names = c(NA,
-11L))

Related

Filter row sequence defined by values in several columns

I have a dataframe from which I want to filter/which I want to subset on row sequences fulfilling two conditions: (EDIT) (i) the first row in the sequence is Q == "q_misc" and (ii) all rows in the sequence Seq are not NA
df <- structure(list(Line = c(480L, 481L, 482L, 483L, 484L, 485L, 497L,
498L, 499L, 500L, 501L, 502L, 549L, 550L, 551L, 552L, 557L, 558L,
559L, 560L, 561L, 562L, 563L, 564L), Seq = c(NA, 7L, 7L, 7L,
NA, NA, NA, 0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, NA, NA, 0L, 0L, 0L,
0L, 0L, NA, NA), Q = c(NA, "q_wh", NA, NA, NA, NA, NA, "q_misc",
NA, NA, NA, NA, "q_pol", NA, NA, NA, NA, "q_misc", NA, NA, NA,
NA, NA, NA)), row.names = c(NA, -24L), class = "data.frame")
The desired output is this row sequence:
Line Seq Q
8 498 0 q_misc
9 499 0 <NA>
10 500 0 <NA>
11 501 0 <NA>
18 558 0 q_misc
19 559 0 <NA>
20 560 0 <NA>
21 561 0 <NA>
22 562 0 <NA>
I've tried this but it only returns the first row of the row sequence:
library(dplyr)
df %>%
filter(Q == "q_misc" & !is.na(Seq))
You may use fill to fill the missing values with previous value and select rows where value is 'q_misc' and Seq is not NA.
library(dplyr)
library(tidyr)
df %>%
mutate(Q1 = Q) %>%
fill(Q1) %>%
filter(Q1 == 'q_misc' & !is.na(Seq)) %>%
select(-Q1)
# Line Seq Q
#1 498 0 q_misc
#2 499 0 <NA>
#3 500 0 <NA>
#4 501 0 <NA>
#5 558 0 q_misc
#6 559 0 <NA>
#7 560 0 <NA>
#8 561 0 <NA>
#9 562 0 <NA>
Using na.locf from zoo
library(zoo)
library(dplyr)
df %>%
filter(zoo::na.locf0(Q) %in% 'q_misc', complete.cases(Seq))
Line Seq Q
1 498 0 q_misc
2 499 0 <NA>
3 500 0 <NA>
4 501 0 <NA>
5 558 0 q_misc
6 559 0 <NA>
7 560 0 <NA>
8 561 0 <NA>
9 562 0 <NA>

How do I remove duplicates based on three columns, but I keep the row with the highest number in the specific column using R?

I have a dataset that looks like this:
Unique Id|Class Id|Version Id
501 1 1
602 3 1
602 3 1
405 2 1
305 2 3
305 2 2
305 1 1
305 2 1
509 1 1
501 2 1
501 3 1
501 3 2
602 2 1
602 1 1
405 1 1
If I were to run the script the remaining entries should be:
Unique Id|Class Id|Version Id
501 1 1
602 3 1
405 2 1
305 2 3
305 1 1
509 1 1
501 2 1
501 3 2
602 2 1
602 1 1
405 1 1
Note that Unique id:501 Class id:3 and Version id:2 was selected instead because it has the highest Version id. Note Unique id:602 Class id:3 and VersionId:1 is deleted because it is exactly the same from beginning to end.
Basically I want the script to delete all duplicates based on three columns and leave the row with the highest version id.
We can use rleid on the UniqueID column and do slice_max after grouping by the rleid on 'Unique Id' and Class Id
library(dplyr)
library(data.table)
data %>%
group_by(grp = rleid(`Unique Id`), `Class Id`) %>%
slice_max(`Version Id`) %>%
ungroup %>%
select(-grp) %>%
distinct
-output
# A tibble: 11 x 3
# `Unique Id` `Class Id` `Version Id`
# <int> <int> <int>
# 1 501 1 1
# 2 602 3 1
# 3 405 2 1
# 4 305 1 1
# 5 305 2 3
# 6 509 1 1
# 7 501 2 1
# 8 501 3 2
# 9 602 1 1
#10 602 2 1
#11 405 1 1
Or if we don't have to consider the Unique Id with adjacent blocks as one
data %>%
group_by(`Unique Id`, `Class Id`) %>%
slice_max(`Version Id`) %>%
ungroup %>%
distinct
Or using base R
ind <- with(rle(data$`Unique Id`), rep(seq_along(values), lengths))
data1 <- data[order(ind, -data$`Version Id`),]
data1[!duplicated(cbind(ind, data1$`Class Id`)),]
data
data <- structure(list(`Unique Id` = c(501L, 602L, 602L, 405L, 305L,
305L, 305L, 305L, 509L, 501L, 501L, 501L, 602L, 602L, 405L),
`Class Id` = c(1L, 3L, 3L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 3L,
3L, 2L, 1L, 1L), `Version Id` = c(1L, 1L, 1L, 1L, 3L, 2L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L)), class = "data.frame",
row.names = c(NA,
-15L))
If the order doesn't matter then we can reorder the data so that higher version IDs are on top, and then remove duplicated entries.
df <- df[order(df[,1], df[,2], -df[,3]),]
df <- df[!duplicated(df[,-3]),]
df
Unique Id Class Id Version Id
7 305 1 1
5 305 2 3
15 405 1 1
4 405 2 1
1 501 1 1
10 501 2 1
12 501 3 2
9 509 1 1
14 602 1 1
13 602 2 1
2 602 3 1

How to combine columns based on contingencies?

I have the following df:
SUMLEV STATE COUNTY AGEGRP TOT_POP TOT_MALE
50 1 1 0 55601 26995
50 7 33 0 218022 105657
50 14 500 0 24881 13133
50 4 70 0 22400 11921
50 3 900 0 57840 28500
50 22 11 0 10138 5527
I would like to make a new columns named CODE based on the columns state and county. I would like to paste the number from state to the number from county. However, if county is a single or double digit number, I would like it to have zeroes before it, like 001 and 033.
Ideally the final df would look like:
SUMLEV STATE COUNTY AGEGRP TOT_POP TOT_MALE CODE
50 1 1 0 55601 26995 1001
50 7 33 0 218022 105657 7033
50 14 500 0 24881 13133 14500
50 4 70 0 22400 11921 4070
50 3 900 0 57840 28500 3900
50 22 11 0 10138 5527 22011
Is there a short, elegant way of doing this?
We can use sprintf
library(dplyr)
df %>%
mutate(CODE = sprintf('%d%03d', STATE, COUNTY))
# SUMLEV STATE COUNTY AGEGRP TOT_POP TOT_MALE CODE
#1 50 1 1 0 55601 26995 1001
#2 50 7 33 0 218022 105657 7033
#3 50 14 500 0 24881 13133 14500
#4 50 4 70 0 22400 11921 4070
#5 50 3 900 0 57840 28500 3900
#6 50 22 11 0 10138 5527 22011
If we need to split the column 'CODE' into two, we can use separate
library(tidyr)
df %>%
mutate(CODE = sprintf('%d%03d', STATE, COUNTY)) %>%
separate(CODE, into = c("CODE1", "CODE2"), sep= "(?=...$)")
Or extract to capture substrings as a group
df %>%
mutate(CODE = sprintf('%d%03d', STATE, COUNTY)) %>%
extract(CODE, into = c("CODE1", "CODE2"), "^(.*)(...)$")
Or with str_pad
library(stringr)
df %>%
mutate(CODE = str_c(STATE, str_pad(COUNTY, width = 3, pad = '0')))
Or in base R
df$CODE <- sprintf('%d%03d', df$STATE, df$COUNTY)
data
df <- structure(list(SUMLEV = c(50L, 50L, 50L, 50L, 50L, 50L), STATE = c(1L,
7L, 14L, 4L, 3L, 22L), COUNTY = c(1L, 33L, 500L, 70L, 900L, 11L
), AGEGRP = c(0L, 0L, 0L, 0L, 0L, 0L), TOT_POP = c(55601L, 218022L,
24881L, 22400L, 57840L, 10138L), TOT_MALE = c(26995L, 105657L,
13133L, 11921L, 28500L, 5527L)), class = "data.frame", row.names = c(NA,
-6L))

function for event occurrence data in r

I have a patient data set i need to drop the rows after the first occurrence of disease column. for instance
ID Date Disease
123 02-03-2012 0
123 03-03-2013 1
123 04-03-2014 0
321 03-03-2015 1
423 06-06-2016 1
423 07-06-2017 1
543 08-05-2018 1
543 09-06-2019 0
645 08-09-2019 0
and the expected output i want
ID Date Disease
123 02-03-2012 0
123 03-03-2013 1
321 03-03-2015 1
423 06-06-2016 1
543 08-05-2018 1
One way with dplyr select rows till first occurrence of 1 for each ID.
library(dplyr)
df %>% group_by(ID) %>% filter(row_number() <= which(Disease == 1)[1])
# ID Date Disease
# <int> <fct> <int>
#1 123 02-03-2012 0
#2 123 03-03-2013 1
#3 321 03-03-2015 1
#4 423 06-06-2016 1
#5 543 08-05-2018 1
We can also use slice
df %>% group_by(ID) %>% slice(if(any(Disease == 1)) 1:which.max(Disease) else 0)
data
df <- structure(list(ID = c(123L, 123L, 123L, 321L, 423L, 423L, 543L,
543L, 645L), Date = structure(c(1L, 2L, 4L, 3L, 5L, 6L, 7L, 9L,
8L), .Label = c("02-03-2012", "03-03-2013", "03-03-2015", "04-03-2014",
"06-06-2016", "07-06-2017", "08-05-2018", "08-09-2019", "09-06-2019"
), class = "factor"), Disease = c(0L, 1L, 0L, 1L, 1L, 1L, 1L,
0L, 0L)), class = "data.frame", row.names = c(NA, -9L))
I have no idea why don't have the last line 645 08-09-2019 0 in your expected result. The first occurrence of disease column for ID 645 has not appeared yet, so I guess you might have missed it in your expected result.
Based on my guess above, maybe you can try the base R solution below, using subset + ave
dfout <- subset(df,!!ave(Disease,ID,FUN = function(v) !duplicated(cumsum(v)>0)))
such that
> dfout
ID Date Disease
1 123 02-03-2012 0
2 123 03-03-2013 1
4 321 03-03-2015 1
5 423 06-06-2016 1
7 543 08-05-2018 1
9 645 08-09-2019 0
DATA
df <- structure(list(ID = c(123L, 123L, 123L, 321L, 423L, 423L, 543L,
543L, 645L), Date = c("02-03-2012", "03-03-2013", "04-03-2014",
"03-03-2015", "06-06-2016", "07-06-2017", "08-05-2018", "09-06-2019",
"08-09-2019"), Disease = c(0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L
)), class = "data.frame", row.names = c(NA, -9L))

selecting groups with zero values by action column in R

I have next data
mydat=structure(list(group = c(111L, 111L, 111L, 111L, 111L, 111L,
111L, 333L, 333L, 333L, 333L, 333L, 333L, 333L, 555L, 555L, 555L,
555L, 555L, 555L, 555L), group2 = c(222L, 222L, 222L, 222L, 222L,
222L, 222L, 444L, 444L, 444L, 444L, 444L, 444L, 444L, 666L, 666L,
666L, 666L, 666L, 666L, 666L), action = c(0L, 0L, 0L, 1L, 1L,
0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L
), x1 = c(1L, 2L, 3L, 0L, 0L, 1L, 2L, 1L, 2L, 3L, 0L, 0L, 1L,
2L, 1L, 2L, 3L, 10L, 20L, 1L, 2L)), .Names = c("group", "group2",
"action", "x1"), class = "data.frame", row.names = c(NA, -21L
))
Here two group variables(group and group2) .
there are three group
111 222
333 444
555 666
action column can take value only 0 and 1.
So i need find these groups where for 1 category of action they have only zero values by x1.
in our case it is
111 222
333 444
because for all 1 category of action they have zeros by x1.
So i can work only with 555 666 group.
because it has at least one non-zero value of first category of action by x1 variable.
The desired output
Mydat1 here groups with at least one non-zero value of first category of action by x1 variable.
group group2 action x1
555 666 0 1
555 666 0 2
555 666 0 3
555 666 1 **10**
555 666 1 **20**
555 666 0 1
555 666 0 2
mydat2 groups which for all 1 category of action they have zeros by x1
group group2 action x1
111 222 0 1
111 222 0 2
111 222 0 3
111 222 1 **0**
111 222 1 **0**
111 222 0 1
111 222 0 2
333 444 0 1
333 444 0 2
333 444 0 3
333 444 1 **0**
333 444 1 **0**
333 444 0 1
333 444 0 2
If i correctly you, then understand your question is:
i need find these groups where for 1 category of action they have
only zero values by x1.
so here is the response:
library(tidyverse)
mydat %>%
group_by( action ) %>%
filter( action==1 & x1==0 )
and the response is:
group group2 action x1
<int> <int> <int> <int>
1 111 222 1 0
2 111 222 1 0
3 333 444 1 0
4 333 444 1 0
What does this code do?
it looks at action feature, and consider 2 main categories for all rows(0,and 1). Then it filters out the observations which pass action==1 & x1==0. So, it means, among those rows who have action==1 the x1==0 is true as well.
can script return all values of 555+666 group?
No. it does not return these 2 groups. And it should not do that. Let's write a code which filters 555,and 666
library(tidyverse)
mydat %>%
group_by( action ) %>%
filter( group==555 | group2==666 )
and the result is:
group group2 action x1
<int> <int> <int> <int>
1 555 666 0 1
2 555 666 0 2
3 555 666 0 3
4 555 666 1 10
5 555 666 1 20
6 555 666 0 1
7 555 666 0 2
so, as you can see, none of these observation fulfills the condition action==1 & x1==0 . Therefore, they are not among the valid response.

Resources