Conditionally delete individuals from longtidunal data [duplicate] - r

This question already has answers here:
Select groups which have at least one of a certain value
(3 answers)
Closed 1 year ago.
I have a longitudinal data set where I want to drop individuals (id) if they do no fulfill the criterion indicated by criteria == 1 at any time points. To put it in context we could say that criteria denotes if the individual was living in the region of interest at any time during.
Using some toy-data that have a similar structure as mine:
id <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
event <- c(0,1,0,1,0,0,0,0,0,0,1,0,1,0,1)
criteria <- c(1,0,0,0,0,0, 0, 0, 0, 1, 1, 1,0,0,1)
df <- data.frame(cbind(id,time,event, criteria))
> df
id time event criteria
1 1 1 0 1
2 1 2 1 0
3 1 3 0 0
4 2 1 1 0
5 2 2 0 0
6 2 3 0 0
7 3 1 0 0
8 3 2 0 0
9 3 3 0 0
10 4 1 0 1
11 4 2 1 1
12 4 3 0 1
13 5 1 1 0
14 5 2 0 0
15 5 3 1 1
So by removing any id that have criteria == 0 at all time points (time) would lead to an end result looking like this:
id time event criteria
1 1 1 0 1
2 1 2 1 0
3 1 3 0 0
4 4 1 0 1
5 4 2 1 1
6 4 3 0 1
7 5 1 1 0
8 5 2 0 0
9 5 3 1 1
I've been trying to achieve this by using dplyr::group_by(id) and then filter on the criterion but that does not achieve the result I want to. I'd prefer a tidyverse solution! :D
Thanks!

df %>%
group_by(id) %>%
# looking for the opposite (i.e. !) of criteria == 1 at least 1 time
mutate(is_good = !any(criteria == 1)) %>%
filter(is_good)

If you'd be willing to look into data.table's, which I recommend, it would be as simple as this:
library(data.table)
setDT(df) # make it a data.table
df[ , .SD[ !all(criteria==0) ], by=id ]
See this page for a general introduction and an explanation of the .SD idiom:
https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html

Related

Mutate multiply columns based on conditional and column name

I have a dataframe with the following structure (See example). The dots after OperatedIn2007 column signify multiple columns with same name, changing only the year (e.g OperatedIn2008, OperatedIn2009, etc.).
I wish to do the following procedure:
If the group is 1, then add one in all columns whose names start with OperatedIn.
The expected result should be similar to the one presented in the desired output.
A nonscalable solution would be to use:
df <- df %<%
mutate(OperatedIn2006 = ifelse(group == 1, 1, 0)) %<%
[...]
I imagine there is some slick solution using dplyr or data.table, but I could not think of it myself.
Example
ID group OperatedIn2006 OperatedIn2007 ...
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 1 0 0
6 2 0 0
Desired output
ID group OperatedIn2006 OperatedIn2007 ...
1 1 1 1
2 2 0 0
3 3 0 0
4 4 0 0
5 1 1 1
6 2 0 0
We could use across with an ifelse statement:
library(dplyr)
df %>%
mutate(across(-c(ID, group), ~ifelse(group==1, 1, .)))
ID group OperatedIn2006 OperatedIn2007
1 1 1 1 1
2 2 2 0 0
3 3 3 0 0
4 4 4 0 0
5 5 1 1 1
6 6 2 0 0

If else Condition in R based on different columns and rows

I have a dataset with an ID column with multiple visits for every ID. I am trying to create a new variable Status, which will check the Visit column and Value column. The conditions are as follows
For visit in 1,2 & 3, if the values are 1,1,1 then 1
For visit in 1,2 & 3, if the values are 0,1,1 then 0
For visit in 1,2 & 3, if the values are 0,0,0 then 0
How do I specify this condition in R ?
Below is a sample dataset
ID
Visit
Value
1
1
1
1
2
1
1
3
1
2
1
1
2
2
0
2
3
0
3
1
0
3
2
0
3
3
0
4
1
0
4
2
1
4
3
1
Result dataset
ID
Visit
Value
Status
1
1
1
1
1
2
1
1
1
3
1
1
2
1
1
0
2
2
0
0
2
3
0
0
3
1
0
0
3
2
0
0
3
3
0
0
4
1
0
0
4
2
1
0
4
3
1
0
I'd have tried something like this (suppose your initial table is called df):
status = c()
for(i in 1:4){ #1:4 correspond to the ID you showed us
if(sum(df[df$ID == i,'value'])==3) status=c(status,rep(1,3))
if(sum(df[df$ID == i,'value'])!=3) status=c(status,rep(0,3))
}
df = cbind(df,status)
I hope that it will help you
I believe that case_when from the dplyr package is what you need to use. Here more details on that fuction: https://dplyr.tidyverse.org/reference/case_when.html

How to flag duplicate values in r - newbie

I'm trying to flag duplicate IDs in another column. I don't necessarily want to remove them yet, just create an indicator (0/1) of whether the IDs are unique or duplicates. In sql, it would be like this:
SELECT ID, count(ID) count from TABLE group by ID) a
On TABLE.ID = a.ID
set ID Duplicate Flag Column 1 = 1
where count > 1;
Is there a way to do this simply in r?
Any help would be greatly appreciated.
As an example of duplicated let's start with some values (numbers here, but strings would do the same thing)
x <- c(9, 1:5, 3:7, 0:8)
x
# 9 1 2 3 4 5 3 4 5 6 7 0 1 2 3 4 5 6 7 8
If you want to flag the second and later copies
as.numeric(duplicated(x))
# 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 1 0
If you want to flag all values that occur two or more times
as.numeric(x %in% x[duplicated(x)])
# 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0

give id to sessions with a mark at the first row of a new session

df <- data.frame(mark_new =c(1,0,0,1,0,0,0,0,0,1,0,1))
> df
mark_new
1 1
2 0
3 0
4 1
5 0
6 0
7 0
8 0
9 0
10 1
11 0
12 1
Every time a 1 appears, it means a new session starts. I want to give each session an id. The result should look like this:
>df1
mark_new id
1 1 1
2 0 1
3 0 1
4 1 2
5 0 2
6 0 2
7 0 2
8 0 2
9 0 2
10 1 3
11 0 3
12 1 4
We can use cumsum on the logical column (df$mark_new == 1 - here I am assuming that there could be other values in addition to 0 and 1)
df$id <- cumsum(df$mark_new==1)
df$id
#[1] 1 1 1 2 2 2 2 2 2 3 3 4
If the 'mark_new' column is binary, as #thelatemail mentioned, we can just do the cumsum on the whole column instead of converting to 'logical'
df$id <- cumsum(df$mark_new)
If we use dplyr
library(dplyr)
df %>%
mutate(id = cumsum(mark_new == 1)

Pivoting Nominal Data in R [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I have a data frame in R that I need to manipulate (pivot). At the simplest level the first few rows would look like the following:
Batch Unit Success InputGrouping
1 1 1 A
2 5 1 B
3 4 0 C
1 1 1 D
2 5 1 A
I would like to pivot this data so that the column names would be InputGrouping and the values would be 1 if it exists and 0 if not. Using above:
Batch Unit Success A B C D
1 1 1 1 0 0 1
2 5 1 1 1 0 0
3 4 0 0 0 1 0
I've looked at reshape/cast but can't figure out if this transformation is possible with the package. Any advice would be very much appreciated.
This is indeed possible using reshape2 with the function dcast().
Recreate your data:
dat <- read.table(header=TRUE, text="
Batch Unit Success InputGrouping
1 1 1 A
2 5 1 B
3 4 0 C
1 1 1 D
2 5 1 A")
Now recast the data:
library("reshape2")
dcast(Batch + Unit + Success ~ InputGrouping, data=dat, fun.aggregate = length)
The results:
Using InputGrouping as value column: use value.var to override.
Batch Unit Success A B C D
1 1 1 1 1 0 0 1
2 2 5 1 1 1 0 0
3 3 4 0 0 0 1 0
Here's a possible solution using the data.table package
library(data.table)
setDT(df)[, as.list(table(InputGrouping)), by = .(Batch, Unit, Success)]
# Batch Unit Success A B C D
# 1: 1 1 1 1 0 0 1
# 2: 2 5 1 1 1 0 0
# 3: 3 4 0 0 0 1 0

Resources