Removing rows with a rule from another column - r

I have data similar to this:
PatientID=c(1,1,1,1,2,2,2,3,3,3,3,3)
VisitId=c(1,5,6,9,2,3,12,4,7,8,10,11)
target=c(0,0,0,1,0,0,0,0,0,0,1,0)
as.data.frame(cbind(PatientID,VisitId,target))
PatientID VisitId target
1 1 1 0
2 1 5 0
3 1 6 0
4 1 9 1
5 2 2 0
6 2 3 0
7 2 12 0
8 3 4 0
9 3 7 0
10 3 8 0
11 3 10 1
12 3 11 0
I would need to delete rows, that have a VisitId per PatientID equal or larger than the VisitId of a row where the target is 1.
Ie in the example case rows 4, 11 and 12 should be eliminated, because those are rows that occurred for this patient at the same time or after the target incident happened - which I wish to predict...

Here is an idea using dplyr. This makes the assumption that you only have 1 or none 1 as target in each patientid
library(dplyr)
df %>%
group_by(PatientID) %>%
mutate(new = ifelse(target == 1, VisitId, 0),
new = replace(new, new == 0, max(new))) %>%
filter(target != 1 & VisitId < new | new == 0) %>%
select(-new)
which gives,
# A tibble: 9 x 3
# Groups: PatientID [3]
PatientID VisitId target
<dbl> <dbl> <dbl>
1 1 1 0
2 1 5 0
3 1 6 0
4 2 2 0
5 2 3 0
6 2 12 0
7 3 4 0
8 3 7 0
9 3 8 0

Related

Create new columns based on 2 columns

So I have this kind of table df
Id
Type
QTY
unit
1
A
5
1
2
B
10
2
3
C
5
3
2
A
10
4
3
B
5
5
1
C
10
6
I want to create this data frame df2
Id
A_QTY
A_unit
B_QTY
B_unit
C_QTY
C_unit
1
5
1
0
0
10
6
2
10
4
10
2
0
0
3
0
0
5
5
5
3
This means that I want to create a new column for every "Type's" "QTY" and "unit" for each "Id". I was thinking to use a loop to first create a new column for each Type, to get something like this :
Id
Type
QTY
unit
A_QTY
A_unit
B_QTY
B_unit
C_QTY
C_unit
1
A
5
1
5
1
0
0
0
0
2
B
10
2
0
0
10
2
0
0
3
C
5
3
0
0
0
0
5
3
2
A
10
4
10
4
0
0
0
0
3
B
5
5
0
0
5
5
0
0
1
C
10
6
0
0
0
0
10
6
, and then group_by() to agregate them resulting in df2. But I get stuck when it comes to creating the new columns. I have tried the for loop but my level on R is still not that great yet. I can't manage to create new columns from those existing columns...
I'll appreciate any suggestions you have for me!
You can use pivot_wider from the tidyr package:
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = "Type", # Columns to get the names from
values_from = c("QTY", "unit"), # Columns to get the values from
names_glue = "{Type}_{.value}", # Column naming
values_fill = 0, # Fill NAs with 0
names_vary = "slowest") # To get the right column ordering
output
# A tibble: 3 × 7
Id A_QTY A_unit B_QTY B_unit C_QTY C_unit
<int> <int> <int> <int> <int> <int> <int>
1 1 5 1 0 0 10 6
2 2 10 4 10 2 0 0
3 3 0 0 5 5 5 3
library(tidyverse)
df %>%
pivot_longer(-c(Id, Type)) %>%
mutate(name = str_c(Type, name, sep = "_")) %>%
select(-Type) %>%
pivot_wider(names_from = "name", values_from = "value", values_fill = 0)
# A tibble: 3 × 7
Id A_QTY A_unit B_QTY B_unit C_QTY C_unit
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 1 0 0 10 6
2 2 10 4 10 2 0 0
3 3 0 0 5 5 5 3

R Count Unique By Group in DPLYR

HAVE = data.frame("TRIMESTER" = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,4),
"STUDENT" = c(1,2,3,3,4,2,5,6,7,1,2,2,2,2,2,1,2,3,4,5))
HAVE$WANT1 = c(4,4,4,4,4,5,5,5,5,5,1,1,1,1,5,5,5,5,5,5)
HAVE$WANT2 = c(0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1)
I have HAVE and wish to APPEND a column to count the UNIQUE value of STUDENT for every TRIMESTER shown WANT1 and I wish to create WANT2 which is the SUM of times for every TRIMESTER that STUDENT==5 appears so STUDENT==5 appear ZERO times in TRIMESTER == 1, so the value for all TRIMESTER == 1 is ZERO but student 5 appear ONCE in TRIMESTER==4 so the value is 1
After grouping by 'TRIMESTER', get the count of distinct elements of 'STUDENT' with n_distinct and the count of STUDENT 5 with sum on a logical expression
library(dplyr)
HAVE %>%
group_by(TRIMESTER) %>%
mutate(WANT1new = n_distinct(STUDENT),
WANT2NEW = sum(STUDENT == 5)) %>%
ungroup
-output
# A tibble: 20 × 6
TRIMESTER STUDENT WANT1 WANT2 WANT1new WANT2NEW
<dbl> <dbl> <dbl> <dbl> <int> <int>
1 1 1 4 0 4 0
2 1 2 4 0 4 0
3 1 3 4 0 4 0
4 1 3 4 0 4 0
5 1 4 4 0 4 0
6 2 2 5 1 5 1
7 2 5 5 1 5 1
8 2 6 5 1 5 1
9 2 7 5 1 5 1
10 2 1 5 1 5 1
11 3 2 1 0 1 0
12 3 2 1 0 1 0
13 3 2 1 0 1 0
14 3 2 1 0 1 0
15 4 2 5 1 5 1
16 4 1 5 1 5 1
17 4 2 5 1 5 1
18 4 3 5 1 5 1
19 4 4 5 1 5 1
20 4 5 5 1 5 1
The code below should produce the desired result.
library(dplyr)
HAVE %>%
group_by(TRIMESTER) %>%
mutate(WANT1 = length(unique(STUDENT)),
WANT2 = as.numeric(any(5 == STUDENT)))

Use R to find values for which a condition is first met

Consider the following sample dataset. Id is an individual identifier.
rm(list=ls()); set.seed(1)
n<-100
X<-rbinom(n, 1, 0.5) #binary covariate
j<-rep (1:n)
dat<-data.frame(id=1:n, X)
ntp<- rep(4, n)
mat<-matrix(ncol=3,nrow=1)
m=0; w <- mat
for(l in ntp)
{
m=m+1
ft<- seq(from = 2, to = 8, length.out = l)
# ft<- seq(from = 1, to = 9, length.out = l)
ft<-sort(ft)
seq<-rep(ft,each=2)
seq<-c(0,seq,10)
matid<-cbind( matrix(seq,ncol=2,nrow=l+1,byrow=T ) ,m)
w<-rbind(w,matid)
}
d<-data.frame(w[-1,])
colnames(d)<-c("time1","time2","id")
D <- round( merge(d,dat,by="id") ,2) #merging dataset
nr<-nrow(D)
D$Survival_time<-round(rexp(nr, 0.1)+1,3)
head(D,15)
id time1 time2 X Survival_time
1 1 0 2 0 21.341
2 1 2 4 0 18.987
3 1 4 6 0 4.740
4 1 6 8 0 13.296
5 1 8 10 0 6.397
6 2 0 2 0 10.566
7 2 2 4 0 2.470
8 2 4 6 0 14.907
9 2 6 8 0 8.620
10 2 8 10 0 13.376
11 3 0 2 1 45.239
12 3 2 4 1 11.545
13 3 4 6 1 11.352
14 3 6 8 1 19.760
15 3 8 10 1 7.547
How can I obtain the value at which Survival_time is less that time2 for the very first time per individual. I should end up with the following values
id Survival_time
1 4.740
2 2.470
3 7.547
Also, how can I subset the data to stop individualwise when this condition occurs. i.e obtain
id time1 time2 X Survival_time
1 1 0 2 0 21.341
2 1 2 4 0 18.987
3 1 4 6 0 4.740
6 2 0 2 0 10.566
7 2 2 4 0 2.470
11 3 0 2 1 45.239
12 3 2 4 1 11.545
13 3 4 6 1 11.352
14 3 6 8 1 19.760
15 3 8 10 1 7.547
Using data.table
library(data.table)
setDT(D)[, .SD[seq_len(.N) <= which(Survival_time < time2)[1]], id]
-output
id time1 time2 X Survival_time
1: 1 0 2 0 21.341
2: 1 2 4 0 18.987
3: 1 4 6 0 4.740
4: 2 0 2 0 10.566
5: 2 2 4 0 2.470
6: 3 0 2 1 45.239
7: 3 2 4 1 11.545
8: 3 4 6 1 11.352
9: 3 6 8 1 19.760
10: 3 8 10 1 7.547
Slight variation:
library(dplyr)
D %>% # Take D, and then
group_by(id) %>% # group by id, and then
filter(Survival_time < time2) %>% # keep Survival times < time2, and then
slice(1) %>% # keep the first row per id, and then
ungroup() # ungroup
You can use -
library(dplyr)
D %>%
group_by(id) %>%
summarise(Survival_time = Survival_time[match(TRUE, Survival_time < time2)])
#Also using which.max
#summarise(Survival_time = Survival_time[which.max(Survival_time < time2)])
# id Survival_time
# <int> <dbl>
#1 1 4.74
#2 2 2.47
#3 3 7.55
To select the rows you may till that point you may use -
D %>%
group_by(id) %>%
filter(row_number() <= match(TRUE, Survival_time < time2)) %>%
ungroup
# id time1 time2 X Survival_time
# <int> <int> <int> <int> <dbl>
# 1 1 0 2 0 21.3
# 2 1 2 4 0 19.0
# 3 1 4 6 0 4.74
# 4 2 0 2 0 10.6
# 5 2 2 4 0 2.47
# 6 3 0 2 1 45.2
# 7 3 2 4 1 11.5
# 8 3 4 6 1 11.4
# 9 3 6 8 1 19.8
#10 3 8 10 1 7.55

Randomly select rows in R using sample_n

df <- data.frame(
id = c(1:12),
day = c(1, 1, 1,1, 2, 2,2, 2, 3,3,3,3),
endpoint = c(1, 1, 1,1, 2,2,2,2,1,1,1,1))
df
#> id day endpoint
#> 1 1 1 1
#> 2 2 1 1
#> 3 3 1 1
#> 4 4 1 1
#> 5 5 2 2
#> 6 6 2 2
#> 7 7 2 2
#> 8 8 2 2
#> 9 9 3 1
#> 10 10 3 1
#> 11 11 3 1
#> 12 12 3 1
In the above data, there some patients(id) reached the endpoint each day. I am trying to randomly select the endpoint number of patients with s = 1. For each day, ids on that day and previously days are eligible as long as not previously selected. The following code gets what I expected, but I have to manually enter day and endpoint values. Any suggestions on how to pick those values directly from the data would be appreciated.
library(dplyr)
df$s = 0
df$s <-ifelse(df$id%in%sample_n(df[df$day<=1 & df$s==0, ], 1)$id, 1, df$s)
df$s <-ifelse(df$id%in%sample_n(df[df$day<=2 & df$s==0, ], 2)$id, 1, df$s)
df$s <-ifelse(df$id%in%sample_n(df[df$day<=3 & df$s==0, ], 1)$id, 1, df$s)
df
#> id day endpoint s pick_day
#> 1 1 1 1 0 0
#> 2 2 1 1 1 2
#> 3 3 1 1 1 1
#> 4 4 1 1 1 3
#> 5 5 2 2 1 2
#> 6 6 2 2 0 0
#> 7 7 2 2 0 0
#> 8 8 2 2 0 0
#> 9 9 3 1 0 0
#> 10 10 3 1 0 0
#> 11 11 3 1 0 0
#> 12 12 3 1 0 0
EDIT
Is it possible to add a variable to show the day for which a row was picked, like the above variable pick_day? Thanks.
A way in base R using for loop :
df$s = 0
set.seed(123)
for (i in unique(df$day)) {
temp <- subset(df, day <= i & s == 0)
ids <- with(temp, sample(id, endpoint[day == i][1]))
df$s[df$id %in% ids] <- 1
}
df
# id day endpoint s
#1 1 1 1 0
#2 2 1 1 0
#3 3 1 1 1
#4 4 1 1 1
#5 5 2 2 1
#6 6 2 2 0
#7 7 2 2 0
#8 8 2 2 1
#9 9 3 1 0
#10 10 3 1 0
#11 11 3 1 0
#12 12 3 1 0

Summing rows after first occurance of a certain number

I would like to to get the sum of the rows after first occurrence of a certain number. In this case it is '10' for instance.
I though If we can know the row number after first occurrence and the ending row number of that group and we can sum in between them.
I can get the first occurrence of '10' each group but I don't know how can get the sum of the rows.
df <- data.frame(gr=rep(c(1,2),c(7,9)),
y_value=c(c(0,0,10,8,8,6,0),c(0,0,10,10,5,4,2,0,0)))
> df
gr y_value
1 1 0
2 1 0
3 1 10
4 1 8
5 1 8
6 1 6
7 1 0
8 2 0
9 2 0
10 2 10
11 2 10
12 2 5
13 2 4
14 2 2
15 2 0
16 2 0
My initial attempt is below which is not working for some reason even for grouping part:(!
library(dplyr)
df%>%
group_by(gr)%>%
mutate(check1=any(y_value==10),row_sum=which(y_value == 10)[1])
Expected output
> df
gr y_value sum_rows_range
1 1 0 22/4
2 1 0 22/4
3 1 10 22/4
4 1 8 22/4
5 1 8 22/4
6 1 6 22/4
7 1 0 22/4
8 2 0 21/6
9 2 0 21/6
10 2 10 21/6
11 2 10 21/6
12 2 5 21/6
13 2 4 21/6
14 2 2 21/6
15 2 0 21/6
16 2 0 21/6
A dplyr solution:
library(dplyr)
df %>%
group_by(gr) %>%
slice(if(any(y_value == 10)) (which.max(y_value == 10)+1):n() else row_number()) %>%
summarize(sum = sum(y_value),
rows = n()) %>%
inner_join(df)
Notes:
The main idea is to slice on the rows after the first 10 occurs. any(y_value == 10)) and else row_number() are just to take care of the case where there are no 10's in y_value.
Reading the documentation for ?which.max, you will notice that when it is applied to a logical vector, in this case y_value == 10, "with both FALSE and TRUE values, which.min(x) and which.max(x) return the index of the first FALSE or TRUE, respectively, as FALSE < TRUE."
In other words, which.max(y_value == 10) will give the index of the first occurrence of 10. By adding 1 to it, I can start sliceing from the value right after the first occurrence of 10.
Result:
# A tibble: 16 × 4
gr sum rows y_value
<dbl> <dbl> <int> <dbl>
1 1 22 4 0
2 1 22 4 0
3 1 22 4 10
4 1 22 4 8
5 1 22 4 8
6 1 22 4 6
7 1 22 4 0
8 2 21 6 0
9 2 21 6 0
10 2 21 6 10
11 2 21 6 10
12 2 21 6 5
13 2 21 6 4
14 2 21 6 2
15 2 21 6 0
16 2 21 6 0
It's a bit convoluted, and I'm not positive it's what you're looking for, but it does match your output.
df %>%
group_by(gr) %>%
mutate(is_ten = cumsum(y_value == 10)) %>%
filter(is_ten > 0) %>%
filter(!(y_value == 10 & is_ten == 1)) %>%
group_by(gr) %>%
summarize(sum_rows_range = paste(sum(y_value), n(), sep = "/")) %>%
right_join(df)
# A tibble: 16 x 3
gr sum_rows_range y_value
<dbl> <chr> <dbl>
1 1 22/4 0
2 1 22/4 0
3 1 22/4 10
4 1 22/4 8
5 1 22/4 8
6 1 22/4 6
7 1 22/4 0
8 2 21/6 0
9 2 21/6 0
10 2 21/6 10
11 2 21/6 10
12 2 21/6 5
13 2 21/6 4
14 2 21/6 2
15 2 21/6 0
16 2 21/6 0

Resources