Select rows with all longitudinal measurements - r

I have a longitudinal dataset with ID, Wave (Wave1-4), and Score. Here's sample data with the same structure. The length of the original data is around 2000, with 500 participants total, put in long form.
ID Wave Score
1 1001 1 28
2 1001 2 27
3 1001 3 28
4 1001 4 26
5 1002 1 30
6 1002 3 30
7 1003 1 30
8 1003 2 30
9 1003 3 29
10 1003 4 28
11 1004 1 22
12 1005 1 20
13 1005 2 18
14 1006 1 22
15 1006 2 23
16 1006 3 25
17 1006 4 19
I would like to select the 'ID's with all four measurements of 'Score' available. In other words, I want to select rows of the participants with 'Score' available for all 4 waves.
I've been trying to select rows with 'ID's that have data in all 'Wave's. My tryout so far has been based on this idea: if a participant has all four measurements, the ID will appear in the data four times.
That's why I tried to count the number of IDs,
table(data$id) == 4
and although it showed me the number of each ID appearing in the data, I cannot select the corresponding rows.
all.data <- subset(data, subset=table(data$id) == 4)
Because the length of the original data is different, being in long form. "Length of logical index must be 1 or 2637, not 828" I would need a long-form data for further analysis, so I wish not to change it.

You can try:
df[as.logical(with(df, ave(Wave, ID, FUN = function(x) length(x) == 4))), ]
ID Wave Score
1 1001 1 28
2 1001 2 27
3 1001 3 28
4 1001 4 26
7 1003 1 30
8 1003 2 30
9 1003 3 29
10 1003 4 28
14 1006 1 22
15 1006 2 23
16 1006 3 25
17 1006 4 19
Or if you want to keep your basic idea, a slight modification of #jay.sf code:
df[df$ID %in% names(which(table(df$ID) == 4)), ]

I like your table() approach.
> table(d$ID) == 4
1001 1002 1003 1004 1005 1006
TRUE FALSE TRUE FALSE FALSE TRUE
The interesting IDs are in the names() though. So to get your code to work you could extract the IDs like so
subs <- names(which(table(d$ID) == 4))
and get your desired subset using %in%.
all.data <- subset(d, subset=d$ID %in% subs)
Result
> all.data
ID Wave Score
1 1001 1 28
2 1001 2 27
3 1001 3 28
4 1001 4 26
7 1003 1 30
8 1003 2 30
9 1003 3 29
10 1003 4 28
14 1006 1 22
15 1006 2 23
16 1006 3 25
17 1006 4 19
(BTW: Always make sure with ?<name> that you do not define any existing function names as object names, this will save you a lot of trouble. In your case type ?data in a fresh session before loading the object.)
Data
> dput(d)
structure(list(ID = c(1001L, 1001L, 1001L, 1001L, 1002L, 1002L,
1003L, 1003L, 1003L, 1003L, 1004L, 1005L, 1005L, 1006L, 1006L,
1006L, 1006L), Wave = c(1L, 2L, 3L, 4L, 1L, 3L, 1L, 2L, 3L, 4L,
1L, 1L, 2L, 1L, 2L, 3L, 4L), Score = c(28L, 27L, 28L, 26L, 30L,
30L, 30L, 30L, 29L, 28L, 22L, 20L, 18L, 22L, 23L, 25L, 19L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17"))

instead of feeding table(data$ID), try with
ID %in% names(table(data$ID)[table(data$ID)==4])
As the table gives you the number of occurrences for each ID (named vector)

This is a quick data.table answer.
library(data.table)
dt <- structure(list(ID = c(1001, 1001, 1001, 1001, 1002, 1002, 1003,
1003, 1003, 1003, 1004, 1005, 1005, 1006, 1006, 1006, 1006),
Wave = c(1, 2, 3, 4, 1, 3, 1, 2, 3, 4, 1, 1, 2, 1, 2, 3,
4), Score = c(28, 27, 28, 26, 30, 30, 30, 30, 29, 28, 22,
20, 18, 22, 23, 25, 19)), row.names = c(NA, -17L), class = c("data.table",
"data.frame"))
dt[ , .(Score, N = uniqueN(.SD)) , by = list(ID), .SDcols = c("Wave")][N == 4,]
> ID Score N
1: 1001 28 4
2: 1001 27 4
3: 1001 28 4
4: 1001 26 4
5: 1003 30 4
6: 1003 30 4
7: 1003 29 4
8: 1003 28 4
9: 1006 22 4
10: 1006 23 4
11: 1006 25 4
12: 1006 19 4

For the sake of completeness, here are two data.table solutions. Both identify those IDs for which Wave has values 1 to 4. One approach uses subsetting, the other one is joining.
Subsetting
library(data.table)
setDT(df)[ID %in% dt[ , which(uniqueN(Wave) == 4L), by = ID]$ID]
ID Wave Score
1: 1001 1 28
2: 1001 2 27
3: 1001 3 28
4: 1001 4 26
5: 1003 1 30
6: 1003 2 30
7: 1003 3 29
8: 1003 4 28
9: 1006 1 22
10: 1006 2 23
11: 1006 3 25
12: 1006 4 19
Joining
library(data.table)
setDT(df)[df[, .N, .(ID, Wave)][, .N, ID][N == 4L, .(ID)], on = "ID"]
which returns the same result.
Data
library(data.table)
fread("
rn ID Wave Score
1 1001 1 28
2 1001 2 27
3 1001 3 28
4 1001 4 26
5 1002 1 30
6 1002 3 30
7 1003 1 30
8 1003 2 30
9 1003 3 29
10 1003 4 28
11 1004 1 22
12 1005 1 20
13 1005 2 18
14 1006 1 22
15 1006 2 23
16 1006 3 25
17 1006 4 19", drop = 1L)

Related

Add multiple rows based on column string R

I have a data frame that looks like this...
patient_ID CBC CBN totindex samp index
120 1007 BLOQ BLOQ 7 8 1
121 1007 BLOQ BLOQ 8 9 1
122 1007 BLOQ BLOQ 9 10 1
123 1007 BLOQ BLOQ 10 11 1
124 1007 BLOQ BLOQ 11 12 1
125 1007 BLOQ BLOQ 12 15 4
126 1007 BLOQ BLOQ 13 16 1
127 1007 BLOQ BLOQ 14 17 1
128 1007 BLOQ BLOQ 15 18 1
129 1007 BLOQ BLOQ 16 19 1
130 1007 BLOQ BLOQ 17 20 1
I created the index column to denote any rows in the samp column to show where the previous row is not in numerical order. For example, if the numbers are a difference of one, I make the index of the row 3. In this case, there was a difference of 2 in the samp column between rows 124 and 125. So I placed a 4. I do this for other examples as well like a difference of 3 or 4 is indexed as a 5 or 6.
How can I insert 1 or 2 or 3 or 4 new blank lines in between the discrepancies?
In this example, I would like this output...
patient_ID CBC CBN totindex samp index
120 1007 BLOQ BLOQ 7 8 1
121 1007 BLOQ BLOQ 8 9 1
122 1007 BLOQ BLOQ 9 10 1
123 1007 BLOQ BLOQ 10 11 1
124 1007 BLOQ BLOQ 11 12 1
125 1007 0 0 0 13 1
126 1007 0 0 0 14 1
127 1007 BLOQ BLOQ 12 15 1
128 1007 BLOQ BLOQ 13 16 1
129 1007 BLOQ BLOQ 14 17 1
130 1007 BLOQ BLOQ 15 18 1
131 1007 BLOQ BLOQ 16 19 1
132 1007 BLOQ BLOQ 17 20 1
Thus, adding 2 rows equal to 0 before the 4 index, keep the patient_ID, remove the 4, and add the 13 and 14 to the samp column.
I have tried a for loop like this...
nidx4 <- as.numeric(rownames(df[grep("4", df$index), ]))
dfnew <- data.frame()
for (idx in 1:length(nidx4)) {
if (idx==1){
df1 = df[1:(nidx4[idx]-1),]
}
else if (idx == length(nidx4)) {
df1 = df[nidx4[idx-1]:nrow(df),]
}
else {
df1 = df[(nidx4[idx-1]):(nidx4[idx]-1),]
}
df1[nrow(df1)+1,] = 0
df1[nrow(df1)+1,] = 0
df1[nrow(df1)-1,21] = df1[nrow(df1)-2,21]+1
df1[nrow(df1),21] = df1[nrow(df1)-1,21]+1
dfnew = rbind(dfnew,df1)
}
for (row in 1:nrow(dfnew)){
if (dfnew[row,"index"] == 0) {dfnew[row,"index"] = 1}
if (dfnew[row,"index"] == 4) {dfnew[row,"index"] = 1}
}
rownames(dfnew) <- NULL
df <- dfnew
But that only accounts for the first few indices. The last iteration of 4 in the index is not accounted for. Also, I had to change the loop code to this because, before, it was stopping at the last iteration of 4 and not including the rest of the data frame.
Any help would be great.
Edit + Answer
This worked for me by adding in all of the column names that I wanted to iterate over.
dfnew <- complete(df, patient_ID, samp = full_seq(samp, period = 1),
fill = list("Sample_Name_(run_ID)" = "no_sample",
Sample_Name = "no_sample", THC = "0", OH_THC = "0",
THC_COOH = "0", THC_COO_gluc = "0", THC_gluc = "0",
CBD = "0", "6aOH_CBD" = "0", "7OH_CBD" = "0",
"6bOH_CBD" = "0", CBD_COOH = "0", CBD_gluc = "0",
CBC = "0", CBN = "0", CBG = "0", THCV = "0", CBDV = "0",
totindex = 0,
index = 1)) %>%
mutate(index = 1)
We may use complete
library(tidyr)
library(dplyr)
complete(df1, patient_ID, samp = full_seq(samp, period = 1),
fill = list(CBC = "0", CBN = "0", totindex = 0, index = 1)) %>%
mutate(index = 1)
-output
# A tibble: 13 × 6
patient_ID samp CBC CBN totindex index
<int> <dbl> <chr> <chr> <int> <dbl>
1 1007 8 BLOQ BLOQ 7 1
2 1007 9 BLOQ BLOQ 8 1
3 1007 10 BLOQ BLOQ 9 1
4 1007 11 BLOQ BLOQ 10 1
5 1007 12 BLOQ BLOQ 11 1
6 1007 13 0 0 0 1
7 1007 14 0 0 0 1
8 1007 15 BLOQ BLOQ 12 1
9 1007 16 BLOQ BLOQ 13 1
10 1007 17 BLOQ BLOQ 14 1
11 1007 18 BLOQ BLOQ 15 1
12 1007 19 BLOQ BLOQ 16 1
13 1007 20 BLOQ BLOQ 17 1
Or use
complete(df1, patient_ID, samp = full_seq(samp, period = 1)) %>%
mutate(across(CBC:CBN, replace_na, "0"),
totindex = replace_na(totindex, 0), index = 1)
Or use
complete(df1, patient_ID, samp = full_seq(samp, period = 1),
fill = setNames(c(as.list(rep('0', 2)), 0, 1),
names(df1)[c(2:4, 6)]))
data
df1 <- structure(list(patient_ID = c(1007L, 1007L, 1007L, 1007L, 1007L,
1007L, 1007L, 1007L, 1007L, 1007L, 1007L), CBC = c("BLOQ", "BLOQ",
"BLOQ", "BLOQ", "BLOQ", "BLOQ", "BLOQ", "BLOQ", "BLOQ", "BLOQ",
"BLOQ"), CBN = c("BLOQ", "BLOQ", "BLOQ", "BLOQ", "BLOQ", "BLOQ",
"BLOQ", "BLOQ", "BLOQ", "BLOQ", "BLOQ"), totindex = 7:17, samp = c(8L,
9L, 10L, 11L, 12L, 15L, 16L, 17L, 18L, 19L, 20L), index = c(1L,
1L, 1L, 1L, 1L, 4L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame",
row.names = c("120",
"121", "122", "123", "124", "125", "126", "127", "128", "129",
"130"))

Separate hour and minutes in R

I have a column for time, but it hasn't been separated by : or any thing. It looks like this:
person time
1 356
1 931
1 2017
1 2103
2 256
2 1031
2 1517
2 2206
How do I separate them?
There are different ways of approaching the issue. Which method you choose depends on your desired output.
For example, you could use stringr::str_split to split time into a list vector of hours and minutes using a positive look-ahead
library(tidyverse)
df %>% mutate(time = str_split(time, "(?=\\d{2}$)"))
# person time
#1 1 3, 56
#2 1 9, 31
#3 1 20, 17
#4 1 2, 13
#5 2 2, 56
#6 2 10, 31
#7 2 15, 17
#8 2 2, 26
Or we can use tidyr::separate to create two new columns hours and minutes
df %>% separate(time, c("hours", "minutes"), sep = "(?=\\d{2}$)")
# person hours minutes
#1 1 3 56
#2 1 9 31
#3 1 20 17
#4 1 2 13
#5 2 2 56
#6 2 10 31
#7 2 15 17
#8 2 2 26
In response to your comment you could use stringr::str_replace
df %>% mutate(time = str_replace(time, "(?=\\d{2}$)", ":"))
# person time
#1 1 3:56
#2 1 9:31
#3 1 20:17
#4 1 2:13
#5 2 2:56
#6 2 10:31
#7 2 15:17
#8 2 2:26
And the same in base R using sub
transform(df, time = sub("(?=\\d{2}$)", ":", time, perl = TRUE))
giving the same result.
Sample data
df <- read.table(text = "
person time
1 356
1 931
1 2017
1 213
2 256
2 1031
2 1517
2 226", header = T)
We can use strptime with sprintf in base R
df[c("hour", "min")] <- unclass(strptime(sprintf("%04d00", df$time),
"%H%M%S"))[c('hour', 'min')]
df
# person time hour min
#1 1 356 3 56
#2 1 931 9 31
#3 1 2017 20 17
#4 1 213 2 13
#5 2 256 2 56
#6 2 1031 10 31
#7 2 1517 15 17
#8 2 226 2 26
Or if it needs to only create a delimiter
tmp <- sub('(\\d{2})$', ':\\1', df$time)
tmp
#[1] "3:56" "9:31" "20:17" "2:13" "2:56" "10:31" "15:17" "2:26"
and then it can be separated in to two column with read.table
read.table(text = tmp, sep=":", header = FALSE, col.names = c('hour', 'min'))
data
df <- structure(list(person = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), time = c(356L,
931L, 2017L, 213L, 256L, 1031L, 1517L, 226L)),
class = "data.frame", row.names = c(NA,
-8L))
Another possibility:
res<-strsplit(gsub("(\\d+(?=\\d{2,}))(\\d{1,})",
"\\1:\\2",df$time,perl = T),":")
df$Minutes <- sapply(res,"[[",2)
df$Hr <- sapply(res,"[[",1)
df
Result:
person time Minutes Hr
1 1 356 56 3
2 1 931 31 9
3 1 2017 17 20
4 1 2103 03 21
5 2 256 56 2
6 2 1031 31 10
7 2 1517 17 15
8 2 2206 06 22
Data:
df <-structure(list(person = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), time = c(356L,
931L, 2017L, 2103L, 256L, 1031L, 1517L, 2206L)), row.names = c(NA,
-8L), class = "data.frame")
If you want to show time in HH:MM format, probably we can use sprintf with sub to enter semicolon (:) in between
sub("(\\d{2})(\\d{2})", "\\1:\\2",sprintf("%04d", df$time))
#[1] "03:56" "09:31" "20:17" "21:03" "02:56" "10:31" "15:17" "22:06"

Moving average and moving slope in R

I am looking to separately calculate a 7-day moving average and 7-day moving slope of 'oldvar'.
My sincere apologies that I didn't add the details below in my original post. These are repeated observations for each id which can go from a minimum of 3 observations per id to 100 observations per id. The start day can be different for different IDs, and to make things complicated, the days are not equally spaced, so some IDs have missing days.
Here is the data structure. Please note that 'average' is the variable that I am trying to create as moving 7-day average for each ID:
id day outcome average
1 1 15 100 NA
2 1 16 110 NA
3 1 17 190 NA
4 1 18 130 NA
5 1 19 140 NA
6 1 20 150 NA
7 1 21 160 140
8 1 22 100 140
9 1 23 180 150
10 1 24 120 140
12 2 16 90 NA
13 2 17 110 NA
14 2 18 120 NA
12 2 20 130 NA
15 3 16 110 NA
16 3 18 200 NA
17 3 19 180 NA
18 3 21 170 NA
19 3 22 180 168
20 3 24 210 188
21 3 25 160 180
22 3 27 200 184
Also, would appreciate advice on how to calculate a moving 7-day slope using the same.
Thank you and again many apologies for being unclear the first time around.
The real challenge is to create a data.frame after completing the missing rows. One solution could be using zoo library. The rollapply function will provide a way to assign NA value for the initial rows.
Using data from OP as is, the solution could be:
library(zoo)
library(dplyr)
# Data from OP
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
day = c(15L,16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 16L, 17L, 18L, 20L,
16L, 18L, 19L, 21L, 22L, 24L, 25L, 27L),
outcome = c(100L, 110L,190L, 130L, 140L, 150L, 160L, 100L, 180L, 120L, 90L, 110L, 120L,
130L, 110L, 200L, 180L, 170L, 180L, 210L, 160L, 200L)),
.Names = c("id", "day", "outcome"), row.names = c(NA, -22L), class = "data.frame")
# Make a list without missing day for each id
df_complete <- merge(
expand.grid(id=unique(df$id), day=min(df$day):max(df$day)),
df, all=TRUE)
# Valid range of day for each ID group
df_id_wise_range <- df %>% group_by(id) %>%
summarise(min_day = min(day), max_day = max(day)) %>% as.data.frame()
# id min_day max_day
# 1 1 15 24
# 2 2 16 20
# 3 3 16 27
# Join original df and df_complete and then use df_id_wise_range to
# filter it for valid range of day for each group
df_final <- df_complete %>%
left_join(df, by=c("id","day")) %>%
select(-outcome.y) %>%
inner_join(df_id_wise_range, by="id") %>%
filter(day >= min_day & day <= max_day) %>%
mutate(outcome = outcome.x) %>%
select( id, day, outcome) %>%
as.data.frame()
# Now apply mean to get average
df_average <- df_final %>% group_by(id) %>%
mutate(average= rollapply(outcome, 7, mean, na.rm = TRUE, by = 1,
fill = NA, align = "right", partial = 7)) %>% as.data.frame()
df_average
# The result
# id day outcome average
#1 1 15 100 NA
#2 1 16 110 NA
#3 1 17 190 NA
#4 1 18 130 NA
#5 1 19 140 NA
#6 1 20 150 NA
#7 1 21 160 140.0
#8 1 22 100 140.0
#9 1 23 180 150.0
#10 1 24 120 140.0
#11 2 16 90 NA
#12 2 17 110 NA
#13 2 18 120 NA
#....
#....
#19 3 19 180 NA
#20 3 20 NA NA
#21 3 21 170 NA
#22 3 22 180 168.0
#23 3 23 NA 182.5
#24 3 24 210 188.0
#25 3 25 160 180.0
#26 3 26 NA 180.0
#27 3 27 200 184.0
The steps to calculate moving slope are:
First create a function to return slope
Use function as as part of rollapplyr
#Function to calculate slope
slop_e <- function(z) coef(lm(b ~ a, as.data.frame(z)))[[2]]
#Apply function
z2$slope <- rollapplyr(zoo(z2), 7, slop_e , by.column = FALSE, fill = NA, align = "right")
z2
a b mean_a slope
1 1 21 NA NA
2 2 22 NA NA
3 3 23 NA NA
4 4 24 NA NA
5 5 25 NA NA
6 6 26 NA NA
7 7 27 4 1
8 8 28 5 1
9 9 29 6 1
10 10 30 7 1
11 11 31 8 1
12 12 32 9 1
13 13 33 10 1
14 14 34 11 1
15 15 35 12 1
16 16 36 13 1
17 17 37 14 1
18 18 38 15 1
19 19 39 16 1
20 20 40 17 1

lapply alternative to for loop to append to data frame

I have a data frame:
df<-structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
pos = c(10L, 200L, 134L, 400L, 600L, 1000L, 20L, 33L, 40L,
45L, 50L, 55L, 100L, 123L)), .Names = c("chrom", "pos"), row.names = c(NA, -14L), class = "data.frame")
> head(df)
chrom pos
1 1 10
2 1 200
3 1 134
4 1 400
5 1 600
6 1 1000
And I want to calculate pos[i+1] - pos[i] on the sample chromosome (chrom)
By using a for loop over each chrom level, and another over each row I get the expected results:
for (c in levels(df$chrom)){
df_chrom<-filter(df, chrom == c)
df_chrom<-arrange(df_chrom, df_chrom$pos)
for (i in 1:nrow(df_chrom)){
dist<-(df_chrom$pos[i+1] - df_chrom$pos[i])
logdist<-log10(dist)
cat(c, i, df_chrom$pos[i], dist, logdist, "\n")
}
}
However, I want to save this to a data frame, and think that lapply or apply is the right way to go about this. I can't work out how to make the pos[i+1] - pos[i] calculation though (seeing as lapply works on each row/column.
Any pointers would be appreciated
Here's the output from my solution:
chrom index pos dist log10dist
1 1 10 124 2.093422
1 2 134 66 1.819544
1 3 200 200 2.30103
1 4 400 200 2.30103
1 5 600 400 2.60206
1 6 1000 NA NA
2 1 20 13 1.113943
2 2 33 NA NA
3 1 40 5 0.69897
3 2 45 NA NA
4 1 50 5 0.69897
4 2 55 45 1.653213
4 3 100 23 1.361728
4 4 123 NA NA
We could do this using a group by difference. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'chrom', order the 'pos', get the difference of 'pos' (diff) and also log of the difference
library(data.table)
setDT(df)[order(pos), {v1 <- diff(pos)
.(index = seq_len(.N), pos = pos,
dist = c(v1, NA), logdiff = c(log10(v1), NA))}
, by = chrom]
# chrom index pos dist logdiff
# 1: 1 1 10 124 2.093422
# 2: 1 2 134 66 1.819544
# 3: 1 3 200 200 2.301030
# 4: 1 4 400 200 2.301030
# 5: 1 5 600 400 2.602060
# 6: 1 6 1000 NA NA
# 7: 2 1 20 13 1.113943
# 8: 2 2 33 NA NA
# 9: 3 1 40 5 0.698970
#10: 3 2 45 NA NA
#11: 4 1 50 5 0.698970
#12: 4 2 55 45 1.653213
#13: 4 3 100 23 1.361728
#14: 4 4 123 NA NA
Upon running the OP's code the output printed are
#1 1 10 124 2.093422
#1 2 134 66 1.819544
#1 3 200 200 2.30103
#1 4 400 200 2.30103
#1 5 600 400 2.60206
#1 6 1000 NA NA
#2 1 20 13 1.113943
#2 2 33 NA NA
#3 1 40 5 0.69897
#3 2 45 NA NA
#4 1 50 5 0.69897
#4 2 55 45 1.653213
#4 3 100 23 1.361728
#4 4 123 NA NA
We split df by df$chrom (Note that we reorder both df and df$chrom before splitting). Then we go through each of the subgroups (the subgroups are called a in this example) using lapply. On the pos column of each subgroup, we calculate difference (diff) of consecutive elements and take log10. Since diff decreases the number of elements by 1, we add a NA to the end. Finally, we rbind all the subgroups together using do.call.
do.call(rbind, lapply(split(df[order(df$chrom, df$pos),], df$chrom[order(df$chrom, df$pos)]),
function(a) data.frame(a, dist = c(log10(diff(a$pos)), NA))))
# chrom pos dist
#1.1 1 10 2.093422
#1.3 1 134 1.819544
#1.2 1 200 2.301030
#1.4 1 400 2.301030
#1.5 1 600 2.602060
#1.6 1 1000 NA
#2.7 2 20 1.113943
#2.8 2 33 NA
#3.9 3 40 0.698970
#3.10 3 45 NA
#4.11 4 50 0.698970
#4.12 4 55 1.653213
#4.13 4 100 1.361728
#4.14 4 123 NA

Remove group of rows by flag indicator in R

I have a dataframe where I have groups of numbers in the unique3 column.
structure(list(unique1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("11/1/2016", "11/10/2016", "11/11/2016",
"11/12/2016", "11/13/2016", "11/14/2016", "11/15/2016", "11/16/2016",
"11/17/2016", "11/18/2016", "11/19/2016", "11/2/2016", "11/20/2016",
"11/21/2016", "11/22/2016", "11/23/2016", "11/24/2016", "11/25/2016",
"11/26/2016", "11/27/2016", "11/28/2016", "11/3/2016", "11/4/2016",
"11/5/2016", "11/6/2016", "11/7/2016", "11/8/2016", "11/9/2016"
),
class = "factor"), unique2 = c(21L, 21L, 21L, 21L, 21L, 21L,
21L, 21L, 31L, 41L), unique3 = c(100001L, 100001L, 100001L, 100001L,
100001L, 100001L, 100001L, 100001L, 100002L, 100003L),
flag = c(NA_integer_,1, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), value = c(1L,
6L, 18L, 19L, 22L, 29L, 30L, 32L, 1L, 1L)),
.Names = c("unique1","unique2", "unique3", "flag", "value"), row.names = c(NA, 10L), class = "data.frame")
unique1 unique2 unique3 flag value
1 11/1/2016 21 100001 NA 1
2 11/1/2016 21 100001 1 6
3 11/1/2016 21 100001 NA 18
4 11/1/2016 21 100001 NA 19
5 11/1/2016 21 100001 NA 22
6 11/1/2016 21 100001 NA 29
7 11/1/2016 21 100001 NA 30
8 11/1/2016 21 100001 NA 32
9 11/1/2016 31 100002 NA 1
10 11/1/2016 41 100003 NA 1
I basically need to group by unique column 3 where if any of the rows for 100001 had a 1 in flag. They would be removed. Although 100001 may not be unique and may repeat for a different value of unique2.
What I would do is make all the values for unique 3 to have a value of 1 like so
unique1 unique2 unique3 flag value
1 11/1/2016 21 100001 1 1
2 11/1/2016 21 100001 1 6
3 11/1/2016 21 100001 1 18
4 11/1/2016 21 100001 1 19
5 11/1/2016 21 100001 1 22
6 11/1/2016 21 100001 1 29
7 11/1/2016 21 100001 1 30
8 11/1/2016 21 100001 1 32
9 11/1/2016 31 100002 NA 1
10 11/1/2016 41 100003 NA 1
and then group by and filter to have:
unique1 unique2 unique3 flag value
1 11/1/2016 21 100001 1 1
2 11/1/2016 21 100001 1 6
3 11/1/2016 21 100001 1 18
4 11/1/2016 21 100001 1 19
5 11/1/2016 21 100001 1 22
6 11/1/2016 21 100001 1 29
7 11/1/2016 21 100001 1 30
8 11/1/2016 21 100001 1 32
For the first step (applying the flag uniformly to each group):
DF$flag <- ave(DF$flag, DF$unique3, FUN = function(x) max(c(0,x), na.rm=TRUE))
Then you can filter a few different ways. One option is:
subset(DF, flag == 1)
How it works
ave(v, g1, g2, g3, FUN = f) splits up vector v based on grouping variables; applies a function to each subvector; recombines to return a vector with the same class as v.
max(c(0,x), na.rm=TRUE) removes the NA values, adds a 0 value and then takes the max. If x only contains 1s and NAs, this will return a 1 if x contains any 1 and otherwise returns 0.
Some alternatives with packages
library(data.table)
DT = setDT(copy(DF))
DT[, flag := max(c(0,flag), na.rm=TRUE), by=unique3][ flag == 1 ]
# or...
library(dplyr)
DF2 = DF
(DF2 %<>%
group_by(unique3) %>%
mutate(flag = max(c(0,flag), na.rm=TRUE))
) %>% filter(flag == 1)
(I'm only creating the DF2 and DT objects here so the code can be run directly without conflicting edits on DF.)
You should be able to do this with just dplyr. Here, I group_by, then use any to return whether or not any values in that column are "1". If you have more complicated criteria in your use case, you could include them here.
df %>%
group_by(unique3) %>%
mutate(newFlag = any(flag == 1, na.rm = TRUE))
returns:
unique1 unique2 unique3 flag value newFlag
<fctr> <int> <int> <dbl> <int> <lgl>
1 11/1/2016 21 100001 NA 1 TRUE
2 11/1/2016 21 100001 1 6 TRUE
3 11/1/2016 21 100001 NA 18 TRUE
4 11/1/2016 21 100001 NA 19 TRUE
5 11/1/2016 21 100001 NA 22 TRUE
6 11/1/2016 21 100001 NA 29 TRUE
7 11/1/2016 21 100001 NA 30 TRUE
8 11/1/2016 21 100001 NA 32 TRUE
9 11/1/2016 31 100002 NA 1 FALSE
10 11/1/2016 41 100003 NA 1 FALSE
where the column newFlag accomplishes what I think you are requesting. You can overwrite flag instead if you prefer.
You can use it to filter as such:
df %>%
group_by(unique3) %>%
mutate(newFlag = any(flag == 1, na.rm = TRUE)) %>%
filter(newFlag)
From your question, it is unclear whether you want to keep or discard groups that have a flag. If you want to remove them, use filter(!newFlag) instead. In either case, if you want to be rid of the new column after filtering, use select(-newFlag).

Resources