Remove rows based on first instance to meet a condition - r

In the following dataset, I want to remove all rows starting at the first instance, sorted by Time and grouped by ID, that Var is TRUE. Put differently, I want to subset all rows for each ID by those which are FALSE up until the first TRUE, sorted by Time.
ID <- c('A','B','C','A','B','C','A','B','C','A','B','C')
Time <- c(3,3,3,6,6,6,9,9,9,12,12,12)
Var <- c(F,F,F,T,T,F,T,T,F,T,F,T)
data = data.frame(ID, Time, Var)
data
ID Time Var
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 A 6 TRUE
5 B 6 TRUE
6 C 6 FALSE
7 A 9 TRUE
8 B 9 TRUE
9 C 9 FALSE
10 A 12 TRUE
11 B 12 FALSE
12 C 12 TRUE
The desired result for this data frame should be:
ID Time Var
A 3 FALSE
B 3 FALSE
C 3 FALSE
C 6 FALSE
C 9 FALSE
Note that the solution should not only remove rows where Var == TRUE, but should also remove rows where Var == FALSE but this follows (in Time) another instance where Var == TRUE for that ID.
I've tried many different things but can't seem to figure this out. Any help is much appreciated!

Here's how to do that with dplyr using group_by and cumsum.
The rationale is that Var is a logical vector where FALSE is equal to 0 and TRUE is equal to 1. cumsum will remain at 0 until it hits the first TRUE.
library(dplyr)
data%>%
group_by(ID)%>%
filter(cumsum(Var)<1)
ID Time Var
<fctr> <dbl> <lgl>
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 C 6 FALSE
5 C 9 FALSE
Here's the equivalent code with data.table:
library(data.table)
data[data[, .I[cumsum(Var) <1], by = ID]$V1]
ID Time Var
1: A 3 FALSE
2: B 3 FALSE
3: C 3 FALSE
4: C 6 FALSE
5: C 9 FALSE

This data.table solution should work.
library(data.table)
> setDT(data)[, .SD[1:(which.max(Var)-1)], by=ID]
ID Time Var
1: A 3 FALSE
2: B 3 FALSE
3: C 3 FALSE
4: C 6 FALSE
5: C 9 FALSE
Given that you want all the values up to the first TRUE value, which.max is the way to go.

You can do this with the cumall verb as well:
library(dplyr)
data %>%
dplyr::group_by(ID) %>%
dplyr::filter(dplyr::cumall(!Var))
ID Time Var
<chr> <dbl> <lgl>
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 C 6 FALSE
5 C 9 FALSE
cumall(!x): all cases until the first TRUE

Related

Roll condition ifelse in R data frame

I have a data frame with two columns in R and I want to create a third column that will roll by 2 in both columns and check if a condition is satisfied or not as described in the table below.
The condition is a rolling ifelse and goes like this :
IF -A1<B3<A1 TRUE ELSE FALSE
IF -A2<B4<A2 TRUE ELSE FALSE
IF -A3<B5<A3 TRUE ELSE FALSE
IF -A4<B6<A4 TRUE ELSE FALSE
A
B
CHECK
1
4
NA
2
5
NA
3
6
FALSE
4
1
TRUE
5
-4
FALSE
6
1
TRUE
How can I do it in R? Is there a base R's function or within the dplyr framework ?
Since R is vectorized, you can do that with one command, using for instance dplyr::lag:
library(dplyr)
df %>%
mutate(CHECK = -lag(A, n=2) < B & lag(A, n=2) > B)
A B CHECK
1 1 4 NA
2 2 5 NA
3 3 6 FALSE
4 4 1 TRUE
5 5 -4 FALSE
6 6 1 TRUE

How to find values shared between groups in a data frame? [duplicate]

This question already has answers here:
Extract elements common in all column groups
(3 answers)
Closed 3 years ago.
I have a tidy data.frame with two columns: exp and val. I want to find which values of val are shared among all different experiments.
df <- data.frame(exp = c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'),
val = c(10, 20, 15, 10, 10, 15, 99, 2, 15, 20, 10, 4))
df
exp val
1 A 10
2 A 20
3 A 15
4 A 10
5 B 10
6 B 15
7 B 99
8 B 2
9 C 15
10 C 20
11 C 10
12 C 4
Expected result could be either a vector of values:
10, 15
or a column on the data frame telling if that value is shared:
exp val shared
<fct> <dbl> <lgl>
1 A 10 TRUE
2 A 20 FALSE
3 A 15 TRUE
4 A 10 TRUE
5 B 10 TRUE
6 B 15 TRUE
7 B 99 FALSE
8 B 2 FALSE
9 C 15 TRUE
10 C 20 FALSE
11 C 10 TRUE
12 C 4 FALSE
I was able to find an answer (see the self-answer below) but this seems like a common enough question that there must be a better way than the really hacky solution I cam up with.
I tried to solve this problem in dplyr since that's what I'm familiar with, but I'm interested in any kind of solution.
Or you can group by val and then check whether the number of distinct exp for that val is equal to the data frame level number of distinct exp:
df %>%
group_by(val) %>%
mutate(shared = n_distinct(exp) == n_distinct(.$exp))
# notice the first exp refers to exp for each group while .$exp refers
# to the overall exp column in the data frame
# A tibble: 12 x 3
# Groups: val [6]
# exp val shared
# <fct> <dbl> <lgl>
# 1 A 10 TRUE
# 2 A 20 FALSE
# 3 A 15 TRUE
# 4 A 10 TRUE
# 5 B 10 TRUE
# 6 B 15 TRUE
# 7 B 99 FALSE
# 8 B 2 FALSE
# 9 C 15 TRUE
#10 C 20 FALSE
#11 C 10 TRUE
#12 C 4 FALSE
Using base R you can use table:
as.numeric(colnames(a<-table(df))[colSums(a>0)==nrow(a)])
[1] 10 15
you can also do:
df %>%
mutate(s = val %in% as.numeric(colnames(a<-table(df))[colSums(a>0)==nrow(a)]))
exp val s
1 A 10 TRUE
2 A 20 FALSE
3 A 15 TRUE
4 A 10 TRUE
5 B 10 TRUE
6 B 15 TRUE
7 B 99 FALSE
8 B 2 FALSE
9 C 15 TRUE
10 C 20 FALSE
11 C 10 TRUE
12 C 4 FALSE
Here is an other base R solution:
x <- split(df$val, df$exp)
Reduce(intersect, x)
## [1] 10 15
We can go through the data.frame row by row and count up how many times that row's value is found in the vector df$val.
To deal with possible repeat values, we have to use group_by %>% distinct to remove repeated values of val within groups. But then to get just the values of val as a vector, we need to ungroup %>% select(val) %>% unlist, which just seems needlessly complicated.
Finally, we can check whether the number of groups the value is found in equals the total number of groups.
df %>%
rowwise() %>%
mutate(num_groups = sum(group_by(., exp) %>%
distinct(val) %>%
ungroup() %>%
select(val) %>%
unlist() %in% val),
shared = num_groups == length(unique(.$exp)))
# A tibble: 12 x 4
exp val num_groups shared
<fct> <dbl> <int> <lgl>
1 A 10 3 TRUE
2 A 20 2 FALSE
3 A 15 3 TRUE
4 A 10 3 TRUE
5 B 10 3 TRUE
6 B 15 3 TRUE
7 B 99 1 FALSE
8 B 2 1 FALSE
9 C 15 3 TRUE
10 C 20 2 FALSE
11 C 10 3 TRUE
12 C 4 1 FALSE

transpose long to wide within groups with tidyverse

I cant quite work this one out.
How do I go from:
Visit Test
1 A
1 B
2 A
2 C
3 B
To:
Visit A B C
1 TRUE TRUE FALSE
2 TRUE FALSE TRUE
3 FALSE TRUE FALSE
With dplyr and tidyr you can do
dd %>% mutate(Value=TRUE) %>%
spread(Test, Value, fill=FALSE)
# Visit A B C
# 1 1 TRUE TRUE FALSE
# 2 2 TRUE FALSE TRUE
# 3 3 FALSE TRUE FALSE
tested with
dd<-read.table(text="Visit Test
1 A
1 B
2 A
2 C
3 B", header=T)
Another option is to use reshape2::dcast with fun.aggregate to check if length is greater than 0.
library(reshape2)
dcast(df,Visit~Test, fun.aggregate = function(x)length(x)>0, value.var = "Test")
# Visit A B C
# 1 1 TRUE TRUE FALSE
# 2 2 TRUE FALSE TRUE
# 3 3 FALSE TRUE FALSE
Data:
df<-read.table(text="Visit Test
1 A
1 B
2 A
2 C
3 B",
header=TRUE, stringsAsFactor = FALSE)

Finding the minimums of groups of observations in R

I'm relatively new to R and struggle with "vectorizing" all my code in R. Even though I appreciate that's the proper way to do it.
I need to set a value in a dataframe to be the minimum time for the IDs.
ID isTrue RealTime MinTime
1 TRUE 16
1 FALSE 8
1 TRUE 10
2 TRUE 7
2 TRUE 30
3 FALSE 3
To be turned into:
ID isTrue RealTime MinTime
1 TRUE 16 10
1 FALSE 8
1 TRUE 10 10
2 TRUE 7 7
2 TRUE 30 7
3 FALSE 3
The following works perfectly. However, it takes 10 minutes to run which isn't ideal:
for (i in 1:nrow(df)){
if (df[i,'isTrue']) {
prevTime <- sqldf(paste('Select min(MinTime) from dfStageIV where ID =',df[i,'ID'],sep=" "))[1,1]
if (is.na(prevTime) | is.na(df[i,'MinTime']) | df[i,'MinTime'] < prevTime){
df[i,'MinTime']<-dfStageIV[i,'RealTime']
} else {
dfStageIV[i,'MinTime']<-prevTime
}
}
}
How should I do this properly? I take it using for or do loops are not the best way in R. I've been looking at the apply() and aggregate.data.frame() functions but can't make sense of how to do this. Can someone point me in the right direction? Much appreciated!!
Here is a two line base R solution using ave, pmax, and is.na.
# calculate minimum for each ID, excluding FALSE instances
df$MinTime <- ave(pmax(df$RealTime, (!df$isTrue) * max(df$RealTime)), df$ID, FUN=min)
# turn FALSE instances into NA
is.na(df$MinTime) <- (!df$isTrue)
which returns
df
ID isTrue RealTime MinTime
1 1 TRUE 16 10
2 1 FALSE 8 NA
3 1 TRUE 10 10
4 2 TRUE 7 7
5 2 TRUE 30 7
6 3 FALSE 3 NA
In the first line, pmax is used to construct a vector of the observations if df$isTrue is TRUE or the maximum RealTime value in the data.frame. This new vector is used in the minimum calculation. The FALSE values are set to NA in the second line.
data
df <- read.table(header=T, text="ID isTrue RealTime
1 TRUE 16
1 FALSE 8
1 TRUE 10
2 TRUE 7
2 TRUE 30
3 FALSE 3")
It should be far faster with a dplyr chain. Here we group the data frame by both ID and group and get the minima at the group level. Then we can ungroup it again and simply remove the F minima.
library(dplyr)
df %>%
group_by(ID, isTrue) %>%
mutate(Min.all = min(RealTime)) %>%
ungroup() %>%
transmute(ID, isTrue, RealTime, MinTime = ifelse(isTrue == T, Min.all, ""))
Output:
# A tibble: 6 × 4
ID isTrue RealTime MinTime
<int> <lgl> <int> <chr>
1 1 TRUE 16 10
2 1 FALSE 8
3 1 TRUE 10 10
4 2 TRUE 7 7
5 2 TRUE 30 7
6 3 FALSE 3
I'd really recommend you get familiar with dplyr if you're going to be doing lots of data frame manipulation.
Someone suggested using the ave() function and the following works and is fast although it returns a ton of warnings:
df$MinTime<-ave(df$RealTime,df$ID, df$isTrue, FUN = min)
df$MinTime<-ifelse(df$isTrue, df$MinTime,NA).
The code in the question could be simplified by doing it all in SQL or all in R (appropriately vectorized) rather than half and half. There are already some R solutions so here is an SQL solution that shows that the problem amounts to aggregating a custom self-join.
library(sqldf)
sqldf("select a.*, min(b.RealTime) minRealTime
from df a
left join df b on a.ID = b.ID and a.isTRUE and b.isTRUE
group by a.rowid")
giving:
ID isTrue RealTime minRealTime
1 1 TRUE 16 10
2 1 FALSE 8 NA
3 1 TRUE 10 10
4 2 TRUE 7 7
5 2 TRUE 30 7
6 3 FALSE 3 NA

If row meets criteria, then TRUE else FALSE in R

I have nested data that looks like this:
ID Date Behavior
1 1 FALSE
1 2 FALSE
1 3 TRUE
2 3 FALSE
2 5 FALSE
2 6 TRUE
2 7 FALSE
3 1 FALSE
3 2 TRUE
I'd like to create a column called counter in which for each unique ID the counter adds one to the next row until the Behavior = TRUE
I am expecting this result:
ID Date Behavior counter
1 1 FALSE 1
1 2 FALSE 2
1 3 TRUE 3
2 3 FALSE 1
2 5 FALSE 2
2 6 TRUE 3
2 7 FALSE
3 1 FALSE 1
3 2 TRUE 2
Ultimately, I would like to pull the minimum counter in which the observation occurs for each unique ID. However, I'm having trouble developing a solution for this current counter issue.
Any and all help is greatly appreciated!
I'd like to create a counter within each array of unique IDs and from there, ultimately pull the row level info - the question is how long on average does it take to reach a TRUE
I sense there might an XY problem going on here. You can answer your latter question directly, like so:
> library(plyr)
> mean(daply(d, .(ID), function(grp)min(which(grp$Behavior))))
[1] 2.666667
(where d is your data frame.)
Here's a dplyr solution that finds the row number for each TRUE in each ID:
library(dplyr)
newdf <- yourdataframe %>%
group_by(ID) %>%
summarise(
ftrue = which(Behavior))
do.call(rbind, by(df, list(df$ID), function(x) {n = nrow(x); data.frame(x, Counter = c(1:(m<-which(x$Behavior)), rep(NA, n-m)))}))
ID Date Behavior Counter
1.1 1 1 FALSE 1
1.2 1 2 FALSE 2
1.3 1 3 TRUE 3
2.4 2 3 FALSE 1
2.5 2 5 FALSE 2
2.6 2 6 TRUE 3
2.7 2 7 FALSE NA
3.8 3 1 FALSE 1
3.9 3 2 TRUE 2
df = read.table(text = "ID Date Behavior
1 1 FALSE
1 2 FALSE
1 3 TRUE
2 3 FALSE
2 5 FALSE
2 6 TRUE
2 7 FALSE
3 1 FALSE
3 2 TRUE", header = T)

Resources