I have nested data that looks like this:
ID Date Behavior
1 1 FALSE
1 2 FALSE
1 3 TRUE
2 3 FALSE
2 5 FALSE
2 6 TRUE
2 7 FALSE
3 1 FALSE
3 2 TRUE
I'd like to create a column called counter in which for each unique ID the counter adds one to the next row until the Behavior = TRUE
I am expecting this result:
ID Date Behavior counter
1 1 FALSE 1
1 2 FALSE 2
1 3 TRUE 3
2 3 FALSE 1
2 5 FALSE 2
2 6 TRUE 3
2 7 FALSE
3 1 FALSE 1
3 2 TRUE 2
Ultimately, I would like to pull the minimum counter in which the observation occurs for each unique ID. However, I'm having trouble developing a solution for this current counter issue.
Any and all help is greatly appreciated!
I'd like to create a counter within each array of unique IDs and from there, ultimately pull the row level info - the question is how long on average does it take to reach a TRUE
I sense there might an XY problem going on here. You can answer your latter question directly, like so:
> library(plyr)
> mean(daply(d, .(ID), function(grp)min(which(grp$Behavior))))
[1] 2.666667
(where d is your data frame.)
Here's a dplyr solution that finds the row number for each TRUE in each ID:
library(dplyr)
newdf <- yourdataframe %>%
group_by(ID) %>%
summarise(
ftrue = which(Behavior))
do.call(rbind, by(df, list(df$ID), function(x) {n = nrow(x); data.frame(x, Counter = c(1:(m<-which(x$Behavior)), rep(NA, n-m)))}))
ID Date Behavior Counter
1.1 1 1 FALSE 1
1.2 1 2 FALSE 2
1.3 1 3 TRUE 3
2.4 2 3 FALSE 1
2.5 2 5 FALSE 2
2.6 2 6 TRUE 3
2.7 2 7 FALSE NA
3.8 3 1 FALSE 1
3.9 3 2 TRUE 2
df = read.table(text = "ID Date Behavior
1 1 FALSE
1 2 FALSE
1 3 TRUE
2 3 FALSE
2 5 FALSE
2 6 TRUE
2 7 FALSE
3 1 FALSE
3 2 TRUE", header = T)
Related
I have an R dataframe where one of the columns is a comma delimited string. I want to add a new column to the dataset to show whether the column contains a particular value
For example
> data <- data.frame(a = 1:5, b = c("123", "6475,320", "475", "905,1204,543", "567,475"))
> data
a b
1 1 123
2 2 6475,320
3 3 475
4 4 905,1204,543
5 5 567,475
I want to create a new column to indicate whether b contains 475, which would leave me with
a b has_475
1 1 123 FALSE
2 2 6475,320 FALSE
3 3 475 TRUE
4 4 905,1204,543 FALSE
5 5 567,475 TRUE
You can use boundaries '\b' to look for the number. This will ensure things like 1475 24756 are not matched
data$has_475 <- grepl('\\b475\\b', data$b)
data
a b has_475
1 1 123 FALSE
2 2 6475,320 FALSE
3 3 475 TRUE
4 4 905,1204,543 FALSE
5 5 567,475 TRUE
6 6 1475 FALSE
You can use this regular expression
data["has_475"] = grepl("(^|,)475(,|$)",data$b)
Output:
a b has_475
1 1 123 FALSE
2 2 6475,320 FALSE
3 3 475 TRUE
4 4 905,1204,543 FALSE
5 5 567,475 TRUE
I have a data frame with two columns in R and I want to create a third column that will roll by 2 in both columns and check if a condition is satisfied or not as described in the table below.
The condition is a rolling ifelse and goes like this :
IF -A1<B3<A1 TRUE ELSE FALSE
IF -A2<B4<A2 TRUE ELSE FALSE
IF -A3<B5<A3 TRUE ELSE FALSE
IF -A4<B6<A4 TRUE ELSE FALSE
A
B
CHECK
1
4
NA
2
5
NA
3
6
FALSE
4
1
TRUE
5
-4
FALSE
6
1
TRUE
How can I do it in R? Is there a base R's function or within the dplyr framework ?
Since R is vectorized, you can do that with one command, using for instance dplyr::lag:
library(dplyr)
df %>%
mutate(CHECK = -lag(A, n=2) < B & lag(A, n=2) > B)
A B CHECK
1 1 4 NA
2 2 5 NA
3 3 6 FALSE
4 4 1 TRUE
5 5 -4 FALSE
6 6 1 TRUE
In the following dataset, I want to remove all rows starting at the first instance, sorted by Time and grouped by ID, that Var is TRUE. Put differently, I want to subset all rows for each ID by those which are FALSE up until the first TRUE, sorted by Time.
ID <- c('A','B','C','A','B','C','A','B','C','A','B','C')
Time <- c(3,3,3,6,6,6,9,9,9,12,12,12)
Var <- c(F,F,F,T,T,F,T,T,F,T,F,T)
data = data.frame(ID, Time, Var)
data
ID Time Var
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 A 6 TRUE
5 B 6 TRUE
6 C 6 FALSE
7 A 9 TRUE
8 B 9 TRUE
9 C 9 FALSE
10 A 12 TRUE
11 B 12 FALSE
12 C 12 TRUE
The desired result for this data frame should be:
ID Time Var
A 3 FALSE
B 3 FALSE
C 3 FALSE
C 6 FALSE
C 9 FALSE
Note that the solution should not only remove rows where Var == TRUE, but should also remove rows where Var == FALSE but this follows (in Time) another instance where Var == TRUE for that ID.
I've tried many different things but can't seem to figure this out. Any help is much appreciated!
Here's how to do that with dplyr using group_by and cumsum.
The rationale is that Var is a logical vector where FALSE is equal to 0 and TRUE is equal to 1. cumsum will remain at 0 until it hits the first TRUE.
library(dplyr)
data%>%
group_by(ID)%>%
filter(cumsum(Var)<1)
ID Time Var
<fctr> <dbl> <lgl>
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 C 6 FALSE
5 C 9 FALSE
Here's the equivalent code with data.table:
library(data.table)
data[data[, .I[cumsum(Var) <1], by = ID]$V1]
ID Time Var
1: A 3 FALSE
2: B 3 FALSE
3: C 3 FALSE
4: C 6 FALSE
5: C 9 FALSE
This data.table solution should work.
library(data.table)
> setDT(data)[, .SD[1:(which.max(Var)-1)], by=ID]
ID Time Var
1: A 3 FALSE
2: B 3 FALSE
3: C 3 FALSE
4: C 6 FALSE
5: C 9 FALSE
Given that you want all the values up to the first TRUE value, which.max is the way to go.
You can do this with the cumall verb as well:
library(dplyr)
data %>%
dplyr::group_by(ID) %>%
dplyr::filter(dplyr::cumall(!Var))
ID Time Var
<chr> <dbl> <lgl>
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 C 6 FALSE
5 C 9 FALSE
cumall(!x): all cases until the first TRUE
With data as such below, I'm trying to reassign any of the test cols (test_A, etc.) to their corresponding time cols (time_A, etc.) if the test is true, and then find the minimum of all true test times.
[ID] [time_A] [time_B] [time_C] [test_A] [test_B] [test_C] [min_true_time]
[1,] 1 2 3 4 FALSE TRUE FALSE ?
[2,] 2 -4 5 6 TRUE TRUE FALSE ?
[3,] 3 6 1 -2 TRUE TRUE TRUE ?
[4,] 4 -2 3 4 TRUE FALSE FALSE ?
My actual data set is quite large so my attempts at if and for loops have failed miserably. But I can't make any progress on an apply function.
And more negative time, say -2 would be considered the minimum for row 3.
Any suggestions are welcomed gladly
You don't give much information, but I think this does what you need. No idea if it is efficient enough, since you don't say how big your dataset actually is.
#I assume your data is in a data.frame:
df <- read.table(text="ID time_A time_B time_C test_A test_B test_C
1 1 2 3 4 FALSE TRUE FALSE
2 2 -4 5 6 TRUE TRUE FALSE
3 3 6 1 -2 TRUE TRUE TRUE
4 4 -2 3 4 TRUE FALSE FALSE")
#loop over all rows and subset column 2:4 with column 5:7, then take the mins
df$min_true_time <- sapply(1:nrow(df), function(i) min(df[i,2:4][unlist(df[i,5:7])]))
df
# ID time_A time_B time_C test_A test_B test_C min_true_time
#1 1 2 3 4 FALSE TRUE FALSE 3
#2 2 -4 5 6 TRUE TRUE FALSE -4
#3 3 6 1 -2 TRUE TRUE TRUE -2
#4 4 -2 3 4 TRUE FALSE FALSE -2
Another way, which might be faster (I'm not in the mood for benchmarking):
m <- as.matrix(df[,2:4])
m[!df[,5:7]] <- NA
df$min_true_time <- apply(m,1,min,na.rm=TRUE)
I have searched exhaustively for a direct R translation for the FIRST. and LAST. pointers in SAS DATA steps but can't seem to find one. For those not familiar with SAS, FIRST. is a boolean that identifies the first appearance of a given element in a table and LAST. is a boolean that identifies the last appearance. For instance, consider the following sorted table:
V1 V2 V3
1 1 1
1 1 2
1 2 3
1 2 4
2 3 5
2 3 6
2 4 7
2 4 8
3 5 9
3 5 10
3 6 11
3 6 12
Because SAS DATA steps read tables line by line, I can use a statement like:
IF FIRST.V1 THEN DO ...
FIRST.V1 will return TRUE if and only if this is the first time the observation has been encountered in V1. In other words, it will return true for V1[1] (the first appearance of '1'), V1[5] (the first appearance of '2'), and V1[9] (the first appearance of '3'). The LAST. pointer functions in analogous fashion, but with the final appearance of that element.
Is there anything in R that emulates this?
You can do this with duplicated and rev (for LAST):
> v1=c(1,1,1,2,2,3,3,3,3,4,4,5)
> data.frame(v1,FIRST=!duplicated(v1),LAST=rev(!duplicated(rev(v1))))
v1 FIRST LAST
1 1 TRUE FALSE
2 1 FALSE FALSE
3 1 FALSE TRUE
4 2 TRUE FALSE
5 2 FALSE TRUE
6 3 TRUE FALSE
7 3 FALSE FALSE
8 3 FALSE FALSE
9 3 FALSE TRUE
10 4 TRUE FALSE
11 4 FALSE TRUE
12 5 TRUE TRUE