Subset data based on sequence of characters across rows

Subset data based on sequence of characters across rows - r

How can I subset df by a pattern of consecutive rows of characters? In the example below, I'd like to subset the data that have history values of "TRUE", "FALSE", "TRUE" consecutively. The data below is a bit odd but you get the idea!
value <- c(1/1/16,1/2/16, 1/3/16, 1/4/16, 1/5/16, 1/6/16, 1/7/16, 1/8/16, 1/9/16, 1/10/16)
history <- c("TRUE", "FALSE", "TRUE", "TRUE", "FALSE", "TRUE", "TRUE", "TRUE", "FALSE", "TRUE")
df <- data.frame(value, history)
df
value history
1 0.062500000 TRUE
2 0.031250000 FALSE
3 0.020833333 TRUE
4 0.015625000 TRUE
5 0.012500000 FALSE
6 0.010416667 TRUE
7 0.008928571 TRUE
8 0.007812500 TRUE
9 0.006944444 FALSE
10 0.006250000 TRUE
I've tried grepl, but that works for character strings - not sequences of characters consecutively across rows.
The output would be the same df as above, but without row 7, as that doesn't follow the pattern mentioned.

You could do...
s = c("TRUE", "FALSE", "TRUE")
library(data.table)
w = as.data.table(embed(history, length(s)))[as.list(s), on=paste0("V", seq_along(s)), which=TRUE]
df$v <- FALSE
df$v[w + rep(seq_along(s)-1L, each=length(s))] <- TRUE
value history v
1 0.062500000 TRUE TRUE
2 0.031250000 FALSE TRUE
3 0.020833333 TRUE TRUE
4 0.015625000 TRUE TRUE
5 0.012500000 FALSE TRUE
6 0.010416667 TRUE TRUE
7 0.008928571 TRUE FALSE
8 0.007812500 TRUE TRUE
9 0.006944444 FALSE TRUE
10 0.006250000 TRUE TRUE
You can then filter like subset(df, v == TRUE).
This works using data.table joins, x[i, which=TRUE] which looks up i = as.list(s) in x = embed(history, length(s)) and reports which rows of x are matched:
> as.data.table(as.list(s))
V1 V2 V3
1: TRUE FALSE TRUE
> as.data.table(embed(history, length(s)))
V1 V2 V3
1: TRUE FALSE TRUE
2: TRUE TRUE FALSE
3: FALSE TRUE TRUE
4: TRUE FALSE TRUE
5: TRUE TRUE FALSE
6: TRUE TRUE TRUE
7: FALSE TRUE TRUE
8: TRUE FALSE TRUE
The w + rep(...) is the same as #GGrothendieck's outer(...) except here w contains the position of the start of a match, not the end.

The data in the question looks very strange so we used the data in the Note at the end. If you really have a character vector or factor with value "TRUE" and "FALSE" it can readily be translated to logicals using:
df <- transform(df, history = history == "TRUE")
1) rollapply First define the pattern and then search for it using a moving window with rollapplyr. That gives a logical vector which is TRUE if it is the end of such a pattern match. Find the indexes of the TRUEs and include the prior two indexes as well. Finally perform the subset.
library(zoo)
pattern <- c(TRUE, FALSE, TRUE)
ix <- which(rollapplyr(df$history, length(pattern), identical, pattern, fill = FALSE))
ix <- unique(sort(c(outer(ix, seq_along(pattern) - 1L, "-"))))
df[ix, ]
giving:
value history
1 0.062500000 TRUE
2 0.031250000 FALSE
3 0.020833333 TRUE
4 0.015625000 TRUE
5 0.012500000 FALSE
6 0.010416667 TRUE
8 0.007812500 TRUE
9 0.006944444 FALSE
10 0.006250000 TRUE
1a) magrittr This code in (1) could be expressed using magrittr. (Solution (2) could also be expressed using magrittr following similar ideas.)
library(magrittr)
library(zoo)
df %>%
extract(
extract(.,, "history") %>%
rollapplyr(length(pattern), identical, pattern, fill = FALSE) %>%
which %>%
outer(seq_along(pattern) - 1L, "-") %>%
sort %>%
unique, )
2) gregexpr Using pattern defined above we convert it to a character string of 0s and 1s and also convert df$history to such a string. We can then use gregexpr to find the indexes of the first element of each match and then expand that to all indexes and subset. We get the same answer as before. This alternative uses no packages.
collapse <- function(x) paste0(x + 0, collapse = "")
ix <- gregexpr(collapse(pattern), collapse(df$history))[[1]]
ix <- unique(sort(c(outer(ix, seq_along(pattern) - 1L, "+"))))
df[ix, ]
Note
Lines <- "
value history
1 0.062500000 TRUE
2 0.031250000 FALSE
3 0.020833333 TRUE
4 0.015625000 TRUE
5 0.012500000 FALSE
6 0.010416667 TRUE
7 0.008928571 TRUE
8 0.007812500 TRUE
9 0.006944444 FALSE
10 0.006250000 TRUE"
df <- read.table(text = Lines)

option using lag:
df <- data.frame(value, history)
n<- grepl("TRUE, FALSE, TRUE", paste(lag(lag(history)), (lag(history)), history, sep = ", "))[-(1:2)]
cond <- n |lag(n)|lag(lag(n))
cond <- c(cond, cond[length(history)-2], cond[length(history)-2])
df[cond, ]

Related

Using `:=` from rlang to assign column names using lapply function inputs

I am trying to loop across a vector of patterns/strings to both match strings within another column and assign the results to a column of the same name as the pattern being searched. A simple example is below.
I am aware that this example is trivial but it captures the minimal case that produces the error that I cannot resolve.
> library(rlang)
> library(stringr)
> library(dplyr)
> set.seed(5)
> df <- data.frame(
+ groupA = sample(x = LETTERS[1:6], size = 20, replace = TRUE),
+ id_col = 1:20
+ )
>
> mycols <- c('A','C','D')
>
> dfmatches <-
+ lapply(mycols, function(icol) {
+ data.frame(!!icol := grepl(pattern = icol, x = df$groupA))
+ }) %>%
+ cbind.data.frame()
This gives me the error:
Error: `:=` can only be used within a quasiquoted argument
The desired output would be a data.frame like the below:
> dfmatches
A C D
1 FALSE FALSE FALSE
2 FALSE FALSE FALSE
3 FALSE FALSE FALSE
4 FALSE FALSE FALSE
5 TRUE FALSE FALSE
6 FALSE FALSE FALSE
7 FALSE FALSE TRUE
8 FALSE FALSE FALSE
9 FALSE FALSE FALSE
10 TRUE FALSE FALSE
11 FALSE FALSE FALSE
12 FALSE TRUE FALSE
13 FALSE FALSE FALSE
14 FALSE FALSE TRUE
15 FALSE FALSE FALSE
16 FALSE FALSE FALSE
17 FALSE TRUE FALSE
18 FALSE FALSE FALSE
19 FALSE FALSE TRUE
20 FALSE FALSE FALSE
I've tried multiple variations using {{}} or !! rlang::sym() etc but cannot quite figure out the right syntax for this.

One option is to use map_dfc from purrr. Also I think you don't need grepl since here we are looking for an exact match and not partial one.
library(dplyr)
library(purrr)
map_dfc(mycols, ~df %>% transmute(!!.x := groupA == .x))
In base R, we can do
setNames(do.call(cbind.data.frame, lapply(mycols,
function(x) df$groupA == x)), mycols)

recoding based on two condtions in r

I have an example dataset looks like:
data <- as.data.frame(c("A","B","C","X1_theta","X2_theta","AB_theta","BC_theta","CD_theta"))
colnames(data) <- "category"
> data
category
1 A
2 B
3 C
4 X1_theta
5 X2_theta
6 AB_theta
7 BC_theta
8 CD_theta
I am trying to generate a logical variable when the category (variable) contains "theta" in it. However, I would like to assign the logical value as "FALSE" when cell values contain "X1" and "X2".
Here is what I did:
data$logic <- str_detect(data$category, "theta")
> data
category logic
1 A FALSE
2 B FALSE
3 C FALSE
4 X1_theta TRUE
5 X2_theta TRUE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
here, all cells value that have "theta" have the logical value of "TRUE".
Then, I wrote this below to just assign "FALSE" when the cell value has "X" in it.
data$logic <- ifelse(grepl("X", data$category), "FALSE", "TRUE")
> data
category logic
1 A TRUE
2 B TRUE
3 C TRUE
4 X1_theta FALSE
5 X2_theta FALSE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
But this, of course, overwrote the previous application
What I would like to get is to combine two conditions:
> data
category logic
1 A FALSE
2 B FALSE
3 C FALSE
4 X1_theta FALSE
5 X2_theta FALSE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
Any thoughts?
Thanks

We can create the 'logic', by detecting substring 'theta' at the end and not having 'X' ([^X]) as the starting (^) character
libary(dplyr)
library(stringr)
library(tidyr)
data %>%
mutate(logic = str_detect(category, "^[^X].*theta$"))
If we need to split the column into separate columns based on the conditions
data %>%
mutate(logic = str_detect(category, "^[^X].*theta$"),
category = case_when(logic ~ str_replace(category, "_", ","),
TRUE ~ as.character(category))) %>%
separate(category, into = c("split1", "split2"), sep= ",", remove = FALSE)
# category split1 split2 logic
#1 A A <NA> FALSE
#2 B B <NA> FALSE
#3 C C <NA> FALSE
#4 X1_theta X1_theta <NA> FALSE
#5 X2_theta X2_theta <NA> FALSE
#6 AB,theta AB theta TRUE
#7 BC,theta BC theta TRUE
#8 CD,theta CD theta TRUE
Or in base R
data$logic <- with(data, grepl("^[^X].*theta$", category))
Another option is to have two grepl condition statements
data$logic <- with(data, grepl("theta$", category) & !grepl("^X\\d+", category))
data$logic
#[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE

Not the cleanest in the world (since it adds 2 unnecessary cols) but it gets the job done:
data <- as.data.frame(c("A","B","C","X1_theta","X2_theta","AB_theta","BC_theta","CD_theta"))
colnames(data) <- "category"
data$logic1 <- ifelse(grepl('X',data$category), FALSE, TRUE)
data$logic2 <- ifelse(grepl('theta',data$category),TRUE, FALSE)
data$logic <- ifelse((data$logic1 == TRUE & data$logic2 == TRUE), TRUE, FALSE)
print(data)
I think you can also remove the logic1 and logic2 cols if you want but I usually don't bother (I'm a messy coder haha).
Hope this helped!
EDIT: akrun's grepl solution does what I'm doing way more cleanly (as in, it doesn't require the extra cols). I definitely recommend that approach!

%in% vs == for subsetting

I am getting some unexpected behavior using %in% c() versus == c() to filter data on multiple conditions. I am returning incomplete results when the == c() method. Is there a logical explanation for this behavior?
df <- data.frame(region = as.factor(c(1,1,1,2,2,3,3,4,4,4)),
value = 1:10)
library(dplyr)
filter(df, region == c(1,2))
filter(df, region %in% c(1,2))
# using base syntax
df[df$region == c(1,2),]
df[df$region %in% c(1,2),]
The results do not change if I convert 'region' to numeric.

I am returning incomplete results when the == c() method. Is there a
logical explanation for this behavior?
That's kind of logical, let's see:
df$region == 1:2
# [1] TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
df$region %in% 1:2
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
The reason is in the first form your trying to compare different lenght vectors, as #lukeA said in his comment this form is the same as (see implementation-of-standard-recycling-rules):
# 1 1 1 2 2 3 3 4 4 4 ## df$region
# 1 2 1 2 1 2 1 2 1 2 ## c(1,2) recycled to the same length
# T F T T F F F F F F ## equality of the corresponding elements
df$region == c(1,2,1,2,1,2,1,2,1,2)
# [1] TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Where each value on the left hand side of the operator is tested with the corresponding value on the right hand side of the operator.
However when you use df$region %in% 1:2 it's more in the idea:
sapply(df$region, function(x) { any(x==1:2) })
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
I mean each value is tested against the second vector and TRUE is returned if there's one match.

More efficient ways to use R than 'for' loops

I'm a relative newcomer to R so I'm sorry if there's an obvious answer to this. I've looked at other questions and I think 'apply' is the answer but I can't work out how to use it in this case.
I've got a longitudinal survey where participants are invited every year. In some years they fail to take part, and sometimes they die. I need to identify which participants have taken part for a consistent 'streak' since from the start of the survey (i.e. if they stop, they stop for good).
I've done this with a 'for' loop, which works fine in the example below. But I have many years and many participants, and the loop is very slow. Is there a faster approach I could use?
In the example, TRUE means they participated in that year. The loop creates two vectors - 'finalyear' for the last year they took part, and 'streak' to show if they completed all years before the finalyear (i.e. cases 1, 3 and 5).
dat <- data.frame(ids = 1:5, "1999" = c(T, T, T, F, T), "2000" = c(T, F, T, F, T), "2001" = c(T, T, T, T, T), "2002" = c(F, T, T, T, T), "2003" = c(F, T, T, T, F))
finalyear <- NULL
streak <- NULL
for (i in 1:nrow(dat)) {
x <- as.numeric(dat[i,2:6])
y <- max(grep(1, x))
finalyear[i] <- y
streak[i] <- sum(x) == y
}
dat$finalyear <- finalyear
dat$streak <- streak
Thanks!

We could use max.col and rowSums as a vectorized approach.
dat$finalyear <- max.col(dat[-1], 'last')
If there are rows without TRUE values, we can make sure to return 0 for that row by multiplying with the double negation of rowSums. The FALSE will be coerced to 0 and multiplying with 0 returns 0 for that row.
dat$finalyear <- max.col(dat[-1], 'last')*!!rowSums(dat[-1])
Then, we create the 'streak' column by comparing the rowSums of columns 2:6 with that of 'finalyear'
dat$streak <- rowSums(dat[,2:6])==dat$finalyear
dat
# ids X1999 X2000 X2001 X2002 X2003 finalyear streak
#1 1 TRUE TRUE TRUE FALSE FALSE 3 TRUE
#2 2 TRUE FALSE TRUE TRUE TRUE 5 FALSE
#3 3 TRUE TRUE TRUE TRUE TRUE 5 TRUE
#4 4 FALSE FALSE TRUE TRUE TRUE 5 FALSE
#5 5 TRUE TRUE TRUE TRUE FALSE 4 TRUE
Or a one-line code (it could fit in one-line, but decided to make it obvious by 2-lines ) suggested by #ColonelBeauvel
library(dplyr)
mutate(dat, finalyear=max.col(dat[-1], 'last'),
streak=rowSums(dat[-1])==finalyear)

For-loops are not inherently bad in R, but they are slow if you grow vectors iteratively (like you are doing). There are often better ways to do things. Example of a solution with only apply-functions:
dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <- apply(dat[,2:7],MARGIN=1,function(x){sum(x[1:5])==x[6]})
Or option 2, based on comment by #Spacedman:
dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))==sum(x)})
> dat
ids X1999 X2000 X2001 X2002 X2003 finalyear streak
1 1 TRUE TRUE TRUE FALSE FALSE 3 TRUE
2 2 TRUE FALSE TRUE TRUE TRUE 5 FALSE
3 3 TRUE TRUE TRUE TRUE TRUE 5 TRUE
4 4 FALSE FALSE TRUE TRUE TRUE 5 FALSE
5 5 TRUE TRUE TRUE TRUE FALSE 4 TRUE

Here is a solution with dplyr and tidyr.
gather(data = dat,year,value,-ids) %>%
mutate(year=as.integer(gsub("X","",year))) %>%
group_by(ids) %>%
summarize(finalyear=last(year[value]),
streak=!any(value[first(year):finalyear] == FALSE))
output
ids finalyear streak
1 1 2001 TRUE
2 2 2003 FALSE
3 3 2003 TRUE
4 4 2003 FALSE
5 5 2002 TRUE

Here's a base version using apply to loop over rows and rle to see how often the state changes. Your condition seems to be equivalent to the state starting as TRUE and only ever changing to FALSE at most once, so I test the rle as being shorter than 3 and the first value being TRUE:
> dat$streak = apply(dat[,2:6],1,function(r){r[1] & length(rle(r)$length)<=2})
>
> dat
ids X1999 X2000 X2001 X2002 X2003 streak
1 1 TRUE TRUE TRUE FALSE FALSE TRUE
2 2 TRUE FALSE TRUE TRUE TRUE FALSE
3 3 TRUE TRUE TRUE TRUE TRUE TRUE
4 4 FALSE FALSE TRUE TRUE TRUE FALSE
5 5 TRUE TRUE TRUE TRUE FALSE TRUE
There's probably loads of ways of working out finalyear, this just finds the last element of each row which is TRUE:
> dat$finalyear = apply(dat[,2:6], 1, function(r){max(which(r))})
> dat
ids X1999 X2000 X2001 X2002 X2003 streak finalyear
1 1 TRUE TRUE TRUE FALSE FALSE TRUE 3
2 2 TRUE FALSE TRUE TRUE TRUE FALSE 5
3 3 TRUE TRUE TRUE TRUE TRUE TRUE 5
4 4 FALSE FALSE TRUE TRUE TRUE FALSE 5
5 5 TRUE TRUE TRUE TRUE FALSE TRUE 4

how to find if all elements in a subset of a data.frame row are TRUE

I have a data.frame with a block of columns that are logicals, e.g.
> tmp <- data.frame(a=c(13, 23, 52),
+ b=c(TRUE,FALSE,TRUE),
+ c=c(TRUE,TRUE,FALSE),
+ d=c(TRUE,TRUE,TRUE))
> tmp
a b c d
1 13 TRUE TRUE TRUE
2 23 FALSE TRUE TRUE
3 52 TRUE FALSE TRUE
I'd like to compute a summary column (say: e) that is a logical AND over the whole range of logical columns. In other words, for a given row, if all b:d are TRUE, then e would be TRUE; if any b:d are FALSE, then e would be FALSE.
My expected result is:
> tmp
a b c d e
1 13 TRUE TRUE TRUE TRUE
2 23 FALSE TRUE TRUE FALSE
3 52 TRUE FALSE TRUE FALSE
I want to indicate the range of columns by indices, as I have a bunch of columns, and the names are cumbersome. The following code works, but i'd rather use a vectorized approach to improve performance.
> tmp$e <- NA
> for(i in 1:nrow(tmp)){
+ tmp[i,"e"] <- all(tmp[i,2:(ncol(tmp)-1)]==TRUE)
+ }
> tmp
a b c d e
1 13 TRUE TRUE TRUE TRUE
2 23 FALSE TRUE TRUE FALSE
3 52 TRUE FALSE TRUE FALSE
Any way to do this without using a for loop to step through the rows of the data.frame?

You can use rowSums to loop over rows... and some fancy footwork to make it quasi-automated:
# identify the logical columns
boolCols <- sapply(tmp, is.logical)
# sum each row of the logical columns and
# compare to the total number of logical columns
tmp$e <- rowSums(tmp[,boolCols]) == sum(boolCols)

By using rowSums in ifelse statement, in one go it can be acheived:
tmp$e <- ifelse(rowSums(tmp[,2:4] == T) == 3, T, F)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subset data based on sequence of characters across rows - r

option using lag: df <- data.frame(value, history) n<- grepl("TRUE, FALSE, TRUE", paste(lag(lag(history)), (lag(history)), history, sep = ", "))[-(1:2)] cond <- n |lag(n)|lag(lag(n)) cond <- c(cond, cond[length(history)-2], cond[length(history)-2]) df[cond, ]

Related

Using `:=` from rlang to assign column names using lapply function inputs

recoding based on two condtions in r

%in% vs == for subsetting

More efficient ways to use R than 'for' loops

how to find if all elements in a subset of a data.frame row are TRUE

Categories

Resources