Count string followed by separate string occurrences in r - r

I am trying to count the number of occurrences of a string followed by another string in r. I cannot seem to get the regex worked out to count this correctly.
As an example:
v <- c("F", "F", "C", "F", "F", "C", "F", "F")
b <- str_count(v, "F(?=C)")
I would like b to tell me how many times string F was followed by string C in vector v (which should equal 2).
I have successfully implemented str_count() to count single strings, but I cannot figure out how to count string followed by a different string.
Also, I found that in regex (?=...) should indicated "followed by" however this does not seem to be sufficient.

You don't have one string. You have individual strings. Her you can test if F is followed by C by shifting using [ for subsetting.
sum(v[-length(v)] == "F" & v[-1] == "C")
#sum(v == "F" & c(v[-1] == "C", FALSE)) #Alternative
#[1] 2
To use stringr::str_count you can paste v to one string.
stringr::str_count(paste(v, collapse = ""), "F(?=C)")
#[1] 2
And for rows of a data.frame:
set.seed(42)
v <- as.data.frame(matrix(sample(c("F", "C"), 25, TRUE), 5))
stringr::str_count(apply(v, 1, paste, collapse = ""), "F(?=C)")
#[1] 1 1 2 1 1

You can use lag() from dplyr:
library(dplyr)
sum(v == "C" & lag(v) == "F", na.rm = TRUE)
(The na.rm = TRUE is because the first value of lag(v) is NA).
Your comment notes that you're also interested in applying this across each row of a data frame. This can be done by pivoting the data to be longer, then applying a grouped mutate, then pivoting the data to be wider again. On an example dataset:
example <- tibble(id = 1:3,
s1 = c("F", "F", "F"),
s2 = c("C", "F", "C"),
s3 = c("C", "C", "F"),
s4 = c("F", "C", "C"))
example %>%
pivot_longer(s1:s4) %>%
group_by(id) %>%
mutate(fc_count = sum(value == "C" & lag(value) == "F", na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = value)
Result:
# A tibble: 3 x 6
id fc_count s1 s2 s3 s4
<int> <int> <chr> <chr> <chr> <chr>
1 1 1 F C C F
2 2 1 F F C C
3 3 2 F C F C
Note that this assumed the data had something like an id column that uniquely identifies each original row. If it doesn't, you can add one with mutate(id = row_number()) first.

Related

Compare two vectors within a data frame with %in% with R

Compare two vectors within a data frame with %in%
I have the following data
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
Col1
Col2
a
a,b,c
b
aa,c,d
aa
c,d,e
d
d,f,g
I want to select the rows that contain a character from this vector c("a", "e", "g"), specifying the columna
library(dplyr)
T1 %>% filter(Col1 %in% c("a", "e", "g"))
I returned
1 a a,b,c
It is correct, but if I want to compare two vectors, example:
With unlist and strsplit, I transform the value of each row to a character vector and try to compare it with the reference vector to select the rows that contain any of the values:
unlist(strsplit(T1$Col2[1],","))
[1] "a" "b" "c"
T1 %>% filter(unlist(strsplit(Col2,",")) %in% c("a", "e", "g"))
It gives me an error:
Error in filter():
! Problem while computing ..1 = unlist(strsplit(Col2, ",")) %in% c("a", "e", "g").
✖ Input ..1 must be of size 4 or 1, not size 12.
Run ]8;;rstudio:run:rlang::last_error()rlang::last_error() ]8;; to see where the error occurred.
I can do it like this:
T1[grep(c("a|e|g"), T1$Col2),]
1 a a,b,c
2 b aa,c,d
3 aa c,d,e
4 d d,f,g
But it's wrong, row 3 aa c,d,e, shouldn't be, because it's not a, it's aa
To search for the "a" alone, you would have to do:
T1[grep(c("\\<a\\>"), T1$Col2),]
I think that with this form I will end up making a mistake, it would give me more security to be able to do it comparing vector with vector:
T1 %>% filter(unlist(strsplit(Col2,",")) %in% c("a", "e", "g"))
Edited answer
You can use the syntax \\b for regular expressions word boundary. The | is for boundaries adjacent to like an or operation. You can use the following code:
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
library(dplyr)
library(stringr)
T1 %>%
filter(grepl("\\b(a|e|g)\\b", Col2))
#> Col1 Col2
#> 1 a a,b,c
#> 2 aa c,d,e
#> 3 d d,f,g
Created on 2022-07-16 by the reprex package (v2.0.1)
Note: \\b is for R version 4.1+ otherwise use \b.
old answer
It returns all rows back because you check if one of the strings exists in Col2 and you can see that in row 3, "e" exists which is one of the strings and that's why it returns also row 4. You could also use str_detect like this:
library(dplyr)
library(stringr)
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
vector <- c("a", "e", "g")
T1 %>%
filter(any(str_detect(Col2, paste0(vector, collapse="|"))))
#> Col1 Col2
#> 1 a a,b,c
#> 2 b aa,c,d
#> 3 aa c,d,e
#> 4 d d,f,g
Created on 2022-07-16 by the reprex package (v2.0.1)
If you want to check if the strings exists, one of them, in both columns. You can use the following code:
library(dplyr)
library(stringr)
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
vector <- c("a", "e", "g")
T1 %>%
filter(Reduce(`|`, across(all_of(colnames(T1)), ~str_detect(paste0(vector, collapse="|"), .x))))
#> Col1 Col2
#> 1 a a,b,c
Created on 2022-07-16 by the reprex package (v2.0.1)
Another way you could achieve this (using your original approach with strsplit) is to do it rowwise() and 'sum' the logical test.
T1 %>%
rowwise() %>%
filter(sum(unlist(strsplit(Col2,",")) %in% c("a","e","g")) >= 1)

R How to remap letters in a string

I’d be grateful for suggestions as to how to remap letters in strings in a map-specified way.
Suppose, for instance, I want to change all As to Bs, all Bs to Ds, and all Ds to Fs. If I do it like this, it doesn’t do what I want since it applies the transformations successively:
"abc" %>% str_replace_all(c(a = "b", b = "d", d = "f"))
Here’s a way I can do what I want, but it feels a bit clunky.
f <- function (str) str_c( c(a = "b", b = "d", c = "c", d = "f") %>% .[ strsplit(str, "")[[1]] ], collapse = "" )
"abc" %>% map_chr(f)
Better ideas would be much appreciated.
James.
P.S. Forgot to specify. Sometimes I want to replace a letter with multiple letters, e.g., replace all As with the string ZZZ.
P.P.S. Ideally, this would be able to handle vectors of strings too, e.g., c("abc", "gersgaesg", etc.)
We could use chartr in base R
chartr("abc", "bdf", "abbbce")
#[1] "bdddfe"
Or a package solution would be mgsub which would also match and replace strings with number of characters greater than 1
library(mgsub)
mgsub("abbbce", c("a", "b", "c"), c("b", "d", "f"))
#[1] "bdddfe"
mgsub("abbbce", c("a", "b", "c"), c("ba", "ZZZ", "f"))
#[1] "baZZZZZZZZZfe"
Maybe this is more elegant? It will also return warnings when values aren't found.
library(plyr)
library(tidyverse)
mappings <- c(a = "b", b = "d", d = "f")
str_split("abc", pattern = "") %>%
unlist() %>%
mapvalues(from = names(mappings), to = mappings) %>%
str_c(collapse = "")
# The following `from` values were not present in `x`: d
# [1] "bdc"

Apply filter criteria to variables that contain/start with certain string in R

I am trying to find a way to filter a dataframe by criteria applied to variables that their name contains a certain string
in this example below,
I want to find the subjects that any of their test results contain "d".
d=structure(list(ID = c("a", "b", "c", "d", "e"), test1 = c("a", "b", "a", "d", "a"), test2 = c("a", "b", "b", "a", "s"), test3 = c("b", "c", "c", "c", "d"), test4 = c("c", "d", "a", "a", "f")), class = "data.frame", row.names = c(NA, -5L))
I can use dplyr and write one by one using | which works for small examples like this but for my real data will be time consuming.
library(dplyr) library(stringr) d %>% filter(str_detect(d$test1, "d") |str_detect(d$test2, "d") |str_detect(d$test3, "d") |str_detect(d$test4, "d") )
the output I get shows that subjects b, d and e meet the criteria:
ID test1 test2 test3 test4
1 b b b c d
2 d d a c a
3 e a s d f
The output is what I need but I was looking for an easier way, for example, if there is a way to apply the filter criteria to the variables that contain the word "test"
I know about the contain function in dplyr to select certain variables and I tried it here but not working,
d %>% filter(str_detect(contains("test"), "d"))
is there a way to write this code different or is there another way to achieve the same goal?
thank you
In base R you can use lapply/sapply :
d[Reduce(`|`, lapply(d[-1], grepl, pattern = 'd')), ]
#d[rowSums(sapply(d[-1], grepl, pattern = 'd')) > 0, ]
# ID test1 test2 test3 test4
#2 b b b c d
#4 d d a c a
#5 e a s d f
If you are interested in dplyr solution you can use any of the below method :
library(dplyr)
library(stringr)
#1.
d %>%
filter_at(vars(starts_with('test')), any_vars(str_detect(., 'd')))
#2.
d %>%
rowwise() %>%
filter(any(str_detect(c_across(starts_with('test')), 'd')))
#3.
d %>%
filter(Reduce(`|`, across(starts_with('test'), str_detect, 'd')))

How can I make a data.frame using chr vertor [duplicate]

This question already has answers here:
aggregating unique values in columns to single dataframe "cell" [duplicate]
(2 answers)
Closed 2 years ago.
I have the data.frame below.
> Chr Chr
> A E
> A F
> A E
> B G
> B G
> C H
> C I
> D E
and... I want to convert the dataset as belows as you may be noticed.
I want to coerce all chr vectors into an row.
chr chr
A E,F
B G
C H,I
D E
they are all characters, so I tried to do several things so that I want to make.
Firstly, I used unique function for FILTER <- unique(chr[,15])1st column and try to subset them using
FILTER data that I created using rbind or bind rows function.
Secondly, I tested to check whether my idea works or not
FILTER <- unique(Top[,15])
NN <- data.frame()
for(i in 1 :nrow(FILTER)){
result = unique(Top10Data[TGT == FILTER[i]]$`NM`))
print(result)
}
to this stage, it seems to be working well.
The problem for me is that when I used both functions, the data frame only creates 1 column and ignored the others vector (2nd variables from above data.frame) all.
Only For the chr [1,1], those functions do work well, but I have chr vectors such as chr[1,n], which is unable to be coerced.
here's my code for your reference.
FILTER <- unique(Top[,15])
NN <- data.frame()
for(i in 1 :nrow(FILTER)){
CGONM <- rbind(NN,unique(Top10Data[TGT == FILTER[i]]$`NM`))
}
Base R solutions:
# Solution 1:
df_str_agg1 <- aggregate(var2~var1, df, FUN = function(x){
paste0(unique(x), collapse = ",")})
# Solution 2:
df_str_agg2 <- data.frame(do.call("rbind",lapply(split(df, df$var1), function(x){
data.frame(var1 = unique(x$var1),
var2 = paste0(unique(x$var2), collapse = ","))
}
)
),
row.names = NULL
)
Tidyverse solution:
library(tidyverse)
df_str_agg3 <-
df %>%
group_by(var1) %>%
summarise(var2 = str_c(unique(var2), collapse = ",")) %>%
ungroup()
Data:
df <- data.frame(var1 = c("A", "A", "A", "B", "B", "C", "C", "D"),
var2 = c("E", "F", "E", "G", "G", "H", "I", "E"), stringsAsFactors = FALSE)

Replace strings in variable using lookup vector

I have a dataframe df with a character variable and the fromvec and tovec.
df <- tibble(var = c("A", "B", "C", "a", "E", "D", "b"))
fromvec <- c("A", "B", "C")
tovec <- c("X", "Y", "Z")
Use strings in fromvec, check them in df and then replace them with the corresponding strings in tovec so that "A" in df gets replaced with "X", "B" with "Y" and so on to get the desired_df.
desired_df <- tibble(var = c("X", "Y", "Z", "X", "E", "D", "Y"))
I tried following, but not getting the desired result!
from_vec <- paste(fromvec, collapse="|")
to_vec <- paste(tovec, collapse="|")
undesired_df <- df %>%
mutate(var = str_replace(str_to_upper(var), from_vec, to_vec))
i.e. this
tibble(var = c("X|Y|Z", "X|Y|Z", "X|Y|Z", "X|Y|Z", "E", "D", "X|Y|Z"))
How can I get the desired_df?
You could use chartr :
df$var <- chartr(paste(fromvec,collapse=""),
paste(tovec,collapse=""),
toupper(df$var))
# # A tibble: 7 x 1
# var
# <chr>
# 1 X
# 2 Y
# 3 Z
# 4 X
# 5 E
# 6 D
# 7 Y
Or we can use recode
library(dplyr)
df$var <- recode(toupper(df$var), !!!setNames(tovec,fromvec))
If you really want to use str_replace you could do:
library(purrr)
library(stringr)
df$var <- reduce2(fromvec, tovec, str_replace, .init=toupper(df$var))
The correct way to do this with stringr is with str_replace_all:
mutate(df,str_replace_all(str_to_upper(var),setNames(tovec, fromvec)))
(thanks, #Moody_Mudskipper!)
We can use base R
with(df, ifelse(toupper(var) %in% fromvec,
setNames(tovec, fromvec)[toupper(var)], var))
#[1] "X" "Y" "Z" "X" "E" "D" "Y"
which can be also written in two lines by creating a logical condition
i1 <- toupper(df$var) %in% fromvec
df$var[i1] <- setNames(tovec, fromvec)[toupper(df$var)[i1]]
Or using data.table
library(data.table)
setDT(df)[toupper(var) %in% fromvec, var := setNames(tovec, fromvec)[toupper(var)]]
It's not clear the result should be case insensitive.
In my opinion, replacement (update) operations that involve an indeterminate number of changes are best accomplished using JOINs. In this case, it also cements a good practice of tracking your changes in a separate dataframe.
Unfortunately, the tidyverse has no "update dataframe" function....a glaring omission. That means tidyverse-ers must use a work-around, coalesce.
#JOIN Operation
tibble(fromvec, tovec) %>% #< dataframe of changes
right_join(df, by = c("fromvec" = "var")) %>% #< join operation
transmute(var = coalesce(tovec, fromvec)) #< coalesce work-around
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 a
5 E
6 D
7 b
If a case insensitive operation is preferred, consider inserting str_to_upper in the pipeline:
tibble(fromvec, tovec) %>%
right_join(df %>% mutate(var = (str_to_upper(var))), #<modify case
by = c("fromvec" = "var")) %>%
transmute(var = coalesce(tovec, fromvec))
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 X
5 E
6 D
7 Y

Resources