Compare two vectors within a data frame with %in% with R - r

Compare two vectors within a data frame with %in%
I have the following data
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
Col1
Col2
a
a,b,c
b
aa,c,d
aa
c,d,e
d
d,f,g
I want to select the rows that contain a character from this vector c("a", "e", "g"), specifying the columna
library(dplyr)
T1 %>% filter(Col1 %in% c("a", "e", "g"))
I returned
1 a a,b,c
It is correct, but if I want to compare two vectors, example:
With unlist and strsplit, I transform the value of each row to a character vector and try to compare it with the reference vector to select the rows that contain any of the values:
unlist(strsplit(T1$Col2[1],","))
[1] "a" "b" "c"
T1 %>% filter(unlist(strsplit(Col2,",")) %in% c("a", "e", "g"))
It gives me an error:
Error in filter():
! Problem while computing ..1 = unlist(strsplit(Col2, ",")) %in% c("a", "e", "g").
✖ Input ..1 must be of size 4 or 1, not size 12.
Run ]8;;rstudio:run:rlang::last_error()rlang::last_error() ]8;; to see where the error occurred.
I can do it like this:
T1[grep(c("a|e|g"), T1$Col2),]
1 a a,b,c
2 b aa,c,d
3 aa c,d,e
4 d d,f,g
But it's wrong, row 3 aa c,d,e, shouldn't be, because it's not a, it's aa
To search for the "a" alone, you would have to do:
T1[grep(c("\\<a\\>"), T1$Col2),]
I think that with this form I will end up making a mistake, it would give me more security to be able to do it comparing vector with vector:
T1 %>% filter(unlist(strsplit(Col2,",")) %in% c("a", "e", "g"))

Edited answer
You can use the syntax \\b for regular expressions word boundary. The | is for boundaries adjacent to like an or operation. You can use the following code:
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
library(dplyr)
library(stringr)
T1 %>%
filter(grepl("\\b(a|e|g)\\b", Col2))
#> Col1 Col2
#> 1 a a,b,c
#> 2 aa c,d,e
#> 3 d d,f,g
Created on 2022-07-16 by the reprex package (v2.0.1)
Note: \\b is for R version 4.1+ otherwise use \b.
old answer
It returns all rows back because you check if one of the strings exists in Col2 and you can see that in row 3, "e" exists which is one of the strings and that's why it returns also row 4. You could also use str_detect like this:
library(dplyr)
library(stringr)
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
vector <- c("a", "e", "g")
T1 %>%
filter(any(str_detect(Col2, paste0(vector, collapse="|"))))
#> Col1 Col2
#> 1 a a,b,c
#> 2 b aa,c,d
#> 3 aa c,d,e
#> 4 d d,f,g
Created on 2022-07-16 by the reprex package (v2.0.1)
If you want to check if the strings exists, one of them, in both columns. You can use the following code:
library(dplyr)
library(stringr)
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
vector <- c("a", "e", "g")
T1 %>%
filter(Reduce(`|`, across(all_of(colnames(T1)), ~str_detect(paste0(vector, collapse="|"), .x))))
#> Col1 Col2
#> 1 a a,b,c
Created on 2022-07-16 by the reprex package (v2.0.1)

Another way you could achieve this (using your original approach with strsplit) is to do it rowwise() and 'sum' the logical test.
T1 %>%
rowwise() %>%
filter(sum(unlist(strsplit(Col2,",")) %in% c("a","e","g")) >= 1)

Related

Apply filter criteria to variables that contain/start with certain string in R

I am trying to find a way to filter a dataframe by criteria applied to variables that their name contains a certain string
in this example below,
I want to find the subjects that any of their test results contain "d".
d=structure(list(ID = c("a", "b", "c", "d", "e"), test1 = c("a", "b", "a", "d", "a"), test2 = c("a", "b", "b", "a", "s"), test3 = c("b", "c", "c", "c", "d"), test4 = c("c", "d", "a", "a", "f")), class = "data.frame", row.names = c(NA, -5L))
I can use dplyr and write one by one using | which works for small examples like this but for my real data will be time consuming.
library(dplyr) library(stringr) d %>% filter(str_detect(d$test1, "d") |str_detect(d$test2, "d") |str_detect(d$test3, "d") |str_detect(d$test4, "d") )
the output I get shows that subjects b, d and e meet the criteria:
ID test1 test2 test3 test4
1 b b b c d
2 d d a c a
3 e a s d f
The output is what I need but I was looking for an easier way, for example, if there is a way to apply the filter criteria to the variables that contain the word "test"
I know about the contain function in dplyr to select certain variables and I tried it here but not working,
d %>% filter(str_detect(contains("test"), "d"))
is there a way to write this code different or is there another way to achieve the same goal?
thank you
In base R you can use lapply/sapply :
d[Reduce(`|`, lapply(d[-1], grepl, pattern = 'd')), ]
#d[rowSums(sapply(d[-1], grepl, pattern = 'd')) > 0, ]
# ID test1 test2 test3 test4
#2 b b b c d
#4 d d a c a
#5 e a s d f
If you are interested in dplyr solution you can use any of the below method :
library(dplyr)
library(stringr)
#1.
d %>%
filter_at(vars(starts_with('test')), any_vars(str_detect(., 'd')))
#2.
d %>%
rowwise() %>%
filter(any(str_detect(c_across(starts_with('test')), 'd')))
#3.
d %>%
filter(Reduce(`|`, across(starts_with('test'), str_detect, 'd')))

Exclude common rows in tibbles [duplicate]

This question already has an answer here:
Using anti_join() from the dplyr on two tables from two different databases
(1 answer)
Closed 2 years ago.
I'm looking for a way to join two tibbles in a a way to leave rows only unique to the first first tibble or unique in both tibbles - simply those one that do not have any matched key.
Let's see example:
A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
With common dplyr::join I am not able to get this:
A
1 d
2 e
Is there some way within dplyr to overcome it or in general in tidyverse to overcome it?
Use setdiff() function from dplyr library
A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
C <- setdiff(A,B)
Just to add.
Setdiff(A,B) gives out those elements present in A but not in B.
dplyr::anti_join will keep only the rows that are unique to the tibble/data.frame of the first argument.
A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
dplyr::anti_join(A, B, by = "A")
# A
# <chr>
# 1 d
# 2 e
A base R possibility (well except the tibble):
A[!A$A %in% B$A,]
returns
# A tibble: 2 x 1
A
<chr>
1 d
2 e

Count string followed by separate string occurrences in r

I am trying to count the number of occurrences of a string followed by another string in r. I cannot seem to get the regex worked out to count this correctly.
As an example:
v <- c("F", "F", "C", "F", "F", "C", "F", "F")
b <- str_count(v, "F(?=C)")
I would like b to tell me how many times string F was followed by string C in vector v (which should equal 2).
I have successfully implemented str_count() to count single strings, but I cannot figure out how to count string followed by a different string.
Also, I found that in regex (?=...) should indicated "followed by" however this does not seem to be sufficient.
You don't have one string. You have individual strings. Her you can test if F is followed by C by shifting using [ for subsetting.
sum(v[-length(v)] == "F" & v[-1] == "C")
#sum(v == "F" & c(v[-1] == "C", FALSE)) #Alternative
#[1] 2
To use stringr::str_count you can paste v to one string.
stringr::str_count(paste(v, collapse = ""), "F(?=C)")
#[1] 2
And for rows of a data.frame:
set.seed(42)
v <- as.data.frame(matrix(sample(c("F", "C"), 25, TRUE), 5))
stringr::str_count(apply(v, 1, paste, collapse = ""), "F(?=C)")
#[1] 1 1 2 1 1
You can use lag() from dplyr:
library(dplyr)
sum(v == "C" & lag(v) == "F", na.rm = TRUE)
(The na.rm = TRUE is because the first value of lag(v) is NA).
Your comment notes that you're also interested in applying this across each row of a data frame. This can be done by pivoting the data to be longer, then applying a grouped mutate, then pivoting the data to be wider again. On an example dataset:
example <- tibble(id = 1:3,
s1 = c("F", "F", "F"),
s2 = c("C", "F", "C"),
s3 = c("C", "C", "F"),
s4 = c("F", "C", "C"))
example %>%
pivot_longer(s1:s4) %>%
group_by(id) %>%
mutate(fc_count = sum(value == "C" & lag(value) == "F", na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = value)
Result:
# A tibble: 3 x 6
id fc_count s1 s2 s3 s4
<int> <int> <chr> <chr> <chr> <chr>
1 1 1 F C C F
2 2 1 F F C C
3 3 2 F C F C
Note that this assumed the data had something like an id column that uniquely identifies each original row. If it doesn't, you can add one with mutate(id = row_number()) first.

How to paste vector elements comma-separated and in quotation marks?

I want to select columns of data frame dfr by their names in a certain order, that i obtain with the numbers in first place.
> (x <- names(dfr)[c(3, 4, 2, 1, 5)])
[1] "c" "d" "b" "a" "e"
In the final code there only should be included the names version, because it's safer.
dfr[, c("c", "d", "b", "a", "e")
I want to paste the elements separated with commas and quotation marks into a string, in order to include it into the final code. I've tried a few options, but they don't give me what I want:
> paste(x, collapse='", "')
[1] "c\", \"d\", \"b\", \"a\", \"e"
> paste(x, collapse="', '")
[1] "c', 'd', 'b', 'a', 'e"
I need something like "'c', 'd', 'b', 'a', 'e'",—of course "c", "d", "b", "a", "e" would be much nicer.
Data
dfr <- setNames(data.frame(matrix(1:15, 3, 5)), letters[1:5])
So dput(x) is the correct answer but just in case you were wondering how to achieve this by modifying your existing code you could do something like the following:
cat(paste0('c("', paste(x, collapse='", "'), '")'))
c("c", "d", "b", "a", "e")
Can also be done with packages (as Tung has showed), here is an example using glue:
library(glue)
glue('c("{v}")', v = glue_collapse(x, '", "'))
c("c", "d", "b", "a", "e")
Try vector_paste() function from the datapasta package
library(datapasta)
vector_paste(input_vector = letters[1:3])
#> c("a", "b", "c")
vector_paste_vertical(input_vector = letters[1:3])
#> c("a",
#> "b",
#> "c")
Or, using base R, this gives you what you want:
(x <- letters[1:3])
q <- "\""
( y <- paste0("c(", paste(paste0(q, x, q), collapse = ", ") , ")" ))
[1] "c(\"a\", \"b\", \"c\")"
Though I'm not realy sure why you want it? Surely you can simply subset like this:
df <- data.frame(a=1:3, b = 1:3, c = 1:3)
df[ , x]
a b c
1 1 1 1
2 2 2 2
3 3 3 3
df[ , rev(x)]
c b a
1 1 1 1
2 2 2 2
3 3 3 3
suppose you want to add a quotation infront and at the end of a text, and save it as an R object - use the capture.output function from utils pkg.
Example. I want ABCDEFG to be saved as an R object as "ABCDEFG"
> cat("ABCDEFG")
> ABCDEFG
> cat("\"ABCDEFG\"")
> "ABCDEFG"
>
#To save output of the cat as an R object including the quotation marks at the start and end of the word use the capture.ouput
> add_quote <- capture.output(cat("\"ABCDEFG\""))
> add_quote
[1] "\"ABCDEFG\""

Replace strings in variable using lookup vector

I have a dataframe df with a character variable and the fromvec and tovec.
df <- tibble(var = c("A", "B", "C", "a", "E", "D", "b"))
fromvec <- c("A", "B", "C")
tovec <- c("X", "Y", "Z")
Use strings in fromvec, check them in df and then replace them with the corresponding strings in tovec so that "A" in df gets replaced with "X", "B" with "Y" and so on to get the desired_df.
desired_df <- tibble(var = c("X", "Y", "Z", "X", "E", "D", "Y"))
I tried following, but not getting the desired result!
from_vec <- paste(fromvec, collapse="|")
to_vec <- paste(tovec, collapse="|")
undesired_df <- df %>%
mutate(var = str_replace(str_to_upper(var), from_vec, to_vec))
i.e. this
tibble(var = c("X|Y|Z", "X|Y|Z", "X|Y|Z", "X|Y|Z", "E", "D", "X|Y|Z"))
How can I get the desired_df?
You could use chartr :
df$var <- chartr(paste(fromvec,collapse=""),
paste(tovec,collapse=""),
toupper(df$var))
# # A tibble: 7 x 1
# var
# <chr>
# 1 X
# 2 Y
# 3 Z
# 4 X
# 5 E
# 6 D
# 7 Y
Or we can use recode
library(dplyr)
df$var <- recode(toupper(df$var), !!!setNames(tovec,fromvec))
If you really want to use str_replace you could do:
library(purrr)
library(stringr)
df$var <- reduce2(fromvec, tovec, str_replace, .init=toupper(df$var))
The correct way to do this with stringr is with str_replace_all:
mutate(df,str_replace_all(str_to_upper(var),setNames(tovec, fromvec)))
(thanks, #Moody_Mudskipper!)
We can use base R
with(df, ifelse(toupper(var) %in% fromvec,
setNames(tovec, fromvec)[toupper(var)], var))
#[1] "X" "Y" "Z" "X" "E" "D" "Y"
which can be also written in two lines by creating a logical condition
i1 <- toupper(df$var) %in% fromvec
df$var[i1] <- setNames(tovec, fromvec)[toupper(df$var)[i1]]
Or using data.table
library(data.table)
setDT(df)[toupper(var) %in% fromvec, var := setNames(tovec, fromvec)[toupper(var)]]
It's not clear the result should be case insensitive.
In my opinion, replacement (update) operations that involve an indeterminate number of changes are best accomplished using JOINs. In this case, it also cements a good practice of tracking your changes in a separate dataframe.
Unfortunately, the tidyverse has no "update dataframe" function....a glaring omission. That means tidyverse-ers must use a work-around, coalesce.
#JOIN Operation
tibble(fromvec, tovec) %>% #< dataframe of changes
right_join(df, by = c("fromvec" = "var")) %>% #< join operation
transmute(var = coalesce(tovec, fromvec)) #< coalesce work-around
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 a
5 E
6 D
7 b
If a case insensitive operation is preferred, consider inserting str_to_upper in the pipeline:
tibble(fromvec, tovec) %>%
right_join(df %>% mutate(var = (str_to_upper(var))), #<modify case
by = c("fromvec" = "var")) %>%
transmute(var = coalesce(tovec, fromvec))
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 X
5 E
6 D
7 Y

Resources