Check whether a string appears in another in R - r

I've got a tibble containing sentences like that :
df <- tibble(sentences = c("Bob is looking for something", "Adriana has an umbrella", "Michael is looking at..."))
And another containing a long list of names :
names <- tibble(names = c("Bob", "Mary", "Michael", "John", "Etc."))
I would like to see if the sentences contain a name from the list and add a column to indicate if this is the case and get the following tibble :
wanted_df <- tibble(sentences = c("Bob is looking for something", "Adriana has an umbrella", "Michael is looking at..."), check = c(TRUE, FALSE, TRUE))
So far I've tried that, with no success :
df <- df %>%
mutate(check = grepl(pattern = names$names, x = df$sentences, fixed = TRUE))
And also :
check <- str_detect(names$names %in% df$sentences)
Thanks a lot for any help ;)

You should form a single regex expression in grepl:
df %>%
mutate(check = grepl(paste(names$names, collapse = "|"), sentences))
# A tibble: 3 × 2
sentences check
<chr> <lgl>
1 Bob is looking for something TRUE
2 Adriana has an umbrella FALSE
3 Michael is looking at... TRUE

Here is a base R solution.
inx <- sapply(names$names, \(pat) grepl(pat, df$sentences))
inx
#> Bob Mary Michael John Etc.
#> [1,] TRUE FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE FALSE FALSE
#> [3,] FALSE FALSE TRUE FALSE FALSE
inx <- rowSums(inx) > 0L
df$check <- inx
df
#> # A tibble: 3 × 2
#> sentences check
#> <chr> <lgl>
#> 1 Bob is looking for something TRUE
#> 2 Adriana has an umbrella FALSE
#> 3 Michael is looking at... TRUE
Created on 2023-01-11 with reprex v2.0.2

grep and family expect pattern= to be length 1. Similarly, str_detect needs strings, not a logical vector, and of the same length, so that won't work as-is.
We have a couple of options:
sapply on the names (into a matrix) and see if each row has one or more matches:
df %>%
mutate(check = rowSums(sapply(names$names, grepl, sentences)) > 0)
# # A tibble: 3 × 2
# sentences check
# <chr> <lgl>
# 1 Bob is looking for something TRUE
# 2 Adriana has an umbrella FALSE
# 3 Michael is looking at... TRUE
(I now see this is in RuiBarradas's answer.)
Do a fuzzy-join on the data using fuzzyjoin:
df %>%
fuzzyjoin::regex_left_join(names, by = c(sentences = "names")) %>%
mutate(check = !is.na(names))
# # A tibble: 3 × 3
# sentences names check
# <chr> <chr> <lgl>
# 1 Bob is looking for something Bob TRUE
# 2 Adriana has an umbrella NA FALSE
# 3 Michael is looking at... Michael TRUE
This method as an advantage that it tells you which pattern (in names) made the match.

Maybe we can try adist + colSums like below
df %>%
mutate(check = colSums(adist(names$names, sentences, fixed = FALSE) == 0) > 0)
which gives
# A tibble: 3 × 2
sentences check
<chr> <lgl>
1 Bob is looking for something TRUE
2 Adriana has an umbrella FALSE
3 Michael is looking at... TRUE

Related

for loop combined with if statement in R

I am working in R and I want to iterate over every unique/distinct name in this table and if A=="yes" | B=="yes" it should create another column C==TRUE for all entries with the same Name, else C==FALSE. I dont know how to combine a for loop with this if statement, I am always getting error messages, although it should be a simple task to do...
Name
A
B
Jordan
yes
no
Pascal
yes
no
Nando
no
yes
Nando
no
no
Nico
no
no
Nico
no
no
This should be the result:
Name
A
B
C
Jordan
yes
no
TRUE
Pascal
yes
no
TRUE
Nando
no
yes
TRUE
Nando
no
no
TRUE
Nico
no
no
FALSE
Nico
no
no
FALSE
For-loops are often not needed in R.
library(dplyr)
dat |>
group_by(Name) |>
mutate(C = if_else("yes" %in% c(A, B), TRUE, FALSE))
#> # A tibble: 6 x 4
#> # Groups: Name [4]
#> Name A B C
#> <chr> <chr> <chr> <lgl>
#> 1 Jordan yes no TRUE
#> 2 Pascal yes no TRUE
#> 3 Nando no yes TRUE
#> 4 Nando no no TRUE
#> 5 Nico no no FALSE
#> 6 Nico no no FALSE
Created on 2022-07-05 by the reprex package (v2.0.1)
R is a language that prefers vector calculations over loops, so the more common way in R would be
df <- data.frame(
Name = c("Jordan","Pascal","Nando","Nando","Nico","Nico"),
A = c("yes","yes","no","no","no","no"),
B = c("no","no","yes","no","no","no")
)
df$C <- df$Name %in% df$Name[df$A == "yes" | df$B == "yes"]
This solution does not rely on any additional library.
If you feel strongly about looping, you could loop over unique(df$Name) or you could aggregate by df$Name, but all of those are much more involved and inefficient techniques.

Add column to dataframe to show if an element in that row is in a certain list in R

I have a dataframe df (tibble in my case) in R and several files in a given directory which have a loose correspondence with the elements of one of the columns in df. I want to track which rows in df correspond to these files by adding a column has_file.
Here's what I've tried.
# SETUP
dir.create("temp")
setwd("temp")
LETTERS[1:4] %>%
str_c(., ".png") %>%
file.create()
df <- tibble(x = LETTERS[3:6])
file_list <- list.files()
# ATTEMPT
df %>%
mutate(
has_file = file_list %>%
str_remove(".png") %>%
is.element(x, .) %>%
any()
)
# RESULT
# A tibble: 4 x 2
x has_file
<chr> <lgl>
1 C TRUE
2 D TRUE
3 E TRUE
4 F TRUE
I would expect that only the rows with C and D get values of TRUE for has_file, but E and F do as well.
What is happening here, and how may I generate this correspondence in a column?
(Tidyverse solution preferred.)
We may need to add rowwise at the top because the any is going to do the evaluation on the whole column and as there are already two TRUE elements, any returns TRUE from that row to fill up the whole column. With rowwise, there is no need for any as is.element returns a single TRUE/FALSE per each element of 'x' column
df %>%
rowwise %>%
mutate(
has_file = file_list %>%
str_remove(".png") %>%
is.element(x, .)) %>%
ungroup
# A tibble: 4 × 2
x has_file
<chr> <lgl>
1 C TRUE
2 D TRUE
3 E FALSE
4 F FALSE
i.e. check the difference after adding the any
> is.element(df$x, LETTERS[1:4])
[1] TRUE TRUE FALSE FALSE
> any(is.element(df$x, LETTERS[1:4]))
[1] TRUE
We may also use map to do this
library(purrr)
df %>%
mutate(has_file = map_lgl(x, ~ file_list %>%
str_remove(".png") %>%
is.element(.x, .)))
# A tibble: 4 × 2
x has_file
<chr> <lgl>
1 C TRUE
2 D TRUE
3 E FALSE
4 F FALSE
Or if we want to use vectorized option, instead of using is.element, do the %in% directly
df %>%
mutate(has_file = x %in% str_remove(file_list, ".png"))
# A tibble: 4 × 2
x has_file
<chr> <lgl>
1 C TRUE
2 D TRUE
3 E FALSE
4 F FALSE

str_detect on multiple columns in the same row

I have two datasets, one with full names and one with first and last names.
library(tidyverse)
(x = tibble(fullname = c("Michael Smith",
"Elisabeth Brown",
"John-Henry Albert")))
#> # A tibble: 3 x 1
#> fullname
#> <chr>
#> 1 Michael Smith
#> 2 Elisabeth Brown
#> 3 John-Henry Albert
(y = tribble(~first, ~last,
"Elisabeth", "Smith",
"John", "Albert",
"Roland", "Brown"))
#> # A tibble: 3 x 2
#> first last
#> <chr> <chr>
#> 1 Elisabeth Smith
#> 2 John Albert
#> 3 Roland Brown
I'd like to make a single boolean column that is true only if the first and last column is within the fullname column.
In essence, I'm looking for something like:
x %>%
mutate(fname_match = str_detect(fullname, paste0(y$first, collapse = "|")), ## correct
lname_match = str_detect(fullname, paste0(y$last, collapse = "|"))) ## correct
#> # A tibble: 3 x 3
#> fullname fname_match lname_match
#> <chr> <lgl> <lgl>
#> 1 Michael Smith FALSE TRUE
#> 2 Elisabeth Brown TRUE TRUE
#> 3 John-Henry Albert TRUE TRUE
But here if I took the columns with two TRUE's Elisabeth Brown would be a false positive because the matching first name and last name are not in the same row.
My best idea so far is to combine the first and last column and search for this, but this creates a false negative for John-Henry
y = tribble(~first, ~last,
"Elisabeth", "Smith",
"John", "Albert",
"Roland", "Brown") %>%
rowwise() %>%
mutate(longname = paste(first, last, sep = "&"))
x %>%
mutate(full_match = str_detect(fullname, paste0(y$longname, collapse = "|")))
#> # A tibble: 3 x 2
#> fullname full_match
#> <chr> <lgl>
#> 1 Michael Smith FALSE
#> 2 Elisabeth Brown FALSE
#> 3 John-Henry Albert FALSE
I think this does what you want, using purrr::map2 to iterate over the tuples of first and last.
library(dplyr)
library(purrr)
y %>%
mutate(
name_match = map2_lgl(
first, last,
.f = ~any(grepl(paste0(.x, '.*', .y), x$fullname, ignore.case = T))
)
)
Do mind, paste0(.x, '.*', .y) combines them into a regex that only lets rows pass in which the last name appears fully after the first. That seemed reasonable to do (otherwise, first name "Elisabeth", last name "Abe" would still be TRUE, which I here assume you would not want).
Also, the above is case insensitive.
// UPDATE:
I forgot; inversely, if you want to check the fullname values in x, then you can run this:
x %>%
rowwise() %>%
mutate(
name_match = any(map2_lgl(
y$first, y$last,
.f = ~grepl(paste0('\\b', .x, '\\b.*\\b', .y, '\\b'), fullname, ignore.case = T)
))
)
Depending on how important this check is for you and how many assumptions you want to make, it might make sense to tweak the above regex a little further:
ensure that the first name and last name stand as isolated words in the fullname
-> paste0('\\b', .x, '\\b.*\\b', .y, '\\b')
test that the first name comes right at the beginning
-> paste0('^', .x, '\\b.*\\b', .y, '\\b')
test that the fullname ends after the last name
-> paste0('\\b', .x, '\\b.*\\b', .y, '$')

For rows in a tibble, how to count the greatest number of TRUE values between FALSE values?

So I have a tibble in the form
passengerId FlightChain
1 1 c("TRUE", "TRUE", "FALSE", "TRUE", "TRUE", "TRUE", "FALSE", "TRUE")
2 2 TRUE
3 3 c("TRUE", "FALSE", "TRUE", "TRUE", "FALSE")
and I'm trying to get the highest count of "TRUES" between "FALSE"s as it's own column.
so in this case:
passengerId fullFlightChain
1 1 3
2 2 1
3 3 2
I first had the tibble in format:
passengerId flightTo
<int> <lgl>
1 1 TRUE
2 1 TRUE
3 1 FALSE
4 1 TRUE
5 1 TRUE
6 1 TRUE
7 1 FALSE
8 1 TRUE
9 2 TRUE
10 3 TRUE
11 3 FALSE
etc....
so if it would actually be better to work from (grouping by passengerId) there I'm all ears. From what I've heard rle() is a function that might work, but I can't get it to work properly.
Thanks
Edit: now with code
df_q3 <- df1 %>%
group_by(passengerId) %>%
arrange(passengerId) %>%
mutate(flightToUK = if_else(to == "uk", FALSE, TRUE)) %>%
summarise(fullFlightChain = paste(flightToUK, collapse = "-")) %>%
mutate(fullFlightChainSplit = str_split(fullFlightChain, "-")) %>%
map(fullFlightChainSplit,rle(fullFlightChainSplit))) %>%
print()
Where the last line is where I'm trying to make the count as seen in the first table
Taking as an input your initial tibble format, i.e.:
library(readr)
library(dplyr)
df <- read_table2("passengerId flightTo
1 TRUE
1 TRUE
1 FALSE
1 TRUE
1 TRUE
1 TRUE
1 FALSE
1 TRUE
2 TRUE
3 TRUE
3 FALSE")
This is the best solution to your problem:
df1 <- df %>%
group_by(passengerId) %>%
transmute(fullFlightChain = with(rle(flightTo), max(lengths[values]))
) %>%
unique(.)
Output:
> df1
# A tibble: 3 x 2
# Groups: passengerId [3]
passengerId fullFlightChain
<dbl> <int>
1 1 3
2 2 1
3 3 1
EDIT: Adding the missing rows to your initial tibble and producing the output:
df <- read_table2("passengerId flightTo
1 TRUE
1 TRUE
1 FALSE
1 TRUE
1 TRUE
1 TRUE
1 FALSE
1 TRUE
2 TRUE
3 TRUE
3 FALSE
3 TRUE
3 TRUE
3 FALSE")
df1 <- df %>%
group_by(passengerId) %>%
transmute(fullFlightChain = with(rle(flightTo), max(lengths[values]))
) %>%
unique(.)
Output:
> df1
# A tibble: 3 x 2
# Groups: passengerId [3]
passengerId fullFlightChain
<dbl> <int>
1 1 3
2 2 1
3 3 2
Using the rle function which encodes a vector by values and lengths would allow you to examine the max length that had a TRUE value. Something along these lines although untested in the absence of example built in code.
RLE <- rle(flightTo)
mxT <- max( RLE$lengths[RLE$values == TRUE] )
Or for multiple items in a list:
lapply( list_name, function(line){
RLE <- rle(flightTo)
mxT <- max( RLE$lengths[RLE$values == TRUE] ) }
Here is both a reproducible example and the solution based on rle
library(tibble)
library(magrittr)
library(dplyr)
set.seed(4242)
tbl <- tibble(passID = sample(1:3, 20, replace = TRUE),
flightTO = sample(c(T, F), 20, replace = TRUE)) %>%
arrange(passID)
rle(tbl$flightTO)
tbl %>%
group_by(passID) %>%
do({tmp <- with(rle(.$flightTO==TRUE), lengths[values])
data.frame(passID= .$passID, Max=if(length(tmp)==0) 0
else max(tmp)) }) %>%
slice(1L)
UPDATE
simply use my code to create a temporary object which you will use to join to the main summarised object, keeping the critical "Max" column which summarises the maximum run length by passID. "tbl" is your "df1"
temp_obj <- tbl %>%
group_by(passID) %>%
do({tmp <- with(rle(.$flightTO==TRUE), lengths[values])
data.frame(passID= .$passID, Max=if(length(tmp)==0) 0
else max(tmp)) }) %>%
slice(1L)
your_new_obj_where_you_summarise_other_stuff <- tbl %>%
group_by(passID) %>%
summarise(..other summary statistics you need..) %>%
inner_join(temp_obj, by = "passID")

Row-wise operation to see if any columns are in any other list

I have a tibble tib as follows:
A B C D
<chr> <chr> <chr> <chr>
1 X123 X456 K234 V333
2 X456 Z000 L888 B323
3 X789 ZZZZ D345 O999
4 M111 M111 M111 M111
.
.
.
(5000 rows)
I also have another vector as follows:
> vec <- c("X123","X456")
> vec
[1] "X123" "X456"
I am looking for a way to search for, and add a logical column (with 5000 rows, in the e.g.) to the right of the tibble that is either TRUE or FALSE depending on whether any of the values of the columns in tib contain a value in vec. My goal output is the following:
A B C D lgl
<chr> <chr> <chr> <chr> <lgl>
1 X123 X456 K234 V333 TRUE
2 X456 Z000 L888 B323 TRUE
3 X789 ZZZZ D345 O999 FALSE
4 M111 M111 M111 M111 FALSE
I have the following:
> tib %>%
+ pmap_lgl(~any(..1 %in% vec))
[1] TRUE TRUE FALSE FALSE
This gets the results that I am seeking, but I'm a bit confused about the syntax.
Why does the above work (i.e. using ..1), instead of having to use ..1, ..2, ..3, and ..4? My understanding is that pmap generates a vector based on the inputs rowwise, so I assume that ..1 in the above means the vector c("X123","X456","K234","V333") for row #1, c("X456","Z000","L888","B323") for row #2, etc.
In the end, I have two questions:
How do I append this new logical vector to the above tib? I haven't had any luck with:
tib %>% mutate(lgl = pmap_lgl(~any(..1 %in% vec)))
Error in mutate_impl(.data, dots): Evaluation error: argument ".f" is missing, with no default.
If I were to watch to access each column within each row (e.g. "X123" for the first row in pmap), how would I do that within the syntax of purrr?
You can use add_column and pmap_lgl along with a helper function to get a tidyverse one-liner similar to the base apply solution from #YOLO.
library(tidyverse)
df <- tibble(A = c('X123', 'X456','X789', 'M111'),
B = c('X456', 'Z000', 'ZZZZ', 'M111'),
C = c('K234', 'L888', 'D345', 'M111'),
D = c('V333', 'B323', '0999', 'M111'))
vec <- c('V333', '0999')
check <- function(...) {
any(c(...) %in% vec)
}
add_column(df, row_check = pmap_lgl(df, check))
# A tibble: 4 x 5
A B C D row_check
<chr> <chr> <chr> <chr> <lgl>
1 X123 X456 K234 V333 TRUE
2 X456 Z000 L888 B323 FALSE
3 X789 ZZZZ D345 0999 TRUE
4 M111 M111 M111 M111 FALSE
The caveat of using ... in the function is that it will operate over ALL columns of the provided tibble or data frame. If you have additional columns you'll need to either specify the function arguments or limit the data passed to the pmap_lgl
Keep it simple, you could use base functions apply with any function:
df$lgl <- apply(df, 1, function(x) any(x %in% vec))
The ..1, ..2 refers to the number of arguments. We can use these along with the mutate and rowwise functions to get our desired result:
tib %>%
mutate(lgl = pmap(., ~c(..1, ..2, ..3, ..4) %in% vec)) %>%
rowwise() %>%
mutate(lgl = any(unlist(lgl)))
V1 V2 V3 V4 lgl
<chr> <chr> <chr> <chr> <lgl>
1 X123 X456 K234 V333 TRUE
2 X456 Z000 L888 B323 TRUE
3 X789 ZZZZ D345 O999 FALSE
4 M111 M111 M111 M111 FALSE
The call to pmap uses . as its first argument, which is the function we're using. Then we create a vector of the values for each column using c(..1, ..2, ..3, ..4). We need to then use rowwise to calculate the final logical value for each row.
The previous iteration of my answer would have returned an incorrect result for vec = c('M111'), it correctly performs it now:
tib %>%
mutate(lgl = pmap(., ~c(..1, ..2, ..3, ..4) %in% c('M111'))) %>%
rowwise() %>%
mutate(lgl = any(unlist(lgl)))
V1 V2 V3 V4 lgl
<chr> <chr> <chr> <chr> <lgl>
1 X123 X456 K234 V333 FALSE
2 X456 Z000 L888 B323 FALSE
3 X789 ZZZZ D345 O999 FALSE
4 M111 M111 M111 M111 TRUE
Here's a link to the documentation for the function, which might be useful too.

Resources