Check if string contains anything other than items in vector [R] - r

I have a dataframe containing a column of strings. I want to check whether any of the elements in each string match any of the elements in one or more predefined vectors, and then return a new logical column. This is easily accomplished using grepl().
However (and this is the part I need help with), I also want to check whether the strings contain any elements other than those contained in the keyword vectors.
Example data:
matchvector1 <- c("Apple","Banana","Orange")
matchvector2 <- c("Strawberry","Kiwi","Grapefruit")
id <- c(1,2,3)
string_column <- c(paste0(c("Apple","Banana"),collapse=", "), paste0(c("Strawberry","Kiwi"), collapse = ", "), paste0(c("Apple","Pineapple"), collapse = ", "))
df <- data.frame(id, string_column)
df$string_column <- as.character(df$string_column)
matches_vector1 <- grepl(paste(matchvector1, collapse = "|"), df$string_column)
matches_vector2 <- grepl(paste(matchvector2, collapse = "|"), df$string_column)
The output should look something like:
matches_vector1: TRUE FALSE TRUE
matches_vector2: FALSE TRUE FALSE
unmatched_words: FALSE FALSE TRUE
I'm stuck on this last part. Is there an easy way to match on anything except something in a list of keywords using grepl() (or another function)? I suspect it will involve using negative lookaround somehow but the few existing threads on this didn't seem to answer my question.

One option is to split the 'string_column' with separate_rows, grouped by 'id', check if there are not any elements from 'string_column' %in% the concatenated vectors
library(dplyr)
library(tidyr)
df %>%
separate_rows(string_column) %>%
group_by(id) %>%
summarise(unmatched = any(!string_column %in% c(matchvector1, matchvector2)) )
# A tibble: 3 x 2
# id unmatched
#* <dbl> <lgl>
#1 1 FALSE
#2 2 FALSE
#3 3 TRUE
or in base R
lengths(sapply(strsplit(df$string_column, ",\\s*"),
setdiff, c(matchvector1, matchvector2))) > 0
#[1] FALSE FALSE TRUE

Related

Create new data frame boolean column based on dynamic number of other columns all being true

I have a data frame which always starts with a target column, then an unknown number of other columns, all of booleans (results of agrep searches against a dynamic number of search patterns).
I need to create a column called final_result, which is TRUE if any of the boolean columns have a TRUE value in them. The number of boolean columns is unknown in advance as the data frame is created on the fly.
My rather naive approach was this:
target = c('blood', 'pressure','lymphatic')
result_1 = c(TRUE, TRUE, FALSE)
result_2 = c(TRUE, FALSE, FALSE)
# may be many more columns, unknown at runtime
df = data.frame(target, result_1, result_2)
df$final_result <- any(df[,2:ncol(df)])
but this returns:
the last result "lymphatic" has both FALSE values, and so should return FALSE.
Any ideas appreciated.
A possible solution, based on dplyr:
library(dplyr)
df %>%
mutate(new = rowSums(across(-target)) > 0)
#> target result_1 result_2 new
#> 1 blood TRUE TRUE TRUE
#> 2 pressure TRUE FALSE TRUE
#> 3 lymphatic FALSE FALSE FALSE
An approach that does not require any additional packages is:
df$final_result <- apply(df[,-1], 1, any)
The -1 means all of the columns except the first one. The apply function will convert the rest of the data frame into a matrix, then apply the any function to each row (the 2nd argument is 1 for rows, 2 for columns).
Another approach that does not convert to a matrix (so could be faster in some cases) is:
df$final_result <- Reduce(`|`, df[-1])
This treats the data frame as a list and starts by finding if the first column (after dropping "target") or the second is TRUE, then finds if that result or the third column is TRUE, then compare that result with the 4th column, on until it runs out of columns.
If you want to use the tidyverse, then pmap from the purrr package can do this:
library(tidyverse)
df$final_result <- df[-1] %>% pmap_lgl(any)
For any of these you can replace the -1 with 2:ncol(df) or with the results of which, grep, sapply etc. used to select columns.
In dplyr, we can use if_any
library(dplyr)
df %>%
mutate(final_result = if_any(starts_with('result')))
-output
target result_1 result_2 final_result
1 blood TRUE TRUE TRUE
2 pressure TRUE FALSE TRUE
3 lymphatic FALSE FALSE FALSE

R - using function to create new column based on string comparison

I'm fairly new to R and I'm trying to write code to solve the Spelling Bee game on the NYTimes website to see how I'm doing. I tried writing a function to compare two strings ('given' and 'test_word') that returns TRUE if you can spell 'test_word' with only the letters from 'given' and FALSE otherwise. I got that to work, so I downloaded the enable1 wordlist and tried to apply that function to every word in the list. Instead of giving me a new column in the dataframe with the result of the function on each word, it just returns FALSE for every row, and I'm just confused as to what I'm doing wrong. It looks like it's just taking the value of the function for the first entry in the wordlist instead of looking at each word individually.
Here's my code:
library(dplyr)
is_good <- function(given, test_word) {
diffs <- paste(unlist(setdiff(strsplit(test_word,'')[[1]],strsplit(given,'')[[1]])),collapse='')
match = case_when(
diffs == '' ~ TRUE,
diffs != '' ~ FALSE
)
return(match)
}
given <- 'CLEXION'
#words = read.csv('c:/Users/Dave/Documents/R/enable1.txt', header=FALSE)
# edited to add sample list of words
V1 <- c('AAHED','LEXICON','LION','COLLECTION')
words <- data.frame(V1)
names(words) <- c('word')
words <- filter(words, nchar(word)>=4)
words$word <- toupper(words$word)
words <- words %>% mutate(is_match = is_good(given,word))
After running all this, I get this output:
> filter(words, is_match == TRUE)
[1] word is_match
<0 rows> (or 0-length row.names)
Just to check I ran a filter on a word I know should work and got
> filter(words, word=='LEXICON')
word is_match
1 LEXICON FALSE
If I run the function on its own with one word I get the expected result:
> is_good(given,'LEXICON')
[1] TRUE
Why is the function call in my mutate step not applying the function to each row? I'm getting comfortable with the idea of lists and data frames but there's obviously something I'm missing when putting it into practice.
UPDATE: I researched the lapply function and it did what I hoped - my new code looks like
test_split <- lapply(test_word, function(w) {strsplit(w,'')[[1]]})
given_split <- strsplit(given,'')[[1]]
diff_1 <- lapply(test_split, function(x) {paste(unlist(setdiff(x, given_split)),collapse='')})
match = lapply(diff_1, function(x) {
case_when(
x == '' ~ TRUE,
x != '' ~ FALSE
)})
Answer to match the OP's view
is_good <- function(given, test_word) {
test_split <- strsplit(test_word, split = "") # don't need lapply here since strsplit is already vectorized
given_split <- strsplit(given,'')[[1]]
diff_1 <- lapply(test_split, function(x) {paste(unlist(setdiff(x, given_split)),collapse='')})
# From here, it is back to simple things!
diff_1 <- unlist(diff_1)
match <- diff_1 == ""
return(match)
}
Thanks for providing sample data. It makes it easier to solve.
It is probably overkill, but here is the dplyr / tidyverse answer.
Note that |> is the base pipe (similar to %>%).(will work for R>=4.1.0)
Note you will need extra packages ( stringr and tidyr). Check if you have them installed.
If not already installed, run
install.packages(c("tidyr", "stringr")
purrr::map() is used to manipulate elements of a list
purrr::map_lgl() ensures you return a logical vector
Solution for dealing with duplicated letters
library(dplyr)
is_good <- function(given, test_word) {
# Standardizing to upper case
given <- toupper(given)
test_word <- toupper(test_word)
# Extracting letters
given_letters <- stringr::str_split(given, pattern = "")
given_letters <- unlist(given_letters)
# This part deals with duplicated letters
# there is probably a base R way to do it.
given_letters <- tibble(given_letter = given_letters) |>
group_by(given_letter) |>
mutate(letter = paste0(given_letter, row_number())) |>
pull(letter)
# For word "DREAD", it will return ("D1", "R1", "E1", "D2")
# Manipulating test_word
letters_in_test_words <- stringr::str_split(test_word, pattern = "")
# a little bit more complicated, but similar to previously to mark duplicated
# letters. It outputs a list. Example: for input "THIN", "MINI"
# a list of 2
# [[1]] : "T1", "H1", "I1", "N1"
# [[2]] : "M1", "I1", "N1" "I2"
letters_in_test_words <- tibble(
word_id = 1:length(letters_in_test_words),
letter = letters_in_test_words
) |>
tidyr::unnest(letter) |>
group_by(word_id, letter) |>
mutate(letter = paste0(letter, dplyr::row_number())) |>
ungroup() |>
tidyr::nest(data = letter) |>
mutate(data = purrr::map(data, 1)) |>
pull(data)
# iterates over the words to find if there is a complete match
match <- purrr::map_lgl(letters_in_test_words, ~ all(.x %in% given_letters))
match
}
given <- 'CLEXION'
#words = read.csv('c:/Users/Dave/Documents/R/enable1.txt', header=FALSE)
# edited to add sample list of words
V1 <- c('AAHED','LEXICON','LION','COLLECTION')
words <- data.frame(word = V1)
words <- filter(words, nchar(word)>=4)
words$word <- toupper(words$word) # a good idea to be put inside the function
is_good("AHMED","LEXICO1N")
#> [1] FALSE
words <- words %>% mutate(is_match = is_good(given,word))
words |>
filter(is_match)
#> word is_match
#> 1 LEXICON TRUE
#> 2 LION TRUE
# My solution checks for duplicated letters
# You probably don't want this as TRUE.
is_good(given = "TRUCE", test_word = "TRUCEE")
#> [1] FALSE
Created on 2022-06-17 by the reprex package (v2.0.1)
Note:
My function could probably exist in base R as well, but I am better with tables. It is also overkill since it checks for duplicates.
A solution that doesn't check for duplicates (much simpler)
library(dplyr)
is_good <- function(given, test_word) {
# Standardizing
given <- toupper(given)
test_word <- toupper(test_word)
# Extracting letters
given_letters <- stringr::str_split(given, pattern = "")
given_letters <- unlist(given_letters)
# Manipulating test_word
letters_in_test_words <- stringr::str_split(test_word, pattern = "")
# a little bit more complicated, but simi
# iterates over the words to find if there is a complete match
match <- purrr::map_lgl(letters_in_test_words, ~ all(.x %in% given_letters))
match
}
given <- 'CLEXION'
#words = read.csv('c:/Users/Dave/Documents/R/enable1.txt', header=FALSE)
# edited to add sample list of words
V1 <- c('AAHED','LEXICON','LION','COLLECTION')
words <- data.frame(word = V1)
words <- filter(words, nchar(word)>=4)
words$word <- toupper(words$word) # a good idea to be put inside the function
is_good("AHMED","LEXICO1N")
#> [1] FALSE
words <- words %>% mutate(is_match = is_good(given,word))
words |>
filter(is_match)
#> word is_match
#> 1 LEXICON TRUE
#> 2 LION TRUE
# You probably don't want this as TRUE. but it will come out as TRUE without the
# Duplicated letters are ignored.
is_good(given = "TRUCE", test_word = "TRUCEE")
#> [1] TRUE
Created on 2022-06-17 by the reprex package (v2.0.1)

How to match a vector to a dataframe using mutate

I have a data frame in which I'd like to make a new column which is set to TRUE or FALSE depending on whether it matches with a vector.
So far I've tried two different approaches, the first directly using the %in% operator to check whether elements of test occurred in column apples, the second by putting this in an ifelse statement.
test <- c("a","b","c")
df <- tibble(apples = c("a","d","e","f","z","g","c"))
#First attempt
df_match <- df %>%
mutate(
match = test %in% apples
)
#Second attempt
df_match <- df %>%
mutate(
match = ifelse(test %in% apples,TRUE,FALSE)
)
The desired output for column 'match' would be
> df$match
[1] TRUE FALSE FALSE FALSE FALSE FALSE TRUE
Using base R
transform(df, match = apples %in% test)

How to index a dataframe that contains lists/vectors as values

Indexing a dataframe works for single values but not for values that are list elements or vectors.
I have two lists of the genes that I need to match up. In each list, the genes are named as different gene aliases. I need to query a large list of genes in order to filter out any genes that are not shared between the two datasets. To do this, I created a dataframe that contains all genes from both lists. Each value in the dataframe is either a single string or a vector of multiple strings (aliases). A separate column assigns each group of aliases a unique number, which I am using to match the two lists. For each gene I need to check if it is present in the dataframe. But I cannot index the vector values. See below:
df <- data.frame("col1"=I(list(c("MALAT1","FTK2","CAS9"),
"MS4A6A",
c("LACT1","FLEE6","LOC98"))),
"col2"=I(list(c("CASS4","MS4A2","NME"),
"PLD3",
"ADAM4")))
"MALAT1" %in% df$col1
[1] FALSE
"MS4A6A" %in% df$col1
[1] TRUE
As it is a list, we can unlist
"MALAT1" %in% unlist(df$col1)
#[1] TRUE
The reason, second one returns TRUE is because the second element is of length 1 while the one with "MALAT1" is not
-testing
If we change the list element that have a single element to "MALAT1"
df$col1[2] <- "MALAT1"
"MALAT1" %in% df$col1
#[1] TRUE
Generally, when we have a list, if we want to test on each element
lapply(df$col1, `%in%`, x = "LACT1")
#[[1]]
#[1] FALSE
#[[2]]
#[1] FALSE
#[[3]]
#[1] TRUE
Here is another workaround, which plays a trick on df by flattening the lists in columns via rapply + toString
df[] <- rapply(df, toString, how = "unlist")
such that
> df
col1 col2
1 MALAT1, FTK2, CAS9 CASS4, MS4A2, NME
2 MS4A6A PLD3
3 LACT1, FLEE6, LOC98 ADAM4
and then you can use grepl to check if the objective can be found in the column via, e.g.,
> grepl("LACT1", df$col1, fixed = TRUE)
[1] FALSE FALSE TRUE
> grepl("NME", df$col2, fixed = TRUE)
[1] TRUE FALSE FALSE
You were almost there. Just wrap unlist() around the list you have your da

Use an external list to remove data from rows

I have a data frame
df <- data.frame(
A = c(4, 2, 7),
B = c(3, 3, 5),
C = c("Expert,Foo", "Bar,Wild", "Zap")
)
and a second one which I would like to use as index to remove rows which contain the specific values
mylist <- data.frame(rtext = c("Foo","Bar"))
So I tried this:
subset(df, C %in% mylist$rtext)
How can I remove the specific rows?
As it is a partial match, we can use grep. We paste the elements of 'myList' column 'rtext' into a single string with delimiter | which implies OR, then get a logical index with grepl on the 'C' column of 'df', negate (!) to change TRUE to FALSE and FALSE to TRUE to subset the rows that are not in the 'rtext' of 'mylist'
subset(df, !grepl(paste(mylist$rtext, collapse="|"), C))
# A B C
#3 7 5 Zap
Using str_detect from stringr
df[!stringr::str_detect(df$C,paste(mylist$rtext,collapse = '|')),]
A B C
3 7 5 Zap
If you need the 100% match , which means Foooo will not be removed ,check with dplyr and tidyr re-format your df 1st , since str_detect and grepl are partial match , if you have word like Expert,Foott it will still show as match with Foo
library(tidyr)
library(dplyr)
df$id=seq.int(nrow(df))
df1=df %>%
transform(C = strsplit(C, ",")) %>%
unnest(C)
df[!df$id%in%df1$id[df1$C%in%mylist$rtext],]

Resources