R match whole words in phrases - r

I have a character vector
var1 <- c("pine tree", "forest", "fruits", "water")
and a list
var2 <- list(c("tree", "house", "star"), c("house", "tree", "pine tree", "tree pine", "dense forest"), c("apple", "orange", "grapes"))
I want to match words in var1 with words in var2, and extract the maximum matching element in var2. For example,
[[1]]
[1] "tree" "house" "star"
has 1 match with var1
[[2]]
[1] "house" "tree" "pine tree" "tree pine" "dense forest"
has 4 matches with var1
[[3]]
[1] "apple" "orange" "grapes"
has 0 match with var1
And the desired output is the following:
[[2]]
[1] "house" "tree" "pine tree" "tree pine" "dense forest"
I tried
sapply(var1, grep, var2, ignore.case=T, value=T)
without getting the output desired.
How to solve it? A code snippet would be appreciated.
Thanks.

We create a pattern string ('pat') for the grepl , by first splitting the 'var1' by space '\\s+'. The output will be a list. We use sapply to loop over the list, use paste with collapse= '|', and then collapse the whole vector to a single string with another paste. The | acts as OR while using as pattern for grepl in v1. The sum vector ('v1') will be used for subsetting the list 'var2' based on the condition described in the question.
pat <- paste(sapply(strsplit(var1, '\\s+'), function(x)
paste(unique(c(x, paste(x, collapse=' '))), collapse='|')),
collapse='|')
v1 <- sapply(var2, function(x) sum(grepl(pat, x)))
v1
#[1] 1 4 0
var2[which.max(v1)]
#[[1]]
#[1] "house" "tree" "pine tree" "tree pine" "dense forest"

Related

R - Combine two elements within the SAME list. Preferentially a purrr solution

I am looking for a purrr solution for the following problem:
Say, we have some list:
list( c("Hello", "Well", "You" ),
c("again", "done,", "annoy"),
c("my friend!", "boy!", "me!" ) )
Now, I would like to to combine the the first two elements within that list.
My desired output is:
list( c("Hello", "Well", "You" , "again",
"done,", "annoy"),
c("again", "done,", "annoy"),
c("my friend!", "boy!", "me!" ) )
Appreciate your help! Thanks.
Subset the first two list elements, concatenate with do.call and
assign (<-) it back to the first element
lst1[[1]] <- do.call(c, lst1[1:2])
-output
> lst1
[[1]]
[1] "Hello" "Well" "You" "again" "done," "annoy"
[[2]]
[1] "again" "done," "annoy"
[[3]]
[1] "my friend!" "boy!" "me!"
With purrr, we can use modify_in
library(purrr)
modify_in(lst1, 1, ~ flatten_chr(lst1[1:2]))
[[1]]
[1] "Hello" "Well" "You" "again" "done," "annoy"
[[2]]
[1] "again" "done," "annoy"
[[3]]
[1] "my friend!" "boy!" "me!"
data
lst1 <- list(c("Hello", "Well", "You"), c("again", "done,", "annoy"),
c("my friend!", "boy!", "me!"))
I don't think you want a purrr solution, but if you insist for some workflow reason...
x <- list(c("Hello", "Well", "You"),
c("again", "done,", "annoy"),
c("my friend!", "boy!", "me!"))
library(purrr)
modify_at(x, 1, ~ c(., x[[2]]))
# which can simplify to...
x %>%
modify_at(1, c, .[[2]])
# or with more purrr!!
x %>%
modify_at(1, c, pluck(., 2))
But I would just do...
x[[1]] <- c(x[[1]], x[[2]])

R: How to delete words other than specific words in a corpus

In the corpus "tkn_pb" , I would like to delete all words except for some keywords I chose (ex. "attack" and "gunman"). Is it possicle to do this?
You can use whichand grepl to subset your corpus:
Data:
sample_tokens <- c("word", "another","a", "new", "word token", "one", "more", "and", "another one")
Remove all words except "a" and "and":
sample_tokens[which(grepl("\\b(a|and)\\b", sample_tokens))]
[1] "a" "and"
EDIT:
If the corpus is a list, then this solution suggested by #John would work:
Data:
sample_tokens <- list(c("word", "another","a", "new", "word token", "one", "more", "and", "another one"),
c("yet", "a", "few", "more", "words"),
c("and", "so on"))
lapply(sample_tokens, function(x) x[which(grepl("\\b(a|and)\\b", x))])
[[1]]
[1] "a" "and"
[[2]]
[1] "a"
[[3]]
[1] "and"

How to split string in R with regular expression when parts of the regular expression are to be kept in the subsequent splitted strings?

I have a vector of character strings like this x = c("ABC", "ABC, EF", "ABC, DEF, 2 stems", "DE, other comments, and stuff").
I'd like to split each of these into two components: 1) the set of capital letters (2 or 3 letters, separated by commas), and 2) everything after the last "[A-Z][A-Z], ".
The results should be
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, DEF" "2 stems"
[[4]]
[1] "DE" "other comments, and stuff"
I tried strsplit(x, "[A-Z][A-Z], [a-z0-9]") and strsplit(x, "(?:[A-Z][A-Z], )[a-z0-9]"), both of which returned
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, D" " stems"
[[4]]
[1] "" "ther comments, and stuff"
The identification of where to split depends on a combination of the end of the first substring and the beginning of the second substring, and so those parts get excluded from the final result.
Any help appreciated in splitting as indicated above while including the relevant parts of the split regex in each substring!
One option would be str_split
library(stringr)
str_split(x, ", (?=[a-z0-9])", n = 2)
#[[1]]
#[1] "ABC"
#[[2]]
#[1] "ABC, EF"
#[[3]]
#[1] "ABC, DEF" "2 stems"
#[[4]]
#[1] "DE" "other comments, and stuff"

How to associate the factor and lists?

I have a lot of named lists. Now I want to separate them according to the number of letter "a" within each element. For instants,
library(stringr)
data1 <- c("apple","appreciate","available","account","adapt")
data2 <- c("tab","banana","cable","tatabox","aaaaaaa","aaaaaaaaaaa")
list1 <- list(data1,data2)
names(list1) <- c("a","b")
ca <- lapply(list1, function(x) str_count(x, "a")) #counting letter a
factor1 <- lapply(ca,as.factor) #convert ca to factor
#is that possible to associate factor1 to list1, then I can separate
#elements depends on the factor1?
#ideal results
result$1 or result[1]
$`1`
$`a`$`1`
[1] "apple" "account"
$`b`$`1`
[1] "tab" "cable"
You can get very close with one line using split and Map:
Map(split, list1, Map(stringr::str_count, list1, "a"))
$a
$a$`1`
[1] "apple" "account"
$a$`2`
[1] "appreciate" "adapt"
$a$`3`
[1] "available"
$b
$b$`1`
[1] "tab" "cable"
$b$`2`
[1] "tatabox"
$b$`3`
[1] "banana"
$b$`7`
[1] "aaaaaaa"
$b$`11`
[1] "aaaaaaaaaaa"
This lists all the "a" elements first and then all the "b" elements grouped by the number of "a" characters.

R match words in list

I have a character vector
var1 <- c("pine tree", "forest", "fruits", "water")
and a list
var2 <- list(c("tree", "house", "star"), c("house", "tree", "dense forest"), c("apple", "orange", "grapes"))
I want to match words in var1 with words in var2, and RANK the list elements according to the number of words matched. For example,
[[2]]
[1] "house" "tree" "dense forest"
has 2 matches with var1
[[1]]
[1] "tree" "house" "star"
has 1 match with var1
[[3]]
[1] "apple" "orange" "grapes"
has 0 match with var1
And the desired output is the following rank:
[1] "house" "tree" "dense forest"
[2] "tree" "house" "star"
[3] "apple" "orange" "grapes"
I tried
sapply(var1, grep, var2, ignore.case=T, value=T)
without getting the output desired.
How to solve it? A code snippet would be appreciated.
Thanks.
EDIT:
The problem has been edited from single word match to word match in phrases as described above.
you can try
var2[[which.max(lapply(var2, function(x) sum(var1 %in% x)))]]
[1] "house" "tree" "forest"
from the last modification of the OP and #franks comment
var2[order(-sapply(var2, function(x) sum(var1 %in% x)))]
[[1]]
[1] "house" "tree" "forest"
[[2]]
[1] "tree" "house" "star"
[[3]]
[1] "apple" "orange" "grapes"

Resources