I have a character vector
var1 <- c("pine tree", "forest", "fruits", "water")
and a list
var2 <- list(c("tree", "house", "star"), c("house", "tree", "dense forest"), c("apple", "orange", "grapes"))
I want to match words in var1 with words in var2, and RANK the list elements according to the number of words matched. For example,
[[2]]
[1] "house" "tree" "dense forest"
has 2 matches with var1
[[1]]
[1] "tree" "house" "star"
has 1 match with var1
[[3]]
[1] "apple" "orange" "grapes"
has 0 match with var1
And the desired output is the following rank:
[1] "house" "tree" "dense forest"
[2] "tree" "house" "star"
[3] "apple" "orange" "grapes"
I tried
sapply(var1, grep, var2, ignore.case=T, value=T)
without getting the output desired.
How to solve it? A code snippet would be appreciated.
Thanks.
EDIT:
The problem has been edited from single word match to word match in phrases as described above.
you can try
var2[[which.max(lapply(var2, function(x) sum(var1 %in% x)))]]
[1] "house" "tree" "forest"
from the last modification of the OP and #franks comment
var2[order(-sapply(var2, function(x) sum(var1 %in% x)))]
[[1]]
[1] "house" "tree" "forest"
[[2]]
[1] "tree" "house" "star"
[[3]]
[1] "apple" "orange" "grapes"
Related
The problem is I have a list of character vectors.
example:
mylist <- list( c("once","upon","a","time"),
c("once", "in", "olden", "times"),
c("Let","all","good","men"),
c("Let","This"),
c("once", "is","never","enough"),
c("in","the"),
c("Come","dance","all","around"))
and I want to prepend c("one", "two") to those vectors starting "once" to end up with the list
mylist <- list( c("one", "two", "once","upon","a","time"),
c("one", "two", "once", "in", "olden", "times"),
c("Let","all","good","men"),
c("Let","This"),
c("one", "two", "once", "is","never","enough"),
c("in","the"),
c("Come","dance","all","around"))
so far
I can select the relevant vectors
mylist[grep("once",mylist)]
and I can prepend "one" and "two" to create a results list
resultlist <- lapply(mylist[grep("once",mylist)],FUN = function(listrow) prepend(listrow,c("One","Two")))
But putting the results in the correct place in mylist?
Nope, that escapes me!
Hints, tips and solutions most welcome :-)
We can use
lapply(mylist , \(x) if(grepl("once" , x[1]))
append(x, c("one", "two") , 0) else x)
Output
[[1]]
[1] "one" "two" "once" "upon" "a" "time"
[[2]]
[1] "one" "two" "once" "in" "olden" "times"
[[3]]
[1] "Let" "all" "good" "men"
[[4]]
[1] "Let" "This"
[[5]]
[1] "one" "two" "once" "is" "never" "enough"
[[6]]
[1] "in" "the"
[[7]]
[1] "Come" "dance" "all" "around"
I don't think you need grep at all. Loop over the list, checking the first value for "once" and appending via c() the extra values:
lapply(mylist, \(x) if(x[1] == "once") c("one", "two", x) else x)
##[[1]]
##[1] "one" "two" "once" "upon" "a" "time"
##
##[[2]]
##[1] "one" "two" "once" "in" "olden" "times"
##
##[[3]]
##[1] "Let" "all" "good" "men"
##
##[[4]]
##[1] "Let" "This"
##
##[[5]]
##[1] "one" "two" "once" "is" "never" "enough"
##
##[[6]]
##[1] "in" "the"
##
##[[7]]
##[1] "Come" "dance" "all" "around"
Another option with map_if
library(purrr)
map_if(mylist, .p = ~ first(.x) == "once", .f = ~ c("one", "two", .x))
-output
[[1]]
[1] "one" "two" "once" "upon" "a" "time"
[[2]]
[1] "one" "two" "once" "in" "olden" "times"
[[3]]
[1] "Let" "all" "good" "men"
[[4]]
[1] "Let" "This"
[[5]]
[1] "one" "two" "once" "is" "never" "enough"
[[6]]
[1] "in" "the"
[[7]]
[1] "Come" "dance" "all" "around"
This question already has answers here:
List of unique words from data.frame
(2 answers)
Closed 9 months ago.
Lets say i have a string and i only want unique words in the sentence as separate elements
a = "an apple is an apple"
word <- function(a){
words<- c(strsplit(a,split = " "))
return(unique(words))
}
word(a)
This returns
[[1]]
[1] "an" "apple" "is" "an" "apple"
and the output im expecting is
'an','apple','is'
what im doing wrong? really appreciate any help
Cheers!
The problem is that wrapping strsplit(.) in c(.) does not change the fact that it is still a list, and unique will be operating at the list-level, not the word-level.
c(strsplit(rep(a, 2), "\\s+"))
# [[1]]
# [1] "an" "apple" "is" "an" "apple"
# [[2]]
# [1] "an" "apple" "is" "an" "apple"
unique(c(strsplit(rep(a, 2), "\\s+")))
# [[1]]
# [1] "an" "apple" "is" "an" "apple"
Alternatives:
If length(a) is always 1, then perhaps
unique(strsplit(a, "\\s+")[[1]])
# [1] "an" "apple" "is"
If length(a) can be 2 or more and you want a list of unique words for each sentence, then
a2 <- c("an apple is an apple", "a pear is a pear", "an orange is an orange")
lapply(strsplit(a2, "\\s+"), unique)
# [[1]]
# [1] "an" "apple" "is"
# [[2]]
# [1] "a" "pear" "is"
# [[3]]
# [1] "an" "orange" "is"
(Note: this always returns a list, regardless of the number of sentences in the input.)
if length(a) can be 2 ore more and you want a unique words across all sentences, then
unique(unlist(strsplit(a2, "\\s+")))
# [1] "an" "apple" "is" "a" "pear" "orange"
(Note: this method also works well when length(a) is 1.)
Another possible solution, based on stringr::str_split:
library(tidyverse)
a %>% str_split("\\s+") %>% unlist %>% unique
#> [1] "an" "apple" "is"
You could try
unique(unlist(strsplita, " ")))
I have a list containing a number of lists containing character vectors. The lists are always arranged so that the first list contains a vector with a single element, the second list contains a vector with two elements and the third contains one or more vectors containing three elements.
fruits <- list(
list(c("orange")),
list(c("pear", "orange")),
list(c("orange", "pear", "grape"),
c("orange", "lemon", "pear"))
)
I need to iterate through the lists in order to remove the elements from the vector in the previous list. i.e. I would first find the value from the vector in the first list ('orange') and remove it from the vector in the second list, then take the values from the second list ("pear", "orange") and remove them from both vectors in the third list, so I ended up with:
new_fruits <- list(
list(c("orange")),
list(c("pear")),
list(c("grape"),
c("lemon"))
)
I should add that I have had a go at doing this, but I'm finding the lists within lists make it quite complicated and my solution is long and not very efficient.
Maybe you can try the code below
new_fruits <- s <- c()
for (k in seq_along(fruits)) {
new_fruits[[k]] <- lapply(fruits[[k]],function(x) x[!x%in%s])
s <- union(s,unlist(fruits[[k]]))
}
which gives
> new_fruits
[[1]]
[[1]][[1]]
[1] "orange"
[[2]]
[[2]][[1]]
[1] "pear"
[[3]]
[[3]][[1]]
[1] "grape"
[[3]][[2]]
[1] "lemon"
Here is an idea where we unlist, convert to strings and resplit to differentiate between different vectors of same element. We then unlist one more time and get the unique values, i.e.
as.list(unique(unlist(strsplit(unlist(lapply(fruits, function(i) sapply(i, toString))), ', '))))
#[[1]]
#[1] "orange"
#[[2]]
#[1] "pear"
#[[3]]
#[1] "grape"
#[[4]]
#[1] "lemon"
Another two compact options:
> mapply(fruits,append(list(list("")),fruits[-length(fruits)], after = length(fruits)), FUN = function(x,y) sapply(x,function(item)list(setdiff(item,y[[1]]))))
[[1]]
[[1]][[1]]
[1] "orange"
[[2]]
[[2]][[1]]
[1] "pear"
[[3]]
[[3]][[1]]
[1] "grape"
[[3]][[2]]
[1] "lemon"
or also
> append(fruits[[1]],mapply(fruits[-1],fruits[-length(fruits)], FUN = function(x,y) sapply(x,function(item)list(setdiff(item,y[[1]])))), after = length(fruits))
[[1]]
[1] "orange"
[[2]]
[[2]][[1]]
[1] "pear"
[[3]]
[[3]][[1]]
[1] "grape"
[[3]][[2]]
[1] "lemon"
I have a lot of named lists. Now I want to separate them according to the number of letter "a" within each element. For instants,
library(stringr)
data1 <- c("apple","appreciate","available","account","adapt")
data2 <- c("tab","banana","cable","tatabox","aaaaaaa","aaaaaaaaaaa")
list1 <- list(data1,data2)
names(list1) <- c("a","b")
ca <- lapply(list1, function(x) str_count(x, "a")) #counting letter a
factor1 <- lapply(ca,as.factor) #convert ca to factor
#is that possible to associate factor1 to list1, then I can separate
#elements depends on the factor1?
#ideal results
result$1 or result[1]
$`1`
$`a`$`1`
[1] "apple" "account"
$`b`$`1`
[1] "tab" "cable"
You can get very close with one line using split and Map:
Map(split, list1, Map(stringr::str_count, list1, "a"))
$a
$a$`1`
[1] "apple" "account"
$a$`2`
[1] "appreciate" "adapt"
$a$`3`
[1] "available"
$b
$b$`1`
[1] "tab" "cable"
$b$`2`
[1] "tatabox"
$b$`3`
[1] "banana"
$b$`7`
[1] "aaaaaaa"
$b$`11`
[1] "aaaaaaaaaaa"
This lists all the "a" elements first and then all the "b" elements grouped by the number of "a" characters.
I have a character vector
var1 <- c("pine tree", "forest", "fruits", "water")
and a list
var2 <- list(c("tree", "house", "star"), c("house", "tree", "pine tree", "tree pine", "dense forest"), c("apple", "orange", "grapes"))
I want to match words in var1 with words in var2, and extract the maximum matching element in var2. For example,
[[1]]
[1] "tree" "house" "star"
has 1 match with var1
[[2]]
[1] "house" "tree" "pine tree" "tree pine" "dense forest"
has 4 matches with var1
[[3]]
[1] "apple" "orange" "grapes"
has 0 match with var1
And the desired output is the following:
[[2]]
[1] "house" "tree" "pine tree" "tree pine" "dense forest"
I tried
sapply(var1, grep, var2, ignore.case=T, value=T)
without getting the output desired.
How to solve it? A code snippet would be appreciated.
Thanks.
We create a pattern string ('pat') for the grepl , by first splitting the 'var1' by space '\\s+'. The output will be a list. We use sapply to loop over the list, use paste with collapse= '|', and then collapse the whole vector to a single string with another paste. The | acts as OR while using as pattern for grepl in v1. The sum vector ('v1') will be used for subsetting the list 'var2' based on the condition described in the question.
pat <- paste(sapply(strsplit(var1, '\\s+'), function(x)
paste(unique(c(x, paste(x, collapse=' '))), collapse='|')),
collapse='|')
v1 <- sapply(var2, function(x) sum(grepl(pat, x)))
v1
#[1] 1 4 0
var2[which.max(v1)]
#[[1]]
#[1] "house" "tree" "pine tree" "tree pine" "dense forest"