Return only the unique words [duplicate] - r

This question already has answers here:
List of unique words from data.frame
(2 answers)
Closed 9 months ago.
Lets say i have a string and i only want unique words in the sentence as separate elements
a = "an apple is an apple"
word <- function(a){
words<- c(strsplit(a,split = " "))
return(unique(words))
}
word(a)
This returns
[[1]]
[1] "an" "apple" "is" "an" "apple"
and the output im expecting is
'an','apple','is'
what im doing wrong? really appreciate any help
Cheers!

The problem is that wrapping strsplit(.) in c(.) does not change the fact that it is still a list, and unique will be operating at the list-level, not the word-level.
c(strsplit(rep(a, 2), "\\s+"))
# [[1]]
# [1] "an" "apple" "is" "an" "apple"
# [[2]]
# [1] "an" "apple" "is" "an" "apple"
unique(c(strsplit(rep(a, 2), "\\s+")))
# [[1]]
# [1] "an" "apple" "is" "an" "apple"
Alternatives:
If length(a) is always 1, then perhaps
unique(strsplit(a, "\\s+")[[1]])
# [1] "an" "apple" "is"
If length(a) can be 2 or more and you want a list of unique words for each sentence, then
a2 <- c("an apple is an apple", "a pear is a pear", "an orange is an orange")
lapply(strsplit(a2, "\\s+"), unique)
# [[1]]
# [1] "an" "apple" "is"
# [[2]]
# [1] "a" "pear" "is"
# [[3]]
# [1] "an" "orange" "is"
(Note: this always returns a list, regardless of the number of sentences in the input.)
if length(a) can be 2 ore more and you want a unique words across all sentences, then
unique(unlist(strsplit(a2, "\\s+")))
# [1] "an" "apple" "is" "a" "pear" "orange"
(Note: this method also works well when length(a) is 1.)

Another possible solution, based on stringr::str_split:
library(tidyverse)
a %>% str_split("\\s+") %>% unlist %>% unique
#> [1] "an" "apple" "is"

You could try
unique(unlist(strsplita, " ")))

Related

Program to iterate through lists within a list

I have a list containing a number of lists containing character vectors. The lists are always arranged so that the first list contains a vector with a single element, the second list contains a vector with two elements and the third contains one or more vectors containing three elements.
fruits <- list(
list(c("orange")),
list(c("pear", "orange")),
list(c("orange", "pear", "grape"),
c("orange", "lemon", "pear"))
)
I need to iterate through the lists in order to remove the elements from the vector in the previous list. i.e. I would first find the value from the vector in the first list ('orange') and remove it from the vector in the second list, then take the values from the second list ("pear", "orange") and remove them from both vectors in the third list, so I ended up with:
new_fruits <- list(
list(c("orange")),
list(c("pear")),
list(c("grape"),
c("lemon"))
)
I should add that I have had a go at doing this, but I'm finding the lists within lists make it quite complicated and my solution is long and not very efficient.
Maybe you can try the code below
new_fruits <- s <- c()
for (k in seq_along(fruits)) {
new_fruits[[k]] <- lapply(fruits[[k]],function(x) x[!x%in%s])
s <- union(s,unlist(fruits[[k]]))
}
which gives
> new_fruits
[[1]]
[[1]][[1]]
[1] "orange"
[[2]]
[[2]][[1]]
[1] "pear"
[[3]]
[[3]][[1]]
[1] "grape"
[[3]][[2]]
[1] "lemon"
Here is an idea where we unlist, convert to strings and resplit to differentiate between different vectors of same element. We then unlist one more time and get the unique values, i.e.
as.list(unique(unlist(strsplit(unlist(lapply(fruits, function(i) sapply(i, toString))), ', '))))
#[[1]]
#[1] "orange"
#[[2]]
#[1] "pear"
#[[3]]
#[1] "grape"
#[[4]]
#[1] "lemon"
Another two compact options:
> mapply(fruits,append(list(list("")),fruits[-length(fruits)], after = length(fruits)), FUN = function(x,y) sapply(x,function(item)list(setdiff(item,y[[1]]))))
[[1]]
[[1]][[1]]
[1] "orange"
[[2]]
[[2]][[1]]
[1] "pear"
[[3]]
[[3]][[1]]
[1] "grape"
[[3]][[2]]
[1] "lemon"
or also
> append(fruits[[1]],mapply(fruits[-1],fruits[-length(fruits)], FUN = function(x,y) sapply(x,function(item)list(setdiff(item,y[[1]])))), after = length(fruits))
[[1]]
[1] "orange"
[[2]]
[[2]][[1]]
[1] "pear"
[[3]]
[[3]][[1]]
[1] "grape"
[[3]][[2]]
[1] "lemon"

ft_tokenizer tokenizes words to lower, I want it to be as they are

I am using ft_tokenizer for spark dataframe in R.
and it tokenizes each word and changes it to all lower, I want the words to be in the format they originally are.
text_data <- data_frame(
x = c("This IS a sentence", "So is this")
)
tokenized <- text_data_tbl %>%
ft_tokenizer("x", "word")
tokenized$word
## [[1]]
## [[1]][[1]]
## [1] "this"
##
## [[1]][[2]]
## [1] "is"
##
## [[1]][[3]]
## [1] "a"
I want:
tokenized$word
## [[1]]
## [[1]][[1]]
## [1] "This"
##
## [[1]][[2]]
## [1] "IS"
##
## [[1]][[3]]
## [1] "a"
I guess it is not possible with ft_tokenizer. From ?ft_tokenizer
A tokenizer that converts the input string to lowercase and then splits it by white spaces.
So it's basic feature is to convert the string to lowercase and split on white-space which I guess cannot be changed. Consider doing
text_data$new_x <- lapply(strsplit(text_data$x, "\\s+"), as.list)
which will give the same output as expected and you can continue your process as it is from here.
text_data$new_x
#[[1]]
#[[1]][[1]]
#[1] "This"
#[[1]][[2]]
#[1] "IS"
#[[1]][[3]]
#[1] "a"
#[[1]][[4]]
#[1] "sentence"
#[[2]]
#[[2]][[1]]
#[1] "So"
#[[2]][[2]]
#[1] "is"
#[[2]][[3]]
#[1] "this"

How to split string in R with regular expression when parts of the regular expression are to be kept in the subsequent splitted strings?

I have a vector of character strings like this x = c("ABC", "ABC, EF", "ABC, DEF, 2 stems", "DE, other comments, and stuff").
I'd like to split each of these into two components: 1) the set of capital letters (2 or 3 letters, separated by commas), and 2) everything after the last "[A-Z][A-Z], ".
The results should be
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, DEF" "2 stems"
[[4]]
[1] "DE" "other comments, and stuff"
I tried strsplit(x, "[A-Z][A-Z], [a-z0-9]") and strsplit(x, "(?:[A-Z][A-Z], )[a-z0-9]"), both of which returned
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, D" " stems"
[[4]]
[1] "" "ther comments, and stuff"
The identification of where to split depends on a combination of the end of the first substring and the beginning of the second substring, and so those parts get excluded from the final result.
Any help appreciated in splitting as indicated above while including the relevant parts of the split regex in each substring!
One option would be str_split
library(stringr)
str_split(x, ", (?=[a-z0-9])", n = 2)
#[[1]]
#[1] "ABC"
#[[2]]
#[1] "ABC, EF"
#[[3]]
#[1] "ABC, DEF" "2 stems"
#[[4]]
#[1] "DE" "other comments, and stuff"

Filter list in R which has nchar > 1

I have a list of names
> x <- c("Test t", "Cuma Nama K", "O", "Test satu dua t")
> name <- strsplit(x, " ")
> name
[[1]]
[1] "Test" "t"
[[2]]
[1] "Cuma" "Nama" "K"
[[3]]
[1] "O"
[[4]]
[1] "Test" "satu" "dua" "t"
How can I filter a list so that it can become like this?
I am trying to find out how to filter the list which has nchar > 1
> name
[[1]]
[1] "Test"
[[2]]
[1] "Cuma" "Nama"
[[4]]
[1] "Test" "satu" "dua"
lapply(name, function(x) x[nchar(x)>1])
Results in:
[[1]]
[1] "Test"
[[2]]
[1] "Cuma" "Nama"
[[3]]
character(0)
[[4]]
[1] "Test" "satu" "dua"
We can loop over the list elements, subset the elements that have nchar greater than 1 and use Filter to remove the elements that 0 elements
Filter(length,lapply(name, function(x) x[nchar(x) >1 ]))
#[[1]]
#[1] "Test"
#[[2]]
#[1] "Cuma" "Nama"
#[[3]]
#[1] "Test" "satu" "dua"
If we want to remove the words with one character from the string, we can also do this without splitting
setdiff(gsub("(^| ).( |$)", "", x), "")
#[1] "Test" "Cuma Nama" "Test satu dua"

R match words in list

I have a character vector
var1 <- c("pine tree", "forest", "fruits", "water")
and a list
var2 <- list(c("tree", "house", "star"), c("house", "tree", "dense forest"), c("apple", "orange", "grapes"))
I want to match words in var1 with words in var2, and RANK the list elements according to the number of words matched. For example,
[[2]]
[1] "house" "tree" "dense forest"
has 2 matches with var1
[[1]]
[1] "tree" "house" "star"
has 1 match with var1
[[3]]
[1] "apple" "orange" "grapes"
has 0 match with var1
And the desired output is the following rank:
[1] "house" "tree" "dense forest"
[2] "tree" "house" "star"
[3] "apple" "orange" "grapes"
I tried
sapply(var1, grep, var2, ignore.case=T, value=T)
without getting the output desired.
How to solve it? A code snippet would be appreciated.
Thanks.
EDIT:
The problem has been edited from single word match to word match in phrases as described above.
you can try
var2[[which.max(lapply(var2, function(x) sum(var1 %in% x)))]]
[1] "house" "tree" "forest"
from the last modification of the OP and #franks comment
var2[order(-sapply(var2, function(x) sum(var1 %in% x)))]
[[1]]
[1] "house" "tree" "forest"
[[2]]
[1] "tree" "house" "star"
[[3]]
[1] "apple" "orange" "grapes"

Resources