R: How to subset multiple elements from a list - r

> x
[[1]]
[1] "Bob" "John" "Tom"
[2] "Claire" "Betsy"
[[2]]
[1] "Strawberry" "Banana"
[2] "Kiwi"
[[3]]
[1] "Red"
[2] "Blue" "White"
Suppose I had a list x as shown above. I wish to subset the 2nd element of each entry in the list
x[[1]][2]
x[[2]][2]
x[[3]][2]
How can I do that in one command? I tried x[[1:3]][2] but I got an error.

Try
sapply(x2, `[`,2)
#[1] " from localhost (localhost [127.0.0.1])"
#[2] " from phobos [127.0.0.1]"
#[3] " from n20.grp.scd.yahoo.com (n20.grp.scd.yahoo.com"
#[4] " from [66.218.67.196] by n20.grp.scd.yahoo.com with NNFMP;"
data
x2 <- list(c("Received", " from localhost (localhost [127.0.0.1])"),
c("Received", " from phobos [127.0.0.1]"), c("Received",
" from n20.grp.scd.yahoo.com (n20.grp.scd.yahoo.com"),
c("Received", " from [66.218.67.196] by n20.grp.scd.yahoo.com with NNFMP;" ) )

Related

Return only the unique words [duplicate]

This question already has answers here:
List of unique words from data.frame
(2 answers)
Closed 9 months ago.
Lets say i have a string and i only want unique words in the sentence as separate elements
a = "an apple is an apple"
word <- function(a){
words<- c(strsplit(a,split = " "))
return(unique(words))
}
word(a)
This returns
[[1]]
[1] "an" "apple" "is" "an" "apple"
and the output im expecting is
'an','apple','is'
what im doing wrong? really appreciate any help
Cheers!
The problem is that wrapping strsplit(.) in c(.) does not change the fact that it is still a list, and unique will be operating at the list-level, not the word-level.
c(strsplit(rep(a, 2), "\\s+"))
# [[1]]
# [1] "an" "apple" "is" "an" "apple"
# [[2]]
# [1] "an" "apple" "is" "an" "apple"
unique(c(strsplit(rep(a, 2), "\\s+")))
# [[1]]
# [1] "an" "apple" "is" "an" "apple"
Alternatives:
If length(a) is always 1, then perhaps
unique(strsplit(a, "\\s+")[[1]])
# [1] "an" "apple" "is"
If length(a) can be 2 or more and you want a list of unique words for each sentence, then
a2 <- c("an apple is an apple", "a pear is a pear", "an orange is an orange")
lapply(strsplit(a2, "\\s+"), unique)
# [[1]]
# [1] "an" "apple" "is"
# [[2]]
# [1] "a" "pear" "is"
# [[3]]
# [1] "an" "orange" "is"
(Note: this always returns a list, regardless of the number of sentences in the input.)
if length(a) can be 2 ore more and you want a unique words across all sentences, then
unique(unlist(strsplit(a2, "\\s+")))
# [1] "an" "apple" "is" "a" "pear" "orange"
(Note: this method also works well when length(a) is 1.)
Another possible solution, based on stringr::str_split:
library(tidyverse)
a %>% str_split("\\s+") %>% unlist %>% unique
#> [1] "an" "apple" "is"
You could try
unique(unlist(strsplita, " ")))

Subsetting a string based on multiple conditions

I have a vector where each element is a string. I only want to keep the part of the string right before the '==' regardless of whether it is at the beginning of the string, after the & symbol, or after the | symbol. Here is my data:
data <- c("name=='John'", "name=='David'&age=='50'|job=='Doctor'&city=='Liverpool'",
"job=='engineer'&name=='Andrew'",
"city=='Manchester'", "age=='40'&city=='London'"
)
My ideal format would be something like this:
[1] "name"
[2] "name" "age" "job" "city"
[3] "job" "name"
[4] "city"
[5] "age" "city"
The closest I have got is using genXtract from the qdap library, which puts the data in the format above, but I only know how to use it with one condition, i.e.
qdap::genXtract(data, "&", "==")
But I don't just want the part of the string between & and == but also between | and == or the beginning of the string and ==
What this regex does, is capture all a-zA-Z0-9 (=letters and numbers) before an occurence of ==.
stringr::str_extract_all( data, "[0-9a-zA-Z]+(?=(==))")
[[1]]
[1] "name"
[[2]]
[1] "name" "age" "job" "city"
[[3]]
[1] "job" "name"
[[4]]
[1] "city"
[[5]]
[1] "age" "city"
if you want the output as a vector, use
L <- stringr::str_extract_all( data, "[0-9a-zA-Z]+(?=(==))" )
unlist( lapply( L, paste, collapse = " " ) )
results in
[1] "name"
[2] "name age job city"
[3] "job name"
[4] "city"
[5] "age city"
In base R, this can be done with regmatches/gregexpr
lst1 <- regmatches(data, gregexpr("\\w+(?=\\={2})", data, perl = TRUE))
sapply(lst1, paste, collapse = " ")
#[1] "name"
#[2] "name age job city"
#[3] "job name"
#[4] "city"
#[5] "age city"

Stemming function in r

Package corpus provides a custom stemming function. The stemming function should, when given a term as an input, return the stem of the term as the output.
From Stemming Words I taken the following example, that uses the hunspell dictionary to do the stemming.
First I define the sentences on which to test this function:
sentences<-c("The color blue neutralizes orange yellow reflections.",
"Zod stabbed me with blue Kryptonite.",
"Because blue is your favourite colour.",
"Red is wrong, blue is right.",
"You and I are going to yellowstone.",
"Van Gogh looked for some yellow at sunset.",
"You ruined my beautiful green dress.",
"You do not agree.",
"There's nothing wrong with green.")
The custom stemming function is:
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
This code
sentences=text_tokens(sentences, stemmer = stem_hunspell)
produces:
> sentences
[[1]]
[1] "the" "color" "blue" "neutralize" "orange" "yellow"
[7] "reflection" "."
[[2]]
[1] "zod" "stabbed" "me" "with" "blue" "kryptonite"
[7] "."
[[3]]
[1] "because" "blue" "i" "your" "favourite" "colour"
[7] "."
[[4]]
[1] "re" "i" "wrong" "," "blue" "i" "right" "."
[[5]]
[1] "you" "and" "i" "are" "go"
[6] "to" "yellowstone" "."
[[6]]
[1] "van" "gogh" "look" "for" "some" "yellow" "at" "sunset" "."
[[7]]
[1] "you" "ruin" "my" "beautiful" "green" "dress"
[7] "."
[[8]]
[1] "you" "do" "not" "agree" "."
[[9]]
[1] "there" "nothing" "wrong" "with" "green" "."
After stemming I would like to apply other operations on the text, e.g. removing stop words. Anyway, when I applied the tm-function:
removeWords(sentences,stopwords)
to my sentences, I obtained the following error:
Error in UseMethod("removeWords", x) :
no applicable method for 'removeWords' applied to an object of class "list"
If I use
unlist(sentences)
I don't get the desired result, because I end up with a chr of 65 elements. The desired result should be (e.g. for the the first sentences):
"the color blue neutralize orange yellow reflection."
If you want to remove stopwords from each sentence, you could use lapply :
library(tm)
lapply(sentences, removeWords, stopwords())
#[[1]]
#[1] "" "color" "blue" "neutralize" "orange" "yellow" "reflection" "."
#[[2]]
#[1] "zod" "stabbed" "" "" "blue" "kryptonite" "."
#...
#...
However, from your expected output it looks you want to paste the text together.
lapply(sentences, paste0, collapse = " ")
#[[1]]
#[1] "the color blue neutralize orange yellow reflection ."
#[[2]]
#[1] "zod stabbed me with blue kryptonite ."
#....
We can use map
library(tm)
library(purrr)
map(sentences, removeWords, stopwords())
#[[1]]
#[1] "" "color" "blue" "neutralize" "orange" "yellow" "reflection"
#[8] "."
#[[2]]
#[1] "zod" "stabbed" "" "" "blue" "kryptonite" "."

How to split string in R with regular expression when parts of the regular expression are to be kept in the subsequent splitted strings?

I have a vector of character strings like this x = c("ABC", "ABC, EF", "ABC, DEF, 2 stems", "DE, other comments, and stuff").
I'd like to split each of these into two components: 1) the set of capital letters (2 or 3 letters, separated by commas), and 2) everything after the last "[A-Z][A-Z], ".
The results should be
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, DEF" "2 stems"
[[4]]
[1] "DE" "other comments, and stuff"
I tried strsplit(x, "[A-Z][A-Z], [a-z0-9]") and strsplit(x, "(?:[A-Z][A-Z], )[a-z0-9]"), both of which returned
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, D" " stems"
[[4]]
[1] "" "ther comments, and stuff"
The identification of where to split depends on a combination of the end of the first substring and the beginning of the second substring, and so those parts get excluded from the final result.
Any help appreciated in splitting as indicated above while including the relevant parts of the split regex in each substring!
One option would be str_split
library(stringr)
str_split(x, ", (?=[a-z0-9])", n = 2)
#[[1]]
#[1] "ABC"
#[[2]]
#[1] "ABC, EF"
#[[3]]
#[1] "ABC, DEF" "2 stems"
#[[4]]
#[1] "DE" "other comments, and stuff"

Filter list in R which has nchar > 1

I have a list of names
> x <- c("Test t", "Cuma Nama K", "O", "Test satu dua t")
> name <- strsplit(x, " ")
> name
[[1]]
[1] "Test" "t"
[[2]]
[1] "Cuma" "Nama" "K"
[[3]]
[1] "O"
[[4]]
[1] "Test" "satu" "dua" "t"
How can I filter a list so that it can become like this?
I am trying to find out how to filter the list which has nchar > 1
> name
[[1]]
[1] "Test"
[[2]]
[1] "Cuma" "Nama"
[[4]]
[1] "Test" "satu" "dua"
lapply(name, function(x) x[nchar(x)>1])
Results in:
[[1]]
[1] "Test"
[[2]]
[1] "Cuma" "Nama"
[[3]]
character(0)
[[4]]
[1] "Test" "satu" "dua"
We can loop over the list elements, subset the elements that have nchar greater than 1 and use Filter to remove the elements that 0 elements
Filter(length,lapply(name, function(x) x[nchar(x) >1 ]))
#[[1]]
#[1] "Test"
#[[2]]
#[1] "Cuma" "Nama"
#[[3]]
#[1] "Test" "satu" "dua"
If we want to remove the words with one character from the string, we can also do this without splitting
setdiff(gsub("(^| ).( |$)", "", x), "")
#[1] "Test" "Cuma Nama" "Test satu dua"

Resources