Subsetting a string based on multiple conditions

Subsetting a string based on multiple conditions - r

I have a vector where each element is a string. I only want to keep the part of the string right before the '==' regardless of whether it is at the beginning of the string, after the & symbol, or after the | symbol. Here is my data:
data <- c("name=='John'", "name=='David'&age=='50'|job=='Doctor'&city=='Liverpool'",
"job=='engineer'&name=='Andrew'",
"city=='Manchester'", "age=='40'&city=='London'"
)
My ideal format would be something like this:
[1] "name"
[2] "name" "age" "job" "city"
[3] "job" "name"
[4] "city"
[5] "age" "city"
The closest I have got is using genXtract from the qdap library, which puts the data in the format above, but I only know how to use it with one condition, i.e.
qdap::genXtract(data, "&", "==")
But I don't just want the part of the string between & and == but also between | and == or the beginning of the string and ==

What this regex does, is capture all a-zA-Z0-9 (=letters and numbers) before an occurence of ==.
stringr::str_extract_all( data, "[0-9a-zA-Z]+(?=(==))")
[[1]]
[1] "name"
[[2]]
[1] "name" "age" "job" "city"
[[3]]
[1] "job" "name"
[[4]]
[1] "city"
[[5]]
[1] "age" "city"
if you want the output as a vector, use
L <- stringr::str_extract_all( data, "[0-9a-zA-Z]+(?=(==))" )
unlist( lapply( L, paste, collapse = " " ) )
results in
[1] "name"
[2] "name age job city"
[3] "job name"
[4] "city"
[5] "age city"

In base R, this can be done with regmatches/gregexpr
lst1 <- regmatches(data, gregexpr("\\w+(?=\\={2})", data, perl = TRUE))
sapply(lst1, paste, collapse = " ")
#[1] "name"
#[2] "name age job city"
#[3] "job name"
#[4] "city"
#[5] "age city"

Related

How to extract words with exactly one vowel

I have strings like these:
turns <- c("does him good to stir him up now and again .",
"when , when I see him he w's on the settees .",
"yes it 's been eery for a long time .",
"blissful timing , indeed it was ")
What I'm trying to do is extract those words that have exactly one vowel. I do get the correct result with this:
library(stringr)
str_extract_all(turns, "\\b[b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\\b")
[[1]]
[1] "him" "to" "stir" "him" "up" "now" "and"
[[2]]
[1] "when" "when" "i" "him" "he" "on" "the"
[[3]]
[1] "yes" "it" "for" "a" "long"
[[4]]
[1] "it" "was"
However, it feels cumbersome to define a consonant class. Is there a more elegant and more concise way?

We can use str_count on the words after splitting the 'turns' at the spaces
library(stringr)
lapply(strsplit(turns, "\\s+"), function(x) x[str_count(x, '[aeiou]') == 1])
-output
#[[1]]
#[1] "him" "to" "stir" "him" "up" "now" "and"
#[[2]]
#[1] "when" "when" "him" "he" "on" "the"
#[[3]]
#[1] "yes" "it" "for" "a" "long"
#[[4]]
#[1] "it" "was"

You can use a PCRE regex with character classes containing double negation:
turns <- c("does him good to stir him up now and again .",
"when , when I see him he w's on the settees .",
"yes it 's been eery for a long time .",
"blissful timing , indeed it was ")
rx <- "\\b[^[:^alpha:]aeiou]*[aeiou][^[:^alpha:]aeiou]*\\b"
regmatches(turns, gregexpr(rx, turns, perl=TRUE, ignore.case=TRUE))
See the R demo online. The result is as in the question.
See the regex demo. Details:
\b - word boundary
[^[:^alpha:]aeiou]* - zero or more chars other than letters and aeiou chars
[aeiou] - a vowel
[^[:^alpha:]aeiou]* - zero or more chars other than letters and aeiou chars
\b - word boundary.
An equivalent expression:
(?i)\b[^\P{L}aeiou]*[aeiou][^\P{L}aeiou]*\b
See this regex demo. \P{L} matches any char but a letter. (?i) is equivalent of ignore.case=TRUE.

Here is a base R option using strsplit + nchar + gsub
lapply(
strsplit(turns, "\\s"),
function(v) v[nchar(gsub("[^aeiou]", "", v)) == 1]
)
which gives
[[1]]
[1] "him" "to" "stir" "him" "up" "now" "and"
[[2]]
[1] "when" "when" "him" "he" "on" "the"
[[3]]
[1] "yes" "it" "for" "a" "long"
[[4]]
[1] "it" "was"

R: Possible to extract groups of words from each sentence(rows)? and create data frame(or matrix)?

I created lists for each word to extract words from sentences, for example like this
hello<- NULL
for (i in 1:length(text)){
hello[i]<-as.character(regmatches(text[i], gregexpr("[H|h]ello?", text[i])))
}
But I have more than 25 words list to extract, that's very long coding.
Is it possible to extract a group of characters(words) from text data?
Below is just a pseudo set.
words<-c("[H|h]ello","you","so","tea","egg")
text=c("Hello! How's you and how did saturday go?",
"hello, I was just texting to see if you'd decided to do anything later",
"U dun say so early.",
"WINNER!! As a valued network customer you have been selected" ,
"Lol you're always so convincing.",
"Did you catch the bus ? Are you frying an egg ? ",
"Did you make a tea and egg?"
)
subsets<-NULL
for ( i in 1:length(text)){
.....???
}
Expected output as below
[1] Hello you
[2] hello you
[3] you
[4] you so
[5] you you egg
[6] you tea egg

in base R, you could do:
regmatches(text,gregexpr(sprintf("\\b(%s)\\b",paste0(words,collapse = "|")),text))
[[1]]
[1] "Hello" "you"
[[2]]
[1] "hello" "you"
[[3]]
[1] "so"
[[4]]
[1] "you"
[[5]]
[1] "you" "so"
[[6]]
[1] "you" "you" "egg"
[[7]]
[1] "you" "tea" "egg"
depending on how you want the results:
trimws(gsub(sprintf(".*?\\b(%s).*?|.*$",paste0(words,collapse = "|")),"\\1 ",text))
[1] "Hello you" "hello you" "so" "you" "you so" "you you egg"
[7] "you tea egg"

You say that you have a long list of word-sets. Here's a way to turn each wordset into a regex, apply it to a corpus (a list of sentences) and pull out the hits as character-vectors. It's case-insensitive, and it checks for word boundaries, so you don't pull age out of agent or rage.
wordsets <- c(
"oak dogs cheese age",
"fire open jail",
"act speed three product"
)
library(tidyverse)
harvSent <- read_table("SENTENCE
Oak is strong and also gives shade.
Cats and dogs each hate the other.
The pipe began to rust while new.
Open the crate but don't break the glass.
Add the sum to the product of these three.
Thieves who rob friends deserve jail.
The ripe taste of cheese improves with age.
Act on these orders with great speed.
The hog crawled under the high fence.
Move the vat over the hot fire.") %>%
pull(SENTENCE)
aWset builds the regexs from the wordsets, and applies them to the sentences
aWset <- function(harvSent, wordsets){
# Turn out a vector of regex like "(?ix) \\b (oak|dogs|cheese) \\b"
regexS <- paste0("(?ix) \\b (",
str_replace_all(wordsets, " ", "|" ),
") \\b")
# Apply each regex to the sentences
map(regexS,
~ str_extract_all(harvSent, .x, simplify = TRUE) %>%
# str_extract_all return a character matrix of hits. Paste it together by row.
apply( MARGIN = 1,
FUN = function(x){
str_trim(paste(x, collapse = " "))}))
}
Giving us
aWset(harvSent , wordsets)
[[1]]
[1] "Oak" "dogs" "" "" "" "" "cheese age" ""
[9] "" ""
[[2]]
[1] "" "" "" "Open" "" "jail" "" "" "" "fire"
[[3]]
[1] "" "" "" "" "product three" "" ""

How to split string in R with regular expression when parts of the regular expression are to be kept in the subsequent splitted strings?

I have a vector of character strings like this x = c("ABC", "ABC, EF", "ABC, DEF, 2 stems", "DE, other comments, and stuff").
I'd like to split each of these into two components: 1) the set of capital letters (2 or 3 letters, separated by commas), and 2) everything after the last "[A-Z][A-Z], ".
The results should be
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, DEF" "2 stems"
[[4]]
[1] "DE" "other comments, and stuff"
I tried strsplit(x, "[A-Z][A-Z], [a-z0-9]") and strsplit(x, "(?:[A-Z][A-Z], )[a-z0-9]"), both of which returned
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, D" " stems"
[[4]]
[1] "" "ther comments, and stuff"
The identification of where to split depends on a combination of the end of the first substring and the beginning of the second substring, and so those parts get excluded from the final result.
Any help appreciated in splitting as indicated above while including the relevant parts of the split regex in each substring!

One option would be str_split
library(stringr)
str_split(x, ", (?=[a-z0-9])", n = 2)
#[[1]]
#[1] "ABC"
#[[2]]
#[1] "ABC, EF"
#[[3]]
#[1] "ABC, DEF" "2 stems"
#[[4]]
#[1] "DE" "other comments, and stuff"

Accessing element of a split string in R

If I have a string,
x <- "Hello World"
How can I access the second word, "World", using string split, after
x <- strsplit(x, " ")
x[[2]] does not do anything.

As mentioned in the comments, it's important to realise that strsplit returns a list object. Since your example is only splitting a single item (a vector of length 1) your list is length 1. I'll explain with a slightly different example, inputting a vector of length 3 (3 text items to split):
input <- c( "Hello world", "Hi there", "Back at ya" )
x <- strsplit( input, " " )
> x
[[1]]
[1] "Hello" "world"
[[2]]
[1] "Hi" "there"
[[3]]
[1] "Back" "at" "ya"
Notice that the returned list has 3 elements, one for each element of the input vector. Each of those list elements is split as per the strsplit call. So we can recall any of these list elements using [[ (this is what your x[[2]] call was doing, but you only had one list element, which is why you couldn't get anything in return):
> x[[1]]
[1] "Hello" "world"
> x[[3]]
[1] "Back" "at" "ya"
Now we can get the second part of any of those list elements by appending a [ call:
> x[[1]][2]
[1] "world"
> x[[3]][2]
[1] "at"
This will return the second item from each list element (note that the "Back at ya" input has returned "at" in this case). You can do this for all items at once using something from the apply family. sapply will return a vector, which will probably be good in this case:
> sapply( x, "[", 2 )
[1] "world" "there" "at"
The last value in the input here (2) is passed to the [ operator, meaning the operation x[2] is applied to every list element.
If instead of the second item, you'd like the last item of each list element, we can use tail within the sapply call instead of [:
> sapply( x, tail, 1 )
[1] "world" "there" "ya"
This time, we've applied tail( x, 1 ) to every list element, giving us the last item.
As a preference, my favourite way to apply actions like these is with the magrittr pipe, for the second word like so:
x <- input %>%
strsplit( " " ) %>%
sapply( "[", 2 )
> x
[1] "world" "there" "at"
Or for the last word:
x <- input %>%
strsplit( " " ) %>%
sapply( tail, 1 )
> x
[1] "world" "there" "ya"

Another approach that might be a little easier to read and apply to a data frame within a pipeline (though it takes more lines) would be to wrap it in your own function and apply that.
library(tidyverse)
df <- data.frame(
greetings = c( "Hello world", "Hi there", "Back at ya" )
)
split_params = function (x, sep, n) {
# Splits string into list of substrings separated by 'sep'.
# Returns nth substring.
x = strsplit(x, sep)[[1]][n]
return(x)
}
df = df %>%
mutate(
'greetings' = sapply(
X = greetings,
FUN = split_params,
# Arguments for split_params.
sep = ' ',
n = 2
)
)
df
### (Output in RStudio Notebook)
greetings second_word
<chr> <chr>
Hello world world
Hi there there
Back at ya at
3 rows
###

With stringr 1.5.0, you can use str_split_i to access the ith element of a split string:
library(stringr)
x <- "Hello World"
str_split_i(x, " ", i = 2)
#[1] "World"
It is vectorized:
x <- c("Hello world", "Hi there", "Back at ya")
str_split_i(x, " ", 2)
#[1] "world" "there" "at"

x=strsplit("a;b;c;d",";")
x
[[1]]
[1] "a" "b" "c" "d"
x=as.character(x[[1]])
x
[1] "a" "b" "c" "d"
x=strsplit(x," ")
x
[[1]]
[1] "a"
[[2]]
[1] "b"
[[3]]
[1] "c"
[[4]]
[1] "d"

R: How to subset multiple elements from a list

> x
[[1]]
[1] "Bob" "John" "Tom"
[2] "Claire" "Betsy"
[[2]]
[1] "Strawberry" "Banana"
[2] "Kiwi"
[[3]]
[1] "Red"
[2] "Blue" "White"
Suppose I had a list x as shown above. I wish to subset the 2nd element of each entry in the list
x[[1]][2]
x[[2]][2]
x[[3]][2]
How can I do that in one command? I tried x[[1:3]][2] but I got an error.

Try
sapply(x2, `[`,2)
#[1] " from localhost (localhost [127.0.0.1])"
#[2] " from phobos [127.0.0.1]"
#[3] " from n20.grp.scd.yahoo.com (n20.grp.scd.yahoo.com"
#[4] " from [66.218.67.196] by n20.grp.scd.yahoo.com with NNFMP;"
data
x2 <- list(c("Received", " from localhost (localhost [127.0.0.1])"),
c("Received", " from phobos [127.0.0.1]"), c("Received",
" from n20.grp.scd.yahoo.com (n20.grp.scd.yahoo.com"),
c("Received", " from [66.218.67.196] by n20.grp.scd.yahoo.com with NNFMP;" ) )

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subsetting a string based on multiple conditions - r

In base R, this can be done with regmatches/gregexpr lst1 <- regmatches(data, gregexpr("\\w+(?=\\={2})", data, perl = TRUE)) sapply(lst1, paste, collapse = " ") #[1] "name" #[2] "name age job city" #[3] "job name" #[4] "city" #[5] "age city"

Related

How to extract words with exactly one vowel

R: Possible to extract groups of words from each sentence(rows)? and create data frame(or matrix)?

How to split string in R with regular expression when parts of the regular expression are to be kept in the subsequent splitted strings?

Accessing element of a split string in R

R: How to subset multiple elements from a list

Categories

Resources