Extract string between exact word and pattern using stringr - r

I have been wondering how to extract string in R using stringr or another package between the exact word "to the" (which is always lowercase) and the very second comma in a sentence.
For instance:
String: "This not what I want to the THIS IS WHAT I WANT, DO YOU SEE IT?, this is not what I want"
Desired output: "THIS IS WHAT I WANT, DO YOU SEE IT?"
I have this vector:
x<-c("This not what I want to the THIS IS WHAT I WANT, DO YOU SEE IT?, this is not what I want",
"HYU_IO TO TO to the I WANT, THIS, this i dont, want", "uiui uiu to the xxxx,,this is not, what I want")
and I am trying to use this code
str_extract(string = x, pattern = "(?<=to the ).*(?=\\,)")
but I cant seem to get it to work to properly give me this:
"THIS IS WHAT I WANT, DO YOU SEE IT?"
"I WANT, THIS"
"xxxx,"
Thank you guys so much for your time and help

You were close!
str_extract(string = x, pattern = "(?<=to the )[^,]*,[^,]*")
# [1] "THIS IS WHAT I WANT, DO YOU SEE IT?"
# [2] "I WANT, THIS"
# [3] "xxxx,"
The look-behind stays the same, [^,]* matches anything but a comma, then , matches exactly one comma, then [^,]* again for anything but a comma.

Alternative approach, by far not comparable with Gregor Thomas approach, but somehow an alternative:
vector to tibble
separate twice by first to the then by ,
paste together
pull for vector output.
library(tidyverse)
as_tibble(x) %>%
separate(value, c("a", "b"), sep = 'to the ') %>%
separate(b, c("a", "c"), sep =",") %>%
mutate(x = paste0(a, ",", c), .keep="unused") %>%
pull(x)
[1] "THIS IS WHAT I WANT, DO YOU SEE IT?"
[2] "I WANT, THIS"
[3] "xxxx,"

Related

Is there a way to replace a word only if it's paired with another word?

I have a regular expressions/string replacement dilemma (sorry if this is a duplicate post, I was looking for a solution and couldn't find one, but please let me know if I just missed a similar post!).
We have a dataset that's structured in two columns: subject and verb. I want to delete every modal auxiliary in the verb column, but only if the modal is with another word. So I want to replace "can" in the string "can do" with "", but I don't want to replace "can" if it appears on its own. I thought I could maybe use an ifelse statement, like in the code below:
all_doubles <- all_doubles %>%
mutate(modal_removed = ifelse(str_detect(all_doubles$verb_lemma, modal_with_words) == TRUE,
str_replace_all("can|could|may|might|shall|should|will|would|need", ""),
all_doubles$verb_lemma))
But I'm having trouble getting the regex right to return only modal auxiliaries that are accompanied by other words. Right now, I'm using this but it doesn't seem to be working well:
modal_with_words <- ".+can|could|may|might|shall|should|will|would|need.+"
Any advice would be very appreciated (I'm sure there's a better way to do this)! Thank you very much!
It seems all you want is to remove a model verb from your list if there is a whitespace + a letter right after.
In this case, all you need is
rx <- '(?:\\s+|^)(?:can|could|may|might|shall|should|will|would|need)(\\s+[[:alpha:]])'
verb <- c('I can help you.', 'We shall not stop here!')
gsub(rx, '\\1', verb)
# => [1] "I help you." "We not stop here!"
See the R demo. The (?:\s+|^)(?:can|could|may|might|shall|should|will|would|need)(\s+[[:alpha:]]) regex matches
(?:\s+|^) - one or more whitespaces or start of string
(?:can|could|may|might|shall|should|will|would|need) - one of the words
(\s+[[:alpha:]]) - Group 1 (\1 in the replacement refers to this value): one or more whitespaces and a letter.
If your list of potential candidates is fairly short and unambiguous, you can just concatenate the potential words into a set of lookup words in a regular expression.
#Create what I think the data looks like based on your question
replacement_targets <- data.frame(subject = c("He", "He", "She", "She", "They", "They", "It", "It"),
verb = c("can", "can do", "can't help", "can help", "can't do", "can't", "will not help", "will help"))
replacement_targets$string <- paste0(replacement_targets$subject, " ", replacement_targets$verb)
substitution_list <- data.frame(modal_aux = c("can", "can't", "can", "can't", "will", "will not"),
target = c("do", "do", "help", "help", "help", "help"))
#Constructs a regular expression based on the list of words
pattern <- paste0("(", paste(unique(substitution_list$modal_aux), collapse = "|"), ").?(", paste(unique(substitution_list$target), collapse="|"), ")")
#Replaces any matches with just the second captured group, where applicable
gsub(pattern, "\\2", replacement_targets$string)

Looping through and replacing text in a data frame

I have a dataframe that consists of a variable with multiple words, such as:
variable
"hello my name is this"
"greetings friend"
And another dataframe that consists of two columns, one of which is words, the other of which is replacements for those words, such as:
word
"hello"
"greetings"
replacement:
replacement
"hi"
"hi"
I'm trying to find an easy way to replace the words in "variable" with the replacement words, looping over both all the observations, and all the words in each observation. The desired result is:
variable
"hi my name is this"
"hi friend"
I've looked into some methods that use cSplit, but it's not feasible for my application (there are too many words in any given observation of "variable", so this creates too many columns). I'm not sure how I would use strsplit for this, but am guessing that is the correct option?
EDIT: From my understanding of this question, my question my be a repeat of a previously unanswered question: Replace strings in text based on dictionary
stringr's str_replace_all would be handy in this case:
df = data.frame(variable = c('hello my name is this','greetings friend'))
replacement <- data.frame(word = c('hello','greetings'), replacment = c('hi','hi'), stringsAsFactors = F)
stringr::str_replace_all(df$variable,replacement$word,replacement$replacment)
Output:
> stringr::str_replace_all(df$variable,replacement$word,replacement$replacment)
[1] "hi my name is this" "hi friend"
This is similar to #amrrs's solution, but I am using a named vector instead of supplying two separate vectors. This also addresses the issue mentioned by OP in the comments:
library(dplyr)
library(stringr)
df2$word %>%
paste0("\\b", ., "\\b") %>%
setNames(df2$replacement, .) %>%
str_replace_all(df1$variable, .)
# [1] "hi my name is this" "hi friend" "hi, hellomy is not a word"
# [4] "hi! my friend"
This is the named vector with regex as the names and string to replace with as elements:
df2$word %>%
paste0("\\b", ., "\\b") %>%
setNames(df2$replacement, .)
# \\bhello\\b \\bgreetings\\b
# "hi" "hi"
Data:
df1 = data.frame(variable = c('hello my name is this',
'greetings friend',
'hello, hellomy is not a word',
'greetings! my friend'))
df2 = data.frame(word = c('hello','greetings'),
replacement = c('hi','hi'),
stringsAsFactors = F)
Note:
In order to address the issue of root words also being converted, I wrapped the regex with word boundaries (\\b). This makes sure that I am not converting words that live inside another, like "helloguys".

gsubfn function not giving desired output when ignore.case = TRUE

I am trying to substitute multiple patterns within a character vector with their corresponding replacement strings. After doing some research I found the package gsubfn which I think is able to do what I want it to, however when I run the code below I don't get my expected output (see end of question for results versus what I expected to see).
library(gsubfn)
# Our test data that we want to search through (while ignoring case)
test.data<- c("1700 Happy Pl","155 Sad BLVD","82 Lolly ln", "4132 Avent aVe")
# A list data frame which contains the patterns we want to search for
# (again ignoring case) and the associated replacement strings we want to
# exchange any matches we come across with.
frame<- data.frame(pattern= c(" Pl"," blvd"," LN"," ave"), replace= c(" Place", " Boulevard", " Lane", " Avenue"),stringsAsFactors = F)
# NOTE: I added spaces in front of each of our replacement terms to make
# sure we only grab matches that are their own word (for instance if an
# address was 45 Splash Way we would not want to replace "pl" inside of
# "Splash" with "Place
# The following set of paste lines are supposed to eliminate the substitute function from
# grabbing instances like first instance of " Ave" found directly after "4132"
# inside "4132 Avent Ave" which we don't want converted to " Avenue".
pat <- paste(paste(frame$pattern,collapse = "($|[^a-zA-Z])|"),"($|[^a-zA-Z])", sep = "")
# Here is the gsubfn function I am calling
gsubfn(x = test.data, pattern = pat, replacement = setNames(as.list(frame$replace),frame$pattern), ignore.case = T)
Output being received:
[1] "1700 Happy" "155 Sad" "82 Lolly" "4132 Avent"
Output expected:
[1] "1700 Happy Place" "155 Sad Boulevard" "82 Lolly Lane" "4132 Avent Avenue"
My working theory on why this isn't working is that the matches don't match the names associated with the list I am passing into the gsubfn's replacement argument because of some case discrepancies (eg: the match being found on "155 Sad BLVD" doesn't == " blvd" even though it was able to be seen as a match due to the ignore.case argument). Can someone confirm that this is the issue/point me to what else might be going wrong, and perhaps a way of fixing this that doesn't require me expanding my pattern vector to include all case permutations if possible?
Seems like stringr has a simple solution for you:
library(stringr)
str_replace_all(test.data,
regex(paste0('\\b',frame$pattern,'$'),ignore_case = T),
frame$replace)
#[1] "1700 Happy Place" "155 Sad Boulevard" "82 Lolly Lane" "4132 Avent Avenue"
Note that I had to alter the regex to look for only words at the end of the string because of the tricky 'Avent aVe'. But of course there's other ways to handle that too.

combine any word which comes after specific word

I want to combine word which comes after a specific word ,I have try bigram approach which is too slow and also tried with gregexpr but didnt get any good solution. for ex
text="This approach isnt good enough."
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
BigramTokenizer(text)
[1] "This approach" "approach isnt" "isnt good" "good enough"
what i really want is isnt_good as single word in a text ,combine next word which comes after isnt.
text
"This approach isnt_good enough."
Any efficient approach to convert into unigram.Thanks.
To extract all occurrences of the word "isn't" and the following word you can do this:
library(stringr)
pattern <- "isnt \\w+"
str_extract_all(text, pattern)
[[1]]
[1] "isnt good"
It essentially does the same thing as the example below (from the base package) but I find the stringr solution more elegant and readable.
> regmatches(text, regexpr(pattern, text))
[1] "isnt good"
Update
To replace the occurrences of isnt x with isnt_x you just need gsub of the base package.
gsub("isnt (\\w+)", "isnt_\\1", text)
[1] "This approach isnt_good enough."
What you do is to use a capturing group that copies whatever is found inside the parentheses to the \\1. See this page for a good introduction: http://www.regular-expressions.info/brackets.html
How about this function?
joinWords <- function(string, word){
y <- paste0(word, " ")
x <- unlist(strsplit(string, y))
paste0(x[1], word, "_", x[2])
}
> text <- "This approach isnt good enough."
> joinWords(text, "isnt")
# [1] "This approach isnt_good enough."
> joinWords("This approach might work for you", "might")
# [1] "This approach might_work for you"

How to Convert "space" into "%20" with R

Referring the title, I'm figuring how to convert space between words to be %20 .
For example,
> y <- "I Love You"
How to make y = I%20Love%20You
> y
[1] "I%20Love%20You"
Thanks a lot.
Another option would be URLencode():
y <- "I love you"
URLencode(y)
[1] "I%20love%20you"
gsub() is one option:
R> gsub(pattern = " ", replacement = "%20", x = y)
[1] "I%20Love%20You"
The function curlEscape() from the package RCurl gets the job done.
library('RCurl')
y <- "I love you"
curlEscape(urls=y)
[1] "I%20love%20you"
I like URLencode() but be aware that sometimes it does not work as expected if your url already contains a %20 together with a real space, in which case not even the repeated option of URLencode() is doing what you want.
In my case, I needed to run both URLencode() and gsub consecutively to get exactly what I needed, like so:
a = "already%20encoded%space/a real space.csv"
URLencode(a)
#returns: "encoded%20space/real space.csv"
#note the spaces that are not transformed
URLencode(a, repeated=TRUE)
#returns: "encoded%2520space/real%20space.csv"
#note the %2520 in the first part
gsub(" ", "%20", URLencode(a))
#returns: "encoded%20space/real%20space.csv"
In this particular example, gsub() alone would have been enough, but URLencode() is of course doing more than just replacing spaces.

Resources