I am trying to substitute multiple patterns within a character vector with their corresponding replacement strings. After doing some research I found the package gsubfn which I think is able to do what I want it to, however when I run the code below I don't get my expected output (see end of question for results versus what I expected to see).
library(gsubfn)
# Our test data that we want to search through (while ignoring case)
test.data<- c("1700 Happy Pl","155 Sad BLVD","82 Lolly ln", "4132 Avent aVe")
# A list data frame which contains the patterns we want to search for
# (again ignoring case) and the associated replacement strings we want to
# exchange any matches we come across with.
frame<- data.frame(pattern= c(" Pl"," blvd"," LN"," ave"), replace= c(" Place", " Boulevard", " Lane", " Avenue"),stringsAsFactors = F)
# NOTE: I added spaces in front of each of our replacement terms to make
# sure we only grab matches that are their own word (for instance if an
# address was 45 Splash Way we would not want to replace "pl" inside of
# "Splash" with "Place
# The following set of paste lines are supposed to eliminate the substitute function from
# grabbing instances like first instance of " Ave" found directly after "4132"
# inside "4132 Avent Ave" which we don't want converted to " Avenue".
pat <- paste(paste(frame$pattern,collapse = "($|[^a-zA-Z])|"),"($|[^a-zA-Z])", sep = "")
# Here is the gsubfn function I am calling
gsubfn(x = test.data, pattern = pat, replacement = setNames(as.list(frame$replace),frame$pattern), ignore.case = T)
Output being received:
[1] "1700 Happy" "155 Sad" "82 Lolly" "4132 Avent"
Output expected:
[1] "1700 Happy Place" "155 Sad Boulevard" "82 Lolly Lane" "4132 Avent Avenue"
My working theory on why this isn't working is that the matches don't match the names associated with the list I am passing into the gsubfn's replacement argument because of some case discrepancies (eg: the match being found on "155 Sad BLVD" doesn't == " blvd" even though it was able to be seen as a match due to the ignore.case argument). Can someone confirm that this is the issue/point me to what else might be going wrong, and perhaps a way of fixing this that doesn't require me expanding my pattern vector to include all case permutations if possible?
Seems like stringr has a simple solution for you:
library(stringr)
str_replace_all(test.data,
regex(paste0('\\b',frame$pattern,'$'),ignore_case = T),
frame$replace)
#[1] "1700 Happy Place" "155 Sad Boulevard" "82 Lolly Lane" "4132 Avent Avenue"
Note that I had to alter the regex to look for only words at the end of the string because of the tricky 'Avent aVe'. But of course there's other ways to handle that too.
Related
I am working with really long strings. How can I remove the first 4 words after a certain string pattern occurs? For example:
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
#remove the fist 4 words after and including "stackoverflow"
result
"hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
Solution with base R
A one line solution:
pattern <- "stackoverflow"
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
How it works
Create the pattern you want with a regex:
"stackoverflow" followed by 4 words.
Definitely, check out ?regex for more info about it.
Words are identified by \\w+ and separators are identified by \\W+ (capital w, it includes spaces and special characters like the apostrophe that you have in the sentence)
(...){0,4} means that the combination of word and separator may repeat up to 4 times.
\\W* needs to identify a possible final separator, so that the remaining two pieces of the sentence won't have two separators dividing them. Try it without, you'll see what I mean.
gsub locates the pattern you want and replace it with "" (thus deliting it).
Handle Exceptions
Note that it works even for particular cases:
# end of a sentence with fewer than 4 words after
string <- "hello I am a user of stackoverflow and I am"
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "hello I am a user of "
# beginning of a sentence
string <- "stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "happy with all the help the community usually offers when I'm in need of some coding expertise."
# pattern == string
string <- "stackoverflow and I am really"
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] ""
A tidyverse solution
library(stringr)
# locate start and end position of pattern
tmp <- str_locate(string, paste0(pattern,"(\\W+\\w+){0,4}\\W*"))
# get positions: start_sentence-start_pattern and end_pattern-end_sentence
tmp <- invert_match(tmp)
# get the substrings
tmp <- str_sub(string, tmp[,1], tmp[,2])
# collapse substrings together
str_c(tmp, collapse = "")
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
Search for your pattern with additional spaces and words after it. Find the positions of the first last match, split the string and paste it back together. At the end gsub any double (or more) spaces.
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
pat="stackoverflow"
library(stringr)
tmp=str_locate(
string,
paste0(
pat,
paste0(
rep("\\s?[a-zA-Z]+",4),
collapse=""
)
)
)
gsub("\\s{2,}"," ",
paste0(
substring(string,1,tmp[1]-1),
substring(string,tmp[2]+1)
)
)
[1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
Quick answer, I am sure you can have better code thant that:
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
t<-read.table(textConnection(string))
string2<-''
i<-0
j<-0
for(i in 1:length(t)){
if(t[i]=="stackoverflow"){
j=i
}else if(j>0){
if(i-j>4){
string2=paste0(string2, " " , t[i])
}
}else if(j==0){
if(i>1){
string2=paste0(string2, " " , t[i])
}else{
string2=t[i]
}
}
}
print(string2)
Given the following string:
my.str <- "I welcome you my precious dude"
One splits it:
my.splt.str <- strsplit(my.str, " ")
And then concatenates:
paste(my.splt.str[[1]][1:2], my.splt.str[[1]][3:4], my.splt.str[[1]][5:6], sep = " ")
The result is:
[1] "I you precious" "welcome my dude"
When not using the colon operator it returns the correct order:
paste(my.splt.str[[1]][1], my.splt.str[[1]][2], my.splt.str[[1]][3], my.splt.str[[1]][4], my.splt.str[[1]][5], my.splt.str[[1]][6], sep = " ")
[1] "I welcome you my precious dude"
Why is this happening?
paste is designed to work with vectors element-by-element. Say you did this:
names <- c('Alice', 'Bob', 'Charlie')
paste('Hello', names)
You'd want to result to be [1] "Hello Alice" "Hello Bob" "Hello Charlie", rather than "Hello Hello Hello Alice Bob Charlie".
To make it work like you want it to, rather than giving the different sections to paste as separate arguments, you could first combine them into a single vector with c:
paste(c(my.splt.str[[1]][1:2], my.splt.str[[1]][3:4], my.splt.str[[1]][5:6]), collapse = " ")
## [1] "I welcome you my precious dude"
We can use collapse instead of sep
paste(my.splt.str[[1]], collapse= ' ')
If we use the first approach by OP, it is pasteing the corresponding elements from each of the subset
If we want to selectively paste, first create an object because the [[ repeat can be avoided
v1 <- my.splt.str[[1]]
v1[3:4] <- toupper(v1[3:4])
paste(v1, collapse=" ")
#[1] "I welcome YOU MY precious dude"
When we have multiple arguments in paste, it is doing the paste on the corresponding elements of it
paste(v1[1:2], v1[3:4])
#[1] "I you" "welcome my"
If we use collapse, then it would be a single string, but still the order is different because the first element of v1[1:2] is pasteed with the first element of v1[3:4] and 2nd with the 2nd element
paste(v1[1:2], v1[3:4], collapse = ' ')
#[1] "I you welcome my"
It is documented in ?paste
paste converts its arguments (via as.character) to character strings, and concatenates them (separating them by the string given by sep). If the arguments are vectors, they are concatenated term-by-term to give a character vector result. Vector arguments are recycled as needed, with zero-length arguments being recycled to "".
Also, converting to uppercase can be done on a substring without splitting as well
sub("^(\\w+\\s+\\w+)\\s+(\\w+\\s+\\w+)", "\\1 \\U\\2", my.str, perl = TRUE)
#[1] "I welcome YOU MY precious dude"
I have some strings of text (example below). As you can see each string was split at a period or question mark.
[1]"I am a Mr."
[2]"asking for help."
[3]"Can you help?"
[4]"Thank you ms."
[5]"or mr."
I want to collapse where the string ends with an abbreviation like mr., mrs. so the end result would be the desired output below.
[1]"I am a Mr. asking for help."
[2]"Can you help?"
[3]"Thank you ms. or mr."
I already created a vector (called abbr) containing all my abbreviations in the following format:
> abbr
[1] "Mr|Mrs|Ms|Dr|Ave|Blvd|Rd|Mt|Capt|Maj"
but I can't figure out how to use it in paste function to collapse. I have also tried using gsub (didn't work) to replace \n following abbreviation with a period with a space like this:
lines<-gsub('(?<=abbr\\.\\n)(?=[A-Z])', ' ', lines, perl=FALSE)
We can use tapply to collapse string and grepl to create groups to collapse.
x <- c("I am a Mr.", "asking for help.","Can you help?","Thank you ms.", "or Mr.")
#Include all the abbreviations with proper cases
#Note that "." has a special meaning in regex so you need to escape it.
abbr <- 'Mr\\.|Mrs\\.|Ms\\.|Dr\\.|mr\\.|ms\\.'
unname(tapply(x, c(0, head(cumsum(!grepl(abbr, x)), -1)), paste, collapse = " "))
#[1] "I am a Mr. asking for help." "Can you help?" "Thank you ms. or mr."
I have a string printed out like this:
"\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
(The "\" wasn't there. R just automatically prints it out.)
I would like to calculate how many non-empty segments there are in this string. In this case the answer should be 11.
I tried to convert it to a vector, but R ignores the quotation marks so I still ended up with a vector with length 1.
I don't know whether I need to extract those segments first and then count, or there're easier ways to do that.
If it's the former case, which regular expression function best suits my need?
Thank you very much.
You can use scan to convert your large string into a vector of individual ones, then use nchar to count the lengths. Assuming your large string is x:
y <- scan(text=x, what="character", sep=",", strip.white=TRUE)
Read 12 items
sum(nchar(y)>0)
[1] 11
I assume a segment is defined as anything between . or ,. An option using strsplit can be found as:
length(grep("\\w+", trimws(strsplit(str, split=",|\\.")[[1]])))
#[1] 11
Note: trimws is not mandatory in above statement. I have included so that one can get the value of each segment by just adding value = TRUE argument in grep.
Data:
str <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
strsplit might be one possibility?
txt <- "Jenna and Alex were making cupcakes., Jenna asked Alex whether all were ready to be frosted.,
Alex said that, some of them , were., He added, that, the rest, would be, ready, soon.,"
a <- strsplit(txt, split=",")
length(a[[1]])
[1] 11
If the backslashes are part of the text it doesnt really change a lot, except for the last element which would have "\"" in it. By filtering that out, the result is the same:
txt <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all
were ready to be frosted.\", \"Alex said that\", \" some of them \",
\"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
a <- strsplit(txt, split=", \"")
length(a[[1]][a[[1]] != "\""])
[1] 11
This is an absurd idea, but it does work:
txt <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
Txt <-
read.csv(text = txt,
header = FALSE,
colClasses = "character",
na.strings = c("", " "))
sum(!vapply(Txt, is.na, logical(1)))
For a multiple pattern matches (present in a character vector), i tried to apply grep(paste(States,collapse="|), Description). It works fine, but the problem here is that
Consider,
Descritpion=C("helloWorld Washington DC","Hello Stackoverflow////Newyork RBC")
States=C("DC","RBC","WA")
if the multiple pattern match for "WA" in the Description Vector. My function works for "helloWorld **Wa**shington DC" because "WA" is present. But i need a suggestion regarding the search pattern not in the whole String but at the end of String here with DC,RBC.
Thanks in advance
I guess you want something like the following. I've taken the liberty to clean up your example a bit.
Description <- c("helloWorld Washington DC", "Hello Stackoverflow", "Newyork RBC")
States <- c("DC","RBC","WA")
search.string <- paste0(States, "$", collapse = "|") # Construct the reg. exprs.
grep(search.string, Description, value = TRUE)
#[1] "helloWorld Washington DC" "Newyork RBC"
Note, we use $ to signify end-of-string match.