This is my character vector:
mycharacter<-" Directors:Chris Renaud, Yarrow Cheney | Stars:Louis C.K., Eric Stonestreet, Kevin Hart, Lake Bell "
Why I cant extract the "|" from my character?
Also, after extract "|" how can I build a data frame with two columns. One being Directors and other being Stars?
Any help?
We can use fixed as the | in default mode in regex is a metacharacter suggesting OR. So, if we want to get the literal value, use fixed or escape (\\) or place it inside square brackets
library(stringr)
str_extract(mycharacter, fixed("|"))
You can use gsub:
# return the left side of |
gsub("^(.*)\\|(.*)$","\\1",mycharacter)
[1] " Directors:Chris Renaud, Yarrow Cheney "
# return the right side of |
gsub("^(.*)\\|(.*)$","\\2",mycharacter)
[1] " Stars:Louis C.K., Eric Stonestreet, Kevin Hart, Lake Bell "
If you want to remove the spaces you can act on the regular expression (.*).
director <- gsub("^\\s+(.*)\\|(.*)$","\\1",mycharacter)
director <- gsub("\\s+$","",director)
star <- gsub("^(.*)\\|\\s+(.*)$","\\2",mycharacter)
star <- gsub("\\s+$","",star)
You can then build a data.frame with
myDF <- data.frame(Directors = director, Stars= star)
Related
I would like to manually correct a record by using R. Last name and first name should always be separated by a comma.
names <- c("ADAM, Smith J.", "JOHNSON. Richard", "BROWN, Wilhelm K.", "DAVIS, Daniel")
Sometimes, however, a full stop has crept in as a separator, as in the case of "JOHNSON. Richard". I would like to do this automatically. Since the last name is always at the beginning of the line, I can simply access it via sub:
sub("^[[:upper:]]+\\.","^[[:upper:]]+\\,",names)
However, I cannot use a function for the replacement that specifically replaces the full stop with a comma.
Is there a way to insert a function into the replacement that does this for me?
Your sub is mostly correct, but you'll need a capture group (the brackets and backreference \\1) for the replacement.
Because we are "capturing" the upper case letters, therefore \\1 here represents the original upper case letters in your original strings. The only replacement here is \\. to \\,. In other words, we are replacing upper case letters ^(([[:upper:]]+) AND full stop \\. with it's original content \\1 AND comma \\,.
For more details you can visit this page.
test_names <- c("ADAM, Smith J.", "JOHNSON. Richard", "BROWN, Wilhelm K.", "DAVIS, Daniel")
sub("^([[:upper:]]+)\\.","\\1\\,",test_names)
[1] "ADAM, Smith J." "JOHNSON, Richard" "BROWN, Wilhelm K."
[4] "DAVIS, Daniel"
Can be done by a function like so:
names <- c("ADAM, Smith", "JOHNSON. Richard", "BROWN, Wilhelm", "DAVIS, Daniel")
replacedots <- function(mystring) {
gsub("\\.", ",", names)
}
replacedots(names)
[1] "ADAM, Smith" "JOHNSON, Richard" "BROWN, Wilhelm" "DAVIS, Daniel"
I have a number of strings containing the pattern "of" followed by an uppercase letter without spaces (in regex: "of[A-Z]"). I want to add spaces, e.g. "PrinceofWales" should become "Prince of Wales" etc.). However, I couldn't find how to add the value of [A-Z] that was matched into the replacement value:
library(tidyverse)
str_replace("PrinceofWales", "of[A-Z]", " of [A-Z]")
# Gives: Prince of [A-Z]ales
# Expected: Prince of Wales
str_replace("DukeofEdinburgh", "of[A-Z]", " of [A-Z]")
# Gives: Duke of [A-Z]dinburgh
# Expected: Duke of Edinburgh
Can someone enlighten me? :)
It needs to be captured as a group (([A-Z])) and replace with the backreference (\\1) of the captured group i.e. regex interpretation is in the pattern and not in the replacement
stringr::str_replace("PrinceofWales", "of([A-Z])", " of \\1")
[1] "Prince of Wales"
According to ?str_replace
replacement - A character vector of replacements. Should be either length one, or the same length as string or pattern. References of the form \1, \2, etc will be replaced with the contents of the respective matched group (created by ()).
Or another option is a regex lookaround
stringr::str_replace("PrinceofWales", "of(?=[A-Z])", " of ")
[1] "Prince of Wales"
I have a string that contains a persons name and city. It's formatted like this:
mock <- "Joe Smith (Cleveland, OH)"
I simply want the state abbreviation remaining, so it in this case, the only remaining string would be "OH"
I can get rid of the the parentheses and comma
[(.*?),]
Which gives me:
"Joe Smith Cleveland OH"
But I can't figure out how to combine all of it. For the record, all of the records will look like that, where it ends with ", two letter capital state abbreviation" (ex: ", OH", ", KY", ", MD" etc...)
You may use
mock <- "Joe Smith (Cleveland, OH)"
sub(".+,\\s*([A-Z]{2})\\)$","\\1",mock)
## => [1] "OH"
## With stringr:
str_extract(mock, "[A-Z]{2}(?=\\)$)")
See this R demo
Details
.+,\\s*([A-Z]{2})\\)$ - matches any 1+ chars as many as possible, then ,, 0+ whitespaces, and then captures 2 uppercase ASCII letters into Group 1 (referred to with \1 from the replacement pattern) and then matches ) at the end of string
[A-Z]{2}(?=\)$) - matches 2 uppercase ASCII letters if followed with the ) at the end of the string.
How about this. If they are all formatted the same, then this should work.
mock <- "Joe Smith (Cleveland, OH)"
substr(mock, (nchar(mock) - 2), (nchar(mock) - 1))
If the general case is that the state is in the second and third last characters then match everything, .*, and then a capture group of two characters (..) and then another character . and replace that with the capture group:
sub(".*(..).", "\\1", mock)
## [1] "OH"
Hi I have a large dataframe of addresses which I need to clean. One of the problems is where I wish to replace a number and suffix with an unwanted whitespace as follows
original <- c("73 A Acacia Avenue","656 B East Street", " FLAT 1 D High Road", "66B West Street")
corrected <- c("73A Acacia Avenue","656B East Street", " FLAT 1D High Road")
I can identify and isolate what I wish to change using grep and regexpr, but am not sure how to remove the offending space and replace the correction in the original dataframe
reg <- "([0-9]+ [A-Z] )"
grep(reg, original, value = T, perl =T) # finds match
grep(reg, original, perl =T) # finds match row
regexpr(reg,match) # finds position
findstr <- regmatches(match,r) # show relevant string
So my final stage is to remove the whitespace and apply the correction.
Any help appreciated
Thank you
You may use the gsub with your (a bit modified) regex and \1\2 replacement:
original <- c("73 A Acacia Avenue","656 B East Street", " FLAT 1 D High Road", "66B West Street")
reg <- "([0-9]+)\\s([A-Z]\\s+)"
gsub(reg, "\\1\\2", original)
## => [1] "73A Acacia Avenue" "656B East Street" " FLAT 1D High Road" [4] "66B West Street"
See the online R demo.
Details:
([0-9]+) - Group 1 matching one or more digits
\\s - a whitespace
([A-Z]\\s+) - Group 2 matching an uppercase ASCII letter and then 1 or more whitespaces.
The replacement is \1\2 where \1 is the value of the first group and \2 references the value in the second group.
Rearranging simpsons names with R to follow first name, last name format but there are large spaces between the names, is it possible to remove spaces outside the quoted names?
library(stringr)
simpsons <- c("Moe Syzlak", "Burns, C. Montgomery", "Rev. Timothy Lovejoy", "Ned Flanders", "Simpson, Homer", "Dr. Julius Hibbert")
reorder <- sapply(sapply(str_split(simpsons, ","), str_trim),rev)
for (i in 1:length(name) ) {
splitname[i]<- paste(unlist(splitname[i]), collapse = " ")
}
splitname <- unlist(splitname)
If we need to rearrange the first name followed by last name, we could use sub. We capture one or more than character which is not a , in a group, followed by , followed by 0 or more space (\\s*), capture one or more characters that are not a , as the 2nd group, and in the replacement reverse the backreference to get the output.
sub("([^,]+),\\s*([^,]+)", "\\2 \\1", simpsons)
#[1] "Moe Syzlak" "C. Montgomery Burns" "Rev. Timothy Lovejoy" "Ned Flanders" "Homer Simpson" "Dr. Julius Hibbert"