R sub replacing part of identified string - r

Hi I have a large dataframe of addresses which I need to clean. One of the problems is where I wish to replace a number and suffix with an unwanted whitespace as follows
original <- c("73 A Acacia Avenue","656 B East Street", " FLAT 1 D High Road", "66B West Street")
corrected <- c("73A Acacia Avenue","656B East Street", " FLAT 1D High Road")
I can identify and isolate what I wish to change using grep and regexpr, but am not sure how to remove the offending space and replace the correction in the original dataframe
reg <- "([0-9]+ [A-Z] )"
grep(reg, original, value = T, perl =T) # finds match
grep(reg, original, perl =T) # finds match row
regexpr(reg,match) # finds position
findstr <- regmatches(match,r) # show relevant string
So my final stage is to remove the whitespace and apply the correction.
Any help appreciated
Thank you

You may use the gsub with your (a bit modified) regex and \1\2 replacement:
original <- c("73 A Acacia Avenue","656 B East Street", " FLAT 1 D High Road", "66B West Street")
reg <- "([0-9]+)\\s([A-Z]\\s+)"
gsub(reg, "\\1\\2", original)
## => [1] "73A Acacia Avenue" "656B East Street" " FLAT 1D High Road" [4] "66B West Street"
See the online R demo.
Details:
([0-9]+) - Group 1 matching one or more digits
\\s - a whitespace
([A-Z]\\s+) - Group 2 matching an uppercase ASCII letter and then 1 or more whitespaces.
The replacement is \1\2 where \1 is the value of the first group and \2 references the value in the second group.

Related

str_replace: replacement depending on wildcard value [A-Z]

I have a number of strings containing the pattern "of" followed by an uppercase letter without spaces (in regex: "of[A-Z]"). I want to add spaces, e.g. "PrinceofWales" should become "Prince of Wales" etc.). However, I couldn't find how to add the value of [A-Z] that was matched into the replacement value:
library(tidyverse)
str_replace("PrinceofWales", "of[A-Z]", " of [A-Z]")
# Gives: Prince of [A-Z]ales
# Expected: Prince of Wales
str_replace("DukeofEdinburgh", "of[A-Z]", " of [A-Z]")
# Gives: Duke of [A-Z]dinburgh
# Expected: Duke of Edinburgh
Can someone enlighten me? :)
It needs to be captured as a group (([A-Z])) and replace with the backreference (\\1) of the captured group i.e. regex interpretation is in the pattern and not in the replacement
stringr::str_replace("PrinceofWales", "of([A-Z])", " of \\1")
[1] "Prince of Wales"
According to ?str_replace
replacement - A character vector of replacements. Should be either length one, or the same length as string or pattern. References of the form \1, \2, etc will be replaced with the contents of the respective matched group (created by ()).
Or another option is a regex lookaround
stringr::str_replace("PrinceofWales", "of(?=[A-Z])", " of ")
[1] "Prince of Wales"

Extract "|" from a character in R

This is my character vector:
mycharacter<-" Directors:Chris Renaud, Yarrow Cheney | Stars:Louis C.K., Eric Stonestreet, Kevin Hart, Lake Bell "
Why I cant extract the "|" from my character?
Also, after extract "|" how can I build a data frame with two columns. One being Directors and other being Stars?
Any help?
We can use fixed as the | in default mode in regex is a metacharacter suggesting OR. So, if we want to get the literal value, use fixed or escape (\\) or place it inside square brackets
library(stringr)
str_extract(mycharacter, fixed("|"))
You can use gsub:
# return the left side of |
gsub("^(.*)\\|(.*)$","\\1",mycharacter)
[1] " Directors:Chris Renaud, Yarrow Cheney "
# return the right side of |
gsub("^(.*)\\|(.*)$","\\2",mycharacter)
[1] " Stars:Louis C.K., Eric Stonestreet, Kevin Hart, Lake Bell "
If you want to remove the spaces you can act on the regular expression (.*).
director <- gsub("^\\s+(.*)\\|(.*)$","\\1",mycharacter)
director <- gsub("\\s+$","",director)
star <- gsub("^(.*)\\|\\s+(.*)$","\\2",mycharacter)
star <- gsub("\\s+$","",star)
You can then build a data.frame with
myDF <- data.frame(Directors = director, Stars= star)

Regex in R - Extracting two letters between spaces

I am trying to extract the two letters between two spaces -
AAPL US Equity
1836 JP Equity
APPLE SOMETHING NOT
C US Equity
Result -
US
JP
US
What I tried was gsub("\\s[A-Z]{2}\\s", "\\1", vec) but that gives me -
AAPLEquity
1836Equity
APPLE SOMETHING NOT
CEquity
which seems the exact opposite of what I want.
We can use sub
out <- rep("", length(vec))
i1 <- grepl("\\b[A-Z]{2}\\b", vec)
out[i1] <- sub(".*\\s+([A-Z]{2})\\s+.*", "\\1", vec[i1])
out
#[1] "US" "JP" "" "US"
Or using str_extract to extract the two upper case characters after a space (specified by the regex lookaround) and follows a word boundary (\\b)
str_extract(vec, "(?<=\\s)([A-Z]{2})\\b")
#[1] "US" "JP" NA "US"
NOTE: Not copied syntax from others' answer
data
vec <- c("AAPL US Equity", "1836 JP Equity", "APPLE SOMETHING NOT", "C US Equity")
The gsub command removes the parts of text matched with the regular expression. \s[A-Z]{2}\s finds streaks of whitespace, 2 uppercase ASCII letters and whitespace, and removes them from character vectors.
You may use
x <- c('AAPL US Equity','1836 JP Equity','APPLE SOMETHING NOT','C US Equity')
sub(".*\\s+([A-Z]{2})\\s.*|.*", "\\1", x)
# => [1] "US" "JP" "" "US"
Here, the .*\\s+([A-Z]{2})\\s.* alternative matches those inputs that have a two-letter "word" between whitespaces and puts the words into Group 1 (\1), while .* alternative matches all other inputs to produce an empty result as the sub operation.
Or, you may use
library(stringr)
str_extract(x, "(?<=\\s)[A-Z]{2}(?=\\s)")
# => [1] "US" "JP" NA "US"
Here, (?<=\\s)[A-Z]{2}(?=\\s) matches and str_extract extracts strings that are first two-letter words in between whitespaces.
If the words can be at the start/end of the string use
str_extract(x, "(?<!\\S)[A-Z]{2}(?!\\S)")

Shifting text around in a sentence via R

I have an R dataframe with movie names like so:
Shawshank Redemption, The
Godfather II, The
Band of Brothers
I would like to display these names as:
The Shawshank Redemption
The Godfather II
Band of Brothers
Can anyone help with how to do a check each row of the dataframe to see if there is a 'The' after a comma (like) above, and if there is, shift it to the front of the sentence?
You can use gsub:
df$movies2 = gsub("^([\\w\\s]+),*\\s*([Tt]he*($|(?=\\s\\(\\d{4}\\))))", "\\2 \\1", df$movies, perl = TRUE)
Result:
> df
movies movies2
1 Shawshank Redemption, The (1994) The Shawshank Redemption (1994)
2 Godfather II, The The Godfather II
3 Band of Brothers Band of Brothers
4 Dora, The Explorer Dora, The Explorer
5 Kill Bill Vol. 2 The Kill Bill Vol. 2 The
6 ,The Highlander ,The Highlander
7 Happening, the the Happening
Data:
df = data.frame(movies = c("Shawshank Redemption, The (1994)",
"Godfather II, The",
"Band of Brothers",
"Dora, The Explorer",
"Kill Bill Vol. 2 The",
",The Highlander",
"Happening, the"), stringsAsFactors = FALSE)
Notes:
The goal of the entire regex is to group the first part (part before ,) and the second part ('The' after , and only when it's at the end or before (year)) into separate capture groups which I can swap with \\2 and \\1
^([\\w\\s]+) matches any word character or spaces one or more times starting from the beginning of the string
,*\\s* matches comma and space both zero or more times
[Tt]he* matches "The" or "the" zero or more times
Notice that it is followed by ($|(?=\\s\\(\\d{4}\\))) which matches the "end of string", $, or a positive lookahead, which checks whether the previous pattern is followed by \\s\\(\\d{4}\\)
\\s\\(\\d{4}\\) matches a space and (4 digits) including the parentheses. Double backslashes are needed to escape a single backslash
So ([Tt]he*($|(?=\\s\\(\\d{4}\\)))) matches "The" or "the" either at the end of string or if it is followed by (4 digits)
Everything in parentheses are capture groups, so \\2 \\1 swaps the first capture group, ([\\w\\s]+), with the second, ([Tt]he*($|(?=\\s\\(\\d{4}\\))))
Now, since "The" is only matched zero or more times by [Tt]he*, if a string doesn't have "The" in it, an empty string gets swapped, with \\1, which returns the original string.
This seems to work for me:
#create a vector of movies
x=c("Shawshank Redemption, The", "Godfather II, The", "Band of Brothers")
#use grep to find those with ", The" at the end
the.end=grep(", The$",x)
#trim movie titles to remove ", The"
trimmed=strtrim(x[the.end],nchar(x[the.end])-5)
#add "The " to the beginning of the trimmed titles
final=paste("The",trimmed)
#replace the trimmed elements of the movie vector
x[the.end]<-final
#take a look
x
Note that this doesn't remove ", The" from anywhere in the name other than the end... I think that's the behaviour that you want. It will also miss any "The" without the comma, or lower case "the". To see what I mean, try this as your initial movie vector:
#create a vector of movies
x=c("Shawshank Redemption, The", "Godfather II, The", "Band of Brothers",
"Dora, The Explorer", "Kill Bill Vol. 2 The", ",The Highlander",
"Happening, the")

extract a number in the middle or end of a string in R

I have a string vector. I would like to extract a number after "# of Stalls: " The numbers are located either in the middle or in the end of the string.
x <- c("1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/># of Stalls: 244<br/>Cost: Free", "20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/># of Stalls: 40")
Here is my trial, but it is not sufficient. I appreciate your help.
gsub(".*\\# of Stalls: ", "", x)
Since it's HTML, you can use rvest or another HTML parser to extract the nodes you want first, which makes extracting the numbers trivial. XPath selectors and functions afford a little more flexibility than CSS ones for this sort of work.
library(rvest)
x %>% paste(collapse = '<br/>') %>%
read_html() %>%
html_nodes(xpath = '//text()[contains(., "# of Stalls:")]') %>%
html_text() %>%
readr::parse_number()
#> [1] 244 40
We match one or more characters that are not a # ([^#]+) from the start (^) of the string followed by a # followed by zero or more characters that are not a number ([^0-9]*) followed by one or more numbers ([0-9]+) captured as a group ((...)), followed by other characters (.*) and replace it with the backreference (\\1) of the captured group
as.integer(sub("^[^#]+#[^0-9]*([0-9]+).*", "\\1", x))
#[1] 244 40
If the string is more specific, then we can specify it
as.integer(sub("^[^#]+# of Stalls:\\s+([0-9]+).*", "\\1", x))
#[1] 244 40
There are many ways to solve this problem, I am going to use stringr package to solve it. The first str_extract would fetch the values :
[1] "# of Stalls: 244" "# of Stalls: 40" and then the second str_extract extracts the only digit parts available in the string.
I am however not clear whether you want to extract the string or replace the string. In case you want extarct the string below would work for you. In case you want to replace the string then you need to use str_replace
library(stringr)
as.integer(str_extract(str_extract(x,"#\\D*\\d{1,}"),"\\d{1,}"))
In case you want to replace the string then you should do :
str_replace(x,"#\\D*(\\d{1,})","\\1")
Output:
Output for extract:
> as.integer(str_extract(str_extract(x,"#\\D*\\d{1,}"),"\\d{1,}"))
[1] 244 40
Output for replace:
> str_replace(x,"#\\D*(\\d{1,})","\\1")
[1] "1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/>244<br/>Cost: Free"
[2] "20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/>40"
Here are some solutions. (1) and (1a) are variations of the code in the question. (2) and (2a) take the opposite approach where, instead of removing what we don't want they match what we do want.
1) gsub The code in the question removes the portion behfore the number but does not remove the portion after. We can modify it to do both at once below. The |\\D.*$ part that we added does that. Note that "\\D" matches any non-digit.
as.integer(gsub(".*# of Stalls: |\\D.*$", "", xx))
## [1] 244 40
1a) sub Alernately do these in two separate sub calls. The inner sub is from the question and the outer sub removes the first non-numeric onwards after the number.
as.integer(sub("\\D.*$", "", sub(".*# of Stalls: ", "", xx)))
## [1] 244 40
2) strcapture With this approach, available in the development version of R, we can simplify the regular expression substantially. We specify a match with a capture group (portion in parentheses). strcapture will return the portion corresponding to the capture group and create a data.frame from it. The third argument is a prototype structure that it uses to know that it is supposed to return integers. Note that "\\d" matches any digit.
strcapture("# of Stalls: (\\d+)", xx, list(stalls = integer()))
## stalls
## 1 244
## 2 40
2a) strapply The strapply function in the gsubfn package is similar to strcapture but uses an apply paradigm where the first argument is the input string, the second is the pattern and the third is the function to apply to the capture group.
library(gsubfn)
strapply(xx, "# of Stalls: (\\d+)", as.integer, simplify = TRUE)
## [1] [1] 244 40
Note: The input xx used is the same as x in the question:
xx <- c("1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/># of Stalls: 244<br/>Cost: Free",
"20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/># of Stalls: 40"
)

Resources