r street type removing pattern match - r

I need to remove the street type (St, Blvd, Rd, etc) from a series of addresses as the clean up step before a data match. I'm using the code below, but for some addresses the result is missing part of the street I want to keep.
library(tidyverse)
c("9123 GLENOAKS BLVD","123 E AVENUE K6 STE B","123 CAMP PLENTY RD","900 E VICTORIA ST","460 SAN FERNANDO RD","176 S SANTA FE AVE STE 9") %>%
sub("AVE.*$|ST.*$| BLVD.*$| RD.*$| PL.*$| 3RD.*$| APT.*$| DR.*$", "", .)
[1] "9123 GLENOAKS" "123 E " "123 CAMP" "900 E VICTORIA " "460 SAN FERNANDO" "176 S SANTA FE "
Below is the expected output
[1] "9123 GLENOAKS" "123 E AVENUE K6 " "123 CAMP PLENTY" "900 E VICTORIA " "460 SAN FERNANDO" "176 S SANTA FE "

You may use
sub("(.*?)\\s+(?:AVE|STE?|BLVD|RD|PL|3RD|APT|DR)\\b.*", "\\1", .)
Details
(.*?) - Group 1 (this group will hold the value referenced to with \1 from the replacement pattern): any 0 or more chars as few as possible
\s+ - 1 or more whitespaces
(?:AVE|STE?|BLVD|RD|PL|3RD|APT|DR) - a list of the string alternatives: AVE, ST or STE, BLVD, RD, PL, 3RD, APT or DR
\b - a word boundary
.* - the rest of the input.

Related

Adding a string between some pattern in R

I would like to add the string "AND" between the words "STREET" and "HENRY" into the following string:
WEST 156 STREET HENRY HUDSON PARKWAY
So that it reads WEST 156 STREET AND HENRY HUDSON PARKWAY. Essentially, I am trying to geocode intersections so I would like to be able to add "AND" between street types (AVENUE, STREET, BLVD, etc.) and whatever word comes after that to create the intersection like I specified above.
Here are a couple more examples (just made up):
strings = c("WEST 135TH AVE BROADWAY", # want WEST 135TH AVE AND BROADWAY,
"SUNSET BLVD MAIN ST", # SUNSET BLVD AND MAIN ST
"W 45TH ST LAKESHORE BLVD", #...
"HIGH ST BROAD ST") # ...
I would greatly appreciate any help! I am somewhat familiar with regular expressions, but I am not familiar with how to insert another word in this manner.
capture the words as a group and replace with backreference (\\1) along with the substring "AND". For the third and fourth strings, as it is at the end of the string, it wouldn't replace as we used \\s+ (one or more spaces)
sub("(AVENUE|AVE|STREET|BLVD)\\s+", "\\1 AND ", strings)
-output
[1] "WEST 135TH AVE AND BROADWAY" "SUNSET BLVD AND MAIN ST"
[3] "W 45TH ST LAKESHORE BLVD" "HIGH ST BROAD ST"

Extract either two or three words and exclude duplicates and accronyms

I have a group of names worded in a bizarre fashion. Here is a sample:
Sammy WatkinsS. Watkins
Buffalo BillsBUF
New England PatriotsNE
Tre'Quan SmithT. Smith
JuJu Smith-SchusterJ. Smith-Schuster
My goal is to clean it so either first and last name show for names or just team names is returned for teams. Here is what have tried:
df$name <- sub("^(.*[a-z])[A-Z]", "\\1", "\\1", df$name)
This is what I'm getting returned
Sammy WatkinsS. Watkins
Buffalo BillsBUF
New England PatriotsNE
Tre'Quan SmithT. Smith
JuJu Smith-SchusterJ. Smith-Schuster
To be clear, goal would be to have this:
Sammy Watkins
Buffalo Bills
New England Patriots
Tre'Quan Smith
JuJu Smith-Schuster
data
df <- data.frame(name = c(
"Sammy WatkinsS. Watkins",
"Buffalo BillsBUF",
"New England PatriotsNE",
"Tre'Quan SmithT. Smith",
"JuJu Smith-SchusterJ. Smith-Schuster"),
stringsAsFactors = FALSE)
I suggest
df$name <- sub("\\B[A-Z]+(?:\\.\\s+\\S+)*$", "", df$name)
See the regex demo
Pattern details
\B - a non-word boundary (there must be a letter, digit or _ right before)
[A-Z]+ - 1+ ASCII uppercase letters (use \p{Lu} to match any Unicode uppercase letters)
(?:\.\s+\S+)* - 0 or more sequences of:
\. - a dot
\s+ - 1+ whitespaces
\S+ - 1+ non-whitespaces
$ - end of string.
What about:
(?<=[a-z])[A-Z](?=[.\sA-Z]).*
Check here. Without experience in R I'm unsure if this would be accepted. Also, there may be neater patterns as I'm rather new to RegEx.
I've also included a (possibly unlikely) sample: Sammy J. WatkinsJ.S. Watkins
Two laps:
df$name <- gsub(".\\. .*", "", df$name)
df$name <- gsub("[A-Z]*$", "", df$name)
The first line removes all cases of the form "x. surname" and the second removes all capital letters at the end of the string.
Another way :
sub("(.*?\\s.*?[a-z](?=[A-Z])).*", "\\1", df$name, perl = TRUE)
#> [1] "Sammy Watkins" "Buffalo Bills" "New England Patriots"
#> [4] "Tre'Quan Smith" "JuJu Smith-Schuster"
sub(".*?\\s.*?[a-z](?=[A-Z])", "", df$name, perl = TRUE)
#> [1] "S. Watkins" "BUF" "NE"
#> [4] "T. Smith" "J. Smith-Schuster"
We're splitting between a lower case character and an upper case character, but not before we see a space.
You could also use unglue :
library(unglue)
unglue_unnest(df, name, "{name1=.*?\\s.*?[a-z]}{name2=[A-Z].*?}")
#> name1 name2
#> 1 Sammy Watkins S. Watkins
#> 2 Buffalo Bills BUF
#> 3 New England Patriots NE
#> 4 Tre'Quan Smith T. Smith
#> 5 JuJu Smith-Schuster J. Smith-Schuster

Can I create a vector with regexps?

My data looks somthing like this:
412 U CA, Riverside
413 U British Columbia
414 CREI
415 U Pompeu Fabra
416 Office of the Comptroller of the Currency, US Department of the Treasury
417 Bureau of Economics, US Federal Trade Commission
418 U Carlos III de Madrid
419 U Brescia
420 LUISS Guido Carli
421 U Alicante
422 Harvard Society of Fellows
423 Toulouse School of Economics
424 Decision Economics Inc, Boston, MA
425 ECARES, Free U Brussels
I will need to geocode this data in order to get the coordinates for each specific institution. in order to do that I need all state names to be spelled out. At the same time I don't want acronyms like "ECARES" to be transformed into "ECaliforniaRES".
I have been toying with the idea of converting the state.abb and state.name vectors into vectors of regular expressions, so that state.abb would look something like this (Using Alabama and California as state 1 and state 2):
c("^AL "|" AL "|" AL,"|",AL "| " AL$", "^CA "[....])
And the state.name vector something like this:
c("^Alabama "|" Alabama "|" Alabama,"|",Alabama "| " Alabama$", "^California "[....])
Hopefully, I can then use the mgsub function to replace all expressions in the modified state.abb vector with the corresponding entries in the modified state.name vector.
For some reason, however, it doesn't seem to be possible to put regexps in a vector:
test<-c(^AL, ^AB)
Error: unexpected '^' in "test<-c(^"
I have tried excusing the "^"-signs but this doesnt really seem to work:
test<-c(\^AL, \^AB)
Error: unexpected input in "test<-c(\"
> test<-c(\\^AL, \\^AB)
Is there any way of putting regexps in a vector, or is there another way of achieving my goal (that is, to replace all two-letter state abbreviations to state names without messing up other acronyms in the process)?
Excerpt of my data:
c("U Lausanne", "Swiss Finance Institute", "U CA, Riverside",
"U British Columbia", "CREI", "U Pompeu Fabra", "Office of the Comptroller of the Currency, US Department of the Treasury",
"Bureau of Economics, US Federal Trade Commission", "U Carlos III de Madrid",
"U Brescia", "LUISS Guido Carli", "U Alicante", "Harvard Society of Fellows",
"Toulouse School of Economics", "Decision Economics Inc, Boston, MA",
"ECARES, Free U Brussels", "Baylor U", "Research Centre for Education",
"the Labour Market, Maastricht U", "U Bonn", "Swarthmore College"
)
We can make use of the state.abb vector and paste it together by collapseing with |
pat1 <- paste0("\\b(", paste(state.abb, collapse="|"), ")\\b")
The \\b signifies the word boundary so that indiscriminate matches "SAL" can be avoided
and similarly with state.name, paste the ^ and $ as prefix/suffix to mark the start, end of the string respectively
pat2 <- paste0("^(", paste(state.name, collapse="|"), ")$")

Remove a portion of a randomized string over an entire dataframe column in R

Need help removing random text in a string that appears before an address (data set has ~5000 observations). Dataframe test2$address reads as follows:
addresses <- c(
"140 National Plz Oxon Hill, MD 20745",
"6324 Windsor Mill Rd Gwynn Oak, MD 21207",
"23030 Indian Creek Dr Sterling, VA 20166",
"Located in Reston Town Center 18882 Explorer St Reston, VA 20190"
)
I want it to spit out all addresses in a common format:
[885] "23030 Indian Creek Dr Sterling, VA 20166"
[886] "18882 Explorer St Reston, VA 20190"
Not sure how to go about doing this as there is no specific pattern to the text that comes before the address number.
If you know that the address portion you want will always start with digits, and the part you want to remove will be text, then you can use this:
sub(".*?(\\d+)", "\\1", x)
Output:
[1] "140 National Plz Oxon Hill, MD 20745"
[2] "6324 Windsor Mill Rd Gwynn Oak, MD 21207"
[3] "23030 Indian Creek Dr Sterling, VA 20166"
[4] "18882 Explorer St Reston, VA 20190"
What this does is remove everything (.*) before the first (?) digit series (\\d+).
Sample data:
x <- c("140 National Plz Oxon Hill, MD 20745",
"6324 Windsor Mill Rd Gwynn Oak, MD 21207",
"23030 Indian Creek Dr Sterling, VA 20166",
"Located in Reston Town Center 18882 Explorer St Reston, VA 20190")

Remove specific string at the end position of each row from dataframe(csv)

I am trying to clean a set of data which is in csv format. After loading data into R, i need to replace and also remove some characters from the it. Below is an example. Ideally i want to
replace the St at the end of each -> Street
in cases where there are St St.
i need to remove St and replace St. with just Street.
I tried to use this code
sub(x = evostreet, pattern = "St.", replacement = " ") and later
gsub(x = evostreet, pattern = "St.", replacement = " ") to remove the St. at the end of each row but this also remove some other occurrences of St and the next character
3 James St.
4 Glover Road St.
5 Jubilee Estate. St.
7 Fed Housing Estate St.
8 River State School St.
9 Brown State Veterinary Clinic. St.
11 Saw Mill St.
12 Dyke St St.
13 Governor Rd St.
I'm seeing a lot of close answers but I'm not seeing any that address the second problem he's having such as replacing "St St." with "Street"; e.g., "Dyke St St."
sub, as stated in the documentation:
The two *sub functions differ only in that sub replaces only the first occurrence of a pattern
So, just using "St\\." as the pattern match is incorrect.
OP needs to match a possible pattern of "St St." and I'll further assume that it could even be "St. St." or "St. St".
Assuming OP is using a simple list:
x = c("James St.", "Glover Road St.", "Jubilee Estate. St.",
"Fed Housing Estate St.", "River State School St St.",
"Brown State Vet Clinic. St. St.", "Dyke St St.")`
[1] "James St." "Glover Road St."
[3] "Jubilee Estate. St." "Fed Housing Estate St."
[5] "River State School St St." "Brown State Vet Clinic. St. St."
[7] "Dyke St St."
Then the following will replace the possible combinations mentioned above with "Street", as requested:
y <- sub(x, pattern = "[ St\\.]*$", replacement = " Street")
[1] "James Street" "Glover Road Street"
[3] "Jubilee Estate Street" "Fed Housing Estate Street"
[5] "River State School Street" "Brown State Vet Clinic Street"
[7] "Dyke Street"
Edit:
To answer OP's question below in regard to replacing one substr of St. with Saint and another with Street, I was looking for a way to be able to match similar expressions to return different values but at this point I haven't been able to find it. I suspect regmatches can do this but it's something I'll have to fiddle with later.
A simple way to accomplish what you're wanting - let's assume:
x <- c("St. Mary St St.", "River State School St St.", "Dyke St. St")
[1] "Saint Mary St St." "River State School St St."
[3] "Dyke St. St"
So you want x[1] to be Saint Mary Street, x[2] to be River State School Street and x[3] to be Dyke Street. I would want to resolve the Saint issue first by assigning sub() to y like:
y <- sub(x, pattern = "^St\\.", replacement = "Saint")
[1] "Saint Mary Street" "River State School Street"
[3] "Dyke Street"
To resolve the St's as the end, we can use the same resolution as I posted except notice now I'm not using x as my input vector but isntead the y I just made:
y <- sub(y, pattern = "[ St\\.]*$", replacement = " Street")
And that should take care of it. Now, I don't know if this is the most efficient way. And if you're dataset is rather large this may run slow. If I find a better solution I will post it (provided no one else beats me).
You don't need to use regular expression here.
sub(x = evostreet, pattern = "St.", replacement = " ", fixed=T)
The fixed argument means that you want to replace this exact character, not matches of a regular expression.
I think that your problem is that the '.' character in the regular expression world means "any single character". So to match literally in R you should write
sub(x = evostreet, pattern = "St\\.", replacement = " ")
You will need to "comment" the dot... otherwise it means anything after St and that is why some other parts of your text are eliminated.
sub(x = evostreet, pattern = "St\\.", replacement = " ")
You can add $ at the end if you want to remove the tag apearing just at the end of the text.
sub(x = evostreet, pattern = "St\\.$", replacement = " ")
The difference between sub and gsub is that sub will deal just with the firs time your tag appears in a text. gsub will eliminate all if there are duplicated. In your case as you are looking for the pattern at the end of the line it should not make any difference if you use the $.

Resources