Capitalizing in R with Exceptions - r

How can you capitalize data on R except add boundaries?
For example:
Given a list of cities and states in the form: "NEW YORK, NY"
It needs to be changed to: "New York, NY"
The str_to_title function changes it to "New York, Ny".
Patterns:
WASHINGTON, DC
AMHERST, MA
HANOVER, NH
DAVIDSON, NC
BRUNSWICK, ME
GREENVILLE, SC
PORTLAND, OR
LOUISVILLE, KY
They should all be in the form: Amherst, MA or Brunswick, ME

We could use a negative regex lookaround to match the upper case letters that are not succeeding the , and space , capture as a group ((...)), in the replacement specify the backreference of the captured group (\\1, \\2) while converting the second group to lower (\\L)
gsub("(?<!, )([A-Z])([A-Z]+)\\b", "\\1\\L\\2", str1, perl = TRUE)
#[1] "New York, NY" "Washington, DC" "Amherst, MA" "Hanover, NH"
#[5] "Davidson, NC" "Brunswick, ME"
#[7] "Greenville, SC" "Portland, OR" "Louisville, KY"
data
str1 <- c("NEW YORK, NY", "WASHINGTON, DC", "AMHERST, MA", "HANOVER, NH",
"DAVIDSON, NC", "BRUNSWICK, ME", "GREENVILLE, SC", "PORTLAND, OR",
"LOUISVILLE, KY")

Related

Adding a string between some pattern in R

I would like to add the string "AND" between the words "STREET" and "HENRY" into the following string:
WEST 156 STREET HENRY HUDSON PARKWAY
So that it reads WEST 156 STREET AND HENRY HUDSON PARKWAY. Essentially, I am trying to geocode intersections so I would like to be able to add "AND" between street types (AVENUE, STREET, BLVD, etc.) and whatever word comes after that to create the intersection like I specified above.
Here are a couple more examples (just made up):
strings = c("WEST 135TH AVE BROADWAY", # want WEST 135TH AVE AND BROADWAY,
"SUNSET BLVD MAIN ST", # SUNSET BLVD AND MAIN ST
"W 45TH ST LAKESHORE BLVD", #...
"HIGH ST BROAD ST") # ...
I would greatly appreciate any help! I am somewhat familiar with regular expressions, but I am not familiar with how to insert another word in this manner.
capture the words as a group and replace with backreference (\\1) along with the substring "AND". For the third and fourth strings, as it is at the end of the string, it wouldn't replace as we used \\s+ (one or more spaces)
sub("(AVENUE|AVE|STREET|BLVD)\\s+", "\\1 AND ", strings)
-output
[1] "WEST 135TH AVE AND BROADWAY" "SUNSET BLVD AND MAIN ST"
[3] "W 45TH ST LAKESHORE BLVD" "HIGH ST BROAD ST"

In R, remove substring pattern from string with gsub

We have a string column in our database with values for sports teams. The names of these teams are occasionally prefixed with the team's ranking, like such: (13) Miami (FL). Here the 13 is Miami's rank, and the (FL) means this is Miami Florida, not Miami of Ohio (Miami (OH)):
We need to clean up this string, removing (13) and keeping only Miami (FL). So far we've used gsub and tried the following:
> gsub("\\s*\\([^\\)]+\\)", "", "(13) Miami (FL)")
[1] " Miami"
This is incorrectly removing the (FL) suffix, and it's also not handling the white space correctly in front.
Edit
Here's a few additional school names, to show a bit the data we're working with. Note that not every school has the (##) prefix.:
c("North Texas", "Southern Methodist", "Texas-El Paso",
"Brigham Young", "Winner", "(12) Miami (FL)", "Appalachian State",
"Arkansas State", "Army", "(1) Clemson",
"(14) Georgia Southern")
You can use sub to remove a number in brackets followed by whitespace.
sub("\\(\\d+\\)\\s", "", "(13) Miami (FL)")
#[1] "Miami (FL)"
The regex could be made stricter based on the pattern in data.
We can match the opening ( followed by one or more digits (\\d+), then the closing )) and one or more spaces (\\s+), replace with blanks ("")
sub("\\(\\d+\\)\\s+", "", "(13) Miami (FL)")
#[1] "Miami (FL)"
Using the OP' updated example
sub("\\(\\d+\\)\\s+", "", v1)
#[1] "North Texas" "Southern Methodist" "Texas-El Paso" "Brigham Young" "Winner" "Miami (FL)"
#[7] "Appalachian State" "Arkansas State" "Army" "Clemson" "Georgia Southern"
Or another option with str_remove from stringr
library(stringr)
str_remove("(13) Miami (FL)", "\\(\\d+\\)\\s+")
Another solution, based on stringr, is this:
str_extract(v1, "[A-Z].*")
[1] "North Texas" "Southern Methodist" "Texas-El Paso" "Brigham Young" "Winner"
[6] "Miami (FL)" "Appalachian State" "Arkansas State" "Army" "Clemson"
[11] "Georgia Southern"
This extracts everything starting from the first upper case letter (thereby ignoring the unwanted rankings).

R - using regex to delete all strings with 2 characters or less [duplicate]

This question already has answers here:
R: Find and remove all one to two letter words
(2 answers)
Closed 5 years ago.
I've got a problem and I'm sure it's super simple to fix, but I've been searching for an answer for about an hour and can't seem to work it out.
I have a character vector with data that looks a bit like this:
[5] "Toronto, ON" "Manchester, UK"
[7] "New York City, NY" "Newark, NJ"
[9] "Melbourne" "Los Angeles, CA"
[11] "New York, USA" "Liverpool, England"
[13] "Fort Collins, CO" "London, UK"
[15] "New York, NY"
and basically I'd like to get rid of all character elements that are 2 digits or shorter, so that the data can then look as follows:
[5] "Toronto, " "Manchester, "
[7] "New York City, " "Newark, "
[9] "Melbourne" "Los Angeles, "
[11] "New York, USA" "Liverpool, England"
[13] "Fort Collins, " "London, "
[15] "New York, "
The commas I know how to get rid of. As I said, I'm sure this is super simple, any help would be greatly appreciated. Thanks!
You can use quantifier on a word character \\w with word boundaries, \\b\\w{1,2}\\b will match a word with one or two characters; use gsub to remove it in case you have multiple matched pattern:
gsub("\\b\\w{1,2}\\b", "", v)
# [1] "Toronto, " "Manchester, " "New York City, " "Newark, " "Melbourne" "Los Angeles, " "New York, USA"
# [8] "Liverpool, England" "Fort Collins, " "London, " "New York, "
Notice \\w matches both alpha letters and digits with underscore, if you only want to take alpha letters into account, you can use gsub("\\b[a-zA-Z]{1,2}\\b", "", v).
v <- c("Toronto, ON", "Manchester, UK", "New York City, NY", "Newark, NJ", "Melbourne", "Los Angeles, CA", "New York, USA", "Liverpool, England", "Fort Collins, CO", "London, UK", "New York, NY")
Doesn't use regex but it gets the job done:
d <- c(
"Toronto, ON", "Manchester, UK",
"New York City, NY", "Newark, NJ",
"Melbourne", "Los Angeles, CA" ,
"New York, USA", "Liverpool, England" ,
"Fort Collins, CO", "London, UK" ,
"New York, NY" )
toks <- strsplit(d, "\\s+")
lens <- sapply(toks, nchar)
mapply(function(a, b) {
paste(a[b > 2], collapse = " ")
}, toks, lens)

Remove specific string at the end position of each row from dataframe(csv)

I am trying to clean a set of data which is in csv format. After loading data into R, i need to replace and also remove some characters from the it. Below is an example. Ideally i want to
replace the St at the end of each -> Street
in cases where there are St St.
i need to remove St and replace St. with just Street.
I tried to use this code
sub(x = evostreet, pattern = "St.", replacement = " ") and later
gsub(x = evostreet, pattern = "St.", replacement = " ") to remove the St. at the end of each row but this also remove some other occurrences of St and the next character
3 James St.
4 Glover Road St.
5 Jubilee Estate. St.
7 Fed Housing Estate St.
8 River State School St.
9 Brown State Veterinary Clinic. St.
11 Saw Mill St.
12 Dyke St St.
13 Governor Rd St.
I'm seeing a lot of close answers but I'm not seeing any that address the second problem he's having such as replacing "St St." with "Street"; e.g., "Dyke St St."
sub, as stated in the documentation:
The two *sub functions differ only in that sub replaces only the first occurrence of a pattern
So, just using "St\\." as the pattern match is incorrect.
OP needs to match a possible pattern of "St St." and I'll further assume that it could even be "St. St." or "St. St".
Assuming OP is using a simple list:
x = c("James St.", "Glover Road St.", "Jubilee Estate. St.",
"Fed Housing Estate St.", "River State School St St.",
"Brown State Vet Clinic. St. St.", "Dyke St St.")`
[1] "James St." "Glover Road St."
[3] "Jubilee Estate. St." "Fed Housing Estate St."
[5] "River State School St St." "Brown State Vet Clinic. St. St."
[7] "Dyke St St."
Then the following will replace the possible combinations mentioned above with "Street", as requested:
y <- sub(x, pattern = "[ St\\.]*$", replacement = " Street")
[1] "James Street" "Glover Road Street"
[3] "Jubilee Estate Street" "Fed Housing Estate Street"
[5] "River State School Street" "Brown State Vet Clinic Street"
[7] "Dyke Street"
Edit:
To answer OP's question below in regard to replacing one substr of St. with Saint and another with Street, I was looking for a way to be able to match similar expressions to return different values but at this point I haven't been able to find it. I suspect regmatches can do this but it's something I'll have to fiddle with later.
A simple way to accomplish what you're wanting - let's assume:
x <- c("St. Mary St St.", "River State School St St.", "Dyke St. St")
[1] "Saint Mary St St." "River State School St St."
[3] "Dyke St. St"
So you want x[1] to be Saint Mary Street, x[2] to be River State School Street and x[3] to be Dyke Street. I would want to resolve the Saint issue first by assigning sub() to y like:
y <- sub(x, pattern = "^St\\.", replacement = "Saint")
[1] "Saint Mary Street" "River State School Street"
[3] "Dyke Street"
To resolve the St's as the end, we can use the same resolution as I posted except notice now I'm not using x as my input vector but isntead the y I just made:
y <- sub(y, pattern = "[ St\\.]*$", replacement = " Street")
And that should take care of it. Now, I don't know if this is the most efficient way. And if you're dataset is rather large this may run slow. If I find a better solution I will post it (provided no one else beats me).
You don't need to use regular expression here.
sub(x = evostreet, pattern = "St.", replacement = " ", fixed=T)
The fixed argument means that you want to replace this exact character, not matches of a regular expression.
I think that your problem is that the '.' character in the regular expression world means "any single character". So to match literally in R you should write
sub(x = evostreet, pattern = "St\\.", replacement = " ")
You will need to "comment" the dot... otherwise it means anything after St and that is why some other parts of your text are eliminated.
sub(x = evostreet, pattern = "St\\.", replacement = " ")
You can add $ at the end if you want to remove the tag apearing just at the end of the text.
sub(x = evostreet, pattern = "St\\.$", replacement = " ")
The difference between sub and gsub is that sub will deal just with the firs time your tag appears in a text. gsub will eliminate all if there are duplicated. In your case as you are looking for the pattern at the end of the line it should not make any difference if you use the $.

improve nested ifelse statement in r

I have more than 10k address info, looks like "XXX street, city, state, US", in a character vector.
I want to group them by states, so I use nested ifelse to get the address date.frame with two variable, add_info and state.
library(stringr)
for (i in nrow(address){
ifelse(str_detect(address, 'Alabama'), address[i,state]='Alabama',
ifelse(str_detect(address, 'Alaska'), address[i,state]='Alaska',
ifelse(str_detect(address, 'Arizona'), address[i,state]='Arizona',
...
ifelse(str_detect(address, 'Wyoming'), address[i,state]='Wyoming', address[i,state]=NA)...)
}
Of course, this is extremely inefficient, but I don't know how to rewrite this nested ifelse. Any idea?
There are many ways to approach this problem. This is one approach assuming that your address string always contains the full spelling of only one US state.
library(stringr)
# Get a list of all states
state.list = scan(text = "Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming", what = "", sep = ",", strip.white = T)
# Extract state from vector address using library(stringr)
state = unlist(sapply(address, function(x) state.list[str_detect(x, state.list)]))
# Generate fake data to test
fake.address = paste0(replicate(10, paste(sample(c(0:9, LETTERS), 10, replace=TRUE), collapse="")),
sample(state.list, 20, rep = T),
replicate(10, paste(sample(c(0:9, LETTERS), 10, replace=TRUE), collapse="")))
# Test using fake address
unlist(sapply(fake.address, function(x) state.list[str_detect(x, state.list)]))
Output for fake address
O4H8V0NYEHColoradoA5K5XK35LX 44NDPQVMZ8UtahMY0I4M3086 LJ0LJW8BOBFloridaP5H2QW8B81 521IHHC1MFCaliforniaG7QTYCJRO5
"Colorado" "Utah" "Florida" "California"
YESTB7R6EPRhode IslandXEEGD4GEY3 5OHN2BR29HKansasCOKR9DY1WJ 4UXNJQW0QKNew MexicoH9GVQR3ZFY 5SYELTKO5HTexas3ONM1HU1VB
"Rhode Island" "Kansas" "New Mexico" "Texas"
Z8MKKL7K1RWashingtonGEBS7LJUU0 WPRSQEI2CNIndiana141S0Z1M2E O4H8V0NYEHNorth DakotaA5K5XK35LX 44NDPQVMZ8New HampshireMY0I4M3086
"Washington" "Indiana" "North Dakota" "New Hampshire"
LJ0LJW8BOBWest VirginiaP5H2QW8B811 LJ0LJW8BOBWest VirginiaP5H2QW8B812 521IHHC1MFNew JerseyG7QTYCJRO5 YESTB7R6EPWisconsinXEEGD4GEY3
"Virginia" "West Virginia" "New Jersey" "Wisconsin"
5OHN2BR29HOregonCOKR9DY1WJ 4UXNJQW0QKOhioH9GVQR3ZFY 5SYELTKO5HRhode Island3ONM1HU1VB Z8MKKL7K1ROklahomaGEBS7LJUU0
"Oregon" "Ohio" "Rhode Island" "Oklahoma"
WPRSQEI2CNIowa141S0Z1M2E
"Iowa"
edit: Use the following function based on agrep() for Fuzzy matching. Should work with minor spelling mistakes. You might need to go into edit comment to copy the code. The code contains an index-assign [<- operator called functionally, so the display is glitching here.
unlist(sapply(fake.address, function(x) state.list[[<-((L<-as.logical(sapply(state.list, function(s) agrep(s, x)*1))),is.na(L),F)]))
Assuming that your formatting is consistent (sensu Joran's comment above), you could just parse with strsplit and then use data.frame:
address1<-"410 West Street, Small Town, MN, US"
address2<-"5844 Green Street, Foo Town, NY, US"
address3<-"875 Cardinal Lane, Placeville, CA, US"
vector<-c(address1,address2,address3)
df<-t(data.frame(strsplit(vector,", "))
colnames(df)<-c("Number","City","State","Country")
rownames(df)<-NULL
df
which produces:
Number City State Country
[1,] "410 West Street" "Small Town" "MN" "US"
[2,] "5844 Green Street" "Foo Town" "NY" "US"
[3,] "875 Cardinal Lane" "Placeville" "CA" "US"
There are several methods.
First we need some sample data.
# some sample data
set.seed(123)
dat <- data.frame(addr=sprintf('123 street, Townville, %s, US',
sample(state.name, 25, replace=T)),
stringsAsFactors=F)
If your data is super regular like that:
# the easy way, split on commas:
matrix(unlist(strsplit(dat$addr, ',')), ncol=4, byrow=T)
Method 2, use grep to search for values. This works even if no commas or different commas in different rows. (As long as the states always appear spelled the same way)
# get a list of state name matches; need to match ', state name,' otherwise
# West Virginia counts as Virginia...
matches <- sapply(paste0(', ', state.name, ','), grep, dat$addr)
# now pair up the state name with the row it matches to
state_df <- data.frame(state=rep(state.name, sapply(matches, length)),
row=unname(unlist(matches)),
stringsAsFactors=F)
# reorder based on position in original data.frame, and there you go!
dat$state <- state_df[order(state_df$row), 'state']
This seemed to be working in my tests:
just.ST <- gsub( paste0(".+(", paste(state.name,collapse="|"), ").+$"),
"\\1", address)
As mentioned in comments and illustrated in other answers, state.name should be available by default. It does have the deficiency that in case of a non-match it returns the whole string, but you can probably use:
is.na(just.ST) <- nchar(just.ST) > max(nchar(state.name))

Resources