I have address data I extracted from SQL, and have now loaded into R. I am trying to extract out the individual components, namely the ZIP-CODE at the end of the query (State would also be nice). I would like the ZIP-CODE and State to be in new individual columns.
The primary issue is the ZIP-CODE is sometimes 5 digits, and sometimes 9.
Two example rows would be:
Address_FULL
1234 NOWHERE ST WASHINGTON DC 20005
567 EVERYWHERE LN CHARLOTTE NC 22011-1203
I suspect I'll need some kind of regex \\d{5} notation, or some kind of fancy manipulation in dplyr that I'm not aware exists.
If the zip code is always at the end you could use
str_extract(Address_FULL,"[[:digit:]]{5}(-[[:digit:]]{4})?$")
To add a "zip" column via dplyr you could use
df %>% mutate(zip = str_extract(Address_FULL,"[[:digit:]]{5}(-[[:digit:]]{4})?$"))
Where df is your dataframe containing Address_FULL and
str_extract() is from stringr.
State could be extracted as follows:
str_extract(Address_FULL,"(?<=\\s)[[:alpha:]]{2}(?=\\s[[:digit:]]{5})")
However, this makes the following assumptions:
The state abbreviation is 2 characters long
The state abbreviation is followed immediately by a space
The zip code follows immediately after the space that follows the state
Assuming that the zip is always at the end, you can try:
tail(unlist(strsplit(STRING, split=" ")), 1)
For example
ex1 = "1234 NOWHERE ST WASHINGTON DC 20005"
ex2 = "567 EVERYWHERE LN CHARLOTTE NC 22011-1203"
> tail(unlist(strsplit(ex1, split=" ")), 1)
[1] "20005"
> tail(unlist(strsplit(ex2, split=" ")), 1)
[1] "22011-1203"
Use my package tfwstring
Works automatically on any address type, even with prefixes and suffixes.
if (!require(remotes)) install.packages("remotes")
remotes::install_github("nbarsch/tfwstring")
parseaddress("1234 NOWHERE ST WASHINGTON DC 20005", force_stateabb = F)
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode
"1234" "NOWHERE" "ST" "WASHINGTON" "DC" "20005"
parseaddress("567 EVERYWHERE LN CHARLOTTE NC 22011-1203", force_stateabb = F)
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode
"567" "EVERYWHERE" "LN" "CHARLOTTE" "NC" "22011-1203"
Related
I'm just having a really hard time figuring this out. Let's go straight to the data.
library(countrycode)
countries <- codelist$country.name.en #list of countries from the library
text <- "(France) Mr. Tom(CEO) from France is getting a new car. The car is a Toyota. His wife will get a Nissan. (Spain) No information available. (Chad) Mr. Smith (from N'Djamena) bought a new house. It's very nice."
I'd want to create a list of the parsed text (eg. from "(France)" to "Nissan.") for all three countries. The actual text is 30 pages long and each (countryName) is followed by several paragraphs of text.
All the countryNames are in parentheses but there might be other non-country parentheses in the text or countryNames without parentheses. But the general pattern is that each segment I want to parse starts with (countryName1) and ends with (countryName2)
Output:
[[1]]
[1] "(France) Mr. Tom(CEO) from France is getting a new car. The car is a Toyota. His wife will get a Nissan."
[[2]]
[1] "(Spain) No information available."
[[3]]
[1] "(Chad) Mr.Smith (from N'Djamena) bought a new house. It's very nice."
If all the countries in the 'text' matches with the reference vector, we may paste the reference vector into a single string to split the string just before the country match
as.list(strsplit(text, sprintf('(?<=\\s)(?=(%s))',
paste(paste0("\\(", countries), collapse = "|")), perl = TRUE)[[1]])
-output
[[1]]
[1] "(France) Mr. Tom(CEO) from France is getting a new car. The car is a Toyota. His wife will get a Nissan. "
[[2]]
[1] "(Spain) No information available. "
[[3]]
[1] "(Chad) Mr. Smith (from N'Djamena) bought a new house. It's very nice."
I tried to find an answer for this in other posts but nothing seemed to be working.
I have a data set where people answered the city they were in using a free response format. Therefore for each city, people identified in many different ways. For example, those living in Atlanta might have written "Atlanta", "atlanta", "Atlanta, GA" and so on.
There are 12 cities represented in this data set. I'm trying to clean this variable so each city is written consistently. Is there a way to do this efficiently for each city?
I've tried mutate_if and str_replace_all but can't seem to figure it out (see my code below)
all_data_city <- mutate_if(all_data_city, is.character,
str_replace_all, pattern = "Atlanta, GA",
replacement = "Atlanta")
all_data_city %>%
str_replace_all(c("Atlanta, GA" & "HCA Atlanta" & "HCC Atlanta" &
"Suwanee" & "Suwanee, GA" & "suwanee"), = "Atlanta")
If we need to pass a vector of elements to be replaced, paste them together with | as pattern and replace with 'Atlanta'
library(dplyr)
library(stringr)
pat <- str_c(c("Atlanta, GA" , "HCA Atlanta" , "HCC Atlanta" ,
"Suwanee" , "Suwanee, GA" , "suwanee"), collapse = "|")
all_data_city %>%
str_replace_all(pat, "Atlanta")
Using a reproducible example with iris
iris %>%
transmute(Species = str_replace_all(Species,
str_c(c("set", "versi"), collapse="|"), "hello")) %>%
pull(Species) %>%
unique
#[1] "helloosa" "hellocolor" "virginica"
Questions on data cleaning are difficult to answer, as answers strongly depend on the data.
Proposed solutions may work for a (small) sample dataset but may fail for a (large) production dataset.
In this case, I see two possible approaches:
Collecting all possible ways of writing a city's name and replacing these different variants by the desired city name. This can be achieved by str_replace() or by joining. This is safe but tedious.
Looking for a matching character string within the city name and replace if found.
Below is a blue print which can be extended for other uses cases. For demonstration, a data.frame with one column city is created:
library(dplyr)
library(stringr)
data.frame(city = c("Atlanta, GA", "HCA Atlanta", "HCC Atlanta",
"Suwanee", "Suwanee, GA", "suwanee", "Atlantic City")) %>%
mutate(city_new = case_when(
str_detect(city, regex("Atlanta|Suwanee", ignore_case = TRUE)) ~ "Atlanta",
TRUE ~ as.character(city)
)
)
city city_new
1 Atlanta, GA Atlanta
2 HCA Atlanta Atlanta
3 HCC Atlanta Atlanta
4 Suwanee Atlanta
5 Suwanee, GA Atlanta
6 suwanee Atlanta
7 Atlantic City Atlantic City
I want to extract state abbreviation (2 letters) and zip code (either 4 or 5 numbers) from the following string
address <- "19800 Eagle River Road, Eagle River AK 99577
907-481-1670
230 Colonial Promenade Pkwy, Alabaster AL 35007
205-620-0360
360 Connecticut Avenue, Norwalk CT 06854
860-409-0404
2080 S Lincoln, Jerome ID 83338
208-324-4333
20175 Civic Center Dr, Augusta ME 4330
207-623-8223
830 Harvest Ln, Williston VT 5495
802-878-5233
"
For the zip code, I tried few methods that I found on here but it didn't work mainly because of the 5 number street address or zip codes that have only 4 numbers
text <- readLines(textConnection(address))
library(stringi)
zip <- stri_extract_last_regex(text, "\\d{5}")
zip
library(qdapRegex)
rm_zip3 <- rm_(pattern="(?<!\\d)\\d{5}(?!\\d)", extract = TRUE)
zip <- rm_zip3(text)
zip
[1] "99577" "1670" "35007" "0360" "06854" "0404" "83338" "4333" "4330" "8223" "5495" "5233" NA
For the state abbreviation, I have no idea how to extract
Any help is appreciated! Thanks in advance!
Edit 1: Include phone numbers
Code to extract zip code:
zip <- str_extract(text, "\\d{5}")
Code to extract state code:
states <- str_extract(text, "\\b[A-Z]{2}(?=\\s+\\d{5}$)")
Code to extract phone numbers:
phone <- str_extract(text, "\\b\\d{3}-\\d{3}-\\d{4}\\b")
NOTE: Looks like there's an issue with your data because the last 2 zip codes should be 5 characters long and not 4. 4330 should actually be 04330. If you don't have control over the data source, but know for sure that they are US codes you could pad 0's on the left as required. However since you are looking for a solution for 4 or 5 characters, you can use this:
Code to extract zip code (looks for space in front and newline at the back so that parts of a phone number or an address aren't picked)
zip <- str_extract(text, "(?<= )\\d{4,5}(?=\\n|$)")
Code to extract state code:
states <- str_extract(text, "\\b[A-Z]{2}(?=\\s+\\d{4,5}$)")
Demo: https://regex101.com/r/7Im0Mu/2
I am using address as input not the text, see if it works for your case.
Assumptions on regex: Two capital letters followed by 4 or 5 numeric letters are for state and zip, The phone numbers are always on next line.
Input:
address <- "19800 Eagle River Road, Eagle River AK 99577
907-481-1670
230 Colonial Promenade Pkwy, Alabaster AL 35007
205-620-0360
360 Connecticut Avenue, Norwalk CT 06854
860-409-0404
2080 S Lincoln, Jerome ID 83338
208-324-4333
20175 Civic Center Dr, Augusta ME 4330
207-623-8223
830 Harvest Ln, Williston VT 5495
802-878-5233
"
I am using stringr library , you may choose any other to extract the information as you wish.
library(stringr)
df <- data.frame(do.call("rbind",strsplit(str_extract_all(address,"[A-Z][A-Z]\\s\\d{4,5}\\s\\d{3}-\\d{3}-\\d{4}")[[1]],split="\\s|\\n")))
names(df) <- c("state","Zip","Phone")
EDIT:
In case someone want to use text as input,
text <- readLines(textConnection(address))
text <- data.frame(text)
st_zip <- setNames(data.frame(str_extract_all(text$text,"[A-Z][A-Z]\\s\\d{4,5}",simplify = T)),"St_zip")
pin <- setNames(data.frame(str_extract_all(text$text,"\\d{3}-\\d{3}-\\d{4}",simplify = T)),"pin")
st_zip <- st_zip[st_zip$St_zip != "",]
df1 <- setNames(data.frame(do.call("rbind",strsplit(st_zip,split=' '))),c("State","Zip"))
pin <- pin[pin$pin != "",]
df2 <- data.frame(cbind(df1,pin))
OUTPUT:
State Zip pin
1 AK 99577 907-481-1670
2 AL 35007 205-620-0360
3 CT 06854 860-409-0404
4 ID 83338 208-324-4333
5 ME 4330 207-623-8223
6 VT 5495 802-878-5233
Thank you #Rahul. Both would be great. At least can you show me how to do it with Notepad++?
Extraction using Notepad++
Well first copy your whole data in a file.
Go to Find by pressing Ctrl + F. This will open search dialog box. Choose Replace tab search with regex ([A-Z]{2}\s*\d{4,5})$ and replace with \n-\1-\n. This will search for state abbreviation and ZIP code and place them in new line with - as prefix and suffix.
Now go to Mark tab. Check Bookmark Line checkbox then search with -(.*?)- and press Mark All. This will mark state abb and ZIP which are in newlines with -.
Now go to Search --> Bookmark --> Remove Unmarked Lines
Finally search with ^-|-$ and replace with empty string.
Update
So now there will be phone numbers too ? In that case you only have to remove $ from regex in step 2. Regex to use will be ([A-Z]{2}\s*\d{4,5}). Rest all steps will be same.
I am trying to determine in R how to split a column that has multiple fields with multiple delimiters.
From an API, I get a column in a data frame called "Location". It has multiple location identifiers in it. Here is an example of one entry. (edit- I added a couple more)
6540 BENNINGTON AVE
Kansas City, MO 64133
(39.005620414000475, -94.50998643299965)
4284 E 61ST ST
Kansas City, MO 64130
(39.014638172000446, -94.5335298549997)
3002 SPRUCE AVE
Kansas City, MO 64128
(39.07083265200049, -94.53320606399967)
6022 E Red Bridge Rd
Kansas City, MO 64134
(38.92458893200046, -94.52090062499968)
So the above is the entry in row 1-4, column "location".
I want split this into address, city, state, zip, long and lat columns. Some fields are separated by space or tab while others by comma. Also nothing is fixed width.
I have looked at the reshape package- but seems I need a single deliminator. I can't use space (or can I?) as the address has spaces in it.
Thoughts?
If the data you have is not like this, let everyone know by adding code we can copy and paste into R to reproduce your data (see how this sample data can be easily copied and pasted into R?)
Sample data:
location <- c(
"6540 BENNINGTON AVE
Kansas City, MO 64133
(39.005620414000475, -94.50998643299965)",
"456 POOH LANE
New York City, NY 10025
(40, -90)")
location
#[1] "6540 BENNINGTON AVE\nKansas City, MO 64133\n(39.005620414000475, -94.50998643299965)"
#[2] "456 POOH LANE\nNew York City, NY 10025\n(40, -90)"
A solution:
# Insert a comma between the state abbreviation and the zip code
step1 <- gsub("([[:alpha:]]{2}) ([[:digit:]]{5})", "\\1,\\2", location)
# get rid of parentheses
step2 <- gsub("\\(|\\)", "", step1)
# split on "\n", ",", and ", "
strsplit(step2, "\n|,|, ")
#[[1]]
#[1] "6540 BENNINGTON AVE" "Kansas City" "MO"
#[4] "64133" "39.005620414000475" "-94.50998643299965"
#[[2]]
#[1] "456 POOH LANE" "New York City" "NY" "10025"
#[5] "40" "-90"
Here is an example with the stringr package.
Using #Frank's example data from above, you can do:
library(stringr)
address <- str_match(location,
"(^[[:print:]]+)[[:space:]]([[:alpha:]. ]+), ([[:alpha:]]{2}) ([[:digit:]]{5})[[:space:]][(]([[:digit:].-]+), ([[:digit:].-]+)")
address <- data.frame(address[,-1]) # get rid of the first column which has the full match
names(address) <- c("address", "city", "state", "zip", "lat", "lon")
> address
address city state zip lat lon
1 6540 BENNINGTON AVE Kansas City MO 64133 39.005620414000475 -94.50998643299965
2 456 POOH LANE New York City NY 10025 40 -90
Note that this is pretty specific to the format of the one entry given. It would need to be tweaked if there is variation in any number of ways.
This takes everything from the start of the string to the first [:space:] character as address. The next set of letters, spaces and periods up until the next comma is given to city. After the comma and a space, the next two letters are given to state. Following a space, the next five digits make up the zip field. Finally, the next set of numbers, period and/or minus signs each get assigned to lat and lon.
I have a large file with a variable state that has full state names. I would like to replace it with the state abbreviations (that is "NY" for "New York"). Is there an easy way to do this (apart from using several if-else commands)? May be using replace() statement?
R has two built-in constants that might help: state.abb with the abbreviations, and state.name with the full names. Here is a simple usage example:
> x <- c("New York", "Virginia")
> state.abb[match(x,state.name)]
[1] "NY" "VA"
1) grep the full name from state.name and use that to index into state.abb:
state.abb[grep("New York", state.name)]
## [1] "NY"
1a) or using which:
state.abb[which(state.name == "New York")]
## [1] "NY"
2) or create a vector of state abbreviations whose names are the full names and index into it using the full name:
setNames(state.abb, state.name)["New York"]
## New York
## "NY"
Unlike (1), this one works even if "New York" is replaced by a vector of full state names, e.g. setNames(state.abb, state.name)[c("New York", "Idaho")]
Old post I know, but wanted to throw mine in there. I learned on tidyverse, so for better or worse I avoid base R when possible. I wanted one with DC too, so first I built the crosswalk:
library(tidyverse)
st_crosswalk <- tibble(state = state.name) %>%
bind_cols(tibble(abb = state.abb)) %>%
bind_rows(tibble(state = "District of Columbia", abb = "DC"))
Then I joined it to my data:
left_join(data, st_crosswalk, by = "state")
I found the built-in state.name and state.abb have only 50 states. I got a bigger table (including DC and so on) from online (e.g., this link: http://www.infoplease.com/ipa/A0110468.html) and pasted it to a .csv file named States.csv. I then load states and abbr. from this file instead of using the built-in. The rest is quite similar to #Aniko 's
library(dplyr)
library(stringr)
library(stringdist)
setwd()
# load data
data = c("NY", "New York", "NewYork")
data = toupper(data)
# load state name and abbr.
State.data = read.csv('States.csv')
State = toupper(State.data$State)
Stateabb = as.vector(State.data$Abb)
# match data with state names, misspell of 1 letter is allowed
match = amatch(data, State, maxDist=1)
data[ !is.na(match) ] = Stateabb[ na.omit( match ) ]
There's a small difference between match and amatch in how they calculate the distance from one word to another. See P25-26 here http://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf
You can also use base::abbreviate if you don't have US state names. This won't give you equally sized abbreviations unless you increase minlength.
state.name %>% base::abbreviate(minlength = 1)
Here is another way of doing it in case you have more than one state in your data and you want to replace the names with the corresponding abbreviations.
#creating a list of names
states_df <- c("Alabama","California","Nevada","New York",
"Oregon","Texas", "Utah","Washington")
states_df <- as.data.frame(states_df)
The output is
> print(states_df)
states_df
1 Alabama
2 California
3 Nevada
4 New York
5 Oregon
6 Texas
7 Utah
8 Washington
Now using the state.abb function you can easily convert the names into abbreviations, and vice-versa.
states_df$state_code <- state.abb[match(states_df$states_df, state.name)]
> print(states_df)
states_df state_code
1 Alabama AL
2 California CA
3 Nevada NV
4 New York NY
5 Oregon OR
6 Texas TX
7 Utah UT
8 Washington WA
If matching state names to abbreviations or the other way around is something you have to frequently, you could put Aniko's solution in a function in a .Rprofile or a package:
state_to_st <- function(x){
c(state.abb, 'DC')[match(x, c(state.name, 'District of Columbia'))]
}
st_to_state <- function(x){
c(state.name, 'District of Columbia')[match(x, c(state.abb, 'DC'))]
}
Using that function as a part of a dplyr chain:
enframe(state.name, value = 'state_name') %>%
mutate(state_abbr = state_to_st(state_name))