I am trying to determine in R how to split a column that has multiple fields with multiple delimiters.
From an API, I get a column in a data frame called "Location". It has multiple location identifiers in it. Here is an example of one entry. (edit- I added a couple more)
6540 BENNINGTON AVE
Kansas City, MO 64133
(39.005620414000475, -94.50998643299965)
4284 E 61ST ST
Kansas City, MO 64130
(39.014638172000446, -94.5335298549997)
3002 SPRUCE AVE
Kansas City, MO 64128
(39.07083265200049, -94.53320606399967)
6022 E Red Bridge Rd
Kansas City, MO 64134
(38.92458893200046, -94.52090062499968)
So the above is the entry in row 1-4, column "location".
I want split this into address, city, state, zip, long and lat columns. Some fields are separated by space or tab while others by comma. Also nothing is fixed width.
I have looked at the reshape package- but seems I need a single deliminator. I can't use space (or can I?) as the address has spaces in it.
Thoughts?
If the data you have is not like this, let everyone know by adding code we can copy and paste into R to reproduce your data (see how this sample data can be easily copied and pasted into R?)
Sample data:
location <- c(
"6540 BENNINGTON AVE
Kansas City, MO 64133
(39.005620414000475, -94.50998643299965)",
"456 POOH LANE
New York City, NY 10025
(40, -90)")
location
#[1] "6540 BENNINGTON AVE\nKansas City, MO 64133\n(39.005620414000475, -94.50998643299965)"
#[2] "456 POOH LANE\nNew York City, NY 10025\n(40, -90)"
A solution:
# Insert a comma between the state abbreviation and the zip code
step1 <- gsub("([[:alpha:]]{2}) ([[:digit:]]{5})", "\\1,\\2", location)
# get rid of parentheses
step2 <- gsub("\\(|\\)", "", step1)
# split on "\n", ",", and ", "
strsplit(step2, "\n|,|, ")
#[[1]]
#[1] "6540 BENNINGTON AVE" "Kansas City" "MO"
#[4] "64133" "39.005620414000475" "-94.50998643299965"
#[[2]]
#[1] "456 POOH LANE" "New York City" "NY" "10025"
#[5] "40" "-90"
Here is an example with the stringr package.
Using #Frank's example data from above, you can do:
library(stringr)
address <- str_match(location,
"(^[[:print:]]+)[[:space:]]([[:alpha:]. ]+), ([[:alpha:]]{2}) ([[:digit:]]{5})[[:space:]][(]([[:digit:].-]+), ([[:digit:].-]+)")
address <- data.frame(address[,-1]) # get rid of the first column which has the full match
names(address) <- c("address", "city", "state", "zip", "lat", "lon")
> address
address city state zip lat lon
1 6540 BENNINGTON AVE Kansas City MO 64133 39.005620414000475 -94.50998643299965
2 456 POOH LANE New York City NY 10025 40 -90
Note that this is pretty specific to the format of the one entry given. It would need to be tweaked if there is variation in any number of ways.
This takes everything from the start of the string to the first [:space:] character as address. The next set of letters, spaces and periods up until the next comma is given to city. After the comma and a space, the next two letters are given to state. Following a space, the next five digits make up the zip field. Finally, the next set of numbers, period and/or minus signs each get assigned to lat and lon.
Related
I tried to find an answer for this in other posts but nothing seemed to be working.
I have a data set where people answered the city they were in using a free response format. Therefore for each city, people identified in many different ways. For example, those living in Atlanta might have written "Atlanta", "atlanta", "Atlanta, GA" and so on.
There are 12 cities represented in this data set. I'm trying to clean this variable so each city is written consistently. Is there a way to do this efficiently for each city?
I've tried mutate_if and str_replace_all but can't seem to figure it out (see my code below)
all_data_city <- mutate_if(all_data_city, is.character,
str_replace_all, pattern = "Atlanta, GA",
replacement = "Atlanta")
all_data_city %>%
str_replace_all(c("Atlanta, GA" & "HCA Atlanta" & "HCC Atlanta" &
"Suwanee" & "Suwanee, GA" & "suwanee"), = "Atlanta")
If we need to pass a vector of elements to be replaced, paste them together with | as pattern and replace with 'Atlanta'
library(dplyr)
library(stringr)
pat <- str_c(c("Atlanta, GA" , "HCA Atlanta" , "HCC Atlanta" ,
"Suwanee" , "Suwanee, GA" , "suwanee"), collapse = "|")
all_data_city %>%
str_replace_all(pat, "Atlanta")
Using a reproducible example with iris
iris %>%
transmute(Species = str_replace_all(Species,
str_c(c("set", "versi"), collapse="|"), "hello")) %>%
pull(Species) %>%
unique
#[1] "helloosa" "hellocolor" "virginica"
Questions on data cleaning are difficult to answer, as answers strongly depend on the data.
Proposed solutions may work for a (small) sample dataset but may fail for a (large) production dataset.
In this case, I see two possible approaches:
Collecting all possible ways of writing a city's name and replacing these different variants by the desired city name. This can be achieved by str_replace() or by joining. This is safe but tedious.
Looking for a matching character string within the city name and replace if found.
Below is a blue print which can be extended for other uses cases. For demonstration, a data.frame with one column city is created:
library(dplyr)
library(stringr)
data.frame(city = c("Atlanta, GA", "HCA Atlanta", "HCC Atlanta",
"Suwanee", "Suwanee, GA", "suwanee", "Atlantic City")) %>%
mutate(city_new = case_when(
str_detect(city, regex("Atlanta|Suwanee", ignore_case = TRUE)) ~ "Atlanta",
TRUE ~ as.character(city)
)
)
city city_new
1 Atlanta, GA Atlanta
2 HCA Atlanta Atlanta
3 HCC Atlanta Atlanta
4 Suwanee Atlanta
5 Suwanee, GA Atlanta
6 suwanee Atlanta
7 Atlantic City Atlantic City
I have address data I extracted from SQL, and have now loaded into R. I am trying to extract out the individual components, namely the ZIP-CODE at the end of the query (State would also be nice). I would like the ZIP-CODE and State to be in new individual columns.
The primary issue is the ZIP-CODE is sometimes 5 digits, and sometimes 9.
Two example rows would be:
Address_FULL
1234 NOWHERE ST WASHINGTON DC 20005
567 EVERYWHERE LN CHARLOTTE NC 22011-1203
I suspect I'll need some kind of regex \\d{5} notation, or some kind of fancy manipulation in dplyr that I'm not aware exists.
If the zip code is always at the end you could use
str_extract(Address_FULL,"[[:digit:]]{5}(-[[:digit:]]{4})?$")
To add a "zip" column via dplyr you could use
df %>% mutate(zip = str_extract(Address_FULL,"[[:digit:]]{5}(-[[:digit:]]{4})?$"))
Where df is your dataframe containing Address_FULL and
str_extract() is from stringr.
State could be extracted as follows:
str_extract(Address_FULL,"(?<=\\s)[[:alpha:]]{2}(?=\\s[[:digit:]]{5})")
However, this makes the following assumptions:
The state abbreviation is 2 characters long
The state abbreviation is followed immediately by a space
The zip code follows immediately after the space that follows the state
Assuming that the zip is always at the end, you can try:
tail(unlist(strsplit(STRING, split=" ")), 1)
For example
ex1 = "1234 NOWHERE ST WASHINGTON DC 20005"
ex2 = "567 EVERYWHERE LN CHARLOTTE NC 22011-1203"
> tail(unlist(strsplit(ex1, split=" ")), 1)
[1] "20005"
> tail(unlist(strsplit(ex2, split=" ")), 1)
[1] "22011-1203"
Use my package tfwstring
Works automatically on any address type, even with prefixes and suffixes.
if (!require(remotes)) install.packages("remotes")
remotes::install_github("nbarsch/tfwstring")
parseaddress("1234 NOWHERE ST WASHINGTON DC 20005", force_stateabb = F)
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode
"1234" "NOWHERE" "ST" "WASHINGTON" "DC" "20005"
parseaddress("567 EVERYWHERE LN CHARLOTTE NC 22011-1203", force_stateabb = F)
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode
"567" "EVERYWHERE" "LN" "CHARLOTTE" "NC" "22011-1203"
I want to extract state abbreviation (2 letters) and zip code (either 4 or 5 numbers) from the following string
address <- "19800 Eagle River Road, Eagle River AK 99577
907-481-1670
230 Colonial Promenade Pkwy, Alabaster AL 35007
205-620-0360
360 Connecticut Avenue, Norwalk CT 06854
860-409-0404
2080 S Lincoln, Jerome ID 83338
208-324-4333
20175 Civic Center Dr, Augusta ME 4330
207-623-8223
830 Harvest Ln, Williston VT 5495
802-878-5233
"
For the zip code, I tried few methods that I found on here but it didn't work mainly because of the 5 number street address or zip codes that have only 4 numbers
text <- readLines(textConnection(address))
library(stringi)
zip <- stri_extract_last_regex(text, "\\d{5}")
zip
library(qdapRegex)
rm_zip3 <- rm_(pattern="(?<!\\d)\\d{5}(?!\\d)", extract = TRUE)
zip <- rm_zip3(text)
zip
[1] "99577" "1670" "35007" "0360" "06854" "0404" "83338" "4333" "4330" "8223" "5495" "5233" NA
For the state abbreviation, I have no idea how to extract
Any help is appreciated! Thanks in advance!
Edit 1: Include phone numbers
Code to extract zip code:
zip <- str_extract(text, "\\d{5}")
Code to extract state code:
states <- str_extract(text, "\\b[A-Z]{2}(?=\\s+\\d{5}$)")
Code to extract phone numbers:
phone <- str_extract(text, "\\b\\d{3}-\\d{3}-\\d{4}\\b")
NOTE: Looks like there's an issue with your data because the last 2 zip codes should be 5 characters long and not 4. 4330 should actually be 04330. If you don't have control over the data source, but know for sure that they are US codes you could pad 0's on the left as required. However since you are looking for a solution for 4 or 5 characters, you can use this:
Code to extract zip code (looks for space in front and newline at the back so that parts of a phone number or an address aren't picked)
zip <- str_extract(text, "(?<= )\\d{4,5}(?=\\n|$)")
Code to extract state code:
states <- str_extract(text, "\\b[A-Z]{2}(?=\\s+\\d{4,5}$)")
Demo: https://regex101.com/r/7Im0Mu/2
I am using address as input not the text, see if it works for your case.
Assumptions on regex: Two capital letters followed by 4 or 5 numeric letters are for state and zip, The phone numbers are always on next line.
Input:
address <- "19800 Eagle River Road, Eagle River AK 99577
907-481-1670
230 Colonial Promenade Pkwy, Alabaster AL 35007
205-620-0360
360 Connecticut Avenue, Norwalk CT 06854
860-409-0404
2080 S Lincoln, Jerome ID 83338
208-324-4333
20175 Civic Center Dr, Augusta ME 4330
207-623-8223
830 Harvest Ln, Williston VT 5495
802-878-5233
"
I am using stringr library , you may choose any other to extract the information as you wish.
library(stringr)
df <- data.frame(do.call("rbind",strsplit(str_extract_all(address,"[A-Z][A-Z]\\s\\d{4,5}\\s\\d{3}-\\d{3}-\\d{4}")[[1]],split="\\s|\\n")))
names(df) <- c("state","Zip","Phone")
EDIT:
In case someone want to use text as input,
text <- readLines(textConnection(address))
text <- data.frame(text)
st_zip <- setNames(data.frame(str_extract_all(text$text,"[A-Z][A-Z]\\s\\d{4,5}",simplify = T)),"St_zip")
pin <- setNames(data.frame(str_extract_all(text$text,"\\d{3}-\\d{3}-\\d{4}",simplify = T)),"pin")
st_zip <- st_zip[st_zip$St_zip != "",]
df1 <- setNames(data.frame(do.call("rbind",strsplit(st_zip,split=' '))),c("State","Zip"))
pin <- pin[pin$pin != "",]
df2 <- data.frame(cbind(df1,pin))
OUTPUT:
State Zip pin
1 AK 99577 907-481-1670
2 AL 35007 205-620-0360
3 CT 06854 860-409-0404
4 ID 83338 208-324-4333
5 ME 4330 207-623-8223
6 VT 5495 802-878-5233
Thank you #Rahul. Both would be great. At least can you show me how to do it with Notepad++?
Extraction using Notepad++
Well first copy your whole data in a file.
Go to Find by pressing Ctrl + F. This will open search dialog box. Choose Replace tab search with regex ([A-Z]{2}\s*\d{4,5})$ and replace with \n-\1-\n. This will search for state abbreviation and ZIP code and place them in new line with - as prefix and suffix.
Now go to Mark tab. Check Bookmark Line checkbox then search with -(.*?)- and press Mark All. This will mark state abb and ZIP which are in newlines with -.
Now go to Search --> Bookmark --> Remove Unmarked Lines
Finally search with ^-|-$ and replace with empty string.
Update
So now there will be phone numbers too ? In that case you only have to remove $ from regex in step 2. Regex to use will be ([A-Z]{2}\s*\d{4,5}). Rest all steps will be same.
I have rows/observations in a comma delimited file that ideally should have 55 columns. But there are fields such as addresses that have an an extra comma within them. Such as Manhattan, New York should be one field Manhattan, New York but I get two fields Manhattan and New York when I read the file which increases the number of columns.
Is there anyway I can delete such observations using R or any tool such as Delimit or Excel?
I would eventually like to load this file into R for analysis.
I agree my question is similar to Delete lines or rows in a tab-delimited file, by number of cells in that lines or rows but I am looking for a solution in R.
Input
Name, Address, DOB
John, Manhattan, New York, 2/8/1990
Jacob, Arizona, 9/10/2012
Smith, New Jersey, 8/10/2016
Expected Output
Name, Address, DOB
Jacob, Arizona, 9/10/2012
Smith, New Jersey, 8/10/2016
In general, I do not advocate doing what you want to do, which is to throw away records. Nonetheless, if this is what you want to do, you could do so as follows.
Assuming your data is stored as a text in a file called foo, you can use the count.fields function to count fields defined by the presence of sep. Then just omit them from the readLines function.
text <-
"Name, Address, DOB
John, Manhattan, New York, 2/8/1990
Jacob, Arizona, 9/10/2012
Smith, New Jersey, 8/10/2016
"
cat(text, file = "foo", sep = ",")
fields <- count.fields("foo", sep = ",")
readLines("foo")[fields == 3]
One option would be to read with readLines and then create a quote around the words with sub, and then read the dataset with read.table
lines1 <- gsub(",", " ", lines)
lines1[-1] <- sub("^(\\S+)\\s+([^0-9]+\\b)\\s+(\\d+.*)", "\\1 '\\2' \\3",
lines1[-1])
read.table(text=lines1, stringsAsFactors=FALSE, header = TRUE)
# Name Address DOB
#1 John Manhattan New York 2/8/1990
#2 Jacob Arizona 9/10/2012
#3 Smith New Jersey 8/10/2016
data
lines <- readLines("yourfile.txt")
We can count the number of commas in each line and subset the line vector for only those lines that have the expected number of commas:
## read in raw file lines using readLines()
lines1 <- readLines(textConnection('Name, Address, DOB\nJohn, Manhattan, New York, 2/8/1990\nJacob, Arizona, 9/10/2012\nSmith, New Jersey, 8/10/2016\n'));
## subset for lines with the expected number of commas
lines2 <- lines1[2L==sapply(lines1,function(s) nchar(s)-nchar(gsub(',','',s)))];
## result
lines1;
## [1] "Name, Address, DOB"
## [2] "John, Manhattan, New York, 2/8/1990"
## [3] "Jacob, Arizona, 9/10/2012"
## [4] "Smith, New Jersey, 8/10/2016"
## [5] ""
lines2;
## [1] "Name, Address, DOB"
## [2] "Jacob, Arizona, 9/10/2012"
## [3] "Smith, New Jersey, 8/10/2016"
I have a CSV file like
LocationList,Identity,Category
"New York,New York,United States","42","S"
"NA,California,United States","89","lyt"
"Hartford,Connecticut,United States","879","polo"
"San Diego,California,United States","45454","utyr"
"Seattle,Washington,United States","uytr","69"
"NA,NA,United States","87","tree"
I want to remove all 'NA' from the 'LocationList' Column
The Desired Result -
LocationList,Identity,Category
"New York,New York,United States","42","S"
"California,United States","89","lyt"
"Hartford,Connecticut,United States","879","polo"
"San Diego,California,United States","45454","utyr"
"Seattle,Washington,United States","uytr","69"
"United States","87","tree"
The number of columns are not fixed and they may increase or decrease. Also I want to write to the CSV file without quotes and without escaping for the 'LocationList' column.
How to achieve the following in R?
New to R any help is appreciated.
In this case, you just want to replace the NA, with nothing. However, this is not the standard way to remove NA values.
Assuming dat is your data, use
dat$LocationList <- gsub("^(NA,)+", "", dat$LocationList)
Try:
my.data <- read.table(text='LocationList,Identity,Category
"New York,New York,United States","42","S"
"NA,California,United States","89","lyt"
"Hartford,Connecticut,United States","879","polo"
"San Diego,California,United States","45454","utyr"
"Seattle,Washington,United States","uytr","69"
"NA,NA,United States","87","tree"', header=T, sep=",")
my.data$LocationList <- gsub("NA,", "", my.data$LocationList)
my.data
# LocationList Identity Category
# 1 New York,New York,United States 42 S
# 2 California,United States 89 lyt
# 3 Hartford,Connecticut,United States 879 polo
# 4 San Diego,California,United States 45454 utyr
# 5 Seattle,Washington,United States uytr 69
# 6 United States 87 tree
If you get rid of the quotes when you write to a conventional csv file, you will have trouble reading the data in later. This is because you have commas already inside each value in the LocationList variable, so you would have commas both in the middle of fields and marking the break between fields. You might try using write.csv2() instead, which will indicate new fields with a semicolon ;. You could use:
write.csv2(my.data, file="myFile.csv", quote=FALSE, row.names=FALSE)
Which yields the following file:
LocationList;Identity;Category
New York,New York,United States;42;S
California,United States;89;lyt
Hartford,Connecticut,United States;879;polo
San Diego,California,United States;45454;utyr
Seattle,Washington,United States;uytr;69
United States;87;tree
(I now notice that the values for Identity and Category for row 5 are presumably messed up. You may want to switch those before writing to file.)
x <- my.data[5, 2]
my.data[5, 2] <- my.data[5, 3]
my.data[5, 2] <- x
rm(x)