I'm trying to extract the city and state from the Address column into 2 separate columns labeled City and State in r. This is what my data looks like:
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ")) %>%
separate(address, c("City", "State"), sep=",")
I tried using the separate function but that only gets the ones with commas. Any ideas on how to do this for both cases?
There is a pattern at the end (space, letter, letter) which I can use to exploit and then remove any commas but not sure how the syntax would work using grep.
Starting from your df
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ"))
> df
address
1 Los Angeles, CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia, PA
6 Trenton, NJ
It's possible to use gsub to subset the string like this:
> city=gsub(',','',gsub("(.*).{3}","\\1",df[,1]))
> city
[1] "Los Angeles" "Pittsburgh" "Miami" "Baltimore" "Philadelphia"
[6] "Trenton"
> state=gsub(".*(\\w{2})","\\1",df[,1])
> state
[1] "CA" "PA" "FL" "MD" "PA" "NJ"
df=data.frame(City=city,State=state)
> df
City State
1 Los Angeles CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia PA
6 Trenton NJ
This is a little unorthodox but it works well. It assumes that all states are 2 characters long and that there is at least 1 space between the city and state. Comma's are ignored
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ"))
df$city <- substring(sub(",","",df$address),1,nchar(sub(",","",df$address))-3)
df$state <- substring(as.character(df$address),nchar(as.character(df$address))-1,nchar(as.character(df$address)))
df <- within(df,rm(address))
output:
city state
1 Los Angeles CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia PA
6 Trenton NJ
If a have a dataframe with a column with a pattern that is: a row with a string with a name on it followed by other rows containing names and a sequence of numbers. This is repeated all over the dataframe.
I want to create a new column base on the condition that if it founds a row with a string that starts with the word "CANTON" (and without a number), copy the string without the first word (CANTON) trough all the next rows of the new column until it appears another row with a string that starts with the word "CANTON" where it has to take the new string, and copy the new last word in the new column.
An example of the dataframe is the next one:
datos <- data.frame(sitio = c("CANTON SAN JOSE", "01 Carmen", "02 Merced",
"03 Hospital", "04 Catedral", "05 San Franscisco",
"CANTON ESCAZU", "01 Escazu", "02 San Antonio", "03 San Rafael" ),
area = c(44.62, 1.49, 2.29, 3.38, 2.31, 2.85, 34.49, 4.38,
16.99, 13.22))
datos
And the expected result would be:
expected_result <-data.frame(
sitio = c("CANTON SAN JOSE", "01 Carmen", "02 Merced",
"03 Hospital", "04 Catedral", "05 San Franscisco",
"CANTON ESCAZU", "01 Escazu", "02 San Antonio",
"03 San Rafael" ),
area = c(44.62, 1.49, 2.29, 3.38, 2.31, 2.85, 34.49, 4.38,
16.99, 13.22),
canton = c("SAN JOSE", "SAN JOSE", "SAN JOSE", "SAN JOSE",
"SAN JOSE", "SAN JOSE", "ESCAZU", "ESCAZU", "ESCAZU",
"ESCAZU"))
I have tried to do many for loops, subsets and joining dataframes without success. I cannot make clear this pattern in a instruction in R.
Thanks for helping!
Hope this works for you data:
x <- gsub('^CANTON ', '', datos$sitio)
x[!grepl('^CANTON ', datos$sitio)] <- NA
datos$canton <- ave(x, cumsum(!is.na(x)), FUN = function(xx) xx[1])
# > datos
# sitio area canton
# 1 CANTON SAN JOSE 44.62 SAN JOSE
# 2 01 Carmen 1.49 SAN JOSE
# 3 02 Merced 2.29 SAN JOSE
# 4 03 Hospital 3.38 SAN JOSE
# 5 04 Catedral 2.31 SAN JOSE
# 6 05 San Franscisco 2.85 SAN JOSE
# 7 CANTON ESCAZU 34.49 ESCAZU
# 8 01 Escazu 4.38 ESCAZU
# 9 02 San Antonio 16.99 ESCAZU
# 10 03 San Rafael 13.22 ESCAZU
Hello i have an intriguing question here. Suppose that i have a long character which includes city names between others.
test<-"Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology"
My goal is to extract all the city names of it. And I achieved it by following five steps.
#replace | with ,
test2<-str_replace_all(test, "[|]", ", ")
# Remove punctuation from data
test3<-gsub("[[:punct:]\n]","",test2)
# Split data at word boundaries
test4 <- strsplit(test3, " ")
# Load data from package maps
data(world.cities)
# Match on cities in world.cities
citiestest<-lapply(test4, function(x)x[which(x %in% world.cities$name)])
The result may be correct
citiestest
[[1]]
[1] "San" "Boston" "Boston" "Washington" "York"
[6] "York" "Kettering" "York" "York" "Charlotte"
[11] "Carolina" "Cleveland" "Nashville" "Seattle" "Seattle"
[16] "Washington" "Asan"
But as you can see I cannot deal with cities with two-words name (New York, San Diego etc.) as they are separated. Of course fix this issue manually is not an option as my real dataset is quite large.
A rather different approach which may be more or less useful, depending on the data at hand: Pass each address to a geocoding API, then pull the city out of the response.
library(tidyverse)
places <- data_frame(string = "Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology") %>%
separate_rows(string, sep = '\\|')
places <- places %>%
mutate(geodata = map(string, ~{Sys.sleep(1); ggmap::geocode(.x, output = 'all')}))
places <- places %>%
mutate(address_components = map(geodata, list('results', 1, 'address_components')),
address_components = map(address_components,
~as_data_frame(transpose(.x)) %>%
unnest(long_name, short_name)),
city = map(address_components, unnest),
city = map_chr(city, ~{
l <- set_names(.x$long_name, .x$types);
coalesce(l['locality'], l['administrative_area_level_1'])
}))
Comparing the result and the original,
places %>% select(city, string)
#> # A tibble: 17 x 2
#> city string
#> <chr> <chr>
#> 1 San Diego Ucsd Medical Center, San Diego, California, USA
#> 2 New Haven Yale Cancer Center, New Haven, Connecticut, USA
#> 3 Boston Massachusetts General Hospital., Boston, Massachusetts, USA
#> 4 Boston Dana Farber Cancer Institute, Boston, Massachusetts, USA
#> 5 St. Louis Washington University, Saint Louis, Missouri, USA
#> 6 New York Mount SInai Medical Center, New York, New York, USA
#> 7 New York Memorial Sloan Kettering Cancer Center, New York, New York, USA
#> 8 Charlotte Carolinas Healthcare System, Charlotte, North Carolina, USA
#> 9 Cleveland University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA
#> 10 Nashville Vanderbilt University Medical Center, Nashville, Tennessee, USA
#> 11 Seattle Seattle Cancer Care Alliance, Seattle, Washington, USA
#> 12 Goyang-si National Cancer Center, Gyeonggi-do, Korea, Republic of
#> 13 서울특별시 Seoul National University Hospital, Seoul, Korea, Republic of
#> 14 Seoul Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of
#> 15 Seoul Korea University Guro Hospital, Seoul, Korea, Republic of
#> 16 Seoul Asan Medical Center., Seoul, Korea, Republic of
#> 17 Amsterdam VU MEDISCH CENTRUM; Dept. of Medical Oncology
...well, it's not perfect. The biggest issue is that cities are classified as localities for US cities, but administrative_area_level_1 (which in the US is the state) for South Korea. Unlike the other Korean rows, 12 actually has a locality, which is not the city listed (which is in the response as an administrative region). Further, "Seoul" in line 13 was inexplicably translated to Korean.
The good news is that "Saint Louis" has been shortened to "St. Louis", which is a more standardized form, and the last row has been located in Amsterdam.
Scaling such an approach would likely require paying Google a little for the usage of their API.
Here is a base R option using strsplit and sub:
terms <- unlist(strsplit(test, "\\s*\\|\\s*"))
cities <- sapply(terms, function(x) gsub("[^,]+,\\s*([^,]+),.*", "\\1", x))
cities[1:3]
Ucsd Medical Center, San Diego, California, USA
"San Diego"
Yale Cancer Center, New Haven, Connecticut, USA
"New Haven"
Massachusetts General Hospital., Boston, Massachusetts, USA
"Boston"
Demo
Another way that works with no loop
pat="(,.\\w+,)|(,.\\w+.\\w+,)"
gsub("(,\\s)|,","",regmatches(m<-strsplit(test,"\\|")[[1]],regexpr(pat,m)))
[1] "San Diego" "New Haven" "Boston" "Boston" "Saint Louis" "New York" "New York"
[8] "Charlotte" "Cleveland" "Nashville" "Seattle" "Gyeonggi-do" "Seoul" "Seoul"
[15] "Seoul" "Seoul"
The other results given in this page do fail: for example, there is a town called Greonggi-do This is not given in the other solutions. Also some of the codes give the whole string as the town
What I would do:
test2 <- str_replace_all(test, "[|]", ", ") #Same as you did
test3 <- unlist(strsplit(test2, split=", ")) #Turns string into a vector
check <- test3 %in% world.cities$name #Check if element vectors match list of city names
test3[check == TRUE] #Select vector elements that match list of city names
[1] "San Diego" "New Haven" "Boston" "Boston" "Saint Louis" "New York" "New York" "New York"
[9] "New York" "Charlotte" "Cleveland" "Nashville" "Seattle" "Washington"
To expand on #hrbrmstr's comment above, you can use the Stanford CoreNLP library to do named entity recognition (NER) on each string. The big caveat to such an undertaking is that most NER annotators only go so far as to annotate a token as a "location" or equivalent, which is not very useful when cities are mixed in with states and countries. Beyond its usual NER annotator, though, CoreNLP does contain an extra regex NER annotator that can increase NER granularity to the level of cities.
In R, you can use the coreNLP package to run the annotators. It does require rJava, which in some cases can be hard to configure. You'll also need to download the actual (pretty big) library, which can be done with coreNLP::downloadCoreNLP, and, should you like, set the CORENLP_HOME environment variable in ~/.Renviron to the installation path.
Also note that this approach is fairly slow and resource-intensive, as it's doing a lot of work in Java.
library(tidyverse)
library(coreNLP)
# set which annotators to use
writeLines('annotators = tokenize, ssplit, pos, lemma, ner, regexner\n', 'corenlp.properties')
initCoreNLP(libLoc = Sys.getenv('CORENLP_HOME'), parameterFile = 'corenlp.properties')
unlink('corenlp.properties') # clean up
places <- data_frame(string = "Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology") %>%
separate_rows(string, sep = '\\|') # separate strings
places_ner <- places %>%
mutate(annotations = map(string, annotateString),
tokens = map(annotations, 'token'),
tokens = map(tokens, group_by, token_id = data.table::rleid(NER)),
city = map(tokens, filter, NER == 'CITY'),
city = map(city, summarise, city = paste(token, collapse = ' ')),
city = map_chr(city, ~if(nrow(.x) == 0) NA_character_ else .x$city))
which returns
places_ner %>% select(city, string)
#> # A tibble: 17 x 2
#> city string
#> <chr> <chr>
#> 1 San Diego Ucsd Medical Center, San Diego, California, USA
#> 2 New Haven Yale Cancer Center, New Haven, Connecticut, USA
#> 3 Boston Massachusetts General Hospital., Boston, Massachusetts, USA
#> 4 Boston Dana Farber Cancer Institute, Boston, Massachusetts, USA
#> 5 NA Washington University, Saint Louis, Missouri, USA
#> 6 NA Mount SInai Medical Center, New York, New York, USA
#> 7 NA Memorial Sloan Kettering Cancer Center, New York, New York, USA
#> 8 Charlotte Carolinas Healthcare System, Charlotte, North Carolina, USA
#> 9 Cleveland University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA
#> 10 Nashville Vanderbilt University Medical Center, Nashville, Tennessee, USA
#> 11 Seattle Seattle Cancer Care Alliance, Seattle, Washington, USA
#> 12 NA National Cancer Center, Gyeonggi-do, Korea, Republic of
#> 13 Seoul Seoul National University Hospital, Seoul, Korea, Republic of
#> 14 Seoul Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of
#> 15 Seoul Korea University Guro Hospital, Seoul, Korea, Republic of
#> 16 Seoul Asan Medical Center., Seoul, Korea, Republic of
#> 17 NA VU MEDISCH CENTRUM; Dept. of Medical Oncology
Failings:
"New York" is recognized twice as a state or province ("New York City" would be recognized as such).
"Saint Louis" is recognized as a person. "St. Louis" is recognized as a location on my installation, but an online version of the same library recognizes the original as a location, so this may be a version issue.
"Gyeonggi-do" isn't recognized, though "Seoul" is. I'm not sure how granular the regexner annotator goes, but given (as the name suggests) it works by regex, there is a size/familiarity threshold under which it doesn't contain a regex. You can add your own regex to it if it's worthwhile, though.
The cleanNLP package also supports Stanford CoreNLP (and a couple other backends) with an easier-to-use interface (setup is still hard), but as far as I can tell doesn't allow the use of regexner at the moment due to how it initializes CoreNLP.
You can use the tidytext to extract bigram--> words --> intersect to get the common part
library(tidyverse)
libraty(tidytext)
# city is a vector containing pre-defined city name
t2 <- test %>% as_tibble() %>%
unnest_tokens(bigram,value,token = 'ngrams', n =2) %>%
separate(bigram,c('word1','word2'),remove = F)
city_get <- c(intersect(t2$bigram,city),intersect(t2$word1,city))%>%
unique()