Finding all string matches from another dataframe in R - r

I am relatively new in R.
I have a dataframe locs that has 1 variable V1 and looks like:
V1
edmonton general hospital
cardiovascular institute, hospital san carlos, madrid spain
hospital of santa maria, lisbon, portugal
and another dataframe cities that has two variables that look like this:
city country
edmonton canada
san carlos spain
los angeles united states
santa maria united states
tokyo japan
madrid spain
santa maria portugal
lisbon portugal
I want to create two new variables in locs that relates any string match of V1 within city so that locs looks like this:
V1 city country
edmonton general hospital edmonton canada
hospital san carlos, madrid spain san carlos, madrid spain
hospital of santa maria, lisbon, portugal santa maria, lisbon portugal, united states
A few things to note: V1 may have multiple country names. Also, if there is a repeat country (for instance, both san carlos and madrid are in spain), then I only want one instance of the country.
Please advise.
Thanks.

A solution using tidyverse and stringr. locs2 is the final output.
library(tidyverse)
library(stringr)
locs2 <- locs %>%
rowwise() %>%
mutate(city = list(str_match(V1, cities$city))) %>%
unnest() %>%
drop_na(city) %>%
left_join(cities, by = "city") %>%
group_by(V1) %>%
summarise_all(funs(toString(sort(unique(.)))))
Result
locs2 %>% as.data.frame()
V1 city country
1 cardiovascular institute, hospital san carlos, madrid spain madrid, san carlos spain
2 edmonton general hospital edmonton canada
3 hospital of santa maria, lisbon, portugal lisbon, santa maria portugal, united states
DATA
library(tidyverse)
locs <- data_frame(V1 = c("edmonton general hospital",
"cardiovascular institute, hospital san carlos, madrid spain",
"hospital of santa maria, lisbon, portugal"))
cities <- read.table(text = "city country
edmonton canada
'san carlos' spain
'los angeles' 'united states'
'santa maria' 'united states'
tokyo japan
madrid spain
'santa maria' portugal
lisbon portugal",
header = TRUE, stringsAsFactors = FALSE)

Related

Find city names within affiliations and add them with their corresponding countries in new columns of a dataframe

I have a dataframe ‘dfa’ of affiliations that contains city names, for which the country is sometimes missing, e.g. like rows 4 (BAGHDAD) and 7 (BERLIN):
dfa <- data.frame(affiliation=c("DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS",
"DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA.",
"DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES",
"COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD.",
"DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA.",
"LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY.",
"DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN",
"INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY.",
"DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND.",
"DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN",
"DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN.",
"LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA."))
I have now a second dataframe ‘dfb’ that contains a list of cities and corresponding country, some of which are present in 'dfa':
dfb <- data.frame(city=c("AGRI","AMSTERDAM","ATHENS","AUCKLAND","BUENOS AIRES","BEIJING","BAGHDAD","BANGKOK","BERLIN","BUDAPEST"),
country=c("TURKEY","NETHERLANDS","GREECE","NEW ZEALAND","ARGENTINA","CHINA","IRAQ","THAILAND","GERMANY","HUNGARY"))
How can I add cities and corresponding countries in two new columns only for cities that are present in both ‘dfa’ and ‘dfb’ (and even when the country is missing, as for BAGHDAD and BERLIN)?
NB: the goal is to add full city names but not part of them. Below in row 7, an example of what is not wanted: the AGRI city of TURKEY is inappropriately associated with BERLIN because this row includes the 'AGRICULTURE' word.
Is there a simple way to do that, ideally using dplyr?
affiliation city country
1 DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS AMSTERDAM NETHERLANDS
2 DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA. <NA> <NA>
3 DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES <NA> <NA>
4 COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD. BAGHDAD IRAQ
5 DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA. BEIJING CHINA
6 LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY. <NA> <NA>
7 DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN AGRI TURKEY
8 INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY. <NA> <NA>
9 DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND. BANGKOK THAILAND
10 DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN <NA> <NA>
11 DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN. <NA> <NA>
12 LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA. BEIJING CHINA
A combination of str_extract and either a join or another str_extract is one option to get you there.
str_extract will get the first value it encounters, with a paste0 to collapse the cities into a long or string to check against.
library(dplyr)
library(stringr)
dfa %>%
mutate(city = str_extract(dfa$affiliation, paste0("\\b", dfb$city, collapse = "\\b|"))) %>%
left_join(dfb, by = "city")
Edit: added word bounderies in the paste0 so that only whole city names are matched and partial matching is avoided.
affiliation city country
1 DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS AMSTERDAM NETHERLANDS
2 DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA. <NA> <NA>
3 DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES <NA> <NA>
4 COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD. BAGHDAD IRAQ
5 DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA. BEIJING CHINA
6 LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY. <NA> <NA>
7 DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN BERLIN GERMANY
8 INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY. <NA> <NA>
9 DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND. BANGKOK THAILAND
10 DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN <NA> <NA>
11 DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN. <NA> <NA>
12 LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA. BEIJING CHINA
This approach accounts for the possibility that an affiliation could match more than one city name.
library(tidyverse)
dfa %>%
mutate(city = map(affiliation, ~ str_extract(.x, dfb$city))) %>%
unnest(cols = c(city)) %>%
group_by(affiliation) %>%
mutate(nmatches = sum(!is.na(city))) %>%
filter((nmatches > 0 & !is.na(city)) | (nmatches == 0 & row_number() == 1)) %>%
ungroup() %>%
left_join(dfb, by = "city") %>%
mutate(country_match = str_detect(affiliation, country))
# A tibble: 12 x 5
affiliation city nmatches country country_match
<chr> <chr> <int> <chr> <lgl>
1 DEPARTMENT OF PHARMACY,… AMSTE… 1 NETHER… TRUE
2 DEPARTMENT OF BIOCHEMIS… NA 0 NA NA
3 DEPARTMENT OF PATHOLOGY… NA 0 NA NA
4 COLLEGE OF EDUCATION FO… BAGHD… 1 IRAQ FALSE
5 DEPARTMENT OF CLINICAL … BEIJI… 1 CHINA TRUE
6 LABORATORY OF MOLECULAR… NA 0 NA NA
7 BERLIN INSTITUTE OF HEA… BERLIN 1 GERMANY FALSE
8 INSTITUTE OF LABORATORY… NA 0 NA NA
9 DEPARTMENT OF CLINICAL … BANGK… 1 THAILA… TRUE
10 DEPARTMENT OF BIOLOGY, … NA 0 NA NA
11 DEPARTMENT OF MOLECULAR… NA 0 NA NA
12 LABORATORY OF CARDIOVAS… BEIJI… 1 CHINA TRUE
You could then double-check cases with 1 nmatches but country_match == F, and when there are 2 or more nmatches you can keep the one with country_match == T.

Extract cell with AND without commas in R

I'm trying to extract the city and state from the Address column into 2 separate columns labeled City and State in r. This is what my data looks like:
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ")) %>%
separate(address, c("City", "State"), sep=",")
I tried using the separate function but that only gets the ones with commas. Any ideas on how to do this for both cases?
There is a pattern at the end (space, letter, letter) which I can use to exploit and then remove any commas but not sure how the syntax would work using grep.
Starting from your df
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ"))
> df
address
1 Los Angeles, CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia, PA
6 Trenton, NJ
It's possible to use gsub to subset the string like this:
> city=gsub(',','',gsub("(.*).{3}","\\1",df[,1]))
> city
[1] "Los Angeles" "Pittsburgh" "Miami" "Baltimore" "Philadelphia"
[6] "Trenton"
> state=gsub(".*(\\w{2})","\\1",df[,1])
> state
[1] "CA" "PA" "FL" "MD" "PA" "NJ"
df=data.frame(City=city,State=state)
> df
City State
1 Los Angeles CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia PA
6 Trenton NJ
This is a little unorthodox but it works well. It assumes that all states are 2 characters long and that there is at least 1 space between the city and state. Comma's are ignored
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ"))
df$city <- substring(sub(",","",df$address),1,nchar(sub(",","",df$address))-3)
df$state <- substring(as.character(df$address),nchar(as.character(df$address))-1,nchar(as.character(df$address)))
df <- within(df,rm(address))
output:
city state
1 Los Angeles CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia PA
6 Trenton NJ

R group or aggregate

I would like to do a group_by or aggregate. I have something like:
> head(affiliation_clean)
Affiliation_ID Affiliation_Name City Country
1 000001 New Mexico State University Las Cruces Las Cruces United States
2 000001 New Mexico State University Las Cruces Las Cruces <NA>
3 000001 New Mexico State University Las Cruces <NA> <NA>
4 000002 Palo Alto Research Center Incorporated Palo Alto <NA>
5 000002 Palo Alto Research Center Incorporated <NA> United States
6 000002 Palo Alto Research Center Incorporated <NA> <NA>
Grouping by "Affiliation_ID" and taking the longest string of "Affiliation_Name", "City" and "Country", I would like to get:
> head(affiliation_clean)
Affiliation_ID Affiliation_Name City Country
1 000001 New Mexico State University Las Cruces Las Cruces United States
2 000002 Palo Alto Research Center Incorporated Palo Alto United States
Thanks in advance.
Here is a dplyr solution based on your description to select the longest string of each Affiliation_ID and column.
library(dplyr)
dat2 <- dat %>%
group_by(Affiliation_ID) %>%
summarise_all(funs(.[which.max(nchar(.))][1]))
dat2
# # A tibble: 2 x 4
# Affiliation_ID Affiliation_Name City Country
# <int> <chr> <chr> <chr>
# 1 1 New Mexico State University Las Cruces Las Cruces United States
# 2 2 Palo Alto Research Center Incorporated Palo Alto United States
DATA
dat <-read.table(text = " Affiliation_ID Affiliation_Name City Country
1 '000001' 'New Mexico State University Las Cruces' 'Las Cruces' 'United States'
2 '000001' 'New Mexico State University Las Cruces' 'Las Cruces' NA
3 '000001' 'New Mexico State University Las Cruces' NA NA
4 '000002' 'Palo Alto Research Center Incorporated' 'Palo Alto' NA
5 '000002' 'Palo Alto Research Center Incorporated' NA 'United States'
6 '000002' 'Palo Alto Research Center Incorporated' NA NA",
header = TRUE, stringsAsFactors = FALSE)
Assuming that there is a single unique 'City/Country' for each 'Affiliation_ID', 'Affiliation_Name', after grouping at the first two columns, get the unique non-NA element of all other columns with summarise_all
library(dplyr)
affiliation_clean %>%
group_by(Affiliation_ID, Affiliation_Name) %>%
summarise_all(funs(unique(.[!is.na(.)])) )
# A tibble: 2 x 4
# Groups: Affiliation_ID [?]
# Affiliation_ID Affiliation_Name City Country
# <chr> <chr> <chr> <chr>
#1 000001 New Mexico State University Las Cruces Las Cruces United States
#2 000002 Palo Alto Research Center Incorporated Palo Alto United States

Extract city names from large text with R

Hello i have an intriguing question here. Suppose that i have a long character which includes city names between others.
test<-"Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology"
My goal is to extract all the city names of it. And I achieved it by following five steps.
#replace | with ,
test2<-str_replace_all(test, "[|]", ", ")
# Remove punctuation from data
test3<-gsub("[[:punct:]\n]","",test2)
# Split data at word boundaries
test4 <- strsplit(test3, " ")
# Load data from package maps
data(world.cities)
# Match on cities in world.cities
citiestest<-lapply(test4, function(x)x[which(x %in% world.cities$name)])
The result may be correct
citiestest
[[1]]
[1] "San" "Boston" "Boston" "Washington" "York"
[6] "York" "Kettering" "York" "York" "Charlotte"
[11] "Carolina" "Cleveland" "Nashville" "Seattle" "Seattle"
[16] "Washington" "Asan"
But as you can see I cannot deal with cities with two-words name (New York, San Diego etc.) as they are separated. Of course fix this issue manually is not an option as my real dataset is quite large.
A rather different approach which may be more or less useful, depending on the data at hand: Pass each address to a geocoding API, then pull the city out of the response.
library(tidyverse)
places <- data_frame(string = "Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology") %>%
separate_rows(string, sep = '\\|')
places <- places %>%
mutate(geodata = map(string, ~{Sys.sleep(1); ggmap::geocode(.x, output = 'all')}))
places <- places %>%
mutate(address_components = map(geodata, list('results', 1, 'address_components')),
address_components = map(address_components,
~as_data_frame(transpose(.x)) %>%
unnest(long_name, short_name)),
city = map(address_components, unnest),
city = map_chr(city, ~{
l <- set_names(.x$long_name, .x$types);
coalesce(l['locality'], l['administrative_area_level_1'])
}))
Comparing the result and the original,
places %>% select(city, string)
#> # A tibble: 17 x 2
#> city string
#> <chr> <chr>
#> 1 San Diego Ucsd Medical Center, San Diego, California, USA
#> 2 New Haven Yale Cancer Center, New Haven, Connecticut, USA
#> 3 Boston Massachusetts General Hospital., Boston, Massachusetts, USA
#> 4 Boston Dana Farber Cancer Institute, Boston, Massachusetts, USA
#> 5 St. Louis Washington University, Saint Louis, Missouri, USA
#> 6 New York Mount SInai Medical Center, New York, New York, USA
#> 7 New York Memorial Sloan Kettering Cancer Center, New York, New York, USA
#> 8 Charlotte Carolinas Healthcare System, Charlotte, North Carolina, USA
#> 9 Cleveland University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA
#> 10 Nashville Vanderbilt University Medical Center, Nashville, Tennessee, USA
#> 11 Seattle Seattle Cancer Care Alliance, Seattle, Washington, USA
#> 12 Goyang-si National Cancer Center, Gyeonggi-do, Korea, Republic of
#> 13 서울특별시 Seoul National University Hospital, Seoul, Korea, Republic of
#> 14 Seoul Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of
#> 15 Seoul Korea University Guro Hospital, Seoul, Korea, Republic of
#> 16 Seoul Asan Medical Center., Seoul, Korea, Republic of
#> 17 Amsterdam VU MEDISCH CENTRUM; Dept. of Medical Oncology
...well, it's not perfect. The biggest issue is that cities are classified as localities for US cities, but administrative_area_level_1 (which in the US is the state) for South Korea. Unlike the other Korean rows, 12 actually has a locality, which is not the city listed (which is in the response as an administrative region). Further, "Seoul" in line 13 was inexplicably translated to Korean.
The good news is that "Saint Louis" has been shortened to "St. Louis", which is a more standardized form, and the last row has been located in Amsterdam.
Scaling such an approach would likely require paying Google a little for the usage of their API.
Here is a base R option using strsplit and sub:
terms <- unlist(strsplit(test, "\\s*\\|\\s*"))
cities <- sapply(terms, function(x) gsub("[^,]+,\\s*([^,]+),.*", "\\1", x))
cities[1:3]
Ucsd Medical Center, San Diego, California, USA
"San Diego"
Yale Cancer Center, New Haven, Connecticut, USA
"New Haven"
Massachusetts General Hospital., Boston, Massachusetts, USA
"Boston"
Demo
Another way that works with no loop
pat="(,.\\w+,)|(,.\\w+.\\w+,)"
gsub("(,\\s)|,","",regmatches(m<-strsplit(test,"\\|")[[1]],regexpr(pat,m)))
[1] "San Diego" "New Haven" "Boston" "Boston" "Saint Louis" "New York" "New York"
[8] "Charlotte" "Cleveland" "Nashville" "Seattle" "Gyeonggi-do" "Seoul" "Seoul"
[15] "Seoul" "Seoul"
The other results given in this page do fail: for example, there is a town called Greonggi-do This is not given in the other solutions. Also some of the codes give the whole string as the town
What I would do:
test2 <- str_replace_all(test, "[|]", ", ") #Same as you did
test3 <- unlist(strsplit(test2, split=", ")) #Turns string into a vector
check <- test3 %in% world.cities$name #Check if element vectors match list of city names
test3[check == TRUE] #Select vector elements that match list of city names
[1] "San Diego" "New Haven" "Boston" "Boston" "Saint Louis" "New York" "New York" "New York"
[9] "New York" "Charlotte" "Cleveland" "Nashville" "Seattle" "Washington"
To expand on #hrbrmstr's comment above, you can use the Stanford CoreNLP library to do named entity recognition (NER) on each string. The big caveat to such an undertaking is that most NER annotators only go so far as to annotate a token as a "location" or equivalent, which is not very useful when cities are mixed in with states and countries. Beyond its usual NER annotator, though, CoreNLP does contain an extra regex NER annotator that can increase NER granularity to the level of cities.
In R, you can use the coreNLP package to run the annotators. It does require rJava, which in some cases can be hard to configure. You'll also need to download the actual (pretty big) library, which can be done with coreNLP::downloadCoreNLP, and, should you like, set the CORENLP_HOME environment variable in ~/.Renviron to the installation path.
Also note that this approach is fairly slow and resource-intensive, as it's doing a lot of work in Java.
library(tidyverse)
library(coreNLP)
# set which annotators to use
writeLines('annotators = tokenize, ssplit, pos, lemma, ner, regexner\n', 'corenlp.properties')
initCoreNLP(libLoc = Sys.getenv('CORENLP_HOME'), parameterFile = 'corenlp.properties')
unlink('corenlp.properties') # clean up
places <- data_frame(string = "Ucsd Medical Center, San Diego, California, USA|Yale Cancer Center, New Haven, Connecticut, USA|Massachusetts General Hospital., Boston, Massachusetts, USA|Dana Farber Cancer Institute, Boston, Massachusetts, USA|Washington University, Saint Louis, Missouri, USA|Mount SInai Medical Center, New York, New York, USA|Memorial Sloan Kettering Cancer Center, New York, New York, USA|Carolinas Healthcare System, Charlotte, North Carolina, USA|University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA|Vanderbilt University Medical Center, Nashville, Tennessee, USA|Seattle Cancer Care Alliance, Seattle, Washington, USA|National Cancer Center, Gyeonggi-do, Korea, Republic of|Seoul National University Hospital, Seoul, Korea, Republic of|Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of|Korea University Guro Hospital, Seoul, Korea, Republic of|Asan Medical Center., Seoul, Korea, Republic of|VU MEDISCH CENTRUM; Dept. of Medical Oncology") %>%
separate_rows(string, sep = '\\|') # separate strings
places_ner <- places %>%
mutate(annotations = map(string, annotateString),
tokens = map(annotations, 'token'),
tokens = map(tokens, group_by, token_id = data.table::rleid(NER)),
city = map(tokens, filter, NER == 'CITY'),
city = map(city, summarise, city = paste(token, collapse = ' ')),
city = map_chr(city, ~if(nrow(.x) == 0) NA_character_ else .x$city))
which returns
places_ner %>% select(city, string)
#> # A tibble: 17 x 2
#> city string
#> <chr> <chr>
#> 1 San Diego Ucsd Medical Center, San Diego, California, USA
#> 2 New Haven Yale Cancer Center, New Haven, Connecticut, USA
#> 3 Boston Massachusetts General Hospital., Boston, Massachusetts, USA
#> 4 Boston Dana Farber Cancer Institute, Boston, Massachusetts, USA
#> 5 NA Washington University, Saint Louis, Missouri, USA
#> 6 NA Mount SInai Medical Center, New York, New York, USA
#> 7 NA Memorial Sloan Kettering Cancer Center, New York, New York, USA
#> 8 Charlotte Carolinas Healthcare System, Charlotte, North Carolina, USA
#> 9 Cleveland University Hospitals Case Medical Center; Seidman Cancer Center, Cleveland, Ohio, USA
#> 10 Nashville Vanderbilt University Medical Center, Nashville, Tennessee, USA
#> 11 Seattle Seattle Cancer Care Alliance, Seattle, Washington, USA
#> 12 NA National Cancer Center, Gyeonggi-do, Korea, Republic of
#> 13 Seoul Seoul National University Hospital, Seoul, Korea, Republic of
#> 14 Seoul Severance Hospital, Yonsei University Health System, Seoul, Korea, Republic of
#> 15 Seoul Korea University Guro Hospital, Seoul, Korea, Republic of
#> 16 Seoul Asan Medical Center., Seoul, Korea, Republic of
#> 17 NA VU MEDISCH CENTRUM; Dept. of Medical Oncology
Failings:
"New York" is recognized twice as a state or province ("New York City" would be recognized as such).
"Saint Louis" is recognized as a person. "St. Louis" is recognized as a location on my installation, but an online version of the same library recognizes the original as a location, so this may be a version issue.
"Gyeonggi-do" isn't recognized, though "Seoul" is. I'm not sure how granular the regexner annotator goes, but given (as the name suggests) it works by regex, there is a size/familiarity threshold under which it doesn't contain a regex. You can add your own regex to it if it's worthwhile, though.
The cleanNLP package also supports Stanford CoreNLP (and a couple other backends) with an easier-to-use interface (setup is still hard), but as far as I can tell doesn't allow the use of regexner at the moment due to how it initializes CoreNLP.
You can use the tidytext to extract bigram--> words --> intersect to get the common part
library(tidyverse)
libraty(tidytext)
# city is a vector containing pre-defined city name
t2 <- test %>% as_tibble() %>%
unnest_tokens(bigram,value,token = 'ngrams', n =2) %>%
separate(bigram,c('word1','word2'),remove = F)
city_get <- c(intersect(t2$bigram,city),intersect(t2$word1,city))%>%
unique()

R: Mission impossible? How to assign "New York" to a county

I run into problems assigning a county to some city places. When querying via the acs package
> geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name
1 36 New York <NA> NA <NA>
2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city
3 36 New York Oneida County 51011 New York Mills village
, you can see that "New York", for instance, has a bunch of counties. So do Los Angeles, Portland, Oklahoma, Columbus etc. How can such data be assigned to a "county"?
Following code is currently used to match "county.name" with the corresponding county FIPS code. Unfortunately, it only works for cases of only one county name output in the query.
Script
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
dat <- strsplit(dat, ",")
dat
library(tigris)
library(acs)
data(fips_codes) # FIPS codes with state, code, county information
GeoLookup <- lapply(dat,function(x) {
geo.lookup(state = trimws(x[2]), place = trimws(x[1]))[2,]
})
df <- bind_rows(GeoLookup)
#Rename cols to match
colnames(fips_codes) = c("state.abb", "statefips", "state.name", "countyfips", "county.name")
# Here is a problem, because it works with one item in "county.name" but not more than one (see output below).
df <- df %>% left_join(fips_codes, by = c("state.name", "county.name"))
df
Returns:
state state.name county.name place place.name state.abb statefips countyfips
1 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city <NA> <NA> <NA>
2 25 Massachusetts Suffolk County 7000 Boston city MA 25 025
3 6 California Los Angeles County 20802 East Los Angeles CDP CA 06 037
4 48 Texas Collin County, Dallas County, Denton County, Kaufman County, Rockwall County 19000 Dallas city <NA> <NA> <NA>
5 6 California San Mateo County 20956 East Palo Alto city CA 06 081
In order to retain data, the left_join might better be matched as "look for county.name that contains place.name (without the appending xy city in the name), or choose the first item by default. It would be great to see how this could be done.
In general: I assume, there's no better way than this approach?
Thanks for your help!
What about something like the code below to create a "long" data frame for joining. We use the tidyverse pipe operator to chain operations. strsplit returns a list, which we unnest to stack the list values (the county names that go with each combination of state.name and place.name) into a long data frame where each county.name now gets its own row.
library(tigris)
library(acs)
library(tidyverse)
dat = geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name
1 36 New York <NA> NA <NA>
2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city
3 36 New York Oneida County 51011 New York Mills village
dat = dat %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
state state.name place place.name county.name
<chr> <chr> <int> <chr> <chr>
1 36 New York NA <NA> <NA>
2 36 New York 51000 New York city Bronx County
3 36 New York 51000 New York city Kings County
4 36 New York 51000 New York city New York County
5 36 New York 51000 New York city Queens County
6 36 New York 51000 New York city Richmond County
7 36 New York 51011 New York Mills village Oneida County
UPDATE: Regarding the second question in your comment, assuming you have the vector of metro areas already, how about this:
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
df <- map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
})
df
state state.name place place.name county.name
1 36 New York 51000 New York city Bronx County
2 36 New York 51000 New York city Kings County
3 36 New York 51000 New York city New York County
4 36 New York 51000 New York city Queens County
5 36 New York 51000 New York city Richmond County
6 36 New York 51011 New York Mills village Oneida County
7 25 Massachusetts 7000 Boston city Suffolk County
8 25 Massachusetts 7000 Boston city Suffolk County
9 6 California 20802 East Los Angeles CDP Los Angeles County
10 6 California 39612 Lake Los Angeles CDP Los Angeles County
11 6 California 44000 Los Angeles city Los Angeles County
12 48 Texas 19000 Dallas city Collin County
13 48 Texas 19000 Dallas city Dallas County
14 48 Texas 19000 Dallas city Denton County
15 48 Texas 19000 Dallas city Kaufman County
16 48 Texas 19000 Dallas city Rockwall County
17 48 Texas 40516 Lake Dallas city Denton County
18 6 California 20956 East Palo Alto city San Mateo County
19 6 California 55282 Palo Alto city Santa Clara County
UPDATE 2: If I understand your comments, for cities (actually place names in the example) with more than one county, we want only the county that includes the same name as the city (for example, New York County in the case of New York city), or the first county in the list otherwise. The following code selects a county with the same name as the city or, if there isn't one, the first county for that city. You might have to tweak it a bit to make it work for the entire U.S. For example, for it to work for Louisiana, you might need gsub(" County| Parish"... instead of gsub(" County"....
map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest %>%
slice(max(1, which(grepl(sub(" [A-Za-z]*$","", place.name), gsub(" County", "", county.name))), na.rm=TRUE))
})
state state.name place place.name county.name
<chr> <chr> <int> <chr> <chr>
1 36 New York 51000 New York city New York County
2 36 New York 51011 New York Mills village Oneida County
3 25 Massachusetts 7000 Boston city Suffolk County
4 6 California 20802 East Los Angeles CDP Los Angeles County
5 6 California 39612 Lake Los Angeles CDP Los Angeles County
6 6 California 44000 Los Angeles city Los Angeles County
7 48 Texas 19000 Dallas city Dallas County
8 48 Texas 40516 Lake Dallas city Denton County
9 6 California 20956 East Palo Alto city San Mateo County
10 6 California 55282 Palo Alto city Santa Clara County
Could you prep the data by using something like the below code?
new_york_data <- geo.lookup(state = "NY", place = "New York")
prep_data <- function(full_data){
output <- data.frame()
for(row in 1:nrow(full_data)){
new_rows <- replicateCounty(full_data[row, ])
output <- plyr::rbind.fill(output, new_rows)
}
return(output)
}
replicateCounty <- function(row){
counties <- str_trim(unlist(str_split(row$county.name, ",")))
output <- data.frame(state = row$state,
state.name = row$state.name,
county.name = counties,
place = row$place,
place.name = row$place.name)
return(output)
}
prep_data(new_york_data)
It's a little messy and you'll need the plyr and stringr packages. Once you prep the data, you should be able to join on it

Resources