Fuzzy Matching/Join Two Data Frames of University Names [duplicate] - r

This question already has answers here:
How can I match fuzzy match strings from two datasets?
(7 answers)
Closed 4 years ago.
I have a list of university names input with spelling errors and inconsistencies. I need to match them against an official list of university names to link my data together.
I know fuzzy matching/join is my way to go, but I'm a bit lost on the correct method. Any help would be greatly appreciated.
d<-data.frame(name=c("University of New Yorkk", "The University of South
Carolina", "Syracuuse University", "University of South Texas",
"The University of No Carolina"), score = c(1,3,6,10,4))
y<-data.frame(name=c("University of South Texas", "The University of North
Carolina", "University of South Carolina", "Syracuse
University","University of New York"), distance = c(100, 400, 200, 20, 70))
And I desire an output that has them merged together as closely as possible
matched<-data.frame(name=c("University of New Yorkk", "The University of South Carolina",
"Syracuuse University","University of South Texas","The University of No Carolina"),
correctmatch = c("University of New York", "University of South Carolina",
"Syracuse University","University of South Texas", "The University of North Carolina"))

I use adist() for things like this and have little wrapper function called closest_match() to help compare a value against a set of "good/permitted" values.
library(magrittr) # for the %>%
closest_match <- function(bad_value, good_values) {
distances <- adist(bad_value, good_values, ignore.case = TRUE) %>%
as.numeric() %>%
setNames(good_values)
distances[distances == min(distances)] %>%
names()
}
sapply(d$name, function(x) closest_match(x, y$name)) %>%
setNames(d$name)
University of New Yorkk The University of South\n Carolina Syracuuse University
"University of New York" "University of South Carolina" "University of New York"
University of South Texas The University of No Carolina
"University of South Texas" "University of South Carolina"
adist() utilizes Levenshtein distance to compare similarity between two strings.

Related

Is there syntactic sugar to define a data frame in R

I want to regroup US states by regions and thus I need to define a "US state" -> "US Region" mapping function, which is done by setting up an appropriate data frame.
The basis is this exercise (apparently this is a map of the "Commonwealth of the Fallout"):
One starts off with an original list in raw form:
Alabama = "Gulf"
Arizona = "Four States"
Arkansas = "Texas"
California = "South West"
Colorado = "Four States"
Connecticut = "New England"
Delaware = "Columbia"
which eventually leads to this R code:
us_state <- c("Alabama","Arizona","Arkansas","California","Colorado","Connecticut",
"Delaware","District of Columbia","Florida","Georgia","Idaho","Illinois","Indiana",
"Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts","Michigan",
"Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada","New Hampshire",
"New Jersey","New Mexico","New York","North Carolina","North Dakota","Ohio","Oklahoma",
"Oregon","Pennsylvania","Rhode Island","South Carolina","South Dakota","Tennessee",
"Texas","Utah","Vermont","Virginia","Washington","West Virginia ","Wisconsin","Wyoming")
us_region <- c("Gulf","Four States","Texas","South West","Four States","New England",
"Columbia","Columbia","Gulf","Southeast","North West","Midwest","Midwest","Plains",
"Plains","East Central","Gulf","New England","Columbia","New England","Midwest",
"Midwest","Gulf","Plains","North","Plains","South West","New England","Eastern",
"Four States","Eastern","Southeast","North","East Central","Plains","North West",
"Eastern","New England","Southeast","North","East Central","Texas","Four States",
"New England","Columbia","North West","Eastern","Midwest","North")
us_state_to_region_map <- data.frame(us_state, us_region, stringsAsFactors=FALSE)
which is supremely ugly and unmaintainable as the State -> Region mapping is effectively
obfuscated.
I actually wrote a Perl program to generate the above from the original list.
In Perl, one would write things like:
#!/usr/bin/perl
$mapping = {
"Alabama"=> "Gulf",
"Arizona"=> "Four States",
"Arkansas"=> "Texas",
"California"=> "South West",
"Colorado"=> "Four States",
"Connecticut"=> "New England",
...etc...etc...
"West Virginia "=> "Eastern",
"Wisconsin"=> "Midwest",
"Wyoming"=> "North" };
which is maintainable because one can verify the mapping on a line-by-line basis.
There must be something similar to this Perl goodness in R?
It seems a bit open for interpretation as to what you're looking for.
Is the mapping meant to be a function type thing such that a call would return the region or vise-versa (Eg. similar to a function call mapping("alabama") => "Gulf")?
I am reading the question to be more looking for a dictionary style storage, which in R could be obtained with an equivalent named list
ncountry <- 49
mapping <- as.list(c("Gulf","Four States",
...
,"Midwest","North"))
names(mapping) <- c("Alabama","Arizona",
...
,"Wisconsin","Wyoming")
mapping[["Pennsylvania"]]
[1] "Eastern"
This could be performed in a single call
mapping <- list("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
Which makes it very simple to check that the mapping is working as expected. This doesn't convert nicely to a 2 column data.frame however, which we would then obtain using
mapping_df <- data.frame(region = unlist(mapping), state = names(mapping))
note "not nicely" simply means as.data.frame doesn't translate the input into a 2 column output.
Alternatively just using a named character vector would likely be fine too
mapping_c <- c("Alabama" = "Gulf",
"Arizona" = "Four States",
...,
"Wisconsin" = "Midwest",
"Wyoming" = "North")
which would be converted to a data.frame in almost the same fashion
mapping_df_c <- data.frame(region = mapping_c, state = names(mapping_c))
Note however a slight difference in the two choices of storage. While referencing an entry that exists using either single brackets [ or double brackets [[ works just fine
#Works:
mapping_c["Pennsylvania"] == mapping["Pennsylvania"]
#output
Pennsylvania
TRUE
mapping_c[["Pennsylvania"]] == mapping[["Pennsylvania"]]
[1] TRUE
But when referencing unknown entries these differ slightly in behaviour
#works sorta:
mapping_c["hello"] == mapping["hello"]
#output
$<NA>
NULL
#Does not work:
mapping_c[["hello"]] == mapping[["hello"]]
Error in mapping_c[["hello"]] : subscript out of bounds
If you are converting your input into a data.frame this is not an issue, but it is worth being aware of this, so you obtain the behaviour expected.
Of course you could use a function call to create a proper dictionary with a simple switch statement. I don't think that would be any prettier though.
If us_region is a named list...
us_region <- list(Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia")
Then,
us_state_to_region_map <- data.frame(us_state = names(us_region),
us_region = sapply(us_region, c),
stringsAsFactors = FALSE)
and, as a bonus, you also get the states as row names...
us_state_to_region_map
us_state us_region
Alabama Alabama Gulf
Arizona Arizona Four States
Arkansas Arkansas Texas
California California South West
Colorado Colorado Four States
Connecticut Connecticut New England
Delaware Delaware Columbia
As #tim-biegeleisen says it could be more appropriate to maintain this dataset in a database, a CSV file or a spreadsheet and open it in R (with readxl::read_excel(), readr::read_csv(),...).
However if you want to write it directly in your code you can use tibble:tribble() which allows to write a dataframe row by row :
library(tibble)
tribble(~ state, ~ region,
"Alabama", "Gulf",
"Arizona", "Four States",
(...)
"Wisconsin", "Midwest",
"Wyoming", "North")
One option could be to create a data frame in wide format (your initial list makes it very straightforward and this maintains a very obvious mapping. It is actually quite similar to your Perl code), then transform it to the long format:
library(tidyr)
data.frame(
Alabama = "Gulf",
Arizona = "Four States",
Arkansas = "Texas",
California = "South West",
Colorado = "Four States",
Connecticut = "New England",
Delaware = "Columbia",
stringsAsFactors = FALSE
) %>%
gather("us_state", "us_region") # transform to long format

R: Match character vector with another character vector [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Can't wrap my mind around this task
Consider a data frame "usa" with 3 columns, "title", "city" and "state" (reproducible):
title <- c("Events in Chicago, September", "California hotels",
"Los Angeles, August", "Restaurant in Chicago")
city <- c("","", "Los Angeles", "Chicago")
state <- c("","", "California", "IL")
usa <-data.frame(title, city, state)
Resulting in this:
title city state
1 Events in Chicago, September
2 California hotels
3 Los Angeles, August Los Angeles California
4 Restaurant in Chicago Chicago IL
Now what I am trying to do is to fill the STATE variable for the first 2 observations, which are now missing.
TITLE variable contains a clue: either a city or a state is mentioned in each of the entries.
I need to do the following:
Check if any word in "title" column matches any observation found in "city" and "state" columns;
If any word in "title" matches any observation in "state", paste the same state for the given title's observation;
If any word in "title" matches any observation in "city", paste the matched city's state in the "state" column of the title's row.
So what I want to get eventually is this:
title city state
1 Events in Chicago, September IL
2 California hotels California
3 Los Angeles, August Los Angeles California
4 Restaurant in Chicago Chicago IL
In other words, in the second row the title contained a word "California", so a matching state was found from state vector. However, in the first line, the word "Chicago" was the key, and there was another entry in the data frame (row 4), which linked Chicago to "IL" state, so "IL" has to be pasted in the first row of "state" column.
Waiting for the community's ideas :) Thanks!
I would recommend you use the stringr package; specifically, a function called str_extract.
If you have a complete list of cities, e.g. city <- c("Los Angeles", "Chicago"), then you can make it into regular expression using paste(city, collapse = '|'). That will give you: 'Los Angeles|Chicago'. With str_extract, you can extract that city (will extract the first one it sees, and an NA if none appear). Here's the complete code. Note: this only works if your dataframe is a data_frame (tibble), not a data.frame (not totally sure why, haven't looked into it)
library(tidyverse)
library(stringr)
title <- c("Events in Chicago, September", "California hotels",
"Los Angeles, August", "Restaurant in Chicago")
city <- c("","", "Los Angeles", "Chicago")
state <- c("","", "California", "IL")
usa <-data_frame(title, city, state) # notice this is a data_frame not data.frame
cities <- paste(c("Los Angeles", "Chicago"), collapse = '|')
states <- paste(c("California", "IL"), collapse = '|')
usa <- usa %>%
mutate(city = ifelse(city == '', str_extract(title, cities), city),
state = ifelse(state == '', str_extract(title, states), state))
This results in:
# A tibble: 4 x 3
title city state
<chr> <chr> <chr>
1 Events in Chicago, September Chicago <NA>
2 California hotels <NA> California
3 Los Angeles, August Los Angeles California
4 Restaurant in Chicago Chicago IL

R: match each string in a list against another list while returning adjacent value

This is something that I can do easily in Excel. But Im being confounded by R.
I would like to assign country names to a long list of strings ("affiliation").
c("Department of Psychiatry and Behavioural Sciences, University College London Medical School, UK.",
"", "Ty Dewi Sant School of Nursing, University Hospital of Wales, College of Medicine, Cardiff.",
"University of Massachusetts Medical Center.", "Older Women's League.",
"Kimberly Quality Care, Boston, MA.", "Michaux Manor Living Center, Fayetteville, PA.",
"Florida Diagnostic and Learning Resources System, University of South Florida, Tampa 33613.",
"", "Bigel Institute for Health Policy, Brandeis University, Waltham, MA.",
"", "York Health Authority.", "Southern Illinois University, Edwardsville.",
"St. Joseph's Hospital, Memphis, TN.", "Long Term Home Care of the Frail Elderly Foundation, New York City.",
"Catholic University of America, Washington, DC.", "Mercy Health Center, Oklahoma City, OK.",
"", "Visiting Nurse Service of New York.", "RespiteCare Center, Evanston, IL.",
"Camden and Islington HA.", "National Advisory Council on Aging.",
"Visiting Nurse Service of New York.", "American Health Care Association, Washington, DC.",
"HealthCare Partners Medical Group, Los Angeles, CA 90015, USA.",
"Tad Publishing Company, Peoria, IL, USA.", "Child Health Investment Partnership, Roanoke, VA, USA.",
"School of Public Health, State University of New York, Albany 12237, USA.",
"Bundoora Extended Care Centre.", "", "", "Family Respite Center, Falls Church, VA, USA.",
"", "University of Victoria.", "", "Homemaker Health Aide Service of the National Capital Area.",
"West Lambeth Health Authority, London SE1 7EH.", "Bon Secours Hospital/Villa Maria Nursing Center, North Miami, FL 33161.",
"Alzheimer's Disease and Related Disorders Association, Syracuse, NY.",
"Alzheimer's Association, Washington DC.", "South Carolina Commission on Aging, Columbia.",
"University of New Mexico College of Nursing.", "Department of Human Development and Family Studies, University of Alabama, Tuscaloosa.",
"Ballard Health Care Residence, Des Plaines, IL.", "Bowman Gray School of Medicine of Wake Forest University, Winston-Salem, NC.",
"Case Western Reserve University.", "School of Public and Environmental Administration, Indiana University, Indianapolis 46202.",
"Manor HealthCare Corp, Silver Spring, MD.", "Relationship Builders, Napa, CA.",
"", "", "Medical University of South Carolina, USA.", "Tokyo Metropolitan Institute of Gerontology, Itabashi, Japan. tatsuro#tmig.or.jp",
"Medical University of South Carolina, USA.", "Royal Hospital for Sick Children, Bristol.",
"Barefield, Ennis, Co. Clare., Ireland.", "North Georgia College, Dahlonega 30597, USA.",
"Institute for Psychology (I), University of Wurzburg, Germany.",
"Camborne Redruth Community Hospital, Cornwall, United Kingdom.",
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",
"", "", "", "", "", "", "", "", "", "", "Institute of Child Health and Great Ormond Street Hospital for Children NHS Trust, London, UK.",
"Department of Psychiatry, University of Toronto, Toronto, Ontario, Canada. carol.cohen#sunnybrook.on.ca",
"Boston University School of Social Work, MA 02215, USA.", "",
"Neurosciences Unit, General Infirmary at Leeds.", "", "", "School of Kang-Ning Junior College of Nursing, Nei-Hu, Taiwan, ROC.",
"College of Nursing, South Dakota State University, USA.", "Department of Geriatric Medicine, University of Manchester, UK.",
"Southern Illinois University, Department of Social Work, Edwardsville 62026-1450, USA.",
"Redlands Community College, El Reno, Oklahoma, USA.", "", "",
"Department of Geriatric Medicine, Alexandra Hospital, Singapore.",
"School of Nursing and Midwifery, Department of Gerontological and Continuing Care Nursing, University of Sheffield, Sheffield, England. Liz.hanson#act.shef.ac.uk",
"", "State University of New York, Health Science Center at Syracuse, 13210, USA. HAMR#mailbox.hscsyr.edu",
"Div. of Active Palliative Care, Todachuo General Hospital.",
"Children and Young People's Kidney Unit, Nottingham City Hospital, NHS Trust, UK.",
"School of Nursing & Midwifery, Department of Gerontological & Continuing Care Nursing, University of Sheffield. liz.hanson#act.shef.ac.uk",
"Harrington Memorial Hospital, Southbridge, MA, USA.", "", "Department of Curriculum and Instruction, Iowa State University, Ames, 50011. USA.",
"Children & Young People's Kidney Unit, Nottingham City Hospital, U.K.",
"School of Social Work, Boston University, MA 02215, USA. freedman#bu.edu",
"Royal Free Hospital, London, UK.", "Humboldt State University, Department of Nursing, Arcata, CA, USA.",
"Department of Psychiatry, The University of Queensland, Mental Health Centre, Royal Brisbane Hospital, Herston, Australia. davidk#psychiatry.uq.edu.au",
"Centre for Evidence Based Nursing, University of York, Heslington, York, Nth Yorkshire, UK, YO1 5DG. cat4#york.ac.uk",
"School of Nursing, University of British Columbia, Vancouver. magenta#bc.sympatico.ca",
"Medisinsk avdeling, Lovisenberg Diakonale Sykehus, Oslo.", "School of Nursing, Yale University, USA.",
"Centre de la Mémoire, Hôpital Roger Salengro, Centre Hospitalier Universitaire, Lille.",
"University of Ulster and Eastern Health and Social Services Board, Ulster, Northern Ireland. r.mcconkey#ulst.ac.uk",
"Thames Valley Family Practice Research Unit, Department of Family Medicine's Centre for Studies in Family Medicine, University of Western Ontario (UWO), London. jbbrown#julian.uwo.ca",
"", "", "Department of Special Education, University of Nijmegen, The Netherlands. A.Hendriks#ped.kun.nl",
"European Institute of Health and Medical Sciences, University of Surrey, Guildford, England.",
"California State University School of Nursing, Chico, USA.")
Within each string may or may not be a substring referring to a location, which itself may refer to country. The intended output is a dataframe as follows:
Affiliation[1], matchedCountry
Affiliation[2], matchedCountry
...
Affiliation[n], matchedCountry
"matchedCountry" is meant to be assessed based on several lists (university, UK cities, US states, etc.) and NA is allowed. And some lists only return ISO codes.
Based on the feedback thus far (thanks #rbm), I have managed a solution (see answer section) that does the job quite well. That said, I am sure performance could still be improved. Thanks.
References:
Simultaneously merge multiple data.frames in a list
R grepl: quickly match multiple strings against multiple substrings, returning all matches
R grep: Match one string against multiple patterns
Speedy test on R data frame to see if row values in one column are inside another column in the data frame
Extract & combine multiple substrings using multiple patterns from some but not all strings contained in list & return to list in R
How to detect substrings from multiple lists within a string in R
Here's a solution, that checks various lists of substrings against each item in a master list, and then depending on the list, returns either: a) the original substring, b) an adjacent substring, or c) a fixed/pre-defined value. The result is the original table with a "country" column appended.
These conditions are represented in the sample code provided.
edit: it seems the "domain" look-up isnt working as intended. Im not quite sure how to troubleshoot/fix it, but that beyond the scope of this question, I guess...
######### GENERATE COUNTRY ID #############
library("stringr")
library(RCurl)
## Download country lists and perpetrate
countryList <- getURL("https://raw.githubusercontent.com/umpirsky/country-list/master/country/icu/en_US/country.csv")
usstatesList <- getURL("https://raw.githubusercontent.com/jasonong/List-of-US-States/master/states.csv")
ukcitiesList <- getURL("https://raw.githubusercontent.com/encyclopediaio/list-of-cities-in-the-uk/master/src/uk_cities.csv")
ukcountryList <- getURL("https://raw.githubusercontent.com/Gibbs/UK-Postcodes/master/postcodes.csv")
universitiesList <- getURL("https://raw.githubusercontent.com/endSly/world-universities-csv/master/world-universities.csv")
countryList <- read.csv(text = countryList, stringsAsFactors=FALSE)
usstatesList <- read.csv(text = usstatesList, stringsAsFactors=FALSE)
ukcitiesList <- read.csv(text = ukcitiesList, stringsAsFactors=FALSE)
ukcountryList <- read.csv(text = ukcountryList, stringsAsFactors=FALSE)
universitiesList <- read.csv(text = universitiesList, header = FALSE, stringsAsFactors=FALSE)
## Generate affiliation list from ronbun data
affiliationList <- pub.data$Affiliation1
## Generate email domains column and add to countryList
domains <- function(x)
{
x <- tolower(x)
x <- paste0(".", x)
return(x)
}
countryList <- data.frame(countryList[c("name", "iso")], domain = domains(countryList$iso), stringsAsFactors = FALSE)
## Add country names to universitiesList as V4
universitiesList <- data.frame(universitiesList, V4="", stringsAsFactors = FALSE)
i = 0
for (v in universitiesList$V1)
{
tryCatch({
i = i + 1
if (sum(str_detect(v, countryList$iso)) > 0) {
universitiesList$V4[i] <- countryList$name[which(str_detect(v, countryList$iso))]
}
}, error=function(e){})
}
### on to the main show
df <- data.frame(affiliationList, CountryISO="", CountryNAME="", stringsAsFactors = FALSE)
i = 0
for (v in affiliationList)
{
tryCatch({
i = i + 1
if (sum(str_detect(v, countryList$name)) > 0) {
df$CountryISO[i] <- countryList$iso[which(str_detect(v, countryList$name))]
df$CountryNAME[i] <- countryList$name[which(str_detect(v, countryList$name))]
}
if (sum(str_detect(v, ukcitiesList$name)) > 0) {
df$CountryISO[i] <- "GB"
df$CountryNAME[i] <- "United Kingdom"
}
if (sum(str_detect(v, ukcountryList$country_string)) > 0) {
df$CountryISO[i] <- "GB"
df$CountryNAME[i] <- "United Kingdom"
}
if (sum(str_detect(v, usstatesList$State)) > 0 || sum(str_detect(v, usstatesList$Abbreviation)) > 0) {
df$CountryISO[i] <- "US"
df$CountryNAME[i] <- "United States"
}
if (sum(str_detect(v, countryList$domain)) > 0) {
df$CountryISO[i] <- countryList$iso[which(str_detect(v, countryList$domain))]
df$CountryNAME[i] <- countryList$name[which(str_detect(v, countryList$domain))]
}
if (sum(str_detect(v, universitiesList$V2)) > 0) {
df$CountryISO[i] <- universitiesList$V1[which(str_detect(v, universitiesList$V2))]
df$CountryNAME[i] <- universitiesList$V1[which(str_detect(v, universitiesList$V4))]
}
}, error=function(e){})
}
return(df)
Thanks for all the help provided!

How to submit multiple street2coordinates requests? (R) [duplicate]

I am fed up with Google's geocoding, and decided to try an alternative. The Data Science Toolkit (http://www.datasciencetoolkit.org) allows you to Geocode unlimited number of addresses. R has an excellent package that serves as a wrapper for its functions (CRAN:RDSTK). The package has a function called street2coordinates() that interfaces with the Data Science Toolkit's geocoding utility.
However, the RDSTK function street2coordinates() does not work if you try to geocode something simple like City, Country. In the following example I will try to use the function to get the latitude and longitude for the city of Phoenix:
> require("RDSTK")
> street2coordinates("Phoenix+Arizona+United+States")
[1] full.address
<0 rows> (or 0-length row.names)
The utility from the data science toolkit works perfectly. This is the URL request that gives the answer:
http://www.datasciencetoolkit.org/maps/api/geocode/json?sensor=false&address=Phoenix+Arizona+United+States
I am interested in geocoding multiple addresses (which complete addresses and city names). I know that the Data Science Toolkit URL will work well.
How do I interface with the URL and get multiple latitudes and longitudes into a data frame with the addresses?
Here is an sample dataset:
dff <- data.frame(address=c(
"Birmingham, Alabama, United States",
"Mobile, Alabama, United States",
"Phoenix, Arizona, United States",
"Tucson, Arizona, United States",
"Little Rock, Arkansas, United States",
"Berkeley, California, United States",
"Duarte, California, United States",
"Encinitas, California, United States",
"La Jolla, California, United States",
"Los Angeles, California, United States",
"Orange, California, United States",
"Redwood City, California, United States",
"Sacramento, California, United States",
"San Francisco, California, United States",
"Stanford, California, United States",
"Hartford, Connecticut, United States",
"New Haven, Connecticut, United States"
))
Like this:
library(httr)
library(rjson)
data <- paste0("[",paste(paste0("\"",dff$address,"\""),collapse=","),"]")
url <- "http://www.datasciencetoolkit.org/street2coordinates"
response <- POST(url,body=data)
json <- fromJSON(content(response,type="text"))
geocode <- do.call(rbind,sapply(json,
function(x) c(long=x$longitude,lat=x$latitude)))
geocode
# long lat
# San Francisco, California, United States -117.88536 35.18713
# Mobile, Alabama, United States -88.10318 30.70114
# La Jolla, California, United States -117.87645 33.85751
# Duarte, California, United States -118.29866 33.78659
# Little Rock, Arkansas, United States -91.20736 33.60892
# Tucson, Arizona, United States -110.97087 32.21798
# Redwood City, California, United States -117.88536 35.18713
# New Haven, Connecticut, United States -72.92751 41.36571
# Berkeley, California, United States -122.29673 37.86058
# Hartford, Connecticut, United States -72.76356 41.78516
# Sacramento, California, United States -121.55541 38.38046
# Encinitas, California, United States -116.84605 33.01693
# Birmingham, Alabama, United States -86.80190 33.45641
# Stanford, California, United States -122.16750 37.42509
# Orange, California, United States -117.85311 33.78780
# Los Angeles, California, United States -117.88536 35.18713
This takes advantage of the POST interface to the street2coordinates API (documented here), which returns all the results in 1 request, rather than using multiple GET requests.
The absence of Phoenix seems to be a bug in the street2coordinates API. If you go the API demo page and try "Phoenix, Arizona, United States", you get a null response. However, as your example shows, using their "Google-style Geocoder" does give a result for Phoenix. So here's a solution using repeated GET requests. Note that this runs much slower.
geo.dsk <- function(addr){ # single address geocode with data sciences toolkit
require(httr)
require(rjson)
url <- "http://www.datasciencetoolkit.org/maps/api/geocode/json"
response <- GET(url,query=list(sensor="FALSE",address=addr))
json <- fromJSON(content(response,type="text"))
loc <- json['results'][[1]][[1]]$geometry$location
return(c(address=addr,long=loc$lng, lat= loc$lat))
}
result <- do.call(rbind,lapply(as.character(dff$address),geo.dsk))
result <- data.frame(result)
result
# address long lat
# 1 Birmingham, Alabama, United States -86.801904 33.456412
# 2 Mobile, Alabama, United States -88.103184 30.701142
# 3 Phoenix, Arizona, United States -112.0733333 33.4483333
# 4 Tucson, Arizona, United States -110.970869 32.217975
# 5 Little Rock, Arkansas, United States -91.207356 33.608922
# 6 Berkeley, California, United States -122.29673 37.860576
# 7 Duarte, California, United States -118.298662 33.786594
# 8 Encinitas, California, United States -116.846046 33.016928
# 9 La Jolla, California, United States -117.876447 33.857515
# 10 Los Angeles, California, United States -117.885359 35.187133
# 11 Orange, California, United States -117.853112 33.787795
# 12 Redwood City, California, United States -117.885359 35.187133
# 13 Sacramento, California, United States -121.555406 38.380456
# 14 San Francisco, California, United States -117.885359 35.187133
# 15 Stanford, California, United States -122.1675 37.42509
# 16 Hartford, Connecticut, United States -72.763564 41.78516
# 17 New Haven, Connecticut, United States -72.927507 41.365709
The ggmap package includes support for geocoding using either Google or Data Science Toolkit, the latter with their "Google-style geocoder". This is quite slow for multiple addresses, as noted in the earlier answer.
library(ggmap)
result <- geocode(as.character(dff[[1]]), source = "dsk")
print(cbind(dff, result))
# address lon lat
# 1 Birmingham, Alabama, United States -86.80190 33.45641
# 2 Mobile, Alabama, United States -88.10318 30.70114
# 3 Phoenix, Arizona, United States -112.07404 33.44838
# 4 Tucson, Arizona, United States -110.97087 32.21798
# 5 Little Rock, Arkansas, United States -91.20736 33.60892
# 6 Berkeley, California, United States -122.29673 37.86058
# 7 Duarte, California, United States -118.29866 33.78659
# 8 Encinitas, California, United States -116.84605 33.01693
# 9 La Jolla, California, United States -117.87645 33.85751
# 10 Los Angeles, California, United States -117.88536 35.18713
# 11 Orange, California, United States -117.85311 33.78780
# 12 Redwood City, California, United States -117.88536 35.18713
# 13 Sacramento, California, United States -121.55541 38.38046
# 14 San Francisco, California, United States -117.88536 35.18713
# 15 Stanford, California, United States -122.16750 37.42509
# 16 Hartford, Connecticut, United States -72.76356 41.78516
# 17 New Haven, Connecticut, United States -72.92751 41.36571

How to skip a paste() argument when its value is NA in R

I have a data frame with the columns city, state, and country. I want to create a string that concatenates: "City, State, Country". However, one of my cities doesn't have a State (has a NA instead). I want the string for that city to be "City, Country". Here is the code that creates the wrong string:
# define City, State, Country
city <- c("Austin", "Knoxville", "Salk Lake City", "Prague")
state <- c("Texas", "Tennessee", "Utah", NA)
country <- c("United States", "United States", "United States", "Czech Rep")
# create data frame
dff <- data.frame(city, state, country)
# create full string
dff["string"] <- paste(city, state, country, sep=", ")
When I display dff$string, I get the following. Note that the last string has a NA,, which is not needed:
> dff["string"]
string
1 Austin, Texas, United States
2 Knoxville, Tennessee, United States
3 Salk Lake City, Utah, United States
4 Prague, NA, Czech Rep
What do I do to skip that NA,, including the sep = ", ".
The alternative is to just fix it up afterwards:
gsub("NA, ","",dff$string)
#[1] "Austin, Texas, United States"
#[2] "Knoxville, Tennessee, United States"
#[3] "Salk Lake City, Utah, United States"
#[4] "Prague, Czech Rep"
Alternative #2, is to use apply once you have your data.frame called dff:
apply(dff, 1, function(x) paste(na.omit(x),collapse=", ") )
Late to the party, but unite provides a one-step approach:
dff %>% unite("string", c(city, state, country), sep=", ", remove = FALSE, na.rm = TRUE)
string city state country
1 Austin, Texas, United States Austin Texas United States
2 Knoxville, Tennessee, United States Knoxville Tennessee United States
3 Salk Lake City, Utah, United States Salk Lake City Utah United States
4 Prague, Czech Rep Prague <NA> Czech Rep

Resources