I am currently working with data that is formatted like this:
tribble(
~street1, ~street2, ~county, ~state
N BENTON WY, W TEMPLE ST, LOS ANGELES, CA,
11TH PL, BLAINE ST, LOS ANGELES, CA,
W 6TH ST, HOPE ST, LOS ANGELES, CA,
S GRAND AV, W 18TH ST, LOS ANGELES, CA,
BROADWAY, 5TH ST, LOS ANGELES, CA,
)
This corresponds to a dataset containing around 825,000 observations with missing coordinates. These data have only the names of the nearest cross streets, county, and state information (note they not include street numbers). I need to geocode these observations and recover coordinates so that my final data will look something like this:
tribble(
~street1, ~street2, ~county, ~state, ~latitude, ~longitude
N BENTON WY, W TEMPLE ST, LOS ANGELES, CA, XX.XXXX, -YY.YYYY,
11TH PL, BLAINE ST, LOS ANGELES, CA, XX.XXXX, -YY.YYYY,
W 6TH ST, HOPE ST, LOS ANGELES, CA, XX.XXXX, -YY.YYYY,
S GRAND AV, W 18TH ST, LOS ANGELES, CA, XX.XXXX, -YY.YYYY,
BROADWAY, 5TH ST, LOS ANGELES, CA, XX.XXXX, -YY.YYYY,
)
I have already researched a few possible solutions but haven't found a method that will work.
While the Google Maps API (ggmap package) is very good at identifying coordinates from cross streets as inputs, the cost to geocode this many observations (4.00 USD per 1000 queries according to their website) makes that option infeasible.
I've looked through the documentation of other packages such as RDSTK and tidygeocoder but they don't seem to support API queries using two street names as inputs. The Census Geocoder similarly does not have that option, allowing only single address inputs.
Using the OpenStreetMap API through the osmdata package seemed like a promising option after reading this very detailed StackOverflow answer, but attempting to replicate this code with much bigger bounding boxes has produced runtime errors every time.
See for example the following code using Los Angeles county, following the format of user hugh-allan in the above post:
library(sf)
library(tidyverse)
library(osmdata)
tribble(
~point, ~lat, ~lon,
1, 32.75004, -118.951721,
2, 34.823302, -118.951721,
3, 34.823302, -117.646374,
4, 32.75004, -117.646374,
) %>%
st_as_sf(
coords = c('lon', 'lat'),
crs = 4326
) %>%
{. ->> LA_bounds}
st_bbox(LA_bounds) %>%
opq %>%
add_osm_feature(key = 'highway') %>%
osmdata_sf %>%
`[[`('osm_lines') %>%
{. ->> LA_streets}
If anyone knows how to get around this error with OpenStreetMaps or otherwise adjust the syntax of another package to accommodate cross streets and counties as inputs, I would greatly appreciate it.
I don't have the solution for osmdata. However, I did try it on tidygeocoder. If you're looking for batch encoding without requiring an API key, the only free method would be the US Census Bureau in tidygeocoder, but is computationally expensive. To do this, I combine street1 and street2 together with the ampersand sign &. Then combine it with the county and state into a single column called line_address instead of multiple columns:
examples_address <- tibble(line_address= c("N BENTON WY & W TEMPLE ST, LOS ANGELES, CA", "11TH PL & BLAINE ST, LOS ANGELES, CA", "W 6TH ST & HOPE ST, LOS ANGELES, CA", "S GRAND AV & W 18TH ST, LOS ANGELES, CA", "BROADWAY & 5TH ST, LOS ANGELES, CA"))
examples_address1 <- examples_address %>%
tidygeocoder::geocode(address = line_address, method = "census", verbose = TRUE)
examples_address1
The output that I got:
line_address lat long
N BENTON WY & W TEMPLE ST, LOS ANGELES, CA 34.07289 -118.2757
11TH PL & BLAINE ST, LOS ANGELES, CA NA NA
W 6TH ST & HOPE ST, LOS ANGELES, CA 34.04944 -118.2563
S GRAND AV & W 18TH ST, LOS ANGELES, CA 34.03420 -118.2673
BROADWAY & 5TH ST, LOS ANGELES, CA 34.04808 -118.2507
Unfortunately, as you can see above, not all of the rows gave us a lat and long back from the batch query.
We can use method = "argis" inside the function to give us results for all, but for some reasons, the returned lat and long may different. See the last entry:
line_address lat long
N BENTON WY & W TEMPLE ST, LOS ANGELES, CA 34.07290 -118.2757
11TH PL & BLAINE ST, LOS ANGELES, CA 34.04517 -118.2716
W 6TH ST & HOPE ST, LOS ANGELES, CA 34.04946 -118.2564
S GRAND AV & W 18TH ST, LOS ANGELES, CA 34.03417 -118.2673
BROADWAY & 5TH ST, LOS ANGELES, CA 34.01587 -118.4927
arcgis does not support batch query in tidygeocoder.
Related
I recently update RStudio to the version RStudio 2022.07.1, working on Windows 10.
When I tried different geocode reverse functions(Which is input coordinate, output is the address), they all return no found.
Example 1:
library(revgeo)
revgeo(-77.016472, 38.785026)
Suppose return "146 National Plaza, Fort Washington, Maryland, 20745, United States of America". But I got
"Getting geocode data from Photon: http://photon.komoot.de/reverse?lon=-77.016472&lat=38.785026"
[[1]]
[1] "House Number Not Found Street Not Found, City Not Found, State Not Found, Postcode Not Found, Country Not Found"
Data from https://github.com/mhudecheck/revgeo
Example 2:
library(tidygeocoder)
library(dplyr)
path <- "filepath"
df <- read.csv (paste (path, "sample.csv", sep = ""))
reverse <- df %>%
reverse_geocode(lat = longitude, long = latitude, method = 'osm',
address = address_found, full_results = TRUE)
reverse
Where the sample.csv is
name
addr
latitude
longitude
White House
1600 Pennsylvania Ave NW, Washington, DC
38.89770
-77.03655
Transamerica Pyramid
600 Montgomery St, San Francisco, CA 94111
37.79520
-122.40279
Willis Tower
233 S Wacker Dr, Chicago, IL 60606
41.87535
-87.63576
Suppose to get
name
addr
latitude
longitude
address_found
White House
1600 Pennsylvania Ave NW, Washington, DC
38.89770
-77.03655
White House, 1600, Pennsylvania Avenue Northwest, Washington, District of Columbia, 20500, United States
Transamerica Pyramid
600 Montgomery St, San Francisco, CA 94111
37.79520
-122.40279
Transamerica Pyramid, 600, Montgomery Street, Chinatown, San Francisco, San Francisco City and County, San Francisco, California, 94111, United States
Willis Tower
233 S Wacker Dr, Chicago, IL 60606
41.87535
-87.63576
South Wacker Drive, Printer’s Row, Loop, Chicago, Cook County, Illinois, 60606, United States
But I got
# A tibble: 3 × 5
name addr latitude longitude address_found
<chr> <chr> <dbl> <dbl> <chr>
1 White House 1600 Pennsylvania Ave NW, Wash… 38.9 -77.0 NA
2 Transamerica Pyramid 600 Montgomery St, San Francis… 37.8 -122. NA
3 Willis Tower 233 S Wacker Dr, Chicago, IL 6… 41.9 -87.6 NA
Data source: https://cran.r-project.org/web/packages/tidygeocoder/readme/README.html
However, when I tried
reverse_geo(lat = 38.895865, long = -77.0307713, method = "osm")
I'm able to get
# A tibble: 1 × 3
lat long address
<dbl> <dbl> <chr>
1 38.9 -77.0 Pennsylvania Avenue, Washington, District of Columbia, 20045, United States
I had contact the tidygeocoder developer, he/she didn't find out any problem. Detail in https://github.com/jessecambon/tidygeocoder/issues/175
Not sure which part goes wrong. Anyone want try on their RStudio?
The updated revgeo needs to be submitted to CRAN. This has nothing to do with RStudio.
Going to http://photon.komoot.de/reverse?lon=-77.016472&lat=38.785026 in my browser also returns an error. However, I searched for the Photon reverse geocoder, and their example uses .io not .de in the URL, and https://photon.komoot.io/reverse?lon=-77.016472&lat=38.785026 works.
Photon also include a Note at the bottom of their examples:
Until October 2020 the API was available under photon.komoot.de. Requests still work as they redirected to photon.komoot.io but please update your apps accordingly.
Seems like that redirect is either broken or deprecated.
The version of revgeo on github has this change made already, so you can get a working version by using remotes::install_github("https://github.com/mhudecheck/revgeo")
I was working with the googleway package and I had a bunch of addresses that I needed to parse out the various components of the addresses that were in a nested list of lists. Loops (not encouraged) and apply functions both seemed confusing and I was not sure if there was a tidy solution. I found the map function (specifically the pluck function that it calls on lists on the backend) could accomplish my goal so I will share my solution.
Problem:
I need to pull out certain information about the White House such as
Latitude
Longitude
You need to set up your Google Cloud API Key with googleway::set_key(API_KEY), but this is just an example of a nested list that I hope someone working with this package will see.
# Address for the White House and the Lincoln Memorial
address_vec <- c(
"1600 Pennsylvania Ave NW, Washington, DC 20006",
"2 Lincoln Memorial Cir NW, Washington, DC 20002"
)
address_vec <- pmap(list(address_vec), googleway::google_geocode)
outputs
[[1]]
[[1]]$results
address_components
1 1600, Pennsylvania Avenue Northwest, Northwest Washington, Washington, District of Columbia, United States, 20500, 1600, Pennsylvania Avenue NW, Northwest Washington, Washington, DC, US, 20500, street_number, route, neighborhood, political, locality, political, administrative_area_level_1, political, country, political, postal_code
formatted_address geometry.bounds.northeast.lat
1 1600 Pennsylvania Avenue NW, Washington, DC 20500, USA 38.8979
geometry.bounds.northeast.lng geometry.bounds.southwest.lat geometry.bounds.southwest.lng geometry.location.lat
1 -77.03551 38.89731 -77.03796 38.89766
geometry.location.lng geometry.location_type geometry.viewport.northeast.lat geometry.viewport.northeast.lng
1 -77.03657 ROOFTOP 38.89895 -77.03539
geometry.viewport.southwest.lat geometry.viewport.southwest.lng place_id
1 38.89626 -77.03808 ChIJGVtI4by3t4kRr51d_Qm_x58
types
1 establishment, point_of_interest, premise
[[1]]$status
[1] "OK"
[[2]]
[[2]]$results
address_components
1 2, Lincoln Memorial Circle Northwest, Southwest Washington, Washington, District of Columbia, United States, 20037, 2, Lincoln Memorial Cir NW, Southwest Washington, Washington, DC, US, 20037, street_number, route, neighborhood, political, locality, political, administrative_area_level_1, political, country, political, postal_code
formatted_address geometry.location.lat geometry.location.lng
1 2 Lincoln Memorial Cir NW, Washington, DC 20037, USA 38.88927 -77.05018
geometry.location_type geometry.viewport.northeast.lat geometry.viewport.northeast.lng
1 ROOFTOP 38.89062 -77.04883
geometry.viewport.southwest.lat geometry.viewport.southwest.lng place_id
1 38.88792 -77.05152 ChIJgRuEham3t4kRFju4R6De__g
plus_code.compound_code plus_code.global_code types
1 VWQX+PW Washington, DC, USA 87C4VWQX+PW street_address
[[2]]$status
[1] "OK"
Here's some code that I got from the Googleway Vignette:
df <- google_geocode(address = "Flinders Street Station",
key = key,
simplify = TRUE)
geocode_coordinates(df)
# lat lng
# 1 -37.81827 144.9671
It looks like what you need to do is:
df <- google_geocode("1600 Pennsylvania Ave")
geocode_coordinates(df)
The solution I came up with is a custom function that can access any section of the list:
geocode_accessor <- function(df, accessor, ...) {
unlist(map(df, list(accessor, ...)))
}
This has three important parts to understand:
The map function is calling the pluck function for us (it replaces the use of [[ ). You can read more about what is happening here, but just know this lets us access things by name
The "..." in the function's definition as well as in the list allows us to access multiple levels. Again, the use of list() to access further levels in a list is explained in the pluck documentation
The use of unlist converts the list to a vector (what I want in my instance)
Putting this all together, we can get the latitude of the White House & Lincoln Memorial:
geocode_accessor(address_vec, "results", "geometry", "location", "lat")
[1] 38.89766 38.88927
I have a column in my dataset db, say db$affiliation, which looks like:
**db$affiliation**
[1] "[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA"
[2] "[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS."
[3] "[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND."
[4] ...
I would like to create a column within the same dataset containing only the name of the city in db$affiliation, such as
**db$cities**
[1] LOS ANGELES
[2] TWENTE
[3] BANGKOK
[4] ...
If multiple city names are available, I'd like the command to return only the last one, if no city names are available I'd like to have NA. How can I do that?
I thought that I could use world.cities$name in data(world.cities) in the maps package but I can not figure out how.
I even tried to split the db$affiliation column such as:
db$affiliation <- gsub("\\[[^\\]]*\\]", "", db$affiliation, perl=TRUE) # remove content within brackets
db$affiliation[2] # check the separator
db <- cSplit(db, 'affiliation', sep=c(", "), type.convert=FALSE) # split after comma
Which results (I've truncated it after affiliation_3) in:
affiliation_1 affiliation_2 affiliation_3
[1] UNIV CALIF LOS ANGELES DEPT GEOG LOS ANGELES
[2] UNIV TWENTE DEPT WATER ENGN & MANAGEMENT DRIENERLOLAAN
[3] CHULALONGKORN UNIV FAC ARCHITECTURE BANGKOK
And then pass:
db$cities <- lapply(db$affiliation_1, function(x)x[which(x %in% world.cities$name)])
But I get an empty column.
Thanks for the help!
There are many cities in your sample string so you may need to think again if you still want to fetch the 'last city' in case multiple cities are found in affiliation column.
library(maps)
data(world.cities)
#sample data
df <- data.frame(affiliation = c("[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA",
"[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.",
"[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND.",
"Prem"), stringsAsFactors = F)
#fetch city and it's respective country from 'affiliation' column
cities_country <- lapply(gsub("\\[|\\]|[,;]|\\.","",df$affiliation), function(x)
paste(as.character(world.cities$name[sapply(world.cities$name, grepl, x, ignore.case=T)]),
as.character(world.cities$country.etc[sapply(world.cities$name, grepl, x, ignore.case=T)]),
sep="_"))
df$cities_country <- lapply(cities_country, function(x) if(identical(x, character(0))) NA_character_ else x)
df
Output is:
affiliation
1 [SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA
2 [VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.
3 [ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND.
4 Prem
cities_country
1 Al_Norway, Alle_Switzerland, Allen_Philippines, Allen_USA, Angeles_Costa Rica, Angeles_Philippines, Cali_Colombia, Cot_Costa Rica, Li_Norway, Los Angeles_Chile, Los Angeles_USA, Os_Kyrgyzstan, Os_Norway, U_Micronesia, Usa_Japan
2 Ae_Marshall Islands, Ede_Netherlands, Ede_Nigeria, Enschede_Netherlands, Hede_China, Ine_Marshall Islands, Laa_Austria, Lola_Guinea, Man_Ivory Coast, Mana_French Guiana, Manage_Belgium, Nagem_Luxembourg, Ob_Russia, Ola_Panama, Po_Burkina Faso, U_Micronesia, Van_Turkey, Wa_Ghana, We_New Caledonia
3 Aila_Estonia, Al_Norway, Anan_Japan, Ba_Fiji, Bangkok_Thailand, Hit_Iraq, Ila_Nigeria, Ilan_Taiwan, Long_Thailand, Nan_Thailand, Tsu_Japan, U_Micronesia, Ula_Turkey
4 NA
(Note that in above output I have kept all occurrences of cities and for convenience also suffixed it with their respective countries)
From the few lines you have shown it looks like you might be able to do the following (note you missed aligning the casing):
tmpVec <- sapply(strsplit(db$affiliation, split = ","), function(x) {
cleanVec <- toupper(trimws(x))
cleanVec[max(which(cleanVec %in% toupper(maps::world.cities$name)))]
})
Or put a bit more code into the function to avoid the ugly warnings.
Let me leave a part of a solution. As far as I can tell from my own research, letters in the square brackets seem to indicate personal names. For example, I found that Sutee Anantsuksomsri is an actual name. This observation suggests that we probably want to remove texts in the brackets.
Once I removed the texts in the square brackets, I split the words using unnest_tokens() in the tidytext package. Note that the function converts all letters to small letters. If you do not like it, you can change that by specifying to_lower = FALSE. First, I split each city name into word. I also assigned an ID number for each city. Second, I cleaned up your data. As I said earlier, I removed texts in square brackets using gsub(). Then, I applied unnest_tokens() to the data. I subset words using the words from cities in filter(). The result we get up to this point is the following. Obviously, you have more work to do. I leave the sampling data, mydf below. I hope you can move on from here.
data(world.cities)
cities <- world.cities %>%
mutate(id = 1:n()) %>%
unnest_tokens(input = name, output = word, token = "words")
temp <- mydf %>%
mutate(affiliation = gsub(x = affiliation, pattern = "\\[.*\\]", replacement = "")) %>%
unnest_tokens(input = affiliation, output = word, token = "words") %>%
filter(word %in% cities$word)
id word
1 1 los
2 1 angeles
3 1 los
4 1 angeles
5 1 ca
6 1 usa
7 2 water
8 2 ae
9 2 enschede
10 3 bangkok
DATA
mydf <- structure(list(id = 1:3, affiliation = c("[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA",
"[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.",
"[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND."
)), .Names = c("id", "affiliation"), row.names = c(NA, -3L), class = "data.frame")
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Can't wrap my mind around this task
Consider a data frame "usa" with 3 columns, "title", "city" and "state" (reproducible):
title <- c("Events in Chicago, September", "California hotels",
"Los Angeles, August", "Restaurant in Chicago")
city <- c("","", "Los Angeles", "Chicago")
state <- c("","", "California", "IL")
usa <-data.frame(title, city, state)
Resulting in this:
title city state
1 Events in Chicago, September
2 California hotels
3 Los Angeles, August Los Angeles California
4 Restaurant in Chicago Chicago IL
Now what I am trying to do is to fill the STATE variable for the first 2 observations, which are now missing.
TITLE variable contains a clue: either a city or a state is mentioned in each of the entries.
I need to do the following:
Check if any word in "title" column matches any observation found in "city" and "state" columns;
If any word in "title" matches any observation in "state", paste the same state for the given title's observation;
If any word in "title" matches any observation in "city", paste the matched city's state in the "state" column of the title's row.
So what I want to get eventually is this:
title city state
1 Events in Chicago, September IL
2 California hotels California
3 Los Angeles, August Los Angeles California
4 Restaurant in Chicago Chicago IL
In other words, in the second row the title contained a word "California", so a matching state was found from state vector. However, in the first line, the word "Chicago" was the key, and there was another entry in the data frame (row 4), which linked Chicago to "IL" state, so "IL" has to be pasted in the first row of "state" column.
Waiting for the community's ideas :) Thanks!
I would recommend you use the stringr package; specifically, a function called str_extract.
If you have a complete list of cities, e.g. city <- c("Los Angeles", "Chicago"), then you can make it into regular expression using paste(city, collapse = '|'). That will give you: 'Los Angeles|Chicago'. With str_extract, you can extract that city (will extract the first one it sees, and an NA if none appear). Here's the complete code. Note: this only works if your dataframe is a data_frame (tibble), not a data.frame (not totally sure why, haven't looked into it)
library(tidyverse)
library(stringr)
title <- c("Events in Chicago, September", "California hotels",
"Los Angeles, August", "Restaurant in Chicago")
city <- c("","", "Los Angeles", "Chicago")
state <- c("","", "California", "IL")
usa <-data_frame(title, city, state) # notice this is a data_frame not data.frame
cities <- paste(c("Los Angeles", "Chicago"), collapse = '|')
states <- paste(c("California", "IL"), collapse = '|')
usa <- usa %>%
mutate(city = ifelse(city == '', str_extract(title, cities), city),
state = ifelse(state == '', str_extract(title, states), state))
This results in:
# A tibble: 4 x 3
title city state
<chr> <chr> <chr>
1 Events in Chicago, September Chicago <NA>
2 California hotels <NA> California
3 Los Angeles, August Los Angeles California
4 Restaurant in Chicago Chicago IL
Need help removing random text in a string that appears before an address (data set has ~5000 observations). Dataframe test2$address reads as follows:
addresses <- c(
"140 National Plz Oxon Hill, MD 20745",
"6324 Windsor Mill Rd Gwynn Oak, MD 21207",
"23030 Indian Creek Dr Sterling, VA 20166",
"Located in Reston Town Center 18882 Explorer St Reston, VA 20190"
)
I want it to spit out all addresses in a common format:
[885] "23030 Indian Creek Dr Sterling, VA 20166"
[886] "18882 Explorer St Reston, VA 20190"
Not sure how to go about doing this as there is no specific pattern to the text that comes before the address number.
If you know that the address portion you want will always start with digits, and the part you want to remove will be text, then you can use this:
sub(".*?(\\d+)", "\\1", x)
Output:
[1] "140 National Plz Oxon Hill, MD 20745"
[2] "6324 Windsor Mill Rd Gwynn Oak, MD 21207"
[3] "23030 Indian Creek Dr Sterling, VA 20166"
[4] "18882 Explorer St Reston, VA 20190"
What this does is remove everything (.*) before the first (?) digit series (\\d+).
Sample data:
x <- c("140 National Plz Oxon Hill, MD 20745",
"6324 Windsor Mill Rd Gwynn Oak, MD 21207",
"23030 Indian Creek Dr Sterling, VA 20166",
"Located in Reston Town Center 18882 Explorer St Reston, VA 20190")