Find city, state and country information from a location string in R - r

I have a string vector with location information. Here is the part of my string
location_information = c("Hartville, Ohio","Malaysia,Johor Bahru","Culpeper, irginia",
"MD", "Atlanta","Granada Hills CA","Kansas City, mo")
With this string vector, I wanted to get the city, state, and country information. Here is the desired output for the sample.
desired_out = data.frame( Country = c("US","Malaysia",rep("US",5)),
State = c("Ohio",NA,"Virginia","Maryland","Georgia","California","Missouri"),
City = c("Hartville","Johor Bahru","Culpeper",NA, "Atlanta","Granada Hills","Kansas City"))
How can I get that information with the consistent string format?
I think I may need to use Google API or something. How can I do it in R?

Here is a solution using the geocoding from openstreetmap to get needed additional information.
Note that you (probably) will not be able to parse hunderds/thousands of locations in one go.
library(tmap)
library(tmaptools)
library(dplyr)
# sample data of locations
location_information = c("Hartville, Ohio","Malaysia,Johor Bahru","Culpeper, Virginia",
"MD", "Atlanta","Granada Hills CA","Kansas City, mo")
# geocode the locations
loc.data <- tmaptools::geocode_OSM(location_information, as.sf = TRUE)
# reverse geocode the locations for additional OSM data
tmaptools::rev_geocode_OSM(loc.data) %>%
dplyr::select(country, state, city, town, village, city_district)
# country state city town village city_district
# 1 United States Ohio <NA> <NA> Hartville <NA>
# 2 Malaysia Johor Johor Bahru <NA> <NA> <NA>
# 3 United States Virginia <NA> Culpeper <NA> <NA>
# 4 United States Maryland <NA> <NA> <NA> <NA>
# 5 United States Georgia Atlanta <NA> <NA> <NA>
# 6 United States California Los Angeles <NA> <NA> Granada Hills
# 7 United States Missouri Kansas City <NA> <NA> <NA>

Related

Replace all partial string entries with NA

I have a data frame similar to:
df<-as.data.frame(cbind(rep("Canada",6),
c(rep("Alberta",3), rep("Manitoba",2),rep("Unknown_province",1)),
c("Edmonton", "Unknown_city","Unknown_city","Brandon","Unknown_city","Unknown_city")))
colnames(df)<- c("Country","Province","City")
I would like to substitute all entries that contain "Unknown" with NA.
I have tried using grepl, but it removes all entries for that variable if one entry matches, I would like to only replace individual cells.
df[grepl("Unknown", df, ignore.case=TRUE)] <- NA
df1 <- df # This is to ensure that we can refert back to df incase there is an issue
Then you could use any of the following:
is.na(df1) <- array(grepl('Unknown', as.matrix(df1)), dim(df1))
df1
Country Province City
1 Canada Alberta Edmonton
2 Canada Alberta <NA>
3 Canada Alberta <NA>
4 Canada Manitoba Brandon
5 Canada Manitoba <NA>
6 Canada <NA> <NA>
or even:
df1[] <- sub("Unknown.*", NA, as.matrix(df1), ignore.case = TRUE)
df1
Country Province City
1 Canada Alberta Edmonton
2 Canada Alberta <NA>
3 Canada Alberta <NA>
4 Canada Manitoba Brandon
5 Canada Manitoba <NA>
6 Canada <NA> <NA>
Note that grepl and even sub are vectorized hence no need to use the *aply family or even for loops
Here is one possible way to solve your problem:
df[] <- lapply(df, function(x) ifelse(grepl("Unknown", x, TRUE), NA, x))
df
# Country Province City
# 1 Canada Alberta Edmonton
# 2 Canada Alberta <NA>
# 3 Canada Alberta <NA>
# 4 Canada Manitoba Brandon
# 5 Canada Manitoba <NA>
# 6 Canada <NA> <NA>
Using dplyr
library(dplyr)
library(stringr)
df %>%
mutate(across(everything(),
~ case_when(str_detect(., 'Unknown', negate = TRUE) ~ .)))
Country Province City
1 Canada Alberta Edmonton
2 Canada Alberta <NA>
3 Canada Alberta <NA>
4 Canada Manitoba Brandon
5 Canada Manitoba <NA>
6 Canada <NA> <NA>
I like to use replace() in such cases in which values in a vector are replaced or left as is, depending on a condition :
library(dplyr)
library(stringr)
df%>%mutate(across(everything(), ~replace(.x, str_detect(.x, 'Unknown'), NA)))
Country Province City
1 Canada Alberta Edmonton
2 Canada Alberta <NA>
3 Canada Alberta <NA>
4 Canada Manitoba Brandon
5 Canada Manitoba <NA>
6 Canada <NA> <NA>
df[]<- lapply(df, gsub, pattern = "Unknown", replacement = NA, fixed = TRUE)

How to clean the city and state(both full and abbreviation) using R

I have a list of uncleaned city and state from "Location" in twitter, for example:
location <- c("the Great Lake State", "PA", "Harrisburg, Pennsylvania",
"Pennsylvania", "MI", "Detroit,MI")
How to clean the data to make a clean list of two columns with city and state?
You can do this:
splitted_list <- strsplit(location,",")
wide_matrix <- sapply(splitted_list,function(x) c(rep(NA,length(x)==1),x))
res <- setNames(data.frame(t(wide_matrix),stringsAsFactors = FALSE),c("city","state"))
res
# city state
# 1 <NA> the Great Lake State
# 2 <NA> PA
# 3 Harrisburg Pennsylvania
# 4 <NA> Pennsylvania
# 5 <NA> MI
# 6 Detroit MI
Assuming your data (location) is already part of a data.frame which you want to clean up, then tidyr::separate can be suitable option.
location <- c("the Great Lake State", "PA", "Harrisburg, Pennsylvania",
"Pennsylvania", "MI", "Detroit,MI")
library(tidyverse)
as.data.frame(location) %>% # I created a data.frame, which is not needed in actual data
tidyr::separate(location, c("City", "State"), sep=",", fill="left")
# City State
# 1 <NA> the Great Lake State
# 2 <NA> PA
# 3 Harrisburg Pennsylvania
# 4 <NA> Pennsylvania
# 5 <NA> MI
# 6 Detroit MI

extracting country name from city name in R

This question may look like a duplicate but I am facing some issue while extracting country names from the string. I have gone through this link [link]Extracting Country Name from Author Affiliations but I was not able to solve my problem.I have tried grepl and for loop for text matching and replacement, my data column consists of more than 300k rows so using grepl and for loop for pattern matching is very very slow.
I have a column like this.
org_loc
Zug
Zug Canton of Zug
Zimbabwe
Zigong
Zhuhai
Zaragoza
York United Kingdom
Delhi
Yalleroi Queensland
Waterloo Ontario
Waterloo ON
Washington D.C.
Washington D.C. Metro
New York
df$org_loc <- c("zug", "zug canton of zug", "zimbabwe",
"zigong", "zhuhai", "zaragoza","York United Kingdom", "Delhi","Yalleroi Queensland","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","New York")
the string may contain the name of a state, city or country. I just want Country as output. Like this
org_loc
Switzerland
Switzerland
Zimbabwe
China
China
Spain
United Kingdom
India
Australia
Canada
Canada
United State
United state
United state
I am trying to convert state (if match found) to its country using countrycode library but not able to do so. Any help would be appreciable.
You can use your City_and_province_list.csv as a custom dictionary for countrycode. The custom dictionary can not have duplicates in the origin vector (the City column in your City_and_province_list.csv), so you'll have to remove them or deal with them somehow first (as in my example below). Currently, you don't have all of the possible strings in your example in your lookup CSV, so they are not all converted, but if you added all of the possible strings to the CSV, it would work completely.
library(countrycode)
org_loc <- c("Zug", "Zug Canton of Zug", "Zimbabwe", "Zigong", "Zhuhai",
"Zaragoza", "York United Kingdom", "Delhi",
"Yalleroi Queensland", "Waterloo Ontario", "Waterloo ON",
"Washington D.C.", "Washington D.C. Metro", "New York")
df <- data.frame(org_loc)
city_country <- read.csv("https://raw.githubusercontent.com/girijesh18/dataset/master/City_and_province_list.csv")
# custom_dict for countrycode cannot have duplicate origin codes
city_country <- city_country[!duplicated(city_country$City), ]
df$country <- countrycode(df$org_loc, "City", "Country",
custom_dict = city_country)
df
# org_loc country
# 1 Zug Switzerland
# 2 Zug Canton of Zug <NA>
# 3 Zimbabwe <NA>
# 4 Zigong China
# 5 Zhuhai China
# 6 Zaragoza Spain
# 7 York United Kingdom <NA>
# 8 Delhi India
# 9 Yalleroi Queensland <NA>
# 10 Waterloo Ontario <NA>
# 11 Waterloo ON <NA>
# 12 Washington D.C. <NA>
# 13 Washington D.C. Metro <NA>
# 14 New York United States of America
library(countrycode)
df <- c("zug switzerland", "zug canton of zug switzerland", "zimbabwe",
"zigong chengdu pr china", "zhuhai guangdong china", "zaragoza","York United Kingdom", "Yamunanagar","Yalleroi Queensland Australia","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","USA")
df1 <- countrycode(df, 'country.name', 'country.name')
It didn't match a lot of them, but that should do what you're looking for, based on the reference manual for countrycode.
With function geocode from package ggmap you may accomplish, with good but not total accuracy your task; you must also use your criterion to say "Zaragoza" is a city in Spain (which is what geocode returns) and not somewhere in Argentina; geocode tends to give you the biggest city when there are several homonyms.
(remove the $country to see all of the output)
library(ggmap)
org_loc <- c("zug", "zug canton of zug", "zimbabwe",
"zigong", "zhuhai", "zaragoza","York United Kingdom",
"Delhi","Yalleroi Queensland","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","New York")
geocode(org_loc, output = "more")$country
as geocode is provided by google, it has a query limit, 2,500 per day per IP address; if it returns NAs it may be because an unconsistent limit check, just try it again.

R package "acs": Get county name, FIPS?

in search for a solution to an unsolved problem, I came across the acs package. I assume, there's no way within the choropleth package to get any county information from data in the format [city, state]. That's why pre-processing with acs needs to be done.
I tried following code to get the county information on a city:
library(acs)
geo.lookup(state="CA", place="San Francisco")
> geo.lookup(state="CA", place="San Francisco")
state state.name county.name place place.name
1 6 California <NA> NA <NA>
2 6 California San Francisco County 67000 San Francisco city
3 6 California San Mateo County 73262 South San Francisco city
As we know, cities can be part of different counties. Most likely, I will go with the second
> geo.lookup(state="CA", place="San Francisco")[2,]
state state.name county.name place place.name
2 6 California San Francisco County 67000 San Francisco city
by default.
My question:
Is there a way to get the state abbreviation, county name and county FIPS, too? I could not find the answer in the documentation.
Also, for further processing (matching with choroplethr), the last "County" in county.name and "city" in place.name need to be removed.
Here's how to add the state abbreviation, county name, and county FIPS to your example. R has built-in variables for state names and state abbreviations. For the FIPS codes, I read a csv file from the Census Bureau's website.
library(acs)
library(tidyverse)
states <- cbind(state.name, state.abb) %>% tbl_df()
fips <-
read_csv(
"https://www2.census.gov/geo/docs/reference/codes/files/national_county.txt",
col_names = c("state.abb", "statefp", "countyfp", "county.name", "classfp")
)
query <- geo.lookup(state = "CA", place = "San Francisco")[2, ] %>%
tbl_df() %>%
left_join(states, by = "state.name") %>%
left_join(fips, by = c("county.name", "state.abb"))
query
# # A tibble: 1 x 9
# state state.name county.name place place.name state.abb statefp countyfp classfp
# <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr>
# 1 6 California San Francisco County 67000 San Francisco city CA 06 075 H6
As you note at the end of your question, you may need to clean up this data a bit more to make it fit choroplethr.

Extracting parts of data.frame

I have an issue while extracting and creating a new data.frame on the basis of previous one.
So we have:
> head(data.raw)
date id contacted contacted_again region
1 2015-11-29 234 CHAT EMAIL APAC
2 2015-11-29 234 EMAIL EMAIL APAC
3 2015-11-27 257 PHONE PHONE EMEA
4 2015-11-27 278 PHONE EMAIL APAC
5 2015-11-27 293 CHAT EMAIL EMEA
6 2015-11-27 243 EMAIL EMAIL EMEA
market
1 AU/NZ
2 SE Asia (English)
3 Spain
4 China Mainland
5 DACH
6 DACH
However, one I write
data.ru <- data.raw[data.raw$market=="Russia",]
I receive the following mess:
date id contacted contacted_again region market
67 2015-11-25 334 CHAT EMAIL EMEA Russia
NA <NA> <NA> <NA> <NA> <NA> <NA>
NA.1 <NA> <NA> <NA> <NA> <NA> <NA>
NA.2 <NA> <NA> <NA> <NA> <NA> <NA>
NA.3 <NA> <NA> <NA> <NA> <NA> <NA>
NA.4 <NA> <NA> <NA> <NA> <NA> <NA>
How should I write a command to receive just a normal data.frame with all rows that $market=="Russia" without any NAs?
I would just use the subset function.
test <- data.frame(x = c("USA", "USA", "USA", "Russia", "Russia", NA), y = c("Orlando", "Boston", "Memphis", NA, "St. Petersburg", "Mexico City"))
print(test)
x y
1 USA Orlando
2 USA Boston
3 USA Memphis
4 Russia <NA>
5 Russia St. Petersburg
6 <NA> Mexico City
subset(test, x == "Russia")
x y
4 Russia <NA>
5 Russia St. Petersburg
You may want to try: data.ru <- data.raw[data.raw$market %in% "Russia",]
Explanation: I am assuming you have empty lines in your dataset, which are read as NAs (missing value). Since R cannot know if a given NA is equal to "Russia" or not, the generated data frame includes them.
Illustration in code:
# create sample dataset
example.df <- data.frame(market=c(NA, "Russia", NA), outcome = c(1,2,3))
# match market using ==
example.df$market == "Russia"
example.df[example.df$market == "Russia",]
# match market using %in%
example.df$market %in% "Russia"
example.df[example.df$market %in% "Russia",]

Resources