extracting country name from city name in R - r

This question may look like a duplicate but I am facing some issue while extracting country names from the string. I have gone through this link [link]Extracting Country Name from Author Affiliations but I was not able to solve my problem.I have tried grepl and for loop for text matching and replacement, my data column consists of more than 300k rows so using grepl and for loop for pattern matching is very very slow.
I have a column like this.
org_loc
Zug
Zug Canton of Zug
Zimbabwe
Zigong
Zhuhai
Zaragoza
York United Kingdom
Delhi
Yalleroi Queensland
Waterloo Ontario
Waterloo ON
Washington D.C.
Washington D.C. Metro
New York
df$org_loc <- c("zug", "zug canton of zug", "zimbabwe",
"zigong", "zhuhai", "zaragoza","York United Kingdom", "Delhi","Yalleroi Queensland","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","New York")
the string may contain the name of a state, city or country. I just want Country as output. Like this
org_loc
Switzerland
Switzerland
Zimbabwe
China
China
Spain
United Kingdom
India
Australia
Canada
Canada
United State
United state
United state
I am trying to convert state (if match found) to its country using countrycode library but not able to do so. Any help would be appreciable.

You can use your City_and_province_list.csv as a custom dictionary for countrycode. The custom dictionary can not have duplicates in the origin vector (the City column in your City_and_province_list.csv), so you'll have to remove them or deal with them somehow first (as in my example below). Currently, you don't have all of the possible strings in your example in your lookup CSV, so they are not all converted, but if you added all of the possible strings to the CSV, it would work completely.
library(countrycode)
org_loc <- c("Zug", "Zug Canton of Zug", "Zimbabwe", "Zigong", "Zhuhai",
"Zaragoza", "York United Kingdom", "Delhi",
"Yalleroi Queensland", "Waterloo Ontario", "Waterloo ON",
"Washington D.C.", "Washington D.C. Metro", "New York")
df <- data.frame(org_loc)
city_country <- read.csv("https://raw.githubusercontent.com/girijesh18/dataset/master/City_and_province_list.csv")
# custom_dict for countrycode cannot have duplicate origin codes
city_country <- city_country[!duplicated(city_country$City), ]
df$country <- countrycode(df$org_loc, "City", "Country",
custom_dict = city_country)
df
# org_loc country
# 1 Zug Switzerland
# 2 Zug Canton of Zug <NA>
# 3 Zimbabwe <NA>
# 4 Zigong China
# 5 Zhuhai China
# 6 Zaragoza Spain
# 7 York United Kingdom <NA>
# 8 Delhi India
# 9 Yalleroi Queensland <NA>
# 10 Waterloo Ontario <NA>
# 11 Waterloo ON <NA>
# 12 Washington D.C. <NA>
# 13 Washington D.C. Metro <NA>
# 14 New York United States of America

library(countrycode)
df <- c("zug switzerland", "zug canton of zug switzerland", "zimbabwe",
"zigong chengdu pr china", "zhuhai guangdong china", "zaragoza","York United Kingdom", "Yamunanagar","Yalleroi Queensland Australia","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","USA")
df1 <- countrycode(df, 'country.name', 'country.name')
It didn't match a lot of them, but that should do what you're looking for, based on the reference manual for countrycode.

With function geocode from package ggmap you may accomplish, with good but not total accuracy your task; you must also use your criterion to say "Zaragoza" is a city in Spain (which is what geocode returns) and not somewhere in Argentina; geocode tends to give you the biggest city when there are several homonyms.
(remove the $country to see all of the output)
library(ggmap)
org_loc <- c("zug", "zug canton of zug", "zimbabwe",
"zigong", "zhuhai", "zaragoza","York United Kingdom",
"Delhi","Yalleroi Queensland","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","New York")
geocode(org_loc, output = "more")$country
as geocode is provided by google, it has a query limit, 2,500 per day per IP address; if it returns NAs it may be because an unconsistent limit check, just try it again.

Related

Find city, state and country information from a location string in R

I have a string vector with location information. Here is the part of my string
location_information = c("Hartville, Ohio","Malaysia,Johor Bahru","Culpeper, irginia",
"MD", "Atlanta","Granada Hills CA","Kansas City, mo")
With this string vector, I wanted to get the city, state, and country information. Here is the desired output for the sample.
desired_out = data.frame( Country = c("US","Malaysia",rep("US",5)),
State = c("Ohio",NA,"Virginia","Maryland","Georgia","California","Missouri"),
City = c("Hartville","Johor Bahru","Culpeper",NA, "Atlanta","Granada Hills","Kansas City"))
How can I get that information with the consistent string format?
I think I may need to use Google API or something. How can I do it in R?
Here is a solution using the geocoding from openstreetmap to get needed additional information.
Note that you (probably) will not be able to parse hunderds/thousands of locations in one go.
library(tmap)
library(tmaptools)
library(dplyr)
# sample data of locations
location_information = c("Hartville, Ohio","Malaysia,Johor Bahru","Culpeper, Virginia",
"MD", "Atlanta","Granada Hills CA","Kansas City, mo")
# geocode the locations
loc.data <- tmaptools::geocode_OSM(location_information, as.sf = TRUE)
# reverse geocode the locations for additional OSM data
tmaptools::rev_geocode_OSM(loc.data) %>%
dplyr::select(country, state, city, town, village, city_district)
# country state city town village city_district
# 1 United States Ohio <NA> <NA> Hartville <NA>
# 2 Malaysia Johor Johor Bahru <NA> <NA> <NA>
# 3 United States Virginia <NA> Culpeper <NA> <NA>
# 4 United States Maryland <NA> <NA> <NA> <NA>
# 5 United States Georgia Atlanta <NA> <NA> <NA>
# 6 United States California Los Angeles <NA> <NA> Granada Hills
# 7 United States Missouri Kansas City <NA> <NA> <NA>

How do I rename the values in my column as I have misspelt them and cant rename them in R or Colab

I have a data frame that was given to me. Under the column titled state, there are two components with the same name but with different case sensitivities ie one is "London" and the other is "LONDON". How would i be able to rename "LONDON" to become "London" in order to total them up together and not separately. reminder, I am trying to change the name of the input not the name of the column.
You can use the following code, df is your current dataframe, in which you want to substitute "LONDON" for "London"
df <- data.frame(Country = c("US", "UK", "Germany", "Brazil","US", "Brazil", "UK", "Germany"),
State = c("NY", "London", "Bavaria", "SP", "CA", "RJ", "LONDON", "Berlin"),
Candidate = c(1:8))
print(df)
output
Country State Candidate
1 US NY 1
2 UK London 2
3 Germany Bavaria 3
4 Brazil SP 4
5 US CA 5
6 Brazil RJ 6
7 UK LONDON 7
8 Germany Berlin 8
then run the following code to substitute London to all the instances where State is equal to "LONDON"
df[df$State == "LONDON", "State"] <- "London"
Now the output will be as
Country State Candidate
1 US NY 1
2 UK London 2
3 Germany Bavaria 3
4 Brazil SP 4
5 US CA 5
6 Brazil RJ 6
7 UK London 7
8 Germany Berlin 8
Maybe you could try using the case_when function. I would do something like this:
ยดยดยดยด
mutate(data, State_def=case_when(State=="LONDON" ~ "London",
State=="London" ~ "London",
TRUE ~ NA_real_)
I might misunderstand, but I think it should be as simple as this:
x$state <- sub( "LONDON", "London", x$state, fixed=TRUE )
This should change LONDON to London

R package "acs": Get county name, FIPS?

in search for a solution to an unsolved problem, I came across the acs package. I assume, there's no way within the choropleth package to get any county information from data in the format [city, state]. That's why pre-processing with acs needs to be done.
I tried following code to get the county information on a city:
library(acs)
geo.lookup(state="CA", place="San Francisco")
> geo.lookup(state="CA", place="San Francisco")
state state.name county.name place place.name
1 6 California <NA> NA <NA>
2 6 California San Francisco County 67000 San Francisco city
3 6 California San Mateo County 73262 South San Francisco city
As we know, cities can be part of different counties. Most likely, I will go with the second
> geo.lookup(state="CA", place="San Francisco")[2,]
state state.name county.name place place.name
2 6 California San Francisco County 67000 San Francisco city
by default.
My question:
Is there a way to get the state abbreviation, county name and county FIPS, too? I could not find the answer in the documentation.
Also, for further processing (matching with choroplethr), the last "County" in county.name and "city" in place.name need to be removed.
Here's how to add the state abbreviation, county name, and county FIPS to your example. R has built-in variables for state names and state abbreviations. For the FIPS codes, I read a csv file from the Census Bureau's website.
library(acs)
library(tidyverse)
states <- cbind(state.name, state.abb) %>% tbl_df()
fips <-
read_csv(
"https://www2.census.gov/geo/docs/reference/codes/files/national_county.txt",
col_names = c("state.abb", "statefp", "countyfp", "county.name", "classfp")
)
query <- geo.lookup(state = "CA", place = "San Francisco")[2, ] %>%
tbl_df() %>%
left_join(states, by = "state.name") %>%
left_join(fips, by = c("county.name", "state.abb"))
query
# # A tibble: 1 x 9
# state state.name county.name place place.name state.abb statefp countyfp classfp
# <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr>
# 1 6 California San Francisco County 67000 San Francisco city CA 06 075 H6
As you note at the end of your question, you may need to clean up this data a bit more to make it fit choroplethr.

Find groups that contain all elements, but do not overlap [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I've been given a set of country groups and I'm trying to get a set of mutually exclusive regions so that I can compare them. The problem is that my data contains several groups, many of which overlap. How can I get a set of groups which contain all countries, but do not overlap with each other?
For example, assume that this is the list of countries in the world:
World <- c("Angola", "France", "Germany", "Australia", "New Zealand")
Assume that this is my set of groups:
df <- data.frame(group = c("Africa", "Western Europe", "Europe", "Europe", "Oceania", "Oceania", "Commonwealth Countries"),
element = c("Angola", "France", "Germany", "France", "Australia", "New Zealand", "Australia"))
group element
1 Africa Angola
2 Western Europe France
3 Europe Germany
4 Europe France
5 Oceania Australia
6 Oceania New Zealand
7 Commonwealth Countries Australia
How could I remove overlapping groups (in this case Western Europe) to get a set of groups that contains all countries like the following:
df_solved <- data.frame(group = c("Africa", "Europe", "Europe", "Oceania", "Oceania"),
element = c("Angola", "France", "Germany", "Australia", "New Zealand"))
group element
1 Africa Angola
2 Europe France
3 Europe Germany
4 Oceania Australia
5 Oceania New Zealand
One possible rule could be to minimize the number of groups, e.g. to associate an element with that group which includes the most elements.
library(data.table)
setDT(df)[, n.elements := .N, by = group][
order(-n.elements), .(group = group[1L]), by = element]
element group
1: Germany Europe
2: France Europe
3: Australia Oceania
4: New Zealand Oceania
5: Angola Africa
Explanation
setDT(df)[, n.elements := .N, by = group][]
returns
group element n.elements
1: Africa Angola 1
2: Western Europe France 1
3: Europe Germany 2
4: Europe France 2
5: Oceania Australia 2
6: Oceania New Zealand 2
7: Commonwealth Countries Australia 1
Now, the rows are ordered by decreasing number of elements and for each country the first, i.e., the "largest", group is picked. This should return a group for each country as requested.
In case of ties, i.e., one group contains equally many elements, you can add additional citeria when ordering, e.g., length of the group name, or just alphabetical order.
1) If you want to simply eliminate duplicate elements then use !duplicated(...) as shown. No packages are used.
subset(df, !duplicated(element))
giving:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
2) set partitioning If each group must be wholly in or wholly out and each element may only appear once then this is a set partitioning problem:
library(lpSolve)
const.mat <- with(df, table(element, group))
obj <- rep(1L, ncol(const.mat))
res <- lp("min", obj, const.mat, "=", 1L, all.bin = TRUE)
subset(df, group %in% colnames(const.mat[, res$solution == 1]))
giving:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
3) set covering Of course there may be no exact set partition so we could consider the set covering problem (same code exceept "=" is replaced by ">=" in the lp line.
library(lpSolve)
const.mat <- with(df, table(element, group))
obj <- rep(1L, ncol(const.mat))
res <- lp("min", obj, const.mat, ">=", 1L, all.bin = TRUE)
subset(df, group %in% colnames(const.mat[, res$solution == 1]))
giving:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
and we could optionally then apply (1) to remove any duplicates in the cover.
4) Non-dominated groups Another approach is to remove any group whose elements form a strict subset of the elements of some other group. For example, every element in Western Europe is in Europe and Europe has more elements than Western Europe so the elements of Western Europe are a strict subset of the elements of Europe and we remove Western Europe. Using const.mat from above:
# returns TRUE if jth column of const.mat is dominated by some other column
is_dom_fun <- function(j) any(apply(const.mat[, j] <= const.mat[, -j], 2, all) &
sum(const.mat[, j]) < colSums(const.mat[, -j]))
is_dom <- sapply(seq_len(ncol(const.mat)), is_dom_fun)
subset(df, group %in% colnames(const.mat)[!is_dom])
giving:
group element
1 Africa Angola
3 Europe Germany
4 Europe France
5 Oceania Australia
6 Oceania New Zealand
If there are any duplicates left we can use (1) to remove them.
library(dplyr)
df %>% distinct(element, .keep_all=TRUE)
group element
1 Africa Angola
2 Europe France
3 Europe Germany
4 Oceania Australia
5 Oceania New Zealand
Shoutout to Axeman for beating me with this answer.
Update
Your question is ill-defined. Why is 'Europe' preferred over 'Western Europe'? Put another way, each country is assigned several groups. You want to reduce it to one group per country. How do you decide which group?
Here's one way, we always prefer the biggest:
groups <- df %>% count(group)
df %>% inner_join(groups, by='group') %>%
arrange(desc(n)) %>% distinct(elemenet, .keep_all=TRUE)
group element n
1 Europe France 2
2 Europe Germany 2
3 Oceania Australia 2
4 Oceania New Zealand 2
5 Africa Angola 1
Here is one option with data.table
library(data.table)
setDT(df)[, head(.SD, 1), element]
Or with unique
unique(setDT(df), by = 'element')
# group element
#1: Africa Angola
#2: Europe France
#3: Europe Germany
#4: Oceania Australia
#5: Oceania New Zealand
Packages are used and it is data.table
A completely different approach would be to ignore the given groups but to look up just the country names in the catalogue of UN regions which are available in the countrycodes or ISOcodes packages.
The countrycodes package seems to offer the simpler interface and it also warns about country names which can not be found in its database:
# given country names - note the deliberately misspelled last entry
World <- c("Angola", "France", "Germany", "Australia", "New Zealand", "New Sealand")
# regions
countrycode::countrycode(World, "country.name.en", "region")
[1] "Middle Africa" "Western Europe" "Western Europe" "Australia and New Zealand"
[5] "Australia and New Zealand" NA
Warning message:
In countrycode::countrycode(World, "country.name.en", "region") :
Some values were not matched unambiguously: New Sealand
# continents
countrycode::countrycode(World, "country.name.en", "continent")
[1] "Africa" "Europe" "Europe" "Oceania" "Oceania" NA
Warning message:
In countrycode::countrycode(World, "country.name.en", "continent") :
Some values were not matched unambiguously: New Sealand

R data.frame transformation?

I have a R data frame that looks like this:
Country Property Value
Canada Capital Ottawa
Canada Population 38
Canada Language1 French
Canada Language2 English
United States Capital Washington
United States Population 280
United States Language1 English
United States Language2 NA
I want to re-arrange this so that it looks like this:
Country Capital Population Language1 Language2
Canada Ottawa 38 French English
United States Washington 280 English NA
Is there any way to do this transformation ?
Thanks.
As per Paul Hiemstra's comment:
the reshape2 package's dcast will do this nicely:
dcast(data=yourdataframe, Country~Property, value.var='Value')
If you've got duplicated values in there though it will try to aggregate them using length as a default, which isn't what you want!

Resources