How can I add the country name to a dataset based on city name and population? [duplicate]

How can I add the country name to a dataset based on city name and population? [duplicate] - r

This question already has answers here:
extracting country name from city name in R
(3 answers)
Closed 7 months ago.
I have a dataset containing information on a range of cities, but there is no column which says what country the city is located in. In order to perform the analysis, I need to add an extra column which has the name of the country.
population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin
I expect the output to look like this:
population city country
500,000 Oslo Norway
750,000 Bristol England
500,000 Liverpool England
1,000,000 Dublin Ireland
How can I add a column of country names based on the city and population to a large dataset in R?

I am adapting Tom Hoel's answer, as suggested by Ian Campbell. If this is selected I am happy to mark it as community wiki.
library(maps)
library(dplyr)
data("world.cities")
df <- readr::read_table("population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin")
df |>
inner_join(
select(world.cities, name, country.etc, pop),
by = c("city" = "name")
) |> group_by(city) |>
filter(
abs(pop - population) == min(abs(pop - population))
)
# A tibble: 4 x 4
# Groups: city [4]
# population city country.etc pop
# <dbl> <chr> <chr> <int>
# 1 500000 Oslo Norway 821445
# 2 750000 Bristol UK 432967
# 3 500000 Liverpool UK 468584
# 4 1000000 Dublin Ireland 1030431

As stated by others, the cities exists in other countries too as well.
library(tidyverse)
library(maps)
data("world.cities")
df <- read_table("population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin")
df %>%
merge(., world.cities %>%
select(name, country.etc),
by.x = "city",
by.y = "name")
# A tibble: 7 × 3
city population country.etc
<chr> <dbl> <chr>
1 Bristol 750000 UK
2 Bristol 750000 USA
3 Dublin 1000000 USA
4 Dublin 1000000 Ireland
5 Liverpool 500000 UK
6 Liverpool 500000 Canada
7 Oslo 500000 Norway

I think your best bet would be to add a new column in your dataset called country and fill it out, this is part of the CRSIP-DM process data preparation so this is not uncommon. If that does not answer your question please let me know and i will do my best to help.

Related

I am trying to filter on two conditions, but I keep removing all patients with either condition

I'm a beginner on R so apologies for errors, and thank you for helping.
I have a dataset (liver) where rows are patient ID numbers, and columns include what region the patient resides in (London, Yorkshire etc) and what unit the patient was treated in (hospital name). Some of the units are private units. I've identified 120 patients from London, of whom 100 were treated across three private units. I want to remove the 100 London patients treated in private units but I keep accidentally removing all patients treated in the private units (around 900 patients). I'd be grateful for advice on how to just remove the London patients treated privately.
I've tried various combinations of using subset and filter with different exclamation points and brackets in different places including for example:
liver <- filter(liver, region_name != "London" & unit_name!="Primrose Hospital" & unit_name != "Oak Hospital" & unit_name != "Wilson Hospital")
Thank you very much.

Your unit_name condition is zeroing your results. Try using the match function which is more commonly seen in its infix form %in%:
liver <- filter(liver,
region_name != "London",
! unit_name %in% c("Primrose Hospital",
"Oak Hospital",
"Wilson Hospital"))
Also you can separate logical AND conditions using a comma.

Building on Pariksheet's great start (still drops outside-London private hospital patients). Here we need to use the OR | operator within the filter function. I've made an example dataframe which demonstrates how this works for your case. The example tibble contains your three private London hospitals plus one non-private hospital that we want to keep. Plus, it has Manchester patients who attend both Manch and one of the private hospitals, all of whom we want to keep.
EDITED: Now includes character vectors to allow generalisation of combinations to exclude.
liver <- tibble(region_name = rep(c('London', 'Liverpool', 'Glasgow', 'Manchester'), each = 4),
unit_name = c(rep(c('Primrose Hospital',
'Oak Hospital',
'Wilson Hospital',
'State Hospital'), times = 3),
rep(c('Manch General', 'Primrose Hospital'), each = 2)))
liver
# A tibble: 16 x 2
region_name unit_name
<chr> <chr>
1 London Primrose Hospital
2 London Oak Hospital
3 London Wilson Hospital
4 London State Hospital
5 Liverpool Primrose Hospital
6 Liverpool Oak Hospital
7 Liverpool Wilson Hospital
8 Liverpool State Hospital
9 Glasgow Primrose Hospital
10 Glasgow Oak Hospital
11 Glasgow Wilson Hospital
12 Glasgow State Hospital
13 Manchester Manch General
14 Manchester Manch General
15 Manchester Primrose Hospital
16 Manchester Primrose Hospital
excl.private.regions <- c('London',
'Liverpool',
'Glasgow')
excl.private.hospitals <- c('Primrose Hospital',
'Oak Hospital',
'Wilson Hospital')
liver %>%
filter(! region_name %in% excl.private.regions |
! unit_name %in% excl.private.hospitals)
# A tibble: 7 x 2
region_name unit_name
<chr> <chr>
1 London State Hospital
2 Liverpool State Hospital
3 Glasgow State Hospital
4 Manchester Manch General
5 Manchester Manch General
6 Manchester Primrose Hospital
7 Manchester Primrose Hospital

Insert a value to a column by condition

I am attempting to fill in a new column in my dataset. I have a dataset containing information on football matches. There is a column called "Stadium", which has various stadium names. I wish to add a new column which contains the country of which the stadium is located within. My set looks something like this
Match ID Stadium
1 Anfield
2 Camp Nou
3 Stadio Olimpico
4 Anfield
5 Emirates
I am attempting to create a new column looking like this:
Match ID Stadium Country
1 Anfield England
2 Camp Nou Spain
3 Stadio Olimpico Italy
4 Anfield England
5 Emirates England
There is only a handful of stadiums but many rows, meaning I am trying to find a way to avoid inserting the values manually. Any tips?

You want to get the unique stadium names from your data, manually create a vector with the country for each of those stadiums, then join them using Stadium as a key.
library(dplyr)
# Example data
df <- data.frame(`Match ID` = 1:12,
Stadium = rep(c("Stadio Olympico", "Anfield",
"Emirates"), 4))
# Get the unique stadium names in a vector
unique_stadiums <- df %>% pull(Stadium) %>% unique()
unique_stadiums
#> [1] "Stadio Olympico" "Anfield" "Emirates"
# Manually create a vector of country names corresponding to each element of
# the unique stadum name vector. Ordering matters here!
countries <- c("Italy", "England", "England")
# Place them both into a data.frame
lookup <- data.frame(Stadium = unique_stadiums, Country = countries)
# Join the country names to the original data on the stadium key
left_join(x = df, y = lookup, by = "Stadium")
#> Match.ID Stadium Country
#> 1 1 Stadio Olympico Italy
#> 2 2 Anfield England
#> 3 3 Emirates England
#> 4 4 Stadio Olympico Italy
#> 5 5 Anfield England
#> 6 6 Emirates England
#> 7 7 Stadio Olympico Italy
#> 8 8 Anfield England
#> 9 9 Emirates England
#> 10 10 Stadio Olympico Italy
#> 11 11 Anfield England
#> 12 12 Emirates England

R package "acs": Get county name, FIPS?

in search for a solution to an unsolved problem, I came across the acs package. I assume, there's no way within the choropleth package to get any county information from data in the format [city, state]. That's why pre-processing with acs needs to be done.
I tried following code to get the county information on a city:
library(acs)
geo.lookup(state="CA", place="San Francisco")
> geo.lookup(state="CA", place="San Francisco")
state state.name county.name place place.name
1 6 California <NA> NA <NA>
2 6 California San Francisco County 67000 San Francisco city
3 6 California San Mateo County 73262 South San Francisco city
As we know, cities can be part of different counties. Most likely, I will go with the second
> geo.lookup(state="CA", place="San Francisco")[2,]
state state.name county.name place place.name
2 6 California San Francisco County 67000 San Francisco city
by default.
My question:
Is there a way to get the state abbreviation, county name and county FIPS, too? I could not find the answer in the documentation.
Also, for further processing (matching with choroplethr), the last "County" in county.name and "city" in place.name need to be removed.

Here's how to add the state abbreviation, county name, and county FIPS to your example. R has built-in variables for state names and state abbreviations. For the FIPS codes, I read a csv file from the Census Bureau's website.
library(acs)
library(tidyverse)
states <- cbind(state.name, state.abb) %>% tbl_df()
fips <-
read_csv(
"https://www2.census.gov/geo/docs/reference/codes/files/national_county.txt",
col_names = c("state.abb", "statefp", "countyfp", "county.name", "classfp")
)
query <- geo.lookup(state = "CA", place = "San Francisco")[2, ] %>%
tbl_df() %>%
left_join(states, by = "state.name") %>%
left_join(fips, by = c("county.name", "state.abb"))
query
# # A tibble: 1 x 9
# state state.name county.name place place.name state.abb statefp countyfp classfp
# <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr>
# 1 6 California San Francisco County 67000 San Francisco city CA 06 075 H6
As you note at the end of your question, you may need to clean up this data a bit more to make it fit choroplethr.

Merge two data frames by a max number condition in r

Cheers, I have a data frame df1 with the Major City with max visitors in 2011.
df1:
Country City Visitors_2011
UK London 100000
USA Washington D.C 200000
USA New York 100000
France Paris 100000
The other data frame df2 consists of top visited cities in the country for 2012:
df2:
Country City Visitors_2012
USA Washington D.C 200000
USA New York 100000
USA Las Angeles 100000
UK London 100000
UK Manchester 100000
France Paris 100000
France Nice 100000
The Output I would need is:
Logic: To obtain df3, merge df1 and df2 by Country and City and if you can't find city in df1 then add that volume to biggest city in df1.
Example: Los Angeles visitor count here is added to Washington D.C because Los Angeles is not present in df1 and Washington D.C has more visitors(2012) than New York.
df3:
Country City Visitors_2011 Visitors_2012
UK London 100000 200000
USA Washington D.C 200000 300000
USA New York 100000 100000
France Paris 100000 200000
Can anyone point me to the right direction?

Assume df1.txt and df2.txt contain your space-delimited dataframes.
Here is a solution in base R:
df1 <- read.table("df1.txt", header = T, stringsAsFactors = F);
df2 <- read.table("df2.txt", header = T, stringsAsFactors = F);
# Merge with all = TRUE, see ?merge
df <- merge(df1, df2, all = TRUE);
# Deal with missing values
tmp <- lapply(split(df, df$Country), function(x) {
# Make sure NA's are at the bottom
x <- x[order(x$Visitors_2011), ];
# Select first max Visitors_2012 entry
idx <- which.max(x$Visitors_2012);
# Add any NA's to max entry
x$Visitors_2012[idx] <- x$Visitors_2012[idx] + sum(x$Visitors_2012[is.na(x$Visitors_2011)]);
# Return dataframe
return(x[!is.na(x$Visitors_2011), ])});
# Bind list entries into dataframe
df <- do.call(rbind, tmp);
print(df);
Country City Visitors_2011 Visitors_2012
France France Paris 100000 200000
UK UK London 100000 200000
USA.6 USA New_York 100000 100000
USA.7 USA Washington_D.C 200000 300000

A dplyr approach:
library(dplyr)
max.cities <- df1 %>% group_by(Country) %>% summarise(City = City[which.max(Visitors_2011)])
result <- df2 %>% mutate(City=ifelse(City %in% df1$City, City,
max.cities$City[match(Country, max.cities$Country)])) %>%
group_by(Country,City) %>%
summarise(Visitors_2012=sum(Visitors_2012)) %>%
left_join(df1,., by=c("Country", "City"))
Notes:
First, compute the City that has the max visitors group_by Country in df1 and set that to a separate data frame max.cities.
mutate the City column in df2 so that if the City is in df1, then the name is unchanged; otherwise, the City from max.cites that matches the Country is used.
Once the City has been suitably modified, group_by both Country and City and sum up the Visitors_2012.
Finally, left_join with df1 by c("Country", "City") to get the final result.
The result using your posted data is as expected:
print(result)
## Country City Visitors_2011 Visitors_2012
##1 UK London 100000 200000
##2 USA Washington D.C 200000 300000
##3 USA New York 100000 100000
##4 France Paris 100000 200000

Harvest (rvest) multiple HTML pages from a list of urls

I have a dataframe that looks like this:
country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States", "http://en.wikipedia.org/wiki/Canada",
"http://en.wikipedia.org/wiki/Japan", "http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)
country link
1 Canada http://en.wikipedia.org/wiki/United_States
2 US http://en.wikipedia.org/wiki/Canada
3 Japan http://en.wikipedia.org/wiki/Japan
4 China http://en.wikipedia.org/wiki/China
Using rvest I'd like to scrape the table of contents for each url and bind them to one single output.
This code extracts the table of contents for one url:
library(rvest)
toc <- html(url) %>%
html_nodes(".toctext") %>%
html_text()
Desired Output:
country toc
US Etymology
History
Native American and European contact
Settlements
...
Canada Etymology
History
Aboriginal peoples
European colonization
...etc

This will scrape them into a full data frame (one row per TOC entry). Tedious-but-straightforward "print/output" code left to the OP:
library(rvest)
library(dplyr)
country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States",
"http://en.wikipedia.org/wiki/Canada",
"http://en.wikipedia.org/wiki/Japan",
"http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)
bind_rows(lapply(url, function(x) {
data.frame(url=x, toc_entry=toc <- html(url[1]) %>%
html_nodes(".toctext") %>%
html_text())
})) -> toc_entries
df <- toc_entries %>% left_join(df)
df[sample(nrow(df), 10),]
## Source: local data frame [10 x 3]
##
## url toc_entry country
## 1 http://en.wikipedia.org/wiki/Japan Government finance Japan
## 2 http://en.wikipedia.org/wiki/Canada Cold War and civil rights era US
## 3 http://en.wikipedia.org/wiki/United_States Food Canada
## 4 http://en.wikipedia.org/wiki/Japan Sports Japan
## 5 http://en.wikipedia.org/wiki/Canada Religion US
## 6 http://en.wikipedia.org/wiki/China Cold War and civil rights era China
## 7 http://en.wikipedia.org/wiki/Japan Literature, philosophy, and the arts Japan
## 8 http://en.wikipedia.org/wiki/United_States Population Canada
## 9 http://en.wikipedia.org/wiki/Japan Settlements Japan
## 10 http://en.wikipedia.org/wiki/Canada Military US

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How can I add the country name to a dataset based on city name and population? [duplicate] - r

I think your best bet would be to add a new column in your dataset called country and fill it out, this is part of the CRSIP-DM process data preparation so this is not uncommon. If that does not answer your question please let me know and i will do my best to help.

Related

I am trying to filter on two conditions, but I keep removing all patients with either condition

Insert a value to a column by condition

R package "acs": Get county name, FIPS?

Merge two data frames by a max number condition in r

Harvest (rvest) multiple HTML pages from a list of urls

Categories

Resources