Get world region name from a country name in R - r

In my data I have one column with country names. I want to make a new variable that lists which region each country is in based on an excel sheet I have where I have labelled each country by region.
I don't want to use the package countrycode because it doesn't have specific enough regions (i.e. it labels the Netherlands as Europe, and not Northern Europe). Is there a way to get R to inspect a cell and match the contents of that cell to another dataset?

Import your spreadsheet into R. (Use RExcel, or export as CSV and import that using base functions.) Suppose your spreadsheet has two columns, named Country and Region, something like this:
regions <- data.frame(Country = c("Greece", "Netherlands"),
Region = c("Southern Europe", "Northern Europe"),
stringsAsFactors = FALSE)
regions
#> Country Region
#> 1 Greece Southern Europe
#> 2 Netherlands Northern Europe
Now create a named vector from the dataframe:
named <- regions$Region
names(named) <- regions$Country
named
#> Greece Netherlands
#> "Southern Europe" "Northern Europe"
Now you can index the named vector to convert country names to regions in any other vector.
other <- c("Netherlands", "Greece", "Greece")
named[other]
#> Netherlands Greece Greece
#> "Northern Europe" "Southern Europe" "Southern Europe"
If you have any missing countries (or variant spellings), you'll get NA for the region, e.g.
other2 <- c("Greece", "France")
named[other2]
#> Greece <NA>
#> "Southern Europe" NA

The rnaturalearth library has country shapefiles with region and subregion.
library(rnaturalearth)
world <- rnaturalearth::ne_countries(returnclass = "sf")
world$region
world$subregion

Related

Assigning Value to New Variable Based on Specific Values in Another Variable in R

I have a data.frame that contains state names and I would like to create a new variable called "region" in which a value is assigned based on the state that is found under the "state" variable.
For example, if the state variable has "Alabama" or "Georgia", I would like to have "Region" assigned as "South". If state is "Washington" or "California", I would like it assigned to "West". I have to do this for each of the 48 contiguous U.S. states, and I'm having difficulty figuring out the best way to do this. Any help in this (I'm sure simple) procedure would be great. What I am looking for is something like this in the end:
State Region
Wyoming West
Michigan Midwest
Alabama South
Georgia South
California West
Texas Central
And to be clear, I don't have the regions in a separate file, i have to create this as a new variable and create the region names myself. I'm just looking for a way that the code can go through all 3000 lines that I have and can automatically assign the region name once I tell it how to do so.
Rather than type the region for every state, you can use the built-in "state.name" and "state.region" variables from the 'datasets' package (like Jon Spring suggests in his comment), e.g.
library(tidyverse)
library(datasets)
state_lookup_table <- data.frame(name = state.name,
region = state.region)
my_df <- data.frame(place = c("Washington", "California"),
value = c(1000, 2000))
my_df
#> place value
#> 1 Washington 1000
#> 2 California 2000
my_df %>%
left_join(state_lookup_table, by = c("place" = "name"))
#> place value region
#> 1 Washington 1000 West
#> 2 California 2000 West
Created on 2022-09-02 by the reprex package (v2.0.1)
I would go this way:
df <- data.frame(name = c("john", "will", "thomas", "Ali"),
state = c("California", "Alabama", "Washington", "Georgia"))
region_df <- data.frame(state= c("Alabama", "Georgia", "Washington"),
region = c("south", "south", "west"))
merged.df <- merge(df, region_df, all.x = TRUE, on= "state")
I think you need a reference to do so. For your specific question, a dict would be the best solution.
ref_ge <- {}
ref_ge["Georgia"]="South"
ref_ge["Alabama"]="South"
ref_ge["California"]="West"
ref1["Georgia"]
#Or, if you could read the state->region information from an excel to a dataframe
df=data.frame(state=c("Georgia","Alabama","California"),region=c("South","South","West"))
ref2 <- df$region
names(ref2) <- df$state
ref2["Georgia"]

How to group a column with character values in a new column in r

I have a data set with countries column, I want to create a new column and classify the countries into the following categories (first world, second world, third world) countries.
I'm relatively new to R and I'm finding it difficult to find a proper function that deals with characters!
My dataset contains the countries like this, and I have three vectors with a list of countries as shown below:
nt_final_table$`Country name`
#[1] "Finland" "Denmark" "Switzerland"
#[4] "Iceland" "Netherlands" "Norway"
#[7] "Sweden" "Luxembourg" "New Zealand"
#[10] "Austria" "Australia" "Israel"
first_world_countries <- c("Australia","Austria","Belgium","Canada","Denmark","France","Germany","Greece","Iceland","Ireland","Israel","Italy","Japan","Luxembourg","Netherlands","New Zealand","Norway","Portugal","South Korea",
"Spain","Sweden","Switzerland","Turkey","United Kingdom","USA")
Second_world_countries <- c("Albania","Armenia","Azerbaijan","Belarus","Bosnia and Herzegovina","Bulgaria","China","Croatia","Cuba","Czech Republic","EastGermany","Estonia","Georgia","Hungary","Kazakhstan","Kyrgyzstan","Laos","Poland","Romania","Russia","Serbia","Slovakia","Slovenia","Tajikistan","Turkmenistan","Ukraine","Uzbekistan","Vietnam")
Third_world_countries <- ("Somalia","Niger","South Sudan")
I would want a new column that contains the following values :
First World, Second World, Third World based on the Country name column
Any help would be appreciated!
Thanks!
Here are 2 ways you could do this.
Using dplyr package
You could use case_when from the dplyr package to do this.
library(dplyr)
country_name <-c("Finland", "Denmark", "Switzerland","Iceland", "Netherlands", "Norway", "Sweden", "Luxembourg", "New Zealand",
"Austria", "Australia", "Israel")
nt_final_table <- data.frame(country_name)
first_world_countries <- c("Australia","Austria","Belgium","Canada","Denmark","France","Germany","Greece","Iceland","Ireland","Israel","Italy","Japan","Luxembourg","Netherlands","New Zealand","Norway","Portugal","South Korea", "Spain","Sweden","Switzerland","Turkey","United Kingdom","USA")
second_world_countries <- c("Albania","Armenia","Azerbaijan","Belarus","Bosnia and Herzegovina","Bulgaria","China","Croatia","Cuba","Czech Republic","EastGermany","Estonia","Georgia","Hungary","Kazakhstan","Kyrgyzstan","Laos","Poland","Romania","Russia","Serbia","Slovakia","Slovenia","Tajikistan","Turkmenistan","Ukraine","Uzbekistan","Vietnam")
third_world_countries <- c("Somalia","Niger","South Sudan")
nt_final_table_categorized <- nt_final_table %>% mutate(category = case_when(country_name %in% first_world_countries ~ "First",
country_name %in% second_world_countries ~ "Second",
country_name %in% third_world_countries ~ "Third",
TRUE ~"Not listed"))
nt_final_table_categorized
Sample output
country_name category
1 Finland Not listed
2 Denmark First
3 Switzerland First
4 Iceland First
5 Netherlands First
6 Norway First
7 Sweden First
8 Luxembourg First
9 New Zealand First
10 Austria First
11 Australia First
12 Israel First
Using base R
In base R we could create a data frame that lists the countries and their category then use merge to perform a left-join on the 2 dataframes.
country_name <-c("Finland", "Denmark", "Switzerland","Iceland", "Netherlands", "Norway", "Sweden", "Luxembourg", "New Zealand",
"Austria", "Australia", "Israel")
nt_final_table <- data.frame(country_name)
first_world_countries <- c("Australia","Austria","Belgium","Canada","Denmark","France","Germany","Greece","Iceland","Ireland","Israel","Italy","Japan","Luxembourg","Netherlands","New Zealand","Norway","Portugal","South Korea", "Spain","Sweden","Switzerland","Turkey","United Kingdom","USA")
second_world_countries <- c("Albania","Armenia","Azerbaijan","Belarus","Bosnia and Herzegovina","Bulgaria","China","Croatia","Cuba","Czech Republic","EastGermany","Estonia","Georgia","Hungary","Kazakhstan","Kyrgyzstan","Laos","Poland","Romania","Russia","Serbia","Slovakia","Slovenia","Tajikistan","Turkmenistan","Ukraine","Uzbekistan","Vietnam")
third_world_countries <- c("Somalia","Niger","South Sudan")
country_name <- c(first_world_countries,second_world_countries,third_world_countries)
categories <- c(rep("First", length(first_world_countries)),
rep("Second",length(second_world_countries)),
rep("Third",length(third_world_countries)))
all_countries_categorised <- data.frame(country_name, categories)
nt_final_table_categorized <-merge(nt_final_table, all_countries_categorised, by ="country_name", all.x=TRUE)
nt_final_table_categorized
Sample output
country_name categories
1 Australia First
2 Austria First
3 Denmark First
4 Finland <NA>
5 Iceland First
6 Israel First
7 Luxembourg First
8 Netherlands First
9 New Zealand First
10 Norway First
11 Sweden First
12 Switzerland First

Get continent name from country name in R

I have a data frame with one column representing country names. My goal is to add one more column which gives the continent information. Please check the following use case:
my.df <- data.frame(country = c("Afghanistan","Algeria"))
Is there a package that I can use to append a column of data containing the continent names without having the original data?
You can use the countrycode package for this task.
library(countrycode)
df <- data.frame(country = c("Afghanistan",
"Algeria",
"USA",
"France",
"New Zealand",
"Fantasyland"))
df$continent <- countrycode(sourcevar = df[, "country"],
origin = "country.name",
destination = "continent")
#warning
#In countrycode(sourcevar = df[, "country"], origin = "country.name", :
# Some values were not matched unambiguously: Fantasyland
Result
df
# country continent
#1 Afghanistan Asia
#2 Algeria Africa
#3 USA Americas
#4 France Europe
#5 New Zealand Oceania
#6 Fantasyland <NA>
Expanding on Markus' answer, countrycode draws on codelists 'continent' declaration.
?codelist
Definition of continent:
continent: Continent as defined in the World Bank Development Indicators
The question asked for continents but sometimes continents don't provide enough groups for you to delineate the data. For example, continents groups North and South America into Americas.
What you might want is region:
region: Regions as defined in the World Bank Development Indicators
It is unclear how the World Bank groups regions but the below code shows how this destination is more granular.
library(countrycode)
egnations <- c("Afghanistan","Algeria","USA","France","New Zealand","Fantasyland")
countrycode(sourcevar = egnations, origin = "country.name",destination = "region")
Output:
[1] "Southern Asia"
[2] "Northern Africa"
[3] "Northern America"
[4] "Western Europe"
[5] "Australia and New Zealand"
[6] NA
You can try
my.df <- data.frame(country = c("Afghanistan","Algeria"),
continent= as.factor(c("Asia","Africa")))
merge(my.df, raster::ccodes()[,c("NAME", "CONTINENT")], by.x="country", by.y="NAME", all.x=T)
# country continent CONTINENT
# 1 Afghanistan Asia Asia
# 2 Algeria Africa Africa
Some country values might need an adjustment; I dunno since you did not provide all values.

aggregates variables into new variable

I have a column in a dataframe which includes 30 different countries. I want to group these countries into 5 new values.
For example,
I have
China
Japan
US
Canada
....
Aggregate to new variables:
Asia
Asia
North America
North America
....
One solution I am thinking about is using nested ifelse. However it seems that I need 4 or 5 nested ifelse to get what I need. I don't think that's a good way. I want to know other efficient solutions.
One option would be to use a key/value dataset. The countrycode_data from the library(countrycode) can be used for this purpose. We match the 'country.name' column in 'countrycode_data' with the example data column ('Col1'). If there are no matches, it will return NA. Using the OP's example, 'US' returns NA as the 'country.name' is 'United States'. But, we can get the abbreviated form using the 'cowc' column. However, the abbreviated version is also USA, which we can find using grep. I would suggest to grep all NA elements in 'indx'. The 'indx' can be used for returning 'region' from the 'countrycode_data'.
library(countrycode)
indx <- match(df1$Col1, countrycode_data$country.name)
pat <- paste0('^',paste(df1$Col1[is.na(indx)], collapse='|'))
indx[is.na(indx)] <- grep(pat, countrycode_data$cowc)
countrycode_data$region[indx]
#[1] "Eastern Asia" "Eastern Asia" "Northern America" "Northern America"
NOTE: This will return a bit more specific than the general 'Asia'.
If we use the 'continent' column,
countrycode_data$continent[indx]
#[1] "Asia" "Asia" "Americas" "Americas"
data
df1 <- structure(list(Col1 = c("China", "Japan", "US", "Canada")),
.Names = "Col1", class = "data.frame", row.names = c(NA, -4L))
Another approach is to use the recode function from the car package:
library(car)
dat$Region <- recode(dat$Country, "c('China', 'Japan') = 'Asia'; c('US','Canada') = 'North America'")
Country Region
1 China Asia
2 Japan Asia
3 US North America
4 Canada North America
They are just 30 countries and so you can make few vectors like shown below, create a new column and replace according to the vectors.
asia <- c("India", "china")
NorthAmerica <- c("US", "canada")
df$continent <- df$countries
df$continent <- with(df, replace(continent, countries%in%asia,"Asia"))
df$continent <- with(df, replace(continent, countries%in%NorthAmerica,"North America"))
'continent' is a built-in destination code of the countrycode package. You can pass a vector of country names and get a vector of continent names back with...
library(countrycode)
countries <- c('China', 'Japan', 'US', 'Canada')
countrycode(countries, 'country.name', 'continent')
returns...
[1] "Asia" "Asia" "Americas" "Americas"
Make sure when using Veera's and Jay's approaches to define column as a vector in order to allow for the change of a column's levels:
df$continent <- as.factor(as.vector(df$countries))

Isolating partial text in r data frame

I have an r data frame that contains U.S. state and county names in one column. The data is in the format:
United States - State name - County name
where each cell is a unique county. For example:
United States - North Carolina - Wake County
United States - North Carolina - Warren County
etc.
I need to break the column into 2 columns, one containing just the state name and the other containing just the county name. I've experimented with sub and gsub but am getting no results. I understand this is probably a simple matter for r experts but I'm a newbie. I would be most grateful if anyone can point me in the right direction.
You can use tidyr's separate function:
library(tidyr)
df <- separate(df, currentColumn, into = c("Country", "State", "County"), sep = " - ")
If the data is as you show in your question (including United States as country) and if your data frame is called df and the current column with the data is called currentColumn.
Example:
df <- data.frame(currentColumn = c("United States - North Carolina - Wake County",
"United States - North Carolina - Warren County"), val = rnorm(2))
df
# currentColumn val
#1 United States - North Carolina - Wake County 0.8173619
#2 United States - North Carolina - Warren County 0.4941976
separate(df, currentColumn, into = c("Country", "State", "County"), sep = " - ")
# Country State County val
#1 United States North Carolina Wake County 0.8173619
#2 United States North Carolina Warren County 0.4941976
Using read.table, and assuming your data is in df$var
read.table(text=df$var,sep="-",strip.white=TRUE,
col.names=c("Country","State","County"))
If speed is an issue, then strsplit will be a lot quicker:
setNames(data.frame(do.call(rbind,strsplit(df$var,split=" - "))),
c("Country","State","County"))
Both give:
# Country State County
#1 United States North Carolina Wake County
#2 United States North Carolina Warren County

Resources