How do I combine two data frames one of which contains nested lists in R? - r

Specifically, I'm trying to combine the two data frames UN_M.49_Countries and UN_M.49_Regions which contains the country codes in nested lists.
> UN_M.49_Countries
Code Name ISO_Alpha_3
1 004 Afghanistan AFG
2 248 Åland Islands ALA
3 008 Albania ALB
...
> UN_M.49_Regions
Code Name Parent Children Type
1 001 World 002, 019, 010, 142, 150, 009 Region
2 002 Africa 001 015, 202 Region
3 015 Northern Africa 002 012, 818, 434, 504, 729, 788, 732 Region
...
I would like to build a new table which adds two columns to UN_M.49_Countries.
> new_table
Code Name ISO_Alpha_3 Region Subregion
1 004 Afghanistan AFG Asia Southern Asia
2 248 Åland Islands ALA Europe Northern Europe
3 008 Albania ALB Europe Southern Europe
...
I am new to programming and R and, to be honest, I do not even know where to start. Any help would be much appreciated!
install.packages("ISOcodes")
library(ISOcodes)
UN_M.49_Countries
UN_M.49_Regions

if you need to get a specific version you can change Southern Europe to anything you would like, also if don't subset you can get the whole world.
Check out the package documentation.
https://cran.r-project.org/web/packages/ISOcodes/ISOcodes.pdf
data("UN_M.49_Regions")
data("UN_M.49_Countries")
region <- subset(UN_M.49_Regions, Name == "Southern Europe")
codes <- unlist(strsplit(region$Children, ", "))
subset(UN_M.49_Countries, Code %in% codes)
Using the tidyverse
library(ISOcodes)
library(tidyverse)
library(stringr)
countries <- UN_M.49_Countries
regions <- UN_M.49_Regions
countries <- UN_M.49_Countries
region_focused <- regions %>%
mutate(codes = str_split(Children,",")) %>%
unnest() %>%
left_join(countries, by = c("codes" = "Code"))
countr_focused <- regions %>%
mutate(codes = str_split(Children,",")) %>%
unnest() %>%
right_join(countries, by = c("codes" = "Code"))

Related

Joining two dataframes to plot a map with ggplot2

I want to make a worldmap visualization using an data frame, which look like this:
Country Year Sex Age Suicides Population Suicides_per_100k Country_Year HDI/Year Year_GDP
1 Albania 1987 Male 15-24 years 21 312900 6.71 Albania1987 NA 2156624900
2 Albania 1987 Male 35-54 years 16 308000 5.19 Albania1987 NA 2156624900
3 Albania 1987 Female 15-24 years 14 289700 4.83 Albania1987 NA 2156624900
4 Albania 1987 Male 75+ years 1 21800 4.59 Albania1987 NA 2156624900
5 Albania 1987 Male 25-34 years 9 274300 3.28 Albania1987 NA 2156624900
6 Albania 1987 Female 75+ years 1 35600 2.81 Albania1987 NA 2156624900
GDP_Per_Capita Generation Continent
1 796 Generation X Europe
2 796 Silent Europe
3 796 Generation X Europe
4 796 G.I. Generation Europe
5 796 Boomers Europe
6 796 G.I. Generation Europe
I tried to use the following code:
world <- ggplot2::map_data('world')
worldstart <- left_join(df,world,by = c("Country"="region")
This code created a new dataframe with 14 million observations.
But, I'd like to keep the same number of the dataset "df".
What is the best approach?
Indeed, the map_data functions returns the values for each point of each multipolygons in the world (~10k rows). As mentioned earlier, you cannot chose what point to keep.
You can use the sf library to go around this difficulty, keeping the geometry (here multipolygons) on one side and your data on the other.
My proposal would be the following :
library(dplyr)
library(sf)
library(ggplot2)
df <- tibble(Country = "Albania",
GDP_per_Capita = 796)
world <- maps::map('world', plot = F, fill = T) %>% st_as_sf(stringsAsFactors = F)
world_df <- df %>%
left_join(world, by = c("Country" = "ID"))
In my example, you would have only one row of data, but the geometry columns contains all necessary information for plotting.
sf and ggplot2 are well linked so you are good to go.
Best regards

create a variable in a dataframe based on another matrix on R

I am having some problems with the following task
I have a data frame of this type with 99 different countries for thousands of IDs
ID Nationality var 1 var 2 ....
1 Italy //
2 Eritrea //
3 Italy //
4 USA
5 France
6 France
7 Eritrea
....
I want to add a variable corresponding to a given macroregion of Nationality
so I created a matrix of this kind with the rule to follow
Nationality Continent
Italy Europe
Eritrea Africa
Usa America
France Europe
Germany Europe
....
I d like to obtain this
ID Nationality var 1 var 2 Continent
1 Italy // Europe
2 Eritrea // Africa
3 Italy // Europe
4 USA America
5 France Europe
6 France Europe
7 Eritrea Africa
....
I was trying with this command
datasubset <- merge(dataset , continent.matrix )
but it doesn't work, it reports the following error
Error: cannot allocate vector of size 56.6 Mb
that seems very strange to me, also trying to apply this code to a subset it doesn't work. do you have any suggestion on how to proceed?
thank you very much in advance for your help, I hope my question doesn't sound too trivial, but I am quite new to R
You can do this with the left_join function (dplyr's library):
library(dplyr)
df <- tibble(ID=c(1,2,3),
Nationality=c("Italy", "Usa", "France"),
var1=c("a", "b", "c"),
var2=c(4,5,6))
nat_cont <- tibble(Nationality=c("Italy", "Eritrea", "Usa", "Germany", "France"),
Continent=c("Europe", "Africa", "America", "Europe", "Europe"))
df_2 <- left_join(df, nat_cont, by=c("Nationality"))
The output:
> df_2
# A tibble: 3 x 5
ID Nationality var1 var2 Continent
<dbl> <chr> <chr> <dbl> <chr>
1 1 Italy a 4 Europe
2 2 Usa b 5 America
3 3 France c 6 Europe

Find groups that contain all elements, but do not overlap [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I've been given a set of country groups and I'm trying to get a set of mutually exclusive regions so that I can compare them. The problem is that my data contains several groups, many of which overlap. How can I get a set of groups which contain all countries, but do not overlap with each other?
For example, assume that this is the list of countries in the world:
World <- c("Angola", "France", "Germany", "Australia", "New Zealand")
Assume that this is my set of groups:
df <- data.frame(group = c("Africa", "Western Europe", "Europe", "Europe", "Oceania", "Oceania", "Commonwealth Countries"),
element = c("Angola", "France", "Germany", "France", "Australia", "New Zealand", "Australia"))
group element
1 Africa Angola
2 Western Europe France
3 Europe Germany
4 Europe France
5 Oceania Australia
6 Oceania New Zealand
7 Commonwealth Countries Australia
How could I remove overlapping groups (in this case Western Europe) to get a set of groups that contains all countries like the following:
df_solved <- data.frame(group = c("Africa", "Europe", "Europe", "Oceania", "Oceania"),
element = c("Angola", "France", "Germany", "Australia", "New Zealand"))
group element
1 Africa Angola
2 Europe France
3 Europe Germany
4 Oceania Australia
5 Oceania New Zealand
One possible rule could be to minimize the number of groups, e.g. to associate an element with that group which includes the most elements.
library(data.table)
setDT(df)[, n.elements := .N, by = group][
order(-n.elements), .(group = group[1L]), by = element]
element group
1: Germany Europe
2: France Europe
3: Australia Oceania
4: New Zealand Oceania
5: Angola Africa
Explanation
setDT(df)[, n.elements := .N, by = group][]
returns
group element n.elements
1: Africa Angola 1
2: Western Europe France 1
3: Europe Germany 2
4: Europe France 2
5: Oceania Australia 2
6: Oceania New Zealand 2
7: Commonwealth Countries Australia 1
Now, the rows are ordered by decreasing number of elements and for each country the first, i.e., the "largest", group is picked. This should return a group for each country as requested.
In case of ties, i.e., one group contains equally many elements, you can add additional citeria when ordering, e.g., length of the group name, or just alphabetical order.
1) If you want to simply eliminate duplicate elements then use !duplicated(...) as shown. No packages are used.
subset(df, !duplicated(element))
giving:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
2) set partitioning If each group must be wholly in or wholly out and each element may only appear once then this is a set partitioning problem:
library(lpSolve)
const.mat <- with(df, table(element, group))
obj <- rep(1L, ncol(const.mat))
res <- lp("min", obj, const.mat, "=", 1L, all.bin = TRUE)
subset(df, group %in% colnames(const.mat[, res$solution == 1]))
giving:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
3) set covering Of course there may be no exact set partition so we could consider the set covering problem (same code exceept "=" is replaced by ">=" in the lp line.
library(lpSolve)
const.mat <- with(df, table(element, group))
obj <- rep(1L, ncol(const.mat))
res <- lp("min", obj, const.mat, ">=", 1L, all.bin = TRUE)
subset(df, group %in% colnames(const.mat[, res$solution == 1]))
giving:
group element
1 Africa Angola
2 Europe France
3 Europe Germany
5 Oceania Australia
6 Oceania New Zealand
and we could optionally then apply (1) to remove any duplicates in the cover.
4) Non-dominated groups Another approach is to remove any group whose elements form a strict subset of the elements of some other group. For example, every element in Western Europe is in Europe and Europe has more elements than Western Europe so the elements of Western Europe are a strict subset of the elements of Europe and we remove Western Europe. Using const.mat from above:
# returns TRUE if jth column of const.mat is dominated by some other column
is_dom_fun <- function(j) any(apply(const.mat[, j] <= const.mat[, -j], 2, all) &
sum(const.mat[, j]) < colSums(const.mat[, -j]))
is_dom <- sapply(seq_len(ncol(const.mat)), is_dom_fun)
subset(df, group %in% colnames(const.mat)[!is_dom])
giving:
group element
1 Africa Angola
3 Europe Germany
4 Europe France
5 Oceania Australia
6 Oceania New Zealand
If there are any duplicates left we can use (1) to remove them.
library(dplyr)
df %>% distinct(element, .keep_all=TRUE)
group element
1 Africa Angola
2 Europe France
3 Europe Germany
4 Oceania Australia
5 Oceania New Zealand
Shoutout to Axeman for beating me with this answer.
Update
Your question is ill-defined. Why is 'Europe' preferred over 'Western Europe'? Put another way, each country is assigned several groups. You want to reduce it to one group per country. How do you decide which group?
Here's one way, we always prefer the biggest:
groups <- df %>% count(group)
df %>% inner_join(groups, by='group') %>%
arrange(desc(n)) %>% distinct(elemenet, .keep_all=TRUE)
group element n
1 Europe France 2
2 Europe Germany 2
3 Oceania Australia 2
4 Oceania New Zealand 2
5 Africa Angola 1
Here is one option with data.table
library(data.table)
setDT(df)[, head(.SD, 1), element]
Or with unique
unique(setDT(df), by = 'element')
# group element
#1: Africa Angola
#2: Europe France
#3: Europe Germany
#4: Oceania Australia
#5: Oceania New Zealand
Packages are used and it is data.table
A completely different approach would be to ignore the given groups but to look up just the country names in the catalogue of UN regions which are available in the countrycodes or ISOcodes packages.
The countrycodes package seems to offer the simpler interface and it also warns about country names which can not be found in its database:
# given country names - note the deliberately misspelled last entry
World <- c("Angola", "France", "Germany", "Australia", "New Zealand", "New Sealand")
# regions
countrycode::countrycode(World, "country.name.en", "region")
[1] "Middle Africa" "Western Europe" "Western Europe" "Australia and New Zealand"
[5] "Australia and New Zealand" NA
Warning message:
In countrycode::countrycode(World, "country.name.en", "region") :
Some values were not matched unambiguously: New Sealand
# continents
countrycode::countrycode(World, "country.name.en", "continent")
[1] "Africa" "Europe" "Europe" "Oceania" "Oceania" NA
Warning message:
In countrycode::countrycode(World, "country.name.en", "continent") :
Some values were not matched unambiguously: New Sealand

Pass a string argument to a function as dataframe column name in dplyr

I am trying to pass a string variable to a function, to be used as the column name after some data alteration.
Here is the function:
cleandata <- function(df,name){
df <- df %>%
gather(key = 'Year',value = name,X1960:X2015)
df <- df %>%
select(-c(X,Indicator.Name,Indicator.Code))
df$Year <- substr(df$Year,start = 2,stop = 5)
df$Year <- as.factor(df$Year)
return(df)
}
I want to pass a string variable to 'name', and have it as the column name.
The current output of the function is:
> cleandata(lifeexp,'LifeExp')
Source: local data frame [13,888 x 4]
Country.Name Country.Code Year name
(fctr) (fctr) (fctr) (dbl)
1 Aruba ABW 1960 65.56937
2 Andorra AND 1960 NA
3 Afghanistan AFG 1960 32.32851
4 Angola AGO 1960 32.98483
5 Albania ALB 1960 62.25437
6 Arab World ARB 1960 46.84706
7 United Arab Emirates ARE 1960 52.24322
8 Argentina ARG 1960 65.21554
9 Armenia ARM 1960 65.86346
10 American Samoa ASM 1960 NA
.. ... ... ... ...
>
The last column should be 'LifeExp', not name. What am I missing?
Thanks in advance,
Rahul
You want to use gather_ here. See vignette('nse') for an explanation why.
year_cols <- names(df)[grepl('^X\\d{4}$', names(df))]
df %>% gather_('Year', name, year_cols)
The issue is gather takes an unquoted name for its key and value columns, so you can't pass in a variable name. It's just going to interpret what ever variable name you put in there as the the unquoted name you want for the value column. This is consistent with the principle that the tidyr functions without underscores are meant for interactive use and those with underscores should be used when your effort is more programmatic.

Removing repeats and blanks from R data frame

I apologise in advance for the data structure here, but I'm stuck with it...
I have a data frame with lots of repeats and blanks, like so:
df <- data.frame(
country=c("Afghanistan", "Afghanistan", "Algeria", "Australia", "Australia", "Australia"),
survey.1=c("Influenza","", "","","Influenza","Influenza"),
survey.2=c("","Hepatitis C","","","",""),
survey.3=c("West Nile Virus", "", "", "", "", "West Nile Virus"))
country survey.1 survey.2 survey.3
1 Afghanistan Influenza West Nile Virus
2 Afghanistan Hepatitis C
3 Algeria
4 Australia
5 Australia Influenza
6 Australia Influenza West Nile Virus
I need to remove the repeats and blanks but keep the same data structure (I don't know what you would call this... 'concentrating' as opposed to 'aggregating' maybe?). So what I'd end up with is this:
country survey.1 survey.2 survey.3
1 Afghanistan Influenza Hepatitis C West Nile Virus
2 Australia Influenza West Nile Virus
Can anyone help?
Using plyr:
ddply(df,.(country),
function(x)
sapply(x,function(y){
xx= unique(y[nchar(y)>0])
ifelse(length(xx)>0,xx,unique(y))
}
)
)
country survey.1 survey.2 survey.3
1 Afghanistan Influenza Hepatitis C West Nile Virus
2 Algeria
3 Australia Influenza West Nile Virus

Resources