Joining two dataframes to plot a map with ggplot2 - r

I want to make a worldmap visualization using an data frame, which look like this:
Country Year Sex Age Suicides Population Suicides_per_100k Country_Year HDI/Year Year_GDP
1 Albania 1987 Male 15-24 years 21 312900 6.71 Albania1987 NA 2156624900
2 Albania 1987 Male 35-54 years 16 308000 5.19 Albania1987 NA 2156624900
3 Albania 1987 Female 15-24 years 14 289700 4.83 Albania1987 NA 2156624900
4 Albania 1987 Male 75+ years 1 21800 4.59 Albania1987 NA 2156624900
5 Albania 1987 Male 25-34 years 9 274300 3.28 Albania1987 NA 2156624900
6 Albania 1987 Female 75+ years 1 35600 2.81 Albania1987 NA 2156624900
GDP_Per_Capita Generation Continent
1 796 Generation X Europe
2 796 Silent Europe
3 796 Generation X Europe
4 796 G.I. Generation Europe
5 796 Boomers Europe
6 796 G.I. Generation Europe
I tried to use the following code:
world <- ggplot2::map_data('world')
worldstart <- left_join(df,world,by = c("Country"="region")
This code created a new dataframe with 14 million observations.
But, I'd like to keep the same number of the dataset "df".
What is the best approach?

Indeed, the map_data functions returns the values for each point of each multipolygons in the world (~10k rows). As mentioned earlier, you cannot chose what point to keep.
You can use the sf library to go around this difficulty, keeping the geometry (here multipolygons) on one side and your data on the other.
My proposal would be the following :
library(dplyr)
library(sf)
library(ggplot2)
df <- tibble(Country = "Albania",
GDP_per_Capita = 796)
world <- maps::map('world', plot = F, fill = T) %>% st_as_sf(stringsAsFactors = F)
world_df <- df %>%
left_join(world, by = c("Country" = "ID"))
In my example, you would have only one row of data, but the geometry columns contains all necessary information for plotting.
sf and ggplot2 are well linked so you are good to go.
Best regards

Related

How to filter values within a threshold in R

I have a data set that looks like this with the first 10 rows
country freq
Albania 2
Argentina 4
Australia 26
Austria 14
Belgium 22
Brazil 46
Bulgaria 2
Cambodia 2
Canada 37
Chile 19
I want to filter out counts(frequency) that are less than 30
i tried this code:
dd %>%
group_by(freq) %>%
filter(n()<30)
The output was same with the dataset. I did not get want i want
how do I resolve this?
Thanks in advance
Use simple indexing. Why are you grouping by?
dd <- dd[dd$freq >= 30, ]

How to create a data group (factor variables) in my dataframe based on categorical variables #R

I want to create a factor variables in my dataframes based on categorical variables.
My data:
# A tibble: 159 x 3
name.country gpd rate_suicide
<chr> <dbl> <dbl>
1 Afghanistan 2129. 6.4
2 Albania 12003. 5.6
3 Algeria 11624. 3.3
4 Angola 7103. 8.9
5 Antigua and Barbuda 19919. 0.5
6 Argentina 20308. 9.1
7 Armenia 10704. 5.7
8 Australia 47350. 11.7
9 Austria 52633. 11.4
10 Azerbaijan 14371. 2.6
# ... with 149 more rows
I want to create factor variable region, which contains a factors as:
region <- c('Asian', 'Europe', 'South America', 'North America', 'Africa')
region = factor(region, levels = c('Asian', 'Europe', 'South America', 'North America', 'Africa'))
I want to do this with dplyr packages, that can to choose a factor levels depends on name.countrybut it doesn't work. Example:
if (new_data$name.country[new_data$name.country == "N"]) {
mutate(new_data, region_ = region[1])
}
How i can solve the problem?
I think the way I would think about your problem is
Create a reproducible problem. (see How to make a great R reproducible example. ) Since you already have the data, use dput to make it easier for people like me to recreate your data in their environment.
dput(yourdf)
structure(list(name.country = c("Afghanistan", "Albania", "Algeria"
), gpd = c(2129L, 12003L, 11624L), rate_suicide = c(6.4, 5.6,
3.3)), class = "data.frame", row.names = c(NA, -3L))
raw_data<-structure(list(name.country = c("Afghanistan", "Albania", "Algeria"
), gpd = c(2129L, 12003L, 11624L), rate_suicide = c(6.4, 5.6,
3.3)), class = "data.frame", row.names = c(NA, -3L))
Define vectors that specify your regions
Use case_when to separate countries into regions
Use as.factor to convert your character variable to a factor
asia=c("Afghanistan","India","...","Rest of countries in Asia")
europe=c("Albania","France","...","Rest of countries in Europe")
africa=c("Algeria","Egypt","...","Rest of countries in Africa")
df<-raw_data %>%
mutate(region=case_when(
name.country %in% asia ~ "asia",
name.country %in% europe ~ "europe",
name.country %in% africa ~ "africa",
TRUE ~ "other"
)) %>%
mutate(region=region %>% as.factor())
You can check that your variable region is a factor using str
str(df)
'data.frame': 3 obs. of 4 variables:
$ name.country: chr "Afghanistan" "Albania" "Algeria"
$ gpd : int 2129 12003 11624
$ rate_suicide: num 6.4 5.6 3.3
$ region : Factor w/ 3 levels "africa","asia",..: 2 3 1
Here is a working example that combines data from the question with a file of countries and region information from Github. H/T to Luke Duncalfe for maintaining the region data, which is:
...a combination of the Wikipedia ISO-3166 article for alpha and numeric country codes and the UN Statistics site for countries' regional and sub-regional codes.
regionFile <- "https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv"
regionData <- read.csv(regionFile,header=TRUE)
textFile <- "rowID|country|gdp|suicideRate
1|Afghanistan|2129.|6.4
2|Albania|12003.|5.6
3|Algeria|11624.|3.3
4|Angola|7103.|8.9
5|Antigua and Barbuda|19919.|0.5
6|Argentina|20308.|9.1
7|Armenia|10704.|5.7
8|Australia|47350.|11.7
9|Austria|52633.|11.4
10|Azerbaijan|14371.|2.6"
data <- read.csv(text=textFile,sep="|")
library(dplyr)
data %>%
left_join(.,regionData,by = c("country" = "name"))
...and the output:
rowID country gdp suicideRate alpha.2 alpha.3 country.code
1 1 Afghanistan 2129 6.4 AF AFG 4
2 2 Albania 12003 5.6 AL ALB 8
3 3 Algeria 11624 3.3 DZ DZA 12
4 4 Angola 7103 8.9 AO AGO 24
5 5 Antigua and Barbuda 19919 0.5 AG ATG 28
6 6 Argentina 20308 9.1 AR ARG 32
7 7 Armenia 10704 5.7 AM ARM 51
8 8 Australia 47350 11.7 AU AUS 36
9 9 Austria 52633 11.4 AT AUT 40
10 10 Azerbaijan 14371 2.6 AZ AZE 31
iso_3166.2 region sub.region intermediate.region
1 ISO 3166-2:AF Asia Southern Asia
2 ISO 3166-2:AL Europe Southern Europe
3 ISO 3166-2:DZ Africa Northern Africa
4 ISO 3166-2:AO Africa Sub-Saharan Africa Middle Africa
5 ISO 3166-2:AG Americas Latin America and the Caribbean Caribbean
6 ISO 3166-2:AR Americas Latin America and the Caribbean South America
7 ISO 3166-2:AM Asia Western Asia
8 ISO 3166-2:AU Oceania Australia and New Zealand
9 ISO 3166-2:AT Europe Western Europe
10 ISO 3166-2:AZ Asia Western Asia
region.code sub.region.code intermediate.region.code
1 142 34 NA
2 150 39 NA
3 2 15 NA
4 2 202 17
5 19 419 29
6 19 419 5
7 142 145 NA
8 9 53 NA
9 150 155 NA
10 142 145 NA
At this point one can decide whether to use the region, sub region, or intermediate region and convert it to a factor.
We can set region to a factor by adding a mutate() function to the dplyr pipeline:
data %>%
left_join(.,regionData,by = c("country" = "name")) %>%
mutate(region = factor(region)) -> mergedData
At this point mergedData$region is a factor.
str(mergedData$region)
table(mergedData$region)
> str(mergedData$region)
Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 2 3 5 4 3
> table(mergedData$region)
Africa Americas Asia Europe Oceania
2 2 3 2 1
Now the data is ready for further analysis. We will generate a table of average suicide rates by region.
library(knitr) # for kable
mergedData %>% group_by(region) %>%
summarise(suicideRate = mean(suicideRate)) %>%
kable(.)
...and the output:
|region | suicideRate|
|:--------|-----------:|
|Africa | 6.1|
|Americas | 4.8|
|Asia | 4.9|
|Europe | 8.5|
|Oceania | 11.7|
When rendered in an HTML / markdown viewer, the result looks like this:

How to subtract each Country's value by year

I have data for each Country's happiness (https://www.kaggle.com/unsdsn/world-happiness), and I made data for each year of the reports. Now, I don't know how to get the values for each year subtracted from each other e.g. how did happiness rank change from 2015 to 2017/2016 to 2017? I'd like to make a new df of differences for each.
I was able to bind the tables for columns in common and started to work on removing Countries that don't have data for all 3 years. I'm not sure if I'm going down a complicated path.
keepcols <- c("Country","Happiness.Rank","Economy..GDP.per.Capita.","Family","Health..Life.Expectancy.","Freedom","Trust..Government.Corruption.","Generosity","Dystopia.Residual","Year")
mydata2015 = read.csv("C:\\Users\\mmcgown\\Downloads\\2015.csv")
mydata2015$Year <- "2015"
data2015 <- subset(mydata2015, select = keepcols )
mydata2016 = read.csv("C:\\Users\\mmcgown\\Downloads\\2016.csv")
mydata2016$Year <- "2016"
data2016 <- subset(mydata2016, select = keepcols )
mydata2017 = read.csv("C:\\Users\\mmcgown\\Downloads\\2017.csv")
mydata2017$Year <- "2017"
data2017 <- subset(mydata2017, select = keepcols )
df <- rbind(data2015,data2016,data2017)
head(df, n=10)
tail(df, n=10)
df15 <- df[df['Year']=='2015',]
df16 <- df[df['Year']=='2016',]
df17 <- df[df['Year']=='2017',]
nocon <- rbind(setdiff(unique(df16['Country']),unique(df17['Country'])),setdiff(unique(df15['Country']),unique(df16['Country'])))
Don't have a clear path to accomplish what I want but it would look like
df16_to_17
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2017] - Yemen[Happiness Rank in 2016])
USA (USA[Happiness Rank in 2017] - USA[Happiness Rank in 2016])
(other countries)
df15_to_16
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2016] - Yemen[Happiness Rank in 2015])
USA (USA[Happiness Rank in 2016] - USA[Happiness Rank in 2015])
(other countries)
It's very straightforward with dplyr, and involves grouping by country and then finding the differences between consecutive values with base R's diff. Just make sure to use df and not df15, etc.:
library(dplyr)
rank_diff_df <- df %>%
group_by(Country) %>%
mutate(Rank.Diff = c(NA, diff(Happiness.Rank)))
The above assumes that the data are arranged by year, which they are in your case because of the way you combined the dataframes. If not, you'll need to call arrange(Year) before the call to mutate. Filtering out countries with missing year data isn't necessary, but can be done after group_by() with filter(n() == 3).
If you would like to view the differences it would make sense to drop some variables and rearrange the data:
rank_diff_df %>%
select(Year, Country, Happiness.Rank, Rank.Diff) %>%
arrange(Country)
Which returns:
# A tibble: 470 x 4
# Groups: Country [166]
Year Country Happiness.Rank Rank.Diff
<chr> <fct> <int> <int>
1 2015 Afghanistan 153 NA
2 2016 Afghanistan 154 1
3 2017 Afghanistan 141 -13
4 2015 Albania 95 NA
5 2016 Albania 109 14
6 2017 Albania 109 0
7 2015 Algeria 68 NA
8 2016 Algeria 38 -30
9 2017 Algeria 53 15
10 2015 Angola 137 NA
# … with 460 more rows
The above data frame will work well with ggplot2 if you are planning on plotting the results.
If you don't feel comfortable with dplyr you can use base R's merge to combine the dataframes, and then create a new dataframe with the differences as columns:
df_wide <- merge(merge(df15, df16, by = "Country"), df17, by = "Country")
rank_diff_df <- data.frame(Country = df_wide$Country,
Y2015.2016 = df_wide$Happiness.Rank.y -
df_wide$Happiness.Rank.x,
Y2016.2017 = df_wide$Happiness.Rank -
df_wide$Happiness.Rank.y
)
Which returns:
head(rank_diff_df, 10)
Country Y2015.2016 Y2016.2017
1 Afghanistan 1 -13
2 Albania 14 0
3 Algeria -30 15
4 Angola 4 -1
5 Argentina -4 -2
6 Armenia -6 0
7 Australia -1 1
8 Austria -1 1
9 Azerbaijan 1 4
10 Bahrain -7 -1
Assuming the three datasets are present in your environment with the name data2015, data2016 and data2017, we can add a year column with the respective year and keep the columns which are present in keepcols vector. arrange the data by Country and Year, group_by Country, keep only those countries which are present in all 3 years and then subtract the values from previous rows using lag or diff.
library(dplyr)
data2015$Year <- 2015
data2016$Year <- 2016
data2017$Year <- 2017
df <- bind_rows(data2015, data2016, data2017)
data <- df[keepcols]
data %>%
arrange(Country, Year) %>%
group_by(Country) %>%
filter(n() == 3) %>%
mutate_at(-1, ~. - lag(.)) #OR
#mutate_at(-1, ~c(NA, diff(.)))
# A tibble: 438 x 10
# Groups: Country [146]
# Country Happiness.Rank Economy..GDP.pe… Family Health..Life.Ex… Freedom
# <chr> <int> <dbl> <dbl> <dbl> <dbl>
# 1 Afghan… NA NA NA NA NA
# 2 Afghan… 1 0.0624 -0.192 -0.130 -0.0698
# 3 Afghan… -13 0.0192 0.471 0.00731 -0.0581
# 4 Albania NA NA NA NA NA
# 5 Albania 14 0.0766 -0.303 -0.0832 -0.0387
# 6 Albania 0 0.0409 0.302 0.00109 0.0628
# 7 Algeria NA NA NA NA NA
# 8 Algeria -30 0.113 -0.245 0.00038 -0.0757
# 9 Algeria 15 0.0392 0.313 -0.000455 0.0233
#10 Angola NA NA NA NA NA
# … with 428 more rows, and 4 more variables: Trust..Government.Corruption. <dbl>,
# Generosity <dbl>, Dystopia.Residual <dbl>, Year <dbl>
The value of first row for each Year would always be NA, rest of the values would be subtracted by it's previous values.

How to create a loop for sum calculations which then are inserted into a new row?

I have tried to find a solution via similar topics, but haven't found anything suitable. This may be due to the search terms I have used. If I have missed something, please accept my apologies.
Here is a excerpt of my data UN_ (the provided sample should be sufficient):
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
AT 1990 Total 7.869005
AT 1991 1 1.484667
AT 1991 2 1.001578
AT 1991 3 4.625927
AT 1991 4 2.515453
AT 1991 5 2.702081
AT 1991 Total 8.249567
....
BE 1994 1 3.008115
BE 1994 2 1.550344
BE 1994 3 1.080667
BE 1994 4 1.768645
BE 1994 5 7.208295
BE 1994 Total 1.526016
BE 1995 1 2.958820
BE 1995 2 1.571759
BE 1995 3 1.116049
BE 1995 4 1.888952
BE 1995 5 7.654881
BE 1995 Total 1.547446
....
What I want to do is, to add another row with UN_$sector = Residual. The value of residual will be (UN_$sector = Total) - (the sum of column UN for the sectors c("1", "2", "3", "4", "5")) for a given year AND country.
This is how it should look like:
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
----> AT 1990 Residual TO BE CALCULATED
AT 1990 Total 7.869005
As I don't want to write many, many lines of code I'm looking for a way to automate this. I was told about loops, but can't really follow the concept at the moment.
Thank you very much for any type of help!!
Best,
Constantin
PS: (for Parfait)
country year sector UN ETS
UK 2012 1 190336512 NA
UK 2012 2 18107910 NA
UK 2012 3 8333564 NA
UK 2012 4 11269017 NA
UK 2012 5 2504751 NA
UK 2012 Total 580957306 NA
UK 2013 1 177882200 NA
UK 2013 2 20353347 NA
UK 2013 3 8838575 NA
UK 2013 4 11051398 NA
UK 2013 5 2684909 NA
UK 2013 Total 566322778 NA
Consider calculating residual first and then stack it with other pieces of data:
# CALCULATE RESIDUALS BY MERGED COLUMNS
agg <- within(merge(aggregate(UN ~ country + year, data = subset(df, sector!='Total'), sum),
aggregate(UN ~ country + year, data = subset(df, sector=='Total'), sum),
by=c("country", "year")),
{UN <- UN.y - UN.x
sector = 'Residual'})
# ROW BIND DIFFERENT PIECES
final_df <- rbind(subset(df, sector!='Total'),
agg[c("country", "year", "sector", "UN")],
subset(df, sector=='Total'))
# ORDER ROWS AND RESET ROWNAMES
final_df <- with(final_df, final_df[order(country, year, as.character(sector)),])
row.names(final_df) <- NULL
Rextester demo
final_df
# country year sector UN
# 1 AT 1990 1 1.407555
# 2 AT 1990 2 1.037137
# 3 AT 1990 3 4.769618
# 4 AT 1990 4 2.455139
# 5 AT 1990 5 2.238618
# 6 AT 1990 Residual -4.039062
# 7 AT 1990 Total 7.869005
# 8 AT 1991 1 1.484667
# 9 AT 1991 2 1.001578
# 10 AT 1991 3 4.625927
# 11 AT 1991 4 2.515453
# 12 AT 1991 5 2.702081
# 13 AT 1991 Residual -4.080139
# 14 AT 1991 Total 8.249567
# 15 BE 1994 1 3.008115
# 16 BE 1994 2 1.550344
# 17 BE 1994 3 1.080667
# 18 BE 1994 4 1.768645
# 19 BE 1994 5 7.208295
# 20 BE 1994 Residual -13.090050
# 21 BE 1994 Total 1.526016
# 22 BE 1995 1 2.958820
# 23 BE 1995 2 1.571759
# 24 BE 1995 3 1.116049
# 25 BE 1995 4 1.888952
# 26 BE 1995 5 7.654881
# 27 BE 1995 Residual -13.643015
# 28 BE 1995 Total 1.547446
I think there are multiple ways you can do this. What I may recommend is to take advantage of the tidyverse suite of packages which includes dplyr.
Without getting too far into what dplyr and tidyverse can achieve, we can talk about the power of dplyr's inline commands group_by(...), summarise(...), arrange(...) and bind_rows(...) functions. Also, there are tons of great tutorials, cheat sheets, and documentation on all tidyverse packages.
Although it is less and less relevant these days, we generally want to avoid for loops in R. Therefore, we will create a new data frame which contains all of the Residual values then bring it back into your original data frame.
Step 1: Calculating all residual values
We want to calculate the sum of UN values, grouped by country and year. We can achieve this by this value
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))
Step 2: Add sector column to res_UN with value 'residual'
This should yield a data frame which contains country, year, and UN, we now need to add a column sector which the value 'Residual' to satisfy your specifications.
res_UN$sector = 'Residual'
Step 3 : Add res_UN back to UN_ and order accordingly
res_UN and UN_ now have the same columns and they can now be added back together.
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
Piecing this all together, should answer your question and can be achieved in a couple lines!
TLDR:
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))`
res_UN$sector = 'Residual'
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)

How to reshape this complicated data frame?

Here is first 4 rows of my data;
X...Country.Name Country.Code Indicator.Name
1 Turkey TUR Inflation, GDP deflator (annual %)
2 Turkey TUR Unemployment, total (% of total labor force)
3 Afghanistan AFG Inflation, GDP deflator (annual %)
4 Afghanistan AFG Unemployment, total (% of total labor force)
Indicator.Code X2010
1 NY.GDP.DEFL.KD.ZG 5.675740
2 SL.UEM.TOTL.ZS 11.900000
3 NY.GDP.DEFL.KD.ZG 9.437322
4 SL.UEM.TOTL.ZS NA
I want my data reshaped into two colums, one of each Indicator code, and I want each row correspond to a country, something like this;
Country Name NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
Turkey 5.6 11.9
Afghanistan 9.43 NA
I think I could do this with Excel, but I want to learn the R way, so that I don't need to rely on excel everytime I have a problem. Here is dput of data if you need it.
Edit: I actually want 3 colums, one for each indicator and one for the country's name.
Sticking with base R, use reshape. I took the liberty of cleaning up the column names. Here, I'm only showing you a few rows of the output. Remove head to see the full output. This assumes your data.frame is named "mydata".
names(mydata) <- c("CountryName", "CountryCode",
"IndicatorName", "IndicatorCode", "X2010")
head(reshape(mydata[-c(2:3)],
direction = "wide",
idvar = "CountryName",
timevar = "IndicatorCode"))
# CountryName X2010.NY.GDP.DEFL.KD.ZG X2010.SL.UEM.TOTL.ZS
# 1 Turkey 5.675740 11.9
# 3 Afghanistan 9.437322 NA
# 5 Albania 3.459343 NA
# 7 Algeria 16.245617 11.4
# 9 American Samoa NA NA
# 11 Andorra NA NA
Another option in base R is xtabs, but NA gets replaced with 0:
head(xtabs(X2010 ~ CountryName + IndicatorCode, mydata))
# IndicatorCode
# CountryName NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
# Afghanistan 9.437322 0.0
# Albania 3.459343 0.0
# Algeria 16.245617 11.4
# American Samoa 0.000000 0.0
# Andorra 0.000000 0.0
# Angola 22.393924 0.0
The result of xtabs is a matrix, so if you want a data.frame, wrap the output with as.data.frame.matrix.

Resources