Height of tile with discrete values in ggplot2 - r

I am trying to make a heat map for one year (2014) for about a 180 countries where the fill is GHG emissions. The y axis is supposed to be the countries and the x axis is supposed to be the year. Following is the code I am using,
g<-ggplot(data, aes(x = Year, y = Country, fill = `GHG_Emissions_Per_Capita_w_LULUCF_(tCO2e)`)) +
geom_tile() + scale_fill_gradient2(low = colors[1],
mid = colors[paletteSize/2],
high = colors[paletteSize],
midpoint = (max(data$`GHG_Emissions_Per_Capita_w_LULUCF_(tCO2e)`)+min(data$`GHG_Emissions_Per_Capita_w_LULUCF_(tCO2e)`)) / 2,
name = "Total GHG w LULUCF")
I am facing 2 problems,
1. The names on the y axis overlap and I cannot seem to make the y axis longer or the tiles higher to accommodate the name
2. The x axis is supposed to be one discrete element "2014" which gets split into a continuous variable.
Here is the output of str(data)
'data.frame': 191 obs. of 4 variables:
$ Country : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ Year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
$ GHG_Emissions_Per_Capita_wo_LULUCF_(tCO2e): num 1.02 3.13 5.17 6.54 5.86 ...
$ GHG_Emissions_Per_Capita_w_LULUCF_(tCO2e) : num 1.02 3 5.16 6.27 9.36 ...
Here is some sample data.
Country Year GHG_Emissions_Per_Capita_wo_LULUCF_(tCO2e) GHG_Emissions_Per_Capita_w_LULUCF_(tCO2e)
1 Afghanistan 2014 1.0185643 1.0185643
2 Albania 2014 3.1277710 3.0039601
3 Algeria 2014 5.1667095 5.1564317
4 Andorra 2014 6.5401349 6.2655871
5 Angola 2014 5.8623365 9.3643598
6 Antigua & Barbuda 2014 11.4753630 11.5420209
7 Argentina 2014 8.1115625 10.3127052
8 Armenia 2014 2.9682122 2.9177457
9 Australia 2014 25.1371295 22.3016710
10 Austria 2014 8.7978737 8.2251670
11 Azerbaijan 2014 7.6137557 6.7257221
12 Bahamas, The 2014 7.1029869 8.1139307
This is a screenshot of the plot I am trying to make
Thanks in advance for your help.

Related

How to create a data group (factor variables) in my dataframe based on categorical variables #R

I want to create a factor variables in my dataframes based on categorical variables.
My data:
# A tibble: 159 x 3
name.country gpd rate_suicide
<chr> <dbl> <dbl>
1 Afghanistan 2129. 6.4
2 Albania 12003. 5.6
3 Algeria 11624. 3.3
4 Angola 7103. 8.9
5 Antigua and Barbuda 19919. 0.5
6 Argentina 20308. 9.1
7 Armenia 10704. 5.7
8 Australia 47350. 11.7
9 Austria 52633. 11.4
10 Azerbaijan 14371. 2.6
# ... with 149 more rows
I want to create factor variable region, which contains a factors as:
region <- c('Asian', 'Europe', 'South America', 'North America', 'Africa')
region = factor(region, levels = c('Asian', 'Europe', 'South America', 'North America', 'Africa'))
I want to do this with dplyr packages, that can to choose a factor levels depends on name.countrybut it doesn't work. Example:
if (new_data$name.country[new_data$name.country == "N"]) {
mutate(new_data, region_ = region[1])
}
How i can solve the problem?
I think the way I would think about your problem is
Create a reproducible problem. (see How to make a great R reproducible example. ) Since you already have the data, use dput to make it easier for people like me to recreate your data in their environment.
dput(yourdf)
structure(list(name.country = c("Afghanistan", "Albania", "Algeria"
), gpd = c(2129L, 12003L, 11624L), rate_suicide = c(6.4, 5.6,
3.3)), class = "data.frame", row.names = c(NA, -3L))
raw_data<-structure(list(name.country = c("Afghanistan", "Albania", "Algeria"
), gpd = c(2129L, 12003L, 11624L), rate_suicide = c(6.4, 5.6,
3.3)), class = "data.frame", row.names = c(NA, -3L))
Define vectors that specify your regions
Use case_when to separate countries into regions
Use as.factor to convert your character variable to a factor
asia=c("Afghanistan","India","...","Rest of countries in Asia")
europe=c("Albania","France","...","Rest of countries in Europe")
africa=c("Algeria","Egypt","...","Rest of countries in Africa")
df<-raw_data %>%
mutate(region=case_when(
name.country %in% asia ~ "asia",
name.country %in% europe ~ "europe",
name.country %in% africa ~ "africa",
TRUE ~ "other"
)) %>%
mutate(region=region %>% as.factor())
You can check that your variable region is a factor using str
str(df)
'data.frame': 3 obs. of 4 variables:
$ name.country: chr "Afghanistan" "Albania" "Algeria"
$ gpd : int 2129 12003 11624
$ rate_suicide: num 6.4 5.6 3.3
$ region : Factor w/ 3 levels "africa","asia",..: 2 3 1
Here is a working example that combines data from the question with a file of countries and region information from Github. H/T to Luke Duncalfe for maintaining the region data, which is:
...a combination of the Wikipedia ISO-3166 article for alpha and numeric country codes and the UN Statistics site for countries' regional and sub-regional codes.
regionFile <- "https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv"
regionData <- read.csv(regionFile,header=TRUE)
textFile <- "rowID|country|gdp|suicideRate
1|Afghanistan|2129.|6.4
2|Albania|12003.|5.6
3|Algeria|11624.|3.3
4|Angola|7103.|8.9
5|Antigua and Barbuda|19919.|0.5
6|Argentina|20308.|9.1
7|Armenia|10704.|5.7
8|Australia|47350.|11.7
9|Austria|52633.|11.4
10|Azerbaijan|14371.|2.6"
data <- read.csv(text=textFile,sep="|")
library(dplyr)
data %>%
left_join(.,regionData,by = c("country" = "name"))
...and the output:
rowID country gdp suicideRate alpha.2 alpha.3 country.code
1 1 Afghanistan 2129 6.4 AF AFG 4
2 2 Albania 12003 5.6 AL ALB 8
3 3 Algeria 11624 3.3 DZ DZA 12
4 4 Angola 7103 8.9 AO AGO 24
5 5 Antigua and Barbuda 19919 0.5 AG ATG 28
6 6 Argentina 20308 9.1 AR ARG 32
7 7 Armenia 10704 5.7 AM ARM 51
8 8 Australia 47350 11.7 AU AUS 36
9 9 Austria 52633 11.4 AT AUT 40
10 10 Azerbaijan 14371 2.6 AZ AZE 31
iso_3166.2 region sub.region intermediate.region
1 ISO 3166-2:AF Asia Southern Asia
2 ISO 3166-2:AL Europe Southern Europe
3 ISO 3166-2:DZ Africa Northern Africa
4 ISO 3166-2:AO Africa Sub-Saharan Africa Middle Africa
5 ISO 3166-2:AG Americas Latin America and the Caribbean Caribbean
6 ISO 3166-2:AR Americas Latin America and the Caribbean South America
7 ISO 3166-2:AM Asia Western Asia
8 ISO 3166-2:AU Oceania Australia and New Zealand
9 ISO 3166-2:AT Europe Western Europe
10 ISO 3166-2:AZ Asia Western Asia
region.code sub.region.code intermediate.region.code
1 142 34 NA
2 150 39 NA
3 2 15 NA
4 2 202 17
5 19 419 29
6 19 419 5
7 142 145 NA
8 9 53 NA
9 150 155 NA
10 142 145 NA
At this point one can decide whether to use the region, sub region, or intermediate region and convert it to a factor.
We can set region to a factor by adding a mutate() function to the dplyr pipeline:
data %>%
left_join(.,regionData,by = c("country" = "name")) %>%
mutate(region = factor(region)) -> mergedData
At this point mergedData$region is a factor.
str(mergedData$region)
table(mergedData$region)
> str(mergedData$region)
Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 2 3 5 4 3
> table(mergedData$region)
Africa Americas Asia Europe Oceania
2 2 3 2 1
Now the data is ready for further analysis. We will generate a table of average suicide rates by region.
library(knitr) # for kable
mergedData %>% group_by(region) %>%
summarise(suicideRate = mean(suicideRate)) %>%
kable(.)
...and the output:
|region | suicideRate|
|:--------|-----------:|
|Africa | 6.1|
|Americas | 4.8|
|Asia | 4.9|
|Europe | 8.5|
|Oceania | 11.7|
When rendered in an HTML / markdown viewer, the result looks like this:

How to plot monthly data having in the x-axis months and Years R studio

I have a dataframe where column 1 are Months, column 2 are Years and column 3 are precipitation values.
I want to plot the precipitation values for EACH month and EACH year.
My data goes from at January 1961 to February 2019.
¿How can I plot that?
Here is my data:
If I use this:
plot(YearAn,PPMensual,type="l",col="red",xlab="años", ylab="PP media anual")
I get this:
Which is wrong because it puts all the monthly values in every single year! What Im looking for is an x axis that looks like "JAN-1961, FEB1961....until FEB-2019"
It can be done easily using ggplot/tidyverse packages.
First lets load the the packages (ggplot is part of tidyverse) and create a sample data:
library(tidyverse)
set.seed(123)
df <- data.frame(month = rep(c(1:12), 2),
year = rep(c("1961", "1962"),
each = 12),
ppmensual = rnorm(24, 5, 2))
Now we can plot the data (df):
df %>%
ggplot(aes(month, ppmensual,
group = year,
color = year)) +
geom_line()
Using lubridate and ggplot2 but with no grouping:
Setup
library(lubridate) #for graphic
library(ggplot2) # for make_date()
df <- tibble(month = rep(month.name, 40),
year = rep(c(1961:2000), each = 12),
PP = runif(12*40) * runif(12*40) * 10) # PP data is random here
print(df, n = 20)
month year PP
<chr> <int> <dbl>
1 January 1961 5.42
2 February 1961 0.855
3 March 1961 5.89
4 April 1961 1.37
5 May 1961 0.0894
6 June 1961 2.63
7 July 1961 1.89
8 August 1961 0.148
9 September 1961 0.142
10 October 1961 3.49
11 November 1961 1.92
12 December 1961 1.51
13 January 1962 5.60
14 February 1962 1.69
15 March 1962 1.14
16 April 1962 1.81
17 May 1962 8.11
18 June 1962 0.879
19 July 1962 4.85
20 August 1962 6.96
# … with 460 more rows
Graph
df %>%
ggplot(aes(x = make_date(year, factor(month)), y = PP)) +
geom_line() +
xlab("años")

How to get the smooth line for monthly rainfall using ggplot?

I am trying to plot the monthly rainfall data from 1986 to 2016 using ggplot. My dataframe looks like this:
head(df)
Year Month Station Rainfall Remarks
1 1986 Jan stn1 0.0 Observed
2 1986 Feb stn1 10.4 Observed
3 1986 Mar stn1 16.5 Estimated
4 1986 Apr stn1 34.0 Observed
5 1986 May stn1 27.0 Observed
6 1986 Jun stn1 159.4 Observed
str(df)
'data.frame': 1488 obs. of 5 variables:
$ Year : chr "1986" "1986" "1986" "1986" ...
$ Month : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 1 2 3 4 5 6 7 8 9 10 ...
$ Station : Factor w/ 4 levels "stn1","stn2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Rainfall: num 0 10.4 16.5 34 27 ...
$ Remarks : Factor w/ 2 levels "Estimated","Observed": 2 2 1 2 2 2 2 2 2 2 ...
I tried the following code:
library(ggplot2)
ggplot(df, aes(x=Year, y=Rainfall, col=Station)) + geom_line()
However the above code results in vertical lines plot, while I want to have smooth varying lines.
I want to plot all the four station (stn1 to stn4) such that the color of each line be based on the df$Remarks.
Also is it possible to have unique color for each station?
Your help would be appreciated
Here is one approach if you create a month-year variable:
library(ggplot2)
library(zoo)
df$Mo_Yr <- as.yearmon(paste0(df$Year, '-', df$Month), "%Y-%b")
ggplot(df, aes(x=Mo_Yr, y=Rainfall, col=Station)) +
geom_line() +
scale_x_yearmon()
If you want to use different color points for Remarks (Observed and Estimated), for a single Station, you could try the following:
ggplot(df, aes(x=Mo_Yr, y=Rainfall)) +
geom_point(aes(col = Remarks)) +
geom_line() +
scale_x_yearmon()
If you want to plot 2 lines for Observed and Estimated, you could add col argument to geom_line as below. Note I added some example data to illustrate. Depending on what data you have available this may (or may not) be what you need.
ggplot(df, aes(x=Mo_Yr, y=Rainfall)) +
geom_line(aes(col=Remarks)) +
scale_x_yearmon()
Data (for last example)
df <- read.table(text =
"Year Month Station Rainfall Remarks
1986 Jan stn1 0.0 Observed
1986 Feb stn1 10.4 Observed
1986 Mar stn1 16.5 Estimated
1986 Apr stn1 34.0 Observed
1986 May stn1 27.0 Observed
1986 Jun stn1 159.4 Observed
1986 Jul stn1 83.1 Estimated
1986 Aug stn1 55.7 Observed
1986 Sep stn1 12.3 Estimated", header = T, stringsAsFactors = T)
You might want to try passing the stat_smooth parameter
ggplot(df) +
geom_line(aes(y= Rainfall, x= Year, color= Station)) +
stat_smooth(aes(y= Rainfall, x= Year), method = lm, formula = y ~ poly(x, 10), se = FALSE)

R Merging Boxplots

I am trying to use R to show a merged boxplot, I am sure this is easy, I just am missing something:
boxplot(WHO$Male, WHO$Female, ylim=c(0,100))
boxplot(WHO$Female ~ WHO$Year, ylim=c(0,100))
boxplot(WHO$Male ~ WHO$Year, ylim=c(0,100))
All three work, but when I try:
boxplot(WHO$Male ~ WHO$Year, WHO$Female ~ WHO$Year, ylim=c(0,100))
It returns:
Error in as.data.frame.default(data) :
cannot coerce class ""formula"" to a data.frame
Note, Year, only contains three numbers, 1990, 2000, 2010
> head(WHO)
Year WHO.region Country Male Female
1 1990 Africa Algeria 66 68
2 1990 Africa Angola 39 43
3 1990 Africa Benin 45 50
4 1990 Africa Botswana 63 66
5 1990 Africa Burkina Faso 45 49
6 1990 Africa Burundi 47 50
reshape2 package does something similar. Actually there was quite similar question - Plot multiple boxplot in one graph, maybe it will be helpful.

R: Calculating 5 year averages in panel data

I have a balanced panel by country from 1951 to 2007 in a data frame. I'd like to transform it into a new data frame of five year averages of my other variables. When I sat down to do this I realized the only way I could think to do this involved a for loop and then decided that it was time to come to stackoverflow for help.
So, is there an easy way to turn data that looks like this:
country country.isocode year POP ci grgdpch
Argentina ARG 1951 17517.34 18.445022145 3.4602044759
Argentina ARG 1952 17876.96 17.76066507 -7.887407586
Argentina ARG 1953 18230.82 18.365255769 2.3118720688
Argentina ARG 1954 18580.56 16.982113434 1.5693778844
Argentina ARG 1955 18927.82 17.488907008 5.3690276523
Argentina ARG 1956 19271.51 15.907756547 0.3125559183
Argentina ARG 1957 19610.54 17.028450999 2.4896639667
Argentina ARG 1958 19946.54 17.541597134 5.0025894968
Argentina ARG 1959 20281.15 16.137310492 -6.763501447
Argentina ARG 1960 20616.01 20.519539628 8.481742144
...
Venezuela VEN 1997 22361.80 21.923577413 5.603872759
Venezuela VEN 1998 22751.36 24.451736863 -0.781844721
Venezuela VEN 1999 23128.64 21.585034168 -8.728234466
Venezuela VEN 2000 23492.75 20.224310777 2.6828641218
Venezuela VEN 2001 23843.87 23.480311721 0.2476965412
Venezuela VEN 2002 24191.77 16.290691319 -8.02535946
Venezuela VEN 2003 24545.43 10.972153646 -8.341989049
Venezuela VEN 2004 24904.62 17.147693312 14.644028806
Venezuela VEN 2005 25269.18 18.805970212 7.3156977879
Venezuela VEN 2006 25641.46 22.191098769 5.2737381326
Venezuela VEN 2007 26023.53 26.518210052 4.1367897561
into something like this:
country country.isocode period AvPOP Avci Avgrgdpch
Argentina ARG 1 18230 17.38474 1.423454
...
Venezuela VEN 12 25274 21.45343 5.454334
Do I need to transform this data frame using a specific panel data package? Or is there another easy way to do this that I'm missing?
This is the stuff aggregate is made for. :
Df <- data.frame(
year=rep(1951:1970,2),
country=rep(c("Arg","Ven"),each=20),
var1 = c(1:20,51:70),
var2 = c(20:1,70:51)
)
Level <-cut(Df$year,seq(1951,1971,by=5),right=F)
id <- c("var1","var2")
> aggregate(Df[id],list(Df$country,Level),mean)
Group.1 Group.2 var1 var2
1 Arg [1951,1956) 3 18
2 Ven [1951,1956) 53 68
3 Arg [1956,1961) 8 13
4 Ven [1956,1961) 58 63
5 Arg [1961,1966) 13 8
6 Ven [1961,1966) 63 58
7 Arg [1966,1971) 18 3
8 Ven [1966,1971) 68 53
The only thing you might want to do, is to rename the categories and the variable names.
For this type of problem, the plyr package is truely phenomenal. Here is some code that gives you what you want in essentially a single line of code plus a small helper function.
library(plyr)
library(zoo)
library(pwt)
# First recreate dataset, using package pwt
data(pwt6.3)
pwt <- pwt6.3[
pwt6.3$country %in% c("Argentina", "Venezuela"),
c("country", "isocode", "year", "pop", "ci", "rgdpch")
]
# Use rollmean() in zoo as basis for defining a rolling 5-period rolling mean
rollmean5 <- function(x){
rollmean(x, 5)
}
# Use ddply() in plyr package to create rolling average per country
pwt.ma <- ddply(pwt, .(country), numcolwise(rollmean5))
Here is the output from this:
> head(pwt, 10)
country isocode year pop ci rgdpch
ARG-1950 Argentina ARG 1950 17150.34 13.29214 7736.338
ARG-1951 Argentina ARG 1951 17517.34 18.44502 8004.031
ARG-1952 Argentina ARG 1952 17876.96 17.76067 7372.721
ARG-1953 Argentina ARG 1953 18230.82 18.36526 7543.169
ARG-1954 Argentina ARG 1954 18580.56 16.98211 7661.550
ARG-1955 Argentina ARG 1955 18927.82 17.48891 8072.900
ARG-1956 Argentina ARG 1956 19271.51 15.90776 8098.133
ARG-1957 Argentina ARG 1957 19610.54 17.02845 8299.749
ARG-1958 Argentina ARG 1958 19946.54 17.54160 8714.951
ARG-1959 Argentina ARG 1959 20281.15 16.13731 8125.515
> head(pwt.ma)
country year pop ci rgdpch
1 Argentina 1952 17871.20 16.96904 7663.562
2 Argentina 1953 18226.70 17.80839 7730.874
3 Argentina 1954 18577.53 17.30094 7749.694
4 Argentina 1955 18924.25 17.15450 7935.100
5 Argentina 1956 19267.39 16.98977 8169.456
6 Argentina 1957 19607.51 16.82080 8262.250
Note that rollmean(), by default, calculates the centred moving mean. You can modify this behaviour to get the left or right moving mean by passing this parameter to the helper function.
EDIT:
#Joris Meys gently pointed out that you might in fact be after the average for five-year periods.
Here is the modified code to do this:
pwt$period <- cut(pwt$year, seq(1900, 2100, 5))
pwt.ma <- ddply(pwt, .(country, period), numcolwise(mean))
pwt.ma
And the output:
> pwt.ma
country period year pop ci rgdpch
1 Argentina (1945,1950] 1950.0 17150.336 13.29214 7736.338
2 Argentina (1950,1955] 1953.0 18226.699 17.80839 7730.874
3 Argentina (1955,1960] 1958.0 19945.149 17.42693 8410.610
4 Argentina (1960,1965] 1963.0 21616.623 19.09067 9000.918
5 Argentina (1965,1970] 1968.0 23273.736 18.89005 10202.665
6 Argentina (1970,1975] 1973.0 25216.339 19.70203 11348.321
7 Argentina (1975,1980] 1978.0 27445.430 23.34439 11907.939
8 Argentina (1980,1985] 1983.0 29774.778 17.58909 10987.538
9 Argentina (1985,1990] 1988.0 32095.227 15.17531 10313.375
10 Argentina (1990,1995] 1993.0 34399.829 17.96758 11221.807
11 Argentina (1995,2000] 1998.0 36512.422 19.03551 12652.849
12 Argentina (2000,2005] 2003.0 38390.719 15.22084 12308.493
13 Argentina (2005,2010] 2006.5 39831.625 21.11783 14885.227
14 Venezuela (1945,1950] 1950.0 5009.006 41.07972 7067.947
15 Venezuela (1950,1955] 1953.0 5684.009 44.60849 8132.041
16 Venezuela (1955,1960] 1958.0 6988.078 37.87946 9468.001
17 Venezuela (1960,1965] 1963.0 8451.073 26.93877 9958.935
18 Venezuela (1965,1970] 1968.0 10056.910 28.66512 11083.242
19 Venezuela (1970,1975] 1973.0 11903.185 32.02671 12862.966
20 Venezuela (1975,1980] 1978.0 13927.882 36.35687 13530.556
21 Venezuela (1980,1985] 1983.0 16082.694 22.21093 10762.718
22 Venezuela (1985,1990] 1988.0 18382.964 19.48447 10376.123
23 Venezuela (1990,1995] 1993.0 20680.645 19.82371 10988.096
24 Venezuela (1995,2000] 1998.0 22739.062 20.93509 10837.580
25 Venezuela (2000,2005] 2003.0 24550.973 17.33936 10085.322
26 Venezuela (2005,2010] 2006.5 25832.495 24.35465 11790.497
Use cut on your year variable to make the period variable, then use melt and cast from the reshape package to get the averages. There's a lot of other answers that can show you how; see https://stackoverflow.com/questions/tagged/r+reshape
There is a base stats and a plyr answer, so for completeness, here is a dplyr based answer. Using the toy data given by Joris, we have
Df <- data.frame(
year=rep(1951:1970,2),
country=rep(c("Arg","Ven"),each=20),
var1 = c(1:20,51:70),
var2 = c(20:1,70:51)
)
Now, using cut to create the periods, we can then group on them and get the means:
Df %>% mutate(period = cut(Df$year,seq(1951,1971,by=5),right=F)) %>%
group_by(country, period) %>% summarise(V1 = mean(var1), V2 = mean(var2))
Source: local data frame [8 x 4]
Groups: country
country period V1 V2
1 Arg [1951,1956) 3 18
2 Arg [1956,1961) 8 13
3 Arg [1961,1966) 13 8
4 Arg [1966,1971) 18 3
5 Ven [1951,1956) 53 68
6 Ven [1956,1961) 58 63
7 Ven [1961,1966) 63 58
8 Ven [1966,1971) 68 53

Resources