Plot line graph in R with ggplot2 from dataset - r

I have a dataset of countries health expenditure and life expectancy and wish to plot these visually.
I currently have the code:
dd = data.frame(Series_Name = "Health expenditure per capita (current US$) Australia",
Year = c(2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014),
Value = c(1665.200,1883.316,2370.881,2933.229,3214.031,3421.908,4077.852,4410.438,4256.641,5324.517,6368.424,6543.524,6258.467,6031.107))
Which I am then plotting with:
require(ggplot2)
##The values Year, Value, School_ID are
##inherited by the geoms
ggplot(dd, aes(Year, Value,colour=Series_Name)) +
geom_line() +
geom_point()
This displays the graph how I would like, although the issue is that I would to be able to specify which series of data should be placed within the value variable to avoid inputting it manually, the year does not need to be changed and can stay how it is.
The data has been read in from a csv file and saved to the variable 'statistics'. The data looks like this:
Series Name 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Health expenditure per capita (current US$) Australia 1665.200 1883.316 2370.881 2933.229 3214.031 3421.908 4077.852 4410.438 4256.641 5324.517 6368.424 6543.524 6258.467 6031.107
If I wished to change data from Australia to Japan, how would I go about doing so, the Series name is set out the same with the exception of the country name.
Thanks for your help!
EDIT: Thought it may beneficial to add an image of the data layout.
The statistics.csv file - https://ufile.io/ocynw

You could use the following approach. If your data frame is called dd:
names(dd) <- c("Series_Name", seq(2001,2014,1))
library(reshape2)
library(tidyverse)
library(stringr)
We first convert your data frame from wide to long format:
dd2 <- melt(dd, id.vars=c("Series_Name"), value.name = c("value"))
Selecting the variables 'Health expenditure per capita' only
dd2 <- dd2[startsWith(as.character(dd2$Series_Name), prefix = "Health expenditure per capita"), ]
Creating a column with the name of the country that will appear in the legend:
dd2$country <- as.factor(word(dd2$Series_Name,-1) )
Sorting your data:
dd2 <- arrange(dd2, country)
and plotting all the countries:
ggplot(dd2, aes(x = variable, y = value, group=country, color=country)) + geom_line() +
geom_point()
If you want just Japan:
filter(dd2, country == "Japan") %>%
ggplot(aes(x = variable, y = value, group=country, color=country)) +
geom_line() + geom_point()

Related

Create and add a new variable with the pipe operator = Create a Plot out of it

I want to create and add a new variable called "XY" by multiplying the capital income of several countries by 1000 and then divide it by its population = gdp*1000/pop. Afterwards I want to make a plot with selecting the years and countries (Here I want to select certain countries from the data, e.g. Turkey, France, Germany and the timeframe should be 1996 and later meaning 1997, 1998 etc.), where I want to have the x-axis with years and the y-axis with capital income. I want to have every country in a different colour. How can I put this in a R-code. My start so far is like this:
> # two pipes
> gapminder %>%
+ select(Germany, Italy, France) %>%
+ select(population >= 1996)
plot(eu_macro$germany, eu_macro$turkey, eu_macro$france, gdp_capita$medv, main = "GDP
development", xlab = "Year", ylab = "gdp_capita")
Sorry, I am very new to R and Coding in general and just hope to get an understanding for it for Finance classes.
Ary you looking for such a solution?
library(gapminder)
library(tidyverse)
gapminder %>%
mutate(XY = gdpPercap*1000/pop) %>%
filter(country=="Germany" |
country=="Turkey"|
country=="France") %>%
filter(year >= 1996) %>%
ggplot(aes(x=factor(year), y=XY, fill=country)) +
geom_col(position = "dodge") +
ggtitle("GDP development") +
xlab("Year") + ylab("gdp_capita")

Trying to plot change in electricity consumption per year with R

EDIT: I figured out that the year has to be of the numeric data type, or the code has to be written as:
ggplot(data = Electricity_Consumption_per_Capita_United_States, aes(x = year, y = value)) +
geom_line(group = 1) +
scale_x_continuous(breaks = seq(1960, 2010, 5))
Original Question:
I downloaded the "Electricity use, per person" data set from here. This is what the data frame looks like:
I am trying to plot the change in electricity consumption per person for any given country over the years in the data frame (1960 to 2011), and decided to start with the United States. I thought it made sense to use tidyr to organize the years under one column, and the actual kWh under another column:
Electricity_Consumption_Per_Capita <- read_excel("Datasets/Indicator_Electricity consumption per capita.xlsx")
#Gather the years and corresponding electricity consumption per capita values per country.
Electricity_Consumption_Per_Capita %>%
gather(key = "year", value = "value", -"Electricity consumption, per capita (kWh)") -> Electricity_Consumption_Per_Capita
#Rename the Electricity consumption, per capita (kWh) variable to Country, then filter to obtain the data for the United States.
Electricity_Consumption_Per_Capita %>%
rename(Country = `Electricity consumption, per capita (kWh)`) %>%
group_by(Country) %>%
filter(Country == "United States") -> Electricity_Consumption_per_Capita_United_States
The resulting data frame looks like:
Unfortunately, I cannot figure out how to plot the value (kWh) and the year on the same plot. I tried a normal line chart with no success:
ggplot(data = Electricity_Consumption_per_Capita_United_States, aes(x = "year", y = "value")) +
geom_line()
I think this is a discrete versus continuous variable problem, but I'm not certain. May someone point me in the right direction? Do I have to change the "year" column, which is currently a character vector, to a date data type?
Remove the quotes from the aesthetics.

How can I color a line graph by grouping the variables in R?

I have produced a line graph something that looks like this
I have the data set of 50 countries and its GDP for last 10 years.
Sample data:
Country variable value
China Y2007 3.55218e+12
USA Y2007 1.45000e+13
Japan Y2007 4.51526e+12
UK Y2007 3.06301e+12
Russia Y2007 1.29971e+12
Canada Y2007 1.46498e+12
Germany Y2007 3.43995e+12
India Y2007 1.20107e+12
France Y2007 2.66311e+12
SKorea Y2007 1.12268e+12
I generated the line graph using the code
GDP_lineplot = ggplot(data=GDP_linechart, aes(x=variable,y=value)) +
geom_line() +
scale_y_continuous(name = "GDP(USD in Trillions)",
breaks = c(0.0e+00,5.0e+12,1.0e+13,1.5e+13),
labels = c(0,5,10,15)) +
scale_x_discrete(name = "Years", labels = c(2007,"",2009,"",2011,"",2013,"",2015))
The idea is to make the graph look like this.
I tried adding
group=country, color = country
It outputs coloring all the countries.
How can I color the countries with top 4 and the rest?
PS: I am still naive with R.
By plotting subsets, the other groups aren't included in the colour legend on the right. The alternative approach below manipulates factor levels and uses a customized color scale to overcome this.
Preparing data
It is assumed that GDP_long contains the data in long format. This is in line with the data shown by the OP (GDP_lineplot, but see Data section below for differences). To manipulate factor levels, the forcatspackage is used (and data.table).
library(data.table)
library(forcats)
# coerce to data.table, reorder factors by values in last = most actual year
setDT(GDP_long)[, Country := fct_reorder(Country, -value, last)]
# create new factor which collapses all countries to "Other" except the top 4 countries
GDP_long[, top_country := fct_other(Country, keep = head(levels(Country), 4))]
Create plot
library(ggplot2)
ggplot(GDP_long, aes(Year, value/1e12, group = Country, colour = top_country)) +
geom_point() + geom_line(size = 1) + theme_bw() + ylab("GDP(USD in Trillions)") +
scale_colour_manual(name = "Country",
values = c("green3", "orange", "blue", "red", "grey"))
The chart is now quite similar to the expected result. The lines of the top 4 countries are displayed in different colours while the other countries are displayed in grey but do appear in the colour legend to the right.
Note that the groupaesthetic is still needed so that a single line is plotted for each country while colour is controlled by the levels of top_country.
Data
The data set is too large to be reproduced here (even with dput()). The structure
str(GDP_long)
'data.frame': 1763 obs. of 3 variables:
$ Country: chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ Year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
$ value : num 9.84e+09 1.07e+10 1.35e+11 4.01e+09 6.04e+10 ...
is similar to OP's data with the exception that the variable column already is converted to an integer column year. This will give a nicely formatted x-axis without additional effort.
My apologies I missed the part about only coloring a subset of the countries... in the geom_line calls you can add the subsetting that suits your needs.
df <- data.frame(Country=rep(LETTERS[1:10], each=5),
Year=rep(2007:2011, length.out=10),
value=rnorm(50))
ggplot(df) +
geom_line(data=df[21:50, ], aes(x=Year, y=value, group=Country), color="#999999") +
geom_line(data=df[1:20, ], aes(Year, y=value, color=Country))

ggplot: Multiple years on same plot by month

So, I've hit something I don't think I have every come across. I scoured Google looking for the answer, but have not found anything (yet)...
I have two data sets - one for 2015 and one for 2016. They represent the availability of an IT system. The data frames read as such:
2015 Data Set:
variable value
Jan 2015 100
Feb 2015 99.95
... ...
2015 Data Set:
variable value
Jan 2016 99.99
Feb 2016 99.90
... ...
They just go from Jan - Dec listing the availability of the system. The "variable" column is a as.yearmon data type and the value is a simple numeric.
I want to create a geom_line() chart with ggplot2 that will basically have the percentages as the y-axis and the months as the x-axis. I have been able to do this where there are two lines, but the x-axis runs from Jan 2015 - Dec 2016. What I'd like is to have them only be plotted by month, so they overlap. I have tried some various things with the scales and so forth, but I have yet to figure out how to do this.
Basically, I need the x-axis to read January - December in chronological order, but I want to plot both 2015 and 2016 on the same chart. Here is my ggplot code (non-working) as I have it now:
ggplot(data2015,aes(variable,value)) +
geom_line(aes(color="2015")) +
geom_line(data=data2016,aes(color="2016")) +
scale_x_yearmon() +
theme_classic()
This plots in a continuous stream as I am dealing with a yearmon() data type. I have tried something like this:
ggplot(data2015,aes(months(variable),value)) +
geom_line(aes(color="2015")) +
geom_line(data=data2016,aes(color="2016")) +
theme_classic()
Obviously that won't work. I figure the months() is probably still carrying the year somehow. If I plot them as factors() they are not in order. Any help would be very much appreciated. Thank you in advance!
To get a separate line for each year, you need to extract the year from each date and map it to colour. To get months (without year) on the x-axis, you need to extract the month from each date and map to the x-axis.
library(zoo)
library(lubridate)
library(ggplot2)
Let's create some fake data with the dates in as.yearmon format. I'll create two separate data frames so as to match what you describe in your question:
# Fake data
set.seed(49)
dat1 = data.frame(date = seq(as.Date("2015-01-15"), as.Date("2015-12-15"), "1 month"),
value = cumsum(rnorm(12)))
dat1$date = as.yearmon(dat1$date)
dat2 = data.frame(date = seq(as.Date("2016-01-15"), as.Date("2016-12-15"), "1 month"),
value = cumsum(rnorm(12)))
dat2$date = as.yearmon(dat2$date)
Now for the plot. We'll extract the year and month from date with the year and month functions, respectively, from the lubridate package. We'll also turn the year into a factor, so that ggplot will use a categorical color palette for year, rather than a continuous color gradient:
ggplot(rbind(dat1,dat2), aes(month(date, label=TRUE, abbr=TRUE),
value, group=factor(year(date)), colour=factor(year(date)))) +
geom_line() +
geom_point() +
labs(x="Month", colour="Year") +
theme_classic()
month value year
Jan 99.99 2015
Feb 99.90 2015
Jan 100 2016
Feb 99.95 2016
You need one longform dataset that has a year column. Then you can plot both lines with ggplot
ggplot(dataset, aes(x = month, y = value, color = year)) + geom_line()
ggseasonplotfrom forecast package can do that for you. Example code with ts object:
ggseasonplot(a10, year.labels=TRUE, year.labels.left=TRUE) +
ylab("$ million") +
ggtitle("Seasonal plot: antidiabetic drug sales")
Source

Formatting dates in ggplot to highlight the start of financial years

I've got data refering to financial years, starting from 1 April each year and ending 31 March in next solar year.
df <- data.frame(date = seq(as.POSIXct("2008-04-01"), by="month", length.out=49),
var = rnorm(49))
head(df,3)
date var
1 2008-04-01 0.04265025
2 2008-05-01 -1.59671801
3 2008-06-01 0.4909673
Plotting df with library(ggplot2); ggplot(df) + geom_line(aes(date, var)) I get:
Now, what I'm interested in is having say the "2009" label positioned at "2009-04-01", as it's that the actual start of the FY 2009. I managed to get that with the following code:
ggplot(df) + geom_line(aes(date, var)) +
scale_x_datetime(breaks = df$date[months(df$date)=="April"],
labels = date_format("%Y"))
which correctly gives:
My question is (finally :-) ) does some of you have a better way for showing financial years and eventually better codes then the above?
You could use geom_rect to highlight the financial years. Assuming you save your original plot as p, try:
bgdf <- data.frame(xmin=as.POSIXct(paste0(2008:2011,"-04-01")),
xmax=as.POSIXct(paste0(2009:2012,"-04-01")),
ymin=min(df$var),ymax=max(df$var),alpha=((2008:2011)%%2)*0.1)
p + geom_rect(aes(xmin=xmin,xmax=xmax,ymin=ymin,ymax=ymax),
data=bgdf,alpha=bgdf$alpha,fill="blue")

Resources