Using ggplot to create a professional looking graph - r

I have tried to use ggplot2 to create a professional-looking graph, but I have having some trouble with a lot of things. I would like to add color to the data points, add dates on the x-axis, and create a line of best fit or something similar if possible. I have been searching on Stack Exchange and Google in general to try and solve this problem but to no avail. I am using the "Civilian Labor Force Participation Rate: 20 years and over, Black or African American Men" from the Federal Reserve Bank of St. Louis (FRED).
I am using RStudio, and I imported the data from LNS11300031 and then used the read.csv() function to read it into RStudio. I initially used the plot() function to plot the data, but I want to use the ggplot() function to create a better looking graph, but when I create the graph the data points look very opaque, blurry, and cloudy, and there is no labels on the x-axis. I would like to add color and a line of best fit, but I do not know how to do that.
This is the code I used to create the graph with no x-axis labels:
ggplot(data = labor, mapping = aes(x = labor$DATE, y = labor$LNS11300031)) + geom_point(alpha = 0.1)
This is the graph that my code produced:
Here is some sample data (labor is the variable I used to store the data from the FRED site):
head(labor) DATE LNS11300031
1 1972-01-01 77.6
2 1972-02-01 78.3
3 1972-03-01 78.7
4 1972-04-01 78.6
5 1972-05-01 78.7
6 1972-06-01 79.4
I would like to change the variable name LNS11300031 to Labor Force Participation Rate
Additional information about the data:
str(labor)
'data.frame': 566 obs. of 2 variables:
$ DATE : Factor w/ 566 levels "1972-01-01","1972-02-01",..: 1 2 3 4 5 6 7 8 9 10 ...
$ LNS11300031: num 77.6 78.3 78.7 78.6 78.7 79.4 78.8 78.7 78.6 78.1 ...
I would like the code to create much clearer data points with color and a trend line, and be able to have an x-axis with the corresponding dates.

Here's a basic attempt to cover all 3 of your desired improvements:
Clearer points: don't set the alpha too low! A bit of alpha is good for overlapping points, but alpha = 0.1 makes them too blurry.
Colour: R understands simple colour names like "red", but also hex colour codes. Pick any colours you want.
Trend line: easy to add with stat_smooth(). I've used method='lm' which gives a straight linear regression line but there are more flexible alternatives.
Date labels on the x-axis: Make sure your DATE column is correctly set as a Date type, and use scale_x_date() to tweak the labels.
quantmod::getSymbols("LNS11300031", src="FRED")
# Your data is available from the quantmod package
labor = LNS11300031 %>%
as.data.frame() %>%
rownames_to_column(var = "DATE") %>%
# Make sure DATE is a Date column
mutate(DATE = as.Date(DATE))
# Generally, you don't use data$column syntax within ggplot,
# just give the column name
ggplot(data = labor, mapping = aes(x = DATE, y = LNS11300031)) +
geom_point(alpha = 0.7, colour = "#B07AA1") +
stat_smooth(method = "lm", colour = "#E15759", se = FALSE) +
scale_x_date(date_breaks = "5 years", date_labels = "%Y") +
theme_minimal()
Output:

Related

Trying to plot a time course, 2 factors, 1 response variable with standard error bars

As title states, I'm trying to plot a time course of a response variable (that has 2 factors), working environment based in Rstudio.
I'm working off a data frame that's already in long format. Something like:
|Week| Factor1 | Factor2| Response|
1 Sunny High 2.0
1 Sunny High 3.5
1 Rainy Low 2.5
1 Rainy Low 1.5
2 Sunny High 42.5
2 Sunny High 435
2 Rainy Low 44.5
2 Rainy Low 42.5
3 Sunny High 80.5
3 Sunny High 89.5
3 Rainy Low 88.5
3 Rainy Low 87.5
I would like to do a time course with this data frame, but haven't had much success as I cannot figure out how to make ggplot2 plot the Response line as a variable responding to the combination of Factors.
I've sort of done it with geom_smooth(se=FALSE) instead of geom_line. Doing so with the se=FALSE argument removes the confidence interval bounds.
library(dplyr)
library(ggplot2)
library(tidyr)
df %>%
unite(Factor, c(Factor1, Factor2)) %>%
ggplot(aes(Week, Response, group = Factor, color = Factor)) +
geom_smooth(se=FALSE)
or
df <- unite(df,col='Combined Factors',c(Factor1,Factor2),sep='-',remove=FALSE)
df %>%
group_by(Week,`Combined Factors`) %>%
mutate(avg_Response=mean(Response),se_Response=sd(Response)/sqrt(4)) %>%
ggplot(aes(Week, Response,group=`Combined Factors`,color=`Combined Factors`))+
geom_smooth(se=FALSE) %>%
{.} -> Line_plot_Response
This creates the output graph
now for adding in error bars at each y-axis point
so far, I have only succeeded somewhat in plotting the line graph, but cannot make R recognise my factor 1 and factor 2 as a group and fix the standard error to it. I've tried creating another dataframe and used the summarise.data function but the resulting graph is just a hot mess as the standard error bars are all over the graph and not fixed to each plotted point.. I can do all of this in excel to be honest, but would like to do it in R so that I can improve my coding understanding. I don't understandddddd.
If I am understanding you correctly, for each combination of Factor1 and Factor2 you want to calculate the mean value of Response for each week, and use these points as the co-ordinates for a line plot. You would also like to draw error bars around these points.
The sample data you have provided isn't ideal to demonstrate this, but the code that gives the result as I have described it above would be something like:
df %>%
ggplot(aes(Week, Response, group = interaction(Factor1, Factor2),
color = interaction(Factor1, Factor2))) +
stat_summary(geom = 'line', fun = mean) +
stat_summary(geom = 'point', fun = mean, color = 'black') +
stat_summary(geom = 'errorbar', width = 0.1, alpha = 0.8, size = 0.3) +
labs(color = 'Conditions') +
scale_color_brewer(palette = 'Set1') +
theme_minimal(base_size = 16)

ggplot2 - How to plot length of time using geom_bar?

I am trying to show different growing season lengths by displaying crop planting and harvest dates at multiple regions.
My final goal is a graph that looks like this:
which was taken from an answer to this question. Note that the dates are in julian days (day of year).
My first attempt to reproduce a similar plot is:
library(data.table)
library(ggplot2)
mydat <- "Region\tCrop\tPlanting.Begin\tPlanting.End\tHarvest.Begin\tHarvest.End\nCenter-West\tSoybean\t245\t275\t1\t92\nCenter-West\tCorn\t245\t336\t32\t153\nSouth\tSoybean\t245\t1\t1\t122\nSouth\tCorn\t183\t336\t1\t153\nSoutheast\tSoybean\t275\t336\t1\t122\nSoutheast\tCorn\t214\t336\t32\t122"
# read data as data table
mydat <- setDT(read.table(textConnection(mydat), sep = "\t", header=T))
# melt data table
m <- melt(mydat, id.vars=c("Region","Crop"), variable.name="Period", value.name="value")
# plot stacked bars
ggplot(m, aes(x=Crop, y=value, fill=Period, colour=Period)) +
geom_bar(stat="identity") +
facet_wrap(~Region, nrow=3) +
coord_flip() +
theme_bw(base_size=18) +
scale_colour_manual(values = c("Planting.Begin" = "black", "Planting.End" = "black",
"Harvest.Begin" = "black", "Harvest.End" = "black"), guide = "none")
However, there's a few issues with this plot:
Because the bars are stacked, the values on the x-axis are aggregated and end up too high - out of the 1-365 scale that represents day of year.
I need to combine Planting.Begin and Planting.End in the same color, and do the same to Harvest.Begin and Harvest.End.
Also, a "void" (or a completely uncolored bar) needs to be created between Planting.Begin and Harvest.End.
Perhaps the graph could be achieved with geom_rect or geom_segment, but I really want to stick to geom_bar since it's more customizable (for example, it accepts scale_colour_manual in order to add black borders to the bars).
Any hints on how to create such graph?
I don't think this is something you can do with a geom_bar or geom_col. A more general approach would be to use geom_rect to draw rectangles. To do this, we need to reshape the data a bit
plotdata <- mydat %>%
dplyr::mutate(Crop = factor(Crop)) %>%
tidyr::pivot_longer(Planting.Begin:Harvest.End, names_to="period") %>%
tidyr::separate(period, c("Type","Event")) %>%
tidyr::pivot_wider(names_from=Event, values_from=value)
# Region Crop Type Begin End
# <chr> <fct> <chr> <int> <int>
# 1 Center-West Soybean Planting 245 275
# 2 Center-West Soybean Harvest 1 92
# 3 Center-West Corn Planting 245 336
# 4 Center-West Corn Harvest 32 153
# 5 South Soybean Planting 245 1
# ...
We've used tidyr to reshape the data so we have one row per rectangle that we want to draw and we've also make Crop a factor. We can then plot it like this
ggplot(plotdata) +
aes(ymin=as.numeric(Crop)-.45, ymax=as.numeric(Crop)+.45, xmin=Begin, xmax=End, fill=Type) +
geom_rect(color="black") +
facet_wrap(~Region, nrow=3) +
theme_bw(base_size=18) +
scale_y_continuous(breaks=seq_along(levels(plotdata$Crop)), labels=levels(plotdata$Crop))
The part that's a bit messy here that we are using a discrete scale for y but geom_rect prefers numeric values, so since the values are factors now, we use the numeric values for the factors to create ymin and ymax positions. Then we need to replace the y axis with the names of the levels of the factor.
If you also wanted to get the month names on the x axis you could do something like
dateticks <- seq.Date(as.Date("2020-01-01"), as.Date("2020-12-01"),by="month")
# then add this to you plot
... +
scale_x_continuous(breaks=lubridate::yday(dateticks),
labels=lubridate::month(dateticks, label=TRUE, abbr=TRUE))

how do I plot 3 variable separarelt in ggplot?

I want to create a time series plot showing how two variables have changed overtime and colour them to their appropriate region?
I have 2 regions, England and Wales and for each I have calculated the total_tax and the total_income.
I want to plot these on a ggplot over the years, using the years variable.
How would I do this and colour the regions separately?
I have the year variable which I will put on the x axis, then I want to plot both incometax and taxpaid on the graph but show how they have both changed over time?
How would I add a 3rd axis to get the plot how these two variables have changed overtime?
I have tried this code but it has not worked the way I wanted it to do.
ggplot(tax_data, filter %>% aes(x=date)) +
geom_line(aes(y=incometax, color=region)) +
geom_line(aes(y=taxpaid, color=region))+
ggplot is at the beginning a bit hard to grasp - I guess you're trying to achieve something like the following:
Assuming your data is in a format with a column for each date, incometax and taxpaid - I'm creating here an example:
library(tidyverse)
dataset <- tibble(date = seq(from = as.Date("2015-01-01"), to = as.Date("2019-12-31"), by = "month"),
incometax = rnorm(60, 100, 10),
taxpaid = rnorm(60, 60, 5))
Now, for plotting a line for each incometax and taxpaid we need to shape or "tidy" the data (see here for details):
dataset <- dataset %>% pivot_longer(cols = c(incometax, taxpaid))
Now you have three columns like this - we've turned the former column names into the variable name:
# A tibble: 6 x 3
date name value
<date> <chr> <dbl>
1 2015-01-01 incometax 106.
2 2015-01-01 taxpaid 56.9
3 2015-02-01 incometax 112.
4 2015-02-01 taxpaid 65.0
5 2015-03-01 incometax 95.8
6 2015-03-01 taxpaid 64.6
this has now the right format for ggplot and you can map the name to the colour of the lines:
ggplot(dataset, aes(x = date, y = value, colour = name)) + geom_line()

Simple ggplot2 situation with colors and legend

Trying to make some plots with ggplot2 and cannot figure out how colour works as defined in aes. Struggling with errors of aesthetic length.
I've tried defining colours in either main ggplot call aes to give legend, but also in geom_line aes.
# Define dataset:
number<-rnorm(8,mean=10,sd=3)
species<-rep(c("rose","daisy","sunflower","iris"),2)
year<-c("1995","1995","1995","1995","1996","1996","1996","1996")
d.flowers<-cbind(number,species,year)
d.flowers<-as.data.frame(d.flowers)
#Plot with no colours:
ggplot(data=d.flowers,aes(x=year,y=number))+
geom_line(group=species) # Works fine
#Adding colour:
#Defining aes in main ggplot call:
ggplot(data=d.flowers,aes(x=year,y=number,colour=factor(species)))+
geom_line(group=species)
# Doesn't work with data size 8, asks for data of size 4
ggplot(data=d.flowers,aes(x=year,y=number,colour=unique(species)))+
geom_line(group=species)
# doesn't work with data size 4, now asking for data size 8
The first plot gives
Error: Aesthetics must be either length 1 or the same as the data (4): group
The second gives
Error: Aesthetics must be either length 1 or the same as the data (8): x, y, colour
So I'm confused - when given aes of length either 4 or 8 it's not happy!
How could I think about this more clearly?
Here are #kath's comments as a solution. It's subtle to learn at first but what goes inside or outside the aes() is key. Some more info here - When does the aesthetic go inside or outside aes()? and lots of good googleable "ggplot aesthetic" centric pages with lots of examples to cut and paste and try.
library(ggplot2)
number <- rnorm(8,mean=10,sd=3)
species <- rep(c("rose","daisy","sunflower","iris"),2)
year <- c("1995","1995","1995","1995","1996","1996","1996","1996")
d.flowers <- data.frame(number,species,year, param1, param2)
head(d.flowers)
#number species year
#1 8.957372 rose 1995
#2 7.145144 daisy 1995
#3 9.864917 sunflower 1995
#4 7.645287 iris 1995
#5 4.996174 rose 1996
#6 8.859320 daisy 1996
ggplot(data = d.flowers, aes(x = year,y = number,
group = species,
colour = species)) + geom_line()
#note geom_point() doesn't need to be grouped - try:
ggplot(data = d.flowers, aes(x = year,y = number, colour = species)) + geom_point()

Plotting different y-axis scaling using ggplot facet_grid()?

I'm running into trouble plotting some data onto two seperate y-scales. Here are two visualizations of some air quality data I've been working with. The first figure depicts each pollutant on a parts per billion y-scale. In this figure, co dominates the y-axis, and none of the other pollutants' variation is being properly represented. Within air quality science, the pollutant co is conventionally represented in parts per million instead of parts per billion. The second figure illustrates the same no, no2, and o3 data, but I've converted the co concentration from ppb to ppm (divide by 1000). However, while no, no2, and o3 look better, the variation in co is not being justly represented...
Is there an easy way using ggplot() to normalize the scale of the y-axis and best represent each type of pollutant? I'm also trying to work through some other examples that make use of gridExtra to stitch together two seperate plots, each retaining their original y-scales.
The data required to generate these figures is huge (26,295 observations), so I'm still working on a reproducible example. Hopefully a solution can be found within the ggplot() code described below:
plt <- ggplot(df, aes(x=date, y = value, color = pollutant)) +
geom_point() +
facet_grid(id~pollutant, labeller = label_both, switch = "y")
plt
Here's what the head(df) looks like (before converting the co to ppm):
date id pollutant value
1 2017-06-16 10:00:00 Pohl co 236.00
2 2017-06-16 10:00:00 Pohl no 23.06
3 2017-06-16 10:00:00 Pohl no2 12.05
4 2017-06-16 10:00:00 Pohl o3 8.52
5 2017-06-16 11:00:00 Pohl co 207.00
6 2017-06-16 11:00:00 Pohl no 20.82
Marius pointed out that including scales = "free_y" in the facet_grid() function would provide the desired output. Thanks!
Solution:
plt <- ggplot(df, aes(x=date, y = value, color = pollutant)) +
geom_point() +
facet_grid(pollutant~id, scales = "free_y", labeller = label_both, switch = "y")
plt
Output:

Resources