How to plot mixed-frequency series with NAs in ggplot? - r

I have the following dataframe x:
x1 <- data.frame(Date = seq(as.Date("2010-01-01"),
as.Date("2012-12-01"),
by = "month"),
TS1 = rnorm(36,0,1),
TS2 = rnorm(36,0,1),
stringsAsFactors = F)
x2 <- data.frame(Date = seq(as.Date("2010-01-01"),
as.Date("2012-12-01"),
by = "quarter"),
TS3 = rnorm(12,0,1),
stringsAsFactors = F)
x <- left_join(x1, x2, by = "Date")
x contains two monthly series, while one is quarterly.
I would like to plot all three series at the same time with ggplot. I am aware of dualplot as a way to do it. The issue with it however is that it allows you to plot only 2 mixed frequency series.
Is there anyone who can help me with this?
Thanks!

Note that ggplot requires long format, so we first use tidyr::pivot_longer.
Next, we can plot TS1 and TS2 easily, but TS3 will not plot at all as it contains missing values.
One option is to plot the line with missings with a separate geom_line call:
x2 <- x %>%
tidyr::pivot_longer(cols = c(TS1, TS2, TS3), names_to = "TS") %>%
mutate(TS = as.factor(TS))
ggplot(x2, aes(x = Date, y = value, group = TS, color = TS)) +
geom_line() +
geom_line(data = subset(x2, TS == "TS3" & !is.na(value)))

In this instance, ggplot does not have to have the data transformed into long format (although it is a nice solution, if you are familiar with transforming data, and recommended especially if there were lots of columns or separate lines to be plotted).
For simplicity, especially when learning ggplot can I propose an alternative solution.
TS1 and TS2 can easily be plotted against date, as neither have NA values. Here, we call geom_line() twice, once for each line:
x %>%
ggplot()+
geom_line(aes(Date, TS1), colour = 'red')+
geom_line(aes(Date, TS2), colour = 'blue')
If you try and include a third geom_line() with TS3, only the original two lines are plotted due to TS3's missing values (NA). A solution is to fill in the NA values in the data before plotting, using zoo::na.approx(). As the name suggests, zoo::na.approx() is able to approximate values when you have NAs, by linear interpolation. In this instance, I assume linear interpolation between known values is appropriate for plotting (as geom_line is doing anyway). Check out ?zoo::na.approx for more details, including non-linear interpolation.
zoo::na.approx(TS3, Date, na.rm = FALSE) may be read aloud like: "We want to approximate the values of TS3 when they are missing (NA), based on the values of Date, and if there are still NAs in the interpolated data keep the non-NA values we can approximate."
x %>%
mutate(
TS3 = zoo::na.approx(TS3, Date, na.rm = FALSE)
) %>%
ggplot()+
geom_line(aes(Date, TS1), colour = 'red')+
geom_line(aes(Date, TS2), colour = 'blue')+
geom_line(aes(Date, TS3), colour = 'green')
Note that the green line finishes just short (2 data points) of the other two lines. This is because by default, zoo::na.approx() doesn't interpolate when NA is not between two known data points. This is why we specified na.rm = FALSE when doing the interpolation. Look at the help page ?zoo::na.approx for alternatives (such as repeating the last known observation).

Related

geom_vline for values over a threshold on Y-axis

I have a ggplot of temperature values plotted against time. I'd like to add vertical lines to my graph where temperature exceeds a threshold (let's say 12 degrees).
reprex:
#example data
Temp <- c(10.55, 11.02, 6.75, 12.55, 15.5)
Date <- c("01/01/2000", "02/01/2000", "03/01/2000", "04/01/2000", "05/01/2000")
#data.frame
df1 <- data.frame(Temp, Date)
#plot
df1%>%
ggplot(aes(Date, format(as.numeric(Temp))))+
geom_line(group=1)
I thought I could maybe do something with geom_hline and then rotate 90 degrees. I went about this by trying to create an object of all values (to 2dp) between 12 and 20. I would then tell geom_hline to use that object to match values and draw the lines.
Then I get a bit stuck. I don't really know how to rotate the lines or whether that's even a good idea.
Disclaimer: I know my dates are not actually dates in the reprex, but they are in my rle.
geom_vline can accept an xintercept either
in the xintercept parameter (if you want to specify it manually) or
in aes(xintercept = ...) if you want to use values from a data frame. We can use data = . %>% filter... to use the same data frame that came into ggplot, but apply some further manipulations.
df1 %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
ggplot(aes(Date, Temp)) +
geom_line() +
geom_vline(data = . %>% filter(Temp > 12),
aes(xintercept = Date))
If you want to have vertical lines starting from the level of 12:
ggplot(df1, aes(Date, as.numeric(Temp)))+
geom_line(group=1) +
geom_segment(data= df1[df1$Temp>12,],
aes(x = Date,
xend = Date,
y = 12,
yend = Temp),
color = "blue", lwd = 1)

How to graph mean of each date for different groups?

I have a data set with 3 columns: date, weight, and location. I want to make a graph with time on the x-axis and weight on the y-axis with a different line for each location, where each point on the line is the mean weight of all samples from that location at that date. The only ways I've been able to come up with to do this would take way too long and require more lines of code than seems reasonable just to make a graph. For instance I tried subsetting like this:
A <- df$Location == "A"
Aug10_19 <- df$Date == "2019/07/10"
ind <- Aug10_19 & A
mean(df$Weight[ind])
But then I would have to do this manually for every individual combination of date and location and then force all the means into a new data frame. What is the shorter way to accomplish this?
You can use ggplot2 to quickly create summary plots.
library(dplyr)
library(ggplot2)
df <- transmute(
iris,
Location = Species,
Date = as.Date(as.character(
cut(Sepal.Length, breaks = 3,
labels = c("2019-07-10", "2019-07-12", "2019-07-15")))),
Weight = Sepal.Width)
ggplot(data = df,
mapping = aes(x = Date, y = Weight, colour = Location)) +
stat_summary(fun = "mean", geom = "line") +
theme_bw()

Is there a function to allow the assignment of numeric values to characters?

So, the issue is as follows: I have a dataset which contains
A Condition factor variable with (for this example) 3 levels that need to be plotted on a y axis,
A Group factor variable with three levels to be plotted on the x, and
A value for each group at every condition (example data below).
The three levels on the x axis indicate conditions and I would like to display observations at each level on the y in a violin plot format. I am aware of the fact that I need a numeric on the y axis for ggplot to plot these data, but cannot find a solution to solve this issue of nesting specific values (which will change from experiment to experiment) for the y value at each x condition. My progress (after receiving prior help here) has been properly formatting the data into a data frame, and melting the data into a long format for ggplot.
Example data below:
Condition Observation Value
1-----------------A-----------11
1-----------------B-----------7
1-----------------C-----------2
2-----------------A-----------21
2-----------------B-----------2
2-----------------C-----------5
3-----------------A-----------16
3-----------------B-----------45
3-----------------C-----------34
EDIT:
> SampleA <- c(3,7,9)
> SampleB <- c(15,23,33)
> SampleC <- c(21,19,12)
> Observations <- c("Observation 1", "Observation 2", "Observation 3")
> df0 <- data.frame(Observations = as.factor(Observations), SampleA, SampleB, SampleC)
>library(ggplot2)
>df0 <- reshape2::melt((df0, id.vars = "Observations"))
I'd suggest something like this:
library(dplyr)
df0 = df0 %>%
group_by(Observations) %>%
mutate(norm_value = value / sum(value))
ggplot(df0, aes(x = Observations, y = variable, fill = norm_value)) +
geom_tile() +
geom_label(aes(label = scales::percent(norm_value)), fill = "gray80") +
guides(fill = F) +
coord_equal() +
labs(x = "", y = "") +
theme_minimal()
If you have a lot of data, I'd remove the individual labels and rely on the color scale, but with this few points direct labels seem clearest.

How can I set my own tick labels in ggplot while plotting factor values of time series?

So, I am plotting some time series in ggplot and on the x axis I got some date/time data. Data from 2008 to 2016. The problem is that dates are not continuous and for instance the last date of 2008 is
2008/05/14 19:05:12
and the next date is for 2009 something like this
2009/03/24 10:17:54
While plotting these, the result is the following
In order to get rid of the empty spaces I turn my dates into factors
dates <- factors(dates) in order to get the correct plot.
But after that I am unable to set the x tick labels as they don't change using
scale_x_continuous(breaks = c(1,1724,2283,5821,8906,10112,10156,14875 ),
labels = c("2008","2009","2010","2011","2012","2013","2014","2015"))
How can I change them?
There's a few problems this is throwing up, and the solution will really depend on what you're looking for. I'd suggest you post up some sample data and your code so far to get a more precise answer, but here's a possibility in the mean time:
Your graph above is not showing a continuous scale (though it may look like it), it's a discrete scale with the number of levels corresponding to unique date observations. Two problems come out of this:
applying a scale_x_continuous wont work, as the year breaks wont be evenly spread
your data looks like it's smoothly spread, but it isn't, which isn't a good principle for visualisation.
If what you're trying to do is show change year-by-year you could sort all of your data into yearly 'bins' and plot:
library(tidyverse)
library(lubridate)
# creating random data
df <- tibble(date = as_datetime(runif(1000, as.numeric(as_datetime("2001/01/24 09:30:43")), as.numeric(as_datetime("2006/02/24 09:30:43")))))
df["val"] <- rnorm(nrow(df), 25, 5)
# use lubridate to extract year as new variable, and plot grouped years
df %>%
mutate(year = factor(year(date))) %>%
ggplot(aes(year, val)) +
geom_point(position = "jitter")
Another possibility could be to use a colour scale to note your groupings by year, keeping all the dates in order but removing the gaps (and therefore not using a continuous x-axis scale):
df %>% # begin by simulating a data 'gap'
filter(date>as_datetime("2003/07/24 09:30:43")|date<as_datetime("2002/09/24 09:30:43")) %>%
mutate(year = factor(year(date)), # 'year' to select colour
date = factor(date)) %>%
ggplot(aes(date, val, col = year)) +
geom_point() +
theme(axis.ticks.x = element_blank(), # removes all ticks and labels, as too many unique times
axis.text.x = element_blank())
If neither of those are helpful do comment below with any clarifications of what you're looking for, and I'll see if I can help!
Edit: One last idea, you could create an invisible series of points which act as the breaks for your axis ticks:
blank_labels <- tibble(date = as_datetime(c("20020101 000000",
"20030101 000000",
"20040101 000000",
"20050101 000000",
"20060101 000000")),
col = "NA", val = 0)
df2 <- df %>%
filter(date>as_datetime("2003/07/24 09:30:43")|date<as_datetime("2002/09/24 09:30:43")) %>%
mutate(col = "black") %>%
bind_rows(blank_labels) %>%
mutate(date_fac = factor(date))
tick_values <- left_join(blank_labels, df2, by = c("date", "col"))
df2 %>%
ggplot(aes(date_fac, val, col = col)) +
geom_point() +
scale_x_discrete(breaks = tick_values$date_fac, labels = c("2002", "2003", "2004", "2005", "2006")) +
scale_color_identity()

Plotting multiple time series on the same plot using ggplot()

I am fairly new to R and am attempting to plot two time series lines simultaneously (using different colors, of course) making use of ggplot2.
I have 2 data frames. the first one has 'Percent change for X' and 'Date' columns. The second one has 'Percent change for Y' and 'Date' columns as well, i.e., both have a 'Date' column with the same values whereas the 'Percent Change' columns have different values.
I would like to plot the 'Percent Change' columns against 'Date' (common to both) using ggplot2 on a single plot.
The examples that I found online made use of the same data frame with different variables to achieve this, I have not been able to find anything that makes use of 2 data frames to get to the plot. I do not want to bind the two data frames together, I want to keep them separate. Here is the code that I am using:
ggplot(jobsAFAM, aes(x=jobsAFAM$data_date, y=jobsAFAM$Percent.Change)) + geom_line() +
xlab("") + ylab("")
But this code produces only one line and I would like to add another line on top of it.
Any help would be much appreciated.
TIA.
ggplot allows you to have multiple layers, and that is what you should take advantage of here.
In the plot created below, you can see that there are two geom_line statements hitting each of your datasets and plotting them together on one plot. You can extend that logic if you wish to add any other dataset, plot, or even features of the chart such as the axis labels.
library(ggplot2)
jobsAFAM1 <- data.frame(
data_date = runif(5,1,100),
Percent.Change = runif(5,1,100)
)
jobsAFAM2 <- data.frame(
data_date = runif(5,1,100),
Percent.Change = runif(5,1,100)
)
ggplot() +
geom_line(data = jobsAFAM1, aes(x = data_date, y = Percent.Change), color = "red") +
geom_line(data = jobsAFAM2, aes(x = data_date, y = Percent.Change), color = "blue") +
xlab('data_date') +
ylab('percent.change')
If both data frames have the same column names then you should add one data frame inside ggplot() call and also name x and y values inside aes() of ggplot() call. Then add first geom_line() for the first line and add second geom_line() call with data=df2 (where df2 is your second data frame). If you need to have lines in different colors then add color= and name for eahc line inside aes() of each geom_line().
df1<-data.frame(x=1:10,y=rnorm(10))
df2<-data.frame(x=1:10,y=rnorm(10))
ggplot(df1,aes(x,y))+geom_line(aes(color="First line"))+
geom_line(data=df2,aes(color="Second line"))+
labs(color="Legend text")
I prefer using the ggfortify library. It is a ggplot2 wrapper that recognizes the type of object inside the autoplot function and chooses the best ggplot methods to plot. At least I don't have to remember the syntax of ggplot2.
library(ggfortify)
ts1 <- 1:100
ts2 <- 1:100*0.8
autoplot(ts( cbind(ts1, ts2) , start = c(2010,5), frequency = 12 ),
facets = FALSE)
I know this is old but it is still relevant. You can take advantage of reshape2::melt to change the dataframe into a more friendly structure for ggplot2.
Advantages:
allows you plot any number of lines
each line with a different color
adds a legend for each line
with only one call to ggplot/geom_line
Disadvantage:
an extra package(reshape2) required
melting is not so intuitive at first
For example:
jobsAFAM1 <- data.frame(
data_date = seq.Date(from = as.Date('2017-01-01'),by = 'day', length.out = 100),
Percent.Change = runif(5,1,100)
)
jobsAFAM2 <- data.frame(
data_date = seq.Date(from = as.Date('2017-01-01'),by = 'day', length.out = 100),
Percent.Change = runif(5,1,100)
)
jobsAFAM <- merge(jobsAFAM1, jobsAFAM2, by="data_date")
jobsAFAMMelted <- reshape2::melt(jobsAFAM, id.var='data_date')
ggplot(jobsAFAMMelted, aes(x=data_date, y=value, col=variable)) + geom_line()
This is old, just update new tidyverse workflow not mentioned above.
library(tidyverse)
jobsAFAM1 <- tibble(
date = seq.Date(from = as.Date('2017-01-01'),by = 'day', length.out = 5),
Percent.Change = runif(5, 0,1)
) %>%
mutate(serial='jobsAFAM1')
jobsAFAM2 <- tibble(
date = seq.Date(from = as.Date('2017-01-01'),by = 'day', length.out = 5),
Percent.Change = runif(5, 0,1)
) %>%
mutate(serial='jobsAFAM2')
jobsAFAM <- bind_rows(jobsAFAM1, jobsAFAM2)
ggplot(jobsAFAM, aes(x=date, y=Percent.Change, col=serial)) + geom_line()
#Chris Njuguna
tidyr::gather() is the one in tidyverse workflow to turn wide dataframe to long tidy layout, then ggplot could plot multiple serials.
An alternative is to bind the dataframes, and assign them the type of variable they represent. This will let you use the full dataset in a tidier way
library(ggplot2)
library(dplyr)
df1 <- data.frame(dates = 1:10,Variable = rnorm(mean = 0.5,10))
df2 <- data.frame(dates = 1:10,Variable = rnorm(mean = -0.5,10))
df3 <- df1 %>%
mutate(Type = 'a') %>%
bind_rows(df2 %>%
mutate(Type = 'b'))
ggplot(df3,aes(y = Variable,x = dates,color = Type)) +
geom_line()

Resources