ggplot2 comparation of time period - r

I need to visualize and compare the difference in two equally long sales periods. 2018/2019 and 2019/2020. Both periods begin at week 44 and end at week 36 of the following year. If I create a graph, both periods are continuous and line up. If I use only the week number, the values ​​are sorted as continuum and the graph does not make sense. Can you think of a solution?
Thank You
Data:
set.seed(1)
df1 <- data.frame(sells = runif(44),
week = c(44:52,1:35),
YW = yearweek(seq(as.Date("2018-11-01"), as.Date("2019-08-31"), by = "1 week")),
period = "18/19")
df2 <- data.frame(sells = runif(44),
week = c(44:52,1:35),
YW = yearweek(seq(as.Date("2019-11-01"), as.Date("2020-08-31"), by = "1 week")),
period = "19/20")
# Yearweek on x axis, when both period are separated
ggplot(df1, aes(YW, sells)) +
geom_line(aes(color="Period 18/19")) +
geom_line(data=df2, aes(color="Period 19/20")) +
labs(color="Legend text")
# week on x axis when weeks are like continuum and not splited by year
ggplot(df1, aes(week, sells)) +
geom_line(aes(color="Period 18/19")) +
geom_line(data=df2, aes(color="Period 19/20")) +
labs(color="Legend text")

Another alternative is to facet it. This'll require combining the two sets into one, preserving the data source. (This is commonly a better way of dealing with it in general, anyway.)
(I don't have tstibble, so my YW just has seq(...), no yearweek. It should translate.)
ggplot(dplyr::bind_rows(tibble::lst(df1, df2), .id = "id"), aes(YW, sells)) +
geom_line(aes(color = id)) +
facet_wrap(id ~ ., scales = "free_x", ncol = 1)
In place of dplyr::bind_rows, one might also use data.table::rbindlist(..., idcol="id"), or do.call(rbind, ...), though with the latter you will need to assign id externally.
One more note: the default formatting of the x-axis is obscuring the "year" of the data. If this is relevant/important (and not apparent elsewhere), then use ggplot2's normal mechanism for forcing labels, e.g.,
... +
scale_x_date(labels = function(z) format(z, "%Y-%m"))
While unlikely that you can do this without having tibble::lst available, you can replace that with list(df1=df1, df2=df2) or similar.

If you want to keep the x axis as a numeric scale, you can do:
ggplot(df1, aes((week + 9) %% 52, sells)) +
geom_line(aes(color="Period 18/19")) +
geom_line(data=df2, aes(color="Period 19/20")) +
scale_x_continuous(breaks = 1:52,
labels = function(x) ifelse(x == 9, 52, (x - 9) %% 52),
name = "week") +
labs(color="Legend text")

Try this. You can format your week variable as a factor and keep the desired order. Here the code:
library(ggplot2)
library(tsibble)
#Data
df1$week <- factor(df1$week,levels = unique(df1$week),ordered = T)
df2$week <- factor(df2$week,levels = unique(df2$week),ordered = T)
#Plot
ggplot(df1, aes(week, sells)) +
geom_line(aes(color="Period 18/19",group=1)) +
geom_line(data=df2, aes(color="Period 19/20",group=1)) +
labs(color="Legend text")
Output:

Related

How to plot a continuous line with repeating x-axis values

I have a data set of Standardized Precipitation Index values from 1980 to 2005. There is one value for each month, so altogether there are 312 (26 years * 12 months) values. The SPI values range between -3 and +3. Here is an easy reproducible example, since the exact values are not important for my question:
vec1 <- rep(seq(1980, 2005), each= 12)
vec2 <- sample(x = -3:3, size = 312, replace = TRUE)
df <- data.frame(vec1, vec2)
colnames(df) <- c("Year", "SPI")
Now I would like to plot the SPI values with the years being the x-axis.
When I try to plot it using ggplot2:
ggplot() +
geom_line(aes(x=df$Year, y=df$SPI))
Something like this comes out:
So the problem is, there is no continuous line.
I can plot it with a continuous line with Base R for example:
plot(vec2, type="l")
But then the problem is that the x-axis only shows the values 1:312 and I need the years as the x-values.
Anybody with a hint?
EDIT after the answer of marcguery:
It turned out that I cannot use a line plot for my purpose. Instead, I need to do a column plot with many single columns when using ggplot2 since I need to color the areas above/below zero.
marcguery's answer works for a geom_line() plot, but unfortunately not for a geom_col() plot. I have no idea why.
Here is the modified code:
vec1 <- seq(as.Date("1980-01-01"),
by = "month",
to = as.Date("2005-12-01"))
vec2 <- sample(x = -3:3, size = 312, replace = TRUE)
vec3 <- 1:312
df <- data.frame(vec1, vec2, vec3)
colnames(df) <- c("Date", "SPI", "ID")
library(data.table)
df <- as.data.table(df)
This is what unfortunately does not work with the dates as x-axis, there is a strange output:
library(ggplot2)
# with Date as x-axis
ggplot(data= df, aes(x= Date, y= SPI, width= 1)) +
geom_col(data = df[SPI <= 0], fill = "red") +
geom_col(data = df[SPI >= 0], fill = "blue") +
theme_bw()
This is what works with the simple rownumber as x-axis:
# with ID as x-axis
ggplot(data= df, aes(x= ID, y= SPI, width= 1)) +
geom_col(data = df[SPI <= 0], fill = "red") +
geom_col(data = df[SPI >= 0], fill = "blue") +
theme_bw()
I need something like the last example, just with the dates as the x-axis.
Your observations per month of each year have all the same value in your column Year, hence why ggplot cannot assign them different x values. Since you are working with dates, you could use Date format for your time points so that each month is assigned a different value.
#Seed for reproducibility
set.seed(123)
#Data
vec1 <- seq(as.Date("1980-01-01"),
by = "month",
to = as.Date("2005-12-01"))
vec2 <- sample(x = -3:3, size = 312, replace = TRUE)
df <- data.frame(vec1, vec2)
colnames(df) <- c("Date", "SPI")
#Plot
library(ggplot2)
ggplot(df) +
geom_line(aes(x = Date, y = SPI))+
scale_x_date(breaks = "5 years", date_labels = "%Y",
limits = c(as.Date("1979-12-01"),
as.Date("2006-01-01")),
expand = c(0,0))
Edit after you added your question about coloring the area between your values and 0 based on the sign of the values:
You can definitely use a geom_line plot for that purpose. Using a geom_col plot is a possibility but you would loose visual information about change between your x variables which are continuously related as they represent dates.
To plot a nice geom_line, I will base my approach on the answer here https://stackoverflow.com/a/18009173/14027775. You will have to adapt your data by transforming your dates to numerical values, for instance number of days since a given date (typically 1970/01/01).
#Colored plot
#Numerical format for dates (number of days after 1970-01-01)
df$numericDate <- difftime(df$Date,
as.Date("1970-01-01", "%Y-%m-%d"),
units="days")
df$numericDate <- as.numeric(df$Date)
rx <- do.call("rbind",
sapply(1:(nrow(df)-1), function(i){
f <- lm(numericDate~SPI, df[i:(i+1),])
if (f$qr$rank < 2) return(NULL)
r <- predict(f, newdata=data.frame(SPI=0))
if(df[i,]$numericDate < r & r < df[i+1,]$numericDate)
return(data.frame(numericDate=r,SPI=0))
else return(NULL)
}))
#Get back to Date format
rx$Date <- as.Date(rx$numericDate, origin = "1970-01-01")
d2 <- rbind(df,rx)
ggplot(d2,aes(Date,SPI)) +
geom_area(data=subset(d2, SPI<=0), fill="red") +
geom_area(data=subset(d2, SPI>=0), fill="blue") +
geom_line()+
scale_x_date(breaks = "5 years", date_labels = "%Y",
limits = c(as.Date("1979-12-01"),
as.Date("2006-01-01")),
expand = c(0,0))
Now if you want to keep using geom_col, the reason why you don't see all the bars using dates for the x axis is that they are too thin to be filled as they represent one single day over a long period of time. By filling and coloring them, you should be able to see all of them.
ggplot(data= df, aes(x= Date, y= SPI)) +
geom_col(data = df[df$SPI <= 0,],
fill = "red", color="red", width= 1) +
geom_col(data = df[df$SPI >= 0,],
fill = "blue", color="blue", width= 1) +
scale_x_date(breaks = "5 years", date_labels = "%Y",
limits = c(as.Date("1979-12-01"),
as.Date("2006-01-01")),
expand = c(0,0))

Avoid repetitive, similar analysis and plots

I have a table with many variables. One of the variables contains year information: from 1999 till 2010.
I need to do for each year the same analysis, for instance, to plot a graph, a histogram, etc.
Currently, I subset the data so that each year goes into a data frame(table) and I do the analysis in turn for each year. This is very inefficient:
dates <- (sample(seq(as.Date('1999/01/01'), as.Date('2010/01/01'), by="day"), 50, replace = TRUE))
dt<-data.table( YEAR = format.Date(dates,"%Y"),
Var1=sample(0:100, 50, rep=TRUE),
Var2 =sample(0:500, 50, rep=TRUE)
)
year_1999<-dt[YEAR=="1999"]
plot_1999<- ggplot(year_1999, aes (x=Var1))+
geom_line(aes(y=Var2), size=1, color="blue") +
labs(y="V2", x="V1", title="Year 1999")
plot_1999
How can I better write this in a compact way? I suppose I need a function but I have no idea how to.
Instead of repeating the code several times, we can specify the 'YEAR' in facet_wrap
library(ggplot2)
ggplot(dt, aes(x = Var1, y = Var2)) +
geom_line(aes(size = 1, color = "blue")) +
labs(y = "V2", x = "V1") +
facet_wrap(~ YEAR)
Try this if you want to create a separate plot object for each unique year in dt$YEAR:
for (i in unique(dt$YEAR)) {
year <- dt[YEAR==i]
plot <- ggplot(year, aes (x=Var1))+
geom_line(aes(y=Var2), size=1, color="blue") +
labs(y="V2", x="V1", title="Year 1999")
assign(paste("plot", i, sep=""), plot)
}

Timestamp on x-axis in timeseries ggplot

I have measurement data from the past months:
Variables
x <- df$DatoTid
y <- df$Partikler
color <- df$Opgave
I'm trying to plot my data based on the timestamp, so that I have the hours of the day in the x-axis, instead of the specific POSIXct datetime.
I would like the labels and ticks of the x-axis to be fx "00:00", "01:00",..."24:00".
So that noon is in the middle of the x-axis.
So far I tried to convert the datetime values into characters.
Doesn't look good yet (as you can see the axis ticks and labels are gone. Possibly other things are wrong as well).
Can someone help me?
And please let me know how to upload the data for you. I don't know how to add a huge .csv-file....
# Rounding up to nearest 10 min:
head(df)
df$Tid2 <- format(strptime("1970-01-01", "%Y-%m-%d", tz="CET") +
round(as.numeric(df$DatoTid)/300)*300 + 3600, "%Y-%m-%d %H:%M:%S")
head(df)
df$Tid2 <- as.character(df$Tid2)
str(df)
x <- df$Tid2
y <- df$Partikler
color <- df$Opgave
plot2 <- ggplot(data = df, aes(x = x, y = y, color = color)) +
geom_point(shape=16, alpha=0.6, size=1.8) +
scale_y_continuous(labels=function(x) format(x, big.mark = ".", decimal.mark = ",", scientific = FALSE)) +
scale_x_discrete(breaks=c("00:00:00", "06:00:00", "09:00:00", "12:00:00", "18:00:00", "21:00:00")) +
scale_color_discrete(name = "Case") +
xlab(" ") +
ylab(expression(paste("Partikelkoncentration [pt/cc]"))) +
myTheme +
theme(legend.text=element_text(size=8), legend.title=element_text(size=8))
plot2
I would approach this by making a new time stamp that uses a single day, but the hours/minutes/seconds of your existing time stamp.
First, here's a made-up version of your data, here using a linear trend in Partikler:
library(tidyverse); library(lubridate)
df <- data_frame(Tid2 = seq.POSIXt(from = ymd_h(2019010100),
to = ymd_h(2019011500), by = 60*60),
Partikler = seq(from = 0, to = 2.5E5, along.with = Tid2),
Opgave = as.factor(floor_date(Tid2, "3 days")))
# Here's a plot that's structurally similar to yours:
ggplot(df, aes(Tid2, Partikler, col = Opgave)) +
geom_point() +
scale_color_discrete(name = "Case")
Now, if we change the timestamps to be in the same day, we can control them like usual in ggplot, but with them collapsed into a single day of timing. We can also change the x axis so it doesn't mention the date component of the time stamp:
df2 <- df %>%
mutate(Tid2_sameday = ymd_hms(paste(Sys.Date(),
hour(Tid2), minute(Tid2), second(Tid2))))
ggplot(df2, aes(Tid2_sameday, Partikler, col = Opgave)) +
geom_point() +
scale_color_discrete(name = "Case") +
scale_x_datetime(date_labels = "%H:%M")

superpose densities, non exclusive subsets

I need to have several density functions onto a single plot. Each density corresponds to a subset of my overall dataset. The subsets are defined by the value taken by one of the variables in the dataset.
Concretely, I would like to draw a density function for 1, 3, and 10 years horizons. Of course, the 10 years horizons includes the shorter ones. Likewise, the 3 year horizon density should be constructed taking data from the last year.
The subsets need to correspond to data[period == 1,], data[period <= 3, ], data[period == 10,].
I have managed to do so by adding geom_densitys on top of each other, i.e., by redefining the data each time.
ggplot() +
geom_density(data = data[period <=3,], aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="red") +
geom_density(data = data[period ==1,], aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="grey") +
geom_density(data = data, aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="green")
It works fine but I feel like this is not the right way to do it (and indeed, it makes e.g., the creation of a legend cumbersome).
On the other hand, doing like that :
ggplot(data, aes(x=BEST_CUR_EV_TO_EBITDA, color=period)) +
geom_density(alpha=.2, fill="blue")
won't do because then the periods are taken to be mutually exclusive.
Is there a way to specify aes(color) based on the value taken by period where subsets overlap?
Running code:
library(data.table)
library(lubridate)
library(ggplot2)
YEARS <- 10
today <- Sys.Date()
lastYr <- Sys.Date()-years(1)
last3Yr <- Sys.Date()-years(3)
start.date = Sys.Date()-years(YEARS)
date = seq(start.date, Sys.Date(), by=1)
BEST_CUR_EV_TO_EBITDA <- rnorm(length(date), 3,1)
data <- cbind.data.frame(date, BEST_CUR_EV_TO_EBITDA)
data <- cbind.data.frame(data, period = rep(10, nrow(data)))
subPeriods <- function(aDf, from, to, value){
aDf[aDf$date >= from & aDf$date <= to, "period"] = value
return(aDf)
}
data <- subPeriods(data, last3Yr, today, 3)
data <- subPeriods(data, lastYr, today, 1)
data <- data.table(data)
colScale <- scale_colour_manual(
name = "horizon"
, values = c("1 Y" = "grey", "3 Y" = "red", "10 Y" = "green"))
ggplot() +
geom_density(data = data[period <=3,], aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="red") +
geom_density(data = data[period ==1,], aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="grey") +
geom_density(data = data, aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="green") +
colScale
One of the ways to deal with dependent grouping is to create an independent grouping based on the existing groups. The way I'd opted to do it below is by creating three new columns (period_one, period_three and period_ten) with mutate function, where
period_one= BEST_CUR_EV_TO_EBITDA values for period==1
period_three= BEST_CUR_EV_TO_EBITDA values for period<=1
period_ten= BEST_CUR_EV_TO_EBITDA values for all periods
These columns were then converted into the long-format using gather function, where the columns (period_one, period_three and period_ten) are stacked in "period" variable, and the corresponding values in the column "val".
df2 <- data %>%
mutate(period_one=ifelse(period==1, BEST_CUR_EV_TO_EBITDA, NA),
period_three=ifelse(period<=3, BEST_CUR_EV_TO_EBITDA, NA),
period_ten=BEST_CUR_EV_TO_EBITDA) %>%
select(date, starts_with("period_")) %>%
gather(period, val, period_one, period_three, period_ten)
The ggplot is straightforward with long format consisting of independent grouping:
ggplot(df2, aes(val, fill=period)) + geom_density(alpha=.2)

ggplot2: plotting non-contiguous time durations as a bar chart

I'm using ggplot to plot various events as a function of the date (x-axis) and start time (y-axis) on which they began. The data/code are as follows:
date<-c("2013-06-05","2013-06-05","2013-06-04","2013-06-04","2013-06-04","2013-06-04","2013-06-04",
"2013-06-04","2013-06-04","2013-06-03","2013-06-03","2013-06-03","2013-06-03","2013-06-03",
"2013-06-02","2013-06-02","2013-06-02","2013-06-02","2013-06-02","2013-06-02","2013-06-02")
start <-c("07:36:00","01:30:00","22:19:00","22:12:00","20:16:00","19:19:00","09:00:00",
"06:45:00","01:03:00","22:15:00","19:05:00","08:59:00","08:01:00","07:08:00",
"23:24:00","20:39:00","18:53:00","16:57:00","15:07:00","14:33:00","13:24:00")
duration <-c(0.5,6.1,2.18,0.12,1.93,0.95,10.32,
2.25,5.7,2.78,3.17,9.03,0.95,0.88,
7.73,2.75,1.77,1.92,1.83,0.57,1.13)
event <-c("AF201","SS431","BE201","CD331","HG511","CD331","WQ115",
"CD331","SS431","WQ115","HG511","WQ115","CD331","AF201",
"SS431","WQ115","HG511","WQ115","CD331","AS335","CD331")
df<-data.frame(date,start,duration,event)
library(ggplot2)
library(scales)
p <- ggplot(df, aes(as.Date(date),as.POSIXct(start,format='%H:%M:%S'),color=event))
p <- p+geom_point(alpha = I(6/10),size=5)
p + ylab("time (hr)") + xlab("date") + scale_x_date(labels = date_format("%m/%d")) +
scale_y_datetime(labels = date_format("%H"))+
scale_colour_hue(h=c(360, 90))
theme(axis.text.x = element_text(hjust=1, angle=0))
The resulting plot looks like this:
Question: Instead of simply indicating the start time of the event with a single point (shown above), how can I plot a bar that spans the time duration of the event? As shown in the data frame above I have this duration data (in hours). Alternatively, I could supply a 'stop time' (not shown).
I'm imagining the solution would look something like a stacked bar chart. However, a bar chart isn't quite right as it assumes the bar starts at the bottom of the plot and that the vertically stacked events have no gaps between them. My events may be non-contiguous -- 'starting' and 'stopping' at various positions along the y-axis. The solution will also have to take into consideration that 1) some events may ultimately be concurrent (overlap in time) and 2) some events will span multiple days.
I'd be very grateful for any suggestions!
It's a bit unclear exactly what you want - #Michele's answer seemed good, I wasn't clear if you wanted to to use geom_rect because it would make for thicker lines (if so, just change the line width), or if there was another reason. I decided to give it a go using geom_rect to enable dodging. I've plotted it with the starting date on the x axis, and the start and end times on y. I've set up the data slightly differently to enable that. If you're after something different, try to make it explicit, but at least here's another option:
df<-data.frame(date,start,duration,event)
df <- transform(df,
start = as.POSIXct(paste(date, start)),
end = as.POSIXct(paste(date, start)) + duration*3600)
df <- df[c("event", "start", "end")]
df$date <- strptime(df$start, "%Y-%m-%d")
df$start.new <- format(df$start, format = "%H:%M:%S")
df$end.new <- format(df$end, format = "%H:%M:%S")
df$day <- factor(as.POSIXct(df$date))
levels(df$day) <- 1:4
df$day <- as.numeric(as.character(df$day))
df$event.int <- df$event
levels(df$event.int) <- 1:7
df$event.int <- as.numeric(as.character(df$event.int))
p <- ggplot(df, aes(day, start)) + geom_rect(aes(ymin = start, ymax = end,
xmin = (day - 0.45) + event.int/10,
xmax = (day - 0.35) + event.int/10,
fill = event)) +
scale_x_discrete(limits = 1:4,breaks = 1:4, labels = sort(unique(date)),
name = "Start date") + ylab("Duration")
Thanks (+1s) to #Michele and #alexwhan for your input. Using geom_rect I was able to get all of the events which occur on the same date on the same point on the x axis. (I'm anticipating that this data set may ultimately include many months of events.)
df<-data.frame(date,start,duration,event)
library(ggplot2)
p <- ggplot(df, aes(xmin=as.Date(date),xmax=as.Date(date)+1,
ymin=as.POSIXct(start,format='%H:%M:%S'),
ymax=as.POSIXct(start,format='%H:%M:%S')+duration*3600,
fill=event))
p <- p+geom_rect(alpha = I(8/10))
p + ylab("time") + xlab("date") + scale_x_date(labels = date_format("%m/%d")) +
scale_y_datetime(labels = date_format("%H"))+
scale_colour_hue(h=c(360, 90))
theme(axis.text.x = element_text(hjust=1, angle=0))
... resulting in this:
This is pretty close to what I was aiming for.
I think I can deal with the potential overplotting issue by adjusting the alpha.
Ideally I'd like the y axis to include just a single day (00 to 00). To do this I guess I'll probably need to reformat the data such that events with durations that extend beyond midnight are reallocated to the next day. (Not sure how to do this in R.)
try this method. Probably it's different to what you planned but I think it's a quite clear way to show your data:
df<-data.frame(date,start,duration,event)
df <- transform(df,
start = as.POSIXct(paste(date, start)),
end = as.POSIXct(paste(date, start)) + duration*3600)
df <- df[c("event", "start", "end")]
library(reshape2)
df <- melt(df, id.vars="event")
df$value <- as.POSIXct(df$value, origin=as.Date("1970-01-01"))
df <- df[order(df$event, df$value),]
df$eventID <- rep(seq(1, nrow(df)/2, 1), each=2)
library(ggplot2)
ggplot(df) +
geom_line(aes(value, event, group=eventID, color=event))
Combining the benefits of: (i) y-axis containing a single ~24 hour period; (ii) events not overlapping; (iii) events labelled within the graph in addition to the legend; and (iv) concise code.
library(dplyr)
library(lubridate)
# Re-create data frame
df <- data_frame(date, start, duration, event) %>%
mutate(start_dt = as.POSIXct(paste(date, start), tz = 'UTC'),
start_hr = hour(start_dt),
end_dt = start_dt + duration * 3600,
end_hr = hour(end_dt) + (as.Date(end_dt) - as.Date(start_dt)) * 24)
# Plot
df %>% ggplot() +
geom_segment(aes(x = event, y = start_hr, xend = event, yend = end_hr,
color = event, size = 1)) +
facet_wrap(~ date, nrow = 1) +
guides(size = 'none')
Image of plot:

Resources