I am doing an exploratory data analysis for data that is collected at the daily level over many years. The relevant time period is about 18 - 20 months from the same date each year. What I would like to do is visually inspect these 18 month periods one on top of the other. I can do this as below by adding data for each geom_point() call. I would like to avoid calling that one time for each period
min ex:
library(tidyverse)
minex <- data.frame(dts = seq((mdy('01/01/2010')), mdy('11/10/2013'), by = 'days'))
minex$day <- as.numeric(minex$dts - min(minex$dts))
minex$MMDD <- paste0(month(minex$dts), "-", day(minex$dts))
minex$v1 <- 20 + minex$day^0.4 -cos(2*pi*minex$day/365) + rnorm(nrow(minex), 0, 0.3)
ggplot(filter(minex, dts %in% seq((mdy('11/10/2013') - (365 + 180)), mdy('11/10/2013'), by =
'days')), aes(day, v1)) +
geom_point() +
geom_point(data = filter(minex, dts %in% seq((mdy('11/10/2012') - (365 + 180)),
mdy('11/10/2012'), by = 'days')), aes(day+365, v1), color = 'red')
Since you have overlapping spans of time, I think we can lapply over your end dates, mutate the data a little, then use normal ggplot2 aesthetics to color them.
spans <- bind_rows(lapply(mdy("11/10/2010", "11/10/2011", "11/10/2012", "11/10/2013"), function(end) {
filter(minex, between(dts, end - (365 + 180), end)) %>%
mutate(day = day - min(day), end = end)
}))
ggplot(spans, aes(day, v1)) +
geom_point(aes(color = factor(end)))
You can see the range of each with a quick summary:
spans %>%
group_by(end) %>%
summarize(startdate = min(dts), enddate = max(dts))
# # A tibble: 4 x 3
# end startdate enddate
# <date> <date> <date>
# 1 2010-11-10 2010-01-01 2010-11-10
# 2 2011-11-10 2010-05-14 2011-11-10
# 3 2012-11-10 2011-05-15 2012-11-10
# 4 2013-11-10 2012-05-14 2013-11-10
Related
I have a dataframe df_have that looks like this:
Arrival_Date Cust_ID Wait_Time_Mins Cust_Priority
<chr> <int> <int> <int>
1 1/01/2010 612345 114 1
2 1/01/2010 415911 146 4
3 1/01/2010 445132 13 2
4 1/01/2010 515619 72 3
5 1/01/2010 725521 155 4
6 1/01/2010 401404 100 5
... ... ... ...
And I want to create five line graphs - 1 for each of the unique values in Cust_Priority -
overlayed on the same plot, such that it is the such that it is the average Wait_Time_Mins by Cust_Priority by month.
How would I do this?
I know how
You can use floor_date to change the date to 1st day of the month. Then for each Cust_Priority in each Month get the average wait time and create a line plot.
We use scale_x_date to format the labels on X-axis.
library(dplyr)
library(lubridate)
library(ggplot2)
df %>%
#If the date is in mdy format use mdy() function to change Arrival_Date to date
mutate(Arrival_Date = dmy(Arrival_Date),
date = floor_date(Arrival_Date, 'month')) %>%
group_by(Cust_Priority, date) %>%
summarise(Wait_Time_Mins = mean(Wait_Time_Mins), .groups = 'drop') %>%
ggplot(aes(date, Wait_Time_Mins, color = factor(Cust_Priority),
group = Cust_Priority)) +
geom_line() +
labs(x = "Month", y = "Average wait time",
title = "Average wait time for each month", color = "Customer Priority") +
scale_x_date(date_labels = '%b - %Y', date_breaks = '1 month')
Having a dataframe (df) containing a time series for a single variable (X):
X time
1 6.905551 14-01-2021 14:53
2 6.852534 27-01-2021 18:24
3 7.030995 23-01-2021 11:11
4 7.083345 23-01-2021 01:19
5 7.003437 28-01-2021 01:07
6 7.040500 14-01-2021 23:34
7 6.940566 14-01-2021 13:42
8 6.989434 22-01-2021 18:37
9 7.032720 22-01-2021 17:50
10 7.001651 23-01-2021 19:05
I am using the time as a factor to create a plot displaying points in an equidistant manner, for which I require a conversion from the original timestamp e.g. "2021-01-14 12:07:53 CET" to 14-01-2021 12:07.
This is done by factor(format(timestamp, "%d-%m-%Y %H:%M")).
Now for the plotting I use ggplot2:
ggplot(aes(x = time, y = X, group=1), data=df) +
geom_line(linetype="dotted") + geom_point() + theme_linedraw() +
theme(axis.text.x = element_text(angle = -40)) +
scale_x_discrete(breaks=df$time[seq(1,length(df$time),by=4)], name="Date")
As indicated, I want to change the tick frequency for the x axis to avoid overlap. Ideally, ticks are placed in an equidistant manner as well per day, e.g 14-01-2021, 22-01-2021 and so on. By scale_x_discrete, I am able to place ticks for every nth factor but they end up plotting this (which is to be expected):
I have also looked into using the dates directly by as.Date(timestamp) and for the scaling e.g. scale_x_date(date_breaks = "4 days"). This obviously yields the correct equidistant tick spacing but the plot itself will end stacking values for the same date and thus containing gaps.
EDIT
#Jon Springs' answer works well if there are no duplicates in the time due to multiple observations. However, having these will result in the following using facet_grid to resolve for the said variable.
In this case the df looks like (with grouper being the variable used for facet_wrap):
X time. grouper
1 6.905551 14-01-2021 14:53 red
2 6.905551 14-01-2021 14:53 green
3 6.852534 27-01-2021 18:24 red
4 6.852534 27-01-2021 18:24 green
5 7.030995 23-01-2021 11:11 red
6 7.030995 23-01-2021 11:11 green
set.seed(0)
library(dplyr)
my_data <- tibble(X = rnorm(10),
time_delay = runif(10, 1, 1000)) %>%
mutate(time = as.POSIXct("2021-01-14") + cumsum(time_delay)*1E5) %>%
# Label every other NEW time
arrange(time) %>%
mutate(label = if_else(
cumsum(time != lag(time, default = as.POSIXct("2000-01-01"))) %% 2 < 1,
format(time, "%d-%m-%Y\n%H:%M"),
"")
)
my_data
ggplot(my_data, aes(x = time %>% as.factor,
y = X, group = 1)) +
geom_line() +
scale_x_discrete(labels = my_data$label)
I am new to coding in R so please excuse the simple question. I am trying to run ggridges geom in R to create monthly density plots. The code is below, but it creates a plot with the months in the wrong order:
The code references a csv data file with 3 columns (see image) - MST, Aeco_5a, and month: Any suggestions on how to fix this would be greatly appreciated. Here is my code:
> library(ggridges)
> read_csv("C:/Users/Calvin Johnson/Desktop/Aeco_Price_2017.csv")
Parsed with column specification:
cols(
MST = col_character(),
Month = col_character(),
Aeco_5a = col_double()
)
# A tibble: 365 x 3
MST Month Aeco_5a
<chr> <chr> <dbl>
1 1/1/2017 January 3.2678
2 1/2/2017 January 3.2678
3 1/3/2017 January 3.0570
4 1/4/2017 January 2.7811
5 1/5/2017 January 2.6354
6 1/6/2017 January 2.7483
7 1/7/2017 January 2.7483
8 1/8/2017 January 2.7483
9 1/9/2017 January 2.5905
10 1/10/2017 January 2.6902
# ... with 355 more rows
>
> mins<-min(Aeco_Price_2017$Aeco_5a)
> maxs<-max(Aeco_Price_2017$Aeco_5a)
>
> ggplot(Aeco_Price_2017,aes(x = Aeco_5a,y=Month,height=..density..))+
+ geom_density_ridges(scale=3) +
+ scale_x_continuous(limits = c(mins,maxs))
This has two parts: (1) you want your months to be factor instead of chr, and (2) you need to order the factors the way we typically order months.
With some reproducible data:
library(ggridges)
df <- sapply(month.abb, function(x) { rnorm(10, rnorm(1), sd = 1)})
df <- as_tibble(x) %>% gather(key = "month")
Then you need to mutate month to be a factor, and use the levels defined by the actual order they show up in the data.frame (unique gives the unique levels in the dataset, and orders them in the way they're ordered in your data ("Jan", "Feb", ...)). Then you need to reverse them, because this way "Jan" will be at the bottom (it's the first factor).
df %>%
# switch to factor, and define the levels they way you want them to show up
# in the ggplot; "Dec", "Nov", "Oct", ...
mutate(month = factor(month, levels = rev(unique(df$month)))) %>%
ggplot(aes(x = value, y = month)) +
geom_density_ridges()
With ggplot2, I would like to create a multiplot (facet_grid) where each plot is the weekly count values for the month.
My data are like this :
day_group count
1 2012-04-29 140
2 2012-05-06 12595
3 2012-05-13 12506
4 2012-05-20 14857
I have created for this dataset two others colums the Month and the Week based on day_group :
day_group count Month Week
1 2012-04-29 140 Apr 17
2 2012-05-06 12595 May 18
3 2012-05-13 12506 May 19
4 2012-05-20 14857 May 2
Now I would like for each Month to create a barplot where I have the sum of the count values aggregated by week. So for example for a year I would have 12 plots with 4 bars (one per week).
Below is what I use to generate the plot :
ggplot(data = count_by_day, aes(x=day_group, y=count)) +
stat_summary(fun.y="sum", geom = "bar") +
scale_x_date(date_breaks = "1 month", date_labels = "%B") +
facet_grid(facets = Month ~ ., scales="free", margins = FALSE)
So far, my plot looks like this
https://dl.dropboxusercontent.com/u/96280295/Rplot.png
As you can see the x axes is not as I'm looking for. Instead of showing only week 1, 2, 3 and 4, it displays all the month.
Do you know what I must change to get what I'm looking for ?
Thanks for your help
Okay, now that I see what you want, I wrote a small program to illustrate it. The key to your order of month problem is making month a factor with the levels in the right order:
library(dplyr)
library(ggplot2)
#initialization
set.seed(1234)
sday <- as.Date("2012-01-01")
eday <- as.Date("2012-07-31")
# List of the first day of the months
mfdays <- seq(sday,length.out=12,by="1 month")
# list of months - this is key to keeping the order straight
mlabs <- months(mfdays)
# list of first weeks of the months
mfweek <- trunc((mfdays-sday)/7)
names(mfweek) <- mlabs
# Generate a bunch of event-days, and then months, then week numbs in our range
n <- 1000
edf <-data.frame(date=sample(seq(sday,eday,by=1),n,T))
edf$month <- factor(months(edf$date),levels=mlabs) # use the factor in the right order
edf$week <- 1 + as.integer(((edf$date-sday)/7) - mfweek[edf$month])
# Now summarize with dplyr
ndf <- group_by(edf,month,week) %>% summarize( count = n() )
ggplot(ndf) + geom_bar(aes(x=week,y=count),stat="identity") + facet_wrap(~month,nrow=1)
Yielding:
(As an aside, I am kind of proud I did this without lubridate ...)
I think you have to do this but I am not sure I understand your question:
ggplot(data = count_by_day, aes(x=Week, y=count, group= Month, color=Month))
I am initially having the dataset as shown below:
ID A B Type Time Date
1 12 13 R 23:20 1-1-01
1 13 12 F 23:40 1-1-01
1 13 11 F 00:00 2-1-01
1 15 10 R 00:20 2-1-01
1 12 06 W 00:40 2-1-01
1 11 09 F 01:00 2-1-01
1 12 10 R 01:20 2-1-01
so on...
I tried to make the ggplot of the above dataset for A and B.
ggplot(data=dataframe, aes(x=A, y=B, colour = Type)) +geom_point()+geom_path()
Problem:
HOW do I add a subsetting variable that looks at the first 24 hours after the every 'F' point.
For the time being I have posted a continuous data set [with respect to time] but my original data set is not continuous. How can I make my data set continuous in a interval of 10 mins? I have used interpolation xspline() function on A and B but I don't know how to make my data set continuous with respect to time,
The highlighted part shown below is what I am looking for, I want to extract this dataset and then plot a new ggplot:
From MarkusN plots this is what I am looking for:
Taking first point as 'F' point and traveling 24hrs from that point (Since there is no 24 hrs data set available here so it should produce like this) :
I've tried the following, maybe you can get an idea from here. I recommend you to first have a variable with the time ordered (either in minutes or hours, in this example I've used hours). Let's see if it helps
#a data set is built as an example
N = 100
set.seed(1)
dataframe = data.frame(A = cumsum(rnorm(N)),
B = cumsum(rnorm(N)),
Type = sample(c('R','F','W'), size = N,
prob = c(5/7,1/7,1/7), replace=T),
time.h = seq(0,240,length.out = N))
# here, a list with dataframes is built with the sequences
l_dfs = lapply(which(dataframe$Type == 'F'), function(i, .data){
transform(subset(.data[i:nrow(.data),], (time.h - time.h[1]) <= 24),
t0 = sprintf('t0=%4.2f', time.h[1]))
}, dataframe)
ggplot(data=do.call('rbind', l_dfs), aes(x=A, y=B, colour=Type)) +
geom_point() + geom_path(colour='black') + facet_wrap(~t0)
First I created sample data. Hope it's similar to your problem:
df = data.frame(id=rep(1:9), A=c(12,13,13,14,12,11,12,11,10),
B=c(13,12,10,12,6,9,10,11,12),
Type=c("F","R","F","R","W","F","R","F","R"),
datetime=as.POSIXct(c("2015-01-01 01:00:00","2015-01-01 22:50:00",
"2015-01-02 08:30:00","2015-01-02 23:00:00",
"2015-01-03 14:10:00","2015-01-05 16:30:00",
"2015-01-05 23:00:00","2015-01-06 17:00:00",
"2015-01-07 23:00:00")),
stringsAsFactors = F)
Your first question is to plot the data, highlighting the first 24h after an F-point. I used dplyr and ggplot for this task.
library(dplyr)
library(ggplot)
df %>%
mutate(nf = cumsum(Type=="F")) %>% # build F-to-F groups
group_by(nf) %>%
mutate(first24h = as.numeric((datetime-min(datetime)) < (24*3600))) %>% # find the first 24h of each F-group
mutate(lbl=paste0(row_number(),"-",Type)) %>%
ggplot(aes(x=A, y=B, label=lbl)) +
geom_path(aes(colour=first24h)) + scale_size(range = c(1, 2)) +
geom_text()
The problem here is, that the colour only changes at some points. One thing I'm not happy with is the use of different line colors for path sections. If first24h is a discrete variable
geom_path draws two sepearate paths. That's why I defined the variable as numeric. Maybe someone can improve this?
Your second question about an interpolation can easily be solved with the zoo package:
library(zoo)
full.time = seq(df$datetime[1], tail(df$datetime, 1), by=600) # new timeline with point at every 10 min
d.zoo = zoo(df[,2:3], df$datetime) # convert to zoo object
d.full = as.data.frame(na.approx(d.zoo, xout=full.time)) # interpolate; result is also a zoo object
d.full$datetime = as.POSIXct(rownames(d.full))
With these two dataframes combined, you get the solution. Every F-F section is drawn in a separate plot and only the points not longer than 24h after the F-point is shown.
df %>%
select(Type, datetime) %>%
right_join(d.full, by="datetime") %>%
mutate(Type = ifelse(is.na(Type),"",Type)) %>%
mutate(nf = cumsum(Type=="F")) %>%
group_by(nf) %>%
mutate(first24h = (datetime-min(datetime)) < (24*3600)) %>%
filter(first24h == TRUE) %>%
mutate(lbl=paste0(row_number(),"-",Type)) %>%
filter(first24h == 1) %>%
ggplot(aes(x=A, y=B, label=Type)) +
geom_path() + geom_text() + facet_wrap(~ nf)