Having trouble correctly producing time series plot - r

I am trying to plot a time series from an excel file in R Studio. It has a single column named 'Dates'. This column contains datetime data of customer visits in the form 2/15/2014 6:17:22 AM. The datetime was originally in char format and I converted it into a Large POSIXct value using lubridate:
tsData <- mdy_hms(fullUsage$Dates)
Which gives me a value:
POSIXct[1:25,354], format: "2018-04-13 10:18:14" "2018-04-14 13:27:11" .....
I then tried converting it into a time series object using the code below:
require(xts)
visitTimes.ts <- xts(tsData, start = 1, order.by=as.POSIXct(tsData))
plot(visitTimes.ts)
ts_plot(visitTimes.ts)
ts_info(visitTimes.ts)
Im not 100% sure but it looks like the plot is coming out using the sum count of visits. I believe my problem may be in correctly indexing my data using the dates. I apologize in advance if this is a simple issue to deal with I am still learning R. I have included the screenshot of my plot.

yes you are right, you need to provide both the date column (x axis) and the value (y axis)
here's a simple example:
v1 <- data.frame(Date = mdy_hms(c("1-1-2020-00-00-00", "1-2-2020-00-00-00", "1-3-2020-00-00-00")), Value = c(1, 3, 6))
v2 <- xts(v1["Value"], order.by = v1[, "Date"])
plot(v2)
first argument of xts takes the x values, on the order.by i leave the actual ts object

You need to count the number of events in each time period and plot these values on the y axis. You didn't provide enough data for a reproducible example, so I have created a small example. We'll use the tidyverse packages dplyr and lubridate to help us out here:
library(lubridate)
library(dplyr)
library(ggplot2)
set.seed(69)
fullUsage <- data.frame(Dates = as.POSIXct("2020-01-01") +
minutes(round(cumsum(rexp(10000, 1/25))))
)
head(fullUsage)
#> Dates
#> 1 2020-01-01 00:02:00
#> 2 2020-01-01 00:15:00
#> 3 2020-01-01 00:22:00
#> 4 2020-01-01 00:29:00
#> 5 2020-01-01 01:13:00
#> 6 2020-01-01 01:27:00
First of all, we will create columns that show the hour of day and the month that events occurred:
fullUsage$hours <- hour(fullUsage$Dates)
fullUsage$month <- floor_date(fullUsage$Dates, "month")
Now we can effectively just count the number of events per month and plot this number for each month:
fullUsage %>%
group_by(month) %>%
summarise(n = length(hours)) %>%
ggplot(aes(month, n)) +
geom_col()
And we can do the same for the hour of day:
fullUsage %>%
group_by(hours) %>%
summarise(n = length(hours)) %>%
ggplot(aes(hours, n)) +
geom_col() +
scale_x_continuous(breaks = 0:23) +
labs(y = "Hour of day")
Created on 2020-08-05 by the reprex package (v0.3.0)

Related

plotting daily rainfall data using geom_step

I have some rainfall data collected continuously from which I have calculated daily totals. Here is some toy data:
Date <- c(seq(as.Date("2016-07-01"), by = "1 day", length.out = 10))
rain_mm <- c(3,6,8,12,0,0,34,23,5,1)
rain_data <- data.frame(Date, rain_mm)
I can plot this data as follows:
ggplot(rain_data, aes(Date, rain_mm)) +
geom_bar(stat = "identity") +
scale_x_date(date_labels = "%d")
Which gives the following:
This seems fine. It is clear how much rainfall there was on a certain day. However, it could also be interpreted that between midday of one day and midday of the next, a certain amount of rain fell, which is wrong. This is especially a problem if the graph is combined with other plots of related continuous variables over the same period.
To get round this issue I could use geom_step as follows:
library(ggplot)
ggplot(rain_data, aes(Date, rain_mm)) +
geom_step() +
scale_x_date(date_labels = "%d")
Which gives:
This is a better way to display the data, and now scale_x_date appears to be a continuous axis. However, it would be nice to get the area below the steps filled but cant seem to find a straight forward way of doing this.
Q1: How can I fill beneath the geom_step? Is it possible?
It may also be useful to convert Date into POSIXct to facilitate identical x-axis in multi-plot figures as discussed in this SO question here.
I can do this as follows:
library(dplyr)
rain_data_POSIX <- rain_data %>% mutate(Date = as.POSIXct(Date))
Date rain_mm
1 2016-07-01 01:00:00 3
2 2016-07-02 01:00:00 6
3 2016-07-03 01:00:00 8
4 2016-07-04 01:00:00 12
5 2016-07-05 01:00:00 0
6 2016-07-06 01:00:00 0
7 2016-07-07 01:00:00 34
8 2016-07-08 01:00:00 23
9 2016-07-09 01:00:00 5
10 2016-07-10 01:00:00 1
However, this gives a time of 01:00 for each date. I would rather have 00:00. Can I change this in the as.POSIXct function call, or do I have to do it afterwards using a separate function? I think it is something to do with tz = "" but cant figure it out.
How can I convert from class Date to POSIXct so that the time generated is 00:00?
Thanks
For your first question, you can work off this example. First, create a time-lagged version of your data:
rain_tl <- mutate( rain_data, rain_mm = lag( rain_mm ) )
Then combine this time-lagged version with the original data, and re-sort by date:
rain_all <- bind_rows( old = rain_data, new = rain_tl, .id="source" ) %>%
arrange( Date, source )
(Note the newly created source column is used to break ties, correctly interlacing the original data with the time-lagged version):
> head( rain_all )
source Date rain_mm
1 new 2016-07-01 NA
2 old 2016-07-01 3
3 new 2016-07-02 3
4 old 2016-07-02 6
5 new 2016-07-03 6
6 old 2016-07-03 8
You can now use the joint matrix to "fill" your steps:
ggplot(rain_data, aes(Date, rain_mm)) +
geom_step() +
geom_ribbon( data = rain_all, aes( ymin = 0, ymax = rain_mm ),
fill="tomato", alpha=0.5 ):
This produces the following plot:
For your second question, the problem is that as.POSIX.ct does not pass additional arguments to the converter, so specifying the tz argument does nothing.
You basically have two options:
1) Reformat the output to what you want: format( as.POSIXct( Date ), "%F 00:00" ), which returns a vector of type character. If you want to preserve the object type as POSIXct, you can instead...
2) Cast your Date vector to character prior to passing it to as.POSIX.ct: as.POSIXct( as.character(Date) ), but this will leave off the time entirely, which may be what you want anyway.
If you would like to avoid the hack, you can customize the position in the geom_bar expression.
I found good results with:
ggplot(rain_data, aes(Date, rain_mm)) +
geom_bar(stat = "identity", position = position_nudge(x = 0.51), width = 0.99) +
scale_x_date(date_labels = "%d")

R ggplot by month and values group by Week

With ggplot2, I would like to create a multiplot (facet_grid) where each plot is the weekly count values for the month.
My data are like this :
day_group count
1 2012-04-29 140
2 2012-05-06 12595
3 2012-05-13 12506
4 2012-05-20 14857
I have created for this dataset two others colums the Month and the Week based on day_group :
day_group count Month Week
1 2012-04-29 140 Apr 17
2 2012-05-06 12595 May 18
3 2012-05-13 12506 May 19
4 2012-05-20 14857 May 2
Now I would like for each Month to create a barplot where I have the sum of the count values aggregated by week. So for example for a year I would have 12 plots with 4 bars (one per week).
Below is what I use to generate the plot :
ggplot(data = count_by_day, aes(x=day_group, y=count)) +
stat_summary(fun.y="sum", geom = "bar") +
scale_x_date(date_breaks = "1 month", date_labels = "%B") +
facet_grid(facets = Month ~ ., scales="free", margins = FALSE)
So far, my plot looks like this
https://dl.dropboxusercontent.com/u/96280295/Rplot.png
As you can see the x axes is not as I'm looking for. Instead of showing only week 1, 2, 3 and 4, it displays all the month.
Do you know what I must change to get what I'm looking for ?
Thanks for your help
Okay, now that I see what you want, I wrote a small program to illustrate it. The key to your order of month problem is making month a factor with the levels in the right order:
library(dplyr)
library(ggplot2)
#initialization
set.seed(1234)
sday <- as.Date("2012-01-01")
eday <- as.Date("2012-07-31")
# List of the first day of the months
mfdays <- seq(sday,length.out=12,by="1 month")
# list of months - this is key to keeping the order straight
mlabs <- months(mfdays)
# list of first weeks of the months
mfweek <- trunc((mfdays-sday)/7)
names(mfweek) <- mlabs
# Generate a bunch of event-days, and then months, then week numbs in our range
n <- 1000
edf <-data.frame(date=sample(seq(sday,eday,by=1),n,T))
edf$month <- factor(months(edf$date),levels=mlabs) # use the factor in the right order
edf$week <- 1 + as.integer(((edf$date-sday)/7) - mfweek[edf$month])
# Now summarize with dplyr
ndf <- group_by(edf,month,week) %>% summarize( count = n() )
ggplot(ndf) + geom_bar(aes(x=week,y=count),stat="identity") + facet_wrap(~month,nrow=1)
Yielding:
(As an aside, I am kind of proud I did this without lubridate ...)
I think you have to do this but I am not sure I understand your question:
ggplot(data = count_by_day, aes(x=Week, y=count, group= Month, color=Month))

How to compute daily average over 31 days for 15 years, taking into account missing values?

This question was marked as duplicate. I don't think it is a duplicate because the specific issues of
averaging over a time span measured in days for several years
and of missing data
Have not been dealt with elsewhere.
I have worked on an answer which I am not allowed to paste in the original question. Therefore I paste it here.
Based on daily data for 15 years from 1993 to 2008. How to compute the daily average, for the variable Open in the file, for each day of the year, based on a 31 day Window centred on the day of interest. Thus, 15тип31 = 465 dates contribute to the statistics of one day.
Output is just 365 values out of the 15 years
The file can be downloaded from here:
http://chart.yahoo.com/table.csv?s=sbux&a=2&b=01&c=1993&d=2&e=01&f=2008&g=d&q=q&y=0&z=sbux&x=.csv
Load packages and data
library(lubridate)
library(dplyr)
dtf <- read.csv("http://chart.yahoo.com/table.csv?s=sbux&a=2&b=01&c=1993&d=2&e=01&f=2008&g=d&q=q&y=0&z=sbux&x=.csv", stringsAsFactors = FALSE)
# I prefer lower case column names
names(dtf) <- tolower(names(dtf))
The lubridate package has a nice function ddays() that adds a number of days. It deals with February 29. For example
ymd("2008-03-01") - ddays(15)
# [1] "2008-02-15 UTC"
ymd("2007-03-01") - ddays(15)
# [1] "2007-02-14 UTC"
Add minus15 and plus15 dates to the dataset, these will be the time bounds over which the average should be calculated for a given date in a given year.
dtf <- dtf %>%
mutate(date = ymd(date),
minus15 = date - ddays(15),
plus15 = date + ddays(15),
monthday = substr(as.character(date),6,10),
year = year(date),
plotdate = ymd(paste(2008,monthday,sep="-")))
calendardays <- dtf %>%
select(monthday) %>%
distinct() %>%
arrange(monthday)
Create a function that gives the average over all those 15 years for a given day :
meanday <- function(givenday, dtf){
# Extract the given day minus 15 days in all years available
# Day minus 15 days will differ for example for march first
# in years where there is a february 29
lowerbound <- dtf$minus15[dtf$monthday == givenday]
# Produce the series of 31 days around the given day
# that is the lower bound + 30 days
filterdates <- lapply(lowerbound, function(x) x + ddays(0:30))
filterdates <- Reduce(c, filterdates)
# filter all of these days
dtfgivenday <- dtf %>%
filter(date %in% filterdates)
return(mean(dtfgivenday$open))
}
Use that function over all dates available in the calendar:
meandays <- sapply(calendardays$monthday, meanday, dtf)
calendardays <- calendardays %>%
mutate(mean = meandays,
plotdate = ymd(paste(2008,monthday,sep="-")))
Plots
plot(dtf$date,dtf$open,type="l")
library(ggplot2)
ggplot(dtf, aes(x=date,y=open, color = as.factor(year))) + geom_line()
ggplot(dtf, aes(x=plotdate,y=open, color = as.factor(year))) + geom_line()
ggplot(calendardays, aes(x=plotdate, y=mean)) + geom_line()
Is it strange to see a periodicity appear here?

Extracting a point from ggplot and plot it

I am initially having the dataset as shown below:
ID A B Type Time Date
1 12 13 R 23:20 1-1-01
1 13 12 F 23:40 1-1-01
1 13 11 F 00:00 2-1-01
1 15 10 R 00:20 2-1-01
1 12 06 W 00:40 2-1-01
1 11 09 F 01:00 2-1-01
1 12 10 R 01:20 2-1-01
so on...
I tried to make the ggplot of the above dataset for A and B.
ggplot(data=dataframe, aes(x=A, y=B, colour = Type)) +geom_point()+geom_path()
Problem:
HOW do I add a subsetting variable that looks at the first 24 hours after the every 'F' point.
For the time being I have posted a continuous data set [with respect to time] but my original data set is not continuous. How can I make my data set continuous in a interval of 10 mins? I have used interpolation xspline() function on A and B but I don't know how to make my data set continuous with respect to time,
The highlighted part shown below is what I am looking for, I want to extract this dataset and then plot a new ggplot:
From MarkusN plots this is what I am looking for:
Taking first point as 'F' point and traveling 24hrs from that point (Since there is no 24 hrs data set available here so it should produce like this) :
I've tried the following, maybe you can get an idea from here. I recommend you to first have a variable with the time ordered (either in minutes or hours, in this example I've used hours). Let's see if it helps
#a data set is built as an example
N = 100
set.seed(1)
dataframe = data.frame(A = cumsum(rnorm(N)),
B = cumsum(rnorm(N)),
Type = sample(c('R','F','W'), size = N,
prob = c(5/7,1/7,1/7), replace=T),
time.h = seq(0,240,length.out = N))
# here, a list with dataframes is built with the sequences
l_dfs = lapply(which(dataframe$Type == 'F'), function(i, .data){
transform(subset(.data[i:nrow(.data),], (time.h - time.h[1]) <= 24),
t0 = sprintf('t0=%4.2f', time.h[1]))
}, dataframe)
ggplot(data=do.call('rbind', l_dfs), aes(x=A, y=B, colour=Type)) +
geom_point() + geom_path(colour='black') + facet_wrap(~t0)
First I created sample data. Hope it's similar to your problem:
df = data.frame(id=rep(1:9), A=c(12,13,13,14,12,11,12,11,10),
B=c(13,12,10,12,6,9,10,11,12),
Type=c("F","R","F","R","W","F","R","F","R"),
datetime=as.POSIXct(c("2015-01-01 01:00:00","2015-01-01 22:50:00",
"2015-01-02 08:30:00","2015-01-02 23:00:00",
"2015-01-03 14:10:00","2015-01-05 16:30:00",
"2015-01-05 23:00:00","2015-01-06 17:00:00",
"2015-01-07 23:00:00")),
stringsAsFactors = F)
Your first question is to plot the data, highlighting the first 24h after an F-point. I used dplyr and ggplot for this task.
library(dplyr)
library(ggplot)
df %>%
mutate(nf = cumsum(Type=="F")) %>% # build F-to-F groups
group_by(nf) %>%
mutate(first24h = as.numeric((datetime-min(datetime)) < (24*3600))) %>% # find the first 24h of each F-group
mutate(lbl=paste0(row_number(),"-",Type)) %>%
ggplot(aes(x=A, y=B, label=lbl)) +
geom_path(aes(colour=first24h)) + scale_size(range = c(1, 2)) +
geom_text()
The problem here is, that the colour only changes at some points. One thing I'm not happy with is the use of different line colors for path sections. If first24h is a discrete variable
geom_path draws two sepearate paths. That's why I defined the variable as numeric. Maybe someone can improve this?
Your second question about an interpolation can easily be solved with the zoo package:
library(zoo)
full.time = seq(df$datetime[1], tail(df$datetime, 1), by=600) # new timeline with point at every 10 min
d.zoo = zoo(df[,2:3], df$datetime) # convert to zoo object
d.full = as.data.frame(na.approx(d.zoo, xout=full.time)) # interpolate; result is also a zoo object
d.full$datetime = as.POSIXct(rownames(d.full))
With these two dataframes combined, you get the solution. Every F-F section is drawn in a separate plot and only the points not longer than 24h after the F-point is shown.
df %>%
select(Type, datetime) %>%
right_join(d.full, by="datetime") %>%
mutate(Type = ifelse(is.na(Type),"",Type)) %>%
mutate(nf = cumsum(Type=="F")) %>%
group_by(nf) %>%
mutate(first24h = (datetime-min(datetime)) < (24*3600)) %>%
filter(first24h == TRUE) %>%
mutate(lbl=paste0(row_number(),"-",Type)) %>%
filter(first24h == 1) %>%
ggplot(aes(x=A, y=B, label=Type)) +
geom_path() + geom_text() + facet_wrap(~ nf)

Extract Date in R

I struggle mightily with dates in R and could do this pretty easily in SPSS, but I would love to stay within R for my project.
I have a date column in my data frame and want to remove the year completely in order to leave the month and day. Here is a peak at my original data.
> head(ds$date)
[1] "2003-10-09" "2003-10-11" "2003-10-13" "2003-10-15" "2003-10-18" "2003-10-20"
> class((ds$date))
[1] "Date"
I "want" it to be.
> head(ds$date)
[1] "10-09" "10-11" "10-13" "10-15" "10-18" "10-20"
> class((ds$date))
[1] "Date"
If possible, I would love to set the first date to be October 1st instead of January 1st.
Any help you can provide will be greatly appreciated.
EDIT: I felt like I should add some context. I want to plot an NHL player's performance over the course of a season which starts in October and ends in April. To add to this, I would like to facet the plots by each season which is a separate column in my data frame. Because I want to compare cumulative performance over the course of the season, I believe that I need to remove the year portion, but maybe I don't; as I indicated, I struggle with dates in R. What I am looking to accomplish is a plot that compares cumulative performance over relative dates by season and have the x-axis start in October and end in April.
> d = as.Date("2003-10-09", format="%Y-%m-%d")
> format(d, "%m-%d")
[1] "10-09"
Is this what you are looking for?
library(ggplot2)
## make up data for two seasons a and b
a = as.Date("2010/10/1")
b = as.Date("2011/10/1")
a.date <- seq(a, by='1 week', length=28)
b.date <- seq(b, by='1 week', length=28)
## make up some score data
a.score <- abs(trunc(rnorm(28, mean = 10, sd = 5)))
b.score <- abs(trunc(rnorm(28, mean = 10, sd = 5)))
## create a data frame
df <- data.frame(a.date, b.date, a.score, b.score)
df
## Since I am using ggplot I better create a "long formated" data frame
df.molt <- melt(df, measure.vars = c("a.score", "b.score"))
levels(df.molt$variable) <- c("First season", "Second season")
df.molt
Then, I am using ggplot2 for plotting the data:
## plot it
ggplot(aes(y = value, x = a.date), data = df.molt) + geom_point() +
geom_line() + facet_wrap(~variable, ncol = 1) +
scale_x_date("Date", format = "%m-%d")
If you want to modify the x-axis (e.g., display format), then you'll probably be interested in scale_date.
You have to remember Date is a numeric format, representing the number of days passed since the "origin" of the internal date counting :
> str(Date)
Class 'Date' num [1:10] 14245 14360 14475 14590 14705 ...
This is the same as in EXCEL, if you want a reference. Hence the solution with format as perfectly valid.
Now if you want to set the first date of a year as October 1st, you can construct some year index like this :
redefine.year <- function(x,start="10-1"){
year <- as.numeric(strftime(x,"%Y"))
yearstart <- as.Date(paste(year,start,sep="-"))
year + (x >= yearstart) - min(year) + 1
}
Testing code :
Start <- as.Date("2009-1-1")
Stop <- as.Date("2011-11-1")
Date <- seq(Start,Stop,length.out=10)
data.frame( Date=as.character(Date),
year=redefine.year(Date))
gives
Date year
1 2009-01-01 1
2 2009-04-25 1
3 2009-08-18 1
4 2009-12-11 2
5 2010-04-05 2
6 2010-07-29 2
7 2010-11-21 3
8 2011-03-16 3
9 2011-07-09 3
10 2011-11-01 4

Resources