How can get summary table/boxplot in time sequence data frame? - r

I have a data frame which contains time sequence, like this:
example <- data.frame(
Date=seq(
from=as.POSIXct("2012-1-1 0:00", tz="UTC"),
to=as.POSIXct("2012-1-31 23:00", tz="UTC"),
by="10 min"),
frequency=runif(4459, min=12, max=26))
I would like count min value, mean, max value etc. (using summary table) by days: for example summary table of days 2012 1. 1. (using only the first 144 raws), 2012 1. 2. (using raws from 145 to 288), 2012 1. 3. (using raws from 289 to 432) etc.
how can I get this table? I have tried this
summary(example$freqency, example$Date, by="day")
how can I draw dropbox for every day separately? I have tried this:
boxplot(example$freqency, example$Date, by="day")
How can I select time data within days? I also want to calculate summary table by days, but in this case I want to use only data in every hours (e.g. 0:00, 1:00, 2:00 etc.)
Can somebody help me?

To get summary of frequency by day, you could use aggregate from base R in combination with strftime():
aggregate(frequency ~ strftime(Date, "%d"),
FUN = summary, data = example)
To get a boxplot per day, we just need to create a $day column for the x-axis in ggplot2.
library(ggplot2)
example$day <- strftime(example$Date, "%d")
ggplot(example, aes(x = factor(day), y = frequency)) + geom_boxplot()

Try this simply:
within days:
example$str.date <- substring(as.character(example$Date),1,10)
summary.example <- aggregate(frequency~str.date, example, FUN = summary)
library(ggplot2)
ggplot(example, aes(str.date, frequency, group=str.date, fill=str.date)) + geom_boxplot() +
theme(axis.text.x = element_text(angle=90, vjust = 0.5))
within hours (within each day):
example$str.date.hrs <- substring(as.character(example$Date),1,13)
summary.example <- aggregate(frequency~str.date.hrs, example, FUN = summary)
library(ggplot2)
ggplot(example[example$str.date=='2012-01-01',], aes(str.date.hrs, frequency, group=str.date.hrs, fill=str.date.hrs)) + geom_boxplot() +
theme(axis.text.x = element_text(angle=90, vjust = 0.5))

Related

Is there a way to subdivide dates into smaller segments on the x axis?

I'm working on a school project and I have been trying to solve this for some time now but I cant find a solution to this.
The problem is whenever I run this the x axis is full with too many variables. I found a post similar to this but that post is working with normal variables not with date time variables (%Y/%m) like I am, witch creates problems when I try and run code like this one:
"scale_x_discrete(breaks = seq(0, 100, by = 5))"
Keep in mind I have many rows, I don't know if that can cause problems but:
And the code:
plottest1 <- function(St, na){
test1 <- ggplot(data = KunskiDepozit1, aes(x=Datum, y=St, group = 1)) +
geom_line() + labs(x = "Datum", y = na, title = paste("vizualization ", na)) + geom_point()
test1 <- test1 +
theme_update(plot.title = element_text(hjust = 0.5))
return(test1)
}
As Geosopher and PoGibas have noted you need to make sure ggplot understands that Datum is a date. You may want to consider the package lubridate.
If I squint, I think your date information is in one column as YYYY-MM, so, to achieve that, you merely need something like:
date_df <- existing_df %>%
mutate(Datum = paste0(Datum, "-01")) %>%
mutate(Datum = lubridate::ymd(Datum))
I have extracted the following sample code from the chapter on lubridate of R for Data Science (freely available online) which explains how to do it when your date and time elements are split in various columns, using the function lubridate::make_datetime. It also shows that you can plot a date-time variable directly and ggplot will do the right thing.
library(tidyverse)
library(lubridate)
library(nycflights13) # Dataset with flight details
# Custom function to transform the date and time information from several columns
# into one "date-time" column. You may be able to get away simply with make_datetime
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
# Apply that function to relevant columns in the dataset
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
# Plot the dataset
flights_dt %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day
As your x axis is a date I would try using an actual date axis instead of a discrete axis. Maybe play around with something like:
scale_x_date(date_breaks = "2 weeks")
Check out the ggplot2 scale documentation for the details!
?ggplot2::scale_x_date()

Time series from three years in one plot

I am struggling (due to lack of knowledge and experience) to create a plot in R with time series from three different years (2009, 2013 and 2017). Failing to solve this problem by searching online has led me here.
I wish to create a plot that shows change in nitrate concentrations over the course of May to October for all years, but keep failing since the x-axis is defined by one specific year. I also receive errors because the x-axis lengths differ (due to different number of samples). To solve this I have tried making separate columns for month and year, with no success.
Data example:
date NO3.mg.l year month
2009-04-22 1.057495 2009 4
2013-05-08 1.936000 2013 5
2017-05-02 2.608000 2017 5
Code:
ggplot(nitrat.all, aes(x = date, y = NO3.mg.l, colour = year)) + geom_line()
This code produces a plot where the lines are positioned next to one another, whilst I want a plot where they overlay one another. Any help will be much appreciated.
Nitrate plot
Probably, that will be helpful for plotting:
library("lubridate")
library("ggplot2")
# evample of data with some points for each year
nitrat.all <- data.frame(date = c(ymd("2009-03-21"), ymd("2009-04-22"), ymd("2009-05-27"),
ymd("2010-03-15"), ymd("2010-04-17"), ymd("2010-05-10")), NO3.mg.l = c(1.057495, 1.936000, 2.608000,
3.157495, 2.336000, 3.908000))
nitrat.all$year <- format(nitrat.all$date, format = "%Y")
ggplot(data = nitrat.all) +
geom_point(mapping = aes(x = format(date, format = "%m-%d"), y = NO3.mg.l, group = year, colour = year)) +
geom_line(mapping = aes(x = format(date, format = "%m-%d"), y = NO3.mg.l, group = year, colour = year))
As for selecting of the dates corresponding to a certain month, you may subset your data frame by a condition using basic R-functions:
n_month1 <- 3 # an index of the first month of the period to select
n_month2 <- 4 # an index of the first month of the period to select
test_for_month <- (as.numeric(format(nitrat.all$date, format = "%m")) >= n_month1) &
(as.numeric(format(nitrat.all$date, format = "%m")) <= n_month2)
nitrat_to_plot <- nitrat.all[test_for_month, ]
Another quite an elegant approach is to use filter() from dplyr package
nitrat.all$month <- as.numeric(format(nitrat.all$date, format = "%m"))
library("dplyr")
nitrat_to_plot <- filter(nitrat.all, ((month >= n_month1) & (month <= n_month2)))

R - How to create a seasonal plot - Different lines for years

I already asked the same question yesterday, but I didnt get any suggestions until now, so I decided to delete the old one and ask again, giving additional infos.
So here again:
I have a dataframe like this:
Link to the original dataframe: https://megastore.uni-augsburg.de/get/JVu_V51GvQ/
Date DENI011
1 1993-01-01 9.946
2 1993-01-02 13.663
3 1993-01-03 6.502
4 1993-01-04 6.031
5 1993-01-05 15.241
6 1993-01-06 6.561
....
....
6569 2010-12-26 44.113
6570 2010-12-27 34.764
6571 2010-12-28 51.659
6572 2010-12-29 28.259
6573 2010-12-30 19.512
6574 2010-12-31 30.231
I want to create a plot that enables me to compare the monthly values in the DENI011 over the years. So I want to have something like this:
http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html#Seasonal%20Plot
Jan-Dec on the x-scale, values on the y-scale and the years displayed by different colored lines.
I found several similar questions here, but nothing works for me. I tried to follow the instructions on the website with the example, but the problem is that I cant create a ts-object.
Then I tried it this way:
Ref_Data$MonthN <- as.numeric(format(as.Date(Ref_Data$Date),"%m")) # Month's number
Ref_Data$YearN <- as.numeric(format(as.Date(Ref_Data$Date),"%Y"))
Ref_Data$Month <- months(as.Date(Ref_Data$Date), abbreviate=TRUE) # Month's abbr.
g <- ggplot(data = Ref_Data, aes(x = MonthN, y = DENI011, group = YearN, colour=YearN)) +
geom_line() +
scale_x_discrete(breaks = Ref_Data$MonthN, labels = Ref_Data$Month)
That also didnt work, the plot looks horrible. I dont need to put all the years in 1 plot from 1993-2010. Actually only a few years would be ok, like from 1998-2006 maybe.
And suggestions, how to solve this?
As others have noted, in order to create a plot such as the one you used as an example, you'll have to aggregate your data first. However, it's also possible to retain daily data in a similar plot.
reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2018-02-11
library(tidyverse)
library(lubridate)
# Import the data
url <- "https://megastore.uni-augsburg.de/get/JVu_V51GvQ/"
raw <- read.table(url, stringsAsFactors = FALSE)
# Parse the dates, and use lower case names
df <- as_tibble(raw) %>%
rename_all(tolower) %>%
mutate(date = ymd(date))
One trick to achieve this would be to set the year component in your date variable to a constant, effectively collapsing the dates to a single year, and then controlling the axis labelling so that you don't include the constant year in the plot.
# Define the plot
p <- df %>%
mutate(
year = factor(year(date)), # use year to define separate curves
date = update(date, year = 1) # use a constant year for the x-axis
) %>%
ggplot(aes(date, deni011, color = year)) +
scale_x_date(date_breaks = "1 month", date_labels = "%b")
# Raw daily data
p + geom_line()
In this case though, your daily data are quite variable, so this is a bit of a mess. You could hone in on a single year to see the daily variation a bit better.
# Hone in on a single year
p + geom_line(aes(group = year), color = "black", alpha = 0.1) +
geom_line(data = function(x) filter(x, year == 2010), size = 1)
But ultimately, if you want to look a several years at a time, it's probably a good idea to present smoothed lines rather than raw daily values. Or, indeed, some monthly aggregate.
# Smoothed version
p + geom_smooth(se = F)
#> `geom_smooth()` using method = 'loess'
#> Warning: Removed 117 rows containing non-finite values (stat_smooth).
There are multiple values from one month, so when plotting your original data, you got multiple points in one month. Therefore, the line looks strange.
If you want to create something similar to the example your provided, you have to summarize your data by year and month. Below I calculated the mean of each year and month for your data. In addition, you need to convert your year and month to factors if you want to plot it as discrete variables.
library(dplyr)
Ref_Data2 <- Ref_Data %>%
group_by(MonthN, YearN, Month) %>%
summarize(DENI011 = mean(DENI011)) %>%
ungroup() %>%
# Convert the Month column to factor variable with levels from Jan to Dec
# Convert the YearN column to factor
mutate(Month = factor(Month, levels = unique(Month)),
YearN = as.factor(YearN))
g <- ggplot(data = Ref_Data2,
aes(x = Month, y = DENI011, group = YearN, colour = YearN)) +
geom_line()
g
If you don't want to add in library(dplyr), this is the base R code. Exact same strategy and results as www's answer.
dat <- read.delim("~/Downloads/df1.dat", sep = " ")
dat$Date <- as.Date(dat$Date)
dat$month <- factor(months(dat$Date, TRUE), levels = month.abb)
dat$year <- gsub("-.*", "", dat$Date)
month_summary <- aggregate(DENI011 ~ month + year, data = dat, mean)
ggplot(month_summary, aes(month, DENI011, color = year, group = year)) +
geom_path()

R + ggplot2: how to hide missing dates from x-axis?

Say we have the following simple data-frame of date-value pairs, where some dates are missing in the sequence (i.e. Jan 12 thru Jan 14). When I plot the points, it shows these missing dates on the x-axis, but there are no points corresponding to those dates. I want to prevent these missing dates from showing up in the x-axis, so that the point sequence has no breaks. Any suggestions on how to do this? Thanks!
dts <- c(as.Date( c('2011-01-10', '2011-01-11', '2011-01-15', '2011-01-16')))
df <- data.frame(dt = dts, val = seq_along(dts))
ggplot(df, aes(dt,val)) + geom_point() +
scale_x_date(format = '%d%b', major='days')
I made a package that does this. It's called bdscale and it's on CRAN and github. Shameless plug.
To replicate your example:
> library(bdscale)
> library(ggplot2)
> library(scales)
> dts <- as.Date( c('2011-01-10', '2011-01-11', '2011-01-15', '2011-01-16'))
> ggplot(df, aes(x=dt, y=val)) + geom_point() +
scale_x_bd(business.dates=dts, labels=date_format('%d%b'))
But what you probably want is to load known valid dates, then plot your data using the valid dates on the x-axis:
> nyse <- bdscale::yahoo('SPY') # get valid dates from SPY prices
> dts <- as.Date('2011-01-10') + 1:10
> df <- data.frame(dt=dts, val=seq_along(dts))
> ggplot(df, aes(x=dt, y=val)) + geom_point() +
scale_x_bd(business.dates=nyse, labels=date_format('%d%b'), max.major.breaks=10)
Warning message:
Removed 3 rows containing missing values (geom_point).
The warning is telling you that it removed three dates:
15th = Saturday
16th = Sunday
17th = MLK Day
Turn the date data into a factor then. At the moment, ggplot is interpreting the data in the sense you have told it the data are in - a continuous date scale. You don't want that scale, you want a categorical scale:
require(ggplot2)
dts <- as.Date( c('2011-01-10', '2011-01-11', '2011-01-15', '2011-01-16'))
df <- data.frame(dt = dts, val = seq_along(dts))
ggplot(df, aes(dt,val)) + geom_point() +
scale_x_date(format = '%d%b', major='days')
versus
df <- data.frame(dt = factor(format(dts, format = '%d%b')),
val = seq_along(dts))
ggplot(df, aes(dt,val)) + geom_point()
which produces:
Is that what you wanted?
First question is : why do you want to do that? There is no point in showing a coordinate-based plot if your axes are not coordinates. If you really want to do this, you can convert to a factor. Be careful for the order though :
dts <- c(as.Date( c('31-10-2011', '01-11-2011', '02-11-2011',
'05-11-2011'),format="%d-%m-%Y"))
dtsf <- format(dts, format= '%d%b')
df <- data.frame(dt=ordered(dtsf,levels=dtsf),val=seq_along(dts))
ggplot(df, aes(dt,val)) + geom_point()
With factors you have to be careful, as the order is arbitrary in a factor,unless you make it an ordered factor. As factors are ordered alphabetically by default, you can get in trouble with some date formats. So be careful what you do. If you don't take the order into account, you get :
df <- data.frame(dt=factor(dtsf),val=seq_along(dts))
ggplot(df, aes(dt,val)) + geom_point()

Is it possible to 'bin' values by date to get a total per 2 weeks in ggplot2 and R?

I have a dataframe which is a history of runs. Some fo the variables include a date (in POSIXct) and a value for that run (here = size). I want to produce various graphs showing a line based on the total fo the size column for a particular date range. Ideally I'd like to use the same dataset and change from totals per week, 2 weeks, month quarter.
Here's an example dataset;
require(ggplot2)
set.seed(666)
seq(Sys.time()-(365*24*60*60), Sys.time(), by="day")
foo<-data.frame(Date=sample(seq(today-(365*24*60*60), today, by="day"),50, replace=FALSE),
value=rnorm(50, mean=100, sd=25),
type=sample(c("Red", "Blue", "Green"), 50, replace=TRUE))
I can create this plot which shows individual values;
ggplot(data=foo, aes(x=Date, y=value, colour=type))+stat_summary(fun.y=sum, geom="line")
Or I can do this to show a sum per Month;
ggplot(data=foo, aes(x=format(Date, "%m %y"), y=value, colour=type))+stat_summary(fun.y=sum, geom="line", aes(group=type))
However it gets more complicated to do sums per quarter / 2 weeks etc. Ideally I'd like something like the stat_bin and stat_summary combined so I could specify a binwidth (or have ggplot make a best guess based on the range)
Am I missing something obvious, or is this just not possible ?
It's pretty easy with plyr and lubridate to do all the calculations yourself:
library(plyr)
library(lubridate)
foo <- data.frame(
date = sample(today() + days(1:365), 50, replace = FALSE),
value = rnorm(50, mean = 100, sd = 25),
type = sample(c("Red", "Blue", "Green"), 50, replace = TRUE))
foo$date2 <- floor_date(foo$date2, "week")
foosum <- ddply(foo, c("date2", "type"), summarise,
n = length(value),
mean = mean(value))
ggplot(foosum, aes(date2, mean, colour = type)) +
geom_point(aes(size = n)) +
geom_line()
The chron package could be very useful to convert dates in a way not covered in the "basic" format command. But the latter can also do smart things (like the strftime in PHP), e.g.:
Show given year and month of a date:
format(foo$Date, "%Y-%m")
And with package chron showing the appropriate quarter of year:
quarters(foo$Date)
To compute the 2-weeks period, you might not find a complete function, but could be computed from a the week number easily, e.g.:
floor(as.numeric(format(foo$Date, "%V"))/2)+1
After computing the new variables in the dataframe, you could easily plot your data just like in your original example.

Resources