Histogram of events grouped by month and day - r

I am trying to make a histogram (or other plot) of the number of occurrences of each event from a set of data from multiple years but grouped by month and day. Basically I want a year long x-axis starting from 1 March showing how many times each date occurs and shading those based on a categorical value. Below is the top 20 entries in the data set:
goose
Index DateLost DateLost1 Nested
1 2/5/1988 1988-02-05 N
2 5/20/1988 1988-05-20 N
3 1/31/1985 1985-01-31 N
4 9/6/1997 1997-09-06 Y
5 9/24/1996 1996-09-24 N
6 9/27/1996 1996-09-27 N
7 9/15/1997 1997-09-15 Y
8 1/18/1989 1989-01-18 Y
9 1/12/1985 1985-01-12 Y
10 2/12/1988 1988-02-12 N
11 1/12/1985 1985-01-12 Y
12 10/26/1986 1986-10-26 N
13 9/15/1988 1988-09-15 Y
14 12/30/1986 1986-12-30 N
15 1/19/1991 1991-01-19 N
16 1/7/1992 1992-01-07 N
17 10/9/1999 1999-10-09 N
18 10/20/1990 1990-10-20 N
19 10/25/2001 2001-10-25 N
20 9/23/1996 1996-09-23 Y
I have tried grouping using strftime, zoo, and lubridate but then the plots don't recognize the time sequence or allow me to adjust the starting value. I have tried numerous methods using plot() and ggplot2() but either can't get the grouped data to plot correctly or can't get data grouped. My best plot so far is from this code:
ggplot(goose, aes(x=DateLost1,fill=Nested))+
stat_bin(binwidth=100 ,position="identity") +
scale_x_date("Date")
This gets me a nice plot but over all years, rather than one year. I have also played with the code from a previous answer here:
Understanding dates and plotting a histogram with ggplot2 in R
But am having trouble choosing a start date. Any help would be greatly appreciated. Let me know if I can provide the example data in an easier to use format.

Let's read in your data:
goose <- read.table(header = TRUE, text = "Index DateLost DateLost1 Nested
1 2/5/1988 1988-02-05 N
2 5/20/1988 1988-05-20 N
3 1/31/1985 1985-01-31 N
4 9/6/1997 1997-09-06 Y
5 9/24/1996 1996-09-24 N
6 9/27/1996 1996-09-27 N
7 9/15/1997 1997-09-15 Y
8 1/18/1989 1989-01-18 Y
9 1/12/1985 1985-01-12 Y
10 2/12/1988 1988-02-12 N
11 1/12/1985 1985-01-12 Y
12 10/26/1986 1986-10-26 N
13 9/15/1988 1988-09-15 Y
14 12/30/1986 1986-12-30 N
15 1/19/1991 1991-01-19 N
16 1/7/1992 1992-01-07 N
17 10/9/1999 1999-10-09 N
18 10/20/1990 1990-10-20 N
19 10/25/2001 2001-10-25 N
20 9/23/1996 1996-09-23 Y")
now we can convert this to POSIXct format:
goose$DateLost1 <- as.POSIXct(goose$DateLost,
format = "%m/%d/%Y",
tz = "GMT")
then we need to figure out what year it was lost in, relative to March 31. Don't try to do this in ggplot(). This requires some mucking about to figure out which year we are in, and then calculate the number of days after March 31.
goose$DOTYMarch1 = as.numeric(format(as.POSIXct(paste0("3/1/",format(goose$DateLost1,"%Y")),
format = "%m/%d/%Y",
tz = "GMT"),
"%j"))
goose$DOTYLost = as.numeric(format(goose$DateLost1,
"%j"))
goose$YLost = as.numeric(format(goose$DateLost1,"%Y")) + (as.numeric(goose$DOTYLost>goose$DOTYMarch1) -1)
goose$DOTYAfterMarch31Lost = as.numeric(goose$DateLost1 - as.POSIXct(paste0("3/1/",goose$YLost),
format = "%m/%d/%Y",
tz = "GMT"))
Then we can plot it. Your code was pretty much perfect already.
require(ggplot2)
p <- ggplot(goose,
aes(x=DOTYAfterMarch31Lost,
fill=Nested))+
stat_bin(binwidth=1,
position="identity")
print(p)
And we get this:

Related

Time series of inflation rate

Hello I have a data set from the CPI of 20 years
i calculated the inflationrate:
"/" <- function(x,y) ifelse(y==0,0,base:::"/"(x,y))
n <- length(CPI.germany$CPI)
infl <- CPI.germany$CPI[13:n]/CPI.germany$CPI[1:(n-12)]
# adjust the date column
date <- CPI.germany1$Date
datenew<- date[13:252]
#control
length(datenew)
length(infl)
infl datenew
1 1.08182862 1991-01-15
2 1.08195654 1991-02-15
3 1.08191389 1991-03-15
4 1.22093054 1991-04-15
5 1.28206524 1991-05-15
6 1.56516705 1991-06-15
7 2.01404189 1991-07-15
8 1.58665134 1991-08-15
How can I know create a Time series graph like that one i attached.
And which package is the easiest one? ggplot2?
Assuming DF shown reproducibly in the Note at the end convert to zoo series z and then use one of the methods shown.
library(zoo)
z <- read.zoo(DF, index = "datenew")
# classic graphics
plot(z)
# ggplot2
library(ggplot2)
autoplot(z)
# lattice
library(lattice)
xyplot(z)
Note
Lines <- " infl datenew
1 1.08182862 1991-01-15
2 1.08195654 1991-02-15
3 1.08191389 1991-03-15
4 1.22093054 1991-04-15
5 1.28206524 1991-05-15
6 1.56516705 1991-06-15
7 2.01404189 1991-07-15
8 1.58665134 1991-08-15"
DF <- read.table(text = Lines)

Order of dates when plotting time series in R

I would like to know if the order of dates matter when plotting a time series in R.
For example, the dataframe below has it's date starting from the year 2010 onwards increasing as it goes down, for example till 2011:
Date Number of visits
2010-05-17 13
2010-05-18 11
2010-05-19 4
2010-05-20 2
2010-05-21 23
2010-05-22 26
2011-05-13 14
and below where the year are jumbled up.
Date Number of visits
2011-06-19 10
2009-04-25 5
2012-03-09 20
2011-01-04 45
Would i be able to plot a time series in R for the second example above? Is it required that in order to plot a time series, the dates must be sorted?
Assuming the data shown reproducibly int he Note at the end create an ordering vector o and then plot the ordered data:
o <- order(dat$Date)
plot(dat[o, ], type = "o")
or convert the data to a zoo series, which will automatically order it, and then plot.
library(zoo)
z <- read.zoo(dat)
plot(z, type = "o")
Note
The data in reproducible form:
Lines <- "Date Number of visits
2010-05-17 13
2010-05-18 11
2010-05-19 4
2010-05-20 2
2010-05-21 23
2010-05-22 26
2011-05-13 14"
dat <- read.csv(text = gsub(" +", ",", readLines(textConnection(Lines))),
check.names = FALSE)
dat$Date <- as.Date(dat$Date)
as.Date slove your problem:
data$Date <- as.Date(x$Date)
ggplot(data, aes(Date, Number_of_visits)) + geom_line()

R ggplot by month and values group by Week

With ggplot2, I would like to create a multiplot (facet_grid) where each plot is the weekly count values for the month.
My data are like this :
day_group count
1 2012-04-29 140
2 2012-05-06 12595
3 2012-05-13 12506
4 2012-05-20 14857
I have created for this dataset two others colums the Month and the Week based on day_group :
day_group count Month Week
1 2012-04-29 140 Apr 17
2 2012-05-06 12595 May 18
3 2012-05-13 12506 May 19
4 2012-05-20 14857 May 2
Now I would like for each Month to create a barplot where I have the sum of the count values aggregated by week. So for example for a year I would have 12 plots with 4 bars (one per week).
Below is what I use to generate the plot :
ggplot(data = count_by_day, aes(x=day_group, y=count)) +
stat_summary(fun.y="sum", geom = "bar") +
scale_x_date(date_breaks = "1 month", date_labels = "%B") +
facet_grid(facets = Month ~ ., scales="free", margins = FALSE)
So far, my plot looks like this
https://dl.dropboxusercontent.com/u/96280295/Rplot.png
As you can see the x axes is not as I'm looking for. Instead of showing only week 1, 2, 3 and 4, it displays all the month.
Do you know what I must change to get what I'm looking for ?
Thanks for your help
Okay, now that I see what you want, I wrote a small program to illustrate it. The key to your order of month problem is making month a factor with the levels in the right order:
library(dplyr)
library(ggplot2)
#initialization
set.seed(1234)
sday <- as.Date("2012-01-01")
eday <- as.Date("2012-07-31")
# List of the first day of the months
mfdays <- seq(sday,length.out=12,by="1 month")
# list of months - this is key to keeping the order straight
mlabs <- months(mfdays)
# list of first weeks of the months
mfweek <- trunc((mfdays-sday)/7)
names(mfweek) <- mlabs
# Generate a bunch of event-days, and then months, then week numbs in our range
n <- 1000
edf <-data.frame(date=sample(seq(sday,eday,by=1),n,T))
edf$month <- factor(months(edf$date),levels=mlabs) # use the factor in the right order
edf$week <- 1 + as.integer(((edf$date-sday)/7) - mfweek[edf$month])
# Now summarize with dplyr
ndf <- group_by(edf,month,week) %>% summarize( count = n() )
ggplot(ndf) + geom_bar(aes(x=week,y=count),stat="identity") + facet_wrap(~month,nrow=1)
Yielding:
(As an aside, I am kind of proud I did this without lubridate ...)
I think you have to do this but I am not sure I understand your question:
ggplot(data = count_by_day, aes(x=Week, y=count, group= Month, color=Month))

How to get sum of values every 8 days by date in data frame in R

I don't often have to work with dates in R, but I imagine this is fairly easy. I have daily data as below for several years with some values and I want to get for each 8 days period the sum of related values.What is the best approach?
Any help you can provide will be greatly appreciated!
str(temp)
'data.frame':648 obs. of 2 variables:
$ Date : Factor w/ 648 levels "2001-03-24","2001-03-25",..: 1 2 3 4 5 6 7 8 9 10 ...
$ conv2: num -3.93 -6.44 -5.48 -6.09 -7.46 ...
head(temp)
Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802
I tried to use aggregate function but for some reasons it doesn't work and it aggregates in wrong way:
z <- aggregate(amount ~ Date, timeSequence(from =as.Date("2001-03-24"),to =as.Date("2001-03-29"), by="day"),data=temp,FUN=sum)
I prefer the package xts for such manipulations.
I read your data, as zoo objects. see the flexibility of format option.
library(xts)
ts.dat <- read.zoo(text ='Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802',header=TRUE,format = '%d/%m/%Y')
Then I extract the index of given period
ep <- endpoints(ts.dat,'days',k=8)
finally I apply my function to the time series at each index.
period.apply(x=ts.dat,ep,FUN=sum )
2001-03-29 2001-04-01
-36.13014 -19.06520
Use cut() in your aggregate() command.
Some sample data:
set.seed(1)
mydf <- data.frame(
DATE = seq(as.Date("2000/1/1"), by="day", length.out = 365),
VALS = runif(365, -5, 5))
Now, the aggregation. See ?cut.Date for details. You can specify the number of days you want in each group using cut:
output <- aggregate(VALS ~ cut(DATE, "8 days"), mydf, sum)
list(head(output), tail(output))
# [[1]]
# cut(DATE, "8 days") VALS
# 1 2000-01-01 8.242384
# 2 2000-01-09 -5.879011
# 3 2000-01-17 7.910816
# 4 2000-01-25 -6.592012
# 5 2000-02-02 2.127678
# 6 2000-02-10 6.236126
#
# [[2]]
# cut(DATE, "8 days") VALS
# 41 2000-11-16 17.8199285
# 42 2000-11-24 -0.3772209
# 43 2000-12-02 2.4406024
# 44 2000-12-10 -7.6894484
# 45 2000-12-18 7.5528077
# 46 2000-12-26 -3.5631950
rollapply. The zoo package has a rolling apply function which can also do non-rolling aggregations. First convert the temp data frame into zoo using read.zoo like this:
library(zoo)
zz <- read.zoo(temp)
and then its just:
rollapply(zz, 8, sum, by = 8)
Drop the by = 8 if you want a rolling total instead.
(Note that the two versions of temp in your question are not the same. They have different column headings and the Date columns are in different formats. I have assumed the str(temp) output version here. For the head(temp) version one would have to add a format = "%d/%m/%Y" argument to read.zoo.)
aggregate. Here is a solution that does not use any external packages. It uses aggregate based on the original data frame.
ix <- 8 * ((1:nrow(temp) - 1) %/% 8 + 1)
aggregate(temp[2], list(period = temp[ix, 1]), sum)
Note that ix looks like this:
> ix
[1] 8 8 8 8 8 8 8 8 16
so it groups the indices of the first 8 rows, the second 8 and so on.
Those are NOT Date classed variables. (No self-respecting program would display a date like that, not to mention the fact that these are labeled as factors.) [I later noticed these were not the same objects.] Furthermore, the timeSequence function (at least the one in the timeDate package) does not return a Date class vector either. So your expectation that there would be a "right way" for two disparate non-Date objects to be aligned in a sensible manner is ill-conceived. The irony is that just using the temp$Date column would have worked since :
> z <- aggregate(amount ~ Date, data=temp , FUN=sum)
> z
Date amount
1 1/04/2001 -5.402788
2 24/03/2001 -3.927020
3 25/03/2001 -6.442700
4 26/03/2001 -5.477593
5 27/03/2001 -6.094622
6 28/03/2001 -7.456669
7 29/03/2001 -6.731541
8 30/03/2001 -6.855206
9 31/03/2001 -6.807210
But to get it in 8 day intervals use cut.Date:
> z <- aggregate(temp$amount ,
list(Dts = cut(as.Date(temp$Date, format="%d/%m/%Y"),
breaks="8 day")), FUN=sum)
> z
Dts x
1 2001-03-24 -49.792561
2 2001-04-01 -5.402788
A more cleaner approach extended to #G. Grothendieck appraoch. Note: It does not take into account if the dates are continuous or discontinuous, sum is calculated based on the fixed width.
code
interval = 8 # your desired date interval. 2 days, 3 days or whatevea
enddate = interval-1 # this sets the enddate
nrows = nrow(z)
z <- aggregate(.~V1,data = df,sum) # aggregate sum of all duplicate dates
z$V1 <- as.Date(z$V1)
data.frame ( Start.date = (z[seq(1, nrows, interval),1]),
End.date = z[seq(1, nrows, interval)+enddate,1],
Total.sum = rollapply(z$V2, interval, sum, by = interval, partial = TRUE))
output
Start.date End.date Total.sum
1 2000-01-01 2000-01-08 9.1395926
2 2000-01-09 2000-01-16 15.0343960
3 2000-01-17 2000-01-24 4.0974712
4 2000-01-25 2000-02-01 4.1102645
5 2000-02-02 2000-02-09 -11.5816277
data
df <- data.frame(
V1 = seq(as.Date("2000/1/1"), by="day", length.out = 365),
V2 = runif(365, -5, 5))

How to select and plot hourly averages from data frame?

I have a CSV file that looks like this, where "time" is a UNIX timestamp:
time,count
1300162432,5
1299849832,0
1300006132,1
1300245532,4
1299932932,1
1300089232,1
1299776632,9
1299703432,14
... and so on
I am reading it into R and converting the time column into POSIXct like so:
data <- read.csv(file="data.csv",head=TRUE,sep=",")
data[,1] <- as.POSIXct(data[,1], origin="1970-01-01")
Great so far, but now I would like to build a histogram with each bin corresponding to the average hourly count. I'm stuck on selecting by hour and then counting. I've looked through ?POSIXt and ?cut.POSIXt, but if the answer is in there, I am not seeing it.
Any help would be appreciated.
Here is one way:
R> lines <- "time,count
1300162432,5
1299849832,0
1300006132,1
1300245532,4
1299932932,1
1300089232,1
1299776632,9
1299703432,14"
R> con <- textConnection(lines); df <- read.csv(con); close(con)
R> df$time <- as.POSIXct(df$time, origin="1970-01-01")
R> df$hour <- as.POSIXlt(df$time)$hour
R> df
time count hour
1 2011-03-15 05:13:52 5 5
2 2011-03-11 13:23:52 0 13
3 2011-03-13 09:48:52 1 9
4 2011-03-16 04:18:52 4 4
5 2011-03-12 12:28:52 1 12
6 2011-03-14 08:53:52 1 8
7 2011-03-10 17:03:52 9 17
8 2011-03-09 20:43:52 14 20
R> tapply(df$count, df$hour, FUN=mean)
4 5 8 9 12 13 17 20
4 5 1 1 1 0 9 14
R>
Your data doesn't actually yet have multiple entries per hour-of-the-day but this would average over the hours, properly parsed from the POSIX time stamps. You can adjust with TZ info as needed.
You can calculate the hour "bin" for each time by converting to a POSIXlt and subtracting away the minute and seconds components. Then you can add a new column to your data frame that would contain the hour bin marker, like so:
date.to.hour <- function (vec)
{
as.POSIXct(
sapply(
vec,
function (x)
{
lt = as.POSIXlt(x)
x - 60*lt$min - lt$sec
}),
tz="GMT",
origin="1970-01-01")
}
data$hour <- date.to.hour(as.POSIXct(data[,1], origin="1970-01-01"))
There's a good post on this topic on Mages' blog. To get the bucketed data:
aggregate(. ~ cut(time, 'hours'), data, mean)
If you just want a quick graph, ggplot2 is your friend:
qplot(cut(time, "hours"), count, data=data, stat='summary', fun.y='mean')
Unfortunately, because cut returns a factor, the x axis won't work properly. You may want to write your own, less awkward bucketing function for time, e.g.
timebucket = function(x, bucketsize = 1,
units = c("secs", "mins", "hours", "days", "weeks")) {
secs = as.numeric(as.difftime(bucketsize, units=units[1]), units="secs")
structure(floor(as.numeric(x) / secs) * secs, class=c('POSIXt','POSIXct'))
}
qplot(timebucket(time, units="hours"), ...)

Resources