I have a factored time series that looks like this:
df <- data.frame(a=c("11-JUL-2004", "11-JUL-2005", "11-JUL-2006",
"11-JUL-2007", "11-JUL-2008"),
b=c("11-JUN-1999", "11-JUN-2000", "11-JUN-2001",
"11-JUN-2002", "11-JUN-2003"))
First, I would like to convert this to a format native to R. Second, I would like to calculate the number of months between the two columns.
Update:
Essentially I'm trying to recreate what I do in SPSS, in R.
In SPSS I would:
Convert the strings to date format DD-MMM-YYYY
COMPUTE. RND((a-b)/60/60/24/30.416)
30.416 is short for 365/12 I don't care so much about month edge cases, hence the rounding operation.
df <- data.frame(c("11-JUL-2004","11-JUL-2005","11-JUL-2006","11-JUL-2007","11-JUL-2008"),
c("11-JUN-1999","11-JUN-2000","11-JUN-2001","11-JUN-2002","11-JUN-2003"))
names(df) <- c("X1","X2")
df <- within(df, X1 <- as.Date(X1, format = "%d-%b-%Y"))
df <- within(df, X2 <- as.Date(X2, format = "%d-%b-%Y"))
Then difftime() will give the difference in weeks:
> with(df, difftime(X1, X2, units = "weeks"))
Time differences in weeks
[1] 265.2857 265.1429 265.1429 265.1429 265.2857
Or if we use Brandon's approximation:
> with(df, difftime(X1, X2) / 30.416)
Time differences in days
[1] 61.05339 61.02052 61.02052 61.02052 61.05339
Closest I could get with lubridate (as highlighted by Dirk) is (using the above df)
> m <- with(df, as.period(subtract_dates(X1, X2)))
> m
[1] 5 years and 1 month 5 years and 1 month 5 years and 1 month 5 years and 1 month 5 years and 1 month
> str(m)
Classes ‘period’ and 'data.frame': 5 obs. of 6 variables:
$ year : int 5 5 5 5 5
$ month : int 1 1 1 1 1
$ day : num 0 0 0 0 0
$ hour : int 0 0 0 0 0
$ minute: int 0 0 0 0 0
$ second: num 0 0 0 0 0
Josh is spot-on with respect to the difficulty of what a month could mean. The lubridate package has some answers on that.
In terms of base R, we can answer it for weeks though:
> df[,"pa"] <- as.POSIXct(strptime(as.character(df$a),
+ format="%d-%B-%Y", tz="GMT"))
> df[,"pb"] <- as.POSIXct(strptime(as.character(df$b),
+ format="%d-%B-%Y",tz="GMT"))
> df[,"weeks"] <- difftime(df$pa, df$pb, unit="weeks")
> df[,"months"] <- difftime(df$pa, df$pb, unit="days")/30.416
> df
a b pa pb weeks months
1 11-JUL-2004 11-JUN-1999 2004-07-11 1999-06-11 265.29 weeks 61.053 days
2 11-JUL-2005 11-JUN-2000 2005-07-11 2000-06-11 265.14 weeks 61.021 days
3 11-JUL-2006 11-JUN-2001 2006-07-11 2001-06-11 265.14 weeks 61.021 days
4 11-JUL-2007 11-JUN-2002 2007-07-11 2002-06-11 265.14 weeks 61.021 days
5 11-JUL-2008 11-JUN-2003 2008-07-11 2003-06-11 265.29 weeks 61.053 days
>
This uses the altered data.frame as per my edit so that we have proper column names. And if you throw an as.numeric() around difftime() you also get numbers.
Brandon,
You could do this with the lubridate package.
> library(lubridate)
Notify R that these are dates. Use the dmy() parser function because the dates are written Day, Month, Year (i.e, dmy).
> df <- transform(df, a = dmy(a), b = dmy(b))
Calculate the difference as a period. This will give you the number of whole years, months, days, etc.
> diff <- as.period(df$a - df$b)
Use math to convert the results to just months.
> 12* diff$year + diff$month
These were all 61 months apart. This would floor it to the nearest month. If you want to round based on the number of days you could do something like
> 12* diff$year + diff$month + round(diff$day/30)
I'm working on making these steps easier/more intuitive in the next version of lubridate.
> Data <- data.frame(
+ V1=c("11-JUL-2004","11-JUL-2005","11-JUL-2006","11-JUL-2007","11-JUL-2008"),
+ V2=c("11-JUN-1999","11-JUN-2000","11-JUN-2001","11-JUN-2002","11-JUN-2003"))
> Data[,1] <- as.Date(Data[,1],"%d-%b-%Y")
> Data[,2] <- as.Date(Data[,2],"%d-%b-%Y")
> # Assuming 30 days per month
> (Data[,1]-Data[,2])/30
Time differences in days
[1] 61.90000 61.86667 61.86667 61.86667 61.90000
> # Assuming 30.416 days per month
> (Data[,1]-Data[,2])/30.416
Time differences in days
[1] 61.05339 61.02052 61.02052 61.02052 61.05339
> # Assuming month crosses
> require(zoo)
> Data[,1] <- as.yearmon(Data[,1])
> Data[,2] <- as.yearmon(Data[,2])
> (Data[,1]-Data[,2])*12
[1] 61 61 61 61 61
Number 1 below seems closest to what you are asking for but 2 and 3 are alternatives you might also want to consider depending on your purpose. Also numbers 1 and 3 can be tried without rounding if you want to consider a fractional number of months.
# first convert columns of df to "Date" class
df[] <- lapply(df, as.Date, "%d-%b-%Y")
# 1. difference in days divided by 365.25/12
with(df, round((as.numeric(a) - as.numeric(b)) / (365.25/12)))
# 2. convert to 1st of month & then take diff in mos
library(zoo)
with(df, 12 * (as.yearmon(a) - as.yearmon(b)))
# 3. business style difference in months. See: ?"mondate-class"
library(mondate)
with(df, round(as.numeric(mondate(a) - mondate(b))))
Related
so I have a dataframe with date and ranks of three tennis players. I want to convert the dates to continuous days so that 2001-01-01 is 0.
I tried this:
days <- yday(x) - 1 # so Jan 1 = day 0
total_days <- cumsum(days)
It does do the job but only per year so for 2002 it starts over, again in 2003, and so on.
I would very much appreciate some help with this.
Cheers
We could also convert to integer first and subtract
as.integer(x) - as.integer(as.Date('2001-01-01'))
[1] 92 396
data
x <- as.Date(c('2001-04-03', '2002-02-01'))
You can subtract days from '2001-01-01' to get number of days since that date.
x <- as.Date(c('2001-04-03', '2002-02-01'))
total_days <- as.numeric(x - as.Date('2001-01-01'))
total_days
#[1] 92 396
I try to find the difference between two timestamps.
The codeQ:
survey <- data.frame(date=c("07/2012","07/2012"),tx_start=c("01/2012","01/2012"))
survey$date_diff <- as.Date(as.character(survey$date), format="%m/%Y")-
as.Date(as.character(survey$tx_start), format="%m/%Y")
survey
I expect to have in the new column the different but I take NA
The results:
> survey
date tx_start date_diff
1 07/2012 01/2012 NA days
2 07/2012 01/2012 NA days
What should I change to replace as.Date for months or years?
Update based on comment of Gregor:
> survey <- data.frame(date=c("07/2012","07/2012"),tx_start=c("01/2012","01/2012"))
> survey$date <- as.Date(paste0("01/", as.character(survey$date)), "%d/%m/%Y")
> survey$tx_start <- as.Date(paste0("01/", as.character(survey$tx_start)), "%d/%m/%Y")
> survey$date_diff <- as.Date(survey$date, format="%d/%m/%Y")-
+ as.Date(survey$tx_start, format="%d/%m/%Y")
> survey
date tx_start date_diff
1 2012-07-01 2012-01-01 182 days
2 2012-07-01 2012-01-01 182 days
I usually convert my dates to POSIXct format. Then, when direct differences are taken with normal syntax, you get an answer in units of seconds. There is a difftime() function in base R that you can use as well:
survey <- data.frame(date=c("07/2012","07/2012"),tx_start=c("01/2012","01/2012"))
# Dates are finicky, add a day so that conversion will work
survey$date2 <- paste0("01/",survey$date)
survey$tx_start2 <- paste0("01/",survey$tx_start)
# conversion
survey$date2 <- as.POSIXct(x=survey$date2,format="%d/%m/%Y")
survey$tx_start2 <- as.POSIXct(x=survey$tx_start2,format="%d/%m/%Y")
# take the difference
survey$date_diff <- with(survey,difftime(time1=date2,time2=tx_start2,units="hours"))
I have dates of format 2015-03 (i.e year-month). Now I want to calculate the month difference in between 2 dates.
Example: difference between dates 2015-03 and 2014-12 should be 3 or 4 as December to March is 3 months or 4 months depending on whether we consider December or not.
You can do it via diff
require(lubridate)
a <- c("2015-03","2014-12")
a_parsed <- ymd(paste0(a,"-01")) # There might be a nicer solution to get the dates
diff(year(a_parsed)) * 12 + diff(month(a_parsed)) # Results in 3
Use + 1 to "consider December"
Explanation:
diff(year(a_parsed)) gives you the difference in the years, * 12 the month resulting from this. diff(month(a_parsed)) results in the monthly difference, ignoring the yearly difference. Combined it results in the Monthly difference you asked for.
a <- "2015-03"
b <- "2014-12"
a <- unlist(strsplit(a, "-"))
b <- unlist(strsplit(b, "-"))
a <- (as.numeric(a[1])*12) + as.numeric(a[2])
b <- (as.numeric(b[1])*12) + as.numeric(b[2])
difference <- diff(c(b,a))
difference
The result of this is 3
I don't often have to work with dates in R, but I imagine this is fairly easy. I have daily data as below for several years with some values and I want to get for each 8 days period the sum of related values.What is the best approach?
Any help you can provide will be greatly appreciated!
str(temp)
'data.frame':648 obs. of 2 variables:
$ Date : Factor w/ 648 levels "2001-03-24","2001-03-25",..: 1 2 3 4 5 6 7 8 9 10 ...
$ conv2: num -3.93 -6.44 -5.48 -6.09 -7.46 ...
head(temp)
Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802
I tried to use aggregate function but for some reasons it doesn't work and it aggregates in wrong way:
z <- aggregate(amount ~ Date, timeSequence(from =as.Date("2001-03-24"),to =as.Date("2001-03-29"), by="day"),data=temp,FUN=sum)
I prefer the package xts for such manipulations.
I read your data, as zoo objects. see the flexibility of format option.
library(xts)
ts.dat <- read.zoo(text ='Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802',header=TRUE,format = '%d/%m/%Y')
Then I extract the index of given period
ep <- endpoints(ts.dat,'days',k=8)
finally I apply my function to the time series at each index.
period.apply(x=ts.dat,ep,FUN=sum )
2001-03-29 2001-04-01
-36.13014 -19.06520
Use cut() in your aggregate() command.
Some sample data:
set.seed(1)
mydf <- data.frame(
DATE = seq(as.Date("2000/1/1"), by="day", length.out = 365),
VALS = runif(365, -5, 5))
Now, the aggregation. See ?cut.Date for details. You can specify the number of days you want in each group using cut:
output <- aggregate(VALS ~ cut(DATE, "8 days"), mydf, sum)
list(head(output), tail(output))
# [[1]]
# cut(DATE, "8 days") VALS
# 1 2000-01-01 8.242384
# 2 2000-01-09 -5.879011
# 3 2000-01-17 7.910816
# 4 2000-01-25 -6.592012
# 5 2000-02-02 2.127678
# 6 2000-02-10 6.236126
#
# [[2]]
# cut(DATE, "8 days") VALS
# 41 2000-11-16 17.8199285
# 42 2000-11-24 -0.3772209
# 43 2000-12-02 2.4406024
# 44 2000-12-10 -7.6894484
# 45 2000-12-18 7.5528077
# 46 2000-12-26 -3.5631950
rollapply. The zoo package has a rolling apply function which can also do non-rolling aggregations. First convert the temp data frame into zoo using read.zoo like this:
library(zoo)
zz <- read.zoo(temp)
and then its just:
rollapply(zz, 8, sum, by = 8)
Drop the by = 8 if you want a rolling total instead.
(Note that the two versions of temp in your question are not the same. They have different column headings and the Date columns are in different formats. I have assumed the str(temp) output version here. For the head(temp) version one would have to add a format = "%d/%m/%Y" argument to read.zoo.)
aggregate. Here is a solution that does not use any external packages. It uses aggregate based on the original data frame.
ix <- 8 * ((1:nrow(temp) - 1) %/% 8 + 1)
aggregate(temp[2], list(period = temp[ix, 1]), sum)
Note that ix looks like this:
> ix
[1] 8 8 8 8 8 8 8 8 16
so it groups the indices of the first 8 rows, the second 8 and so on.
Those are NOT Date classed variables. (No self-respecting program would display a date like that, not to mention the fact that these are labeled as factors.) [I later noticed these were not the same objects.] Furthermore, the timeSequence function (at least the one in the timeDate package) does not return a Date class vector either. So your expectation that there would be a "right way" for two disparate non-Date objects to be aligned in a sensible manner is ill-conceived. The irony is that just using the temp$Date column would have worked since :
> z <- aggregate(amount ~ Date, data=temp , FUN=sum)
> z
Date amount
1 1/04/2001 -5.402788
2 24/03/2001 -3.927020
3 25/03/2001 -6.442700
4 26/03/2001 -5.477593
5 27/03/2001 -6.094622
6 28/03/2001 -7.456669
7 29/03/2001 -6.731541
8 30/03/2001 -6.855206
9 31/03/2001 -6.807210
But to get it in 8 day intervals use cut.Date:
> z <- aggregate(temp$amount ,
list(Dts = cut(as.Date(temp$Date, format="%d/%m/%Y"),
breaks="8 day")), FUN=sum)
> z
Dts x
1 2001-03-24 -49.792561
2 2001-04-01 -5.402788
A more cleaner approach extended to #G. Grothendieck appraoch. Note: It does not take into account if the dates are continuous or discontinuous, sum is calculated based on the fixed width.
code
interval = 8 # your desired date interval. 2 days, 3 days or whatevea
enddate = interval-1 # this sets the enddate
nrows = nrow(z)
z <- aggregate(.~V1,data = df,sum) # aggregate sum of all duplicate dates
z$V1 <- as.Date(z$V1)
data.frame ( Start.date = (z[seq(1, nrows, interval),1]),
End.date = z[seq(1, nrows, interval)+enddate,1],
Total.sum = rollapply(z$V2, interval, sum, by = interval, partial = TRUE))
output
Start.date End.date Total.sum
1 2000-01-01 2000-01-08 9.1395926
2 2000-01-09 2000-01-16 15.0343960
3 2000-01-17 2000-01-24 4.0974712
4 2000-01-25 2000-02-01 4.1102645
5 2000-02-02 2000-02-09 -11.5816277
data
df <- data.frame(
V1 = seq(as.Date("2000/1/1"), by="day", length.out = 365),
V2 = runif(365, -5, 5))
I have a CSV file that looks like this, where "time" is a UNIX timestamp:
time,count
1300162432,5
1299849832,0
1300006132,1
1300245532,4
1299932932,1
1300089232,1
1299776632,9
1299703432,14
... and so on
I am reading it into R and converting the time column into POSIXct like so:
data <- read.csv(file="data.csv",head=TRUE,sep=",")
data[,1] <- as.POSIXct(data[,1], origin="1970-01-01")
Great so far, but now I would like to build a histogram with each bin corresponding to the average hourly count. I'm stuck on selecting by hour and then counting. I've looked through ?POSIXt and ?cut.POSIXt, but if the answer is in there, I am not seeing it.
Any help would be appreciated.
Here is one way:
R> lines <- "time,count
1300162432,5
1299849832,0
1300006132,1
1300245532,4
1299932932,1
1300089232,1
1299776632,9
1299703432,14"
R> con <- textConnection(lines); df <- read.csv(con); close(con)
R> df$time <- as.POSIXct(df$time, origin="1970-01-01")
R> df$hour <- as.POSIXlt(df$time)$hour
R> df
time count hour
1 2011-03-15 05:13:52 5 5
2 2011-03-11 13:23:52 0 13
3 2011-03-13 09:48:52 1 9
4 2011-03-16 04:18:52 4 4
5 2011-03-12 12:28:52 1 12
6 2011-03-14 08:53:52 1 8
7 2011-03-10 17:03:52 9 17
8 2011-03-09 20:43:52 14 20
R> tapply(df$count, df$hour, FUN=mean)
4 5 8 9 12 13 17 20
4 5 1 1 1 0 9 14
R>
Your data doesn't actually yet have multiple entries per hour-of-the-day but this would average over the hours, properly parsed from the POSIX time stamps. You can adjust with TZ info as needed.
You can calculate the hour "bin" for each time by converting to a POSIXlt and subtracting away the minute and seconds components. Then you can add a new column to your data frame that would contain the hour bin marker, like so:
date.to.hour <- function (vec)
{
as.POSIXct(
sapply(
vec,
function (x)
{
lt = as.POSIXlt(x)
x - 60*lt$min - lt$sec
}),
tz="GMT",
origin="1970-01-01")
}
data$hour <- date.to.hour(as.POSIXct(data[,1], origin="1970-01-01"))
There's a good post on this topic on Mages' blog. To get the bucketed data:
aggregate(. ~ cut(time, 'hours'), data, mean)
If you just want a quick graph, ggplot2 is your friend:
qplot(cut(time, "hours"), count, data=data, stat='summary', fun.y='mean')
Unfortunately, because cut returns a factor, the x axis won't work properly. You may want to write your own, less awkward bucketing function for time, e.g.
timebucket = function(x, bucketsize = 1,
units = c("secs", "mins", "hours", "days", "weeks")) {
secs = as.numeric(as.difftime(bucketsize, units=units[1]), units="secs")
structure(floor(as.numeric(x) / secs) * secs, class=c('POSIXt','POSIXct'))
}
qplot(timebucket(time, units="hours"), ...)