I have a CSV file that looks like this, where "time" is a UNIX timestamp:
time,count
1300162432,5
1299849832,0
1300006132,1
1300245532,4
1299932932,1
1300089232,1
1299776632,9
1299703432,14
... and so on
I am reading it into R and converting the time column into POSIXct like so:
data <- read.csv(file="data.csv",head=TRUE,sep=",")
data[,1] <- as.POSIXct(data[,1], origin="1970-01-01")
Great so far, but now I would like to build a histogram with each bin corresponding to the average hourly count. I'm stuck on selecting by hour and then counting. I've looked through ?POSIXt and ?cut.POSIXt, but if the answer is in there, I am not seeing it.
Any help would be appreciated.
Here is one way:
R> lines <- "time,count
1300162432,5
1299849832,0
1300006132,1
1300245532,4
1299932932,1
1300089232,1
1299776632,9
1299703432,14"
R> con <- textConnection(lines); df <- read.csv(con); close(con)
R> df$time <- as.POSIXct(df$time, origin="1970-01-01")
R> df$hour <- as.POSIXlt(df$time)$hour
R> df
time count hour
1 2011-03-15 05:13:52 5 5
2 2011-03-11 13:23:52 0 13
3 2011-03-13 09:48:52 1 9
4 2011-03-16 04:18:52 4 4
5 2011-03-12 12:28:52 1 12
6 2011-03-14 08:53:52 1 8
7 2011-03-10 17:03:52 9 17
8 2011-03-09 20:43:52 14 20
R> tapply(df$count, df$hour, FUN=mean)
4 5 8 9 12 13 17 20
4 5 1 1 1 0 9 14
R>
Your data doesn't actually yet have multiple entries per hour-of-the-day but this would average over the hours, properly parsed from the POSIX time stamps. You can adjust with TZ info as needed.
You can calculate the hour "bin" for each time by converting to a POSIXlt and subtracting away the minute and seconds components. Then you can add a new column to your data frame that would contain the hour bin marker, like so:
date.to.hour <- function (vec)
{
as.POSIXct(
sapply(
vec,
function (x)
{
lt = as.POSIXlt(x)
x - 60*lt$min - lt$sec
}),
tz="GMT",
origin="1970-01-01")
}
data$hour <- date.to.hour(as.POSIXct(data[,1], origin="1970-01-01"))
There's a good post on this topic on Mages' blog. To get the bucketed data:
aggregate(. ~ cut(time, 'hours'), data, mean)
If you just want a quick graph, ggplot2 is your friend:
qplot(cut(time, "hours"), count, data=data, stat='summary', fun.y='mean')
Unfortunately, because cut returns a factor, the x axis won't work properly. You may want to write your own, less awkward bucketing function for time, e.g.
timebucket = function(x, bucketsize = 1,
units = c("secs", "mins", "hours", "days", "weeks")) {
secs = as.numeric(as.difftime(bucketsize, units=units[1]), units="secs")
structure(floor(as.numeric(x) / secs) * secs, class=c('POSIXt','POSIXct'))
}
qplot(timebucket(time, units="hours"), ...)
Related
I have a data frame that looks like this:
X id mat.1 mat.2 mat.3 times
1 1 1 Anne 1495206060 18.5639404 2017-05-19 11:01:00
2 2 1 Anne 1495209660 9.0160321 2017-05-19 12:01:00
3 3 1 Anne 1495211460 37.6559161 2017-05-19 12:31:00
4 4 1 Anne 1495213260 31.1218856 2017-05-19 13:01:00
....
164 164 1 Anne 1497825060 4.8098351 2017-06-18 18:31:00
165 165 1 Anne 1497826860 15.0678781 2017-06-18 19:01:00
166 166 1 Anne 1497828660 4.7636241 2017-06-18 19:31:00
What I would like is to subset the data set by time interval (all data between 11 AM and 4 PM) if there are data points for each hour at least (11 AM, 12, 1, 2, 3, 4 PM) within each day. I want to ultimately sum the values from mat.3 per time interval (11 AM to 4 PM) per day.
I did tried:
sub.1 <- subset(t,format(times,'%H')>='11' & format(times,'%H')<='16')
but this returns all the data from any of the times between 11 AM and 4 PM, but often I would only have data for e.g. 12 and 1 PM for a given day.
I only want the subset from days where I have data for each hour from 11 AM to 4 PM. Any ideas what I can try?
A complement to #Henry Navarro answer for solving an additional problem mentioned in the question.
If I understand in proper way, another concern of the question is to find the dates such that there are data points at least for each hour of the given interval within the day. A possible way following the style of #Henry Navarro solution is as follows:
library(lubridate)
your_data$hour_only <- as.numeric(format(your_data$times, format = "%H"))
your_data$days <- ymd(format(your_data$times, "%Y-%m-%d"))
your_data_by_days_list <- split(x = your_data, f = your_data$days)
# the interval is narrowed for demonstration purposes
hours_intervals <- 11:13
all_hours_flags <- data.frame(days = unique(your_data$days),
all_hours_present = sapply(function(Z) (sum(unique(Z$hour_only) %in% hours_intervals) >=
length(hours_intervals)), X = your_data_by_days_list), row.names = NULL)
your_data <- merge(your_data, all_hours_flags, by = "days")
There is now the column "all_hours_present" indicating that the data for a corresponding day contains at least one value for each hour in the given hours_intervals. And you may use this column to subset your data
subset(your_data, all_hours_present)
Try to create a new variable in your data frame with only the hour.
your_data$hour<-format(your_data$times, format="%H:%M:%S")
Then, using this new variable try to do the next:
#auxiliar variable with your interval of time
your_data$aux_var<-ifelse(your_data$hour >"11:00:00" || your_data$hour<"16:00:00" ,1,0)
So, the next step is filter your data when aux_var==1
your_data[which(your_data$aux_var ==1),]
Here my time period range:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
df = as.data.frame(seq(from = start_day, to = end_day, by = 'day'))
colnames(df) = 'date'
I need to created 10,000 data.frames with different fake years of 365days each one. This means that each of the 10,000 data.frames needs to have different start and end of year.
In total df has got 14,965 days which, divided by 365 days = 41 years. In other words, df needs to be grouped 10,000 times differently by 41 years (of 365 days each one).
The start of each year has to be random, so it can be 1974-10-03, 1974-08-30, 1976-01-03, etc... and the remaining dates at the end df need to be recycled with the starting one.
The grouped fake years need to appear in a 3rd col of the data.frames.
I would put all the data.frames into a list but I don't know how to create the function which generates 10,000 different year's start dates and subsequently group each data.frame with a 365 days window 41 times.
Can anyone help me?
#gringer gave a good answer but it solved only 90% of the problem:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
What I need is 10,000 columns with 14,965 rows made by dates taken from df which need to be eventually recycled when reaching the end of df.
I tried to change length.out = 14965 but R does not recycle the dates.
Another option could be to change length.out = 1 and eventually add the remaining df rows for each column by maintaining the same order:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=1, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
How can I add the remaining df rows to each col?
The seq method also works if the to argument is unspecified, so it can be used to generate a specific number of days starting at a particular date:
> seq(from=df$date[20], length.out=10, by="day")
[1] "1974-01-20" "1974-01-21" "1974-01-22" "1974-01-23" "1974-01-24"
[6] "1974-01-25" "1974-01-26" "1974-01-27" "1974-01-28" "1974-01-29"
When used in combination with replicate and sample, I think this will give what you want in a list:
> replicate(2,seq(sample(df$date, 1), length.out=10, by="day"), simplify=FALSE)
[[1]]
[1] "1985-07-24" "1985-07-25" "1985-07-26" "1985-07-27" "1985-07-28"
[6] "1985-07-29" "1985-07-30" "1985-07-31" "1985-08-01" "1985-08-02"
[[2]]
[1] "2012-10-13" "2012-10-14" "2012-10-15" "2012-10-16" "2012-10-17"
[6] "2012-10-18" "2012-10-19" "2012-10-20" "2012-10-21" "2012-10-22"
Without the simplify=FALSE argument, it produces an array of integers (i.e. R's internal representation of dates), which is a bit trickier to convert back to dates. A slightly more convoluted way to do this is and produce Date output is to use data.frame on the unsimplified replicate result. Here's an example that will produce a 10,000-column data frame with 365 dates in each column (takes about 5s to generate on my computer):
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE));
colnames(dates.df) <- 1:10000;
> dates.df[1:5,1:5];
1 2 3 4 5
1 1988-09-06 1996-05-30 1987-07-09 1974-01-15 1992-03-07
2 1988-09-07 1996-05-31 1987-07-10 1974-01-16 1992-03-08
3 1988-09-08 1996-06-01 1987-07-11 1974-01-17 1992-03-09
4 1988-09-09 1996-06-02 1987-07-12 1974-01-18 1992-03-10
5 1988-09-10 1996-06-03 1987-07-13 1974-01-19 1992-03-11
To get the date wraparound working, a slight modification can be made to the original data frame, pasting a copy of itself on the end:
df <- as.data.frame(c(seq(from = start_day, to = end_day, by = 'day'),
seq(from = start_day, to = end_day, by = 'day')));
colnames(df) <- "date";
This is easier to code for downstream; the alternative being a double seq for each result column with additional calculations for the start/end and if statements to deal with boundary cases.
Now instead of doing date arithmetic, the result columns subset from the original data frame (where the arithmetic is already done). Starting with one date in the first half of the frame and choosing the next 14965 values. I'm using nrow(df)/2 instead for a more generic code:
dates.df <-
as.data.frame(lapply(sample.int(nrow(df)/2, 10000),
function(startPos){
df$date[startPos:(startPos+nrow(df)/2-1)];
}));
colnames(dates.df) <- 1:10000;
>dates.df[c(1:5,(nrow(dates.df)-5):nrow(dates.df)),1:5];
1 2 3 4 5
1 1988-10-21 1999-10-18 2009-04-06 2009-01-08 1988-12-28
2 1988-10-22 1999-10-19 2009-04-07 2009-01-09 1988-12-29
3 1988-10-23 1999-10-20 2009-04-08 2009-01-10 1988-12-30
4 1988-10-24 1999-10-21 2009-04-09 2009-01-11 1988-12-31
5 1988-10-25 1999-10-22 2009-04-10 2009-01-12 1989-01-01
14960 1988-10-15 1999-10-12 2009-03-31 2009-01-02 1988-12-22
14961 1988-10-16 1999-10-13 2009-04-01 2009-01-03 1988-12-23
14962 1988-10-17 1999-10-14 2009-04-02 2009-01-04 1988-12-24
14963 1988-10-18 1999-10-15 2009-04-03 2009-01-05 1988-12-25
14964 1988-10-19 1999-10-16 2009-04-04 2009-01-06 1988-12-26
14965 1988-10-20 1999-10-17 2009-04-05 2009-01-07 1988-12-27
This takes a bit less time now, presumably because the date values have been pre-caclulated.
Try this one, using subsetting instead:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
date_vec <- seq.Date(from=start_day, to=end_day, by="day")
Now, I create a vector long enough so that I can use easy subsetting later on:
date_vec2 <- rep(date_vec,2)
Now, create the random start dates for 100 instances (replace this with 10000 for your application):
random_starts <- sample(1:14965, 100)
Now, create a list of dates by simply subsetting date_vec2 with your desired length:
dates <- lapply(random_starts, function(x) date_vec2[x:(x+14964)])
date_df <- data.frame(dates)
names(date_df) <- 1:100
date_df[1:5,1:5]
1 2 3 4 5
1 1997-05-05 2011-12-10 1978-11-11 1980-09-16 1989-07-24
2 1997-05-06 2011-12-11 1978-11-12 1980-09-17 1989-07-25
3 1997-05-07 2011-12-12 1978-11-13 1980-09-18 1989-07-26
4 1997-05-08 2011-12-13 1978-11-14 1980-09-19 1989-07-27
5 1997-05-09 2011-12-14 1978-11-15 1980-09-20 1989-07-28
I have previous experience of matlab but very new to R. The basic problem that I am having is like this -
I have a data which has 10 columns. The first 6 columns correspond to year, month, day, hour min and secs.
E.g data_example =
2013 6 15 11 15 0 ...
2013 6 15 11 20 0 ...
2013 6 15 11 25 0 ...
In matlab for dealing with dates as numbers I used to easily compute that using datenum(data_example(:,1:6))
but in R what is the best way to go about getting similar numerical representation of the 6 columns.
Here are some alternatives. They all make use of ISOdatetime :
1) Assuming DF is your data frame try ISOdatetime like this:
DF$datetime <- ISOdatetime(DF[[1]], DF[[2]], DF[[3]], DF[[4]], DF[[5]], DF[[6]])
2) or like this:
DF$datetime <- do.call(ISOdatetime, setNames(as.list(DF[1:6]), NULL))
3a) If this is a time series suitable for zoo (distinct times and all numeric) then we could use read.zoo in the zoo package together with ISOdatetime like this:
library(zoo)
z <- read.zoo(DF, index = 1:6, FUN = ISOdatetime)
3b) or using read.zoo to read from a file or character string (latter shown here):
# sample input lines
Lines <- "2013 6 15 11 15 0 1
2013 6 15 11 20 0 2
2013 6 15 11 25 0 3
"
library(zoo)
z <- read.zoo(text = Lines, index = 1:6, FUN = ISOdatetime)
which gives this zoo series:
> z
2013-06-15 11:15:00 2013-06-15 11:20:00 2013-06-15 11:25:00
1 2 3
Use the parse_date_time function from the Lubridate package.
x <- paste0(data_example[,1:6])
x <- parse_date_time(x,"%y%m%d %H%M")
More information in the documentation
EDIT
#joran told me to test it, and it didn't work, so I made some modifications:
data_example = data.frame(t(c(13,2,9,14,30)))
x <- paste0(data_example[,1:3],collapse="-")
y <- paste0(data_example[,4:5],collapse=":")
xy<- paste(x,y)
xy <- parse_date_time(xy,"%y%m%d %H%M")
xy
# "2013-02-09 14:30:00 UTC"
I don't know if there is a cleaner way to do it
The units of the returned value are a bit different in R than in Matlab (see comment in code). Also, since you have other columns in your data frame, you will first need to subset the data frame to contain only the relevant (6) date columns, then add them back to the data frame as a new column at the end.
test <- data.frame("year"=c(2013, 2013, 2013, 2001, 1970)
, "month"=c(6,6, 6, 4, 1)
, "day"=c(15,15, 15, 19, 1)
, "hour"=c(11,11, 11, 11, 0)
, "min"=c(15,20, 25, 30, 0)
, "second"=c(0,0, 0 ,0, 0))
# pad to the right # of digits
dates00 <- apply(test, c(1,2), sprintf, fmt="%02s")
# combine the date components in each row into a single string
dates0 <- apply(dates00, 1, paste, collapse=" ")
#format to a date object
dates <- as.POSIXct(dates0, format="%Y %m %d %H %M %S")
# numbers are seconds since "1970-01-01 00:00:00 UTC"; according
# to the help file for daynum, Matlab returns the number (from
# daynum) as fractional days since "January 0, 0000"
as.numeric(dates)
I don't often have to work with dates in R, but I imagine this is fairly easy. I have daily data as below for several years with some values and I want to get for each 8 days period the sum of related values.What is the best approach?
Any help you can provide will be greatly appreciated!
str(temp)
'data.frame':648 obs. of 2 variables:
$ Date : Factor w/ 648 levels "2001-03-24","2001-03-25",..: 1 2 3 4 5 6 7 8 9 10 ...
$ conv2: num -3.93 -6.44 -5.48 -6.09 -7.46 ...
head(temp)
Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802
I tried to use aggregate function but for some reasons it doesn't work and it aggregates in wrong way:
z <- aggregate(amount ~ Date, timeSequence(from =as.Date("2001-03-24"),to =as.Date("2001-03-29"), by="day"),data=temp,FUN=sum)
I prefer the package xts for such manipulations.
I read your data, as zoo objects. see the flexibility of format option.
library(xts)
ts.dat <- read.zoo(text ='Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802',header=TRUE,format = '%d/%m/%Y')
Then I extract the index of given period
ep <- endpoints(ts.dat,'days',k=8)
finally I apply my function to the time series at each index.
period.apply(x=ts.dat,ep,FUN=sum )
2001-03-29 2001-04-01
-36.13014 -19.06520
Use cut() in your aggregate() command.
Some sample data:
set.seed(1)
mydf <- data.frame(
DATE = seq(as.Date("2000/1/1"), by="day", length.out = 365),
VALS = runif(365, -5, 5))
Now, the aggregation. See ?cut.Date for details. You can specify the number of days you want in each group using cut:
output <- aggregate(VALS ~ cut(DATE, "8 days"), mydf, sum)
list(head(output), tail(output))
# [[1]]
# cut(DATE, "8 days") VALS
# 1 2000-01-01 8.242384
# 2 2000-01-09 -5.879011
# 3 2000-01-17 7.910816
# 4 2000-01-25 -6.592012
# 5 2000-02-02 2.127678
# 6 2000-02-10 6.236126
#
# [[2]]
# cut(DATE, "8 days") VALS
# 41 2000-11-16 17.8199285
# 42 2000-11-24 -0.3772209
# 43 2000-12-02 2.4406024
# 44 2000-12-10 -7.6894484
# 45 2000-12-18 7.5528077
# 46 2000-12-26 -3.5631950
rollapply. The zoo package has a rolling apply function which can also do non-rolling aggregations. First convert the temp data frame into zoo using read.zoo like this:
library(zoo)
zz <- read.zoo(temp)
and then its just:
rollapply(zz, 8, sum, by = 8)
Drop the by = 8 if you want a rolling total instead.
(Note that the two versions of temp in your question are not the same. They have different column headings and the Date columns are in different formats. I have assumed the str(temp) output version here. For the head(temp) version one would have to add a format = "%d/%m/%Y" argument to read.zoo.)
aggregate. Here is a solution that does not use any external packages. It uses aggregate based on the original data frame.
ix <- 8 * ((1:nrow(temp) - 1) %/% 8 + 1)
aggregate(temp[2], list(period = temp[ix, 1]), sum)
Note that ix looks like this:
> ix
[1] 8 8 8 8 8 8 8 8 16
so it groups the indices of the first 8 rows, the second 8 and so on.
Those are NOT Date classed variables. (No self-respecting program would display a date like that, not to mention the fact that these are labeled as factors.) [I later noticed these were not the same objects.] Furthermore, the timeSequence function (at least the one in the timeDate package) does not return a Date class vector either. So your expectation that there would be a "right way" for two disparate non-Date objects to be aligned in a sensible manner is ill-conceived. The irony is that just using the temp$Date column would have worked since :
> z <- aggregate(amount ~ Date, data=temp , FUN=sum)
> z
Date amount
1 1/04/2001 -5.402788
2 24/03/2001 -3.927020
3 25/03/2001 -6.442700
4 26/03/2001 -5.477593
5 27/03/2001 -6.094622
6 28/03/2001 -7.456669
7 29/03/2001 -6.731541
8 30/03/2001 -6.855206
9 31/03/2001 -6.807210
But to get it in 8 day intervals use cut.Date:
> z <- aggregate(temp$amount ,
list(Dts = cut(as.Date(temp$Date, format="%d/%m/%Y"),
breaks="8 day")), FUN=sum)
> z
Dts x
1 2001-03-24 -49.792561
2 2001-04-01 -5.402788
A more cleaner approach extended to #G. Grothendieck appraoch. Note: It does not take into account if the dates are continuous or discontinuous, sum is calculated based on the fixed width.
code
interval = 8 # your desired date interval. 2 days, 3 days or whatevea
enddate = interval-1 # this sets the enddate
nrows = nrow(z)
z <- aggregate(.~V1,data = df,sum) # aggregate sum of all duplicate dates
z$V1 <- as.Date(z$V1)
data.frame ( Start.date = (z[seq(1, nrows, interval),1]),
End.date = z[seq(1, nrows, interval)+enddate,1],
Total.sum = rollapply(z$V2, interval, sum, by = interval, partial = TRUE))
output
Start.date End.date Total.sum
1 2000-01-01 2000-01-08 9.1395926
2 2000-01-09 2000-01-16 15.0343960
3 2000-01-17 2000-01-24 4.0974712
4 2000-01-25 2000-02-01 4.1102645
5 2000-02-02 2000-02-09 -11.5816277
data
df <- data.frame(
V1 = seq(as.Date("2000/1/1"), by="day", length.out = 365),
V2 = runif(365, -5, 5))
I have a factored time series that looks like this:
df <- data.frame(a=c("11-JUL-2004", "11-JUL-2005", "11-JUL-2006",
"11-JUL-2007", "11-JUL-2008"),
b=c("11-JUN-1999", "11-JUN-2000", "11-JUN-2001",
"11-JUN-2002", "11-JUN-2003"))
First, I would like to convert this to a format native to R. Second, I would like to calculate the number of months between the two columns.
Update:
Essentially I'm trying to recreate what I do in SPSS, in R.
In SPSS I would:
Convert the strings to date format DD-MMM-YYYY
COMPUTE. RND((a-b)/60/60/24/30.416)
30.416 is short for 365/12 I don't care so much about month edge cases, hence the rounding operation.
df <- data.frame(c("11-JUL-2004","11-JUL-2005","11-JUL-2006","11-JUL-2007","11-JUL-2008"),
c("11-JUN-1999","11-JUN-2000","11-JUN-2001","11-JUN-2002","11-JUN-2003"))
names(df) <- c("X1","X2")
df <- within(df, X1 <- as.Date(X1, format = "%d-%b-%Y"))
df <- within(df, X2 <- as.Date(X2, format = "%d-%b-%Y"))
Then difftime() will give the difference in weeks:
> with(df, difftime(X1, X2, units = "weeks"))
Time differences in weeks
[1] 265.2857 265.1429 265.1429 265.1429 265.2857
Or if we use Brandon's approximation:
> with(df, difftime(X1, X2) / 30.416)
Time differences in days
[1] 61.05339 61.02052 61.02052 61.02052 61.05339
Closest I could get with lubridate (as highlighted by Dirk) is (using the above df)
> m <- with(df, as.period(subtract_dates(X1, X2)))
> m
[1] 5 years and 1 month 5 years and 1 month 5 years and 1 month 5 years and 1 month 5 years and 1 month
> str(m)
Classes ‘period’ and 'data.frame': 5 obs. of 6 variables:
$ year : int 5 5 5 5 5
$ month : int 1 1 1 1 1
$ day : num 0 0 0 0 0
$ hour : int 0 0 0 0 0
$ minute: int 0 0 0 0 0
$ second: num 0 0 0 0 0
Josh is spot-on with respect to the difficulty of what a month could mean. The lubridate package has some answers on that.
In terms of base R, we can answer it for weeks though:
> df[,"pa"] <- as.POSIXct(strptime(as.character(df$a),
+ format="%d-%B-%Y", tz="GMT"))
> df[,"pb"] <- as.POSIXct(strptime(as.character(df$b),
+ format="%d-%B-%Y",tz="GMT"))
> df[,"weeks"] <- difftime(df$pa, df$pb, unit="weeks")
> df[,"months"] <- difftime(df$pa, df$pb, unit="days")/30.416
> df
a b pa pb weeks months
1 11-JUL-2004 11-JUN-1999 2004-07-11 1999-06-11 265.29 weeks 61.053 days
2 11-JUL-2005 11-JUN-2000 2005-07-11 2000-06-11 265.14 weeks 61.021 days
3 11-JUL-2006 11-JUN-2001 2006-07-11 2001-06-11 265.14 weeks 61.021 days
4 11-JUL-2007 11-JUN-2002 2007-07-11 2002-06-11 265.14 weeks 61.021 days
5 11-JUL-2008 11-JUN-2003 2008-07-11 2003-06-11 265.29 weeks 61.053 days
>
This uses the altered data.frame as per my edit so that we have proper column names. And if you throw an as.numeric() around difftime() you also get numbers.
Brandon,
You could do this with the lubridate package.
> library(lubridate)
Notify R that these are dates. Use the dmy() parser function because the dates are written Day, Month, Year (i.e, dmy).
> df <- transform(df, a = dmy(a), b = dmy(b))
Calculate the difference as a period. This will give you the number of whole years, months, days, etc.
> diff <- as.period(df$a - df$b)
Use math to convert the results to just months.
> 12* diff$year + diff$month
These were all 61 months apart. This would floor it to the nearest month. If you want to round based on the number of days you could do something like
> 12* diff$year + diff$month + round(diff$day/30)
I'm working on making these steps easier/more intuitive in the next version of lubridate.
> Data <- data.frame(
+ V1=c("11-JUL-2004","11-JUL-2005","11-JUL-2006","11-JUL-2007","11-JUL-2008"),
+ V2=c("11-JUN-1999","11-JUN-2000","11-JUN-2001","11-JUN-2002","11-JUN-2003"))
> Data[,1] <- as.Date(Data[,1],"%d-%b-%Y")
> Data[,2] <- as.Date(Data[,2],"%d-%b-%Y")
> # Assuming 30 days per month
> (Data[,1]-Data[,2])/30
Time differences in days
[1] 61.90000 61.86667 61.86667 61.86667 61.90000
> # Assuming 30.416 days per month
> (Data[,1]-Data[,2])/30.416
Time differences in days
[1] 61.05339 61.02052 61.02052 61.02052 61.05339
> # Assuming month crosses
> require(zoo)
> Data[,1] <- as.yearmon(Data[,1])
> Data[,2] <- as.yearmon(Data[,2])
> (Data[,1]-Data[,2])*12
[1] 61 61 61 61 61
Number 1 below seems closest to what you are asking for but 2 and 3 are alternatives you might also want to consider depending on your purpose. Also numbers 1 and 3 can be tried without rounding if you want to consider a fractional number of months.
# first convert columns of df to "Date" class
df[] <- lapply(df, as.Date, "%d-%b-%Y")
# 1. difference in days divided by 365.25/12
with(df, round((as.numeric(a) - as.numeric(b)) / (365.25/12)))
# 2. convert to 1st of month & then take diff in mos
library(zoo)
with(df, 12 * (as.yearmon(a) - as.yearmon(b)))
# 3. business style difference in months. See: ?"mondate-class"
library(mondate)
with(df, round(as.numeric(mondate(a) - mondate(b))))