Correct previous year by id within R - r

I have data something like this:
df <- data.frame(Id=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,9,9,9,9),Date=c("2013-04","2013-12","2013-01","2013-12","2013-11",
"2013-12","2012-04","2013-12","2012-08","2014-12","2013-08","2014-12","2013-08","2014-12","2011-01","2013-11","2013-12","2014-01","2014-04"))
To get the correct format:
df$Date <- paste0(df$Date,"-01")
I would need to obtain only years, so that each id contains 2 dates following on each other.
I if do on the existing data something like this:
require(lubridate)
df$Date <- year(as.Date(df$Date)-days(1))
I get sometimes same date for given id.
The desired output for the column Date is this:
2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011 2013 2014
Please note that the last date for given id is always correct, so just the preceding year have to be corrected based on the last date. The date have to be in format that can be converted to years only as shown.
EDIT Here is the case:
Id Date
1 2013-11-01
1 2013-12-01
1 2014-01-01
1 2014-04-01
Now I'm getting this: 2012,2013,2013,2013
I would need: 2012,2013,2013,2014

This is how I would solve this using data.table package (though it looks over complicated to me)
library(data.table)
setDT(df)[, year := year(Date)][,
year := if(.N == 2) (year[2] - 1):year[2] else year,
Id][]
# Id Date year indx
# 1: 1 2013-04-01 2012 2
# 2: 1 2013-12-01 2013 2
# 3: 2 2013-01-01 2012 2
# 4: 2 2013-12-01 2013 2
# 5: 3 2013-11-01 2012 2
# 6: 3 2013-12-01 2013 2
# 7: 4 2012-04-01 2012 2
# 8: 4 2013-12-01 2013 2
# 9: 5 2012-08-01 2013 2
# 10: 5 2014-12-01 2014 2
# 11: 6 2013-08-01 2013 2
# 12: 6 2014-12-01 2014 2
# 13: 7 2013-08-01 2013 2
# 14: 7 2014-12-01 2014 2
# 15: 8 2011-01-01 2011 1
Or all in one step (thanks to #Arun for providing this):
setDT(df)[, year := {tmp = year(Date);
if (.N == 2L) (tmp[2]-1L):tmp[2] else tmp},
Id]
Edit:
Per OPs new data, we can modify the code by adding additional index
setDT(df)[, indx := if(.N > 2) rep(seq_len(.N/2), each = 2) + 1L else .N, Id]
df[, year := {tmp = year(Date); if (.N > 1L) (tmp[2] - 1L):tmp[2] else tmp},
list(Id, indx)][]
# Id Date indx year
# 1: 1 2013-04-01 2 2012
# 2: 1 2013-12-01 2 2013
# 3: 2 2013-01-01 2 2012
# 4: 2 2013-12-01 2 2013
# 5: 3 2013-11-01 2 2012
# 6: 3 2013-12-01 2 2013
# 7: 4 2012-04-01 2 2012
# 8: 4 2013-12-01 2 2013
# 9: 5 2012-08-01 2 2013
# 10: 5 2014-12-01 2 2014
# 11: 6 2013-08-01 2 2013
# 12: 6 2014-12-01 2 2014
# 13: 7 2013-08-01 2 2013
# 14: 7 2014-12-01 2 2014
# 15: 8 2011-01-01 1 2011
# 16: 9 2013-11-01 2 2012
# 17: 9 2013-12-01 2 2013
# 18: 9 2014-01-01 3 2013
# 19: 9 2014-04-01 3 2014
Or another possible solution provided by #akrun
setDT(df)[, `:=`(year = year(Date), indx = .N, indx2 = as.numeric(gl(.N,2, .N))), Id]
df[indx > 1, year:=(year[2]-1):year[2], list(Id, indx2)][]

Using dplyr using similar approach as #David Arenburg's
library(dplyr)
df %>%
group_by(Id) %>%
mutate(year=as.numeric(sub('-.*', '', Date)),
year=replace(year, n()>1, c(year[2]-1, year[2])))
# Id Date year
#1 1 2013-04 2012
#2 1 2013-12 2013
#3 2 2013-01 2012
#4 2 2013-12 2013
#5 3 2013-11 2012
#6 3 2013-12 2013
#7 4 2012-04 2012
#8 4 2013-12 2013
#9 5 2012-08 2013
#10 5 2014-12 2014
#11 6 2013-08 2013
#12 6 2014-12 2014
#13 7 2013-08 2013
#14 7 2014-12 2014
#15 8 2011-01 2011
Or using base R
with(df, ave(as.numeric(sub('-.*', '', Date)), Id,
FUN=function(x) if(length(x)>1)(x[2]-1):x[2] else x))
#[1] 2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011
Update
You can try
df$indx <- with(df, ave(Id, Id, FUN=function(x) (seq_along(x)-1)%/%2+1))
with(df, ave(as.numeric(sub('-.*', '', Date)), Id, indx,
FUN=function(x) if(length(x)>1)(x[2]-1):x[2] else x))
#[1] 2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011
#[16] 2012 2013 2013 2014
Or
df %>%
group_by(Id) %>%
mutate(year=as.numeric(sub('-.*', '', Date))) %>%
group_by(indx=cumsum(rep(c(TRUE,FALSE), length.out=n())), add=TRUE) %>%
mutate(year=replace(year, n()>1, c(year[2]-1, year[2])))

Here's a dplyr solution. You can remove the intermediate fields last_year and year2, but I left them here for clarity:
library(stringr)
library(dplyr)
df %>%
group_by(Id) %>%
mutate(
last_year = last(as.integer(str_sub(Date, 1, 4))),
year2 = row_number() - n(),
year = last_year + year2
)

Related

data.table: is it possible to merge .SD and return a new 'sub data table' by group?

I have a data table organized by id and year, with a frequency (freq) value for every year where the frequency is at least 1. The start and end year may differ for every id.
Example:
> dt <- data.table(id=c('A','A','A','A','B','B','B','B'),year=c(2010,2012,2013,2015,2006,2007,2010,2011),freq=c(2,1,4,3,1,3,5,7))
> dt
id year freq
1: A 2010 2
2: A 2012 1
3: A 2013 4
4: A 2015 3
5: B 2006 1
6: B 2007 3
7: B 2010 5
8: B 2011 7
I would like to make each time series by id complete, i.e. add rows with freq=0 for any missing year. So the result for the example above should look like this:
id year freq
A 2010 2
A 2011 0
A 2012 1
A 2013 4
A 2014 0
A 2015 3
B 2006 1
B 2007 3
B 2008 0
B 2009 0
B 2010 5
B 2011 7
I'm starting with data.table and I'm interested to see if this is doable. With plyr or dplyr I would have used a merge operation with a complete column of years for every sub dataframe by id. Is there an equivalent to this solution with data.table?
We can't use CJ-based approaches because the missing rows need to be by-id. An alternative is:
library(data.table)
dt[ dt[, .(year = do.call(seq, as.list(range(year)))), by = .(id)],
on = .(id, year)
][is.na(freq), freq := 0][]
# id year freq
# <char> <int> <num>
# 1: A 2010 2
# 2: A 2011 0
# 3: A 2012 1
# 4: A 2013 4
# 5: A 2014 0
# 6: A 2015 3
# 7: B 2006 1
# 8: B 2007 3
# 9: B 2008 0
# 10: B 2009 0
# 11: B 2010 5
# 12: B 2011 7
Another solution, maybe more explicit than #r2evans'? First make a table of complete series:
years <- dt[, list(year= seq(min(year), max(year))), by= id]
years
id year
1: A 2010
2: A 2011
3: A 2012
4: A 2013
5: A 2014
6: A 2015
7: B 2006
8: B 2007
9: B 2008
10: B 2009
11: B 2010
12: B 2011
then merge and replace NAs:
full <- merge(dt, years, all.y= TRUE)
full[, freq := ifelse(is.na(freq), 0, freq)]
full
id year freq
1: A 2010 2
2: A 2011 0
3: A 2012 1
4: A 2013 4
5: A 2014 0
6: A 2015 3
7: B 2006 1
8: B 2007 3
9: B 2008 0
10: B 2009 0
11: B 2010 5
12: B 2011 7
Here is another data.table way to solve your problem:
dt[, .SD[.(min(year):max(year)), on="year"], by=id][is.na(freq), freq:=0]
# id year freq
# <char> <int> <num>
# 1: A 2010 2
# 2: A 2011 0
# 3: A 2012 1
# 4: A 2013 4
# 5: A 2014 0
# 6: A 2015 3
# 7: B 2006 1
# 8: B 2007 3
# 9: B 2008 0
# 10: B 2009 0
# 11: B 2010 5
# 12: B 2011 7

data.table aggregation by one column using the maximum value of another column - R

I've got a data.table DT that I would like to aggregate by one column (year) using the maximum value of another column (month). Here's a sample of my data.table.
> DT <- data.table(month = c("2016-01", "2016-02", "2016-03", "2017-01", "2017-02", "2017-03")
, col1 = c(3,5,2,8,4,9)
, year = c(2016, 2016,2016, 2017,2017,2017))
> DT
month col1 year
1: 2016-01 3 2016
2: 2016-02 5 2016
3: 2016-03 2 2016
4: 2017-01 8 2017
5: 2017-02 4 2017
6: 2017-03 9 2017
The desired output
> ## desired output
> DT
month col1 year desired_output
1: 2016-01 3 2016 2
2: 2016-02 5 2016 2
3: 2016-03 2 2016 2
4: 2017-01 8 2017 9
5: 2017-02 4 2017 9
6: 2017-03 9 2017 9
Aggregating by the column year, the desired output should be the value of col1 for the latest month. But somehow the following code doesn't work, it gives me a warning and returns NAs. What am I doing wrong?
> ## wrong output
> DT[, output := col1[which.max(month)], by = .(year)]
Warning messages:
1: In which.max(month) : NAs introduced by coercion
2: In which.max(month) : NAs introduced by coercion
> DT
month col1 year output
1: 2016-01 3 2016 NA
2: 2016-02 5 2016 NA
3: 2016-03 2 2016 NA
4: 2017-01 8 2017 NA
5: 2017-02 4 2017 NA
6: 2017-03 9 2017 NA
We get the index of the max value in 'month by converting to yearmon class from zoo and use that to get the corresponding value from 'col1' in creating the 'desired_output' column grouped by 'year'
library(zoo)
library(data.table)
DT[, desired_output := col1[which.max(as.yearmon(month))], .(year)]
DT
# month col1 year desired_output
#1: 2016-01 3 2016 2
#2: 2016-02 5 2016 2
#3: 2016-03 2 2016 2
#4: 2017-01 8 2017 9
#5: 2017-02 4 2017 9
#6: 2017-03 9 2017 9
Or extract the 'month' and get the index of max value
DT[, desired_output := col1[which.max(month(as.IDate(paste0(month,
"-01"))))], .(year)]

How to add means to an existing column in R

I am manipulating a dataset but I can't make things right.
Here's an example for this, where df is the name of data frame.
year ID value
2013 1 10
2013 2 20
2013 3 10
2014 1 20
2014 2 20
2014 3 30
2015 1 20
2015 2 10
2015 3 30
So I tried to make another data frame df1 <- aggregate(value ~ year, df, mean, rm.na=T)
And made this data frame df1:
year ID value
2013 avg 13.3
2014 avg 23.3
2015 avg 20
But I want to add each mean by year into each row of df.
The expected form is:
year ID value
2013 1 10
2013 2 20
2013 3 10
2013 avg 13.3
2014 1 20
2014 2 20
2014 3 30
2014 avg 23.3
2015 1 20
2015 2 10
2015 3 30
2015 avg 20
Here is an option with data.table where we convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'year', get the 'mean of 'value' and 'ID' as 'avg', then use rbindlist to rbind both the datasets and order by 'year'
library(data.table)
rbindlist(list(setDT(df), df[, .(ID = 'avg', value = mean(value)), year]))[order(year)]
# year ID value
# 1: 2013 1 10.00000
# 2: 2013 2 20.00000
# 3: 2013 3 10.00000
# 4: 2013 avg 13.33333
# 5: 2014 1 20.00000
# 6: 2014 2 20.00000
# 7: 2014 3 30.00000
# 8: 2014 avg 23.33333
# 9: 2015 1 20.00000
#10: 2015 2 10.00000
#11: 2015 3 30.00000
#12: 2015 avg 20.00000
Or using the OP's method, rbind both the datasets and then order
df2 <- rbind(df, transform(df1, ID = 'avg'))
df2 <- df2[order(df2$year),]

Finding the third Friday of a month and data table

I want to find the third Friday of a month for delivery date of the futures, I used the solution here, getNthDayOfWeek from RcppBDT package:
library(data.table)
library(RcppBDT)
data <- setDT(data.frame(mon=c(5:12, 1:12, 1:12, 1:4),
year=c(rep(2011,8), rep(2012,12), rep(2013,12), rep(2014,4))))
data[, third.friday:= getNthDayOfWeek(third, Fri, mon, year)]
However I get this message: Error: expecting a single value. What am I missing?
Since you did not specify a by clause in your transformation, := is (presumably) trying to apply getNthDayOfWeek as a vectorized function.
This should work:
Data[
,third.friday := getNthDayOfWeek(third, Fri, mon, year)
,by = "mon,year"]
Data
# mon year third.friday
#1: 5 2011 2011-05-20
#2: 6 2011 2011-06-17
#3: 7 2011 2011-07-15
#4: 8 2011 2011-08-19
#5: 9 2011 2011-09-16
#6: 10 2011 2011-10-21
#7: 11 2011 2011-11-18
#8: 12 2011 2011-12-16
#9: 1 2012 2012-01-20
Or, more generally, in case you have duplicate mon,year tuples in your object:
Data[,Idx := 1:.N][
,third.friday := getNthDayOfWeek(third, Fri, mon, year)
,by = "mon,year,Idx"
][,Idx := NULL][]
# mon year third.friday
#1: 5 2011 2011-05-20
#2: 6 2011 2011-06-17
#3: 7 2011 2011-07-15
#4: 8 2011 2011-08-19
#5: 9 2011 2011-09-16
#6: 10 2011 2011-10-21
#7: 11 2011 2011-11-18
#8: 12 2011 2011-12-16
#9: 1 2012 2012-01-20

Calculating mean date by row

I wish to obtain the mean date by row, where each row contains two dates. Eventually I found a way, posted below. However, the approach I used seems rather cumbersome. Is there a better way?
my.data = read.table(text = "
OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE
1 3 6 2012 3 10 2012 1
2 3 10 2012 3 20 2012 1
3 3 16 2012 3 30 2012 1
4 3 20 2012 4 8 2012 1
5 3 20 2012 4 9 2012 1
6 3 20 2012 4 10 2012 1
7 3 20 2012 4 11 2012 1
8 4 4 2012 4 5 2012 1
9 4 6 2012 4 6 2012 1
10 4 6 2012 4 7 2012 1
", header = TRUE, stringsAsFactors = FALSE)
my.data
my.data$MY.DATE1 <- do.call(paste, list(my.data$MONTH1, my.data$DAY1, my.data$YEAR1))
my.data$MY.DATE2 <- do.call(paste, list(my.data$MONTH2, my.data$DAY2, my.data$YEAR2))
my.data$MY.DATE1 <- as.Date(my.data$MY.DATE1, format=c("%m %d %Y"))
my.data$MY.DATE2 <- as.Date(my.data$MY.DATE2, format=c("%m %d %Y"))
my.data
desired.result = read.table(text = "
OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE MY.DATE1 MY.DATE2 mean.date
1 3 6 2012 3 10 2012 1 2012-03-06 2012-03-10 2012-03-08
2 3 10 2012 3 20 2012 1 2012-03-10 2012-03-20 2012-03-15
3 3 16 2012 3 30 2012 1 2012-03-16 2012-03-30 2012-03-23
4 3 20 2012 4 8 2012 1 2012-03-20 2012-04-08 2012-03-29
5 3 20 2012 4 9 2012 1 2012-03-20 2012-04-09 2012-03-30
6 3 20 2012 4 10 2012 1 2012-03-20 2012-04-10 2012-03-30
7 3 20 2012 4 11 2012 1 2012-03-20 2012-04-11 2012-03-31
8 4 4 2012 4 5 2012 1 2012-04-04 2012-04-05 2012-04-04
9 4 6 2012 4 6 2012 1 2012-04-06 2012-04-06 2012-04-06
10 4 6 2012 4 7 2012 1 2012-04-06 2012-04-07 2012-04-06
", header = TRUE, stringsAsFactors = FALSE)
Here is the approach that worked for me:
my.data$mean.date <- (my.data$MY.DATE1 + ((my.data$MY.DATE2 - my.data$MY.DATE1) / 2))
my.data
These approaches did not work:
my.data$mean.date <- mean(my.data$MY.DATE1, my.data$MY.DATE2)
my.data$mean.date <- mean(my.data$MY.DATE1, my.data$MY.DATE2, trim = 0)
my.data$mean.date <- mean(my.data$MY.DATE1, my.data$MY.DATE2, trim = 1)
my.data$mean.date <- mean(my.data$MY.DATE1, my.data$MY.DATE2, trim = 0.5)
my.data$mean.data <- apply(my.data, 1, function(x) {(x[9] + x[10]) / 2})
I think I am supposed to use the Ops.Date command, but have not found an example.
Thank you for any suggestions.
Keep things simple and use mean.Date in base R.
mean.Date(as.Date(c("01-01-2014", "01-07-2014"), format=c("%m-%d-%Y")))
[1] "2014-01-04"
Using the good advice of # jaysunice3401, I came up with this. If you want to keep the original data, you can add remove = FALSE in the two lines with unite
library(dplyr)
library(tidyr)
my.data %>%
unite(whatever1, matches("1"), sep = "-") %>%
unite(whatever2, matches("2"), sep = "-") %>%
mutate_each(funs(as.Date(., "%m-%d-%Y")), contains("whatever")) %>%
rowwise %>%
mutate(mean.date = mean.Date(c(whatever1, whatever2)))
# OBS whatever1 whatever2 STATE mean.date
#1 1 2012-03-06 2012-03-10 1 2012-03-08
#2 2 2012-03-10 2012-03-20 1 2012-03-15
#3 3 2012-03-16 2012-03-30 1 2012-03-23
#4 4 2012-03-20 2012-04-08 1 2012-03-29
#5 5 2012-03-20 2012-04-09 1 2012-03-30
#6 6 2012-03-20 2012-04-10 1 2012-03-30
#7 7 2012-03-20 2012-04-11 1 2012-03-31
#8 8 2012-04-04 2012-04-05 1 2012-04-04
#9 9 2012-04-06 2012-04-06 1 2012-04-06
#10 10 2012-04-06 2012-04-07 1 2012-04-06
Maybe something like that?
library(data.table)
setDT(my.data)[, `:=`(MY.DATE1 = as.Date(paste(DAY1 ,MONTH1, YEAR1), format = "%d %m %Y"),
MY.DATE2 = as.Date(paste(DAY2 ,MONTH2, YEAR2), format = "%d %m %Y"))][,
mean.date := MY.DATE2 - ceiling((MY.DATE2 - MY.DATE1)/2)]
my.data
# OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE MY.DATE1 MY.DATE2 mean.date
# 1: 1 3 6 2012 3 10 2012 1 2012-03-06 2012-03-10 2012-03-08
# 2: 2 3 10 2012 3 20 2012 1 2012-03-10 2012-03-20 2012-03-15
# 3: 3 3 16 2012 3 30 2012 1 2012-03-16 2012-03-30 2012-03-23
# 4: 4 3 20 2012 4 8 2012 1 2012-03-20 2012-04-08 2012-03-29
# 5: 5 3 20 2012 4 9 2012 1 2012-03-20 2012-04-09 2012-03-30
# 6: 6 3 20 2012 4 10 2012 1 2012-03-20 2012-04-10 2012-03-30
# 7: 7 3 20 2012 4 11 2012 1 2012-03-20 2012-04-11 2012-03-31
# 8: 8 4 4 2012 4 5 2012 1 2012-04-04 2012-04-05 2012-04-04
# 9: 9 4 6 2012 4 6 2012 1 2012-04-06 2012-04-06 2012-04-06
# 10: 10 4 6 2012 4 7 2012 1 2012-04-06 2012-04-07 2012-04-06
Or if you insist on using mean.date, here's alternative solution:
library(data.table)
setDT(my.data)[, `:=`(MY.DATE1 = as.Date(paste(DAY1 ,MONTH1, YEAR1), format = "%d %m %Y"),
MY.DATE2 = as.Date(paste(DAY2 ,MONTH2, YEAR2), format = "%d %m %Y"))][,
mean.date := mean.Date(c(MY.DATE1, MY.DATE2)), by = OBS]
One-liner (split for readability), uses lubridate and dplyr and (of course) pipes:
> require(lubridate)
> require(dplyr)
> my.data = my.data %>%
mutate(
MY.DATE1=as.Date(mdy(paste(MONTH1,DAY1,YEAR1))),
MY.DATE2=as.Date(mdy(paste(MONTH2,DAY2,YEAR2)))) %>%
rowwise %>%
mutate(mean.data=mean.Date(c(MY.DATE1,MY.DATE2))) %>% data.frame()
> head(my.data)
OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE MY.DATE1 MY.DATE2
1 1 3 6 2012 3 10 2012 1 2012-03-06 2012-03-10
2 2 3 10 2012 3 20 2012 1 2012-03-10 2012-03-20
3 3 3 16 2012 3 30 2012 1 2012-03-16 2012-03-30
4 4 3 20 2012 4 8 2012 1 2012-03-20 2012-04-08
5 5 3 20 2012 4 9 2012 1 2012-03-20 2012-04-09
6 6 3 20 2012 4 10 2012 1 2012-03-20 2012-04-10
mean.data
1 2012-03-08
2 2012-03-15
3 2012-03-23
4 2012-03-29
5 2012-03-30
6 2012-03-30
As an afterthought, if you like pipes, you can put a pipe in your pipe so you can pipe while you pipe - rewriting the first mutate step thus:
my.data %>% mutate(
MY.DATE1 = paste(MONTH1,DAY1,YEAR1) %>% mdy %>% as.Date,
MY.DATE2 = paste(MONTH2,DAY2,YEAR2) %>% mdy %>% as.Date)
1) Create Date class columns and then its easy. No external packages are used:
asDate <- function(x) as.Date(x, "1970-01-01")
my.data2 <- transform(my.data,
date1 = as.Date(ISOdate(YEAR1, MONTH1, DAY1)),
date2 = as.Date(ISOdate(YEAR2, MONTH2, DAY2))
)
transform(my.data2, mean.date = asDate(rowMeans(cbind(date1, date2))))
If we did add a library(zoo) call then we could omit the asDate definition using as.Date in the last line instead of asDate since zoo adds a default origin to as.Date.
1a) A dplyr version would look like this (using asDate from above):
library(dplyr)
my.data %>%
mutate(
date1 = ISOdate(YEAR1, MONTH1, DAY1) %>% as.Date,
date2 = ISOdate(YEAR2, MONTH2, DAY2) %>% as.Date,
mean.date = cbind(date1, date2) %>% rowMeans %>% asDate)
2) Another way uses julian in the chron package. julian converts a month/day/year to the number of days since the Epoch. We can average the two julians and convert back to Date class:
library(zoo)
library(chron)
transform(my.data,
mean.date = as.Date( ( julian(MONTH1,DAY1,YEAR1) + julian(MONTH2,DAY2,YEAR2) )/2 )
)
We could omit library(zoo) if we used asDate from (1) in place of as.Date.
Update Discussed use of zoo to shorten the solutions and made further reductions in solution (1).
what about :
apply(my.data[,c("MY.DATE1","MY.DATE2")],1,function(date){substr(strptime(mean(c(strptime(date[1],"%y%y-%m-%d"),strptime(date[2],"%y%y-%m-%d"))),format="%y%y-%m-%d"),1,10)})
?
(I just had to use substr because of CET and CEST that put my output as a list...)
This is a vectorized version of the answer posted by jaysunice3401. It seems fairly straight-forward, except that I had to use trial-and-error to identify the correct origin. I do not know how general origin = "1970-01-01" is or whether a different origin would have to be specified with each data set.
According to this website: http://www.ats.ucla.edu/stat/r/faq/dates.htm
When R looks at dates as integers, its origin is January 1, 1970.
Which seems to suggest that origin = "1970-01-01" is fairly general. Although, if I had dates prior to "1970-01-01" in my data set I would definitely test the code before using it.
my.data = read.table(text = "
OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE
1 3 6 2012 3 10 2012 1
2 3 10 2012 3 20 2012 1
3 3 16 2012 3 30 2012 1
4 3 20 2012 4 8 2012 1
5 3 20 2012 4 9 2012 1
6 3 20 2012 4 10 2012 1
7 3 20 2012 4 11 2012 1
8 4 4 2012 4 5 2012 1
9 4 6 2012 4 6 2012 1
10 4 6 2012 4 7 2012 1
", header = TRUE, stringsAsFactors = FALSE)
desired.result = read.table(text = "
OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE MY.DATE1 MY.DATE2 mean.date
1 3 6 2012 3 10 2012 1 2012-03-06 2012-03-10 2012-03-08
2 3 10 2012 3 20 2012 1 2012-03-10 2012-03-20 2012-03-15
3 3 16 2012 3 30 2012 1 2012-03-16 2012-03-30 2012-03-23
4 3 20 2012 4 8 2012 1 2012-03-20 2012-04-08 2012-03-29
5 3 20 2012 4 9 2012 1 2012-03-20 2012-04-09 2012-03-30
6 3 20 2012 4 10 2012 1 2012-03-20 2012-04-10 2012-03-30
7 3 20 2012 4 11 2012 1 2012-03-20 2012-04-11 2012-03-31
8 4 4 2012 4 5 2012 1 2012-04-04 2012-04-05 2012-04-04
9 4 6 2012 4 6 2012 1 2012-04-06 2012-04-06 2012-04-06
10 4 6 2012 4 7 2012 1 2012-04-06 2012-04-07 2012-04-06
", header = TRUE, stringsAsFactors = FALSE)
my.data$MY.DATE1 <- do.call(paste, list(my.data$MONTH1,my.data$DAY1,my.data$YEAR1))
my.data$MY.DATE2 <- do.call(paste, list(my.data$MONTH2,my.data$DAY2,my.data$YEAR2))
my.data$MY.DATE1 <- as.Date(my.data$MY.DATE1, format=c("%m %d %Y"))
my.data$MY.DATE2 <- as.Date(my.data$MY.DATE2, format=c("%m %d %Y"))
my.data$mean.date2 <- as.Date( apply(my.data, 1, function(x) {
mean.Date(c(as.Date(x['MY.DATE1']), as.Date(x['MY.DATE2'])))
}) , origin = "1970-01-01")
my.data
desired.result

Resources