categorize based on date ranges in R - r

How do I categorize each row in a large R dataframe (>2 million rows) based on date range definitions in a separate, much smaller R dataframe (12 rows)?
My large dataframe, captures, looks similar to this when called via head(captures) :
id date sex
1 160520 2016-11-22 1
2 1029735 2016-11-12 1
3 1885200 2016-11-05 1
4 2058366 2015-09-26 2
5 2058367 2015-09-26 1
6 2058368 2015-09-26 1
My small dataframe, seasons, looks similar to this in its entirety:
Season Opening.Date Closing.Date
2016 2016-09-24 2017-01-15
2015 2015-09-26 2016-01-10
2014 2014-09-27 2015-01-11
2013 2013-09-28 2014-01-12
2012 2012-09-22 2013-01-13
2011 2011-09-24 2012-01-08
2010 2010-09-25 2011-01-16
2009 2009-09-26 2010-01-17
2008 2008-09-27 2009-01-18
2007 2007-09-22 2008-01-13
2006 2006-09-23 2007-01-14
2005 2005-09-24 2006-01-15
I need to add a 'season' column to my captures dataframe where the value would be determined based on if and where captures$date falls in the ranges defined in seasons.
Here is a long-hand solution I came up with that isn't working for me because my dataframe is so large.
#add packages
library(dplyr)
library(lubridate)
#make blank column
captures$season=NA
for (i in 1:length(seasons$Season)){
for (j in 1:length(captures$id{
captures$season[j]=ifelse(between(captures$date[j],ymd(seasons$Opening.Date[i]),ymd(seasons$Closing.Date[i])),seasons$Season[i],captures$season[j])
}
}
Again, this doesn't work for me as R crashes every time. I also realize this doesn't take advantage of vectorization in R. Any help here is appreciated!

Here's using non equi joins from data.table:
require(data.table) # v1.10.4+
setDT(captures) # convert data.frames to data.tables
setDT(seasons)
ans <- seasons[captures, Season,
on=.(Opening.Date<=date, Closing.Date>=date),
mult="first"]
# [1] 2016 2016 2016 2015 2015 2015
seasons[, season := ans]
For each row in captures, the index corresponding to the first matching row (mult="first") in seasons is figured out based on the condition provided to on argument. The value of Season for corresponding indices is then returned and saved under ans. It is then added as a new column to seasons by reference.
I've shown it in two steps for sake of understanding.
You can see the first matching indices by using which=TRUE instead:
seasons[captures,
on=.(Opening.Date<=date, Closing.Date>=date),
mult="first",
which=TRUE]
# [1] 1 1 1 2 2 2

It would be great indeed if you could do a join operation efficiently based on a range of values instead of equality. Unfortunately, I don't know if a general solution exists. In the time being, I suggest using a single for loop.
The efficiency of vectorization is best done along the tallest data. That is, if we loop on one data.frame and vectorize the other, it makes more sense to vectorize the longer vector and loop on the shorter ones. With this in mind, we'll loop on the frame of seasons and vectorize the 2M rows of data.
Your data:
txt <- "Season Opening.Date Closing.Date
2016 2016-09-24 2017-01-15
2015 2015-09-26 2016-01-10
2014 2014-09-27 2015-01-11
2013 2013-09-28 2014-01-12
2012 2012-09-22 2013-01-13
2011 2011-09-24 2012-01-08
2010 2010-09-25 2011-01-16
2009 2009-09-26 2010-01-17
2008 2008-09-27 2009-01-18
2007 2007-09-22 2008-01-13
2006 2006-09-23 2007-01-14
2005 2005-09-24 2006-01-15"
seasons <- read.table(text = txt, header = TRUE)
seasons[2:3] <- lapply(seasons[2:3], as.Date)
txt <- " id date sex
1 160520 2016-11-22 1
2 1029735 2016-11-12 1
3 1885200 2016-11-05 1
4 2058366 2015-09-26 2
5 2058367 2015-09-26 1
6 2058368 2015-09-26 1"
dat <- read.table(text = txt, header = TRUE)
dat$date <- as.Date(dat$date)
And the start the process, we assume that all data's season is as yet not defined:
dat$season <- NA
Loop around each of the seasons' rows:
for (i in seq_len(nrow(seasons))) {
dat$season <- ifelse(is.na(dat$season) &
dat$date >= seasons$Opening.Date[i] &
dat$date < seasons$Closing.Date[i],
seasons$Season[i], dat$season)
}
dat
# id date sex season
# 1 160520 2016-11-22 1 2016
# 2 1029735 2016-11-12 1 2016
# 3 1885200 2016-11-05 1 2016
# 4 2058366 2015-09-26 2 2015
# 5 2058367 2015-09-26 1 2015
# 6 2058368 2015-09-26 1 2015

You could try with sqldf. Note, I had to change the point in Opening_Date and Closing_Date to an "_".
library(sqldf)
captures$season <- sqldf("select Season from seasons s, captures c
where c.date >= s.Opening_Date and c.date <= s.Closing_Date")
captures
id date sex Season
1 160520 2016-11-22 1 2016
2 1029735 2016-11-12 1 2016
3 1885200 2016-11-05 1 2016
4 2058366 2015-09-26 2 2015
5 2058367 2015-09-26 1 2015
6 2058368 2015-09-26 1 2015
data
txt <- "Season Opening_Date Closing_Date
2016 2016-09-24 2017-01-15
2015 2015-09-26 2016-01-10
2014 2014-09-27 2015-01-11
2013 2013-09-28 2014-01-12
2012 2012-09-22 2013-01-13
2011 2011-09-24 2012-01-08
2010 2010-09-25 2011-01-16
2009 2009-09-26 2010-01-17
2008 2008-09-27 2009-01-18
2007 2007-09-22 2008-01-13
2006 2006-09-23 2007-01-14
2005 2005-09-24 2006-01-15"
seasons <- read.table(text = txt, header = TRUE)
seasons[2:3] <- lapply(seasons[2:3], as.Date)
txt <- " id date sex
1 160520 2016-11-22 1
2 1029735 2016-11-12 1
3 1885200 2016-11-05 1
4 2058366 2015-09-26 2
5 2058367 2015-09-26 1
6 2058368 2015-09-26 1"
captures <- read.table(text = txt, header = TRUE)
captures$date <- as.Date(captures$date)

Related

Averaging a monthly time series with incomplete observations

I have the following dataset:
id observation_date Observation_value
1 2015-02-23 5
1 2015-02-24 6
1 2015-03-01 24
1 2015-07-16 2
1 2015-09-28 9
1 2015-12-05 12
I would like to create monthly averages of observation_value. In those cases that there are no values for a certain month, I would like to fill in the data with the average between the months where I have data.
Using the data in the Note at the end -- we have added a second id -- convert to zoo using column 1 to split by and column 2 as the index with yearmon class. Also in the same statement aggregate using mean over year/month giving the zoo object z. Then convert to ts which will fill in the missing months with NA and then convert back to zoo and use na.approx to fill in the NAs (or use na.spline or na.locf depending on what you want). fortify.zoo(zz) and fortify.zoo(zz, melt = TRUE) can be used to convert zoo objects to data frames.
library(zoo)
z <- read.zoo(dat, FUN = as.yearmon, index = 2, split = 1, aggregate = mean)
zz <- na.approx(as.zoo(as.ts(z)))
giving
> zz
1 2
Feb 2015 5.5 5.5
Mar 2015 24.0 24.0
Apr 2015 18.5 18.5
May 2015 13.0 13.0
Jun 2015 7.5 7.5
Jul 2015 2.0 2.0
Aug 2015 5.5 5.5
Sep 2015 9.0 9.0
Oct 2015 10.0 10.0
Nov 2015 11.0 11.0
Dec 2015 12.0 12.0
Note
Lines <- "id observation_date Observation_value
1 2015-02-23 5
1 2015-02-24 6
1 2015-03-01 24
1 2015-07-16 2
1 2015-09-28 9
1 2015-12-05 12
2 2015-02-23 5
2 2015-02-24 6
2 2015-03-01 24
2 2015-07-16 2
2 2015-09-28 9
2 2015-12-05 12"
dat <- read.table(text = Lines, header = TRUE)

How to lump sum the number of days of a data of several year?

I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.

R data.table merge drops rows (December only)

Solved with the help of #Uwe Block.
R data.table merge drops December observations by shifting the month-index back in one data set while trying to merge a monthly data set onto a set of daily observations. What's a good way to do this merge that works as expected?
Using merge per #Harry Daniels merge(monthly, daily, by=c("year","month"), all=TRUE) instead of daily[monthly, on=c("year","month"), all=TRUE] retains all daily observations correctly, but the monthly data are still shifted so that January->0.
Problem: generating the month and year columns on the monthly dataset made months not quite exactly integer values. I.e. 1 was actually 0.999999999999091 so the merge took the floor internally and offset it.
Example: `monthly[,month:=100*(Date%%1)]' where the date was stored as numeric 2016.01, 2016.02,...,2016.12.
See the following:
> monthly
year month CPI
1: 2016 1 236.916
2: 2016 2 237.111
3: 2016 3 238.132
4: 2016 4 239.261
5: 2016 5 240.229
6: 2016 6 241.018
7: 2016 7 240.628
8: 2016 8 240.849
9: 2016 9 241.428
10: 2016 10 241.729
11: 2016 11 241.353
12: 2016 12 241.432
> daily
date year month close
1: 2016-01-04 2016 1 2012.66
2: 2016-01-05 2016 1 2016.71
3: 2016-01-06 2016 1 1990.26
4: 2016-01-07 2016 1 1943.09
5: 2016-01-08 2016 1 1922.03
---
248: 2016-12-23 2016 12 2263.79
249: 2016-12-27 2016 12 2268.88
250: 2016-12-28 2016 12 2249.92
251: 2016-12-29 2016 12 2249.26
252: 2016-12-30 2016 12 2238.83
> daily[monthly, on=c("year","month")]
date year month close CPI
1: <NA> 2016 0 NA 236.916
2: 2016-01-04 2016 1 2012.66 237.111
3: 2016-01-05 2016 1 2016.71 237.111
4: 2016-01-06 2016 1 1990.26 237.111
5: 2016-01-07 2016 1 1943.09 237.111
---
228: 2016-11-23 2016 11 2204.72 241.432
229: 2016-11-25 2016 11 2213.35 241.432
230: 2016-11-28 2016 11 2201.72 241.432
231: 2016-11-29 2016 11 2204.66 241.432
232: 2016-11-30 2016 11 2198.81 241.432
> merge(monthly, daily, by=c("year","month"), all=TRUE)
year month CPI close
1: 2016 0 236.916 NA
2: 2016 1 237.111 2012.66
3: 2016 1 237.111 2016.71
4: 2016 1 237.111 1990.26
5: 2016 1 237.111 1943.09
---
249: 2016 12 NA 2263.79
250: 2016 12 NA 2268.88
251: 2016 12 NA 2249.92
252: 2016 12 NA 2249.26
253: 2016 12 NA 2238.83
This should suffice:
merge(monthly, daily , by = 'month', all = TRUE )

How to add means to an existing column in R

I am manipulating a dataset but I can't make things right.
Here's an example for this, where df is the name of data frame.
year ID value
2013 1 10
2013 2 20
2013 3 10
2014 1 20
2014 2 20
2014 3 30
2015 1 20
2015 2 10
2015 3 30
So I tried to make another data frame df1 <- aggregate(value ~ year, df, mean, rm.na=T)
And made this data frame df1:
year ID value
2013 avg 13.3
2014 avg 23.3
2015 avg 20
But I want to add each mean by year into each row of df.
The expected form is:
year ID value
2013 1 10
2013 2 20
2013 3 10
2013 avg 13.3
2014 1 20
2014 2 20
2014 3 30
2014 avg 23.3
2015 1 20
2015 2 10
2015 3 30
2015 avg 20
Here is an option with data.table where we convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'year', get the 'mean of 'value' and 'ID' as 'avg', then use rbindlist to rbind both the datasets and order by 'year'
library(data.table)
rbindlist(list(setDT(df), df[, .(ID = 'avg', value = mean(value)), year]))[order(year)]
# year ID value
# 1: 2013 1 10.00000
# 2: 2013 2 20.00000
# 3: 2013 3 10.00000
# 4: 2013 avg 13.33333
# 5: 2014 1 20.00000
# 6: 2014 2 20.00000
# 7: 2014 3 30.00000
# 8: 2014 avg 23.33333
# 9: 2015 1 20.00000
#10: 2015 2 10.00000
#11: 2015 3 30.00000
#12: 2015 avg 20.00000
Or using the OP's method, rbind both the datasets and then order
df2 <- rbind(df, transform(df1, ID = 'avg'))
df2 <- df2[order(df2$year),]

Calculating mean date by row

I wish to obtain the mean date by row, where each row contains two dates. Eventually I found a way, posted below. However, the approach I used seems rather cumbersome. Is there a better way?
my.data = read.table(text = "
OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE
1 3 6 2012 3 10 2012 1
2 3 10 2012 3 20 2012 1
3 3 16 2012 3 30 2012 1
4 3 20 2012 4 8 2012 1
5 3 20 2012 4 9 2012 1
6 3 20 2012 4 10 2012 1
7 3 20 2012 4 11 2012 1
8 4 4 2012 4 5 2012 1
9 4 6 2012 4 6 2012 1
10 4 6 2012 4 7 2012 1
", header = TRUE, stringsAsFactors = FALSE)
my.data
my.data$MY.DATE1 <- do.call(paste, list(my.data$MONTH1, my.data$DAY1, my.data$YEAR1))
my.data$MY.DATE2 <- do.call(paste, list(my.data$MONTH2, my.data$DAY2, my.data$YEAR2))
my.data$MY.DATE1 <- as.Date(my.data$MY.DATE1, format=c("%m %d %Y"))
my.data$MY.DATE2 <- as.Date(my.data$MY.DATE2, format=c("%m %d %Y"))
my.data
desired.result = read.table(text = "
OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE MY.DATE1 MY.DATE2 mean.date
1 3 6 2012 3 10 2012 1 2012-03-06 2012-03-10 2012-03-08
2 3 10 2012 3 20 2012 1 2012-03-10 2012-03-20 2012-03-15
3 3 16 2012 3 30 2012 1 2012-03-16 2012-03-30 2012-03-23
4 3 20 2012 4 8 2012 1 2012-03-20 2012-04-08 2012-03-29
5 3 20 2012 4 9 2012 1 2012-03-20 2012-04-09 2012-03-30
6 3 20 2012 4 10 2012 1 2012-03-20 2012-04-10 2012-03-30
7 3 20 2012 4 11 2012 1 2012-03-20 2012-04-11 2012-03-31
8 4 4 2012 4 5 2012 1 2012-04-04 2012-04-05 2012-04-04
9 4 6 2012 4 6 2012 1 2012-04-06 2012-04-06 2012-04-06
10 4 6 2012 4 7 2012 1 2012-04-06 2012-04-07 2012-04-06
", header = TRUE, stringsAsFactors = FALSE)
Here is the approach that worked for me:
my.data$mean.date <- (my.data$MY.DATE1 + ((my.data$MY.DATE2 - my.data$MY.DATE1) / 2))
my.data
These approaches did not work:
my.data$mean.date <- mean(my.data$MY.DATE1, my.data$MY.DATE2)
my.data$mean.date <- mean(my.data$MY.DATE1, my.data$MY.DATE2, trim = 0)
my.data$mean.date <- mean(my.data$MY.DATE1, my.data$MY.DATE2, trim = 1)
my.data$mean.date <- mean(my.data$MY.DATE1, my.data$MY.DATE2, trim = 0.5)
my.data$mean.data <- apply(my.data, 1, function(x) {(x[9] + x[10]) / 2})
I think I am supposed to use the Ops.Date command, but have not found an example.
Thank you for any suggestions.
Keep things simple and use mean.Date in base R.
mean.Date(as.Date(c("01-01-2014", "01-07-2014"), format=c("%m-%d-%Y")))
[1] "2014-01-04"
Using the good advice of # jaysunice3401, I came up with this. If you want to keep the original data, you can add remove = FALSE in the two lines with unite
library(dplyr)
library(tidyr)
my.data %>%
unite(whatever1, matches("1"), sep = "-") %>%
unite(whatever2, matches("2"), sep = "-") %>%
mutate_each(funs(as.Date(., "%m-%d-%Y")), contains("whatever")) %>%
rowwise %>%
mutate(mean.date = mean.Date(c(whatever1, whatever2)))
# OBS whatever1 whatever2 STATE mean.date
#1 1 2012-03-06 2012-03-10 1 2012-03-08
#2 2 2012-03-10 2012-03-20 1 2012-03-15
#3 3 2012-03-16 2012-03-30 1 2012-03-23
#4 4 2012-03-20 2012-04-08 1 2012-03-29
#5 5 2012-03-20 2012-04-09 1 2012-03-30
#6 6 2012-03-20 2012-04-10 1 2012-03-30
#7 7 2012-03-20 2012-04-11 1 2012-03-31
#8 8 2012-04-04 2012-04-05 1 2012-04-04
#9 9 2012-04-06 2012-04-06 1 2012-04-06
#10 10 2012-04-06 2012-04-07 1 2012-04-06
Maybe something like that?
library(data.table)
setDT(my.data)[, `:=`(MY.DATE1 = as.Date(paste(DAY1 ,MONTH1, YEAR1), format = "%d %m %Y"),
MY.DATE2 = as.Date(paste(DAY2 ,MONTH2, YEAR2), format = "%d %m %Y"))][,
mean.date := MY.DATE2 - ceiling((MY.DATE2 - MY.DATE1)/2)]
my.data
# OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE MY.DATE1 MY.DATE2 mean.date
# 1: 1 3 6 2012 3 10 2012 1 2012-03-06 2012-03-10 2012-03-08
# 2: 2 3 10 2012 3 20 2012 1 2012-03-10 2012-03-20 2012-03-15
# 3: 3 3 16 2012 3 30 2012 1 2012-03-16 2012-03-30 2012-03-23
# 4: 4 3 20 2012 4 8 2012 1 2012-03-20 2012-04-08 2012-03-29
# 5: 5 3 20 2012 4 9 2012 1 2012-03-20 2012-04-09 2012-03-30
# 6: 6 3 20 2012 4 10 2012 1 2012-03-20 2012-04-10 2012-03-30
# 7: 7 3 20 2012 4 11 2012 1 2012-03-20 2012-04-11 2012-03-31
# 8: 8 4 4 2012 4 5 2012 1 2012-04-04 2012-04-05 2012-04-04
# 9: 9 4 6 2012 4 6 2012 1 2012-04-06 2012-04-06 2012-04-06
# 10: 10 4 6 2012 4 7 2012 1 2012-04-06 2012-04-07 2012-04-06
Or if you insist on using mean.date, here's alternative solution:
library(data.table)
setDT(my.data)[, `:=`(MY.DATE1 = as.Date(paste(DAY1 ,MONTH1, YEAR1), format = "%d %m %Y"),
MY.DATE2 = as.Date(paste(DAY2 ,MONTH2, YEAR2), format = "%d %m %Y"))][,
mean.date := mean.Date(c(MY.DATE1, MY.DATE2)), by = OBS]
One-liner (split for readability), uses lubridate and dplyr and (of course) pipes:
> require(lubridate)
> require(dplyr)
> my.data = my.data %>%
mutate(
MY.DATE1=as.Date(mdy(paste(MONTH1,DAY1,YEAR1))),
MY.DATE2=as.Date(mdy(paste(MONTH2,DAY2,YEAR2)))) %>%
rowwise %>%
mutate(mean.data=mean.Date(c(MY.DATE1,MY.DATE2))) %>% data.frame()
> head(my.data)
OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE MY.DATE1 MY.DATE2
1 1 3 6 2012 3 10 2012 1 2012-03-06 2012-03-10
2 2 3 10 2012 3 20 2012 1 2012-03-10 2012-03-20
3 3 3 16 2012 3 30 2012 1 2012-03-16 2012-03-30
4 4 3 20 2012 4 8 2012 1 2012-03-20 2012-04-08
5 5 3 20 2012 4 9 2012 1 2012-03-20 2012-04-09
6 6 3 20 2012 4 10 2012 1 2012-03-20 2012-04-10
mean.data
1 2012-03-08
2 2012-03-15
3 2012-03-23
4 2012-03-29
5 2012-03-30
6 2012-03-30
As an afterthought, if you like pipes, you can put a pipe in your pipe so you can pipe while you pipe - rewriting the first mutate step thus:
my.data %>% mutate(
MY.DATE1 = paste(MONTH1,DAY1,YEAR1) %>% mdy %>% as.Date,
MY.DATE2 = paste(MONTH2,DAY2,YEAR2) %>% mdy %>% as.Date)
1) Create Date class columns and then its easy. No external packages are used:
asDate <- function(x) as.Date(x, "1970-01-01")
my.data2 <- transform(my.data,
date1 = as.Date(ISOdate(YEAR1, MONTH1, DAY1)),
date2 = as.Date(ISOdate(YEAR2, MONTH2, DAY2))
)
transform(my.data2, mean.date = asDate(rowMeans(cbind(date1, date2))))
If we did add a library(zoo) call then we could omit the asDate definition using as.Date in the last line instead of asDate since zoo adds a default origin to as.Date.
1a) A dplyr version would look like this (using asDate from above):
library(dplyr)
my.data %>%
mutate(
date1 = ISOdate(YEAR1, MONTH1, DAY1) %>% as.Date,
date2 = ISOdate(YEAR2, MONTH2, DAY2) %>% as.Date,
mean.date = cbind(date1, date2) %>% rowMeans %>% asDate)
2) Another way uses julian in the chron package. julian converts a month/day/year to the number of days since the Epoch. We can average the two julians and convert back to Date class:
library(zoo)
library(chron)
transform(my.data,
mean.date = as.Date( ( julian(MONTH1,DAY1,YEAR1) + julian(MONTH2,DAY2,YEAR2) )/2 )
)
We could omit library(zoo) if we used asDate from (1) in place of as.Date.
Update Discussed use of zoo to shorten the solutions and made further reductions in solution (1).
what about :
apply(my.data[,c("MY.DATE1","MY.DATE2")],1,function(date){substr(strptime(mean(c(strptime(date[1],"%y%y-%m-%d"),strptime(date[2],"%y%y-%m-%d"))),format="%y%y-%m-%d"),1,10)})
?
(I just had to use substr because of CET and CEST that put my output as a list...)
This is a vectorized version of the answer posted by jaysunice3401. It seems fairly straight-forward, except that I had to use trial-and-error to identify the correct origin. I do not know how general origin = "1970-01-01" is or whether a different origin would have to be specified with each data set.
According to this website: http://www.ats.ucla.edu/stat/r/faq/dates.htm
When R looks at dates as integers, its origin is January 1, 1970.
Which seems to suggest that origin = "1970-01-01" is fairly general. Although, if I had dates prior to "1970-01-01" in my data set I would definitely test the code before using it.
my.data = read.table(text = "
OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE
1 3 6 2012 3 10 2012 1
2 3 10 2012 3 20 2012 1
3 3 16 2012 3 30 2012 1
4 3 20 2012 4 8 2012 1
5 3 20 2012 4 9 2012 1
6 3 20 2012 4 10 2012 1
7 3 20 2012 4 11 2012 1
8 4 4 2012 4 5 2012 1
9 4 6 2012 4 6 2012 1
10 4 6 2012 4 7 2012 1
", header = TRUE, stringsAsFactors = FALSE)
desired.result = read.table(text = "
OBS MONTH1 DAY1 YEAR1 MONTH2 DAY2 YEAR2 STATE MY.DATE1 MY.DATE2 mean.date
1 3 6 2012 3 10 2012 1 2012-03-06 2012-03-10 2012-03-08
2 3 10 2012 3 20 2012 1 2012-03-10 2012-03-20 2012-03-15
3 3 16 2012 3 30 2012 1 2012-03-16 2012-03-30 2012-03-23
4 3 20 2012 4 8 2012 1 2012-03-20 2012-04-08 2012-03-29
5 3 20 2012 4 9 2012 1 2012-03-20 2012-04-09 2012-03-30
6 3 20 2012 4 10 2012 1 2012-03-20 2012-04-10 2012-03-30
7 3 20 2012 4 11 2012 1 2012-03-20 2012-04-11 2012-03-31
8 4 4 2012 4 5 2012 1 2012-04-04 2012-04-05 2012-04-04
9 4 6 2012 4 6 2012 1 2012-04-06 2012-04-06 2012-04-06
10 4 6 2012 4 7 2012 1 2012-04-06 2012-04-07 2012-04-06
", header = TRUE, stringsAsFactors = FALSE)
my.data$MY.DATE1 <- do.call(paste, list(my.data$MONTH1,my.data$DAY1,my.data$YEAR1))
my.data$MY.DATE2 <- do.call(paste, list(my.data$MONTH2,my.data$DAY2,my.data$YEAR2))
my.data$MY.DATE1 <- as.Date(my.data$MY.DATE1, format=c("%m %d %Y"))
my.data$MY.DATE2 <- as.Date(my.data$MY.DATE2, format=c("%m %d %Y"))
my.data$mean.date2 <- as.Date( apply(my.data, 1, function(x) {
mean.Date(c(as.Date(x['MY.DATE1']), as.Date(x['MY.DATE2'])))
}) , origin = "1970-01-01")
my.data
desired.result

Resources