R: Extract data based on date, "if date lesser than" - r

I have a dataset with two values for each date like this:
date x y
1 2013-05-01 1 2
2 2013-05-02 2 2
3 2013-05-03 3 2
date is in the format as.Date, using the package lubridate.
Now I want to have the mean of the two values, except for a certain time span, in which I want to use the values of x.
I tried the following:
mean=(x+y)/2
newdata=ifelse((data$date < 2013-10-01 | date$date > 2014-04-09), mean, x)
but if will just take the mean for all dates.
Is it possible to use greater/lesser than relationships for dates?
Any suggestions on how to make this work?
Thanks in advance

It looks like you are not casting the comparison values as dates. Also the dates you used for comparison don't exclude any of the dates in the dataframe you provided so I'd expect the mean to be selected every time.
date <- as.Date(c('2013-05-01', '2013-05-02', '2013-05-03'))
x <- c(1, 2, 3)
y <- c(2, 2, 2)
mean <- (x + y)/2
df <- data.frame(date = date, x = x, y = y)
newdata <- ifelse((df$date < as.Date('2013-05-02') | df$date > as.Date('2014-04-09')), mean, x)
newdata
I changed the dates in the condition to be more selective and I got 1.5 2.0 3.0. It selects the first value from mean and the others from x which agrees with the condition I used in the ifelse().

How about something like this:
library(lubridate)
library(data.table)
##
set.seed(123)
Data <- data.frame(
date=as.Date(ymd(20130904))+0:364,
x=as.numeric(sample(1:3,365,replace=TRUE)),
y=as.numeric(sample(1:3,365,replace=TRUE)))
setDT(Data)
##
xSpan <- seq.Date(
from=as.Date("2013-10-01"),
to=as.Date("2014-04-09"),
by="day")
##
Edited - forgot to group by date
Data[,z:=ifelse(
date %in% xSpan,
x,
mean(c(x,y))),
by=date]
##
> head(Data)
date x y z
1: 2013-09-04 1 3 2.0
2: 2013-09-05 3 1 2.0
3: 2013-09-06 2 1 1.5
4: 2013-09-07 3 2 2.5
5: 2013-09-08 3 2 2.5
6: 2013-09-09 1 2 1.5
> head(subset(Data, date %in% xSpan))
date x y z
1: 2013-10-01 2 3 2
2: 2013-10-02 1 3 1
3: 2013-10-03 1 1 1
4: 2013-10-04 3 1 3
5: 2013-10-05 3 1 3
6: 2013-10-06 3 1 3
I just defined xSpan as a contiguous sequence of days for which one of the functions is used in (in your example, just the identity function of x). Dates not included in this time span will use mean to determine their value of z.

Related

Conditional accumulating sum with a dynamic condition

Good afternoon
i'm trying to create an accumulating mean with a "twist" - i only want to average fields that dated before the current one (there may be field with the same date)
i succeeded doing it the "dirty way" using several custom created function but it takes too long and it's very inefficient - im pretty sure that there is a better way.
i was thinking about something along the line of:
averages <- DB %>% group_by(field1,field2) %>% mutate(Avg=cummean(???*value1)))
how do i access the current observation for cummean function
the route i went was to create a logical vector for every subset with a loop
for (i in 1:length(datevector)-1)
logicalvector[i] <- datevector[length(datevector)]>datevector[i]
logicalvector[length(datevector)]=F
and use that in another function to calculate the mean
a simple example is:
df <- data.frame(id=1:5,Date=as.Date(c("2013-08-02","2013-08-02","2013-08-03","2013-08-03","2013-08-04")),Value=c(1,4,5,2,4))
id Date Value accum mean
1 02/08/2013 1 0
2 02/08/2013 4 0
3 03/08/2013 5 2.5
4 03/08/2013 2 2.5
5 04/08/2013 4 3
Explanation:
there are no observation with a prior date for the first 2 observations so the mean is 0
the 3rd observation averages the 1st and 2nd, so does the 4th.
the 5th observation averages all
This can be implwemented as a complex self-join in SQL. This joins to each row all rows with lesser Date amd for each row averages Value in the joined rows. coalesce is used to assign 0 in the situation where the mean would otherwise be Null.
library(sqldf)
sqldf("select a.*, coalesce(avg(b.Value), 0) as mean
from df as a
left join df as b on b.Date < a.Date
group by a.rowid")
giving:
id Date Value mean
1 1 2013-08-02 1 0.0
2 2 2013-08-02 4 0.0
3 3 2013-08-03 5 2.5
4 4 2013-08-03 2 2.5
5 5 2013-08-04 4 3.0
Using data.table and lubridate you have this option:
library(data.table)
library(lubridate)
dt <- data.table(id=c(1:5))
dt$Date <- c("02/08/2013", "02/08/2013", "03/08/2013", "03/08/2013", "04/08/2013")
dt$Value <- c(1,4,5,2,4)
dt$Date <- dmy(dt$Date)
cummean <- function(d){
if(nrow(dt[Date<d])>0)
dt[Date<d, sum(Value)/.N]
else 0
}
dt[, accuMean:=mapply(cummean,Date)]
# id Date Value accuMean
#1: 1 2013-08-02 1 0.0
#2: 2 2013-08-02 4 0.0
#3: 3 2013-08-03 5 2.5
#4: 4 2013-08-03 2 2.5
#5: 5 2013-08-04 4 3.0
Solution when you have multiple values:
library(data.table)
library(lubridate)
dt <- data.table(id=c(1:5))
dt$Date <- c("02/08/2013", "02/08/2013", "03/08/2013", "03/08/2013", "04/08/2013")
dt$Value_1 <- c(1,4,5,2,4)
dt$Value_2 <- c(3,2,0,1,2)
dt$Value_3 <- c(4,9,3,3,3)
dt$Date <- dmy(dt$Date)
cummean <- function(d,Value){
if(nrow(dt[Date<d])>0)
sum(dt[Date<d, Value, with=F])/dt[Date<d, .N]
else 0
}
n <- 3
accuMean <- paste0("accuMean_", (1:n))
for(i in 1:n){
print(i)
dt[, (accuMean[i]):=mapply(cummean,Date,MoreArgs = list(paste0("Value_",i)))]
}
Suppose you have n values named Value_i. Ten in your case, you only need to set n=10

Replacing missing values in time series data in R

I am new to R. I was hoping to replace the missing values for X in the data. How can I replace the missing values of "X" when "Time" = 1 or 2 with the value of "X" when "Time" = 3 for the same "SubID" and the same "Day"
SubID: subject number
Day: each subject's day number (1,2,3...21)
Time: morning marked as 1, afternoon marked as 2, and evening marked as
3
X: only has a valid value when Time is 3, others are missing.
SubID Day Time X
1 1 1 NA
1 1 2 NA
1 1 3 7.4
1 2 1 NA
1 2 3 6.2
2 1 1 NA
2 1 2 NA
2 1 3 7.1
2 2 3 5.9
2 2 2 NA
2 2 1 NA
I was able to go as far as the following codes in zoo. I have very limited experience in R. Thank you in advance!
data2 <- transform(data1,
x = na.aggregate(x,by=SubID,FUN=sum,na.rm = T))
Here's the explanation of my comment:
library(data.table)
library(zoo)
setDT(data1)
data1[order(-Time),
Xf := na.locf(X),
by = .(SubID, Day)]
Ok so the setDT function makes the data1 object a data.table. Then order(-Time) orders data1 with respect to Time in descending order (because of the -). Xf := na.locf(X) creates a new column Xf by reference (which means you don't have to assign this back to data1) as na.locf(X) which is a function in the zoo package that fills the NAs forward with the previous value (in this case filling 2 and 1 with the value in 3). The last line specifies that we want to do this grouped by SubID and Day.
Hope it's clearer now, feel free to ask if you have further doubts.
You can sort the data by descending time and then use X[1].
library(dplyr)
df <- tibble(SubID=1, Day=1, Time=c(1,2,3), X=c(NA, NA, 2.2))
df <- df %>%
group_by(SubID, Day) %>%
arrange(desc(Time)) %>%
mutate(
X=case_when(
is.na(X) ~ X[1],
TRUE ~ X)
)

Rolling weighted mean across two factor levels or time points

I would like to create a rolling 2 quarter average for alpha, bravo and charlie (and lots of other variables. Research is taking me to zoo and lubricate packages but seem to always go back to rolling within one variable or grouping
set.seed(123)
dates <- c("Q4'15", "Q1'16", "Q2'16","Q3'16", "Q4'16", "Q1'17", "Q2'17" ,"Q3'17", "Q4'17","Q1'18")
df <- data.frame(dates = sample(dates, 100, replace = TRUE, prob=rep(c(.03,.07,.03,.08, .05),2)),
alpha = rnorm(100, 5), bravo = rnorm(100, 10), charlie = rnorm(100, 15))
I'm looking for something like
x <- df %>% mutate_if(is.numeric, funs(rollmean(., 2, align='right', fill=NA)))
Desired result: a weighted average across "Q4'15" & "Q1'16", "Q1'16" & "Q2'16", etc for each column of data (alpha, bravo, charlie). Not looking for the average of the paired quarterly averages.
Here is what the averages would be for the Q4'15&"Q1'16" time point
df %>% filter(dates %in% c("Q4'15", "Q1'16")) %>% select(-dates) %>% summarise_all(mean)
I like data.table for this, and I have a solution for you but there may be a more elegant one. Here is what I have:
Data
Now as data.table:
R> suppressMessages(library(data.table))
R> set.seed(123)
R> datesvec <- c("Q4'15", "Q1'16", "Q2'16","Q3'16", "Q4'16",
+ "Q1'17", "Q2'17" ,"Q3'17", "Q4'17","Q1'18")
R> df <- data.table(dates = sample(dates, 100, replace = TRUE,
+ prob=rep(c(.03,.07,.03,.08, .05),2)),
+ alpha = rnorm(100, 5),
+ bravo = rnorm(100, 10),
+ charlie = rnorm(100, 15))
R> df[ , ind := which(datesvec==dates), by=dates]
R> setkey(df, ind) # optional but may as well
R> head(df)
dates alpha bravo charlie ind
1: Q4'15 5.37964 11.05271 14.4789 1
2: Q4'15 7.05008 10.36896 15.0892 1
3: Q4'15 4.29080 12.12845 13.6047 1
4: Q4'15 5.00576 8.93667 13.3325 1
5: Q4'15 3.53936 9.81707 13.6360 1
6: Q1'16 3.45125 10.56299 16.0808 2
R>
The key here is that we need to restore / maintain the temporal ordering of your quarters which your data representation does not have.
Average by quarter
This is easy with data.table:
R> ndf <- df[ ,
+ .(qtr=head(dates,1), # label of quarter
+ sa=sum(alpha), # sum of a in quarter
+ sb=sum(bravo), # sum of b in quarter
+ sc=sum(charlie), # sum of c in quarter
+ n=.N), # number of observations
+ by=ind]
R> ndf
ind qtr sa sb sc n
1: 1 Q4'15 25.2656 52.3039 70.1413 5
2: 2 Q1'16 65.8562 132.6650 192.7921 13
3: 3 Q2'16 10.3422 17.8061 31.3404 2
4: 4 Q3'16 84.6664 168.1914 256.9010 17
5: 5 Q4'16 41.3268 87.8253 139.5873 9
6: 6 Q1'17 42.6196 85.4059 134.8205 9
7: 7 Q2'17 76.5190 162.0784 241.2597 16
8: 8 Q3'17 42.8254 83.2483 127.2600 8
9: 9 Q4'17 68.1357 133.5794 198.1920 13
10: 10 Q1'18 37.0685 78.4107 120.2808 8
R>
Lag those averages once
R> ndf[, `:=`(psa=shift(sa), # previous sum of a
+ psb=shift(sb), # previous sum of b
+ psc=shift(sc), # previous sum of c
+ pn=shift(n))] # previous nb of obs
R> ndf
ind qtr sa sb sc n psa psb psc pn
1: 1 Q4'15 25.2656 52.3039 70.1413 5 NA NA NA NA
2: 2 Q1'16 65.8562 132.6650 192.7921 13 25.2656 52.3039 70.1413 5
3: 3 Q2'16 10.3422 17.8061 31.3404 2 65.8562 132.6650 192.7921 13
4: 4 Q3'16 84.6664 168.1914 256.9010 17 10.3422 17.8061 31.3404 2
5: 5 Q4'16 41.3268 87.8253 139.5873 9 84.6664 168.1914 256.9010 17
6: 6 Q1'17 42.6196 85.4059 134.8205 9 41.3268 87.8253 139.5873 9
7: 7 Q2'17 76.5190 162.0784 241.2597 16 42.6196 85.4059 134.8205 9
8: 8 Q3'17 42.8254 83.2483 127.2600 8 76.5190 162.0784 241.2597 16
9: 9 Q4'17 68.1357 133.5794 198.1920 13 42.8254 83.2483 127.2600 8
10: 10 Q1'18 37.0685 78.4107 120.2808 8 68.1357 133.5794 198.1920 13
R>
Average over current and previous quarter
R> ndf[is.finite(psa), # where we have valid data
+ `:=`(ra=(sa+psa)/(n+pn), # total sum / total n == avg
+ rb=(sb+psb)/(n+pn),
+ rc=(sc+psc)/(n+pn))]
R> ndf[,c(1:2, 11:13)]
ind qtr ra rb rc
1: 1 Q4'15 NA NA NA
2: 2 Q1'16 5.06233 10.27605 14.6074
3: 3 Q2'16 5.07989 10.03141 14.9422
4: 4 Q3'16 5.00045 9.78935 15.1706
5: 5 Q4'16 4.84589 9.84680 15.2496
6: 6 Q1'17 4.66369 9.62395 15.2449
7: 7 Q2'17 4.76554 9.89937 15.0432
8: 8 Q3'17 4.97268 10.22195 15.3550
9: 9 Q4'17 5.28386 10.32513 15.4977
10: 10 Q1'18 5.00972 10.09476 15.1654
R>
taking advantage of the fact that the total sum over two quarters divided by the total number of observations is the same as the mean of those two quarters. (And this reflects an edit following an earlier thinko of mine).
Spot check
We can use the selection feature of data.table to compute two of those rows by hand, I am picked those for indices <1,2> and <4,5> here:
R> df[ ind <= 2, .(a=mean(alpha), b=mean(bravo), c=mean(charlie))]
a b c
1: 5.06233 10.276 14.6074
R> df[ ind == 4 | ind == 5, .(a=mean(alpha), b=mean(bravo), c=mean(charlie))]
a b c
1: 4.84589 9.8468 15.2496
R>
This pans out fine, and the approach should scale easily to millions of rows thanks to data.table.
PS: All in One
As you mentioned pipes etc, you can write all this with chained data.table operations. Not my preferred style, but possible. The following creates the exact same out without ever creating an ndf temporary as above:
## All in one
df[ , ind := which(datesvec==dates), by=dates][
,
.(qtr=head(dates,1), # label of quarter
sa=sum(alpha), # sum of a in quarter
sb=sum(bravo), # sum of b in quarter
sc=sum(charlie), # sum of c in quarter
n=.N), # number of observations
by=ind][
,
`:=`(psa=shift(sa), # previous sum of a
psb=shift(sb), # previous sum of b
psc=shift(sc), # previous sum of c
pn=shift(n))][
is.finite(psa), # where we have valid data
`:=`(ra=(sa+psa)/(n+pn), # total sum / total n == avg
rb=(sb+psb)/(n+pn),
rc=(sc+psc)/(n+pn))][
,c(1:2, 11:13)][]

R: How to substitute a value from one dataframe to another based on a condition using a function?

Objective: To map a date from one dataframe to another given that it falls within a certain date interval. For example, let's say we need to deliver a gift within the time interval from either 20/12/2017 to 25/12/2017 or 26/12/2017 to 30/12/2017, and receive a response from the gift sender on the 23/12/2017. I want to create a function that can identify where to place the response date based on the date interval it falls within. In the example above, the response date would fall in the interval 20/12/2017 to 25/12/2017.
Note: The term "Match" below means a certain condition is met from one data frame with another.
Here is some sample code to demonstrate what I mean.
# Creating the Data Frame with a start and end date interval
StartDate <- seq(as.Date("2000/1/1"), by = "month", length.out = 10)
EndDate <- StartDate +7
Dates_Interval <- data.frame(StartDate,EndDate)
# Creating a second data frame with the response dates only
ResponseDate <- seq(as.Date("2000/1/6"), by = "month", length.out = 10)
Response_Substitute <- data.frame(ResponseDate)
# Substituting random NA values
Response_Substitute[c(1,5,8),] <- NA
# > Response_Substitute
# ResponseDate
# 1 <NA>
# 2 2000-02-09
# 3 2000-03-06
# 4 2000-04-06
# 5 <NA>
# 6 2000-06-06
# 7 2000-07-06
# 8 <NA>
# 9 2000-09-06
# 10 2000-10-06
# Creating a function which evaluates a value in data frame two
# (Response_Substitute) and checks
# whether it meets
# a condition in Dates_Interval.
dateresponses <- function(x,y,z) {
sub_date <- ifelse ( y <=x && x <= z, x, NA)
converteddate <- as.Date(sub_date, origin = "1899-12-30")
return(converteddate)
}
# Example of the function in use to show how it matches a certain condition.
x <- Response_Substitute[2,1]
b <- dateresponses(x,Dates_Interval[2,1],Dates_Interval[2,2])
# > b
# [1] "1930-02-04"
# Example of the function in use to show when a response date does not
# match a certain condition
x <- Response_Substitute[2,1] <- as.Date("2000/2/9")
b <- dateresponses(x,Dates_Interval[2,1],Dates_Interval[2,2])
# > b
# [1] NA
# Example of the function in use to show when there is no response date in
# the Response_Substitute variable
x <- Response_Substitute[1,1]
b <- dateresponses(x,Dates_Interval[2,1],Dates_Interval[2,2])
# > b
# [1] NA
I need a function that will be able to create a new column in the Dates_Interval data frame which matches the response date with the date interval it falls within from StartDate and EndDate columns. If there is no match, then the response will be NA if there is no response. If there is a response but the response date does not fall into any interval, then I want a dataframe to be created that captures unmatched responses.
This is what the final dataframe could look like:
Dates_Interval$ResponseDate <- Response_Substitute
# > Dates_Interval
# StartDate EndDate ResponseDate
# 1 2000-01-01 2000-01-08 <NA>
# 2 2000-02-01 2000-02-08 2000-02-06
# 3 2000-03-01 2000-03-08 2000-03-06
# 4 2000-04-01 2000-04-08 2000-04-06
# 5 2000-05-01 2000-05-08 <NA>
# 6 2000-06-01 2000-06-08 2000-06-06
# 7 2000-07-01 2000-07-08 2000-07-06
# 8 2000-08-01 2000-08-08 <NA>
# 9 2000-09-01 2000-09-08 2000-09-06
# 10 2000-10-01 2000-10-08 2000-10-06
And for response dates that are not NA but do not match any interval another dataframe could be created like this:
Unmatched_Response_Date <- data.frame(seq(as.Date("2000/1/9"), by = "month",
length.out = 2))
colnames(Unmatched_Response_Date) <- "Unmatched Responses"
Unmatched_Response_Date
# > Unmatched_Response_Date
# Unmatched Responses
# 1 2000-01-09
# 2 2000-02-09
EDIT:
There is bug I have noticed when using the dateresponses function. When I use a date from the Response_substitute data frame. The output of the date is not the same as the data frame. e.g. for Response_substitute[2,1] the value should be 2000-02-09 but instead I get 1930-02-04. Any ideas also for solving this issue?
Here is the code you provided:
StartDate <- seq(as.Date("2000/1/1"), by = "month", length.out = 10)
EndDate <- StartDate +7
Dates_Interval <- data.frame(StartDate,EndDate)
# Creating a second data frame with the response dates only
ResponseDate <- seq(as.Date("2000/1/6"), by = "month", length.out = 10)
Response_Substitute <- data.frame(ResponseDate)
# Substituting random NA values
Response_Substitute[c(1,5,8),] <- NA
So to answer you question, I added another date column so that we had an interval in both data.frames. I also removed the NA values from the first data.frame containing the responses. These don't appear to have any bearing on your expected output. Correct me if wrong.
Response_Substitute$Date2 <- Response_Substitute$ResponseDate - 1
Response_Substitute <- Response_Substitute[!is.na(Response_Substitute$ResponseDate),]
The crux of this question comes down to using a data.table function called foverlaps(). From the documentation, this is a fast overlap join function. It is designed to find where two intervals overlap and join the data together. The code below does just that.
This is also why I needed to generate the other date in the Response_Substitute. foverlaps() requires two intervals to work.
library(data.table)
Dates_Interval <- as.data.table(Dates_Interval)
Response_Substitute <- as.data.table(Response_Substitute)
setkey(Response_Substitute, Date2, ResponseDate)
join_df <- foverlaps(Dates_Interval, Response_Substitute,
by.x = c('StartDate', 'EndDate'))
Output:
ResponseDate Date2 StartDate EndDate
1: <NA> <NA> 2000-01-01 2000-01-08
2: 2000-02-06 2000-02-05 2000-02-01 2000-02-08
3: 2000-03-06 2000-03-05 2000-03-01 2000-03-08
4: 2000-04-06 2000-04-05 2000-04-01 2000-04-08
5: <NA> <NA> 2000-05-01 2000-05-08
6: 2000-06-06 2000-06-05 2000-06-01 2000-06-08
7: 2000-07-06 2000-07-05 2000-07-01 2000-07-08
8: <NA> <NA> 2000-08-01 2000-08-08
9: 2000-09-06 2000-09-05 2000-09-01 2000-09-08
10: 2000-10-06 2000-10-05 2000-10-01 2000-10-08
Final Step is to remove the column, and generate the empty vector of the non-matches.
# Removes the Date2 Column
join_df[, Date2:=NULL]
# Generate list of responses that didn't join
setdiff(Response_Substitute$ResponseDate, join_df$ResponseDate)
Does this work for your problem?
Further reading: 1, 2

Find average values of a column in terms of date range of another column in R

I have two data frames that look like this:
> head(y,n=4)
Source: local data frame [6 x 3]
Start Date End Date Length
1 2006-06-08 2006-06-10 3
2 2006-06-12 2006-06-14 3
3 2006-06-18 2006-06-21 4
4 2006-06-24 2006-06-25 2
and
> head(x,n=19)
Date Group.Size
413 2006-06-07 6
414 2006-06-08 3
415 2006-06-09 1
416 2006-06-10 3
417 2006-06-11 15
418 2006-06-12 12
419 2006-06-13 NA
420 2006-06-14 4
421 2006-06-15 8
422 2006-06-16 3
423 2006-06-17 1
424 2006-06-18 3
425 2006-06-19 10
426 2006-06-20 2
427 2006-06-21 7
428 2006-06-22 6
429 2006-06-23 2
430 2006-06-24 1
431 2006-06-25 0
I'm looking for a way to add a new column in data frame y that will show the average Group.Size of data frame x (rounded to nearest integer), depending on the given Start Date and End Dates provided in y.
For example, in the first row of y, I have 6/8/06 to 6/10/06. This is a length of 3 days, so I would want the new column to have the number 2, because the corresponding Group.Size values are 3, 1, and 3 for the respective days in data frame x (mean=2.33, rounded to nearest integer is 2).
If there is an NA in my dataframe x, I'd like to consider it a 0.
There are multiple steps involved in this task, and there is probably a straightforward approach... I am relatively new to R, and am having a hard time breaking it down. Please let me know if I should clarify my example.
Assuming that x$Date, y$StartDate, and y$EndDate are of class Date (or, character), the following apply approach should be doing the trick:
y$AvGroupSize<- apply(y, 1, function(z) {
round(mean(x$Group.Size[which(x$Date >= z[1] & x$Date <=z[2])], na.rm=T),0)
}
)
#Replace missing values in x with 0
x[is.na(x)] <- 0
#Create new 'Group' variable and loop through x to create groups
x$Group <-1
j <- 1
for(i in 1:nrow(x)){
if(x[i,"Date"]==y[j,"StartDate"]){
x[i,"Group"] <- j+1
if(j<nrow(y)){
j <- j+1
} else{
j <- j
}
}else if(i>1){
x[i,"Group"] <- x[i-1,"Group"]
}else {
x[i,"Group"] <- 1
}
}
#Use tapply function to get the rounded mean of each Group
tapply(x$Group.Size, x$Group, function(z) round(mean(z)))
Here is a different dplyr solution
library(dplyr)
na2zero <- function(x) ifelse(is.na(x),0,x) # Convert NA to zero
ydf %>%
group_by(Start_Date, End_Date) %>%
mutate(avg = round(mean(na2zero(xdf$Group.Size[ between(xdf$Date, Start_Date, End_Date) ])), 0)) %>%
ungroup
## Start_Date End_Date Length avg
## (time) (time) (int) (dbl)
## 1 2006-06-08 2006-06-10 3 2
## 2 2006-06-12 2006-06-14 3 5
## 3 2006-06-18 2006-06-21 4 6
## 4 2006-06-24 2006-06-25 2 0
This is a solution that applies over the rows of the data frame y:
library(dplyr)
get_mean_size <- function(start, end, length) {
s <- sum(filter(x, Date >= start, Date <= end)$Group.Size, na.rm = TRUE)
round(s/length)
}
y$Mean.Size = Map(get_mean_size, y$Start_Date, y$End_Date, y$Length)
y
## Start_Date End_Date Length Mean.Size
## 1 2006-06-08 2006-06-10 3 2
## 2 2006-06-12 2006-06-14 3 5
## 3 2006-06-18 2006-06-21 4 6
## 4 2006-06-24 2006-06-25 2 0
It uses two functions from the dplyr package: filter() and mutate().
First I define the function get_mean_size that is supposed with the three values from a column in y: Start_Date, End_Date and length. It fist selects the relevant rows from x using filter and sums up the column Group.Size. Using na.rm = TRUE tells sum() to ignore NA values, which is the same as setting them to zero. Then the average is calculated by dividing by length and rounding. Note that round rounds half to even, thus 0.5 is rounded to 0, while 1.5 is rounded to 2.
This function is then applied to all rows of y using Map() and added as a new column to y.
A final note regarding the dates in x and y. This solution assumes that the dates are stored as Date object. You can check this using, e. g.,
is(x$Date, "Date")
If they do not have class Date, you can convert them using
x$Date <- as.Date(x$Date)
(and simliarly for y$Start_Date and y$End_Date).
There are many ways but here is one. We can first create a list of date positions with lapply (SN: Be sure that the dates are in chronological order). Then we map the function round(mean(Group.Size)) to each of the values:
lst <- lapply(y[1:2], function(.x) match(.x, x[,"Date"]))
y$avg <- mapply(function(i,j) round(mean(x$Group.Size[i:j], na.rm=TRUE)), lst[[1]],lst[[2]])
y
# StartDate EndDate Length avg
# 1 2006-06-08 2006-06-10 3 2
# 2 2006-06-12 2006-06-14 3 8
# 3 2006-06-18 2006-06-21 4 6
# 4 2006-06-24 2006-06-25 2 0

Resources