average for time period dependent on date of row - r

I have a list of dates, and each date has a value.
This is what my data frame looks like right now. Note that there can be repeats in the date, but the entry in value will also repeat with the same value (i.e. row 2 and 3 have the same date, but the respective values are also the same).
date value
1 2018-02-08 1
2 2018-02-09 2
3 2018-02-09 2
4 2018-02-10 4
... ...
This is what I want my data frame to look like
date value weekavg
1 2018-02-08 1 ...
2 2018-02-09 2 ...
3 2018-02-09 2 ...
4 2018-02-10 4 ...
5 2018-02-11 0 ...
6 2018-02-12 0 ...
7 2018-02-13 0 ...
8 2018-02-14 0 ...
9 2018-02-15 0 1
... ... ...
To clarify, the entry in the ninth row is calculated by finding the dates that occurred before it for a week, so for 2018-02-15 that would be the date range 2018-02-08 to 2018-02-13. Thus, the result is 1 since 1+2+4+0+0+0+0 = 7. How could I do this in R, and then do it for every row?
------ Reproducible example -----
data
lines <- "date value
1 2018-02-08 NA
2 2018-02-08 NA
3 2018-02-09 NA
4 2018-02-10 295
5 2018-02-10 295
6 2018-02-11 329
7 2018-02-12 242
8 2018-02-12 242
9 2018-02-13 317
10 2018-02-14 341
11 2018-02-15 292
12 2018-02-16 363
13 2018-02-17 380
14 2018-02-18 319
15 2018-02-19 307
16 2018-02-20 328
17 2018-02-21 290"
df <- read.table(text = lines)
newDF <- merge(df, transform(unique(df), mean = rollmeanr(value, 7, fill = NA)))
the mean column is just NA's for me.
P.S. Apologies for the image comments, I didn't know. Your help is much appreciated.

The question does not fully define the output but assuming:
there are no missing days, only duplicated days
if a day is duplicated then the average on its row should be duplicated
then:
library(zoo)
merge(DF, transform(unique(DF), mean = rollmeanr(value, 7, fill = NA)))
For the sample data shown reproducibly in the Note at the end this gives:
date value mean
1 2018-02-08 1 NA
2 2018-02-09 2 NA
3 2018-02-09 2 NA
4 2018-02-10 4 NA
5 2018-02-11 0 NA
6 2018-02-12 0 NA
7 2018-02-13 0 NA
8 2018-02-14 0 1.0000000
9 2018-02-15 0 0.8571429
Note
Lines <- "
date value
1 2018-02-08 1
2 2018-02-09 2
3 2018-02-09 2
4 2018-02-10 4
5 2018-02-11 0
6 2018-02-12 0
7 2018-02-13 0
8 2018-02-14 0
9 2018-02-15 0
"
DF <- read.table(text = Lines)

Related

Group two dfs based on dates that closely match

These are subsets of two dataframes.
df1:
plot
mean_first_flower_date
gdd
1
2019-07-15
60
1
2019-07-21
50
1
2019-07-23
78
2
2019-05-13
100
2
2019-05-22
173
2
2019-05-25
245
(cont.)
df2:
plot
date
flowers
1
2019-07-12
2
1
2019-07-13
9
1
2019-07-14
3
1
2019-07-15
3
2
2019-05-12
10
2
2019-05-13
10
2
2019-05-14
14
2
2019-05-15
17
(cont.)
df2 has some matching dates with df1 but sometimes the dates are off for one or a couple days (highlighted in bold).
I would like to group both dfs based on both 'date' and 'plot', keeping df2, without losing 'gdd' data from df1.
This will happen if, for example, I inner_join both dfs because the dates will not match.
So if a date in df1 is one to three days earlier or later than what it's possible to match in df2, it's fine because the dates are relatively close. This is tricky because I want this data replacement only if there is not data available in df1 for that data range.
My goal is to have something like this:
plot
date
flowers
gdd
1
2019-07-12
2
60
1
2019-07-13
9
60
1
2019-07-14
3
60
1
2019-07-15
3
60
2
2019-05-12
10
100
2
2019-05-13
10
100
2
2019-05-14
14
100
2
2019-05-15
17
100
Is it possible to do?
I greatly appreciate any help!
Thanks!
I think a 'rolling join' from the data.table package can handle this:
library(data.table)
setDT(df1)
setDT(df2)
df1[, mean_first_flower_date := as.Date(mean_first_flower_date)]
df2[, date := as.Date(date)]
df1[df2, on=c("plot","mean_first_flower_date==date"), roll=3, rollends=TRUE]
# plot mean_first_flower_date gdd flowers
#1: 1 2019-07-12 60 2
#2: 1 2019-07-13 60 9
#3: 1 2019-07-14 60 3
#4: 1 2019-07-15 60 3
#5: 2 2019-05-12 100 10
#6: 2 2019-05-13 100 10
#7: 2 2019-05-14 100 14
#8: 2 2019-05-15 100 17
Using this data:
df1 <- read.table(text="plot mean_first_flower_date gdd
1 2019-07-15 60
1 2019-07-21 50
1 2019-07-23 78
2 2019-05-13 100
2 2019-05-22 173
2 2019-05-25 245", header=TRUE)
df2 <- read.table(text="plot date flowers
1 2019-07-12 2
1 2019-07-13 9
1 2019-07-14 3
1 2019-07-15 3
2 2019-05-12 10
2 2019-05-13 10
2 2019-05-14 14
2 2019-05-15 17", header=TRUE)
Try fill from dplyr. use this syntax
df2 %>% left_join(df1, by = c("plot" = "plot", "date" = "mean_first_flower_date")) %>%
fill(gdd, .direction = "up")
plot date flowers gdd
1 1 2019-07-12 2 60
2 1 2019-07-13 9 60
3 1 2019-07-14 3 60
4 1 2019-07-15 3 60
5 2 2019-05-12 10 100
6 2 2019-05-13 10 100
7 2 2019-05-14 14 NA
8 2 2019-05-15 17 NA
As you can notice there are two NAs in the last two rows which shouldn't be there if you'll join your actual df2 where these rows will be filled by 173 as there will be a match for 2019-05-22. Still if you want to fill the last NA rows, if any, you can use fill again with .direction = "down"
df2 %>% left_join(df1, by = c("plot" = "plot", "date" = "mean_first_flower_date")) %>%
fill(gdd, .direction = "up") %>% fill(gdd, .direction = "down")
plot date flowers gdd
1 1 2019-07-12 2 60
2 1 2019-07-13 9 60
3 1 2019-07-14 3 60
4 1 2019-07-15 3 60
5 2 2019-05-12 10 100
6 2 2019-05-13 10 100
7 2 2019-05-14 14 100
8 2 2019-05-15 17 100

How to update a data.frame based on information from another data.frame

I have two tables: Display and Review. The Review table contains information on reviews on products of an online store. Each row represents the date of the review as well as the cumulative number of reviews and the average rating for the product up to the date.
page_id<-c("1072659", "1072659" , "1072659","1072650","1072660","1072660")
review_id<-c("1761023","1761028","1762361","1918387","1761427","1863914")
date<-as.Date(c("2013-07-11","2013-08-12","2014-07-15","2014-09-10","2013-07-27","2014-08-12"),format = "%Y-%m-%d")
cumulative_No_reviews<-c(1,2,3,1,1,2)
average_rating<-c(5,3.5,4,3,5,5)
Review<-data.frame(page_id,review_id,date,cumulative_No_reviews,average_rating)
page_id review_id date cumulative_No_reviews average_rating
1072659 1761023 2013-07-11 1 5
1072659 1761028 2013-08-12 2 3.5
1072659 1762361 2014-07-15 3 4
1072650 1918387 2014-09-10 1 3
1072660 1761427 2013-07-27 1 5
1072660 1863914 2014-08-12 2 5
The Display table captures the data on customers’ visit to product pages.
page_id<-c("1072659","1072659","1072659","1072650","1072650","1072660","1072660","1072660")
date<-as.Date(c("2013-07-10","2013-08-03","2015-02-11","2014-08-10","2014-09-09","2013-08-12","2014-09-12","2015-08-12"),format = "%Y-%m-%d")
Display<-data.frame(page_id,date)
page_id date
1072659 2013-07-10
1072659 2013-08-03
1072659 2015-02-11
1072650 2014-08-10
1072650 2014-09-09
1072660 2013-08-12
1072660 2014-09-12
1072660 2015-08-12
I’d like to add two column to the Display table (call it Display2) in a way that it reflects the latest review information up the point of visit for each product, as follows:
page_id<-c("1072659","1072659","1072659","1072650","1072650","1072660","1072660","1072660")
date<-as.Date(c("2013-07-10","2013-08-03","2015-02-11","2014-08-10","2014-09-09","2013-08-12","2014-09-12","2015-08-12"),format = "%Y-%m-%d")
cumulative_No_reviews<-c(0,1,3,0,0,1,2,2)
average_rating<-c(NA,5,4,NA,NA,5,5,5)
Display2<-data.frame(page_id,date,cumulative_No_reviews,average_rating)
page_id date cumulative_No_reviews average_rating
1072659 2013-07-10 0 NA
1072659 2013-08-03 1 5
1072659 2015-02-11 3 4
1072650 2014-08-10 0 NA
1072650 2014-09-09 0 NA
1072660 2013-08-14 1 5
1072660 2014-09-11 2 5
1072660 2015-08-12 2 5
I would appreciate your help with this.
You can do this with a data.table join. You can join the Review table with the Display table on the condition that the page_ids match and the Review date is less than the Display date. For some rows of Display there will be multiple rows of Review which match according to these conditions, so with mult = 'last' we're just picking the last one. Since Review is sorted by date, this means the one with the most recent date.
library(data.table) # 1.12.6 for nafill (used below)
setDT(Display)
setDT(Review)
Display2 <- Review[Display, on = .(page_id, date < date), mult = 'last']
Display2
# page_id review_id date cumulative_No_reviews average_rating
# 1: 1072659 <NA> 2013-07-10 NA NA
# 2: 1072659 1761023 2013-08-03 1 5
# 3: 1072659 1762361 2015-02-11 3 4
# 4: 1072650 <NA> 2014-08-10 NA NA
# 5: 1072650 <NA> 2014-09-09 NA NA
# 6: 1072660 1761427 2013-08-12 1 5
# 7: 1072660 1863914 2014-09-12 2 5
# 8: 1072660 1863914 2015-08-12 2 5
Now this output almost matches what you show in the question, we just need to remove the review_id column and replace NAs in the cumulative_No_reviews column with 0s.
Display2[, review_id := NULL]
Display2[, cumulative_No_reviews := nafill(cumulative_No_reviews, fill = 0)][]
# page_id date cumulative_No_reviews average_rating
# 1: 1072659 2013-07-10 0 NA
# 2: 1072659 2013-08-03 1 5
# 3: 1072659 2015-02-11 3 4
# 4: 1072650 2014-08-10 0 NA
# 5: 1072650 2014-09-09 0 NA
# 6: 1072660 2013-08-12 1 5
# 7: 1072660 2014-09-12 2 5
# 8: 1072660 2015-08-12 2 5

Calculate maximum date interval - R

The challenge is a data.frame with with one group variable (id) and two date variables (start and stop). The date intervals are irregular and I'm trying to calculate the uninterrupted interval in days starting from the first startdate per group.
Example data:
data <- data.frame(
id = c(1, 2, 2, 3, 3, 3, 3, 3, 4, 5),
start = as.Date(c("2016-02-18", "2016-12-07", "2016-12-12", "2015-04-10",
"2015-04-12", "2015-04-14", "2015-05-15", "2015-07-14",
"2010-12-08", "2011-03-09")),
stop = as.Date(c("2016-02-19", "2016-12-12", "2016-12-13", "2015-04-13",
"2015-04-22", "2015-05-13", "2015-07-13", "2015-07-15",
"2010-12-10", "2011-03-11"))
)
> data
id start stop
1 1 2016-02-18 2016-02-19
2 2 2016-12-07 2016-12-12
3 2 2016-12-12 2016-12-13
4 3 2015-04-10 2015-04-13
5 3 2015-04-12 2015-04-22
6 3 2015-04-14 2015-05-13
7 3 2015-05-15 2015-07-13
8 3 2015-07-14 2015-07-15
9 4 2010-12-08 2010-12-10
10 5 2011-03-09 2011-03-11
The aim would a data.frame like this:
id start stop duration_from_start
1 1 2016-02-18 2016-02-19 2
2 2 2016-12-07 2016-12-12 7
3 2 2016-12-12 2016-12-13 7
4 3 2015-04-10 2015-04-13 34
5 3 2015-04-12 2015-04-22 34
6 3 2015-04-14 2015-05-13 34
7 3 2015-05-15 2015-07-13 34
8 3 2015-07-14 2015-07-15 34
9 4 2010-12-08 2010-12-10 3
10 5 2011-03-09 2011-03-11 3
Or this:
id start stop duration_from_start
1 1 2016-02-18 2016-02-19 2
2 2 2016-12-07 2016-12-13 7
3 3 2015-04-10 2015-05-13 34
4 4 2010-12-08 2010-12-10 3
5 5 2011-03-09 2011-03-11 3
It's important to identify the gap from row 6 to 7 and to take this point as the maximum interval (34 days). The interval 2018-10-01to 2018-10-01 would be counted as 1.
My usual lubridate approaches don't work with this example (interval %within lag(interval)).
Any idea?
library(magrittr)
library(data.table)
setDT(data)
first_int <- function(start, stop){
ind <- rleid((start - shift(stop, fill = Inf)) > 0) == 1
list(start = min(start[ind]),
stop = max(stop[ind]))
}
newdata <-
data[, first_int(start, stop), by = id] %>%
.[, duration := stop - start + 1]
# id start stop duration
# 1: 1 2016-02-18 2016-02-19 2 days
# 2: 2 2016-12-07 2016-12-13 7 days
# 3: 3 2015-04-10 2015-05-13 34 days
# 4: 4 2010-12-08 2010-12-10 3 days
# 5: 5 2011-03-09 2011-03-11 3 days

How can I connect a data set to another when the date is between 2 other dates R

I need to merge two datasets, but the rows have to merge if the date of the one dataset is between two dates of the other one. The first dataset data looks like this:
Date Weight diff Loc.nr
2013-01-24 1040 7 2
2013-01-31 1000 7 2
2013-02-07 1185 7 2
2013-02-14 915 7 2
2013-02-21 1090 7 2
2013-03-01 1065 9 2
2013-01-19 500 4 9
2013-01-23 1040 3 9
2013-01-28 415 5 9
2013-01-31 650 3 9
2013-02-04 725 4 9
2013-02-07 450 3 9
2013-02-11 550 4 9
The other data set matches looks like this:
Date winning
2013-01-20 1
2013-01-27 0
2013-02-03 1
2013-02-10 0
2013-02-17 1
2013-02-24 0
I wrote a code to connect the winning column from matches to the data set "data":
data$winning <- NA
for(i in 1:nrow(data)) {
for(j in 1:nrow(matches)) {
if((data$Date[i]-data$diff[i]) < matches$Date[j] & data$Date[i] > matches$Date[j]) {
data$winning[i] <- matches$winning[j]
}
}
}
This code takes 3 days to run, is there a faster way to do this?
My expected output is:
Date Weight diff Loc.nr winning
2013-01-24 1040 7 2 1
2013-01-31 1000 7 2 0
2013-02-07 1185 7 2 1
2013-02-14 915 7 2 0
2013-02-21 1090 7 2 1
2013-03-01 1065 9 2 0
2013-01-19 500 4 9 NA
2013-01-23 1040 3 9 NA
2013-01-28 415 5 9 0
2013-01-31 650 3 9 NA
2013-02-04 725 4 9 1
2013-02-07 450 3 9 NA
2013-02-11 550 4 9 0
With non-equi join as suggested by Gregor you can try something along
library(data.table)
setDT(data)[, winning := setDT(matches)[data[, .(upper = Date, lower = Date - diff)],
on = .(Date < upper, Date > lower)]$winning][]
Date Weight diff Loc.nr winning
1: 2013-01-24 1040 7 2 1
2: 2013-01-31 1000 7 2 0
3: 2013-02-07 1185 7 2 1
4: 2013-02-14 915 7 2 0
5: 2013-02-21 1090 7 2 1
6: 2013-03-01 1065 9 2 0
7: 2013-01-19 500 4 9 NA
8: 2013-01-23 1040 3 9 NA
9: 2013-01-28 415 5 9 0
10: 2013-01-31 650 3 9 NA
11: 2013-02-04 725 4 9 1
12: 2013-02-07 450 3 9 NA
13: 2013-02-11 550 4 9 0

R - Calculate Time Elapsed Since Last Event with Multiple Event Types

I have a dataframe that contains the dates of multiple types of events.
df <- data.frame(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000"
,"03/01/2001","17/03/2001","23/04/2001",
"26/05/2001","01/06/2001",
"30/06/2001","02/07/2001","15/07/2001"
,"21/12/2001"), "%d/%m/%Y"),
event_type=c(0,4,1,2,4,1,0,2,3,3,4,3))
date event_type
---------------- ----------
1 2000-07-06 0
2 2000-09-15 4
3 2000-10-15 1
4 2001-01-03 2
5 2001-03-17 4
6 2001-04-23 1
7 2001-05-26 0
8 2001-06-01 2
9 2001-06-30 3
10 2001-07-02 3
11 2001-07-15 4
12 2001-12-21 3
I am trying to calculate the days between each event type so the output looks like the below:
date event_type days_since_last_event
---------------- ---------- ---------------------
1 2000-07-06 0 NA
2 2000-09-15 4 NA
3 2000-10-15 1 NA
4 2001-01-03 2 NA
5 2001-03-17 4 183
6 2001-04-23 1 190
7 2001-05-26 0 324
8 2001-06-01 2 149
9 2001-06-30 3 NA
10 2001-07-02 3 2
11 2001-07-15 4 120
12 2001-12-21 3 172
I have benefited from the answers from these two previous posts but have not been able to address my specific problem in R; multiple event types.
Calculate elapsed time since last event
Calculate days since last event in R
Below is as far as I have gotten. I have not been able to leverage the last event index to calculate the last event date.
df <- cbind(df, as.vector(data.frame(count=ave(df$event_type==df$event_type,
df$event_type, FUN=cumsum))))
df <- rename(df, c("count" = "last_event_index"))
date event_type last_event_index
--------------- ------------- ----------------
1 2000-07-06 0 1
2 2000-09-15 4 1
3 2000-10-15 1 1
4 2001-01-03 2 1
5 2001-03-17 4 2
6 2001-04-23 1 2
7 2001-05-26 0 2
8 2001-06-01 2 2
9 2001-06-30 3 1
10 2001-07-02 3 2
11 2001-07-15 4 3
12 2001-12-21 3 3
We can use diff to get the difference between adjacent 'date' after grouping by 'event_type'. Here, I am using data.table approach by converting the 'data.frame' to 'data.table' (setDT(df)), grouped by 'event_type', we get the diff of 'date'.
library(data.table)
setDT(df)[,days_since_last_event :=c(NA,diff(date)) , by = event_type]
df
# date event_type days_since_last_event
# 1: 2000-07-06 0 NA
# 2: 2000-09-15 4 NA
# 3: 2000-10-15 1 NA
# 4: 2001-01-03 2 NA
# 5: 2001-03-17 4 183
# 6: 2001-04-23 1 190
# 7: 2001-05-26 0 324
# 8: 2001-06-01 2 149
# 9: 2001-06-30 3 NA
#10: 2001-07-02 3 2
#11: 2001-07-15 4 120
#12: 2001-12-21 3 172
Or as #Frank mentioned in the comments, we can also use shift (from version v1.9.5+ onwards) to get the lag (by default, the type='lag') of 'date' and subtract from the 'date'.
setDT(df)[, days_since_last_event := as.numeric(date-shift(date,type="lag")),
by = event_type]
The base R version of this is to use split/lapply/rbind to generate the new column.
> do.call(rbind,
lapply(
split(df, df$event_type),
function(d) {
d$dsle <- c(NA, diff(d$date)); d
}
)
)
date event_type dsle
0.1 2000-07-06 0 NA
0.7 2001-05-26 0 324
1.3 2000-10-15 1 NA
1.6 2001-04-23 1 190
2.4 2001-01-03 2 NA
2.8 2001-06-01 2 149
3.9 2001-06-30 3 NA
3.10 2001-07-02 3 2
3.12 2001-12-21 3 172
4.2 2000-09-15 4 NA
4.5 2001-03-17 4 183
4.11 2001-07-15 4 120
Note that this returns the data in a different order than provided; you can re-sort by date or save the original indices if you want to preserve that order.
Above, #akrun has posted the data.tables approach, the parallel dplyr approach would be straightforward as well:
library(dplyr)
df %>% group_by(event_type) %>% mutate(days_since_last_event=date - lag(date, 1))
Source: local data frame [12 x 3]
Groups: event_type [5]
date event_type days_since_last_event
(date) (dbl) (dfft)
1 2000-07-06 0 NA days
2 2000-09-15 4 NA days
3 2000-10-15 1 NA days
4 2001-01-03 2 NA days
5 2001-03-17 4 183 days
6 2001-04-23 1 190 days
7 2001-05-26 0 324 days
8 2001-06-01 2 149 days
9 2001-06-30 3 NA days
10 2001-07-02 3 2 days
11 2001-07-15 4 120 days
12 2001-12-21 3 172 days

Resources