How to calculate moving average for different starting date? - r

I would like to calculate moving averages for each participant in the dataset.
Participant may have more than one visit date, and I would like to calculate the average value in the past 3 days and in the past 2 days before each visit (not including the day of visit).
For example, let id=1, date=6/6/2017.
Average value in the past 2 days should be an average of value on 6/5/2017 and 6/4/2017.
Sample datasets are generated as below.
I am working on a much larger dataset, with more participants, more visits, and more days of value. I want to find an efficient way to calculate these averages.
timeseries <- data.frame(id=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3), date=c("6/1/2017","6/2/2017","6/3/2017","6/4/2017","6/5/2017","6/6/2017",
"6/1/2017","6/2/2017","6/3/2017","6/4/2017","6/5/2017","6/6/2017",
"6/1/2017","6/2/2017","6/3/2017","6/4/2017","6/5/2017","6/6/2017"),
value=c(2,3,4,NA,6,7,
NA,9,5,NA,3,2,
5,7,3,8,3,5))
> timeseries
id date value
1 1 6/1/2017 2
2 1 6/2/2017 3
3 1 6/3/2017 4
4 1 6/4/2017 NA
5 1 6/5/2017 6
6 1 6/6/2017 7
7 2 6/1/2017 NA
8 2 6/2/2017 9
9 2 6/3/2017 5
10 2 6/4/2017 NA
...
visit <- data.frame(id=c(1,1,2,3,3,3),
date=c("6/6/2017","6/5/2017",
"6/6/2017",
"6/6/2017","6/5/2017","6/4/2017"))
> visit
id date
1 1 6/6/2017
2 1 6/5/2017
3 2 6/6/2017
4 3 6/6/2017
5 3 6/5/2017
6 3 6/4/2017
The result table should be something like this, where mean3 is the average value in the past 3 days, and mean2 is the average value in the past 2 days
> result
id date mean3 mean2
1 1 6/6/2017
2 1 6/5/2017
3 2 6/6/2017
4 3 6/6/2017
5 3 6/5/2017
6 3 6/4/2017

For each id of visit, I subset corresponding data from timeseries and then calculated mean of the value within n_days.
library(lubridate)
n_days = 2
sapply(1:NROW(visit), function(i)
with(subset(x = timeseries,
subset = timeseries$id == visit$id[i]),
mean(x = value[difftime(time1 = mdy(visit$date[i]),
time2 = mdy(date),
units = "days") <= n_days &
difftime(time1 = mdy(visit$date[i]),
time2 = mdy(date),
units = "days") > 0],
na.rm = TRUE)))
#[1] 6.0 4.0 3.0 5.5 5.5 5.0

Related

summing based on conditions from two dataframes and dealing with dates

I have two dataframes, one with climate data for every location and date across 4 years. The other data frame has a date for each day an animal was trapped at a site. I am trying to calculate the mean of each climate variable based on a specific amount of time before the day the animal was trapped (time length depends on variable in question).
climate <- data.frame(site=c(1,1,1,1,2,2,2,2,1,1,1,1),
precip=c(0.1,0.2,0.1,0.1,0.5,0.2,0.3,0.1,0.2,0.1,0.1,0.5),
humid=c(1,1,3,1,2,3,3,1,1,3,1,2),
date=c("6/13/2020","6/12/2020","6/11/2020","6/14/2020","6/13/2020","6/12/2020","6/11/2020","6/14/2020","2/13/2019","2/14/2019","2/15/2019","2/16/2019"))
trap <- data.frame(site=c(1,2,3,3), date=c("7/1/2020","7/1/2020","7/2/2020","7/4/2020"))
> climate
site precip humid date
1 1 0.1 1 6/13/2020
2 1 0.2 1 6/12/2020
3 1 0.1 3 6/11/2020
4 1 0.1 1 6/14/2020
5 2 0.5 2 6/13/2020
6 2 0.2 3 6/12/2020
7 2 0.3 3 6/11/2020
8 2 0.1 1 6/14/2020
9 1 0.2 1 2/13/2019
10 1 0.1 3 2/14/2019
11 1 0.1 1 2/15/2019
12 1 0.5 2 2/16/2019
> trap
site date
1 1 7/1/2020
2 2 7/1/2020
3 3 7/2/2020
4 3 7/4/2020
I want to calculate the mean humid 18-20 days before the date written in the trap dataframe. So essentially what is the mean humid between 6/11/2020 and 6/13/2020 according to the climate data.frame for animals trapped on 7/1/2020. So for site 1 that would be: 1.667 and site 2 that would be 2.67.
I also want to calculate the sum of precipitation 497-500 days before the date written in the trap dataframe. So I would need to calculate the sum (total) precip between 2/13/2019 and 2/16/2019 for an animal trapped on 7/1/2020 at each site. So for site 1 precip would be 0.9.
I know how to create new columns in the trap data frame for mean precip and sum humid but I'm not sure where to start in terms of coding so that each value is calculated as described above and the data that corresponds to the correct date is used for the large dataset that contains many different trap dates.
Thank you very much, hopefully I am being clear in my description.
I have a solution using functions from the tidyverse. It is always useful to convert date variables to the class date. With this class, you can make calculations. Note, that I renamed the date column in the trap data to trap_date. See comments for more details:
library(tidyverse)
climate <- data.frame(site=c(1,1,1,1,2,2,2,2,1,1,1,1),
precip=c(0.1,0.2,0.1,0.1,0.5,0.2,0.3,0.1,0.2,0.1,0.1,0.5),
humid=c(1,1,3,1,2,3,3,1,1,3,1,2),
date=c("6/13/2020","6/12/2020","6/11/2020","6/14/2020","6/13/2020","6/12/2020","6/11/2020","6/14/2020","2/13/2019","2/14/2019","2/15/2019","2/16/2019"))
trap <- data.frame(site=c(1,2,3,3), trap_date=c("7/1/2020","7/1/2020","7/2/2020","7/4/2020"))
# merge data
data <- merge(climate, trap, by="site")
> head(data)
site precip humid date trap_date
1 1 0.1 1 2020-06-13 2020-07-01
2 1 0.2 1 2020-06-12 2020-07-01
3 1 0.1 3 2020-06-11 2020-07-01
4 1 0.1 1 2020-06-14 2020-07-01
5 1 0.2 1 2019-02-13 2020-07-01
6 1 0.1 3 2019-02-14 2020-07-01
# parse dates to class 'date'; enables calculations
data <- data %>%
mutate(date = parse_date(date, format="%m/%d/%Y"),
trap_date = parse_date(trap_date, format="%m/%d/%Y"))
For means:
# humid means
data %>%
group_by(site) %>%
filter(date >= trap_date-20 & date <= trap_date-18) %>%
summarise(mean = mean(humid))
# A tibble: 2 x 2
site mean
<dbl> <dbl>
1 1 1.67
2 2 2.67
However, it seems that the range of 497 to 500 days before the trap date contains no observations. When I used your specified dates, I got the same result of 0.9:
# precip sums
data %>%
group_by(site) %>%
filter(date >= trap_date-500 & date <= trap_date-497)
# A tibble: 0 x 5
# Groups: site [0]
# ... with 5 variables: site <dbl>, precip <dbl>, humid <dbl>,
# date <date>, trap_date <date>
# using your provided dates
data %>%
group_by(site) %>%
filter(date >= as.Date("2019-02-13") & date <= as.Date("2019-02-16")) %>%
summarise(sum = sum(precip))
# A tibble: 1 x 2
site sum
<dbl> <dbl>
1 1 0.9
Hope I can help.

Calculate rolling average of simulated data series with data.table

I am simulating a price time series, where the time horizon basically is that each month consists of 20 working days and 12 months are one year. I now would like to calculate the rolling average of this price, always based on the first day of the month.
I do have a working solution, but would like to know if there's a more elegant or faster one.
dt.oil.price
Period Month Day.Month Oil.Price Oil.Supply Risk.Free.Interest
1: 1 1 1 39.4560000 NA 0.08642857
2: 2 1 2 3.7889460 NA 0.08642857
3: 3 1 3 51.0748751 NA 0.08642857
4: 4 1 4 60.6282853 NA 0.08642857
5: 5 1 5 35.7267224 NA 0.08642857
6: 6 1 6 26.1868977 NA 0.08642857
7: 7 1 7 32.6488136 NA 0.08642857
8: 8 1 8 42.6397549 NA 0.08642857
9: 9 1 9 18.8969991 NA 0.08642857
...
20: 20 1 20 8.8036135 NA 0.08642857
21: 21 2 1 2.5559526 NA 0.08642857
22: 22 2 2 24.3996401 NA 0.08642857
...
40: 40 2 20 41.2988566 NA 0.08642857
41: 41 3 1 20.8012327 NA 0.08642857
42: 42 3 2 70.5297726 NA 0.08642857
Just to give you an idea on the structure of the data. To create the above data structure with 60 periods:
set.seed(1);
dt.oil.price <- as.data.table(cbind( Period = 1:60,
Month = as.integer(rep(1:(60/20), each = 20))[1:60],
Oil.Price=rnorm(3*20,mean = 50, sd = 10)))
dt.oil.price[,"Day.Month" := rank(Period),by="Month"]
With the following code I can then select all first days of a month and calculate the mean of the oil price for these days:
dt.oil.price[ Day.Month == 1, mean(Oil.Price)]
In the next step I use another helper column "Num.Months" to rank the number of months accordingly, by
dt.oil.price[Day.Month == 1 & Period <= 8921,"Num.Months" := rank(-Period)]
and with this I can then select only the last two months for the average calculation, by subsetting this
dt.oil.price[Day.Month == 1 & Period <= 8921,"Num.Months" := rank(-Period)][Num.Months <= 2, Oil.Price]
A code snippet, which allows to calculate the mean without using an explicit helper column for the last three months:
dt.oil.price[Day.Month == 1 & Period <= 60, {Num.Months = rank(-Period); list("Period" = Period, "Month" = Month, "Oil.Price" = Oil.Price, "Num.Months" = Num.Months)}][Num.Months <=12, mean(Oil.Price)]
I hope my steps are all clear and it becomes also clear, what I would like to achieve. It is also possible to calculate the moving average dynamically by defining for example a period and then calculate the moving average for the last 12 months preceding that period. This can be achieved, by sub-setting the data.table only to periods smaller than the defined period and then calculating "Num.Months" for this data.table subset.

Identify and remove duplicates by a criteria in R

Hi I am puzzled with a problem concerning duplicates in R. I have looked around a lot and don't seem to find any help. I have a dataset like that
x = data.frame( id = c("A","A","A","A","A","A","A","B","B","B","B"),
StartDate = c("09/07/2006", "09/07/2006", "09/07/2006", "08/10/2006",
"08/10/2006", "09/04/2007", "02/03/2011","05/05/2005", "08/06/2009", "07/09/2009", "07/09/2009"),
EndDate = c("06/08/2006", "06/08/2006", "06/08/2006", "19/11/2006", "19/11/2006", "07/05/2007", "30/03/2011",
"02/06/2005", "06/07/2009", "05/10/2009", "05/10/2009"),
Group = c(1,1,1,2,2,3,4,2,3,4,4),
TestDate = c("09/06/2006", "08/09/2006", "08/10/2006", "08/09/2006", "08/10/2006", "NA", "02/03/2011",
"NA", "07/09/2009", "07/09/2009", "08/10/2009"),
Code = c(4,4,4858,4,4858,NA,4,NA, 795, 795, 4)
)
> x
id StartDate EndDate Group TestDate Code
1 A 09/07/2006 06/08/2006 1 09/06/2006 4
2 A 09/07/2006 06/08/2006 1 08/09/2006 4
3 A 09/07/2006 06/08/2006 1 08/10/2006 4858
4 A 08/10/2006 19/11/2006 2 08/09/2006 4
5 A 08/10/2006 19/11/2006 2 08/10/2006 4858
6 A 09/04/2007 07/05/2007 3 NA NA
7 A 02/03/2011 30/03/2011 4 02/03/2011 4
8 B 05/05/2005 02/06/2005 2 NA NA
9 B 08/06/2009 06/07/2009 3 07/09/2009 795
10 B 07/09/2009 05/10/2009 4 07/09/2009 795
11 B 07/09/2009 05/10/2009 4 08/10/2009 4
So basically what I am trying to do is to identify duplicates in the TestDate variable by ID. For example dates 08/09/2006 and 08/10/2006 seem to be repeated in the same person but for different Group and I don't want the same Testdate to be in different Group by ID. The criteria to choose which TestDate to choose is to take the difference in days of TestDate with StartDate and EndDate for the different groups and then keep the one with the smallest difference in days. For example, about the date 08/10/2006 I would like to keep row 5 as the TestDate there is closer to the StartDate, than compared with the same differences in row 3. Eventually, I would like to get with a dataset like that
> xfinal
id StartDate EndDate Group TestDate Code
1 A 09/07/2006 06/08/2006 1 09/06/2006 4
4 A 08/10/2006 19/11/2006 2 08/09/2006 4
5 A 08/10/2006 19/11/2006 2 08/10/2006 4858
6 A 09/04/2007 07/05/2007 3 NA NA
7 A 02/03/2011 30/03/2011 4 02/03/2011 4
8 B 05/05/2005 02/06/2005 2 NA NA
10 B 07/09/2009 05/10/2009 4 07/09/2009 795
11 B 07/09/2009 05/10/2009 4 08/10/2009 4
Any help on that will be much appreciated. Thanks
x$StartDate <- as.Date(x$StartDate,format="%d/%m/%Y")
x$EndDate <- as.Date(x$EndDate,format="%d/%m/%Y")
x$TestDate <- as.Date(x$TestDate,format="%d/%m/%Y")
x$Diff <- difftime(x$EndDate,x$StartDate,"days")
x <- x[order(x$id,x$Diff),]
x <- x[!duplicated(x[,c("id","TestDate")]),]
x$Diff <- NULL
x

Creating a vector containing total quantities sold per delivery term

Have a look at the simplified table below. I want for each product a vector containing the quantities sold within each delivery time. A delivery time is defined as 4 days. So if we look at product A, we see that it starts at 03/12/15 and within the first delivery term (until 07/12/15) it has sold a quantity of 4. The second delivery term starts at 08/12/15 and ends at 12/12/15. So for this period there is 1 quantity sold. The following delivery term starts at 13/12/15 and ends at 17/12/15. During these period there are no quantities sold and thus for this period the vector must have a value of 0. In the last period, finally, 2 products are sold. So basically the problem here is that information regarding the periods were no products are sold is missing.
Any ideas on how the vector I want can be created using R? I've been thinking of for or while loops, but these do not seem to give the requested results. Note that the code must be applicable on a real dataset containing over 1000 product categories, so it has to be 'automatized' in one way.
I would be very gratefull if somebody could point me in the right direction.
Product Quantity Date
A 1 03/12/15
A 2 04/12/15
A 1 05/12/15
A 1 08/12/15
A 1 17/12/16
A 1 18/12/16
B 1 19/12/15
B 2 10/05/15
B 2 11/05/15
C 1 01/06/15
C 1 02/06/15
C 1 12/06/15
Assume that dt is the dataset you provided. You'll get a better understanding of the process if you run it step by step (and maybe with an even simpler dataset).
library(lubridate)
library(dplyr)
# create date time columns
dt$Date = dmy(dt$Date)
dt %>%
group_by(Product) %>%
do(data.frame(days = seq(min(.$Date), max(.$Date), by="1 day"))) %>% # create all combinations between product and days
mutate(dist = as.numeric(difftime(days,min(days), units="days"))) %>% # create distance of each day with min date
ungroup() %>%
left_join(dt, by=c("Product"="Product","days"="Date")) %>% # join info to get quantities for each day
mutate(Quantity = ifelse(is.na(Quantity), 0, Quantity), # replace NAs with 0s
id = floor(dist/5 + 1)) %>% # create the 4 period id
group_by(Product, id) %>%
summarise(Sum = sum(Quantity),
min_date = min(days),
max_date = max(days)) %>%
ungroup
# Product id Sum min_date max_date
# 1 A 1 4 2015-12-03 2015-12-07
# 2 A 2 1 2015-12-08 2015-12-12
# 3 A 3 0 2015-12-13 2015-12-17
# 4 A 4 0 2015-12-18 2015-12-22
# 5 A 5 0 2015-12-23 2015-12-27
# 6 A 6 0 2015-12-28 2016-01-01
# 7 A 7 0 2016-01-02 2016-01-06
# 8 A 8 0 2016-01-07 2016-01-11
# 9 A 9 0 2016-01-12 2016-01-16
# 10 A 10 0 2016-01-17 2016-01-21
# .. ... .. ... ... ...
First row of the output tells you that for product A in the first 4 days period (id = 1) you had 4 quantities in total and the period is from 3/12 to 7/12.
I would suggest {dplyr}'s summarise(),mutate() and group_by() functions. group_by() groups your data by desired variables (in your case - product and delivery term),mutate() allows operations on grouped columns, and summarise() applies a summarising function over these groups (in your case sum(Quantity)).
So this is how it will look:
convert date into proper format:
library(dplyr)
df=tbl_df(df)
df$Date=as.Date(df$Date,format="%d/%m/%y")
calculating delivery terms
df=group_by(df,Product) %>% arrange(Date)
df=mutate(df,term=1+unclass((Date-min(Date)))%/%4)
group by product and terms and calculate sum of quantity:
df=group_by(df,Product,term)
summarise(df,sum=sum(Quantity))
Here's a base R way:
df$groups <- ave(as.numeric(df$Date), df$Product, FUN=function(x) {
intrvl <- findInterval(x, seq(min(x), max(x),4))
as.numeric(factor(intrvl))
})
df
# Product Quantity Date groups
# 1 A 1 2015-12-03 1
# 2 A 2 2015-12-04 1
# 3 A 1 2015-12-05 1
# 4 A 1 2015-12-08 2
# 5 A 1 2016-12-17 3
# 6 A 1 2016-12-18 3
# 7 B 1 2015-12-19 2
# 8 B 2 2015-05-10 1
# 9 B 2 2015-05-11 1
# 10 C 1 2015-06-01 1
# 11 C 1 2015-06-02 1
# 12 C 1 2015-06-12 2
The dates should be converted to one of the date classes. I chose as.Date. When it converts to numeric, the output will be the number of days from a specified date. From there, we are able to group by 4 day increments.
Data
df$Date <- as.Date(df$Date, format="%d/%m/%y")

R finding date intervals by ID

Having the following table which comprises some key columns which are: customer ID | order ID | product ID | Quantity | Amount | Order Date.
All this data is in LONG Format, in that you will get multi line items for the 1 Customer ID.
I can get the first date last date using R DateDiff but converting the file to WIDE format using Plyr, still end up with the same problem of getting multiple orders by customer, just less rows and more columns.
Is there an R function that extends R DateDiff to work out how to get the time interval between purchases by Customer ID? That is, time between order 1 and 2, order 2 and 3, and so on assuming these orders exists.
CID Order.Date Order.DateMY Order.No_ Amount Quantity Category.Name Locality
1 26/02/13 Feb-13 zzzzz 1 r MOSMAN
1 26/05/13 May-13 qqqqq 1 x CHULLORA
1 28/05/13 May-13 wwwww 1 r MOSMAN
1 28/05/13 May-13 wwwww 1 x MOSMAN
2 19/08/13 Aug-13 wwwwww 1 o OAKLEIGH SOUTH
3 3/01/13 Jan-13 wwwwww 1 x CURRENCY CREEK
4 28/08/13 Aug-13 eeeeeee 1 t BRISBANE
4 10/09/13 Sep-13 rrrrrrrrr 1 y BRISBANE
4 25/09/13 Sep-13 tttttttt 2 e BRISBANE
It is not clear what do you want to do since you don't give the expected result. But I guess you want to the the intervals between 2 orders.
library(data.table)
DT <- as.data.table(DF)
DT[, list(Order.Date,
diff = c(0,diff(sort(as.Date(Order.Date,'%d/%m/%y')))) ),CID]
CID Order.Date diff
1: 1 26/02/13 0
2: 1 26/05/13 89
3: 1 28/05/13 2
4: 1 28/05/13 0
5: 2 19/08/13 0
6: 3 3/01/13 0
7: 4 28/08/13 0
8: 4 10/09/13 13
9: 4 25/09/13 15
Split the data frame and find the intervals for each Customer ID.
df <- data.frame(customerID=as.factor(c(rep("A",3),rep("B",4))),
OrderDate=as.Date(c("2013-07-01","2013-07-02","2013-07-03","2013-06-01","2013-06-02",
"2013-06-03","2013-07-01")))
dfs <- split(df,df$customerID)
lapply(dfs,function(x){
tmp <-diff(x$OrderDate)
tmp
})
Or use plyr
library(plyr)
dfs <- dlply(df,.(customerID),function(x)return(diff(x$OrderDate)))
I know this question is very old, but I just figured out another way to do it and wanted to record it:
> library(dplyr)
> library(lubridate)
> df %>% group_by(customerID) %>%
mutate(SinceLast=(interval(ymd(lag(OrderDate)),ymd(OrderDate)))/86400)
# A tibble: 7 x 3
# Groups: customerID [2]
customerID OrderDate SinceLast
<fct> <date> <dbl>
1 A 2013-07-01 NA
2 A 2013-07-02 1.
3 A 2013-07-03 1.
4 B 2013-06-01 NA
5 B 2013-06-02 1.
6 B 2013-06-03 1.
7 B 2013-07-01 28.

Resources