Calculate average and std same day last 3 weeks in R [duplicate] - r

This question already has an answer here:
use rollapply and zoo to calculate rolling average of a column of variables
(1 answer)
Closed 2 years ago.
I have a data frame like below (sample data). I want to add two columns for each day to show average and std sales of same day in the last 3 weeks. What I mean by this is the same 3 previous days (last 3 Tuesdays, last 3 Wednesdays, etc.)
df <- data.frame(
stringsAsFactors = FALSE,
date = c("3/28/2019","3/27/2019",
"3/26/2019","3/25/2019","3/24/2019","3/23/2019",
"3/22/2019","3/21/2019","3/20/2019","3/19/2019","3/18/2019",
"3/17/2019","3/16/2019","3/15/2019","3/14/2019",
"3/13/2019","3/12/2020","3/11/2020","3/10/2020","3/9/2021",
"3/8/2021","3/7/2021","3/6/2022","3/5/2022",
"3/4/2022","3/3/2023"),
weekday = c(4L,3L,2L,1L,7L,6L,5L,4L,
3L,2L,1L,7L,6L,5L,4L,3L,2L,1L,7L,6L,5L,4L,
3L,2L,1L,7L),
store_id = c(344L,344L,344L,344L,344L,
344L,344L,344L,344L,344L,344L,344L,344L,344L,344L,
344L,344L,344L,344L,344L,344L,344L,344L,344L,
344L,344L),
store_sales = c(1312005L,1369065L,1354185L,
1339183L,973780L,1112763L,1378349L,1331890L,1357713L,
1366399L,1303573L,936919L,1099826L,1406752L,
1318841L,1321099L,1387767L,1281097L,873449L,1003667L,
1387767L,1281097L,873449L,1003667L,1331636L,1303804L)
)
For example for 3/28/2019 take average sales of (3/21/2019 & 3/14/2019 & 3/7/2021) , like this
date weekday store_id store_sales avg_sameday3
3/28/2019 4 344 1312005 1310609

We can group by weekday and store_id and calculate rolling mean for last 3 entries using zoo::rollapplyr.
library(dplyr)
df %>%
arrange(weekday) %>%
group_by(store_id, weekday) %>%
mutate(store_sales_avg = zoo::rollapplyr(store_sales, 4,
function(x) mean(x[-1]), partial = TRUE))
Note that I have used window size as 4 and removed the first entry from mean calculation so that it does not consider the current value while taking mean. With partial = TRUE it takes mean even when last values are less than 4.

Related

Calculating MAD in two different ways in R return different results

(I have posted a similar question at Cross Validated, but I believe this is more fitting for Stack Overflow).
I have a large dataframe data with following columns:
date time orig new
2001-01-01 00:30:00 345 856
2001-01-01 00:32:43 4575 9261
2001-01-01 00:51:07 6453 2352
...
2001-01-01 23:57:51 421 168
2001-01-02 00:06:14 5612 3462
...
2001-01-31 23:49:11 14420 8992
2001-02-01 00:04:32 213 521
...
I want to calculate the monthly aggregated MAD, which can be calculated by mean(abs(orig - new)) when grouped by month. Ideally, at the end, I want the solutions (dataframe) in a following form:
month mad
2001-01-01 7452.124
2001-02-01 3946.734
2001-03-01 995.938
...
I calculated the monthly MAD in two different ways.
Approach 1
I grouped data by month and took an average of the summed absolute differences (which is a "mathematical" way to do it, as I explained):
data %>%
group_by(
month = lubridate::floor_date(date, 'month')
) %>%
summarise(mad = mean(abs(orig - new)))
Approach 2
I grouped data by hour and got the MAD grouped by hour, and then re-grouped it by month and took an average. This is counter-intuitive, but I used the hourly grouped dataframe for other analyses and tried computing the monthly MAD from this dataframe directly.
data_grouped_by_hour <- data %>%
group_by(
day = lubridate::floor_date(date, 'day'),
hour = as.POSIXlt(time)$hour
) %>%
summarise(mad = mean(abs(orig - new)))
data_grouped_by_hour %>%
group_by(
month = lubridate::floor_date(date, 'month')
) %>%
summarise(mad = mean(mad))
As hinted from the post title, these approaches return different values. I assume my first approach is correct, as it is more concise and follows the accurate concept, but I wonder why the second approach does not return the same value.
I want to note that I would prefer Approach 2 so that I don't have to make separate tables for every analysis with different time unit. Any insights are appreciated.
Because average of average is not the same as complete average.
This is a common misconception. Let's try to understand with the help of an example -
Consider a list with 2 elements a and b
x <- list(a = c(1, 5, 4, 3, 2, 8), b = c(6, 5))
Now, similar to your question we will take average in 2 ways -
Average of all the values of x
res1 <- mean(unlist(x))
res1
#[1] 4.25
Average of each element separately and then complete average.
sapply(x, mean)
# a b
#3.833333 5.500000
res2 <- mean(sapply(x, mean))
res2
#[1] 4.666667
Notice that res1 and res2 has different values because the 2nd case is average of averages.
The same logic applies in your case as well when you take daily average and then monthly which is average of averages.

Replacement of missing day and month in dates using R

This question is about how to replace missing days and months in a data frame using R. Considering the data frame below, 99 denotes missing day or month and NA represents dates that are completely unknown.
df<-data.frame("id"=c(1,2,3,4,5),
"date" = c("99/10/2014","99/99/2011","23/02/2016","NA",
"99/04/2009"))
I am trying to replace the missing days and months based on the following criteria:
For dates with missing day but known month and year, the replacement date would be a random selection from the middle of the interval (first day to the last day of that month). Example, for id 1, the replacement date would be sampled from the middle of 01/10/2014 to 31/10/2014. For id 5, this would be the middle of 01/04/2009 to 30/04/2009. Of note is the varying number of days for different months, e.g. 31 days for October and 30 days for April.
As in the case of id 2, where both day and month are missing, the replacement date is a random selection from the middle of the interval (first day to last day of the year), e.g 01/01/2011 to 31/12/2011.
Please note: complete dates (e.g. the case of id 3) and NAs are not to be replaced.
I have tried by making use of the seq function together with the as.POSIXct and as.Date functions to obtain the sequence of dates from which the replacement dates are to be sampled. The difficulty I am experiencing is how to automate the R code to obtain the date intervals (it varies across distinct id) and how to make a random draw from the middle of the intervals.
The expected output would have the date of id 1, 2 and 5 replaced but those of id 3 and 4 remain unchanged. Any help on this is greatly appreciated.
This isn't the prettiest, but it seems to work and adapts to differing month and year lengths:
set.seed(999)
df$dateorig <- df$date
seld <- grepl("^99/", df$date)
selm <- grepl("^../99", df$date)
md <- seld & (!selm)
mm <- seld & selm
df$date <- as.Date(gsub("99","01",as.character(df$date)), format="%d/%m/%Y")
monrng <- sapply(df$date[md], function(x) seq(x, length.out=2, by="month")[2]) - as.numeric(df$date[md])
df$date[md] <- df$date[md] + sapply(monrng, sample, 1)
yrrng <- sapply(df$date[mm], function(x) seq(x, length.out=2, by="12 months")[2]) - as.numeric(df$date[mm])
df$date[mm] <- df$date[mm] + sapply(yrrng, sample, 1)
#df
# id date dateorig
#1 1 2014-10-14 99/10/2014
#2 2 2011-02-05 99/99/2011
#3 3 2016-02-23 23/02/2016
#4 4 <NA> NA
#5 5 2009-04-19 99/04/2009

Calculate mean of one column for 14 rows before certain row, as identified by date for each group (year)

I would like to calculate mean of Mean.Temp.c. before certain date, such as 1963-03-23 as showed in date2 column in this example. This is time when peak snowmelt runoff occurred in 1963 in my area. I want to know 10 day’s mean temperature before this date (ie., 1963-03-23). How to do it? I have 50 years data, and each year peak snowmelt date is different.
example data
You can try:
library(dplyr)
df %>%
mutate(date2 = as.Date(as.character(date2)),
ten_day_mean = mean(Mean.Temp.c[between(date2, "1963-03-14", "1963-03-23")]))
In this case the desired mean would populate the whole column.
Or with data.table:
library(data.table)
setDT(df)[between(as.Date(as.character(date2)), "1963-03-14", "1963-03-23"), ten_day_mean := mean(Mean.Temp.c)]
In the latter case you'd get NA for those days that are not relevant for your date range.
Supposing date2 is a Date field and your data.frame is called x:
start_date <- as.Date("1963-03-23")-10
end_date <- as.Date("1963-03-23")
mean(x$Mean.Temp.c.[x$date2 >= start_date & x$date2 <= end_date])
Now, if you have multiple years of interest, you could wrap this code within a for loop (or [s|l]apply) taking elements from a vector of dates.

averaging by months with daily data [duplicate]

This question already has answers here:
Get monthly means from dataframe of several years of daily temps
(3 answers)
Closed 5 years ago.
I have daily data with my matrix, divided into 6 columns - "Years, months, days, ssts, anoms, missing ".
I want to calculate the average of each month of SST in each year. (For example - 1981 - september - avg values sts of all days in sept), and I want to do the same for all the years. i am trying to work, my code, but am unable to do so.
You should use dplyr package in R. For this, we will call your data df
require(dplyr)
df.mnths <- df %>%
group_by(Years, months) %>%
summarise(Mean.SST = mean(SSTs))
df.years <- df %>%
group_by(Years) %>%
summarise(Mean.SST = mean(SSTs))
This is two new data sets that will have the mean(SST) for each month of each year in df.mnths, and another dataset that will have mean(SST) for all years in df.years.
In terms of data.table you can perform the following action
library(data.table)
dt[, average_sst := mean(SSTs), by = .(years,months)]
adding an extra column average_sst.
just suppose that your data is stored in a data.frame named "data":
years months SSTs
1 1981 1 -0.46939368
2 1981 1 0.03226932
3 1981 1 -1.60266798
4 1981 1 -1.53095676
5 1981 1 1.71177023
6 1981 1 -0.61309846
tapply(data$SSTs, list(data$years, data$months), mean)
tapply(data$SSTs, factor(data$years), mean)

Writing a running average of last 50 scores by name and date in R [duplicate]

This question already has answers here:
Rolling mean (moving average) by group/id with dplyr
(4 answers)
Closed 6 years ago.
I have a csv file of every golf score recorded by player, tournament, and date, and I want to create a column that calculates the average of the past 50 scores BY player using the date field. It needs to be running average by Player.
example: Table
PLAYER,SCORE,Tournament,ROUNDDATE,Observation,ID
Matthew Fitzpatrick,60,KLM Open,42258,1,1
Jaco Van Zyl,61,Turkish Airlines Open,42306,1,2
Paul Lawrie,61,KLM Open,42257,1,3
Wade Ormsby,61,KLM Open,42257,1,4
Callum Shinkwin,62,Shenzhen International,42483,1,5
Danny Willett,62,Omega European Masters,42209,1,6
Joakim Lagergren,62,Alfred Dunhill Links Championship,42280,1,7
I have tried this code but it just produces the exact same result and not an average of anything
get.mav <- function(bp,n=50){
require(zoo)
if(is.na(bp[1])) bp[1] <- mean(bp,na.rm=TRUE)
bp <- na.locf(bp,na.rm=FALSE)
if(length(bp)<n) return(bp)
c(bp[1:(n-1)],rollapply(bp,width=n,mean,align="right"))
}
test <- with(test,test[order(PLAYER,ROUNDDATE),])
test$SCORE_UPDATED <-
unlist(aggregate(SCORE~PLAYER,test,get.mav,na.action=NULL,n=50)$SCORE)
test
You can just arrange by date and then mean the first 50. Here is a quick example:
# Sample Data
dat <- data.frame(date = seq(as.Date("2000/1/1"), by = "day", length.out = 365),
score = round(rnorm(365, 70, 5)))
# Arrange and get mean of first 50 obs
dat <- dat[order(dat$date, decreasing = TRUE),]
mean(dat$score[1:50])

Resources