How to handle NA's when averaging aggregated datasets - r

In my study, every individual is in one dataset. It is a time series data, so every row is an equal amount of time. In my study, I have three different groups. So, I want to average all datasets that belong to one group. In the end, I want to have one dataset, every row is one hour, and the values in the cell is an average of the group at that time point. Now, the problem is that my dataset has a lot of missing values. I have two methods on how to average the values and aggregate it by hour.
This is how the dataset looks like of one individual (dataset has more rows than indicated below):
DateTime V2
1: 2018-01-01 20:38:00 2.346598
2: 2018-01-01 20:42:00 NA
3: 2018-01-01 20:46:00 NA
4: 2018-01-01 20:50:00 6.000000
5: 2018-01-01 20:54:00 5.234660
6: 2018-01-01 20:58:00 6.132660
I used to methods to do this.
Method one:
I first averaged every row between two datasets and then aggregate the averaged dataset by hour.
daxy<-bind_rows(dx,dy) %>%
group_by(DateTime) %>%
summarise_all(funs(mean(., na.rm = TRUE))) #average the two datasets
daxy.1 <- melt(as.data.frame(daxy), id=c("DateTime")) #melt the data in right format
daxy.2 <- aggregate(daxy.1$value, by=list(format(daxy.1$DateTime, "%Y-%m-%d %H"),variable=daxy.1$variable),
FUN=mean,na.rm = TRUE) #Aggregate all values by hour and calculate the mean for every hour
Method two:
For every individual dataset I aggregate the dataset firs (calculate the mean for every hour) and then average those aggregated datasets.
dx.1 <- melt(as.data.frame(dx), id=c("DateTime"))
dx.2 <- aggregate(dx.1$value, by=list(format(dx.1$DateTime, "%Y-%m-%d %H"),variable=dx.1$variable),
FUN=mean,na.rm = TRUE) #Aggregate individual X by hour
dy.1 <- melt(as.data.frame(dy), id=c("DateTime"))
dy.2 <- aggregate(dy.1$value, by=list(format(dy.1$DateTime, "%Y-%m-%d %H"),variable=dy.1$variable),
FUN=mean,na.rm = TRUE) #Aggregate individual Y by hour
daxy.3 <-bind_rows(dx.2,dy.2) %>%
group_by(variable,Group.1) %>%
summarise_all(funs(mean(., na.rm = TRUE))) #Average aggregated individuals X ad Y
Now I would expect that daxy.2 and daxy.3 have the same averaged values per hour. But this is the result:
head(daxy.2)
Group.1 variable x
1 2018-01-01 20 V2 3.666548
2 2018-01-01 21 V2 5.543472
head(daxy.3)
variable Group.1 x
1 V2 2018-01-01 20 3.732948
2 V2 2018-01-01 21 6.409164
I know this discrepancy is due to the missing values. If I replace all missing values by 0 then the outcome is exactly the same.
My question is which of these two methods is right. First average every individual dataset of one group and then aggregate it per hour. Or first aggregate every individual dataset per hour and then average the dataset per group?

I am not completely understanding the problem so here is what I have done. Please feel free to not consider this an answer.
First, if you want to average by hour and by group of V2, V3 and V4, you should rbind all the dataframes you have just like you have done it. Then, try this:
library(tidyverse)
library(reshape2)
daverage.1 <- melt(daverage, id.vars = "DateTime")
daverage.2 <- aggregate(value ~ format(DateTime, "%Y-%m-%d H") + variable, daverage.1,
FUN = mean, na.rm = TRUE)
daverage.3 <- daverage.1 %>%
mutate(DateHour = format(DateTime, "%Y-%m-%d H")) %>%
group_by(DateHour, variable) %>%
summarise(value = mean(value, na.rm = TRUE))
all.equal(as.data.frame(daverage.2), as.data.frame(daverage.3))
#[1] "Names: 1 string mismatch"
As you can see both methods produce equal mean values. Only one of the column names is different.
As for the different results you are getting, it seems that you are averaging first by hour. And then use this result to average by groups of V*. This is not at all the same thing. Use the code above and the results will be the right ones, the ones you want.

Related

Calculating MAD in two different ways in R return different results

(I have posted a similar question at Cross Validated, but I believe this is more fitting for Stack Overflow).
I have a large dataframe data with following columns:
date time orig new
2001-01-01 00:30:00 345 856
2001-01-01 00:32:43 4575 9261
2001-01-01 00:51:07 6453 2352
...
2001-01-01 23:57:51 421 168
2001-01-02 00:06:14 5612 3462
...
2001-01-31 23:49:11 14420 8992
2001-02-01 00:04:32 213 521
...
I want to calculate the monthly aggregated MAD, which can be calculated by mean(abs(orig - new)) when grouped by month. Ideally, at the end, I want the solutions (dataframe) in a following form:
month mad
2001-01-01 7452.124
2001-02-01 3946.734
2001-03-01 995.938
...
I calculated the monthly MAD in two different ways.
Approach 1
I grouped data by month and took an average of the summed absolute differences (which is a "mathematical" way to do it, as I explained):
data %>%
group_by(
month = lubridate::floor_date(date, 'month')
) %>%
summarise(mad = mean(abs(orig - new)))
Approach 2
I grouped data by hour and got the MAD grouped by hour, and then re-grouped it by month and took an average. This is counter-intuitive, but I used the hourly grouped dataframe for other analyses and tried computing the monthly MAD from this dataframe directly.
data_grouped_by_hour <- data %>%
group_by(
day = lubridate::floor_date(date, 'day'),
hour = as.POSIXlt(time)$hour
) %>%
summarise(mad = mean(abs(orig - new)))
data_grouped_by_hour %>%
group_by(
month = lubridate::floor_date(date, 'month')
) %>%
summarise(mad = mean(mad))
As hinted from the post title, these approaches return different values. I assume my first approach is correct, as it is more concise and follows the accurate concept, but I wonder why the second approach does not return the same value.
I want to note that I would prefer Approach 2 so that I don't have to make separate tables for every analysis with different time unit. Any insights are appreciated.
Because average of average is not the same as complete average.
This is a common misconception. Let's try to understand with the help of an example -
Consider a list with 2 elements a and b
x <- list(a = c(1, 5, 4, 3, 2, 8), b = c(6, 5))
Now, similar to your question we will take average in 2 ways -
Average of all the values of x
res1 <- mean(unlist(x))
res1
#[1] 4.25
Average of each element separately and then complete average.
sapply(x, mean)
# a b
#3.833333 5.500000
res2 <- mean(sapply(x, mean))
res2
#[1] 4.666667
Notice that res1 and res2 has different values because the 2nd case is average of averages.
The same logic applies in your case as well when you take daily average and then monthly which is average of averages.

Plot data over time in R

I'm working with a dataframe including the columns 'timestamp' and 'amount'. The data can be produced like this
sample_size <- 40
start_date = as.POSIXct("2020-01-01 00:00")
end_date = as.POSIXct("2020-01-03 00:00")
timestamps <- as.POSIXct(sample(seq(start_date, end_date, by=60), sample_size))
amount <- rpois(sample_size, 5)
df <- data.frame(timestamps=timestamps, amount=amount)
Now I'd like to plot the sum of the amount entries for some timeframe (like every hour, 30 min, 20 min). The final plot would look like a histogram of the timestamps but should not just count how many timestamps fell into the timeframe, but what amount fell into the timeframe.
How can I approach this? I could create an extra vector with the amount of each timeframe, but don't know how to proceed.
Also I'd like to add a feature to reduce by hour. Such that just just one day is plotted (notice the range between start_date and end_date is two days) and in each timeframe (lets say every hour) the amount of data located in this hour is plotted. In this case the data
2020-01-01 13:03:00 5
2020-01-02 13:21:00 10
2020-01-02 13:38:00 1
2020-01-01 13:14:00 3
would produce a bar of height sum(5, 10, 1, 3) = 19 in the timeframe 13:00-14:00. How can I implement the plotting to easily switch between these two modes (plot days/plot just one day and reduce)?
EDIT: Following the advice of #Gregor Thomas I added a grouping column like this:
df$time_group <- lubridate::floor_date(df$timestamps, unit="20 minutes")
Now I'm wondering how to ignore the dates and thus reduce by 20 minute frame (independent of date).

Replacement of missing day and month in dates using R

This question is about how to replace missing days and months in a data frame using R. Considering the data frame below, 99 denotes missing day or month and NA represents dates that are completely unknown.
df<-data.frame("id"=c(1,2,3,4,5),
"date" = c("99/10/2014","99/99/2011","23/02/2016","NA",
"99/04/2009"))
I am trying to replace the missing days and months based on the following criteria:
For dates with missing day but known month and year, the replacement date would be a random selection from the middle of the interval (first day to the last day of that month). Example, for id 1, the replacement date would be sampled from the middle of 01/10/2014 to 31/10/2014. For id 5, this would be the middle of 01/04/2009 to 30/04/2009. Of note is the varying number of days for different months, e.g. 31 days for October and 30 days for April.
As in the case of id 2, where both day and month are missing, the replacement date is a random selection from the middle of the interval (first day to last day of the year), e.g 01/01/2011 to 31/12/2011.
Please note: complete dates (e.g. the case of id 3) and NAs are not to be replaced.
I have tried by making use of the seq function together with the as.POSIXct and as.Date functions to obtain the sequence of dates from which the replacement dates are to be sampled. The difficulty I am experiencing is how to automate the R code to obtain the date intervals (it varies across distinct id) and how to make a random draw from the middle of the intervals.
The expected output would have the date of id 1, 2 and 5 replaced but those of id 3 and 4 remain unchanged. Any help on this is greatly appreciated.
This isn't the prettiest, but it seems to work and adapts to differing month and year lengths:
set.seed(999)
df$dateorig <- df$date
seld <- grepl("^99/", df$date)
selm <- grepl("^../99", df$date)
md <- seld & (!selm)
mm <- seld & selm
df$date <- as.Date(gsub("99","01",as.character(df$date)), format="%d/%m/%Y")
monrng <- sapply(df$date[md], function(x) seq(x, length.out=2, by="month")[2]) - as.numeric(df$date[md])
df$date[md] <- df$date[md] + sapply(monrng, sample, 1)
yrrng <- sapply(df$date[mm], function(x) seq(x, length.out=2, by="12 months")[2]) - as.numeric(df$date[mm])
df$date[mm] <- df$date[mm] + sapply(yrrng, sample, 1)
#df
# id date dateorig
#1 1 2014-10-14 99/10/2014
#2 2 2011-02-05 99/99/2011
#3 3 2016-02-23 23/02/2016
#4 4 <NA> NA
#5 5 2009-04-19 99/04/2009

Mean Returns in Time Series - Restarting after NA values - rstudio

Has anyone encountered calculating historical mean log returns in time series datasets?
The dataset is ordered by individual security first and by time for each respective security. I am trying to form a historical mean log return, i.e. the mean log return for the security from its first appearance in the dataset to date, for each point in time for each security.
Luckily, the return time series contains NAs between returns for differing securities. My idea is to calculate a historical mean that restarts after each NA that appears.
A simple cumsum() probably will not do it, as the NAs will have to be dropped.
I thought about using rollmean(), if I only knew an efficient way to specify the 'width' parameter to the length of the vector of consecutive preceding non-NAs.
The current approach I am taking, based on Count how many consecutive values are true, takes significantly too much time, given the size of the data set I am working with.
For any x of the form x : [r(1) r(2) ... r(N)], where r(2) is the log return in period 2:
df <- data.frame(x, zcount = NA)
df[1,2] = 0 #df$x[1]=NA by construction of the data set
for(i in 2:nrow(df))
df$zcount[i] <- ifelse(!is.na(df$x[i]), df$zcount[i-1]+1, 0)
Any idea how to speed this up would be highly appreciated!
You will need to reshape the data.frame to apply the cumsum function
over each security. Here's how:
First, I'll generate some data on 100 securities over 100 months which I think corresponds to your description of the data set
securities <- 100
months <- 100
time <- seq.Date(as.Date("2010/1/1"), by = "months", length.out = months)
ID <- rep(paste0("sec", 1:months), each = securities)
returns <- rnorm(securities * months, mean = 0.08, sd = 2)
df <- data.frame(time, ID, returns)
head(df)
time ID returns
1 2010-01-01 sec1 -3.0114466
2 2010-02-01 sec1 -1.7566112
3 2010-03-01 sec1 1.6615731
4 2010-04-01 sec1 0.9692533
5 2010-05-01 sec1 1.3075774
6 2010-06-01 sec1 0.6323768
Now, you must reshape your data so that each security column contains its
returns, and each row represents the date.
library(tidyr)
df_wide <- spread(df, ID, returns)
Once this is done, you can use the apply function to sum every column which now represents each security. Or use the cumsum function. Notice the data object df_wide[-1], which drops the time column. This is necessary to avoid the sum or cumsum functions throwing an error.
matrix_sum <- apply(df_wide[-1], 2, FUN = sum)
matrix_cumsum <- apply(df_wide[-1], 2, FUN = cumsum)
Now, add the time column back as a data.frame if you like:
df_final <- data.frame(time = df_wide[,1], matrix_cumsum)

grouping by date and treatment in R

I have a time series that looks at how caffeine impacts test scores. On each day, the first test is used to measure a baseline score for the day, and the second score is the effect of a treatment.
Post Caffeine Score Time/Date
yes 10 3/17/2014 17:58:28
no 9 3/17/2014 23:55:47
no 7 3/18/2014 18:50:50
no 10 3/18/2014 23:09:03
Some days have a caffeine treatment, others not. Here's a question: how do I group variables by the day of the week, and create a measure of impact, by subtracting the second days' score from the first.
I'm going to be using these groupings for later graphs and analysis, so I think it's most efficient if there's a way to create objects that look at the improvement in score each day and groups by whether caffeine (treatment) was used.
Thank you for your help!
First make a column for the day:
df$day = strftime(df$'Time/Date', format="%Y-%m-%d")
then I think what you're after is two aggregates:
1) To find if the day had caffeine
dayCaf = aggregate(df$Caffeine~df$day, FUN=function(x) ifelse(length(which(grepl("yes",x)))>0,1,0))
2) To calculate the difference in scores
dayDiff = aggregate(df$Score~df$day, FUN=function(x) x[2]-x[1])
Now put the two together
out = merge(dayCaf, dayDiff, by='df$day')
That gives:
df$day df$caff df$score
1 2014-03-17 1 -1
2 2014-03-18 0 3
The whole code is:
df$day = strftime(df$'Time/Date', format="%Y-%m-%d")
dayCaf = aggregate(df$Caffeine~df$day, FUN=function(x) ifelse(length(which(grepl("yes",x)))>0,1,0))
dayDiff = aggregate(df$Score~df$day, FUN=function(x) x[2]-x[1])
out = merge(dayCaf, dayDiff, by='df$day')
Just replace "df" with the name of your frame and it should work.
Alternatively:
DF <- data.frame(Post.Caffeine = c("Yes","No","No","No"),Score=c(10,9,7,10),Time.Date=c("3/17/2014 17:58:28","3/17/2014 23:55:47","3/18/2014 18:50:50", "3/18/2014 23:09:03"))
DF$Time.Date <- as.Date(DF$Time.Date,format="%m/%d/%Y")
DF2 <- setNames(aggregate(Score~Time.Date,DF,diff),c("Date","Diff"))
DF2$PC <- DF2$Date %in% DF$Time.Date[DF$Post.Caffeine=="Yes"]
DF2
EDIT: This assumes that your data is in the order that you demonstrate.
data.table solution. The order part sorts your data first (If it is already sorted, you can remove the order part, just leave the comma in place). The advantage of this approach is that you are doing the whole process in one line and that it will be fast too
library(data.table)
setDT(temp)[order(as.POSIXct(strptime(`Time/Date`, "%m/%d/%Y %H:%M:%S"))),
list(HadCafffeine = if(any(PostCaffeine == "yes")) "yes" else "no",
Score = diff(Score),
by = as.Date(strptime(`Time/Date`, "%m/%d/%Y"))]
## as.Date HadCafffeine Score
## 1: 2014-03-17 yes -1
## 2: 2014-03-18 no 3
This solution assumes temp as your data set and PostCaffeine instead Post Caffeine as the variable name (it is bad practice in R to put spaces or / into variable names as it limits your possibilities to work with them).

Resources