grouping by date and treatment in R - r

I have a time series that looks at how caffeine impacts test scores. On each day, the first test is used to measure a baseline score for the day, and the second score is the effect of a treatment.
Post Caffeine Score Time/Date
yes 10 3/17/2014 17:58:28
no 9 3/17/2014 23:55:47
no 7 3/18/2014 18:50:50
no 10 3/18/2014 23:09:03
Some days have a caffeine treatment, others not. Here's a question: how do I group variables by the day of the week, and create a measure of impact, by subtracting the second days' score from the first.
I'm going to be using these groupings for later graphs and analysis, so I think it's most efficient if there's a way to create objects that look at the improvement in score each day and groups by whether caffeine (treatment) was used.
Thank you for your help!

First make a column for the day:
df$day = strftime(df$'Time/Date', format="%Y-%m-%d")
then I think what you're after is two aggregates:
1) To find if the day had caffeine
dayCaf = aggregate(df$Caffeine~df$day, FUN=function(x) ifelse(length(which(grepl("yes",x)))>0,1,0))
2) To calculate the difference in scores
dayDiff = aggregate(df$Score~df$day, FUN=function(x) x[2]-x[1])
Now put the two together
out = merge(dayCaf, dayDiff, by='df$day')
That gives:
df$day df$caff df$score
1 2014-03-17 1 -1
2 2014-03-18 0 3
The whole code is:
df$day = strftime(df$'Time/Date', format="%Y-%m-%d")
dayCaf = aggregate(df$Caffeine~df$day, FUN=function(x) ifelse(length(which(grepl("yes",x)))>0,1,0))
dayDiff = aggregate(df$Score~df$day, FUN=function(x) x[2]-x[1])
out = merge(dayCaf, dayDiff, by='df$day')
Just replace "df" with the name of your frame and it should work.

Alternatively:
DF <- data.frame(Post.Caffeine = c("Yes","No","No","No"),Score=c(10,9,7,10),Time.Date=c("3/17/2014 17:58:28","3/17/2014 23:55:47","3/18/2014 18:50:50", "3/18/2014 23:09:03"))
DF$Time.Date <- as.Date(DF$Time.Date,format="%m/%d/%Y")
DF2 <- setNames(aggregate(Score~Time.Date,DF,diff),c("Date","Diff"))
DF2$PC <- DF2$Date %in% DF$Time.Date[DF$Post.Caffeine=="Yes"]
DF2
EDIT: This assumes that your data is in the order that you demonstrate.

data.table solution. The order part sorts your data first (If it is already sorted, you can remove the order part, just leave the comma in place). The advantage of this approach is that you are doing the whole process in one line and that it will be fast too
library(data.table)
setDT(temp)[order(as.POSIXct(strptime(`Time/Date`, "%m/%d/%Y %H:%M:%S"))),
list(HadCafffeine = if(any(PostCaffeine == "yes")) "yes" else "no",
Score = diff(Score),
by = as.Date(strptime(`Time/Date`, "%m/%d/%Y"))]
## as.Date HadCafffeine Score
## 1: 2014-03-17 yes -1
## 2: 2014-03-18 no 3
This solution assumes temp as your data set and PostCaffeine instead Post Caffeine as the variable name (it is bad practice in R to put spaces or / into variable names as it limits your possibilities to work with them).

Related

R calculating time differences in a (layered) long dataset

I've been struggling with a bit of timestamp data (haven't had to work with dates much until now, and it shows). Hope you can help out.
I'm working with data from a website showing for each customer (ID) their respective visits and the timestamp for those visits. It's grouped in the sense that one customer might have multiple visits/timestamps.
The df is structured as follows, in a long format:
df <- data.frame("Customer" = c(1, 1, 1, 2, 3, 3),
"Visit" =c(1, 2, 3, 1, 1, 2), # e.g. customer ID #1 has visited the site three times.
"Timestamp" = c("2019-12-31 12:13:25", "2019-12-31 16:13:25", "2020-01-05 10:13:25", "2019-11-12 15:18:42", "2019-11-13 19:22:35", "2019-12-10 19:43:55"))
Note: In the real dataset the timestamp isn't a factor but some other haggard character-type abomination which I should probably first try to convert into a POSIXct format somehow.
What I would like to do here is to create a df that displays per customer their average time between visits (let's say in minutes, or hours). Visitors with only a single visit (e.g., second customer in my example) could be filtered out in advance or should display a 0. My final goal is to visualize that distribution, and possibly calculate a grand mean across all customers.
Because the number of visits can vary drastically (e.g. one or 256 visits) I can't just use a 'wide' version of the dataset where a fixed number of visits are the columns which I could then subtract and average.
I'm at a bit of a loss how to best approach this type of problem, thanks a bunch!
Using dplyr:
df %>%
arrange(Customer, Timestamp) %>%
group_by(Customer) %>%
mutate(Difference = Timestamp - lag(Timestamp)) %>%
summarise(mean(Difference, na.rm = TRUE))
Due to the the grouping, the first value of difference for any costumer should be NA (including those with only one visit), so they will be dropped with the mean.
Using base R (no extra packages):
sort the data, ordering by customer Id, then by timestamp.
calculate the time difference between consecutive rows (using the diff() function), grouping by customer id (tapply() does the grouping).
find the average
squish that into a data.frame.
# 1 sort the data
df$Timestamp <- as.POSIXct(df$Timestamp)
# not debugged
df <- df[order(df$Customer, df$Timestamp),]
# 2 apply a diff.
# if you want to force the time units to seconds, convert
# the timestamp to numeric first.
# without conversion
diffs <- tapply(df$Timestamp, df$Customer, diff)
# ======OR======
# convert to seconds
diffs <- tapply(as.numeric(df$Timestamp), df$Customer, diff)
# 3 find the averages
diffs.mean <- lapply(diffs, mean)
# 4 squish that into a data.frame
diffs.df <- data.frame(do.call(rbind, diffs.mean))
diffs.df$Customer <- names(diffs.mean)
# 4a tidy up the data.frame names
names(diffs.df)[1] <- "Avg_Interval"
diffs.df
You haven't shown your timestamp strings, but when you need to wrangle them, the lubridate package is your friend.

How to handle NA's when averaging aggregated datasets

In my study, every individual is in one dataset. It is a time series data, so every row is an equal amount of time. In my study, I have three different groups. So, I want to average all datasets that belong to one group. In the end, I want to have one dataset, every row is one hour, and the values in the cell is an average of the group at that time point. Now, the problem is that my dataset has a lot of missing values. I have two methods on how to average the values and aggregate it by hour.
This is how the dataset looks like of one individual (dataset has more rows than indicated below):
DateTime V2
1: 2018-01-01 20:38:00 2.346598
2: 2018-01-01 20:42:00 NA
3: 2018-01-01 20:46:00 NA
4: 2018-01-01 20:50:00 6.000000
5: 2018-01-01 20:54:00 5.234660
6: 2018-01-01 20:58:00 6.132660
I used to methods to do this.
Method one:
I first averaged every row between two datasets and then aggregate the averaged dataset by hour.
daxy<-bind_rows(dx,dy) %>%
group_by(DateTime) %>%
summarise_all(funs(mean(., na.rm = TRUE))) #average the two datasets
daxy.1 <- melt(as.data.frame(daxy), id=c("DateTime")) #melt the data in right format
daxy.2 <- aggregate(daxy.1$value, by=list(format(daxy.1$DateTime, "%Y-%m-%d %H"),variable=daxy.1$variable),
FUN=mean,na.rm = TRUE) #Aggregate all values by hour and calculate the mean for every hour
Method two:
For every individual dataset I aggregate the dataset firs (calculate the mean for every hour) and then average those aggregated datasets.
dx.1 <- melt(as.data.frame(dx), id=c("DateTime"))
dx.2 <- aggregate(dx.1$value, by=list(format(dx.1$DateTime, "%Y-%m-%d %H"),variable=dx.1$variable),
FUN=mean,na.rm = TRUE) #Aggregate individual X by hour
dy.1 <- melt(as.data.frame(dy), id=c("DateTime"))
dy.2 <- aggregate(dy.1$value, by=list(format(dy.1$DateTime, "%Y-%m-%d %H"),variable=dy.1$variable),
FUN=mean,na.rm = TRUE) #Aggregate individual Y by hour
daxy.3 <-bind_rows(dx.2,dy.2) %>%
group_by(variable,Group.1) %>%
summarise_all(funs(mean(., na.rm = TRUE))) #Average aggregated individuals X ad Y
Now I would expect that daxy.2 and daxy.3 have the same averaged values per hour. But this is the result:
head(daxy.2)
Group.1 variable x
1 2018-01-01 20 V2 3.666548
2 2018-01-01 21 V2 5.543472
head(daxy.3)
variable Group.1 x
1 V2 2018-01-01 20 3.732948
2 V2 2018-01-01 21 6.409164
I know this discrepancy is due to the missing values. If I replace all missing values by 0 then the outcome is exactly the same.
My question is which of these two methods is right. First average every individual dataset of one group and then aggregate it per hour. Or first aggregate every individual dataset per hour and then average the dataset per group?
I am not completely understanding the problem so here is what I have done. Please feel free to not consider this an answer.
First, if you want to average by hour and by group of V2, V3 and V4, you should rbind all the dataframes you have just like you have done it. Then, try this:
library(tidyverse)
library(reshape2)
daverage.1 <- melt(daverage, id.vars = "DateTime")
daverage.2 <- aggregate(value ~ format(DateTime, "%Y-%m-%d H") + variable, daverage.1,
FUN = mean, na.rm = TRUE)
daverage.3 <- daverage.1 %>%
mutate(DateHour = format(DateTime, "%Y-%m-%d H")) %>%
group_by(DateHour, variable) %>%
summarise(value = mean(value, na.rm = TRUE))
all.equal(as.data.frame(daverage.2), as.data.frame(daverage.3))
#[1] "Names: 1 string mismatch"
As you can see both methods produce equal mean values. Only one of the column names is different.
As for the different results you are getting, it seems that you are averaging first by hour. And then use this result to average by groups of V*. This is not at all the same thing. Use the code above and the results will be the right ones, the ones you want.

Mean Returns in Time Series - Restarting after NA values - rstudio

Has anyone encountered calculating historical mean log returns in time series datasets?
The dataset is ordered by individual security first and by time for each respective security. I am trying to form a historical mean log return, i.e. the mean log return for the security from its first appearance in the dataset to date, for each point in time for each security.
Luckily, the return time series contains NAs between returns for differing securities. My idea is to calculate a historical mean that restarts after each NA that appears.
A simple cumsum() probably will not do it, as the NAs will have to be dropped.
I thought about using rollmean(), if I only knew an efficient way to specify the 'width' parameter to the length of the vector of consecutive preceding non-NAs.
The current approach I am taking, based on Count how many consecutive values are true, takes significantly too much time, given the size of the data set I am working with.
For any x of the form x : [r(1) r(2) ... r(N)], where r(2) is the log return in period 2:
df <- data.frame(x, zcount = NA)
df[1,2] = 0 #df$x[1]=NA by construction of the data set
for(i in 2:nrow(df))
df$zcount[i] <- ifelse(!is.na(df$x[i]), df$zcount[i-1]+1, 0)
Any idea how to speed this up would be highly appreciated!
You will need to reshape the data.frame to apply the cumsum function
over each security. Here's how:
First, I'll generate some data on 100 securities over 100 months which I think corresponds to your description of the data set
securities <- 100
months <- 100
time <- seq.Date(as.Date("2010/1/1"), by = "months", length.out = months)
ID <- rep(paste0("sec", 1:months), each = securities)
returns <- rnorm(securities * months, mean = 0.08, sd = 2)
df <- data.frame(time, ID, returns)
head(df)
time ID returns
1 2010-01-01 sec1 -3.0114466
2 2010-02-01 sec1 -1.7566112
3 2010-03-01 sec1 1.6615731
4 2010-04-01 sec1 0.9692533
5 2010-05-01 sec1 1.3075774
6 2010-06-01 sec1 0.6323768
Now, you must reshape your data so that each security column contains its
returns, and each row represents the date.
library(tidyr)
df_wide <- spread(df, ID, returns)
Once this is done, you can use the apply function to sum every column which now represents each security. Or use the cumsum function. Notice the data object df_wide[-1], which drops the time column. This is necessary to avoid the sum or cumsum functions throwing an error.
matrix_sum <- apply(df_wide[-1], 2, FUN = sum)
matrix_cumsum <- apply(df_wide[-1], 2, FUN = cumsum)
Now, add the time column back as a data.frame if you like:
df_final <- data.frame(time = df_wide[,1], matrix_cumsum)

How to calculate the difference between dates in R for each unique id

I am new to R and have the following data of user name and their usage date for a product (truncated output):
Name, Date
Jane, 01-24-2016 10:02:00
Mary, 01-01-2016 12:18:00
Mary, 01-01-2016 13:18:00
Mary, 01-02-2016 13:18:00
Jane, 01-23-2016 10:02:00
I would like to do some analysis on difference between Date, in particular the number of days between usage for each user. I'd like to plot a histogram to determine if there is a pattern among users.
how do I compute the difference between dates for each user in R ?
are there any other visualizations besides histograms I should explore ?
Thanks
Try this, assuming your data frame is df:
## in case you have different column names
colnames(df) <- c("Name", "Date")
## you might also have Date as factors when reading in data
## the following ensures it is character string
df$Date <- as.character(df$Date)
## convert to Date object
## see ?strptime for various available format
## see ?as.Date for Date object
df$Date <- as.Date(df$Date, format = "%m-%d-%Y %H:%M:%S")
## reorder, so that date are ascending (see Jane)
## this is necessary, otherwise negative number occur after differencing
## see ?order on ordering
df <- df[order(df$Name, df$Date), ]
## take day lags per person
## see ?diff for taking difference
## see ?tapply for applying FUN on grouped data
## as.integer() makes output clean
## if unsure, compare with: lags <- with(df, tapply(Date, Name, FUN = diff))
lags <- with(df, tapply(Date, Name, FUN = function (x) as.integer(diff(x))))
For you truncated data (with 5 rows), I get:
> lags
$Jane
[1] 1
$Mary
[1] 0 1
lags is a list. If you want to get Jane's information, do lags$Jane. To get a histogram, do hist(lags$Jane). Furthermore, if you want to simply produce a histogram for all clients, overlooking individual difference, use hist(unlist(lags)). The unlist() collapse a list into a single vector.
comments:
regarding your requirement for good reference to R, see CRAN: R intro and advanced R;
using tapply for multiple indices? Maybe you can try the trick I gave by using paste to first construct an auxiliary index;
Er, looks like I quickly made things complicated than necessary, by using density and central limit theorem, etc, for visualization. So I removed my other answer.
We can use data.table with lubridate
library(lubridate)
library(data.table)
setDT(df1)[order(mdy_hms(Date)), .(Diff=as.integer(diff(as.Date(mdy_hms(Date))))), Name]
# Name Diff
#1: Mary 0
#2: Mary 1
#3: Jane 1
If there are several grouping variables i.e. "ID" , we can place it in the by
setDT(df1)[order(mdy_hms(Date)), .(Diff=as.integer(diff(as.Date(mdy_hms(Date))))),
by = .(Name, ID)]

Creating a weekend dummy variable

I'm trying to create a dummy variable in my dataset in R for weekend i.e. the column has a value of 1 when the day is during a weekend and a value of 0 when the day is during the week.
I first tried iterating through the entire dataset by row and assigning the weekend variable a 1 if the date is on the weekend. But this takes forever considering there are ~70,000 rows and I know there is a much simpler way, I just can't figure it out.
Below is what I want the dataframe to look like. Right now it looks like this except for the weekend column. I don't know if this changes anything, but right now date is a factor. I also have a list of the dates fall on weekends:
weekend <- c("2/9/2013", "2/10/2013", "2/16/2013", "2/17/2013", ... , "3/2/2013")
date hour weekend
2/10/2013 0 1
2/11/2013 1 0
.... .... ....
Thanks for the help
It might be safer to rely on data structures and functions that are actually built around dates:
dat <- read.table(text = "date hour weekend
+ 2/10/2013 0 1
+ 2/11/2013 1 0",header = TRUE,sep = "")
> weekdays(as.Date(as.character(dat$date),"%m/%d/%Y")) %in% c('Sunday','Saturday')
[1] TRUE FALSE
This is essentially the same idea as SenorO's answer, but we convert the dates to an actual date column and then simply use weekdays, which means we don't need to have a list of weekends already on hand.
DF$IsWeekend <- DF$date %in% weekend
Then if you really prefer 0s and 1s:
DF$IsWeekend <- as.numeric(DF$IsWeeekend)
I would check if my dates are really weekend dates before.
weekends <- c("2/9/2013", "2/10/2013", "2/16/2013", "2/17/2013","3/2/2013")
weekends = weekends[ as.POSIXlt(as.Date(weekends,'%m/%d/%Y'))$wday %in% c(0,6)]
Then using trsanform and ifelse I create the new column
transform(dat ,weekend = ifelse(date %in% as.Date(weekends,'%m/%d/%Y') ,1,0 ))

Resources