Apply a test by groups

Apply a test by groups - r

I have a data sample on five-minute asset price returns (FiveMinRet) and select events for a period covering several years. These events are hypothesized to have an effect on the FiveMinRet (/causing non-zero abnormal returns). From the time series data sample, I construct a sub-sample containing for all events only the, say, 100 minutes (windows) around each event, (sub_sample).
As a part of a preliminary data analysis, I wish to formally test for the presence of heteroskedasticity and first-order autocorrelation within each window. Each window occurs on different dates, so a variable (Date) will be my grouping variable.
So my question in this regard is: Is there a way to apply a Ljung-Box test (Box.test(x, lag = 1, type = c("Box-Pierce", "Ljung-Box"), fitdf = 0) command in R) by groups (Date variable) and to present the test statistics/test results in a list or data frame?
I tried the following approach
Testresults = df %>% group_by(Date) %>% do(tidy(Box.test(df$FiveMinRet_sq, lag = 1, type = c("Ljung-Box"), fitdf = 0)))
The output is what I am looking for, however, by this approach, I obtain the same test statistics for all dates, so my approach is incorrect.

Without a reproducible example in the question I can't test the code for this solution, but one way to separate the tests by Date is to use split() and purrr::map().
df %>%
split(.$Date) %>%
purrr::map(.,function(x){
do(tidy(Box.test(x$FiveMinRet_sq, lag = 1, type = c("Ljung-Box"), fitdf = 0)))
}) -> testResults
# combine into a data frame
as.data.frame(do.call(rbind,testResults))

Related

Test if group means are statistically significantly different in R

*(I asked this question earlier, but it got migrated to stackexchange and was labeled 'unclear' and I couldn't edit it, so I'm going to try to clean up the question and make it more clear).
I have the following data frame and need to determine if there are statistically significant differences among the means of Test Groups, and repeat this for each Task Grouping. :
set.seed(123)
Task_Grouping <- sample(c("A","B","C"),500,replace=TRUE)
Test_Group <- sample(c("Green","Yellow","Orange"),500,replace=TRUE)
TotalTime <- rnorm(500, mean = 3, sd = 3)
mydataframe <- data.frame(Task_Grouping, Test_Group, TotalTime)
For example, for Task A, I need to see if there are significant differences in the means of the Test Groups (Green, Yellow, Orange).
I've tried the following code, but something is wrong since the p.value is the same for each Test Group combination among different Task Groupings (i.e. every p-value is 0.6190578):
results <- mydataframe %>%
group_by(Task_Grouping) %>%
do(tidy(pairwise.t.test(mydataframe$TotalTime, mydataframe$Test_Group,
p.adjust.method = "BH")))
I'm also not 100% sure if a pairwise.t.test is the correct statistical test to use. To rephrase, I need to see if the Test_Group means are statistically different from one another. And then I need to repeat this analysis for each Task Grouping.

Here's how you might do it using dplyr, purrr and broom
library(dply)
library(purrr)
library(broom)
mydataframe %>%
nest(data = c(Test_Group, TotalTime)) %>%
mutate(tidy=map(data, ~tidy(pairwise.t.test(.$TotalTime, .$Test_Group,
p.adjust.method = "BH")))) %>%
select(-data) %>%
unnest(tidy)
Note since we are using map, we use .$ rather than mydataframe$ to get the current group rather than the original table. See more examples at the broom and dplyr vignette

How can I create subsets from these data frame?

I want to aggregate my data. The goal is to have for each time interval one point in a diagram. Therefore I have a data frame with 2 columns. The first columns is a timestamp. The second is a value. I want to evaluate each time period. That means: The values be added all together within the Time period for example 1 second.
I don't know how to work with the aggregate function, because these function supports no time.
0.000180 8
0.000185 8
0.000474 32

It is not easy to tell from your question what you're specifically trying to do. Your data has no column headings, we do not know the data types, you did not include the error message, and you contradicted yourself between your original question and your comment (Is the first column the time stamp? Or is the second column the time stamp?
I'm trying to understand. Are you trying to:
Split your original data.frame in to multiple data.frame's?
View a specific sub-set of your data? Effectively, you want to filter your data?
Group your data.frame in to specific increments of a set time-interval to then aggregate the results?
Assuming that you have named the variables on your dataframe as time and value, I've addressed these three examples below.
#Set Data
num <- 100
set.seed(4444)
tempdf <- data.frame(time = sample(seq(0.000180,0.000500,0.000005),num,TRUE),
value = sample(1:100,num,TRUE))
#Example 1: Split your data in to multiple dataframes (using base functions)
temp1 <- tempdf[ tempdf$time>0.0003 , ]
temp2 <- tempdf[ tempdf$time>0.0003 & tempdf$time<0.0004 , ]
#Example 2: Filter your data (using dplyr::filter() function)
dplyr::filter(tempdf, time>0.0003 & time<0.0004)
#Example 3: Chain the funcions together using dplyr to group and summarise your data
library(dplyr)
tempdf %>%
mutate(group = floor(time*10000)/10000) %>%
group_by(group) %>%
summarise(avg = mean(value),
num = n())
I hope that helps?

Lag function usage within a dplyr subset

My basic goal is to subset a data set, and summarise with new columns that use the lag function. I understand how to subset and the data set, but am struggling to complete using the lag function within my data set and that is giving me trouble.
I have already tried a few different ways of implementing it, but have been unsuccessful.
gapminder %>%
na.omit() %>%
group_by(country) %>%
summarise(prevPeriod = lag(year),
lifeExpGrowth = lag(lifeExp),
popGrowth = lag(pop),
gdppcGrowth = 100*(gdpPercap/lag(gdpPercap) - 1)))
I am currently getting my code to run a lag based upon the country, not the year. the gdppcGrowth is supposed to return a percent as well and I am getting an error;
Column `gdppcGrowth` must be length 1 (a summary value), not 12
For each of the functions, I want to analyze the data by country focusing on growth rates. I want to use the lag(x) function to access the previous value of a series or vector so that 100*(x/lag(x) - 1) computes standard (arithmetic) growth rates of x expressed as a percent.

R aggregating irregular time series data by groups (with meta data)

Hi I have a data frame (~4 million rows) with time series data for different sites and events.
Here is a rough idea of my data, obviously on a different scale, I have several similar time series so I've kept it general as I want to be able to apply it in different cases
Data1 <- data.frame(DateTimes =as.POSIXct("1988-04-30 13:20:00")+c(1:10,12:15,20:30,5:13,16:20,22:35)*300,
Site = c(rep("SiteA",25),rep("SiteB",28)),
Quality = rep(25,53),
Value = round(runif(53,0,5),2),
Othermetadata = c(rep("E1",10),rep("E2",15),rep("E1",10),rep("E2",18)))
What I'm looking for is a simple way to group and aggregate this data to different timesteps while keeping metadata which doesn't vary within the group
I have tried using the zoo library and zoo::aggregate ie:
library(zoo)
zooData <- read.zoo(select(Data1, DateTimes, Value))
zooagg <- aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)
However when I do this I'm losing all my metadata and merging different sites data.
I wondered about trying to use plyr or dplyr to split up the data and then appling the aggregate but I'm still going to lose my other columns.
Is there a better way to do this? I had a brief look at doco for xts library but couldn't see an intuitive solution in their either
*Note: as I want this to work for a few different things both the starting time step and final time step might change. With possibility for random time step, or somewhat regular time step but with missing points. And the FUN applied may vary (mostly sum or mean). As well as the fields I want to split it by *
Edit I found the solution after Hercules Apergis pushed me in the right direction.
newData <- Data1 %>% group_by(timeagg, Site) %>% summarise(Total = sum(Value))
finaldata <- inner_join(Data1,newData) %>% select(-DateTimes, - Value) %>% distinct()
The original DateTimes column wasn't a grouping variable - it was the time series, so I added a grouping variable of my aggregated time (here: time to the nearest hour) and summarised on this. Problem was if I joined on this new column I missed any points where there was time during that hour but not on the hour. Thus the inner_join %>% select %>% distinct method.
Now hopefully it works with my real data not eg data!

Given the function that you have on aggregation:
aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)
You want to sum the values by group of times AND NOT lose other columns. You can simply do this with the dplyr package:
library(dplyr)
newdata <- Data1 %>% group_by(DateTimes) %>% summarise(sum(Value))
finaldata <- inner_join(Data1,newdata),by="DateTimes")
The newdata is a data.frame with each group of DateTimes has the Values summed. Then inner_join merges the parts that are common on those two datasets by the DateTimes variable. Since I am not entirely sure what your desired output is, this should be a good help for starters.

R code for creating variable for accuracy/percentages

I am having some trouble with R code for a variable I am trying to add to my data frame. Essentially, participants responded to two classes of stimuli (A and B) and their responses could either be correct or incorrect. The important variables (columns) in my data set are: ID (participants' ID), stimtype (A or B), and response (correct or incorrect).
What I want to do is calculate, for each participant, create two "accuracy score" variables (columns): one where it lists accuracy percentage for stimulus type A, and one for stimulus type B.
I can get those percentages fairly easily using table functions, but am having difficulty creating those variables in my dataset. Any advice very much appreciated, thank you!!!

If you have a data.frame mydata with character stimtypes and a TRUE/FALSE response, you can use
library(dplyr)
result <- mydata %>%
group_by(ID, stimtype) %>%
summarize(pct_response = 100 * mean(response, na.rm = T))
This interprets the logical responses (T/F) as 1/0 and taking the mean will give you the percentage for a given ID and stimtype. However, the result will have two rows per ID, with one for each stimtype. If you want the results in two columns, you can use tidyr::spread
library(tidyr)
result %>%
spread(key = stimtype, value = pct_response)