Test if group means are statistically significantly different in R - r

*(I asked this question earlier, but it got migrated to stackexchange and was labeled 'unclear' and I couldn't edit it, so I'm going to try to clean up the question and make it more clear).
I have the following data frame and need to determine if there are statistically significant differences among the means of Test Groups, and repeat this for each Task Grouping. :
set.seed(123)
Task_Grouping <- sample(c("A","B","C"),500,replace=TRUE)
Test_Group <- sample(c("Green","Yellow","Orange"),500,replace=TRUE)
TotalTime <- rnorm(500, mean = 3, sd = 3)
mydataframe <- data.frame(Task_Grouping, Test_Group, TotalTime)
For example, for Task A, I need to see if there are significant differences in the means of the Test Groups (Green, Yellow, Orange).
I've tried the following code, but something is wrong since the p.value is the same for each Test Group combination among different Task Groupings (i.e. every p-value is 0.6190578):
results <- mydataframe %>%
group_by(Task_Grouping) %>%
do(tidy(pairwise.t.test(mydataframe$TotalTime, mydataframe$Test_Group,
p.adjust.method = "BH")))
I'm also not 100% sure if a pairwise.t.test is the correct statistical test to use. To rephrase, I need to see if the Test_Group means are statistically different from one another. And then I need to repeat this analysis for each Task Grouping.

Here's how you might do it using dplyr, purrr and broom
library(dply)
library(purrr)
library(broom)
mydataframe %>%
nest(data = c(Test_Group, TotalTime)) %>%
mutate(tidy=map(data, ~tidy(pairwise.t.test(.$TotalTime, .$Test_Group,
p.adjust.method = "BH")))) %>%
select(-data) %>%
unnest(tidy)
Note since we are using map, we use .$ rather than mydataframe$ to get the current group rather than the original table. See more examples at the broom and dplyr vignette

Related

How to create statistical summary for the result of clustering for different group of variable in R

I am wondering if there is a package or fast way to generate a statistical summary table for the result of clustering.
I imagine I can choose variables of interest and group by cluster number and then calculate mean and max and etc. I am looking for a fast way to do it. Is there any package I can use?
Thanks
The fastest and easiest way might depend on the exact results you want. The easiest approach is probably summary() in base R, the more versatile is to use the package dplyr with its functions group_by() and summarize(). For specific type of data, other packages may provide a more practical summary.
An example:
DF <- data.frame(groups = sample(LETTERS, 20, replace = TRUE),
var = runif(20))
summary(DF)
library(dplyr)
DF %>%
group_by(groups) %>%
summarize(mean_by_group = mean(var),
number = n())

Apply a test by groups

I have a data sample on five-minute asset price returns (FiveMinRet) and select events for a period covering several years. These events are hypothesized to have an effect on the FiveMinRet (/causing non-zero abnormal returns). From the time series data sample, I construct a sub-sample containing for all events only the, say, 100 minutes (windows) around each event, (sub_sample).
As a part of a preliminary data analysis, I wish to formally test for the presence of heteroskedasticity and first-order autocorrelation within each window. Each window occurs on different dates, so a variable (Date) will be my grouping variable.
So my question in this regard is: Is there a way to apply a Ljung-Box test (Box.test(x, lag = 1, type = c("Box-Pierce", "Ljung-Box"), fitdf = 0) command in R) by groups (Date variable) and to present the test statistics/test results in a list or data frame?
I tried the following approach
Testresults = df %>% group_by(Date) %>% do(tidy(Box.test(df$FiveMinRet_sq, lag = 1, type = c("Ljung-Box"), fitdf = 0)))
The output is what I am looking for, however, by this approach, I obtain the same test statistics for all dates, so my approach is incorrect.
Without a reproducible example in the question I can't test the code for this solution, but one way to separate the tests by Date is to use split() and purrr::map().
df %>%
split(.$Date) %>%
purrr::map(.,function(x){
do(tidy(Box.test(x$FiveMinRet_sq, lag = 1, type = c("Ljung-Box"), fitdf = 0)))
}) -> testResults
# combine into a data frame
as.data.frame(do.call(rbind,testResults))

Take the nesting variables and mapping them to predicted values?

I'm fitting linear models for several different time periods (the example below uses countries).
I have two questions.
Question 1: Is there a way I can pass the mods column to ggpredict without starting with df1$mods? something like df1 %>% select(mods)%>% map(., ggpredict). I actually tried that but it doesn't work. It's not a huge deal; but I am curious.
Question 2: Is there a way I can automate taking the names fo the grouping variable's categories and mapping them to the predicted values?
Thank you!
library(tidyverse)
library(ggeffects)
#make a fake data frame
df1<-data.frame(country=rep(c("A", "B"), 100), var1=rnorm(200), var2=rnorm(200))
df1
df1 %>%
group_by(country) %>%
#in reality I have 10 to 12 variables and 1 grouping variable with 12 categories
nest(var1, var2) %>%
#in reality I'm also doing glms, but I don't think it matters
mutate(mods=map(data, function(x) lm(var1~var2, data=x))) ->out
out$mods %>%
map_df(., ggpredict, terms=c('var2 [0,0.5]')) %>%
#Is there a way to automate this line; by taking the values of the grouping variables
#somehow.
mutate(country=rep(c('A', 'B'), each=2))

Compare two variables (both numeric or both factors) in expss tables

I am digging deeper and deeper into the expss package, and face one of the examples mentioned here --> https://gdemin.github.io/expss/#example_of_data_processing_with_multiple-response_variables (more particularly the last table of the section.
Consider the following dataframes:
vecA <- factor(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10)),levels=c(1,2,3,4,5))
vecB <- factor(c(rep(1,20),rep(2,20),rep(NA,10)),levels=c(1,2,3,4,5))
df_fact <- data.frame(vecA, vecB)
vecA_num <- as.numeric(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10)))
vecB_num <- as.numeric(c(rep(1,20),rep(2,20),rep(NA,10)))
df_num <- data.frame(vecA, vecB)
Strictly copying the suggested code (URL above), here is what my table look like:
df_fact %>%
tab_cols(total(label = "#Total| |")) %>%
tab_cells(list(vecA)) %>%
tab_stat_cpct(label="vecA", total_row_position="above", total_statistic="u_cases") %>%
tab_cells(list(vecB)) %>%
tab_stat_cpct(label="vecB", total_row_position="above", total_statistic="u_cases") %>%
tab_pivot(stat_position = "inside_columns") %>%
recode(as.criterion(is.numeric) & is.na ~ 0, TRUE ~ copy)
Slightly different procedure with a numeric example:
df_num %>%
tab_cols(total(label = "#Total| |")) %>%
tab_cells(vecA_num, vecB_num) %>%
tab_stat_valid_n(label = "Valid N") %>%
tab_stat_mean(label="Mean") %>%
tab_pivot(stat_position = "inside_columns") %>%
recode(as.criterion(is.numeric) & is.na ~ 0, TRUE ~ copy) %>%
tab_transpose()
Issues start here, since these complex constructs are... complex!
1) I would like to include tab_last_sig* family of functions but I cannot figure out how to do it (and possibly subtotals/nets when variables are factors)
2) Including multiple statistics (cases, percents, means...) altogether is a challenge
3) Last, it is not clear to me where I should write the statistic names / variable names
I have not found detailed documentation for these constructs, hence this message in a bottle :)
It's a pity, but by now significance testing is supported only for independent samples. In your examples you want compare statistics on the dependent samples. You can ran significance calculations for independent proportions but results will be inaccurate.
Including multiple statistics is not difficult - you need just sequentially write tab_stat_. But complex table layout really is a challenge :(
Variable names for statistic always should be written in the tab_cells. After that you can write statistic functions with tab_stat_mean, tab_stat_cpct and etc. You can find documentation by printing ?tab_pivot in the R console. It is a standard way of getting manual for R functions.

Correlations between vectors in two groups (defined by: group_by)

I want to make a correlation between two vectors in two different groups (defined by group_by). The solution needs to be based on dplyr.
My data is in the so-called CDISC format. For simplicity: here is some dummy data (note that one column ("values") holds all the data):
n=5
bmi<-rnorm(n=n,mean=25)
glucose<-rnorm(n=n,mean=5)
insulin<-rnorm(n=n,mean=10)
id<-rep(paste0("id",1:n),3)
myData<-data.frame(id=id,measurement=c(rep("BMI",n),rep("glucose",n),rep("insulin",n)),values=c(bmi,glucose,insulin))
Keeping in mind that all my functions for working with this kind of data is by using dplyr package, such as:
myData %>% group_by(measurement) %>% summarise(mean(values), n())
How do I get the correlation between glucose and insulin (cor(glucose, insulin))? Or in a more general way: how do I get the correlation between two groups?
The following solution is obviously very wrong (but may help to understand my question):
myData %>% group_by(measurement) %>% summarise(cor(glucose,insulin))

Resources