simple question, I want to perform the one-sample rank test with cluster in data, after searching for a while, I got clusWilcox.test from the package clusrank. A toy example for illustration:
df = data.frame(x_1 = rnorm(200),
x_2 = rnorm(200),
group = c(rep('A',100),rep('B',100)),
clus = c(rep('a_1',50),rep('a_2',50),rep('b_1',50),rep('b_2',50)))
Worked like a charm when used directly
clusWilcox.test(x_1,paired = TRUE,cluster = "clus",data = df)
But went wrong when I tried to perform the test by group:
temp_test <-
df %>%
group_by(group) %>%
summarise_each(funs(clusWilcox.test(.,paired = TRUE,cluster = "clus")$p.value), vars = c('x_1','x_2'))
Error in complete.cases(x, cluster, group, stratum) :
not all arguments have the same length
Seems like a data problem, so I fill the data option of the function with df, it worked, but test all the data instead of by group.
temp_test <-
df %>%
group_by(group) %>%
summarise_each(funs(clusWilcox.test(.,paired = TRUE,cluster = "clus",data = df)$p.value), vars = c('x_1','x_2'))
> temp_test
# A tibble: 2 x 3
group vars1 vars2
<fct> <dbl> <dbl>
1 A 0.168 0.136
2 B 0.168 0.136
This won't happen when I tried to perform the one-sample t.test
temp_test <-
df %>%
group_by(group) %>%
summarise_each(funs(t.test(.)$p.value), vars = c('x_1','x_2'))
My guess is that the clusWilcox.test somehow could not inherit data from dplyr, anyone know how to get the problem fixed?
According to ?clusWilcox.test, the cluster parameter should be a numeric vector. In your df, it is a factor.
Therefore, running the test separately for group A with your factor cluster variable
clusWilcox.test(x_1, paired = TRUE, cluster = clus, data = df[df$group == "A", ])
results in:
Clustered Wilcoxon signed rank test using Rosner-Glynn-Lee method
data: x_1; cluster: clus; (from [)x_1; cluster: clus; (from temp)x_1; cluster: clus; (from temp$group == "A")x_1; cluster: clus; (from )
number of observations: 100; number of clusters: 4
Z = NA, p-value = NA
alternative hypothesis: true shift in location is not equal to 0
If you create a new cluster variable that is numeric, it runs the tests correctly:
df %>%
mutate(clus = group_indices(., group, clus)) %>%
group_by(group) %>%
summarise(pvalue = clusWilcox.test(x_1, paired = TRUE, cluster = clus)$p.value)
group pvalue
<fct> <dbl>
1 A 0.175
2 B 0.801
If you want to calculate it for different columns:
df %>%
mutate(clus = group_indices(., group, clus)) %>%
group_by(group) %>%
summarise_at(vars(x_1, x_2), ~ clusWilcox.test(., paired = TRUE, cluster = clus)$p.value)
group x_1 x_2
<fct> <dbl> <dbl>
1 A 0.264 0.712
2 B 0.794 0.289
To indicate that it contains the p-value:
df %>%
mutate(clus = group_indices(., group, clus)) %>%
group_by(group) %>%
summarise_at(vars(x_1, x_2), list(pvalue = ~ clusWilcox.test(., paired = TRUE, cluster = clus)$p.value))
group x_1_pvalue x_2_pvalue
<fct> <dbl> <dbl>
1 A 0.264 0.712
2 B 0.794 0.289
Related
As a continuation from this question, I'm trying to efficiently perform many logistic regression in order to generate a column saying if a group differs significantly from my reference group.
When I try to nest my data by just one column, this solution works beautifully. However, now that I need to group by two columns, the code runs, but I cannot change the reference group. I've tried the following:
Adding a relevel argument (shown below)
Adding a relevel argument within the custom function itself (also shown below)
Renaming the desired reference group to start with 'AAA' to trick R into making it the first option
Here's a sample dataset:
library(dplyr)
library(lubridate)
library(tidyr)
library(purrr)
library(broom)
test <- tibble(
major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
app_deadline = ymd(c(rep("'2021-04-04", 3), rep("'2020-03-23", 3), rep("'2019-05-23", 1))),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
admit = c(500, 1000, 450, 800, 300, 100, 1000),
reject = c(1000, 300, 1000, 210, 100, 900, 1500)
)
test2 <- test %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
mutate(accept_rate = admit/total)
Here's the code that won't let me change the reference level:
#Custom function --note that english has been set as reference level
library(tidyr)
library(dplyr)
library(purrr)
library(broom)
get_model_t <- function(df) {
tryCatch(
expr = glm(accept_rate ~ relevel(major, ref = "english"), data = df, family = binomial, weights = total, na.action = na.exclude),
error = function(e) NULL, warning=function(w) NULL)
}
#putting it altogether--note again that english has been marked as reference level
test2 %>%
# create year column
mutate(year = year(time),
major = relevel(major, "english")) %>%
# nest by year
group_nest(year, app_deadline) %>%
# compute regression
mutate(reg = map(data, get_model_t), reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg)) %>%
full_join(test2)
#Note that, based on the significance column, it's clear that 'undeclared' is being used as the reference group
Why is this happening? For a solution, I'd prefer if it could be flexible--i.e., not just work for 'english' but could also be switched to work for 'computer science' too.
It does respect the relevel() function, the problem, such as it is, is that the returned results don't match with the major column. See what happens if you stop at the unnest() function:
test2 <- test %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
mutate(accept_rate = admit/total)
get_model_t <- function(df) {
tryCatch(
expr = glm(accept_rate ~ relevel(major, ref = "english"), data = df, family = binomial, weights = total, na.action = na.exclude),
error = function(e) NULL, warning=function(w) NULL)
}
#putting it altogether--note again that english has been marked as reference level
tmp <- test2 %>%
# create year column
mutate(year = year(time),
major = relevel(major, "english")) %>%
# nest by year
group_nest(year, app_deadline) %>%
# compute regression
mutate(reg = map(data, get_model_t), reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy))
Now, look at major and term
tmp %>% select(major, term)
# # A tibble: 6 × 2
# major term
# <fct> <chr>
# 1 undeclared "(Intercept)"
# 2 computer science "relevel(major, ref = \"english\")computer science"
# 3 english "relevel(major, ref = \"english\")undeclared"
# 4 undeclared "(Intercept)"
# 5 computer science "relevel(major, ref = \"english\")computer science"
# 6 english "relevel(major, ref = \"english\")undeclared"
You can see that the rows where major is "english" are actually for the "undeclared" parameter estimate. Taking the above result, I think you can capture what you want with the following:
tmp %>%
filter(term != "(Intercept)") %>%
mutate(major = gsub(".*\\)(.*)", "\\1", term)) %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(year, app_deadline, major, time, significant) %>%
full_join(test2)
# Joining, by = c("app_deadline", "major", "time")
# # A tibble: 7 × 9
# year app_deadline major time significant admit reject total accept_rate
# <dbl> <date> <chr> <date> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2020 2020-03-23 computer science 2020-01-01 Yes 300 100 400 0.75
# 2 2020 2020-03-23 undeclared 2020-01-01 Yes 800 210 1010 0.792
# 3 2021 2021-04-04 computer science 2021-01-01 Yes 1000 300 1300 0.769
# 4 2021 2021-04-04 undeclared 2021-01-01 No 500 1000 1500 0.333
# 5 NA 2021-04-04 english 2021-01-01 NA 450 1000 1450 0.310
# 6 NA 2020-03-23 english 2020-01-01 NA 100 900 1000 0.1
# 7 NA 2019-05-23 undeclared 2019-01-01 NA 1000 1500 2500 0.4
As a continuation from this question, I want to run many logistic regression equations at once and then note if a group was significantly different from a reference group. This solution works, but it only works when I'm not missing values. Being that my data has 100 equations, it's bound to have missing values, so rather than this solution failing when it hits an error, how can I program it to skip the instances that throw an error?
Here's a modified dataset that's missing cases:
library(dplyr)
library(lubridate)
library(broom)
test <- tibble(major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
admit = c(500, 1000, 450, 800, 300, 100, 1000),
reject = c(1000, 300, 1000, 210, 100, 900, 1500)) %>%
mutate(total = rowSums(test[ , c("admit", "reject")], na.rm=TRUE)) %>%
mutate(accept_rate = admit/total)
And here's the solution that works when it has all cases (see dataset here), but when it hits the 2019 grouping that's missing cases, it fails:
library(dplyr)
library(lubridate)
library(broom)
library(tidyr)
library(purrr)
test %>%
# create year column
mutate(year = year(time),
major = relevel(major, "undeclared")) %>%
# nest by year
nest(data = -year) %>%
# compute regression
mutate(reg = map(data, ~glm(accept_rate ~ major, data = .,
family = binomial, weights = total, na.action = na.exclude)),
# use broom::tidy to make a tibble out of model object
reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg))
I also tried wrapping the solution using the custom functions here, but I also couldn't get it to work. Last, I'm also open to other ideas for a solution if it produces a similar output and is resistant to these errors.
To ignore errors, use this function:
get_model <- function(df) {
tryCatch(
glm(accept_rate ~ major, data = df, family = binomial, weights = total, na.action = na.exclude),
error = function(e) NULL, warning=function(w) NULL)
}
Use it where you call mutate(reg=map()...):
# compute regression
mutate(reg = map(data, get_model), reg_tidy = map(reg, tidy))
Output:
# A tibble: 4 x 8
year major time admit reject total accept_rate significant
<dbl> <fct> <date> <dbl> <dbl> <dbl> <dbl> <chr>
1 2021 computer science 2021-01-01 1000 300 1300 0.769 Yes
2 2021 english 2021-01-01 450 1000 1450 0.310 No
3 2020 computer science 2020-01-01 300 100 400 0.75 No
4 2020 english 2020-01-01 100 900 1000 0.1 Yes
purrr::safely allows to take care of errors. To wrap glm call inside purrr::safely, I use a helper function glm_safe. glm_safe returns a list with two elements, result and error.
When there is no error, result contains the model object, while element is NULL. In case of an error, the error message is stored in error and result is NULL.
To use the results in your pipeline, we have to extract the result elements which could be achieved via transpose(reg)$result.
library(dplyr)
library(lubridate)
library(broom)
library(tidyr)
library(purrr)
test <- tibble(
major = as.factor(c(rep(c("undeclared", "computer science", "english"), 2), "undeclared")),
time = ymd(c(rep("'2021-01-01", 3), rep("'2020-01-01", 3), rep("'2019-01-01", 1))),
admit = c(500, 1000, 450, 800, 300, 100, 1000),
reject = c(1000, 300, 1000, 210, 100, 900, 1500)
)
test <- test %>%
mutate(total = rowSums(test[, c("admit", "reject")], na.rm = TRUE)) %>%
mutate(accept_rate = admit / total)
glm_safe <- purrr::safely(
function(x) {
glm(accept_rate ~ major,
data = x,
family = binomial, weights = total, na.action = na.exclude
)
}
)
test %>%
# create year column
mutate(
year = year(time),
major = relevel(major, "undeclared")
) %>%
# nest by year
nest(data = -year) %>%
# compute regression
mutate(reg = map(data, glm_safe),
reg = transpose(reg)$result) |>
mutate(reg_tidy = map(reg, tidy)) %>%
# get data and regression results back to tibble form
unnest(c(data, reg_tidy)) %>%
filter(term != "(Intercept)") %>%
# create the significant yes/no column
mutate(significant = ifelse(p.value < 0.05, "Yes", "No")) %>%
# remove the unnecessary columns
select(-c(term, estimate, std.error, statistic, p.value, reg))
#> # A tibble: 4 × 8
#> year major time admit reject total accept_rate significant
#> <dbl> <fct> <date> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2021 computer science 2021-01-01 1000 300 1300 0.769 Yes
#> 2 2021 english 2021-01-01 450 1000 1450 0.310 No
#> 3 2020 computer science 2020-01-01 300 100 400 0.75 No
#> 4 2020 english 2020-01-01 100 900 1000 0.1 Yes
I have a dataframe with three variables: Var1 (with values of A, B, and C), Var2 (with values of X and Y), and Metric (various numeric values). For every group in Var1 there exists multiple of each Var2 (unique by other variables but I am collapsing them here and are not relevant).
For every Var1 group I would like to compare the means of Metric between Var2 groups (X mean vs. Y mean) using a statistical test to determine if they are significantly different from each other (the exact test is not important but ideally the solution is agnostic). Doing this in isolation is easy and I have done so with filtering and TukeyHSD and selected my pairs of interest. With the below example I have filtered for 'A' in Var1 and get the p-value for X vs Y:
#Generating dataframe
set.seed(3)
df = data.frame(Var1 = sample(c("A","B","C"), 15, replace=TRUE),
Var2 = sample(c("X","Y"),15, replace=TRUE),
Metric = 1:15)
df
#Calculating significance between X and Y for A in Var1
stat = aov(Metric ~ Var2, data = subset(df, Var1 %in% c("A")))
summary(stat)
TukeyHSD(stat)
Ideally I would like a way to automate this so that the final result is in an easily accessible format or list of outputs in R such that I have a p-value for X vs Y for A, B, and C groups using some form of loop where I just have to input the desired Var1 groups of interest (as my actual dataset has much more than 3 groups for it).
If I understood correctly
#Generating dataframe
set.seed(3)
df = data.frame(Var1 = sample(c("A","B","C"), 100, replace=TRUE),
Var2 = sample(c("X","Y"),100, replace=TRUE),
Metric = 1:100)
library(tidyverse)
df %>%
group_nest(Var1) %>%
mutate(AOV = map(data, ~aov(Metric ~ Var2, data = .x))) %>%
transmute(Tukey_HSD = map(AOV, TukeyHSD) %>% map(broom::tidy)) %>%
unnest(Tukey_HSD)
#> # A tibble: 3 x 7
#> term contrast null.value estimate conf.low conf.high adj.p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Var2 Y-X 0 0.931 -20.6 22.4 0.930
#> 2 Var2 Y-X 0 -8.06 -27.3 11.2 0.400
#> 3 Var2 Y-X 0 8.90 -14.1 31.9 0.435
Created on 2021-11-04 by the reprex package (v2.0.1)
with variable filtering
filter_vars <- c("B", "C")
df %>%
filter(Var1 %in% filter_vars) %>%
group_nest(Var1) %>%
mutate(AOV = map(data, ~aov(Metric ~ Var2, data = .x))) %>%
transmute(Tukey_HSD = map(AOV, TukeyHSD) %>% map(broom::tidy)) %>%
unnest(Tukey_HSD)
I am trying to run multiple linear models for a very large dataset and store the outputs in a dataframe. I have managed to get estimates and p-values into dataframe (see below) but I also want to store the AIC for each model.
#example dataframe
dt = data.frame(x = rnorm(40, 5, 5),
y = rnorm(40, 3, 4),
group = rep(c("a","b"), 20))
library(dplyr)
library(broom)
# code that runs lm for each group in row z and stores output
dt_lm <- dt %>%
group_by(group) %>%
do(tidy(lm(y~x, data=.)))
Use glance instead of tidy:
dt_lm <- dt %>%
group_by(group) %>%
do(glance(lm(y~x, data=.))) %>%
select(AIC)
which gives:
Adding missing grouping variables: `group`
# A tibble: 2 x 2
# Groups: group [2]
group AIC
<chr> <dbl>
1 a 119.
2 b 114.
If you not only want to store the AIC but other metrics just skip the select part.
In the newer version of dplyr i.e. >= 1.0, we can also use nest_by
library(dplyr)
library(tidyr)
library(broom)
dt %>%
nest_by(group) %>%
transmute(out = list(glance(lm(y ~ x, data = data)))) %>%
unnest(c(out)) %>%
select(AIC)
# A tibble: 2 x 2
# Groups: group [2]
# group AIC
# <chr> <dbl>
#1 a 115.
#2 b 100.
I need to sum sequences generated by one of column. I have done it in that way:
test <- tibble::tibble(
x = c(1,2,3)
)
test %>% dplyr::mutate(., s = plyr::aaply(x, .margins = 1, .fun = function(x_i){sum(seq(x_i))}))
Is there a cleaner way to do this? Is there some helper functions, construction which allows me to reduce this:
plyr::aaply(x, .margins = 1, .fun = function(x_i){sum(seq(x_i))})
I am looking for a generic solution, here sum and seq is only an example. Maybe the real problem is that I do want to execute function on element not all vector.
This is my real case:
test <- tibble::tibble(
x = c(1,2,3),
y = c(0.5,1,1.5)
)
d <- c(1.23, 0.99, 2.18)
test %>% mutate(., s = (function(x, y) {
dn <- dnorm(x = d, mean = x, sd = y)
s <- sum(dn)
s
})(x,y))
test %>% plyr::ddply(., c("x","y"), .fun = function(row) {
dn <- dnorm(x = d, mean = row$x, sd = row$y)
s <- sum(dn)
s
})
I would like to do that by mutate function in a row way not vectorized way.
For the specific example, it is a direct application of cumsum
test %>%
mutate(s = cumsum(x))
For generic cases to loop through the sequence of rows, we can use map
test %>%
mutate(s = map_dbl(row_number(), ~ sum(seq(.x))))
# A tibble: 3 x 2
# x s
# <dbl> <dbl>
#1 1 1
#2 2 3
#3 3 6
Update
For the updated dataset, use map2, as we are using corresponding arguments in dnorm from the 'x' and 'y' columns of the dataset
test %>%
mutate(V1 = map2_dbl(x, y, ~ dnorm(d, mean = .x, sd = .y) %>%
sum))
# A tibble: 3 x 3
# x y V1
# <dbl> <dbl> <dbl>
#1 1 0.5 1.56
#2 2 1 0.929
#3 3 1.5 0.470