grouped statistical test tidyverse

grouped statistical test tidyverse - r

I'm trying to do a Wilcoxon test on long-formatted data. I want to use dplyr::group_by() to specify the subsets I'd like to do the test on.
The final result would be a new column with the p-value of the Wilcoxon test appended to the original data frame. All of the techniques I have seen require summarizing the data frame. I DO NOT want to summarize the data frame.
Please see an example reformatting the iris dataset to mimic my data, and finally my attempts to perform the task.
I am getting close, but I want to preserve all of my original data from before the Wilcoxon test.
# Reformatting Iris to mimic my data.
long_format <- iris %>%
gather(key = "attribute", value = "measurement", -Species) %>%
mutate(descriptor =
case_when(
str_extract(attribute, pattern = "\\.(.*)") == ".Width" ~ "Width",
str_extract(attribute, pattern = "\\.(.*)") == ".Length" ~ "Length")) %>%
mutate(Feature =
case_when(
str_extract(attribute, pattern = "^(.*?)\\.") == "Sepal." ~ "Sepal",
str_extract(attribute, pattern = "^(.*?)\\.") == "Petal." ~ "Petal"))
# Removing no longer necessary column.
cleaned_up <- long_format %>% select(-attribute)
# Attempt using do(), but I lose important info like "measurement"
cleaned_up %>%
group_by(Species, Feature) %>%
do(w = wilcox.test(measurement~descriptor, data=., paired=FALSE)) %>%
mutate(Wilcox = w$p.value)
# This is an attempt with the dplyr experimental group_map function. If only I could just make this a new column appended to the original df in one step.
cleaned_up %>%
group_by(Species, Feature) %>%
group_map(~ wilcox.test(measurement~descriptor, data=., paired=FALSE)$p.value)
Thanks for your help.

The model object can be wrapped in a list
library(tidyverse)
cleaned_up %>%
group_by(Species, Feature) %>%
nest %>%
mutate(model = map(data, ~
.x %>%
transmute(w = list(wilcox.test(measurement~descriptor,
data=., paired=FALSE)))))
Or another option is group_split into a list, then map through the list, elements create the 'pval' column after applying the model
cleaned_up %>%
group_split(Species, Feature) %>%
map_dfr(~ .x %>%
mutate(pval = wilcox.test(measurement~descriptor,
data=., paired=FALSE)$p.value))

Another option is to avoid the data argument entirely. The wilcox.test function only requires a data argument when the variables being tested aren't in the calling scope, but functions called within mutate have all the columns from the data frame in scope.
cleaned_up %>%
group_by(Species, Feature) %>%
mutate(pval = wilcox.test(measurement~descriptor, paired=FALSE)$p.value)
Same as akrun's output (thanks to his correction in the comments above)
akrun <-
cleaned_up %>%
group_split(Species, Feature) %>%
map_dfr(~ .x %>%
mutate(pval = wilcox.test(measurement~descriptor,
data=., paired=FALSE)$p.value))
me <-
cleaned_up %>%
group_by(Species, Feature) %>%
mutate(pval = wilcox.test(measurement~descriptor, paired=FALSE)$p.value)
all.equal(akrun, me)
# [1] TRUE

Related

How to apply multiple functions to a list of data frames?

I have a list of more than 50 csv files with the same numbers of columns and rows.
I want to find the percentage of missing values for each of the data frames and I have found the code that works fine with a single file which is the following:
missing.values <- estaciones2 %>%
gather(key = "key", value = "val") %>%
mutate(is.missing = is.na(val)) %>%
group_by(key, is.missing) %>%
summarise(num.missing = n()) %>%
filter(is.missing==T) %>%
select(-is.missing) %>%
arrange(desc(num.missing))
Now I want to apply these functions to each of my data frames in my list.
I read that I can use the map function to create a loop and run the code for each of my files in the list, although I am not quite sure how to insert the map function into my code shown above and I have tried the following but doesn't seem right:
missing.values <- map(estaciones2, ~ map(estaciones2, ~ estaciones2 %>%
gather(key = "key", value = "val") %>%
mutate(is.missing = is.na(val)) %>%
group_by(key, is.missing) %>%
summarise(num.missing = n()) %>%
filter(is.missing==T) %>%
select(-is.missing) %>%
arrange(desc(num.missing)))

We need a lambda function (~) to loop over the list (assuming estaciones2 is a list object). The .x is the data.frame element of the list using the lambda call
library(purrr)
library(tidyr)
library(dplyr)
map(estaciones2, ~ .x %>%
gather(key = "key", value = "val") %>%
mutate(is.missing = is.na(val)) %>%
group_by(key, is.missing) %>%
summarise(num.missing = n()) %>%
filter(is.missing==T) %>%
select(-is.missing) %>%
arrange(desc(num.missing)))
In the OP's code, multiple map functions are called on the same list element again and again i.e. estaciones2

How to reuse parts of long chain of pipe operators in R?

I have a set of chains of pipe operators (%>%) doing different things with different datasets.
For instance:
dataset %>%
mutate(...) %>%
filter(...) %>%
rowwise() %>%
summarise() %>%
etc...
If I want to reuse some parts of these chains, is there a way to do it, without just wrapping it into a function?
For instance (in pseudocode obviously):
subchain <- filter(...) %>%
rowwise() %>%
summarise()
# and then instead of the chain above it would be:
dataset %>%
mutate(...) %>%
subchain() %>%
etc...

Similar in syntax to desired pseudo-code:
library(dplyr)
subchain <- . %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld)
all.equal(
starwars %>%
group_by(gender) %>%
filter(mass > mean(mass, na.rm = TRUE)) %>%
select(name, gender, homeworld),
starwars %>%
group_by(gender) %>%
subchain()
)
Using a dot . as start of a piping sequence. This is in effect close to function wrapping, but this is called a magrittr functional sequence. See ?functions and try magrittr::functions(subchain)

Save intermediate list output in dplyr pipeline and map it back to another list further down the pipeline - R

I am running pcas on groups in a data set using dplyr pipelines. I am starting with group_split, so am working with a list. In order to run the prcomp() function, only the numeric columns of each list can be included, but I would like the factor column brought back in for plotting at the end. I have tried saving an intermediate output using {. ->> temp} partway through the pipeline, but since it is a list, I don't know how to index the grouping column when plotting.
library(tidyverse)
library(ggbiplot)
iris %>%
group_split(Species, keep = T) %>% #group by species, one pca per species
{. ->> temp} %>% # save intermediate output to preserve species column for use in plotting later
map(~.x %>% select_if(is.numeric) %>% select_if(~var(.) != 0) %>%
prcomp(scale. = TRUE))%>% #run pca on numeric columns only
map(~ggbiplot(.x), label=temp$Species)#plot each pca, labeling points as species names form the temporary object
This works to produce one pca plot for each species in the irisdata set, but since temp$species = NULL, the points are not labelled.

If you use map2() and pass the .y argument as the species list you can get the result I think you want. Note that in your original code the labels argument was outside the ggbiplot() function and was ignored.
library(tidyverse)
library(ggbiplot)
iris %>%
group_split(Species, keep = T) %>%
{. ->> temp} %>%
map(~.x %>%
select_if(is.numeric) %>%
select_if(~var(.) != 0) %>%
prcomp(scale. = TRUE)) %>%
map2(map(temp, "Species"), ~ggbiplot(.x, labels = .y))
In response to your comment, if you wanted to add a third argument you could use pmap() instead of map2(). In the example below, pmap() is being passed a (nested) list of the data for the ggbiplot() arguments. Note I've changed the new variable so that it's a factor and not constant across groups.
iris %>%
mutate(new = factor(sample(1:3, 150, replace = TRUE))) %>%
group_split(Species, keep = T) %>%
{. ->> temp} %>%
map(~.x %>%
select_if(is.numeric) %>%
select_if(~var(.) != 0) %>%
prcomp(scale. = TRUE)) %>%
list(map(temp, "Species"), map(temp, "new")) %>%
pmap(~ ggbiplot(pcobj = ..1, labels = ..2, groups = ..3))

One option is to use split and imap
library(tidyverse)
library(ggbiplot)
iris %>%
split(.$Species) %>% # save intermediate output to preserve species column for use in plotting later
map(~.x %>% select_if(is.numeric) %>% select_if(~var(.) != 0) %>%
prcomp(scale. = TRUE)) %>%
imap(~ggbiplot(.x, labels = .y))

R: dynamic variable name comparisons

I recoded a bunch of variables in a dataset, and and gave the newly recoded variables the prefix "r_" in my dataset.
I'd like to run table on the pairs to ensure the recoding was correct. Something like table(v1, r_v1), but I need to do it for lots of variables. They are not in any particular order, so I couldn't use indexing.
Here is a reproducible example of data one can use (also any tips on optimizing that code are appreciated!).
mtcars %>% select(c(disp,hp)) %>%
mutate_all(funs(if_else(.>100,1,0))) %>%
rename_(.dots=setNames(names(.), paste0('r_', names(.)))) %>%
cbind(mtcars,.)
Any ideas?

I would just use variable names and simple for loop. Calling your modified data dd,
orig = c("disp", "hp")
trans = paste0("r_", orig)
check_list = list()
for (i in seq_along(orig)) {
check_list[[i]] = table(dd[[orig[i]]], dd[[trans[i]]])
# or whatever other check you want to do
}
check_list
You can then examine the check_list contents one at a time.

To keep things in the tidy format with which you started:
library(purrr)
library(tidyr)
mtcars %>%
select(disp,hp) %>%
mutate_all(funs(r = if_else(.>100,1,0))) %>%
mutate(index = row_number()) %>%
gather(key = key, value = value, -index) %>%
separate(key, c("Variable", "Type")) %>%
mutate(Type = ifelse(is.na(Type), "Original", "Recode")) %>%
spread(key = Type, value = value) %>%
select(-index) %>%
split(.$Variable) %>%
map(~ select(.,-Variable)) %>%
map(~ table(.))

bootstrap by group and calculate statistics

I'm trying to bootstrap some model fits and then calculate statistics without having to rerun the models every time. I can do this fine if I calculate r2 inside the first do() but I'd like to know how to access the data.
library(dplyr)
library(tidyr)
library(modelr)
library(purrr)
allmdls <-
mtcars %>%
group_by(cyl) %>%
do({
datsplit=crossv_mc(.,10)
mdls=list(map(datsplit$train, ~glm(hp~disp,data=.,family=gaussian(link='identity'))))
data_frame(datsplit=list(datsplit),mdls)
})
and now something like:
allmdls %>%
by_slice(dmap,.f=map2_dbl(.$mdls,.$datsplit$test,rsquare))
but I get
Error: .y is not a vector (NULL)
or
allmdls %>%
group_by(cyl) %>%
do({
map2_df(.x=.$mdls, .y=.$datsplit, .f=map2_dbl(.x=.x,.y=.y$test,.f=rsquare))
})
Error in map2_dbl(.x = .x, .y = .y$test, .f = rsquare) : object
'.x' not found
I can't seem to get the syntax right.
help?
Thanks
EDIT:
Thanks to #aosmith's comment, I created a somewhat simpler solution:
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})

One option is to use map2 within mutate. Because you are using lists of lists I ended up with nested map2s to get access to the innermost lists. I pulled the test data out via map(datsplit, "test"), as neither the dollar sign operator nor the extract brackets were working for me.
mutate(allmdls, rsq = map2(mdls, map(datsplit, "test"), ~map2_dbl(.x, .y, rsquare)))
Here is another option that avoids the nested lists all together:
mtcars %>%
split(.$cyl) %>%
map_df(crossv_mc, 10, .id = "cyl") %>%
mutate(models = map(train, ~glm(hp ~ disp, data = .x)),
rsq = map2_dbl(models, test, rsquare))

#aosmith answered my question but here is a simpler solution overall
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

grouped statistical test tidyverse - r

Related

How to apply multiple functions to a list of data frames?

How to reuse parts of long chain of pipe operators in R?

Save intermediate list output in dplyr pipeline and map it back to another list further down the pipeline - R

R: dynamic variable name comparisons

bootstrap by group and calculate statistics

Categories

Resources