I am wondering how to tidy the following:
First I gather a selection of columns into a tibble with three columns: strain (=grouping factor), params (names of parameters) and values (the actual values)
sel <- t_tcellact %>% select(strain, contains("nbr_")) %>% gather(params, values, nbr_DP:nbr_CD3p)
Then I perform multiple pairwise.t.test():
test2 <- sel %>% bind_rows(sel) %>% split(.$params) %>% map(~ pairwise.t.test(x=.$values, g=.$strain, p.adj = "none"))
And the result is a list of the results from the pairwise.t.tests which I can start cleaning with:
test3 <- lapply(test2, tidy)
The list looks now like that:
$nbr_CD3p
group1 group2 p.value
1 SKG Balb/c 0.000001849548
$nbr_DN_CD69nCD25n
group1 group2 p.value
1 SKG Balb/c 0.6295371
and so on....
From this I need a tibble with following columns: parameter (e.g. nbr_CD3p), group1, group2, p.value.
In this example I had only two groups, but I want to do it in a generic way also applicable when I have multiple groups.
Does anybody have an idea how to get to this point in an elegant way (without a loop)?
You should be able to use bind_rows(), taking advantage of the .id argument:
test3 <- lapply(test2, tidy) %>%
bind_rows(.id = 'parameter')
That will use the names of test2 as a new column named parameter in the data frame. All that said, replacing lapply with map_df() as aosmith suggested in a comment above should also work.
I found a way to do that:
test2 <- sel %>% bind_rows(sel) %>% split(.$params) %>% map(~ pairwise.t.test(x=.$values, g=.$strain, p.adj = "none")) %>% lapply(tidy) %>% do.call("rbind", .) %>% mutate(params = rownames(.)) %>% as_tibble()
Related
I have a list of data frames with inconsistent but overlapping variables. Some of the shared variables have similar but not identical names. I would like to conditionally rename the variable so that it is consistent across datasets. The way to do this one at a time would be
library(tidyverse)
df_1 <- starwars
df_2 <- starwars %>% rename(haircolor = hair_color)
df_3 <- starwars
df_list <- list(df_1, df_2, df_3)
df_list[[2]] <- df_list[[2]] %>% rename(hair_color = haircolor)
But I would like this to be flexible such that I can just feed in a list of any size and it will rename any variable titled hair_color as haircolor. Is there a way to purrr::map over these in a way that renames conditionally on the variable existing? The most basic interpretation would look something like:
df_list %>%
purrr::map( ~ rename(., hair_color = haircolor))
We can pass this in a select_helpers function
library(dplyr)
library(purrr)
df_list %>%
purrr::map( ~ .x %>%
rename_at(vars(matches('hair_color')), ~ 'haircolor'))
Or use an if/else condition
df_list %>%
purrr::map( ~ if('hair_color' %in% names(.)) {
rename(., haircolor = hair_color)
} else .)
I am trying to compute the column means and row means of some data I have.
Its similar to the following:
library(rsample)
library(tidyquant)
library(tidyverse)
library(tsibble)
aapl <- tq_get("AAPL", start_date = "2000-01-01")
aapl_monthly_nested <- aapl %>%
mutate(ym = yearmonth(date)) %>%
nest(-ym)
aapl_rolled <- aapl_monthly_nested %>%
rolling_origin(cumulative = FALSE)
map(aapl_rolled$splits, ~ analysis(.x)) %>%
head
I try using the summarise_all function once I have mapped over the data but I cannot seem to get the colMeans. I have replaced colMeans with mean without luck.
x <- map(aapl_rolled$splits, ~analysis(.x),
~map(data,
~summarise_all(.funs(colMeans))))
x[[1]]$data
I would like a single observation of the column means for each of the splits.
EDIT:
I think I got it. - I believe I forgot the unnest the data after nesting it previously.
x <- map(aapl_rolled$splits, ~ analysis(.x) %>%
unnest() %>%
as_tibble(.) %>%
select(-year_month) %>%
summarise_all(mean))
If you have a better solution please let me know.
I want to make a bunch of new variables a,b,c,d.....z to store tibble data frames. I will then rbind the new variables that store tibble data frames and export them as a csv. How do I do this faster without having to specify the new variables each time?
a<- subset(data.frame, variable1="condition1",....,) %>% group_by() %>% summarize( a=mean())
b<-subset(data.frame, variable1="condition2",....,) %>% group_by() %>% summarize( a=mean())
....
z<-subset(data.frame, variable1="condition2",....,) %>% group_by() %>% summarize( a=mean())
rbind(a,b,....,z)
There's got to be a faster way to do this. My data set is large so having it stored in memory as partitions of a,b,c,....z is causing the computer to crash. Typing the subset conditions to form the partitions repeatedly is tedious.
You could do something like this using purrr package:
You may need to use NSE depends on what's your condition. You can reference Programming with dplyr
purrr::map_df(
c("condition1","condition2",..., "conditionn"),
# .x for each condition
~ subset(your_data_frame, variable1=.x,....,) %>% group_by(some_columns) %>% summarise(a = mean(some_columns))
)
Example using iris:
library(rlang)
conditions <- c("Petal.Length>1.5","Species == 'setosa'","Sepal.Length > 5")
map(conditions, function(x){
iris %>%
dplyr::filter(!!rlang::parse_expr(x)) %>%
head()
})
Example using iris:
conditions <- c("Petal.Length>1.5","Species == 'setosa'","Sepal.Length > 5")
map(conditions, ~ iris %>% dplyr::filter(!!rlang::parse_expr(.x)) %>% nrow())
# or (!! is almost equivalent to eval or rlang::eval_tidy())
map(conditions, ~ iris %>% dplyr::filter(eval(rlang::parse_expr(.x))) %>% nrow())
[[1]]
[1] 113
[[2]]
[1] 50
[[3]]
[1] 118
Instead of creating multiple objects in the global environemnt, rread them in a list, and bind it
library(data.table)
files <- list.files(pattern = "\\.csv", full.names = TRUE)
rbindlist(lapply(files, fread))
It would be much faster with fread than in any other option
If we are using strings to be passed onto group_by, convert the string to symbol with sym from rlang and evaluate (!!)
library(purrr)
map2_df(c("condition1", "condition2"), c("a", "b") ~ df1 %>%
group_by(!! rlang::sym(.x)) %>%
summarise(!! .y := mean(colname)))
If the 'condition1', 'condition2' etc are expressions, place it as quosure and evaluate it
map2_df(quos(condition1, condition2), c("a", "b"), ~ df1 %>%
filter(!! .x) %>%
summarise(!! .y := mean(colnames)))
Using a reproducible example
conditions <- quos(Petal.Length>1.5,Species == 'setosa',Sepal.Length > 5)
map2(conditions, c('a', 'b', 'c'), ~
iris %>%
filter(!! .x) %>%
summarise(!! .y := mean(Sepal.Length)))
#[[1]]
# a
#1 6.124779
#[[2]]
# b
#1 5.006
#[[3]]
# c
#1 6.129661
It would be a 3 column dataset if we use map2_dfc
NOTE: It is not clear whether the OP meant 'condition1', 'condition2' as expressions to be passed on for filtering the rows or not.
In this SO Question bootstrapping by several groups and subgroups seemed to be easy using the broom::bootstrap function specifying the by_group argument with TRUE.
My desired output is a nested tibble with n rows where the data column contains the bootstrapped data generated by each bootstrap call (and each group and subgroup has the same amount of cases as in the original data).
In broom I did the following:
# packages
library(dplyr)
library(purrr)
library(tidyr)
library(tibble)
library(rsample)
library(broom)
# some data to bootstrap
set.seed(123)
data <- tibble(
group=rep(c('group1','group2','group3','group4'), 25),
subgroup=rep(c('subgroup1','subgroup2','subgroup3','subgroup4'), 25),
v1=rnorm(100),
v2=rnorm(100)
)
# the actual approach using broom::bootstrap
tibble(id = 1:100) %>%
mutate(data = map(id, ~ {data %>%
group_by(group,subgroup) %>%
broom::bootstrap(100, by_group=TRUE)}))
Since the broom::bootstrap function is deprecated, I rebuild my approach with the desired output using rsample::bootstraps. It seems to be much more complicated to get my desired output. Am I doing something wrong or have things gotten more complicated in the tidyverse when generating grouped bootstraps?
data %>%
dplyr::mutate(group2 = group,
subgroup2 = subgroup) %>%
tidyr::nest(-group2, -subgroup2) %>%
dplyr::mutate(boot = map(data, ~ rsample::bootstraps(., 100))) %>%
pull(boot) %>%
purrr::map(., "splits") %>%
transpose %>%
purrr::map(., ~ purrr::map_dfr(., rsample::analysis)) %>%
tibble(id = 1:length(.), data = .)
My dataset looks something like this:
df <- data.frame(compound = c("alanine ", "arginine", "asparagine", "aspartate"))
df <- matrix(rnorm(12*4), ncol = 12)
colnames(df) <- c("AC-1", "AC-2", "AC-3", "AM-1", "AM-2", "AM-3", "SC-1", "SC-2", "SC-3", "SM-1", "SM-2", "SM-3")
df <- data.frame(compound = c("alanine ", "arginine", "asparagine", "aspartate"), df)
df
compound AC.1 AC.2 AC.3 AM.1 AM.2 AM.3 SC.1 SC.2 SC.3 SM.1
1 alanine 1.18362683 -2.03779314 -0.7217692 -1.7569264 -0.8381042 0.06866567 0.2327702 -1.1558879 1.2077454 0.437707310
2 arginine -0.19610110 0.05361113 0.6478384 -0.1768597 0.5905398 -0.67945600 -0.2221109 1.4032349 0.2387620 0.598236199
3 asparagine 0.02540509 0.47880021 -0.1395198 0.8394257 1.9046667 0.31175358 -0.5626059 0.3596091 -1.0963363 -1.004673116
4 aspartate -1.36397906 0.91380826 2.0630076 -0.6817453 -0.2713498 -2.01074098 1.4619707 -0.7257269 0.2851122 -0.007027878
I want to perform a t-test for each row (compound) on the columns [2:4] as one, and [5:7] as one, and store all the p-values. Basically see if there is a difference between the AC group and AM group for each compound.
I am aware there is another topic with this however I couldn't find a viable solution for my problem.
PS. my real dataset has about 35000 rows (maybe it needs a different solution than only 4 rows)
After selecting the columns of interest, use pmap to apply the t.test on each row by selecting the first 3 and next 3 observations as input to t.test and bind the extracted 'p value' as another column in the original data
library(tidyverse)
df %>%
select(AC.1:AM.3) %>%
pmap_dbl(~ c(...) %>%
{t.test(.[1:3], .[4:6])$p.value}) %>%
bind_cols(df, pval_AC_AM = .)
Or after selecting the columns, do a gather to convert to 'long' format, spread, apply the t.test in summarise and join with the original data
df %>%
select(compound, AC.1:AM.3) %>%
gather(key, val, -compound) %>%
separate(key, into = c('key1', 'key2')) %>%
spread(key1, val) %>%
group_by(compound) %>%
summarise(pval_AC_AM = t.test(AC, AM)$p.value) %>%
right_join(df)
Update
If there are cases where there is only a unique value, then t.test shows error. One option is to run the t.test and get NA for those cases. This can be done with possibly
posttest <- possibly(function(x, y) t.test(x, y)$p.value, otherwise = NA)
df %>%
select(AC.1:AM.3) %>%
pmap_dbl(~ c(...) %>%
{posttest(.[1:3], .[4:6])}) %>%
bind_cols(df, pval_AC_AM = .)
posttest(rep(3,5), rep(1, 5))
#[1] NA
If you can use an external library:
library(matrixTests)
row_t_welch(df[,2:4], df[,5:7])$pvalue
[1] 0.67667626 0.39501003 0.26678161 0.01237438