I have a df that needs to be grouped by multiple columns to subsequently calculate ratios for subset of different columns and the row-wise means and standard deviations.
grouper1 grouper2 condition value
foo baz A 1
foo baz B 2
foo oof A 1
foo oof C 3
bar zab B 2
bar zab C 4
Based on this elegant answer I have managed to built a generic solution:
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
crossing(c("A"), c("B","C")) %>%
pmap(~ query %>%
group_by(grouper1, grouper2) %>%
summarise(!! str_c('ratio_', ..1, ..2) :=
value[condition == ..1]/value[condition == ..2])) %>%
reduce(full_join, by = c('grouper1', 'grouper2')) %>%
ungroup() %>% mutate(mean=rowMeans(select(.,-(grouper1, grouper2)), SD=unlist(pmap(select(.,-(grouper1, grouper2)), ~sd(c(...)))))
This works well if all the values in condition column are found in all groups. If this is not the case, e.g. A is not present in the second grouping using grouper1 in the above example, I will receive the following error:
Error: Column ratio_AC must be length 1 (a summary value), not 0
I could obviously preselect the values for crossing but this would require a filter on the df and I will loose generality. I would thus like a solution that simply ignores the missing combinations and still calculates the metrics.
One possible solution would be pivot_wider but here I cannot implement a working solution for calculation the ratios.
We could reshape to wide format with pivot_wider and then use that dataset
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
df1 <- df %>%
pivot_wider(names_from = condition, values_from = value)
crossing(v1 = c("A"), v2 = c("B","C")) %>%
pmap(~ df1 %>%
transmute(grouper1, grouper2,
!! str_c('ratio_', ..1, ..2) :=
.[[..1]]/.[[..2]]))%>%
reduce(full_join, by = c('grouper1', 'grouper2')) %>%
mutate(mean = rowMeans(select(., -grouper1, -grouper2), na.rm = TRUE),
SD= pmap_dbl(select(., -grouper1, -grouper2),
~sd(c(...), na.rm = TRUE)))
data
df <- structure(list(grouper1 = c("foo", "foo", "foo", "foo", "bar",
"bar"), grouper2 = c("baz", "baz", "oof", "oof", "zab", "zab"
), condition = c("A", "B", "A", "C", "B", "C"), value = c(1L,
2L, 1L, 3L, 2L, 4L)), class = "data.frame", row.names = c(NA,
-6L))
Related
I have a data frame in tidy format as follows:
df <- data.frame(name = c("A", "C", "B", "A", "B", "C", "D") ,
group = c(rep("case", 3), rep("cntrl", 4)),
mean = rnorm(7, 0,1))
I would like to group the data by two variables name and group and apply a t.test on mean value of each category. For example doing t.test between A_case.vs.A_cntrl and add pvalue as the result to the table.
Do you have any idea how can I do this using tidyverse package?
Thanks,
here, a group wise, t.test on 'name' cannot be performed as there is only a single observation for each pair. Instead, we can do
library(dplyr)
df %>%
summarise(ttest = list(t.test(mean[group == 'case'],
mean[group == 'cntrl']))) %>%
pull(ttest)
Update
If we need to create a column, use mutate
df %>%
mutate(pval = t.test(mean[group == 'case'],
mean[group == 'cntrl'])$p.value)
Or reshape to 'wide' format and then do the t.test on the columns
library(tidyr)
df %>%
pivot_wider(names_from = group, values_from = mean) %>%
summarise(ttest = list(t.test(case, cntrl))) %>%
pull(ttest)
Let's assume I have a data frame with lots of columns: var1, ..., var100, and also a matching named vectors of the same length.
I would like to create a function that if in the data frame there are NA's it would pick the data from the named vector. This is what I wrote so far:
data %>%
mutate(var1 = ifelse(is.na(var1), named_vec["var1"], var1),
var2 = ifelse(is.na(var2), named_vec["var2"], var2),
...)
It works, however if I have 100's variable it would be very impractical to write so many conditions. I then tried this:
data %>%
mutate_if(~ifelse(is.na(.x), named_vec[colnames(.x)], .x))
Error in selected[[i]] <- eval_tidy(.p(column, ...)) :
more elements supplied than there are to replace
However this does not work. Is there a way in dplyr to extract the column name do I can slice the named vector?
Here a small example of data to try
data <- data.frame(var1 = c(1, 1, NA, 1),
var2 = c(2, NA, NA, 2),
var3 = c(3, 3, 3, NA))
named_vec <- c("var1" = 1, "var2" = 2, "var3" = 3)
It may be easier to do this with coalesce
library(dplyr)
library(purrr)
library(stringr)
nm1 <- str_c('var', 1:3)
data[nm1] <- map_dfc(nm1, ~ coalesce(data[[.x]], named_vec[.x]))
data
# var1 var2 var3
#1 1 2 3
#2 1 2 3
#3 1 2 3
#4 1 2 3
Or if we replicate the 'named_vec',
data[] <- coalesce(as.matrix(data), named_vec[col(data)])
Another option is to convert to 'long' format, then do a left_join, coalesce the 'value' columns, and reshape back to 'wide' format
library(tidyr)
data %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn) %>%
left_join(enframe(named_vec), by = 'name') %>%
transmute(rn, name, value = coalesce(value.x, value.y)) %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-rn)
This is an extension of the question I asked here where I was looking for a way to automate my labeling of subjects into groups based on if their data matched my filter.
Prior to attempting to the automating labeling, this is what I had.
library(tidyverse)
df <- structure(list(Subj_ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
Location = c(1, 2, 3, 1, 4, 2, 1, 2, 5)), class = "data.frame",
row.names = c(NA, -9L))
df2 <- df %>%
mutate(group=
if_else(Subj_ID ==1,
"Treatment",
if_else(Subj_ID == 2,
"Control","Withdrawn")))
complete.df <- df2 %>% filter(complete.cases(.))
In my actual data, there are some rows that have NA's and I need to be able to filter for both complete and incomplete cases so I can review the sub-data sets separately if needed.
My new code looks like this which assigns a subject to a group based on if they have a location data point 4 or 5:
df2 <- df %>%
mutate(group=
if_else(Subj_ID ==1,
"Treatment",
if_else(Subj_ID == 2,
"Control","Withdrawn")))
df3 <- df2 %>% ##this chunk breaks filter(complete.cases(.))
group_by(Subj_ID) %>%
mutate(group2 = case_when(any(Location == 4) | any(Location == 5) ~ "YES", TRUE ~ "NO"))
complete.df <- df3 %>% filter(complete.cases(.))
Once I generate df3 by mutating df2, my filter(complete.cases(.)) subsequently fails.
Yet, if I were to generate df3 by manual recoding, it works! As so:
df2 <- df %>%
mutate(group=
if_else(Subj_ID ==1,
"Treatment",
if_else(Subj_ID == 2,
"Control","Withdrawn")))
df3 <- df2 %>%
mutate(group2=
if_else(Subj_ID ==2 |
Subj_ID ==3,
"TRUE", "FALSE"))
complete.df <- df3 %>% filter(complete.cases(.))
Thoughts?
It would be the group_by attribute which causes the issue and can be solved by ungrouping and then apply the filter. In the OP's last code block (manual coding), it is not creating a grouping attribute and thus it works
library(dplyr)
df3 %>%
ungroup %>%
filter(complete.cases(.))
Or instead of complete.cases in filter, we can use !is.na with filter_all without removing the grouping attribute
df3 %>%
filter_all(any_vars(!is.na(.)))
OP mentioned about the last code block is working, but it doesn't have any group attribute. If we create one, then it fails too
df3 %>%
group_by(group) %>%
filter(complete.cases(.))
Error: Result must have length 3, not 9
I want a column that tracks which items are included in a set based on a predicate. It seems like I should be able to do this with some combination of the purrr accumulate function and the dplyr lead/lag and union/setdiff functions.
This is probably best expressed as a reprex:
input_df <- dplyr::data_frame(user = c("1", "1", "1", "1"),
item = c("a", "b", "a", "a"),
include = c(TRUE, TRUE, FALSE, TRUE))
output_df <- dplyr::data_frame(user = c("1", "1", "1", "1"),
set = list(
c("a"),
c("a", "b"),
c("b"),
c("a", "b")))
Edit: I'm very close. I need to find a way of finding the "bag difference" (instead of the set difference) between vectors in case a user includes, excludes and then re-includes an item.
numbered_input_df <- input_df %>%
mutate(id = row_number())
include_df <- numbered_input_df %>%
filter(include == TRUE) %>%
mutate(include_set = purrr::accumulate(item, c)) %>%
select(user, id, include_set)
exclude_df <- numbered_input_df %>%
filter(include == FALSE) %>%
mutate(exclude_set = purrr::accumulate(item, c)) %>%
select(user, id, exclude_set)
numbered_input_df %>%
left_join(include_df) %>%
left_join(exclude_df) %>%
fill(include_set, exclude_set) %>%
mutate(set = map2(include_set, exclude_set, ~.x[! .x %in% .y]))
Define Update which takes the union or setdiff of the basket with the ith item and use Reduce to apply it to each i. Use ave to do all that by user. No packages are used.
Update <- function(basket, i) with(input_df[i, ],
(if (include) union else setdiff)(basket, item)
)
n <- nrow(input_df)
reduce_user <- function(ix) Reduce(Update, init = NULL, ix, accumulate = TRUE)[-1]
transform(input_df["user"], set = I(ave(as.list(1:n), user, FUN = reduce_user)))
giving:
user set
1 1 a
2 1 a, b
3 1 b
4 1 b, a
Alternately, translating the above to dplyr and purrr and making use of Update from above we get the code below.
library(dplyr)
library(purrr)
input_df %>%
mutate(ix = 1:n()) %>%
group_by(user) %>%
mutate(set = accumulate(ix, Update, .init = NULL)[-1]) %>%
ungroup %>%
select(user, set)
(Note that the only use of purrr is accumulate and that could easily be replaced with Reduce if you want to reduce dependencies.)
Data:
df <- data.frame(A=c(rep(letters[1],3),rep(letters[2],3),rep(letters[3],3)),
B=rnorm(9),
stringsAsFactors=F)
I don't know if there's a way to do this, but what I'd like to know is if there's way to discard the last group by directly referencing the groups after group_by(A) to get the desired output:
A B
1 a -0.4900863
2 a 1.4106594
3 a -0.2245738
4 b -0.2124955
5 b 0.6963785
6 b 0.9151825
I AM INTERESTED IN SOLUTIONS THAT DIRECTLY WORK AT THE GROUPS LEVEL
For instance, something like:
df %>% group_by(A) %>% head(.Groups,-1)
or
df %>% group_by(A) %>% Groups[1:2]
I AM NOT INTERESTED IN THE FOLLOWING KINDS OF SOLUTIONS
df %>% filter(!(A == max(A)))
df %>% filter(!(A %in% max(A)))
OR OTHER SOLUTIONS THAT DO NOT REQUIRE group_by TO WORK
I was assuming you were not supposed to be assuming that we knew in advance what the number of groups might be. Try using the labels attribute:
all_but_last <- df %>% group_by(A) %>% attr("labels") %>% head(-1)
A
1 a
2 b
... to extract desired rows
> df %>% filter(A %in% all_but_last[[1]])
A B
1 a -0.799026840
2 a -0.712402478
3 a 0.685320094
4 b 0.971492883
5 b -0.001479117
6 b -0.817766296
Helps to use dput to look at the actual contents of a "grouped_df":
dput( df %>% group_by(A) )
structure(list(A = c("a", "a", "a", "b", "b", "b", "c", "c",
"c"), B = c(-0.799026840397576, -0.712402478350695, 0.685320094252465,
0.971492883452258, -0.00147911717469651, -0.817766295631676,
-1.00112471676908, 1.88145909873596, -0.305560178617216)), .Names = c("A",
"B"), row.names = c(NA, -9L), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), vars = "A", drop = TRUE, indices = list(
0:2, 3:5, 6:8), group_sizes = c(3L, 3L, 3L), biggest_group_size = 3L,
labels = structure(list(
A = c("a", "b", "c")),
row.names = c(NA, -3L),
class = "data.frame",
vars = "A", drop = TRUE, .Names = "A"))
Note that the labels are a data.frame so you could have further applied unlist to the result that became all_but_last and you then would not have needed to extract its value with "[[".
Perhaps this helps
library(dplyr)
df %>%
group_by(A) %>%
group_indices(.) %in% 1:2 %>%
df[.,]
Or with data.table
library(data.table)
setDT(df)[, grp := .GRP, A][grp %in% unique(grp)[1:2]][, grp := NULL][]