Multiple group_by with dplyr - r

There is a lot of questions about it on this forum but I could not do it. I got a dataframe with a bunch of categrocial variables (class factor). I got a target column (1 or 0). I want to compute the frequency of 1's within each level of the categorical variables. I want to do those 3 group_by computations in once.
library(dplyr)
# Build the toy dataset
target = sample(x = c(0,1),size = 100,replace = T)
cat1 = sample(x = c("a","b","c"),size = 100,replace = T)
cat2 = sample(x = c("x","y","z"),size = 100,replace = T)
cat3 = sample(x = c("T","U","V"),size = 100,replace = T)
df = data.frame(target,cat1,cat2,cat3)
# How to do those 3 group_by computations in once knowing that in reality I got thousands of those categorical columns?
df %>%
group_by(cat1) %>%
summarise(statistics = mean(target))
df %>%
group_by(cat2) %>%
summarise(statistics = mean(target))
df %>%
group_by(cat3) %>%
summarise(statistics = mean(target))

If I understand your question correctly, I believe this code can help you:
df %>%
group_by(cat1,cat2,cat3) %>%
summarise(statistics = mean(target)) %>% arrange(cat1,cat2)

Related

How to plot sjPlots from a nested tibble?

I create some models like this using a nested tidyr dataframe:
set.seed(1)
library(tidyr)
library(dplyr)
library(sjPlot)
library(tibble)
library(purrr)
fits <- tribble(~group, ~colA, ~colB, ~colC,
sample(c("group1", "group2"), 10, replace = T), 0, sample(10, replace = T), sample(10, replace = T),
sample(c("group1", "group2"), 10, replace = T), 1, sample(10, replace = T), sample(10, replace = T)) %>%
unnest(cols = c(colB, colC)) %>%
nest(data=-group) %>%
mutate(fit= map(data, ~glm(formula = colA ~ colB + colC, data = .x, family="binomial"))) %>%
dplyr::select(group, fit) %>%
tibble::column_to_rownames("group")
I would like to use this data to create some quick marginal effects plots with sjPlot::plot_models like this
plot_models(as.list(fits), type = "pred", terms = c("colB", "colA", "colC"))
Unfortunately, I get the error
Error in if (fam.info$is_linear) tf <- NULL else tf <- "exp" :
argument is of length zero
In addition: Warning message:
Could not access model information.
I've played around a bit with the nesting of the data but I've been unable to get it into a format that sjPlot::plot_models will accept.
What I was expecting to get is a "Forest plot of multiple regression models" as described in the help file. Ultimately, the goal is to plot the marginal effects of regression models by group, which I was hoping the plot_models will do (please correct me if I'm wrong).
It think there are some issues with the original code as well as with the data. There are arguments from plot_model in the function call which are not supported in plot_models. I first show an example that shows how plot_models can be called and used with a nested tibble using {ggplot2}'s diamonds data set. Then I apply this approach to the OP's sample data, which doesn't yield useable results*. Finally, I create some new toy data to show how the approach could be applied to a binominal model.
(* In the original toy data the dependent variable is either always 0 or always 1 in each model so this is unlikely to yield useable results).
set.seed(1)
library(tidyr)
library(dplyr)
library(sjPlot)
library(tibble)
library(ggplot2)
# general example
fits <- tibble(id = c("x", "y", "z")) %>%
rowwise() %>%
mutate(fit = list(glm(reformulate(
termlabels = c("cut", "color", "depth", "table", "price", id),
response = "carat"),
data = diamonds)))
plot_models(fits$fit)
# OP's example data
fits2 <- tribble(~group, ~colA, ~colB, ~colC,
sample(c("group1", "group2"), 10, replace = T), 0,
sample(10, replace = T), sample(10, replace = T),
sample(c("group1", "group2"), 10, replace = T), 1,
sample(10, replace = T),
sample(10, replace = T)) %>%
unnest(cols = c(colB, colC)) %>%
nest(data = -group) %>%
rowwise() %>%
mutate(fit = list(glm(formula = colA ~ colB + colC, data = data, family="binomial")))
plot_models(fits2$fit)
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning: Removed 4 rows containing missing values (geom_point).
# new data for binominal model
n <- 500
g <- round(runif(n, 0L, 1L), 0)
x1 <- runif(n,0,100)
x2 <- runif(n,0,100)
y <- (x2 - x1 + rnorm(n,sd=20)) < 0
fits3 <- tibble(g, y, x1, x2) %>%
nest_by(g) %>%
mutate(fit = list(glm(formula = y ~ x1 + x2, data = data, family="binomial")))
plot_models(fits3$fit)
Created on 2021-01-23 by the reprex package (v0.3.0)

Group dataframe row and column wise based on other dataframe?

I have a dataframe that I would like to group in both directions, first rowise and columnwise after. The first part worked well, but I am stuck with the second one. I would appreciate any help or advice for a solution that does both steps at the same time.
This is the dataframe:
df1 <- data.frame(
ID = c(rep(1,5),rep(2,5)),
ID2 = rep(c("A","B","C","D","E"),2),
A = rnorm(10,20,1),
B = rnorm(10,50,1),
C = rnorm(10,10,1),
D = rnorm(10,15,1),
E = rnorm(10,5,1)
)
This is the second dataframe, which holds the "recipe" for grouping:
df2 <- data.frame (
Group_1 = c("B","C"),
Group_2 = c("D","A"),
Group_3 = ("E"), stringsAsFactors = FALSE)
Rowise grouping:
df1_grouped<-bind_cols(df1[1:2], map_df(df2, ~rowSums(df1[unique(.x)])))
Now i would like to apply the same grouping to the ID2 column and sum the values in the other columns. My idea was to mutate a another column (e.g. "group", which contains the name of the final group of ID2. After this i can use group_by() and summarise() to calculate the sum for each. However, I can't figure out an automated way to do it
bind_cols(df1_grouped,
#add group label
data.frame(
group = rep(c("Group_2","Group_1","Group_1","Group_2","Group_3"),2))) %>%
#remove temporary label column and make ID a character column
mutate(ID2=group,
ID=as.character(ID))%>%
select(-group) %>%
#summarise
group_by(ID,ID2)%>%
summarise_if(is.numeric, sum, na.rm = TRUE)
This is the final table I need, but I had to manually assign the groups, which is impossible for big datasets
I will offer such a solution
library(tidyverse)
set.seed(1)
df1 <- data.frame(
ID = c(rep(1,5),rep(2,5)),
ID2 = rep(c("A","B","C","D","E"),2),
A = rnorm(10,20,1),
B = rnorm(10,50,1),
C = rnorm(10,10,1),
D = rnorm(10,15,1),
E = rnorm(10,5,1)
)
df2 <- data.frame (
Group_1 = c("B","C"),
Group_2 = c("D","A"),
Group_3 = ("E"), stringsAsFactors = FALSE)
df2 <- df2 %>% pivot_longer(everything())
df1 %>%
pivot_longer(-c(ID, ID2)) %>%
mutate(gr_r = df2$name[match(ID2, table = df2$value)],
gr_c = df2$name[match(name, table = df2$value)]) %>%
arrange(ID, gr_r, gr_c) %>%
pivot_wider(c(ID, gr_r), names_from = gr_c, values_from = value, values_fn = list(value = sum))

How do you compare means row-wise for the same ratings object in the R expss package?

I have repeated measures data with two ratings (reliable and fast) repeated on two different objects, (each survey respondent rates each object using the same two ratings measures). I would like to have two columns, one for object 1 and one for object 2, with the ratings displayed in two separate rows.
In the reference manual there is reference to using a | separator to compare two variables, but the example given is for mrsets not means, I'm not sure how to do the same with means and keep them in separate data frame columns.
In the code below, the problem is that instead of placing the means side by side (for comparison) they are stacked on top of each other.
#library
library(expss)
library(magrittr)
#dummy data
set.seed(9)
df <- data.frame(
q1_reliable=sample(c(1,5), 100, replace = TRUE),
q1_fast=sample(c(1,5), 100, replace = TRUE),
q2_reliable=sample(c(1,5), 100, replace = TRUE),
q2_fast=sample(c(1,5), 100, replace = TRUE))
#table
df %>%
tab_cells(q1_reliable,q1_fast) %>%
tab_stat_mean(label = "") %>%
tab_cells(q2_reliable,q2_fast) %>%
tab_stat_mean(label = "") %>%
tab_pivot()
I discovered that if I add variable labels first and use 'tab_pivot(stat_position = "inside_columns")' it solved the problem.
#library
library(expss)
library(magrittr)
#dummy data
set.seed(9)
df <- data.frame(
q1_reliable=sample(c(1,5), 100, replace = TRUE),
q1_fast=sample(c(1,5), 100, replace = TRUE),
q2_reliable=sample(c(1,5), 100, replace = TRUE),
q2_fast=sample(c(1,5), 100, replace = TRUE)
)
#labels
df = apply_labels(df,
q1_reliable = "reliable",
q1_fast = "fast",
q2_reliable = "reliable",
q2_fast = "fast")
#table
df %>%
tab_cells(q1_reliable,q1_fast) %>%
tab_stat_mean(label = "") %>%
tab_cells(q2_reliable,q2_fast) %>%
tab_stat_mean(label = "") %>%
tab_pivot(stat_position = "inside_columns")
Like this data.table approach?
library(data.table)
#melt first
DT <- melt( setDT(df),
measure.vars = patterns( reliable = "reliable", fast = "fast"),
variable.name = "q")
#then summarise
DT[, lapply(.SD, mean), by = .(q), .SDcols = c("reliable", "fast")]
q reliable fast
1: 1 3.04 2.96
2: 2 2.92 2.96

Use variable names in function in dplyr for sum and cumsum

dplyr programming question here. Trying to write a dplyr function which takes column names as inputs and also filters on a component outlined in the function. What I am trying to recreate is as follow called test:
#test df
x<- sample(1:100, 10)
y<- sample(c(TRUE, FALSE), 10, replace = TRUE)
date<- seq(as.Date("2018-01-01"), as.Date("2018-01-10"), by =1)
my_df<- data.frame(x = x, y =y, date =date)
test<- my_df %>% group_by(date) %>%
summarise(total = n(), total_2 = sum(y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter(date >= "2018-01-03")
The function I am testing is as follows:
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- enquo(cumulative_y)
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(!!cumulative_y ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data = my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-03")
I have looked looked at some examples of using enquo and this thread gets me half way there:
Use variable names in functions of dplyr
But the issue is I get two different data frame outputs for test 1 and test 2. The one from the function outputs does not have data from the logical y referenced column.
I also tried this instead
cumsum_df<- function(data, date_field, cumulative_y, minimum_date = "2017-04-21") {
date_field <- enquo(date_field)
cumulative_y <- deparse(substitute(cumulative_y))
data %>% group_by(!!date_field) %>%
summarise(total = n(), total_2 = sum(data[[cumulative_y]] ==TRUE, na.rm=TRUE)) %>%
mutate(cumulative_a = cumsum(total), cumulative_b = cumsum(total_2)) %>%
ungroup() %>% filter((!!date_field) >= minimum_date)
}
test2<- cumsum_df(data= my_df, date_field = date, cumulative_y = y, minimum_date = "2018-01-04")
Based on this thread: Pass a data.frame column name to a function
But the output from my test 2 column is also wildly different and it seems to do some kind or recursive accumulation. Which again is different to my test date frame.
If anyone can help that would be much appreciated.

Nested Lists with different lengths into dataframe

I am using R and the package mongolite to get data from a MongoDB. This results in data consisting of lots of nested lists that cannot be simplified into a data frame by using unlist, rbindlist, nor bind_rows from dplyr (at least I did not manage to do it).
After a lot of trial and error I found a way to do it using the package reshape2 with the function melt and using dplyr and tidyr to get it into the form I want it to be. However, the melting takes a lot of time (up to 15 mins per list, and I have 6 of them).
Do you have any ideas how I can make this faster? (Of course another possible solution would be to query the MongoDB correctly such that it does not result in lists but in something more like my target data frame).
The following code creates a dummy dataset with similar attributes, the target form of the dataset and my solution to get there.
Dummy Data:
dummy_data <- list(
list(actions = list(list(action_type = "link_clicks", value = 30),
list(action_type = "post_likes", value = 3)),
date = '2015-08-04'),
list(actions = list(list(action_type = "link_clicks", value = 10),
list(action_type = "post_likes", value = 2),
list(action_type = "page_engagement", value = 5)),
date = '2015-08-02')
)
Target Form:
final_data = data.frame(c(30, 10), c(3, 2), c(NA, 5), c('2015-08-04', '2015-08-02'))
names(final_data) = c('actions: link_clicks', 'actions: post_likes', 'actions: page_engagement', 'date')
final_data
Temporary solution
Solution <- reshape2::melt(dummy_data)
Solution <- Solution %>%
select(L1, L2, L3, L4, value) %>%
mutate(L4 = ifelse(is.na(L4), L2, L4)) %>%
spread(key = L4, value = value) %>%
mutate(L2 = ifelse(!is.na(action_type), paste0(L2, ": ", action_type), L2),
value = ifelse(!is.na(value), value, date)) %>%
select(L1, L2, value) %>%
spread(key = L2, value = value) %>%
select(-L1)
If you have any advice on the mongolite query here is the simplest kind of queries I use:
M_DB <- mongolite::mongo(collection = "name", url = "url")
M_DB_List <- M_DB$iterate()$batch(size = 100000)
Thanks a lot
**Edit: **
A more complex data structure as this is closer to my problem
dummy_data_complex <- list(
list(actions = list(list(action_type = "link_clicks", value = 30),
list(action_type = "post_likes", value = 3)),
date = '2015-08-04',
currency = 'EUR'),
list(actions = list(list(action_type = "link_clicks", value = 10),
list(action_type = "post_likes", value = 2),
list(action_type = "page_engagement", value = 5)),
date = '2015-08-02',
demographics = list(gender = "female",
list(actions = list(action_type = "link_clicks", value = 10)))
))
Here is an option with tidyverse
library(tidyverse)
dummy_data %>%
map_df(~ .x %>%
as_tibble(.) %>%
mutate(actions = map(actions, as_tibble)) %>%
unnest) %>%
group_by(date, action_type) %>%
mutate(n = row_number()) %>%
spread(action_type, value) %>%
select(-n)
# A tibble: 2 x 4
# Groups: date [2]
# date link_clicks page_engagement post_likes
#* <chr> <dbl> <dbl> <dbl>
#1 2015-08-02 10.0 5.00 2.00
#2 2015-08-04 30.0 NA 3.00
I was able to find a fast solution for my problem.
It was solved by using another query as SymbolixAU suggested.
Instead of using iterate() I used find() which then resulted in a dataframe with nested lists inside it.
From this point on I was easily able to get to my target using tidyr::unnest()
Thanks for your help.

Resources