dplyr function group_by several variables - r

I've read the introduction to R's dplyr programming (https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html), which is very useful.
I often build quite complex functions which include several sets of grouping variables. For example, given a dataset df, I may want the function to summarise by some variables (let's say grouping variables G1 and G2) and then summarise by some others (let's say G3), and I'll then use these summaries together to produce some final result
df <- data.frame(xV = 1:3,yV=0:2, G1 =c(1,1,0),G2=c(0,0,1),G3=c(1,1,1))
#Within my function I want to calculate
#a)
df%>%group_by(G1,G2)%>%summarise(MEANS1= mean(xV,na.rm=T))
#As well as (b_
df%>%group_by(G3)%>%summarise(MEAN2= mean(xV,na.rm=T))
If I only had to do the first grouping (i.e. (a)) I can build a function, using ...
TAB2<-function(data,x,...){
require(dplyr)
x<-enquo (x)
groupSet1 <- enquos(...)
data%>%group_by(!!!(groupSet1))%>%
summarise(MEAN=mean(!!x,na.rm=T))
}
#Which gives me my results
TAB2(data=df,x=xV,G1,G2)
# A tibble: 2 x 3
# Groups: G1 [2]
G1 G2 MEAN
<dbl> <dbl> <dbl>
1 0 1 3
2 1 0 1.5
But if I want to do both (a) and (b) I need in some way to distinguish between the first and second set of grouping variables (G1, G2) and G3 respectively. I can't do it by just chucking the grouping variables after all the other inputs. Is there any way I can specify these two sets in the input, something along the lines of
TAB3<-function(data,x,y, GroupSet1=c(G1,G2) and GroupSet2=(G3)){
x<-enquo (x)
y<-enquo (x)
#a)
df%>%group_by(GroupSet1)%>%summarise(MEANS1= mean(!!x,na.rm=T))
#As well as (b_)
df%>%group_by(GroupSet2)%>%summarise(MEAN2= mean(!!y,na.rm=T))
}
I have tried to "quote" the two sets in a similar way to x<-enquo(x) in a range of ways but I always get an error. Could you please help? If it was also possible to pass a list of variables as x and y to summarise_at it would also make the function as generic as possible, which would be even better. Basically I'm trying to create a template function that can take several variable sets x and y as well as several group sets, with the aim to produce the mean of the variables in the sets x and y by the corresponding group sets (G1, G2 and G3 respectively).

You can try
TAB3<-function(data, y, grouping_list){
require(tidyverse)
map(grouping_list, ~group_by_at(data, .) %>%
summarise_at(y, list(Mean= mean), na.rm=T)) }
TAB3(df, "xV", list(c("G1", "G2"), c("G3")))
[[1]]
# A tibble: 2 x 3
# Groups: G1 [2]
G1 G2 Mean
<dbl> <dbl> <dbl>
1 0 1 3
2 1 0 1.5
[[2]]
# A tibble: 1 x 2
G3 Mean
<dbl> <dbl>
1 1 2

If you wanted to use the ellipsis as per your TAB2 example, you could try:
update based on new info:
TAB3<-function(df,x,...){
args <- substitute(list(...))
names_env <- setNames(as.list(names(df)), names(df))
arg_list <- eval(args, names_env)
out <- vector(mode = "list", length(arg_list))
for(i in seq_along(arg_list)){
out[[i]] <- df %>% group_by(!!!syms(arg_list[[i]])) %>%
summarise_at(vars(!!!enquos(x)) ,.funs = list(mean=mean), na.rm = T)
}
out
}
TAB3(df, x = c(xV,yV), GroupSet1=c(G1,G2), GroupSet2=G3)
#[[1]]
# A tibble: 2 x 4
# Groups: G1 [2]
# G1 G2 xV_mean yV_mean
# <dbl> <dbl> <dbl> <dbl>
#1 0 1 3 2
#2 1 0 1.5 0.5
#[[2]]
# A tibble: 1 x 3
# G3 xV_mean yV_mean
# <dbl> <dbl> <dbl>
#1 1 2 1

Related

Re-order columns: split the second half of columns up so that they come as every second

I want to rearrange a dataframe so that 1) the first column always stays first, and 2) that the second half of the remaining columns are split up to come every second instead.
Note that in the example below
Please see example data:
# Example data
N <- "AB"
l_x_1 <- 1
l_x_2 <- 2
l_x_3 <- 3
# ... not it should be able to handle different number of columns
s_x_1 <- 1
s_x_2 <- 2
s_x_3 <- 3
# ... not it should be able to handle different number of columns (although always equal N of s_ and l_.
# Current state
df <- tibble(N, l_x_1, l_x_2, l_x_3, s_x_1, s_x_2, s_x_3)
df
# What I want (but potential to handle different number of As and Bs)
df <- tibble(N, l_x_1, s_x_1, l_x_2, s_x_2, l_x_3, s_x_3)
df
df[c(1, matrix(seq_along(df)[-1], 2, byrow = TRUE))]
# # A tibble: 1 × 7
# N l_x_1 s_x_1 l_x_2 s_x_2 l_x_3 s_x_3
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 AB 1 1 2 2 3 3
The idea is to reorder column indices with matrix(). E.g.
c(matrix(1:6, 2, byrow = TRUE))
# [1] 1 4 2 5 3 6
A solution based on stringi::stri_reverse:
cbind(df[1], df[-1][order(stringi::stri_reverse(colnames(df[-1])))])
N l_x_1 s_x_1 l_x_2 s_x_2 l_x_3 s_x_3
1 AB 1 1 2 2 3 3
This relies on the fact that if you reverse every names, you recover the desired order:
> sort(stringi::stri_reverse(colnames(df[-1])))
[1] "1_x_l" "1_x_s" "2_x_l" "2_x_s" "3_x_l" "3_x_s"
library(tidyverse)
(preferred_order <- str_split(names(df)[-1],"_",
n = 3, # because 2 underscores
simplify = TRUE) |>
data.frame() |>
arrange(X3,X1) |> # also because 2 underscores
rowwise() |>
mutate(nm=paste0(c_across(),
collapse = "_")) |> pull(nm))
relocate(df,
preferred_order,.after = "N")

R dplyr find all mutated rows

I would like to identify all rows of a tibble that have been altered after mutate .
My real data has multiple columns and the mutate function changes more than one column at once.
# library
library(tidyverse)
# get df
df <- tibble(name=c("A","B","C","D"),value=c(1,2,3,4))
# mutate df
dfnew <- df %>%
mutate(value=case_when(name=="A" ~ value+1, TRUE ~value)) %>%
mutate(name=case_when(name=="B" ~ "K", TRUE ~name))
Created on 2020-04-26 by the reprex package (v0.3.0)
Now I look for a way how to compare all rows of df with dfnew and identify all rows with at least one change.
The desired output would be:
# desired output:
#
# # A tibble: 4 x 2
# name value
# <chr> <dbl>
# 1 A 2
# 2 K 2
You can do:
anti_join(dfnew, df)
name value
<chr> <dbl>
1 A 2
2 K 2
#tmfmnk's response does the trick, but if you'd like to use a loop (e.g. for some flexibility using different kinds of messages or warnings depending on what you're checking) you could do:
output <- list()
for (i in 1:nrow(dfnew)) {
if (all(df[i, ] == dfnew[i, ])) {
next
}
output[[i]] <- dfnew[i, ]
}
bind_rows(output)
# A tibble: 2 x 2
name value
<chr> <dbl>
1 A 2
2 K 2
We can also use setdiff from dplyr
library(dplyr)
setdiff(dfnew, df)
# A tibble: 2 x 2
# name value
# <chr> <dbl>
#1 A 2
#2 K 2
Or using fsetdiff from data.table
library(data.table)
fsetdiff(setDT(dfnew), setDT(df))

R/dplyr: Mutate based on multiple dynamic variable names

I have a list of data frames each of which contains multiple variables that contain surface area values (ending in "_area"). For each surface area variable there is corresponding conversion factor (ending in “_unit”) that I want to use to calculate a third variable that contains the area in a standard unit of measurement. I want these variables to end in “_area_ha”.
Below are my sample data frames:
a <- tibble(a1_area = c(1,1,1), a2_area_unit = c(1,1,0.5), a2_area = c(1,1,1),
a1_area_unit = c(1,0.5,0.5), abc = c(1,2,3))
b <- tibble(b1_area = c(1,1,1), b1_area_unit = c(1,1,0.5), b2_area = c(1,1,1),
b2_area_unit = c(1,0.5,0.5), abc = c(1,2,3))
ab_list <- list(a, b)
names(ab_list) <- c("a", "b")
I know how to do to this with the help of a loop but would like to understand how this can be done in the tidyverse/dplyr logic. My loop (which gives me the desired output) looks like this:
df_names <- names(ab_list)
for (d in df_names) {
df <- ab_list[[d]]
var_names <- names(select(df, matches("_area$")))
for (v in var_names) {
int <- df %>% select(all_of(v),)
int2 <- df %>% select(matches(paste0(names(int), "_unit")))
int3 <- int*int2
names(int3) <- paste0(names(int), "_ha")
df <- cbind(df, int3)
rm(int, int2, int3)
}
ab_list[[d]] <- tibble(df)
rm(df)
}
> ab_list
$`a`
# A tibble: 3 x 7
a1_area a2_area_unit a2_area a1_area_unit abc a1_area_ha a2_area_ha
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 1 1 1 0.5 2 0.5 1
3 1 0.5 1 0.5 3 0.5 0.5
$b
# A tibble: 3 x 7
b1_area b1_area_unit b2_area b2_area_unit abc b1_area_ha b2_area_ha
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 1 1 1 0.5 2 1 0.5
3 1 0.5 1 0.5 3 0.5 0.5
I have tried using lapply and mutate_at but my approach does not work. If I understand correctly, this is because my environment is nested and I cannot access x in the function that calculates the variable "ha".
ab_list %>%
lapply(function(x) mutate_at(x, vars(matches("_area$")), list(ha = ~.*x[[paste0(names(.),"_unit")]])))
Error: Column `a1_area_ha` must be length 3 (the number of rows) or one, not 0
Is there a way to get the function within mutate_at to access a variable from the parent data frame based on the name of initial variable within the function?
I would of course be happy about any other suggestion for a tidyverse approach to calculate the "_ha" variables based on dynamic variable names.
Great question. Below is a base R solution. I am sure it can be adapted to a tidyverse solution (e.g., with purrr::map2()). Here I built a function that does a basic test and then used it with lapply(). Note: the answer is tailored for your example, so you'll need to adapt it if you have different column names for the value / units. Hope this helps!!
val_by_unit <- function(data) {
df <- data[order(names(data))]
# Selecting columns for values and units
val <- df[endsWith(names(df), "area")]
unit <- df[endsWith(names(df), "unit")]
# Check names are multiplying correctly
if(!all(names(val) == sub("_unit", "", names(unit)))) {
stop("Not all areas have a corresponding unit")
}
# Multiplying corresponding columns
output <- Map(`*`, val, unit)
# Renaming output and adding columns
data[paste0(names(output), "_ha")] <- output
data
}
Results:
lapply(ab_list, val_by_unit)
$a
# A tibble: 3 x 7
a1_area a2_area_unit a2_area a1_area_unit abc a1_area_ha a2_area_ha
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 1 1 1 0.5 2 0.5 1
3 1 0.5 1 0.5 3 0.5 0.5
$b
# A tibble: 3 x 7
b1_area b1_area_unit b2_area b2_area_unit abc b1_area_ha b2_area_ha
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1
2 1 1 1 0.5 2 1 0.5
3 1 0.5 1 0.5 3 0.5 0.5
The tidyverse functions work best with 'long' formatted data where each of your rows represents a unique data point. To do this, you will want to use the tidyr::pivot_longer function:
# Join dataframes
dplyr::bind_cols(a, b) %>%
# Convert to area columns to long format
tidyr::pivot_longer(
cols = dplyr::ends_with('area'),
names_to = 'site',
values_to = 'area'
) %>%
# Convert unit columns to long format
tidyr::pivot_longer(
cols = dplyr::ends_with('unit'),
names_to = 'site2',
values_to = 'unit'
) %>%
# Just extract first 2 characters of the site column to get unique ID
dplyr::mutate(
site = stringr::str_sub(site, 1, 2)
) %>%
# Remove redundant columns
dplyr::select(abc, site, area, unit) %>%
# Calculate area in HA
dplyr::mutate(
area_ha = area * unit
)
Once your data is in long format, you can just use dplyr::mutate to multiply your area column by the unit column to get an area_ha column. If you want to convert your data back to its original format, you can use tidyr::pivot_wider to convert the data back to a wide format, which would give you columns with names a1_area_ha, a2_area_ha, etc.

How to apply functions sequentially with purrr and pipes

I am struggling with the purrr package.
I am trying to apply the function is.factor to a data frame, and then fct_count on those columns that are factors.
I have tried some variations of modify_if, and summarise_if. I guess I am using incorrectly the dots (.) when calling for the previous object.
(A guide about purrr, and dots would be really beneficial if you have a link).
For example,
df <- data.frame(f1 = c("men", "woman", "men", "men"),
f2 = c("high", "low", "low", "low"),
n1 = c(1, 3, 3, 6))
Then
map(df, is.factor)
If I use
map_if(df, is.factor, forcats::fct_count)
I got results for every variable, instead of only for the factors.
I think it is a pretty simple problem, and with a bit of understanding of the dots (.) can be solved.
Thanks in advance
:)
Issue is that map_if returns the unmodified columns as well. Hence, when the OP tries the code (repeating the same code as in the OP just to show)
map_if(df, is.factor, forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
#$n1
#[1] 1 3 3 6 ### it is the same column value unchanged
Here, we can specify the .else and discard the NULL elements. So, if we specify the other columns to return NULL and then use discard the NULL elements, it would be a list of factor counts.
library(tidyverse)
map_if(df, is.factor, forcats::fct_count, .else = ~ NULL) %>%
discard(is.null)
#$f1
## A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
Or another option is summarise_if and place the output in a list
df %>%
summarise_if(is.factor, list(~ list(fct_count(.)))) %>%
unclass
Or another option would be to gather into 'long' format and then count once
gather(df, key, val, f1:f2) %>%
dplyr::count(key, val)
Or this can be done with lapply from base R
lapply(df[sapply(df, is.factor)], fct_count)
Or using only base R
lapply(df[sapply(df, is.factor)], table)
Or the results can be represented in a different way
table(names(df)[1:2][col(df[1:2])], unlist(df[1:2]))
The issue with map_if/modify_if is it applies the function to only the columns which satisfy the predicate function and rest of them are returned as it is.
Hence, when you try
library(tidyverse)
map_if(df, is.factor, forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
#$n1
#[1] 1 3 3 6
fct_count is applied to columns f1 and f2 which are factors and column n1 is returned as it is. If you want to get only factor columns in the output one way would be to select them first and then apply the function
df %>%
select_if(is.factor) %>%
map(forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3

how to write a function that uses broom, dplyr and lm?

Consider this very simple example
library(dplyr)
library(broom)
dataframe <- data_frame(id = c(1,2,3,4,5,6),
group = c(1,1,1,2,2,2),
value = c(200,400,120,300,100,100))
# A tibble: 6 x 3
id group value
<dbl> <dbl> <dbl>
1 1 1 200
2 2 1 400
3 3 1 120
4 4 2 300
5 5 2 100
6 6 2 100
Here I want to write a function that outputs the upper bound of the confidence estimate for the mean of value. That is,
get_ci_high <- function(data, myvar){
confint_tidy(lm(data = data, myvar ~ 1)) %>% pull(conf.high)
}
Now, this works easily
confint_tidy(lm(data = dataframe, value ~ 1)) %>% pull(conf.high)
[1] 332.9999
This works as well (note the call after a group_by)
dataframe %>% group_by(group) %>% mutate(dealwithit = get_ci_high(., value))
# A tibble: 6 x 4
# Groups: group [2]
id group value dealwithit
<dbl> <dbl> <dbl> <dbl>
1 1 1 200 598.2674
2 2 1 400 598.2674
3 3 1 120 598.2674
4 4 2 300 453.5102
5 5 2 100 453.5102
6 6 2 100 453.5102
This works wonderfully
mindblow <- function(data, groupvar, outputvar){
quo_groupvar <- enquo(groupvar)
quo_outputvar <- enquo(outputvar)
data %>% group_by(!!quo_groupvar) %>%
summarize(output = get_ci_high(., !!quo_outputvar))%>%
ungroup()
}
> mindblow(dataframe, groupvar = group, outputvar = value)
# A tibble: 2 x 2
group output
<dbl> <dbl>
1 1 598.2674
2 2 453.5102
... but this FAILS
get_ci_high(dataframe, value)
Error in eval(expr, envir, enclos) : object 'value' not found
I dont get what is wrong here. I really need a solution that works in the four cases above.
Any ideas?
Many thanks!!
The reason is that when you pass the value argument, you want R to use its name "value" in the formula, rather than the value of the variable (which doesn't exist).
One solution would be to extract the name using substitute() (non-standard evaluation), and create a formula using as.formula:
get_ci_high <- function(data, myvar) {
col_name <- as.character(substitute(myvar))
fmla <- as.formula(paste(col_name, "~ 1"))
confint_tidy(lm(data = data, fmla)) %>% pull(conf.high)
}
get_ci_high(dataframe, value)
However, I'd strongly recommend passing the formula value ~ 1 as the second argument instead. This is both simpler and more flexible for performing other linear models (when you have predictors as well).
get_ci_high <- function(data, fmla) {
confint_tidy(lm(data = data, fmla)) %>% pull(conf.high)
}
get_ci_high(dataframe, value ~ 1)

Resources