How to apply functions sequentially with purrr and pipes - r

I am struggling with the purrr package.
I am trying to apply the function is.factor to a data frame, and then fct_count on those columns that are factors.
I have tried some variations of modify_if, and summarise_if. I guess I am using incorrectly the dots (.) when calling for the previous object.
(A guide about purrr, and dots would be really beneficial if you have a link).
For example,
df <- data.frame(f1 = c("men", "woman", "men", "men"),
f2 = c("high", "low", "low", "low"),
n1 = c(1, 3, 3, 6))
Then
map(df, is.factor)
If I use
map_if(df, is.factor, forcats::fct_count)
I got results for every variable, instead of only for the factors.
I think it is a pretty simple problem, and with a bit of understanding of the dots (.) can be solved.
Thanks in advance
:)

Issue is that map_if returns the unmodified columns as well. Hence, when the OP tries the code (repeating the same code as in the OP just to show)
map_if(df, is.factor, forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
#$n1
#[1] 1 3 3 6 ### it is the same column value unchanged
Here, we can specify the .else and discard the NULL elements. So, if we specify the other columns to return NULL and then use discard the NULL elements, it would be a list of factor counts.
library(tidyverse)
map_if(df, is.factor, forcats::fct_count, .else = ~ NULL) %>%
discard(is.null)
#$f1
## A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
Or another option is summarise_if and place the output in a list
df %>%
summarise_if(is.factor, list(~ list(fct_count(.)))) %>%
unclass
Or another option would be to gather into 'long' format and then count once
gather(df, key, val, f1:f2) %>%
dplyr::count(key, val)
Or this can be done with lapply from base R
lapply(df[sapply(df, is.factor)], fct_count)
Or using only base R
lapply(df[sapply(df, is.factor)], table)
Or the results can be represented in a different way
table(names(df)[1:2][col(df[1:2])], unlist(df[1:2]))

The issue with map_if/modify_if is it applies the function to only the columns which satisfy the predicate function and rest of them are returned as it is.
Hence, when you try
library(tidyverse)
map_if(df, is.factor, forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
#$n1
#[1] 1 3 3 6
fct_count is applied to columns f1 and f2 which are factors and column n1 is returned as it is. If you want to get only factor columns in the output one way would be to select them first and then apply the function
df %>%
select_if(is.factor) %>%
map(forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3

Related

Count number of observations per distinct group inside summarise with dplyr (n_distinct equivalent?)

Is there a function that counts the number of observations within unique groups and not the number of distinct groups as n_distinct() does?
I'm summarising data with dplyr and group_by(), and I'm trying to calculate means of numbers of observations per a different grouping variable.
df<-data.frame(id=c('A', 'A', 'A', 'B', 'B', 'C','C','C'),
id.2=c('1', '2', '2', '1','1','1','2','2'),
v=c(sample(1:10, 8)))
df%>%
group_by(id.2)%>%
summarise(n.mean=mean(n_distinct(id)),
v.mean=mean(v))
# A tibble: 2 × 3
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 3 5
2 2 2 4.5
What I instead need:
id.2 n.mean v.mean
1 1 5
2 2 4.5
because for
id.2==1 n.mean is the mean of 1 observation for A, 2 for B, 1 observation for C,
> mean(1,2,1)
[1] 1
id.2==2 n.mean is the mean of 2 observations for A, 0 for B, 2 for C,
mean(2,0,2)
[1] 2
I tried grouping by group_by(id, id.2) first to count the observations and then pass those counts on when grouping by only id.2 in a subsequent step, but that didn't work (though I probably just don't know how to implement this with dplyr as I'm not very experienced with tidyverse solutions)
You are not using mean correctly. mean(1, 2, 1) ignores all but the first argument and therefore will return 1 no matter what other numbers are in the second and third positions. For id.2 == 1, you'd want mean(c(1, 2, 1)), which returns 1.333.
We can use table to quickly calculate the frequencies of id within each grouping of id.2, and then take the mean of those. We can compute v.mean in the same step.
library(tidyverse)
df %>%
group_by(id.2) %>%
summarize(
n.mean = mean(table(id)),
v.mean = mean(v)
)
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 1.33 4.25
2 2 2 6
Your example notes that id.2 == 2 does not have any values for id == B. It is not clear whether your desired solution counts this as a zero-length category, or simply ignores it. The solution above ignores it. The following includes it as a zero-length category by first complete-ing the input data (note new row #7, which has NA data):
df_complete <- complete(df, id.2, id)
id.2 id v
<chr> <chr> <int>
1 1 A 9
2 1 B 1
3 1 B 2
4 1 C 5
5 2 A 4
6 2 A 7
7 2 B NA
8 2 C 3
9 2 C 10
We can convert id to factor data, which will force table to preserve its unique levels even in groupings of zero length:
df_complete %>%
group_by(id.2) %>%
mutate(id = factor(id)) %>%
filter(!is.na(v)) %>%
summarize(
n.mean = mean(table(id)),
v.mean = mean(v, na.rm = T)
)
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 1.33 4.25
2 2 1.33 6
Or an alternate recipe that does not rely on table:
df_complete %>%
group_by(id.2, id) %>%
summarize(
n_rows = sum(!is.na(v)),
id_mean = mean(v)
) %>%
group_by(id.2) %>%
summarize(
n.mean = mean(n_rows),
v.mean = weighted.mean(id_mean, n_rows, na.rm = T)
)
id.2 n.mean v.mean
<chr> <dbl> <dbl>
1 1 1.33 4.25
2 2 1.33 6
Note that when providing randomized example data, you should use set.seed to control the randomization and ensure reproducibility. Here is what I used:
set.seed(0)
df<-data.frame(id=c('A', 'A', 'A', 'B', 'B', 'C','C','C'),
id.2=c('1', '2', '2', '1','1','1','2','2'),
v=c(sample(1:10, 8)))

R: Encoding categorical data using across()

I have a dataset with features of type character (not all are binary and one of them represents a region).
In order to avoid having to use the function several times, I was trying to use a pipeline and across() to identify all of the columns of character type and encode them with the function created.
encode_ordinal <- function(x, order = unique(x)) {
x <- as.numeric(factor(x, levels = order, exclude = NULL))
x
}
dataset <- dataset %>%
encode_ordinal(across(where(is.character)))
However, it seems that I am not using across() correctly as I get the error:
Error: across() must only be used inside dplyr verbs.
I wonder if I am overcomplicating myself and there is an easier way of achieving this, i.e., identifying all of the features of character type and encode them.
You should call across and encode_ordinal inside mutate, as illustrated in the following example:
dataset <- tibble(x = 1:3, y = c('a', 'b', 'b'), z = c('A', 'A', 'B'))
# # A tibble: 3 x 3
# x y z
# <int> <chr> <chr>
# 1 1 a A
# 2 2 b A
# 3 3 b B
dataset %>%
mutate(across(where(is.character), encode_ordinal))
# # A tibble: 3 x 3
# x y z
# <int> <dbl> <dbl>
# 1 1 1 1
# 2 2 2 1
# 3 3 2 2

R dplyr find all mutated rows

I would like to identify all rows of a tibble that have been altered after mutate .
My real data has multiple columns and the mutate function changes more than one column at once.
# library
library(tidyverse)
# get df
df <- tibble(name=c("A","B","C","D"),value=c(1,2,3,4))
# mutate df
dfnew <- df %>%
mutate(value=case_when(name=="A" ~ value+1, TRUE ~value)) %>%
mutate(name=case_when(name=="B" ~ "K", TRUE ~name))
Created on 2020-04-26 by the reprex package (v0.3.0)
Now I look for a way how to compare all rows of df with dfnew and identify all rows with at least one change.
The desired output would be:
# desired output:
#
# # A tibble: 4 x 2
# name value
# <chr> <dbl>
# 1 A 2
# 2 K 2
You can do:
anti_join(dfnew, df)
name value
<chr> <dbl>
1 A 2
2 K 2
#tmfmnk's response does the trick, but if you'd like to use a loop (e.g. for some flexibility using different kinds of messages or warnings depending on what you're checking) you could do:
output <- list()
for (i in 1:nrow(dfnew)) {
if (all(df[i, ] == dfnew[i, ])) {
next
}
output[[i]] <- dfnew[i, ]
}
bind_rows(output)
# A tibble: 2 x 2
name value
<chr> <dbl>
1 A 2
2 K 2
We can also use setdiff from dplyr
library(dplyr)
setdiff(dfnew, df)
# A tibble: 2 x 2
# name value
# <chr> <dbl>
#1 A 2
#2 K 2
Or using fsetdiff from data.table
library(data.table)
fsetdiff(setDT(dfnew), setDT(df))

Create multiple data that count for unique values of each variables using dplyr and loop

I have some question for programming using dplyr and for loop in order to create multiple data. The code without loop works very well, but the code with for loop doesn't give me the expected result as well as error message.
Error message was like:
"Error in UseMethod ("select_") : no applicable method for 'select_'
applied to an object of class "character"
Please anyone put me on the right way.
The code below worked
B <- data %>% select (column1) %>% group_by (column1) %>% arrange (column1) %>% summarise (n = n ())
The code below did not work
column_list <- c ('column1', 'column2', 'column3')
for (b in column_list) {
a <- data %>% select (b) %>% group_by (b) %>% arrange (b) %>% summarise (n = n () )
assign (paste0(b), a)
}
Don't use assign. Instead use lists.
We can use _at variations in dplyr which works with characters variables.
library(dplyr)
split_fun <- function(df, col) {
df %>% group_by_at(col) %>% summarise(n = n()) %>% arrange_at(col)
}
and then use lapply/map to apply it to different columns
purrr::map(column_list, ~split_fun(data, .))
This will return you a list of dataframes which can be accessed using [[ individually if needed.
Using example with mtcars
df <- mtcars
column_list <- c ('cyl', 'gear', 'carb')
purrr::map(column_list, ~split_fun(df, .))
#[[1]]
# A tibble: 3 x 2
# cyl n
# <dbl> <int>
#1 4 11
#2 6 7
#3 8 14
#[[2]]
# A tibble: 3 x 2
# gear n
# <dbl> <int>
#1 3 15
#2 4 12
#3 5 5
#[[3]]
# A tibble: 6 x 2
# carb n
# <dbl> <int>
#1 1 7
#2 2 10
#3 3 3
#4 4 10
#5 6 1
#6 8 1

how to use tidyeval functions with loops?

Consider this simple example
library(dplyr)
dataframe <- data_frame(id = c(1,2,3,4),
group = c('a','b','c','c'),
value = c(200,400,120,300))
> dataframe
# A tibble: 4 x 3
id group value
<dbl> <chr> <dbl>
1 1 a 200
2 2 b 400
3 3 c 120
4 4 c 300
and this tidyeval function that uses dplyr to aggregate my dataframe according to some input column.
func_tidy <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
df_agg
}
now, this works
> func_tidy(dataframe, group)
# A tibble: 3 x 3
group mean count
<chr> <dbl> <int>
1 a 200 1
2 b 400 1
3 c 210 2
but doing the same thing from within a loop FAILS
for(col in c(group)){
func_tidy(dataframe, col)
}
Error in grouped_df_impl(data, unname(vars), drop) : Column `col` is unknown
What is the problem here? How can I use my tidyeval function in a loop?
Thanks!
For looping through column names you will need to use character strings.
for(col in "group")
When you pass this variable to your function, you will need to convert it from a character string to a symbol using rlang::sym. You use !! to unquote so the expression is evaluated.
So your loop would look like (I add a print to see the output):
for(col in "group"){
print( func_tidy(dataframe, !! rlang::sym(col) ) )
}
# A tibble: 3 x 3
group mean count
<chr> <dbl> <int>
1 a 200 1
2 b 400 1
3 c 210 2

Resources