dplyr - element-wise summarise in nested lists - r

I have the following data.frame:
df <- data.frame(X1 = c(1,2,2))
df$X2 <- list(list(1, 2), list(0, 1), list(1,0))
df
X1 X2
1 1 1, 2
2 2 0, 1
3 2 1, 0
Now, I would like to add a new column that is the element-wise mean of all the lists in X2 that share the same X1 value, e.g.:
X1 mean
1 1 1, 2
2 2 0.5, 0.5
I tried with the following instructions:
df %>% group_by(X1) %>% summarise(mean = mean(X2))
But all I get is
X1 mean
<dbl> <dbl>
1 1.00 NA
2 2.00 NA
Warning messages:
1: In mean.default(X2) : argument is not numeric or logical: returning NA
How can I build this new column?

We may use
df <- df %>% group_by(X1) %>%
summarise(mean = list(map(reduce(X2, `map2`, `+`), `/`, n())))
df$mean
# [[1]]
# [[1]][[1]]
# [1] 1
#
# [[1]][[2]]
# [1] 2
#
#
# [[2]]
# [[2]][[1]]
# [1] 0.5
#
# [[2]][[2]]
# [1] 0.5
Explanation: first, after grouping, with
reduce(X2, `map2`, `+`)
we add all the lists element-wise. Then as to get the mean we use another map with /. Lastly, list returns a list.
Update: you may also use
df %>% group_by(X1) %>%
summarise(mean = list(pmap(X2, ~ sum(...) / n())))
or
df %>% group_by(X1) %>%
summarise(mean = list(pmap(X2, ~ mean(c(...)))))
Unfortunately list(pmap(X2, mean)) doesn't work as
mean(1, 2)
# [1] 1

Related

what is the best code to write this conditional rowwise function in a tibble?

I want to calculate the geometrical mean of two vectors in a tibble using the tidyverse. The calculated mean should be done rowwise for the two variables. I wrote the function below to this end, and it worked, but I am just wondering how could this operation be done or written in a more efficient way of coding, with more efficient I mean less code, faster and neater. Any better ideas? Just thinking loud, can map_*() be implemented in this case? I am also aware of using rowwise() but as far as I know lately the author of the tidyverse Hadley Wickham downplayed the use of rowwise() strategically.
A minimal representative example is below:
Reprex
df <- tribble(
~v1, ~ v2,
4, 5,
NA, 7,
2, 2,
3, NA,
NA, NA,
9, 9)
Suggested function
gMean <- function (df, v1, v2){
output <- vector ("double", nrow (df))
for (i in 1:nrow(df)){
output[[i]] <- case_when (!is.na(df$v1[i]) && !is.na(df$v2[i]) ~ ((df$v1[i] * df$v2[i]) ^ 0.5),
is.na (df$v1[i]) && is.na (df$v2[i]) ~ 1,
!is.na(df$v1[i]) && is.na(df$v2[i]) ~ df$v1[i],
is.na(df$v1[i]) && !is.na(df$v2[i]) ~ df$v2[i]
)
}
output
}
output
df %>%
gMean (v1, v2)
[1] 4.472136 7.000000 2.000000 3.000000 1.000000 9.000000
You could also (just) use mutate instead of looping over each row.
In your case, there is no need to map or to use rowwise, and as case_when is evaluating from the bottom and up, you can simplify your is.na calls as well.
df |> mutate(gMean = case_when(is.na(v1) & is.na(v2) ~ 1,
is.na(v1) ~ v2,
is.na(v2) ~ v1,
TRUE ~ sqrt(v1 * v2)))
However, if we want to use rowwise() or map2_dbl() we could use prod to allow for an na.rm-option, and only take the square root (^(1/2)), where 2 values are available. Utilizing that 1/0 is defined as Inf in R, and 1^Inf as 1.
df |>
rowwise() |>
mutate(gMean = prod(v1, v2, na.rm = TRUE) ^ (1 / sum(c(!is.na(v1), !is.na(v2))))) |>
ungroup()
df |>
mutate(gMean = map2_dbl(v1, v2, ~ prod(.x, .y, na.rm = TRUE) ^ (1 / sum(c(!is.na(.x), !is.na(.y))))))
Output:
# A tibble: 6 × 3
v1 v2 gMean
<dbl> <dbl> <dbl>
1 4 5 4.47
2 NA 7 7
3 2 2 2
4 3 NA 3
5 NA NA 1
6 9 9 9
Another possible solution:
library(tidyverse)
df %>%
mutate(gMean = map2_dbl(v1, v2, ~ sqrt(.x * .y)) %>%
coalesce(v1, v2) %>% if_else(is.na(.), 1, .))
#> # A tibble: 6 × 3
#> v1 v2 gMean
#> <dbl> <dbl> <dbl>
#> 1 4 5 4.47
#> 2 NA 7 7
#> 3 2 2 2
#> 4 3 NA 3
#> 5 NA NA 1
#> 6 9 9 9

Conditional mean from different column

I do have an R data frame like this:
city2001 <- c('a', 'b', 'a')
grade2001 <- c(5, 5, 7)
city2002 <- c('b', 'b', 'a')
grade2002 <- c(8, 9, 10)
df <- data.frame(city2001, grade2001, city2002, grade2002)
and would like to return ,
avg_a = 7.333
# from (5 + 7 + 10)/3
How is the logic for that? Thanks.
Try
mean(df[,grepl("grade",colnames(df))][df[,grepl("city",colnames(df))]=="a"])
[1] 7.333333
your df (columns) better be sorted.
If you want for all the groups and not just "a"
tapply(
unlist(df[,grepl("grade",colnames(df))]),
unlist(df[,grepl("city",colnames(df))]),
mean
)
a b
7.333333 7.333333
library(tidyverse)
df %>%
pivot_longer(everything(), names_to = c('.value', 'year'),
names_pattern = '(\\D+)(\\d+)') %>%
group_by(city)%>%
summarise(mean=mean(grade))
# A tibble: 2 x 2
city mean
<chr> <dbl>
1 a 7.33
2 b 7.33
Here is a one-liner via base R,
aggregate(v2 ~ v1,
cbind.data.frame(v1 = stack(df[c(TRUE, FALSE)])$values,
v2 = stack(df[c(FALSE, TRUE)])$values),
mean)
# v1 v2
#1 a 7.333333
#2 b 7.333333
Another possible solution:
library(tidyverse)
map(list(df[1:2], df[3:4]), ~ filter(.x, .x[1] == "a") %>% pull(2)) %>%
unlist %>% mean
#> [1] 7.333333
Or getting means for each city:
library(tidyverse)
map_dfr(list(df[1:2], df[3:4]), ~ `colnames<-`(.x, c("city", "grade"))) %>%
group_by(city) %>%
summarise(means = mean(grade))
#> # A tibble: 2 x 2
#> city means
#> <chr> <dbl>
#> 1 a 7.33
#> 2 b 7.33

Using rbind within a pipe

Is it possible to use rbind within a pipe so that I don't have to define and store a variable to use it?
library(tidyverse)
## works fine
df <- iris %>%
group_by(Species) %>%
summarise(Avg.Sepal.Length = mean(Sepal.Length)) %>%
ungroup
df %>%
rbind(df)
## anyway to make this work?
iris %>%
group_by(Species) %>%
summarise(Avg.Sepal.Length = mean(Sepal.Length)) %>%
ungroup %>%
rbind(.)
Just to elaborate #MichaelDewar's answer, note the following section of ?magrittr::`%>%`:
Placing lhs elsewhere in rhs call
Often you will want lhs to the rhs call at another position than the first. For this purpose you can use the dot (.) as placeholder. For example, y %>% f(x, .) is equivalent to f(x, y) and z %>% f(x, y, arg = .) is equivalent to f(x, y, arg = z).
My understanding is that when . appears as an argument in the right hand side call, the left hand side is not inserted in the first position. The call is evaluated "as is", with . evaluating to the left hand side. Hence:
library("dplyr")
x <- data.frame(a = 1:2, b = 3:4)
x %>% rbind() # rbind(x)
## a b
## 1 1 3
## 2 2 4
x %>% rbind(.) # same
## a b
## 1 1 3
## 2 2 4
x %>% rbind(x) # rbind(x, x)
## a b
## 1 1 3
## 2 2 4
## 3 1 3
## 4 2 4
x %>% rbind(x, .) # same
x %>% rbind(., x) # same
x %>% rbind(., .) # same
## a b
## 1 1 3
## 2 2 4
## 3 1 3
## 4 2 4
You can devise clever tricks if you know the rules:
x %>% rbind((.)) # rbind(x, (x))
## a b
## 1 1 3
## 2 2 4
## 3 1 3
## 4 2 4
(.) isn't parsed like ., so the left hand is inserted in the first position of the right hand side call. Compare:
as.list(quote(.))
## [[1]]
## .
as.list(quote((.)))
## [[1]]
## `(`
##
## [[2]]
## .
I don't know why you would want to rbind something with itself, but here you go:
iris %>%
group_by(Species) %>%
summarise(Avg.Sepal.Length = mean(Sepal.Length)) %>%
ungroup %>%
rbind(., .)

applying function to each group using dplyr and return specified dataframe

I used group_map for the first time and think I do it correctly. This is my code:
library(REAT)
df <- data.frame(value = c(1,1,1, 1,0.5,0.1, 0,0,0,1), group = c(1,1,1, 2,2,2, 3,3,3,3))
haves <- df %>%
group_by(group) %>%
group_map(~gini(.x$value, coefnorm = TRUE))
The thing is that haves is a list rather than a data frame. What would I have to do to obtain this df
wants <- data.frame(group = c(1,2,3), gini = c(0,0.5625,1))
group gini
1 0.0000
2 0.5625
3 1.0000
Thanks!
You can use dplyr::summarize:
df %>%
group_by(group) %>%
summarize(gini = gini(value, coefnorm = TRUE))
#> # A tibble: 3 x 2
#> group gini
#> <dbl> <dbl>
#> 1 1 0
#> 2 2 0.562
#> 3 3 1
According to the documentation, group_map always produces a list. group_modify is an alternative that produces a tibble if the function does, but gini just outputs a vector. So, you could do something like this...
df %>%
group_by(group) %>%
group_modify(~tibble(gini = gini(.x$value, coefnorm = TRUE)))
# A tibble: 3 x 2
# Groups: group [3]
group gini
<dbl> <dbl>
1 1 0
2 2 0.562
3 3 1
Using data.table
library(data.table)
setDT(df)[, .(gini = gini(value, coefnorm = TRUE)), group]
For grouped datasets, we can specify .data if in case we don't want to use column names unquoted
library(dplyr)
df %>%
group_by(group) %>%
summarize(gini = gini(.data$value, coefnorm = TRUE))

Using quotations inside mutate: an alternative to mutate_(.dots = ...)

I want to apply different functions to the same column in a tibble. These functions are stored in a character string. I used to do this with mutate_ and the .dots argument like this:
library(dplyr)
myfuns <- c(f1 = "a^2", f2 = "exp(a)", f3 = "sqrt(a)")
tibble(a = 1:3) %>%
mutate_(.dots = myfuns)
This approach still works fine but mutate_ is deprecated. I tried to achieve the same result with mutate and the rlang package but did not get very far.
In my real example myfuns contains about 200 functions so typing them one by one is not an option.
Thanks in advance.
For simple equations that take a single input, it’s sufficient to supply the function itself, e.g.
iris %>% mutate_at(vars(-Species), sqrt)
Or, when using an equation rather than a simple function, via a formula:
iris %>% mutate_at(vars(-Species), ~ . ^ 2)
When using equations that access more than a single variable, you need to use rlang quosures instead:
area = quo(Sepal.Length * Sepal.Width)
iris %>% mutate(Sepal.Area = !! area)
Here, quo creates a “quosure” — i.e. a quoted representation of your equation, same as your use of strings, except, unlike strings, this one is properly scoped, is directly usable by dplyr, and is conceptually cleaner: It is like any other R expression, except not yet evaluated. The difference is as follows:
1 + 2 is an expression with value 3.
quo(1 + 2) is an unevaluated expression with value 1 + 2 that evaluates to 3, but it needs to be explicitly evaluated. So how do we evaluated an unevaluated expression? Well …:
Then !! (pronounced “bang bang”) unquotes the previously-quoted expression, i.e. evaluates it — inside the context of mutate. This is important, because Sepal.Length and Sepal.Width are only known inside the mutate call, not outside of it.
In all the cases above, the expressions can be inside a list, too. The only difference is that for lists you need to use !!! instead of !!:
funs = list(
Sepal.Area = quo(Sepal.Length * Sepal.Width),
Sepal.Ratio = quo(Sepal.Length / Sepal.Width)
)
iris %>% mutate(!!! funs)
The !!! operation is known as “unquote-splice”. The idea is that it “splices” the list elements of its arguments into the parent call. That is, it seems to modify the call as if it contained the list elements verbatim as arguments (this only works in functions, such as mutate, that support it, though).
Convert your strings to expressions
myexprs <- purrr::map( myfuns, rlang::parse_expr )
then pass those expressions to regular mutate using quasiquotation:
tibble(a = 1:3) %>% mutate( !!!myexprs )
# # A tibble: 3 x 4
# a f1 f2 f3
# <int> <dbl> <dbl> <dbl>
# 1 1 1 2.72 1
# 2 2 4 7.39 1.41
# 3 3 9 20.1 1.73
Note that this will also work with strings / expressions involving multiple columns.
You have only one column, so both approaches below will give you the same result.
You only have to modify your functions' list.
library(dplyr)
myfuns <- c(f1 = ~.^2, f2 = ~exp(.), f3 = ~sqrt(.))
tibble(a = 1:3) %>% mutate_at(vars(a), myfuns)
tibble(a = 1:3) %>% mutate_all(myfuns)
# # A tibble: 3 x 4
# a f1 f2 f3
# <int> <dbl> <dbl> <dbl>
# 1 1 1 2.72 1
# 2 2 4 7.39 1.41
# 3 3 9 20.1 1.73
A base alternative :
myfuns <- c(f1 = "a^2", f2 = "exp(a)", f3 = "sqrt(a)")
df <- data.frame(a = 1:3)
df[names(myfuns)] <- lapply(myfuns , function(x) eval(parse(text= x), envir = df))
df
#> a f1 f2 f3
#> 1 1 1 2.718282 1.000000
#> 2 2 4 7.389056 1.414214
#> 3 3 9 20.085537 1.732051
Created on 2019-07-08 by the reprex package (v0.3.0)
One way using parse_expr from rlang
library(tidyverse)
library(rlang)
tibble(a = 1:3) %>%
mutate(ans = map(myfuns, ~eval(parse_expr(.)))) %>%
#OR mutate(ans = map(myfuns, ~eval(parse(text = .)))) %>%
unnest() %>%
group_by(a) %>%
mutate(temp = row_number()) %>%
spread(a, ans) %>%
select(-temp) %>%
rename_all(~names(myfuns))
# A tibble: 3 x 3
# f1 f2 f3
# <dbl> <dbl> <dbl>
#1 1 2.72 1
#2 4 7.39 1.41
#3 9 20.1 1.73
you can try also a purrr approach
# define the functions
f1 <- function(a) a^2
f2 <- function(a, b) a + b
f3 <- function(b) sqrt(b)
# put all functions in one list
tibble(funs=list(f1, f2, f3)) %>%
# give each function a name
mutate(fun_id=paste0("f", row_number())) %>%
# add to each row/function the matching column profile
# first extract the column names you specified in each function
#mutate(columns=funs %>%
# toString() %>%
# str_extract_all(., "function \\(.*?\\)", simplify = T) %>%
# str_extract_all(., "(?<=\\().+?(?=\\))", simplify = T) %>%
# gsub(" ", "", .) %>%
# str_split(., ",")) %>%
# with the help of Konrad we can use fn_fmls_names
mutate(columns=map(funs, ~ rlang::fn_fmls_names(.))) %>%
# select the columns and add to our tibble/data.frame
mutate(params=map(columns, ~select(df, .))) %>%
# invoke the functions
mutate(results = invoke_map(.f = funs, .x = params)) %>%
# transform to desired output
unnest(results) %>%
group_by(fun_id) %>%
mutate(n=row_number()) %>%
spread(fun_id, results) %>%
left_join(mutate(df, n=row_number()), .) %>%
select(-n)
Joining, by = "n"
# A tibble: 5 x 5
a b f1 f2 f3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 1 4 3 1
2 4 1 16 5 1
3 5 2 25 7 1.41
4 7 2 49 9 1.41
5 8 2 64 10 1.41
some data
df <- data_frame(
a = c(2, 4, 5, 7, 8),
b = c(1, 1, 2, 2, 2))

Resources